Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-05-01 22:21 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco, Jacob Xu, Darwin Guo
In-Reply-To: <CAEvNRgHRpvsEjtr1A_Qz3d4oMEaffTxESavrZ73Jtt6OobCwhA@mail.gmail.com>

Ackerley Tng <ackerleytng@google.com> writes:

>
> [...snip...]
>
>
> TLDR:
>
> + PRESERVE == guarantee that the process of setting memory attributes
>   doesn't change memory contents.
>     + implementation == do nothing in most cases, except -EOPNOTSUPP for
>       to-shared on TDX, since unmapping is a required part of setting
>       memory attributes to private, and a TDX side effect of unmapping
>       is zeroing memory,

-EOPNOTSUPP will only be for TDX, not SNP.

> + ZERO == guarantee that the process of setting memory attributes zeroes
>   memory contents.
>     + implementation == memset(zero) in most cases. For TDX, a future
>       optimization exists, where memset() can be skipped for pages that
>       were mapped in Secure EPTs before conversion
> + UNSPECIFIED == no guarantees
>     + implementation == guest_memfd does nothing explicitly about memory
>       contents. The implementation is pretty much the same as PRESERVE
>       except guest_memfd won't take into account vendor-specific side
>       effects of the process of conversion. Except for the test vehicle
>       KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.
>

Found another use case internally for pre-finalize, SNP, to-shared,
PRESERVE, which works with the above smaller scope.

During SNP_LAUNCH_UPDATE, when inserting a CPUID page, the firmware will
check that the CPUID values would not lead to an insecure guest
state. SNP_LAUNCH_UPDATE will fail with an error and the page remains
shared in the RMP table.

Here's the proposed flow in the userspace VMM:

1. Load CPUID in shared guest_memfd memory
2. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
3. SNP_LAUNCH_UPDATE => get error since CPUID was insecure
4. SET_MEMORY_ATTRIBUTES(SHARED, PRESERVE)
5. Read shared guest_memfd memory, error if VMM disagrees
6. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
7. SNP_LAUNCH_UPDATE => successful, since CPUID is now corrected

Does that seem ok?

>>>
>>> [...snip...]
>>>

^ permalink raw reply

* Re: [PATCH] trace_printk: replace _______STR with __UNIQUE_ID(STR)
From: Qian-Yu Lin @ 2026-05-02  7:37 UTC (permalink / raw)
  To: David Laight; +Cc: rostedt, mhiramat, linux-kernel, linux-trace-kernel
In-Reply-To: <20260501221315.1f709d6d@pumpkin>

On Fri, May 01, 2026 at 10:13:15PM +0100, David Laight wrote:
> On Fri, 1 May 2026 22:40:17 +0800
> Qian-Yu Lin <tiffany019230@gmail.com> wrote:
> 
> ...
> > Yes. I measured compile time of kernel/trace/ring_buffer_benchmark.o
> > after make clean on an x86_64 machine running Ubuntu 24.04 LTS:
> > 
> >   - Original _______STR:                 49.8s
> >   - v1 with __UNIQUE_ID (compiler.h):    53.5s
> >   - compound literal (no extra include): 33.2s
> 
> That difference looks far to big to me.
> And the times are far too large to be measuring the actual compile time.
> 

You're right, my earlier measurements included dependency rebuilds
after make clean. I re-measured using touch to isolate the actual
compile time of ring_buffer_benchmark.o on x86_64:

  - Original ___STR:                        1.757s
  - v1 with __UNIQUE_ID (compiler.h):       1.836s
  - sizeof __stringify (your suggestion):   1.781s

> > 
> > I propose using a compound literal in v2, which eliminates the local
> > variable entirely and requires no extra include:
> > 
> > #define trace_printk(fmt, ...)                          \
> > do {                                                    \
> >     if (sizeof((char[])                             \
> >         {__stringify((__VA_ARGS__))}) > 3)      \
> >         do_trace_printk(fmt, ##__VA_ARGS__);    \
> 
> There has to be a better way to align that code.
> Although you should be able to use:
> 	if (sizeof __stringify((__VA_ARGS__)) > 3)
> (I've omitted one set of parenthesis for clarity)
> 
> You could change __stringify() to work with __VA_ARGS__ the you don't need
> the extra (); this works fine:
> #define _x(...) #__VA_ARGS__
> #define x(...) _x(__VA_ARGS__)
> #define z abcd
> int a = sizeof x(z, v); /* 8 */
> See: https://godbolt.org/z/zo4h4nr9b
> 
> -- David
> 

Yes, this works. I verified with objdump on the
samples/trace_printk module that all four cases branch correctly:
__trace_bputs, __trace_puts, __trace_bprintk, and __trace_printk.

I'll use this form in v3 since it's simpler than the compound literal.

> >     else                                            \
> >         trace_puts(fmt);                        \
> > } while (0)
> > 
> > This fully eliminates the shadowing risk without any compile overhead.
> > 
> > Qian-Yu
> > 
> > > 
> > >   
> > > >  #include <linux/compiler_attributes.h>
> > > >  #include <linux/instruction_pointer.h>
> > > >  #include <linux/stddef.h>
> > > > @@ -84,15 +85,18 @@ do {									\
> > > >   * let gcc optimize the rest.
> > > >   */
> > > >  
> > > > -#define trace_printk(fmt, ...)				\
> > > > +#define ___trace_printk(fmt, str, ...)				\
> > > >  do {							\
> > > > -	char _______STR[] = __stringify((__VA_ARGS__));	\
> > > > -	if (sizeof(_______STR) > 3)			\
> > > > +	char str[] = __stringify((__VA_ARGS__));	\
> > > > +	if (sizeof(str) > 3)			\
> > > >  		do_trace_printk(fmt, ##__VA_ARGS__);	\
> > > >  	else						\
> > > >  		trace_puts(fmt);			\
> > > >  } while (0)
> > > >  
> > > > +#define trace_printk(fmt, ...) \
> > > > +	___trace_printk(fmt, __UNIQUE_ID(str), ##__VA_ARGS__)
> > > > +
> > > >  #define do_trace_printk(fmt, args...)					\
> > > >  do {									\
> > > >  	static const char *trace_printk_fmt __used			\  
> > >   
> 

^ permalink raw reply

* [PATCH v3] trace_printk: remove local variable for argument detection
From: Qian-Yu Lin @ 2026-05-02  7:55 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, david.laight.linux, linux-kernel, linux-trace-kernel,
	Qian-Yu Lin
In-Reply-To: <20260429165707.7020-1-tiffany019230@gmail.com>

The trace_printk() macro uses a local variable _______STR to detect
whether variadic arguments are present. This name can shadow outer
variables.

Replace the local variable with sizeof applied directly to the
stringified arguments:

  if (sizeof __stringify((__VA_ARGS__)) > 3)

This eliminates the shadowing risk entirely without introducing
any additional includes or local variables.

Verified with objdump on samples/trace_printk that all four cases
branch correctly: __trace_bputs, __trace_puts, __trace_bprintk,
and __trace_printk.

Suggested-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Qian-Yu Lin <tiffany019230@gmail.com>
---
 include/linux/trace_printk.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 2670ec7f4262..3d54f440dccf 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -86,8 +86,7 @@ do {									\

 #define trace_printk(fmt, ...)				\
 do {							\
-	char _______STR[] = __stringify((__VA_ARGS__));	\
-	if (sizeof(_______STR) > 3)			\
+	if (sizeof __stringify((__VA_ARGS__)) > 3)		\
 		do_trace_printk(fmt, ##__VA_ARGS__);	\
 	else						\
 		trace_puts(fmt);			\
-- 
2.43.0

^ permalink raw reply related

* [PATCH] tracing: probes: remove unused variable
From: Martin Kaiser @ 2026-05-02 13:57 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: Markus Schneider-Pargmann, linux-trace-kernel, linux-kernel,
	Martin Kaiser

params is always NULL in traceprobe_expand_meta_args, it can be removed.

Signed-off-by: Martin Kaiser <martin@kaiser.cx>
---
Would it be better to return ERR_PTR(-EOPNOTSUPP) instead of NULL, similar to
what is done in other places where NOSUP_BTFARG is logged? This would make the
parsing fail if $arg* is used without BTF support. At the moment, we skip the
$arg*...

 kernel/trace/trace_probe.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e0d3a0da26af..b627093a941e 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -1729,7 +1729,6 @@ const char **traceprobe_expand_meta_args(int argc, const char *argv[],
 					 int *new_argc, char *buf, int bufsize,
 					 struct traceprobe_parse_context *ctx)
 {
-	const struct btf_param *params = NULL;
 	int i, j, n, used, ret, args_idx = -1;
 	const char **new_argv __free(kfree) = NULL;
 
@@ -1747,7 +1746,7 @@ const char **traceprobe_expand_meta_args(int argc, const char *argv[],
 		if (args_idx != -1) {
 			/* $arg* requires BTF info */
 			trace_probe_log_err(0, NOSUP_BTFARG);
-			return (const char **)params;
+			return NULL;
 		}
 		*new_argc = argc;
 		return NULL;
-- 
2.43.7


^ permalink raw reply related

* [PATCH] tracing: Switch trace_recursion_record.c code over to use guard()
From: Yash Suthar @ 2026-05-02 17:47 UTC (permalink / raw)
  To: rostedt
  Cc: mhiramat, mathieu.desnoyers, linux-kernel, linux-trace-kernel,
	skhan, me, Yash Suthar

Switch mutex_lock()/mutex_unlock() to guard().
also drop the ret local variable and return directly.

Signed-off-by: Yash Suthar <yashsuthar983@gmail.com>
---
 kernel/trace/trace_recursion_record.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_recursion_record.c b/kernel/trace/trace_recursion_record.c
index 784fe1fbb866..bac4bc844ccd 100644
--- a/kernel/trace/trace_recursion_record.c
+++ b/kernel/trace/trace_recursion_record.c
@@ -180,9 +180,8 @@ static const struct seq_operations recursed_function_seq_ops = {
 
 static int recursed_function_open(struct inode *inode, struct file *file)
 {
-	int ret = 0;
+	guard(mutex)(&recursed_function_lock);
 
-	mutex_lock(&recursed_function_lock);
 	/* If this file was opened for write, then erase contents */
 	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) {
 		/* disable updating records */
@@ -194,10 +193,9 @@ static int recursed_function_open(struct inode *inode, struct file *file)
 		atomic_set(&nr_records, 0);
 	}
 	if (file->f_mode & FMODE_READ)
-		ret = seq_open(file, &recursed_function_seq_ops);
-	mutex_unlock(&recursed_function_lock);
+		return seq_open(file, &recursed_function_seq_ops);
 
-	return ret;
+	return 0;
 }
 
 static ssize_t recursed_function_write(struct file *file,
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v19 0/7] ring-buffer: Making persistent ring buffers robust
From: Steven Rostedt @ 2026-05-02 19:23 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <177751968499.2136606.17388366710182662849.stgit@mhiramat.tok.corp.google.com>

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

Hi Masami,

I applied your patches and enabled your ptracingtest code. I noticed
that when there's dropped pages, the trace output is not in order:

 # trace-cmd start -B ptracingtest -e all -v -e '*lock*'
 # taskset -c 5 echo c > /proc/sysrq-trigger

On reboot, I ran:

 # trace-cmd show -B ptracingtest > /tmp/trace.out

Then executed the attached perl program:

  # ./read-ts.pl < /tmp/trace.out

And it errors our:

 30.212495 < 30.213534
           <...>-1048    [005] d....    30.212495: irq_enable: caller=irqentry_exit+0xf5/0x710 parent=0x0

That is, I think the zero timestamps may be messing with the order.

-- Steve

[-- Attachment #2: read-ts.pl --]
[-- Type: application/x-perl, Size: 300 bytes --]

^ permalink raw reply

* Re: [PATCH v19 0/7] ring-buffer: Making persistent ring buffers robust
From: Steven Rostedt @ 2026-05-02 22:17 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Catalin Marinas, Will Deacon, Mathieu Desnoyers, linux-kernel,
	linux-trace-kernel, Ian Rogers, linux-arm-kernel
In-Reply-To: <20260502152304.560a5954@robin>

On Sat, 2 May 2026 15:23:04 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Hi Masami,
> 
> I applied your patches and enabled your ptracingtest code. I noticed
> that when there's dropped pages, the trace output is not in order:
> 
>  # trace-cmd start -B ptracingtest -e all -v -e '*lock*'
>  # taskset -c 5 echo c > /proc/sysrq-trigger
> 
> On reboot, I ran:
> 
>  # trace-cmd show -B ptracingtest > /tmp/trace.out
> 
> Then executed the attached perl program:
> 
>   # ./read-ts.pl < /tmp/trace.out
> 
> And it errors our:
> 
>  30.212495 < 30.213534
>            <...>-1048    [005] d....    30.212495: irq_enable: caller=irqentry_exit+0xf5/0x710 parent=0x0
> 
> That is, I think the zero timestamps may be messing with the order.
> 

Ah, I think I found the problem. The iterator needs the same logic you
added for the consuming read:

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7bfbed0ac90c..90a7fa772fe3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6105,12 +6105,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	struct ring_buffer_per_cpu *cpu_buffer;
 	struct ring_buffer_event *event;
 	int nr_loops = 0;
+	int max_loops;
 
 	if (ts)
 		*ts = 0;
 
 	cpu_buffer = iter->cpu_buffer;
 	buffer = cpu_buffer->buffer;
+	max_loops = cpu_buffer->ring_meta ? cpu_buffer->nr_pages : 3;
 
 	/*
 	 * Check if someone performed a consuming read to the buffer
@@ -6133,7 +6135,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
 	 * the ring buffer with an active write as the consumer is.
 	 * Do not warn if the three failures is reached.
 	 */
-	if (++nr_loops > 3)
+	if (++nr_loops > max_loops)
 		return NULL;
 
 	if (rb_per_cpu_empty(cpu_buffer))


I'll test this some more, and make a proper patch.

-- Steve


^ permalink raw reply related

* Re: [PATCH] fprobe: Add unregister_fprobe_sync() for synchronous unregistration
From: kernel test robot @ 2026-05-03  3:25 UTC (permalink / raw)
  To: Masami Hiramatsu (Google), Steven Rostedt
  Cc: llvm, oe-kbuild-all, Mathieu Desnoyers, Jonathan Corbet,
	linux-kernel, linux-trace-kernel, linux-doc
In-Reply-To: <177729179863.401400.6063130067239479972.stgit@mhiramat.tok.corp.google.com>

Hi Masami,

kernel test robot noticed the following build errors:

[auto build test ERROR on trace/for-next]
[cannot apply to linus/master v7.1-rc1 next-20260430]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Masami-Hiramatsu-Google/fprobe-Add-unregister_fprobe_sync-for-synchronous-unregistration/20260427-214258
base:   https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace for-next
patch link:    https://lore.kernel.org/r/177729179863.401400.6063130067239479972.stgit%40mhiramat.tok.corp.google.com
patch subject: [PATCH] fprobe: Add unregister_fprobe_sync() for synchronous unregistration
config: s390-allmodconfig (https://download.01.org/0day-ci/archive/20260503/202605031133.LJkoT4xo-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260503/202605031133.LJkoT4xo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605031133.LJkoT4xo-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/trace/fprobe.c:983:14: error: call to undeclared function 'fprobe_registered'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     983 |         if (!fp || !fprobe_registered(fp))
         |                     ^
>> kernel/trace/fprobe.c:986:8: error: call to undeclared function 'unregister_fprobe_nolock'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     986 |         ret = unregister_fprobe_nolock(fp);
         |               ^
   kernel/trace/fprobe.c:986:8: note: did you mean 'unregister_fprobe_sync'?
   kernel/trace/fprobe.c:978:5: note: 'unregister_fprobe_sync' declared here
     978 | int unregister_fprobe_sync(struct fprobe *fp)
         |     ^
     979 | {
     980 |         int ret;
     981 | 
     982 |         guard(mutex)(&fprobe_mutex);
     983 |         if (!fp || !fprobe_registered(fp))
     984 |                 return -EINVAL;
     985 | 
     986 |         ret = unregister_fprobe_nolock(fp);
         |               ~~~~~~~~~~~~~~~~~~~~~~~~
         |               unregister_fprobe_sync
   2 errors generated.


vim +/fprobe_registered +983 kernel/trace/fprobe.c

   967	
   968	/**
   969	 * unregister_fprobe_sync() - Unregister fprobe synchronously with RCU grace period.
   970	 * @fp: A fprobe data structure to be unregistered.
   971	 *
   972	 * Unregister fprobe (and remove ftrace hooks from the function entries) and
   973	 * wait for the RCU grace period to finish. This is useful for preventing
   974	 * the fprobe from being used after it is unregistered.
   975	 *
   976	 * Return 0 if @fp is unregistered successfully, -errno if not.
   977	 */
   978	int unregister_fprobe_sync(struct fprobe *fp)
   979	{
   980		int ret;
   981	
   982		guard(mutex)(&fprobe_mutex);
 > 983		if (!fp || !fprobe_registered(fp))
   984			return -EINVAL;
   985	
 > 986		ret = unregister_fprobe_nolock(fp);
   987		if (ret)
   988			return ret;
   989	
   990		synchronize_rcu();
   991		return 0;
   992	}
   993	EXPORT_SYMBOL_GPL(unregister_fprobe_sync);
   994	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [RFC PATCH] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Aaron Tomlin @ 2026-05-03  3:52 UTC (permalink / raw)
  To: corbet, song, kpsingh, mattbobrowski, ast, daniel, andrii,
	eddyz87, memxor, rostedt, mhiramat
  Cc: skhan, jolsa, martin.lau, yonghong.song, mathieu.desnoyers,
	atomlin, neelx, sean, chjohnst, steve, mproche, nick.lange,
	linux-doc, linux-kernel, bpf, linux-trace-kernel

The primary remit of the eBPF verifier is to ensure that eBPF programs
can neither crash the kernel nor corrupt memory. Nevertheless,
administrative utilities such as "bpftrace --unsafe" permit the loading
of programs that employ destructive or mutating helpers, most notably
bpf_probe_write_user() and bpf_override_return().

Since commit b28573ebfabe ("bpf: Remove bpf_probe_write_user() warning
message"), the kernel no longer issues a warning when an attempt is made to
invoke such destructive helpers.

Consequently, this patch introduces a novel kernel taint flag,
TAINT_UNSAFE_BPF ("V"). Tainting the kernel establishes a permanent and
readily auditable indicator (i.e., /proc/sys/kernel/tainted) to alert
maintainers and that the kernel's execution flow or user memory may have
been compromised by an eBPF program.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 Documentation/admin-guide/tainted-kernels.rst | 54 ++++++++++---------
 include/linux/panic.h                         |  3 +-
 kernel/panic.c                                |  1 +
 kernel/trace/bpf_trace.c                      |  3 ++
 4 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 9ead927a37c0..630f24996e7b 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -79,30 +79,31 @@ which bits are set::
 Table for decoding tainted state
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-===  ===  ======  ========================================================
-Bit  Log  Number  Reason that got the kernel tainted
-===  ===  ======  ========================================================
-  0  G/P       1  proprietary module was loaded
-  1  _/F       2  module was force loaded
-  2  _/S       4  kernel running on an out of specification system
-  3  _/R       8  module was force unloaded
-  4  _/M      16  processor reported a Machine Check Exception (MCE)
-  5  _/B      32  bad page referenced or some unexpected page flags
-  6  _/U      64  taint requested by userspace application
-  7  _/D     128  kernel died recently, i.e. there was an OOPS or BUG
-  8  _/A     256  ACPI table overridden by user
-  9  _/W     512  kernel issued warning
- 10  _/C    1024  staging driver was loaded
- 11  _/I    2048  workaround for bug in platform firmware applied
- 12  _/O    4096  externally-built ("out-of-tree") module was loaded
- 13  _/E    8192  unsigned module was loaded
- 14  _/L   16384  soft lockup occurred
- 15  _/K   32768  kernel has been live patched
- 16  _/X   65536  auxiliary taint, defined for and used by distros
- 17  _/T  131072  kernel was built with the struct randomization plugin
- 18  _/N  262144  an in-kernel test has been run
- 19  _/J  524288  userspace used a mutating debug operation in fwctl
-===  ===  ======  ========================================================
+===  ===   ======  ========================================================
+Bit  Log   Number  Reason that got the kernel tainted
+===  ===   ======  ========================================================
+  0  G/P        1  proprietary module was loaded
+  1  _/F        2  module was force loaded
+  2  _/S        4  kernel running on an out of specification system
+  3  _/R        8  module was force unloaded
+  4  _/M       16  processor reported a Machine Check Exception (MCE)
+  5  _/B       32  bad page referenced or some unexpected page flags
+  6  _/U       64  taint requested by userspace application
+  7  _/D      128  kernel died recently, i.e. there was an OOPS or BUG
+  8  _/A      256  ACPI table overridden by user
+  9  _/W      512  kernel issued warning
+ 10  _/C     1024  staging driver was loaded
+ 11  _/I     2048  workaround for bug in platform firmware applied
+ 12  _/O     4096  externally-built ("out-of-tree") module was loaded
+ 13  _/E     8192  unsigned module was loaded
+ 14  _/L    16384  soft lockup occurred
+ 15  _/K    32768  kernel has been live patched
+ 16  _/X    65536  auxiliary taint, defined for and used by distros
+ 17  _/T   131072  kernel was built with the struct randomization plugin
+ 18  _/N   262144  an in-kernel test has been run
+ 19  _/J   524288  userspace used a mutating debug operation in fwctl
+ 20  _/V  1048576  an unsafe eBPF program (mutating helper) was loaded
+===  ===  =======  ========================================================
 
 Note: The character ``_`` is representing a blank in this table to make reading
 easier.
@@ -189,3 +190,8 @@ More detailed explanation for tainting
  19) ``J`` if userspace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
      to use the devices debugging features. Device debugging features could
      cause the device to malfunction in undefined ways.
+
+ 20) ``V`` if an eBPF program utilising unsafe, mutating helpers (such as
+     bpf_probe_write_user() or bpf_override_return()) was loaded. These helpers
+     bypass standard eBPF safety guarantees and can alter execution flow or
+     corrupt memory.
diff --git a/include/linux/panic.h b/include/linux/panic.h
index f1dd417e54b2..8622c02c2c24 100644
--- a/include/linux/panic.h
+++ b/include/linux/panic.h
@@ -88,7 +88,8 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
 #define TAINT_RANDSTRUCT		17
 #define TAINT_TEST			18
 #define TAINT_FWCTL			19
-#define TAINT_FLAGS_COUNT		20
+#define TAINT_UNSAFE_BPF		20
+#define TAINT_FLAGS_COUNT		21
 #define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
 
 struct taint_flag {
diff --git a/kernel/panic.c b/kernel/panic.c
index 20feada5319d..1ae19bd8fc1d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -825,6 +825,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
 	TAINT_FLAG(RANDSTRUCT,			'T', ' '),
 	TAINT_FLAG(TEST,			'N', ' '),
 	TAINT_FLAG(FWCTL,			'J', ' '),
+	TAINT_FLAG(UNSAFE_BPF,			'V', ' '),
 };
 
 #undef TAINT_FLAG
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index af7079aa0f36..4e7e5bf76dcb 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -155,6 +155,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 #ifdef CONFIG_BPF_KPROBE_OVERRIDE
 BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
 {
+	add_taint(TAINT_UNSAFE_BPF, LOCKDEP_STILL_OK);
 	regs_set_return_value(regs, rc);
 	override_function_with_return(regs);
 	return 0;
@@ -344,6 +345,8 @@ BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src,
 	if (unlikely(!nmi_uaccess_okay()))
 		return -EPERM;
 
+	add_taint(TAINT_UNSAFE_BPF, LOCKDEP_STILL_OK);
+
 	return copy_to_user_nofault(unsafe_ptr, src, size);
 }
 
-- 
2.51.0


^ permalink raw reply related

* Re: [RFC PATCH] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Randy Dunlap @ 2026-05-03  4:29 UTC (permalink / raw)
  To: Aaron Tomlin, corbet, song, kpsingh, mattbobrowski, ast, daniel,
	andrii, eddyz87, memxor, rostedt, mhiramat
  Cc: skhan, jolsa, martin.lau, yonghong.song, mathieu.desnoyers, neelx,
	sean, chjohnst, steve, mproche, nick.lange, linux-doc,
	linux-kernel, bpf, linux-trace-kernel
In-Reply-To: <20260503035220.520479-1-atomlin@atomlin.com>

Hi,

On 5/2/26 8:52 PM, Aaron Tomlin wrote:
> The primary remit of the eBPF verifier is to ensure that eBPF programs
> can neither crash the kernel nor corrupt memory. Nevertheless,
> administrative utilities such as "bpftrace --unsafe" permit the loading
> of programs that employ destructive or mutating helpers, most notably
> bpf_probe_write_user() and bpf_override_return().
> 
> Since commit b28573ebfabe ("bpf: Remove bpf_probe_write_user() warning
> message"), the kernel no longer issues a warning when an attempt is made to
> invoke such destructive helpers.
> 
> Consequently, this patch introduces a novel kernel taint flag,
> TAINT_UNSAFE_BPF ("V"). Tainting the kernel establishes a permanent and
> readily auditable indicator (i.e., /proc/sys/kernel/tainted) to alert
> maintainers and that the kernel's execution flow or user memory may have
> been compromised by an eBPF program.
> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>  Documentation/admin-guide/tainted-kernels.rst | 54 ++++++++++---------
>  include/linux/panic.h                         |  3 +-
>  kernel/panic.c                                |  1 +
>  kernel/trace/bpf_trace.c                      |  3 ++
>  4 files changed, 36 insertions(+), 25 deletions(-)
> 
> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> index 9ead927a37c0..630f24996e7b 100644
> --- a/Documentation/admin-guide/tainted-kernels.rst
> +++ b/Documentation/admin-guide/tainted-kernels.rst
> @@ -79,30 +79,31 @@ which bits are set::
>  Table for decoding tainted state
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> -===  ===  ======  ========================================================
> -Bit  Log  Number  Reason that got the kernel tainted
> -===  ===  ======  ========================================================
> -  0  G/P       1  proprietary module was loaded
> -  1  _/F       2  module was force loaded
> -  2  _/S       4  kernel running on an out of specification system
> -  3  _/R       8  module was force unloaded
> -  4  _/M      16  processor reported a Machine Check Exception (MCE)
> -  5  _/B      32  bad page referenced or some unexpected page flags
> -  6  _/U      64  taint requested by userspace application
> -  7  _/D     128  kernel died recently, i.e. there was an OOPS or BUG
> -  8  _/A     256  ACPI table overridden by user
> -  9  _/W     512  kernel issued warning
> - 10  _/C    1024  staging driver was loaded
> - 11  _/I    2048  workaround for bug in platform firmware applied
> - 12  _/O    4096  externally-built ("out-of-tree") module was loaded
> - 13  _/E    8192  unsigned module was loaded
> - 14  _/L   16384  soft lockup occurred
> - 15  _/K   32768  kernel has been live patched
> - 16  _/X   65536  auxiliary taint, defined for and used by distros
> - 17  _/T  131072  kernel was built with the struct randomization plugin
> - 18  _/N  262144  an in-kernel test has been run
> - 19  _/J  524288  userspace used a mutating debug operation in fwctl
> -===  ===  ======  ========================================================
> +===  ===   ======  ========================================================
> +Bit  Log   Number  Reason that got the kernel tainted
> +===  ===   ======  ========================================================
> +  0  G/P        1  proprietary module was loaded
> +  1  _/F        2  module was force loaded
> +  2  _/S        4  kernel running on an out of specification system
> +  3  _/R        8  module was force unloaded
> +  4  _/M       16  processor reported a Machine Check Exception (MCE)
> +  5  _/B       32  bad page referenced or some unexpected page flags
> +  6  _/U       64  taint requested by userspace application
> +  7  _/D      128  kernel died recently, i.e. there was an OOPS or BUG
> +  8  _/A      256  ACPI table overridden by user
> +  9  _/W      512  kernel issued warning
> + 10  _/C     1024  staging driver was loaded
> + 11  _/I     2048  workaround for bug in platform firmware applied
> + 12  _/O     4096  externally-built ("out-of-tree") module was loaded
> + 13  _/E     8192  unsigned module was loaded
> + 14  _/L    16384  soft lockup occurred
> + 15  _/K    32768  kernel has been live patched
> + 16  _/X    65536  auxiliary taint, defined for and used by distros
> + 17  _/T   131072  kernel was built with the struct randomization plugin
> + 18  _/N   262144  an in-kernel test has been run
> + 19  _/J   524288  userspace used a mutating debug operation in fwctl
> + 20  _/V  1048576  an unsafe eBPF program (mutating helper) was loaded
> +===  ===  =======  ========================================================
>  
>  Note: The character ``_`` is representing a blank in this table to make reading
>  easier.
> @@ -189,3 +190,8 @@ More detailed explanation for tainting
>   19) ``J`` if userspace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
>       to use the devices debugging features. Device debugging features could
>       cause the device to malfunction in undefined ways.
> +
> + 20) ``V`` if an eBPF program utilising unsafe, mutating helpers (such as
> +     bpf_probe_write_user() or bpf_override_return()) was loaded. These helpers
> +     bypass standard eBPF safety guarantees and can alter execution flow or
> +     corrupt memory.

(If this patch goes forward:)

In this same file (above), there is a little script around line 77 where
it should be changed:
s/20/21/

Also please update tools/debugging/kernel-chktaint for this taint flag.

-- 
~Randy


^ permalink raw reply

* Re: [RFC PATCH] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: bot+bpf-ci @ 2026-05-03  4:44 UTC (permalink / raw)
  To: atomlin, corbet, song, kpsingh, mattbobrowski, ast, daniel,
	andrii, eddyz87, memxor, rostedt, mhiramat
  Cc: skhan, jolsa, martin.lau, yonghong.song, mathieu.desnoyers,
	atomlin, neelx, sean, chjohnst, steve, mproche, nick.lange,
	linux-doc, linux-kernel, bpf, linux-trace-kernel, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260503035220.520479-1-atomlin@atomlin.com>

[-- Attachment #1: Type: text/plain, Size: 5603 bytes --]

> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> index 9ead927a37c0..630f24996e7b 100644
> --- a/Documentation/admin-guide/tainted-kernels.rst
> +++ b/Documentation/admin-guide/tainted-kernels.rst
> @@ -79,30 +79,31 @@ which bits are set::
>  Table for decoding tainted state
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> -===  ===  ======  ========================================================
> -Bit  Log  Number  Reason that got the kernel tainted
> -===  ===  ======  ========================================================

[ ... ]

> + 19  _/J   524288  userspace used a mutating debug operation in fwctl
> + 20  _/V  1048576  an unsafe eBPF program (mutating helper) was loaded
> +===  ===  =======  ========================================================

This isn't a bug, but the bottom border has 7 equals signs (=======) for
the Number column while the top and header borders have 6 (======). Will
this cause a Sphinx 'Malformed table' warning when rendering the
documentation?

Earlier in the same file, around line 77, there's a shell command example
for checking taint bits. With the addition of bit 20, shouldn't that
command be updated from 'seq 20' to 'seq 21' so it checks all 21 bits
(0-20)?

The current command only checks bits 0-19:

    $ for i in $(seq 20); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done

Since 'seq 20' produces 1..20, and the expression uses $(($i-1)), it only
computes bit positions 0..19, missing the newly added bit 20.

> @@ -189,3 +190,8 @@ More detailed explanation for tainting
>   19) ``J`` if userspace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
>       to use the devices debugging features. Device debugging features could
>       cause the device to malfunction in undefined ways.
> +
> + 20) ``V`` if an eBPF program utilising unsafe, mutating helpers (such as
> +     bpf_probe_write_user() or bpf_override_return()) was loaded. These helpers
> +     bypass standard eBPF safety guarantees and can alter execution flow or
> +     corrupt memory.
> diff --git a/include/linux/panic.h b/include/linux/panic.h
> index f1dd417e54b2..8622c02c2c24 100644
> --- a/include/linux/panic.h
> +++ b/include/linux/panic.h
> @@ -88,7 +88,8 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
>  #define TAINT_RANDSTRUCT		17
>  #define TAINT_TEST			18
>  #define TAINT_FWCTL			19
> -#define TAINT_FLAGS_COUNT		20
> +#define TAINT_UNSAFE_BPF		20
> +#define TAINT_FLAGS_COUNT		21
>  #define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
>
>  struct taint_flag {
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 20feada5319d..1ae19bd8fc1d 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -825,6 +825,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
>  	TAINT_FLAG(RANDSTRUCT,			'T', ' '),
>  	TAINT_FLAG(TEST,			'N', ' '),
>  	TAINT_FLAG(FWCTL,			'J', ' '),
> +	TAINT_FLAG(UNSAFE_BPF,			'V', ' '),
>  };
>
>  #undef TAINT_FLAG

There's a comment earlier in this file that explicitly states: "if you
modify the taint_flags or TAINT_FLAGS_COUNT, please also modify
tools/debugging/kernel-chktaint and
Documentation/admin-guide/tainted-kernels.rst".

While the Documentation was updated, was tools/debugging/kernel-chktaint
also updated? The script currently processes flags 0-19 and ends with the
J flag check. Without adding a section to check bit 20, the script won't
decode the 'V' character in the taint string output when
TAINT_UNSAFE_BPF is set.

The script would need something like:

    T=`expr $T / 2`
    if [ `expr $T % 2` -eq 0 ]; then
        addout " "
    else
        addout "V"
        echo " * an unsafe eBPF program (mutating helper) was loaded (#20)"
    fi

> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index a822c589c9bd..3f893d9ea764 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -183,6 +183,7 @@ unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx)
>  #ifdef CONFIG_BPF_KPROBE_OVERRIDE
>  BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
>  {
> +	add_taint(TAINT_UNSAFE_BPF, LOCKDEP_STILL_OK);
>  	regs_set_return_value(regs, rc);
>  	override_function_with_return(regs);
>  	return 0;

Since bpf_override_return is a helper invoked at runtime and can execute
frequently on hot kprobes, will calling add_taint() here cause severe
cache line bouncing?

Looking at add_taint(), it unconditionally calls set_bit() on the global
tainted_mask.

Also, the commit message and documentation mention the system is tainted
when an unsafe eBPF program 'was loaded'. By applying the taint here
during execution instead of at program load time (e.g., in the verifier),
could this allow an unsafe BPF program to remain dormant and avoid
detection until a specific trigger occurs?

> @@ -372,6 +373,8 @@ BPF_CALL_3(bpf_probe_write_user, void __user *, unsafe_ptr, const void *, src,
>  	if (unlikely(!nmi_uaccess_okay()))
>  		return -EPERM;
>
> +	add_taint(TAINT_UNSAFE_BPF, LOCKDEP_STILL_OK);
> +
>  	return copy_to_user_nofault(unsafe_ptr, src, size);
>  }

Similarly here, will invoking add_taint() on every execution of
bpf_probe_write_user() cause significant performance degradation when
executed concurrently across multiple CPUs?

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25269516443

^ permalink raw reply

* Re: [PATCH] blktrace: reject buf_size smaller than blk_io_trace
From: Deepanshu Kartikey @ 2026-05-03  5:13 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: linux-block, linux-kernel, linux-trace-kernel,
	syzbot+ed8bc247f231c1a48e21
In-Reply-To: <20260322051838.1137822-1-kartikey406@gmail.com>

On Sun, Mar 22, 2026 at 10:48 AM Deepanshu Kartikey
<kartikey406@gmail.com> wrote:
>
> blk_trace_setup() accepts any non-zero buf_size.
> If buf_size < sizeof(struct blk_io_trace), relay_reserve()
> always returns NULL and all trace events are silently dropped.
>
> Reject such values early with -EINVAL.
>
> Reported-by: syzbot+ed8bc247f231c1a48e21@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=ed8bc247f231c1a48e21
> Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
> ---
>  kernel/trace/blktrace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 8cd2520b4c99..6cc7d83ed1c2 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>         if (ret)
>                 return -EFAULT;
>
> -       if (!buts.buf_size || !buts.buf_nr)
> +       if (buts.buf_size < sizeof(struct blk_io_trace) || !buts.buf_nr)
>                 return -EINVAL;
>
>         buts2 = (struct blk_user_trace_setup2) {
> --
> 2.43.0
>

Gentle ping on this patch . Let me know the status of this patch

Thanks

^ permalink raw reply

* Re: [PATCH] blktrace: reject buf_size smaller than blk_io_trace
From: Bart Van Assche @ 2026-05-03  5:52 UTC (permalink / raw)
  To: Deepanshu Kartikey, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: linux-block, linux-kernel, linux-trace-kernel,
	syzbot+ed8bc247f231c1a48e21
In-Reply-To: <20260322051838.1137822-1-kartikey406@gmail.com>

On 3/22/26 6:18 AM, Deepanshu Kartikey wrote:
> Closes: https://syzkaller.appspot.com/bug?extid=ed8bc247f231c1a48e21
> [ ... ] 
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 8cd2520b4c99..6cc7d83ed1c2 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>   	if (ret)
>   		return -EFAULT;
>   
> -	if (!buts.buf_size || !buts.buf_nr)
> +	if (buts.buf_size < sizeof(struct blk_io_trace) || !buts.buf_nr)
>   		return -EINVAL;
>   
>   	buts2 = (struct blk_user_trace_setup2) {

Why sizeof(struct blk_io_trace) instead of sizeof(struct blk_io_trace2)?
Even sizeof(struct blk_io_trace2) is too small if any additional data is
included.

Additionally, how can this patch fix the issue mentioned in the linked 
syzbot report? Is the syzbot link correct? From the syzbot report:

Oops: general protection fault, probably for non-canonical address 
0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:bvec_set_page include/linux/bvec.h:44 [inline]
RIP: 0010:__bio_add_page block/bio.c:992 [inline]
RIP: 0010:bio_add_page+0x462/0x6e0 block/bio.c:1048
Call Trace:
  <TASK>
  bio_add_folio+0x64/0x90 block/bio.c:1084
  io_submit_add_bh fs/ext4/page-io.c:465 [inline]
  ext4_bio_write_folio+0x1446/0x1ea0 fs/ext4/page-io.c:603
  mpage_map_and_submit_buffers fs/ext4/inode.c:2326 [inline]
  mpage_map_and_submit_extent fs/ext4/inode.c:2516 [inline]
  ext4_do_writepages+0x207e/0x46e0 fs/ext4/inode.c:2928
  ext4_writepages+0x241/0x3b0 fs/ext4/inode.c:3022
  do_writepages+0x32e/0x550 mm/page-writeback.c:2554
  __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750
  writeback_sb_inodes+0x992/0x1a20 fs/fs-writeback.c:2042
  __writeback_inodes_wb+0x111/0x240 fs/fs-writeback.c:2118
  wb_writeback+0x46a/0xb70 fs/fs-writeback.c:2229
  wb_check_start_all fs/fs-writeback.c:2355 [inline]
  wb_do_writeback fs/fs-writeback.c:2381 [inline]
  wb_workfn+0x95b/0xf50 fs/fs-writeback.c:2414
  process_one_work+0x9ab/0x1780 kernel/workqueue.c:3288
  process_scheduled_works kernel/workqueue.c:3379 [inline]
  worker_thread+0xba8/0x11e0 kernel/workqueue.c:3465
  kthread+0x388/0x470 kernel/kthread.c:436
  ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
  </TASK>

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH] blktrace: reject buf_size smaller than blk_io_trace
From: Deepanshu Kartikey @ 2026-05-03  8:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: axboe, rostedt, mhiramat, mathieu.desnoyers, linux-block,
	linux-kernel, linux-trace-kernel, syzbot+ed8bc247f231c1a48e21
In-Reply-To: <b568c125-d827-42f3-97ab-521a8648d917@acm.org>

On Sun, May 3, 2026 at 11:22 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 3/22/26 6:18 AM, Deepanshu Kartikey wrote:
> > Closes: https://syzkaller.appspot.com/bug?extid=ed8bc247f231c1a48e21
> > [ ... ]
> > diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> > index 8cd2520b4c99..6cc7d83ed1c2 100644
> > --- a/kernel/trace/blktrace.c
> > +++ b/kernel/trace/blktrace.c
> > @@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
> >       if (ret)
> >               return -EFAULT;
> >
> > -     if (!buts.buf_size || !buts.buf_nr)
> > +     if (buts.buf_size < sizeof(struct blk_io_trace) || !buts.buf_nr)
> >               return -EINVAL;
> >
> >       buts2 = (struct blk_user_trace_setup2) {
>
> Why sizeof(struct blk_io_trace) instead of sizeof(struct blk_io_trace2)?
> Even sizeof(struct blk_io_trace2) is too small if any additional data is
> included.
>
> Additionally, how can this patch fix the issue mentioned in the linked
> syzbot report? Is the syzbot link correct? From the syzbot report:
>
> Oops: general protection fault, probably for non-canonical address
> 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> RIP: 0010:bvec_set_page include/linux/bvec.h:44 [inline]
> RIP: 0010:__bio_add_page block/bio.c:992 [inline]
> RIP: 0010:bio_add_page+0x462/0x6e0 block/bio.c:1048
> Call Trace:
>   <TASK>
>   bio_add_folio+0x64/0x90 block/bio.c:1084
>   io_submit_add_bh fs/ext4/page-io.c:465 [inline]
>   ext4_bio_write_folio+0x1446/0x1ea0 fs/ext4/page-io.c:603
>   mpage_map_and_submit_buffers fs/ext4/inode.c:2326 [inline]
>   mpage_map_and_submit_extent fs/ext4/inode.c:2516 [inline]
>   ext4_do_writepages+0x207e/0x46e0 fs/ext4/inode.c:2928
>   ext4_writepages+0x241/0x3b0 fs/ext4/inode.c:3022
>   do_writepages+0x32e/0x550 mm/page-writeback.c:2554
>   __writeback_single_inode+0x133/0x11a0 fs/fs-writeback.c:1750
>   writeback_sb_inodes+0x992/0x1a20 fs/fs-writeback.c:2042
>   __writeback_inodes_wb+0x111/0x240 fs/fs-writeback.c:2118
>   wb_writeback+0x46a/0xb70 fs/fs-writeback.c:2229
>   wb_check_start_all fs/fs-writeback.c:2355 [inline]
>   wb_do_writeback fs/fs-writeback.c:2381 [inline]
>   wb_workfn+0x95b/0xf50 fs/fs-writeback.c:2414
>   process_one_work+0x9ab/0x1780 kernel/workqueue.c:3288
>   process_scheduled_works kernel/workqueue.c:3379 [inline]
>   worker_thread+0xba8/0x11e0 kernel/workqueue.c:3465
>   kthread+0x388/0x470 kernel/kthread.c:436
>   ret_from_fork+0x51e/0xb90 arch/x86/kernel/process.c:158
>   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>   </TASK>
>
> Thanks,
>
> Bart.

Hi Bart,

Thank you for the review.

You are right on both points:

1. The minimum size check should use
   sizeof(struct blk_io_trace2) as it is
   the larger of the two structs. We will
   fix this in v2.

2. The connection between buf_size being
   too small and the ext4 null-ptr-deref
   is not clearly established. We will
   remove the syzbot link from the commit
   message in v2.

Will send v2 shortly.

Thanks,
Deepanshu Kartikey

^ permalink raw reply

* [PATCH v2] blktrace: reject buf_size smaller than blk_io_trace2
From: Deepanshu Kartikey @ 2026-05-03  8:55 UTC (permalink / raw)
  To: axboe, rostedt, mhiramat, mathieu.desnoyers, bvanassche
  Cc: linux-block, linux-kernel, linux-trace-kernel, Deepanshu Kartikey,
	Deepanshu Kartikey

blk_trace_setup() accepts any non-zero buf_size from
userspace and passes it directly to relay_open(). If
buf_size is smaller than sizeof(struct blk_io_trace2),
relay_reserve() always returns NULL and all trace
events are silently dropped.

Reject such values early with -EINVAL.

Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
---
Changes in v2:
  - Use sizeof(struct blk_io_trace2) instead of
    sizeof(struct blk_io_trace) as it is the larger
    of the two structs
  - Remove incorrect syzbot link from commit message
---
 kernel/trace/blktrace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 8cd2520b4c99..20f941495151 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (ret)
 		return -EFAULT;
 
-	if (!buts.buf_size || !buts.buf_nr)
+	if (buts.buf_size < sizeof(struct blk_io_trace2) || !buts.buf_nr)
 		return -EINVAL;
 
 	buts2 = (struct blk_user_trace_setup2) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: qiwu.chen @ 2026-05-03  8:57 UTC (permalink / raw)
  To: akpm, rostedt, kasong, mhocko, hannes, david, ljs, baohua,
	mhiramat
  Cc: linux-mm, linux-trace-kernel, qiwu.chen

Currently, reclaim_flags always contains RECLAIM_WB_ASYNC in lru_shrink
tracepoints since commit 41ac1999c3e35 ("mm: vmscan: do not stall on
writeback during memory compaction"), which is useless for debugging
memory pressure issues. Other RECLAIM_WB_* flags are not used anywhere
else, so they can be directly removed.
This patch reworks the lru_shrink and write_folio tracepoints for better
correlation and analysis:
 - traces each folio lru type instead of reclaim_flags.
 - traces each lru_shrink with reason.

Fixes: 41ac1999c3e35 ("mm: vmscan: do not stall on writeback during memory compaction")
Signed-off-by: qiwu.chen <qiwu.chen@transsion.com>
---
 include/trace/events/vmscan.h | 65 +++++++++++++++--------------------
 mm/vmscan.c                   |  9 ++---
 2 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4445a8d9218d..d0a7fcd265e2 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -11,22 +11,6 @@
 #include <linux/memcontrol.h>
 #include <trace/events/mmflags.h>
 
-#define RECLAIM_WB_ANON		0x0001u
-#define RECLAIM_WB_FILE		0x0002u
-#define RECLAIM_WB_MIXED	0x0010u
-#define RECLAIM_WB_SYNC		0x0004u /* Unused, all reclaim async */
-#define RECLAIM_WB_ASYNC	0x0008u
-#define RECLAIM_WB_LRU		(RECLAIM_WB_ANON|RECLAIM_WB_FILE)
-
-#define show_reclaim_flags(flags)				\
-	(flags) ? __print_flags(flags, "|",			\
-		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
-		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
-		{RECLAIM_WB_MIXED,	"RECLAIM_WB_MIXED"},	\
-		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
-		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
-		) : "RECLAIM_WB_NONE"
-
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
 #define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
 #define _VMSCAN_THROTTLE_NOPROGRESS	(1 << VMSCAN_THROTTLE_NOPROGRESS)
@@ -51,10 +35,11 @@ TRACE_DEFINE_ENUM(KSWAPD_CLEAR_HOPELESS_PCP);
 	{KSWAPD_CLEAR_HOPELESS_PCP,	"PCP"},		\
 	{KSWAPD_CLEAR_HOPELESS_OTHER,	"OTHER"}
 
-#define trace_reclaim_flags(file) ( \
-	(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
-	(RECLAIM_WB_ASYNC) \
-	)
+#define trace_reclaim_reason_ops		\
+	{PGSTEAL_KSWAPD,	"KSWAPD"},	\
+	{PGSTEAL_DIRECT,	"DIRECT"},	\
+	{PGSTEAL_KHUGEPAGED,	"KHUGEPAGED"}, \
+	{PGSTEAL_PROACTIVE,	"PROACTIVE"}
 
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
@@ -362,19 +347,17 @@ TRACE_EVENT(mm_vmscan_write_folio,
 
 	TP_STRUCT__entry(
 		__field(unsigned long, pfn)
-		__field(int, reclaim_flags)
+		__field(int, lru)
 	),
 
 	TP_fast_assign(
 		__entry->pfn = folio_pfn(folio);
-		__entry->reclaim_flags = trace_reclaim_flags(
-						folio_is_file_lru(folio));
+		__entry->lru = folio_lru_list(folio);
 	),
 
-	TP_printk("page=%p pfn=0x%lx flags=%s",
+	TP_printk("page=%p lru=%s",
 		pfn_to_page(__entry->pfn),
-		__entry->pfn,
-		show_reclaim_flags(__entry->reclaim_flags))
+		__print_symbolic(__entry->lru, LRU_NAMES))
 );
 
 TRACE_EVENT(mm_vmscan_reclaim_pages,
@@ -426,9 +409,9 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 
 	TP_PROTO(int nid,
 		unsigned long nr_scanned, unsigned long nr_reclaimed,
-		struct reclaim_stat *stat, int priority, int file),
+		struct reclaim_stat *stat, int priority, int lru, int reason),
 
-	TP_ARGS(nid, nr_scanned, nr_reclaimed, stat, priority, file),
+	TP_ARGS(nid, nr_scanned, nr_reclaimed, stat, priority, lru, reason),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
@@ -443,7 +426,8 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 		__field(unsigned long, nr_ref_keep)
 		__field(unsigned long, nr_unmap_fail)
 		__field(int, priority)
-		__field(int, reclaim_flags)
+		__field(int, lru)
+		__field(int, reason)
 	),
 
 	TP_fast_assign(
@@ -459,10 +443,11 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 		__entry->nr_ref_keep = stat->nr_ref_keep;
 		__entry->nr_unmap_fail = stat->nr_unmap_fail;
 		__entry->priority = priority;
-		__entry->reclaim_flags = trace_reclaim_flags(file);
+		__entry->lru = lru;
+		__entry->reason = reason;
 	),
 
-	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s",
+	TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d lru=%s reason=%s",
 		__entry->nid,
 		__entry->nr_scanned, __entry->nr_reclaimed,
 		__entry->nr_dirty, __entry->nr_writeback,
@@ -470,16 +455,17 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
 		__entry->nr_activate0, __entry->nr_activate1,
 		__entry->nr_ref_keep, __entry->nr_unmap_fail,
 		__entry->priority,
-		show_reclaim_flags(__entry->reclaim_flags))
+		__print_symbolic(__entry->lru, LRU_NAMES),
+		__print_symbolic(__entry->reason, trace_reclaim_reason_ops))
 );
 
 TRACE_EVENT(mm_vmscan_lru_shrink_active,
 
 	TP_PROTO(int nid, unsigned long nr_taken,
 		unsigned long nr_active, unsigned long nr_deactivated,
-		unsigned long nr_referenced, int priority, int file),
+		unsigned long nr_referenced, int priority, int lru, int reason),
 
-	TP_ARGS(nid, nr_taken, nr_active, nr_deactivated, nr_referenced, priority, file),
+	TP_ARGS(nid, nr_taken, nr_active, nr_deactivated, nr_referenced, priority, lru, reason),
 
 	TP_STRUCT__entry(
 		__field(int, nid)
@@ -488,7 +474,8 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
 		__field(unsigned long, nr_deactivated)
 		__field(unsigned long, nr_referenced)
 		__field(int, priority)
-		__field(int, reclaim_flags)
+		__field(int, lru)
+		__field(int, reason)
 	),
 
 	TP_fast_assign(
@@ -498,15 +485,17 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active,
 		__entry->nr_deactivated = nr_deactivated;
 		__entry->nr_referenced = nr_referenced;
 		__entry->priority = priority;
-		__entry->reclaim_flags = trace_reclaim_flags(file);
+		__entry->lru = lru;
+		__entry->reason = reason;
 	),
 
-	TP_printk("nid=%d nr_taken=%ld nr_active=%ld nr_deactivated=%ld nr_referenced=%ld priority=%d flags=%s",
+	TP_printk("nid=%d nr_taken=%ld nr_active=%ld nr_deactivated=%ld nr_referenced=%ld priority=%d lru=%s reason=%s",
 		__entry->nid,
 		__entry->nr_taken,
 		__entry->nr_active, __entry->nr_deactivated, __entry->nr_referenced,
 		__entry->priority,
-		show_reclaim_flags(__entry->reclaim_flags))
+		__print_symbolic(__entry->lru, LRU_NAMES),
+		__print_symbolic(__entry->reason, trace_reclaim_reason_ops))
 );
 
 TRACE_EVENT(mm_vmscan_node_reclaim_begin,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..4ee84db91635 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2044,7 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
+			nr_scanned, nr_reclaimed, &stat, sc->priority, lru, item);
 	return nr_reclaimed;
 }
 
@@ -2151,7 +2151,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
 	trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
-			nr_deactivate, nr_rotated, sc->priority, file);
+			nr_deactivate, nr_rotated, sc->priority, lru,
+			PGSTEAL_KSWAPD + reclaimer_offset(sc));
 }
 
 static unsigned int reclaim_folio_list(struct list_head *folio_list,
@@ -4854,9 +4855,10 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
+	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
-			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
+			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON, item);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
@@ -4892,7 +4894,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 
-	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
 	mod_lruvec_state(lruvec, item, reclaimed);
 	mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed);
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v2] blktrace: reject buf_size smaller than blk_io_trace2
From: Bart Van Assche @ 2026-05-03 11:08 UTC (permalink / raw)
  To: Deepanshu Kartikey, axboe, rostedt, mhiramat, mathieu.desnoyers
  Cc: linux-block, linux-kernel, linux-trace-kernel
In-Reply-To: <20260503085519.138360-1-kartikey406@gmail.com>

On 5/3/26 10:55 AM, Deepanshu Kartikey wrote:
> blk_trace_setup() accepts any non-zero buf_size from
> userspace and passes it directly to relay_open(). If
> buf_size is smaller than sizeof(struct blk_io_trace2),
> relay_reserve() always returns NULL and all trace
> events are silently dropped.

That's the intended behavior, isn't it?

> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 8cd2520b4c99..20f941495151 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -773,7 +773,7 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
>   	if (ret)
>   		return -EFAULT;
>   
> -	if (!buts.buf_size || !buts.buf_nr)
> +	if (buts.buf_size < sizeof(struct blk_io_trace2) || !buts.buf_nr)
>   		return -EINVAL;
>   
>   	buts2 = (struct blk_user_trace_setup2) {

We may be better off not changing this code because there may be users 
who rely on the current behavior and who will report this change in
behavior as a regression.

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Nico Pache @ 2026-05-03 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <20260424065828.031775921990de37f83a2468@linux-foundation.org>

On 4/24/26 7:58 AM, Andrew Morton wrote:
> On Sun, 19 Apr 2026 12:57:37 -0600 Nico Pache <npache@redhat.com> wrote:
>
>> The following series provides khugepaged with the capability to collapse
>> anonymous memory regions to mTHPs.
>
> Lots of stuff here:
>       https://sashiko.dev/#/patchset/20260419185750.260784-1-npache@redhat.com
>
> It's going to take some time.  Hopefully worthwhile.
>
> As always, it's useful to hear about the usefulness of the AI review.

Ok I went through those! Can you please pull the changes from the
staging branch so I can resend it soon (probably after LSFMM, so no
rush)?

Heres my report:

patch 2 - rather useless as memcg does not have per order/mthp stats

patch 4 - good point, although kinda minor, same as what Usama brought up

patch 5 - either the concern is nonsense or im too dumb to understand
it. Maybe someone else can confirm

Patch 5.2 - Not a real concern (I don't think), given we've already
fully locked down the PTE table, nothing should be able to reach it.
It notes a specific config might be an issue, I will test with that
on.

Patch 9 - was a real concern, and my fault for semi-lazily stripping
out a variable without fully considering the effects. David noted this
too. Good news is that it got the reasoning for why it is bad correct.
Oddly, I did not see a bug during testing which I would expected to
show up in the madvise tests. I just reverted my changes, I will try
to clean this up in a future cleanup series... Although there may be
no good way around this madvise behavior.

Patch 10.1 - Good point, I had considered this during my design, but
then convinced myself I was incorrect. This actually saves us a lot of
heap space :) Gotta retest a lot though. First few tests show no issue

Patch 10.2 - Not a concern and if we made it here, it's already been
checked. Furthermore, the result would be the same. Although not a bad
question from the AI

Patch 10.3 - I dont think this is a valid concern at all

Patch 10.4 - I don't think this is a valid concern at all

Patch 10.5 - The first half is valid (although it's what the next
patch does), so it's not really that valid, it's just missing the
series context. second half hmm

Patch 10.6 - Not a real concern

Patch 11 - Not a bad consideration

Patch 12 - real bug from my last refactor

Patch 12.2 - Decent consideration, but not a real concern, just a design choice.

So yes overall very smart to check sashiko :) But as someone currently
actively working on sashiko I was already a fan.

Cheers,
-- Nico

>

^ permalink raw reply

* Re: [PATCH] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: Matthew Wilcox @ 2026-05-03 12:35 UTC (permalink / raw)
  To: qiwu.chen
  Cc: akpm, rostedt, kasong, mhocko, hannes, david, ljs, baohua,
	mhiramat, linux-mm, linux-trace-kernel, qiwu.chen
In-Reply-To: <20260503085705.3276-1-qiwu.chen@transsion.com>

On Sun, May 03, 2026 at 04:57:05PM +0800, qiwu.chen wrote:
> Currently, reclaim_flags always contains RECLAIM_WB_ASYNC in lru_shrink
> tracepoints since commit 41ac1999c3e35 ("mm: vmscan: do not stall on
> writeback during memory compaction"), which is useless for debugging
> memory pressure issues. Other RECLAIM_WB_* flags are not used anywhere
> else, so they can be directly removed.
> This patch reworks the lru_shrink and write_folio tracepoints for better
> correlation and analysis:
>  - traces each folio lru type instead of reclaim_flags.
>  - traces each lru_shrink with reason.

You also removed the printing of the folio's PFN.  Was this deliberate?
If so, it merits a mention in the commit description.

Also if you are going to do this (and I suspect we should do this!)
we don't need to do the folio -> pfn -> page conversion.  We can just
store the folio pointer and print out the folio pointer.


^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-05-03 13:21 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel, aarcange,
	anshuman.khandual, apopple, baohua, baolin.wang, byungchul,
	catalin.marinas, cl, corbet, dave.hansen, david, dev.jain, gourry,
	hannes, hughd, jack, jackmanb, jannh, jglisse, joshua.hahnjy, kas,
	lance.yang, Liam.Howlett, ljs, mathieu.desnoyers, matthew.brost,
	mhiramat, mhocko, peterx, pfalcato, rakie.kim, raquini, rdunlap,
	richard.weiyang, rientjes, rostedt, rppt, ryan.roberts, shivankg,
	sunnanyong, surenb, thomas.hellstrom, tiwai, usamaarif642, vbabka,
	vishal.moola, wangkefeng.wang, will, willy, yang, ying.huang, ziy,
	zokeefe
In-Reply-To: <CAA1CXcBJFoqDrkQbRE6JnpV-gjfNXe2sUxaCyXPC82h3qk9Jig@mail.gmail.com>

On Sun, 3 May 2026 06:23:31 -0600 Nico Pache <npache@redhat.com> wrote:

>  Can you please pull the changes from the
> staging branch so I can resend it soon (probably after LSFMM, so no
> rush)?

np, I've removed v16 from mm.git.

^ permalink raw reply

* Re: [PATCH 7.2 v16 00/13] khugepaged: mTHP support
From: Andrew Morton @ 2026-05-03 13:32 UTC (permalink / raw)
  To: Nico Pache, linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	aarcange, anshuman.khandual, apopple, baohua, baolin.wang,
	byungchul, catalin.marinas, cl, corbet, dave.hansen, david,
	dev.jain, gourry, hannes, hughd, jack, jackmanb, jannh, jglisse,
	joshua.hahnjy, kas, lance.yang, Liam.Howlett, ljs,
	mathieu.desnoyers, matthew.brost, mhiramat, mhocko, peterx,
	pfalcato, rakie.kim, raquini, rdunlap, richard.weiyang, rientjes,
	rostedt, rppt, ryan.roberts, shivankg, sunnanyong, surenb,
	thomas.hellstrom, tiwai, usamaarif642, vbabka, vishal.moola,
	wangkefeng.wang, will, willy, yang, ying.huang, ziy, zokeefe
In-Reply-To: <20260503062109.0469201428642a4f7fcfd915@linux-foundation.org>

On Sun, 3 May 2026 06:21:09 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sun, 3 May 2026 06:23:31 -0600 Nico Pache <npache@redhat.com> wrote:
> 
> >  Can you please pull the changes from the
> > staging branch so I can resend it soon (probably after LSFMM, so no
> > rush)?
> 
> np, I've removed v16 from mm.git.

And that messed up Zi Yan's "Remove CONFIG_READ_ONLY_THP_FOR_FS and
enable file THP for writable files", so I've restored v16.

Please prepare v17 against mm-unstable.

^ permalink raw reply

* Re: [PATCH] mm: vmscan: rework lru_shrink and write_folio tracepoints
From: chenqiwu @ 2026-05-03 14:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: qiwu.chen, akpm, rostedt, kasong, mhocko, hannes, david, ljs,
	baohua, mhiramat, linux-mm, linux-trace-kernel
In-Reply-To: <afdBCWopkfF2ow-Z@casper.infradead.org>

On Sun, May 03, 2026 at 01:35:21PM +0100, Matthew Wilcox wrote:
> 
> You also removed the printing of the folio's PFN.  Was this deliberate?
> If so, it merits a mention in the commit description.
> 
> Also if you are going to do this (and I suspect we should do this!)
> we don't need to do the folio -> pfn -> page conversion.  We can just
> store the folio pointer and print out the folio pointer.
>
Yes, I think printing out the PFN is indeed unnecessary, but miss to
merit a mention in commit messsage. I will remove the unnecessary
conversion and merit a mention in patch v2.

^ permalink raw reply

* Re: [PATCH v2] Documentation/rv: Replace stale website link
From: Jonathan Corbet @ 2026-05-03 15:11 UTC (permalink / raw)
  To: Gabriele Monaco, rdunlap, Steven Rostedt, Gabriele Monaco,
	linux-trace-kernel, linux-doc, linux-kernel
  Cc: matteo.martelli, skhan
In-Reply-To: <20260427131709.170505-2-gmonaco@redhat.com>

Gabriele Monaco <gmonaco@redhat.com> writes:

> The sched monitor page was linking to Daniel's website which is now
> down. The main purpose of the link was to point to a source for the
> models from the original author and that can be found also in his
> published paper.
>
> Replace the link with a reference to Daniel's "A thread synchronization
> model for the PREEMPT_RT Linux kernel" which can be found online and
> includes the models definitions as well as the work behind them (not the
> original patches but since they're based on a 5.0 kernel and are mostly
> included upstream, there's little value in keeping them in the docs).
>
> Fixes: 03abeaa63c08 ("Documentation/rv: Add docs for the sched monitors")
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
> V2: Add link to the PDF and fixed RST references
>
>  Documentation/trace/rv/monitor_sched.rst | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Applied, thanks.

jon

^ permalink raw reply

* Re: [RFC PATCH] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Aaron Tomlin @ 2026-05-03 15:21 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: corbet, song, kpsingh, mattbobrowski, ast, daniel, andrii,
	eddyz87, memxor, rostedt, mhiramat, skhan, jolsa, martin.lau,
	yonghong.song, mathieu.desnoyers, neelx, sean, chjohnst, steve,
	mproche, nick.lange, linux-doc, linux-kernel, bpf,
	linux-trace-kernel, martin.lau, clm, ihor.solodrai
In-Reply-To: <78ecdb5fbbc2bba3e5d0d001c499c9011ac2e5d7dca29f60b0736c4f70c29d9f@mail.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 4045 bytes --]

On Sun, May 03, 2026 at 04:44:33AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> > index 9ead927a37c0..630f24996e7b 100644
> > --- a/Documentation/admin-guide/tainted-kernels.rst
> > +++ b/Documentation/admin-guide/tainted-kernels.rst
> > @@ -79,30 +79,31 @@ which bits are set::
> >  Table for decoding tainted state
> >  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > -===  ===  ======  ========================================================
> > -Bit  Log  Number  Reason that got the kernel tainted
> > -===  ===  ======  ========================================================
> 
> [ ... ]
> 
> > + 19  _/J   524288  userspace used a mutating debug operation in fwctl
> > + 20  _/V  1048576  an unsafe eBPF program (mutating helper) was loaded
> > +===  ===  =======  ========================================================
> 
> This isn't a bug, but the bottom border has 7 equals signs (=======) for
> the Number column while the top and header borders have 6 (======). Will
> this cause a Sphinx 'Malformed table' warning when rendering the
> documentation?

[ ... ]

> Earlier in the same file, around line 77, there's a shell command example
> for checking taint bits. With the addition of bit 20, shouldn't that
> command be updated from 'seq 20' to 'seq 21' so it checks all 21 bits
> (0-20)?

[ ... ]

> The current command only checks bits 0-19:
> 
>     $ for i in $(seq 20); do echo $(($i-1)) $(($(cat /proc/sys/kernel/tainted)>>($i-1)&1));done
> 
> Since 'seq 20' produces 1..20, and the expression uses $(($i-1)), it only
> computes bit positions 0..19, missing the newly added bit 20.

Acknowledged. This will be resolved in the next iteration.

> There's a comment earlier in this file that explicitly states: "if you
> modify the taint_flags or TAINT_FLAGS_COUNT, please also modify
> tools/debugging/kernel-chktaint and
> Documentation/admin-guide/tainted-kernels.rst".
> 
> While the Documentation was updated, was tools/debugging/kernel-chktaint
> also updated? The script currently processes flags 0-19 and ends with the
> J flag check. Without adding a section to check bit 20, the script won't
> decode the 'V' character in the taint string output when
> TAINT_UNSAFE_BPF is set.
> 
> The script would need something like:
> 
>     T=`expr $T / 2`
>     if [ `expr $T % 2` -eq 0 ]; then
>         addout " "
>     else
>         addout "V"
>         echo " * an unsafe eBPF program (mutating helper) was loaded (#20)"
>     fi

Acknowledged. I will incorporate the suggested check for the "V" flag into
tools/debugging/kernel-chktaint.

> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index a822c589c9bd..3f893d9ea764 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -183,6 +183,7 @@ unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx)
> >  #ifdef CONFIG_BPF_KPROBE_OVERRIDE
> >  BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc)
> >  {
> > +	add_taint(TAINT_UNSAFE_BPF, LOCKDEP_STILL_OK);
> >  	regs_set_return_value(regs, rc);
> >  	override_function_with_return(regs);
> >  	return 0;
> 
> Since bpf_override_return is a helper invoked at runtime and can execute
> frequently on hot kprobes, will calling add_taint() here cause severe
> cache line bouncing?
> 
> Looking at add_taint(), it unconditionally calls set_bit() on the global
> tainted_mask.

[ ... ]

> Also, the commit message and documentation mention the system is tainted
> when an unsafe eBPF program 'was loaded'. By applying the taint here
> during execution instead of at program load time (e.g., in the verifier),
> could this allow an unsafe BPF program to remain dormant and avoid
> detection until a specific trigger occurs?

Acknowledged. I'll move the taint application to check_helper_call().


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC PATCH] bpf: introduce TAINT_UNSAFE_BPF for mutating helpers
From: Aaron Tomlin @ 2026-05-03 15:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: corbet, song, kpsingh, mattbobrowski, ast, daniel, andrii,
	eddyz87, memxor, rostedt, mhiramat, skhan, jolsa, martin.lau,
	yonghong.song, mathieu.desnoyers, neelx, sean, chjohnst, steve,
	mproche, nick.lange, linux-doc, linux-kernel, bpf,
	linux-trace-kernel
In-Reply-To: <e456f0f0-5e49-4de4-9184-32ebc53cd0a1@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

On Sat, May 02, 2026 at 09:29:27PM -0700, Randy Dunlap wrote:
> > + 20) ``V`` if an eBPF program utilising unsafe, mutating helpers (such as
> > +     bpf_probe_write_user() or bpf_override_return()) was loaded. These helpers
> > +     bypass standard eBPF safety guarantees and can alter execution flow or
> > +     corrupt memory.
> 
> (If this patch goes forward:)
> 
> In this same file (above), there is a little script around line 77 where
> it should be changed:
> s/20/21/
> 
> Also please update tools/debugging/kernel-chktaint for this taint flag.
> 
Hi Randy,

Acknowledged. I'll address the above within the next iteration.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox