Linux Trace Kernel

Linux Trace Kernel
 help / color / mirror / Atom feed

* [PATCH net-next v2 03/14] tcp: refresh rcv_wnd snapshots at TCP write sites
From: atwellwea @ 2026-03-14 20:13 UTC (permalink / raw)
  To: netdev, davem, kuba, pabeni, edumazet, ncardwell
  Cc: linux-kernel, linux-api, linux-doc, linux-kselftest,
	linux-trace-kernel, mptcp, dsahern, horms, kuniyu, andrew+netdev,
	willemdebruijn.kernel, jasowang, skhan, corbet, matttbe,
	martineau, geliang, rostedt, mhiramat, mathieu.desnoyers,
	0x7f454c46
In-Reply-To: <20260314201348.1786972-1-atwellwea@gmail.com>

From: Wesley Atwell <atwellwea@gmail.com>

Refresh the live rwnd snapshot whenever TCP updates tp->rcv_wnd at the
normal write sites, including child setup, tcp_select_window(), and the
initial connect-time window selection.

This keeps the live sender-visible window paired with the scaling basis
that was actually advertised.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 net/ipv4/tcp_minisocks.c | 2 +-
 net/ipv4/tcp_output.c    | 8 ++++++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index d350d794a959..1c02c9cd13fe 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -603,7 +603,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->rx_opt.sack_ok = ireq->sack_ok;
 	newtp->window_clamp = req->rsk_window_clamp;
 	newtp->rcv_ssthresh = req->rsk_rcv_wnd;
-	newtp->rcv_wnd = req->rsk_rcv_wnd;
+	tcp_set_rcv_wnd(newtp, req->rsk_rcv_wnd);
 	newtp->rcv_mwnd_seq = newtp->rcv_wup + req->rsk_rcv_wnd;
 	newtp->rx_opt.wscale_ok = ireq->wscale_ok;
 	if (newtp->rx_opt.wscale_ok) {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 35c3b0ab5a0c..0b082726d7c4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -291,7 +291,7 @@ static u16 tcp_select_window(struct sock *sk)
 	 */
 	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
 		tp->pred_flags = 0;
-		tp->rcv_wnd = 0;
+		tcp_set_rcv_wnd(tp, 0);
 		tp->rcv_wup = tp->rcv_nxt;
 		tcp_update_max_rcv_wnd_seq(tp);
 		return 0;
@@ -315,7 +315,7 @@ static u16 tcp_select_window(struct sock *sk)
 		}
 	}
 
-	tp->rcv_wnd = new_win;
+	tcp_set_rcv_wnd(tp, new_win);
 	tp->rcv_wup = tp->rcv_nxt;
 	tcp_update_max_rcv_wnd_seq(tp);
 
@@ -4148,6 +4148,10 @@ static void tcp_connect_init(struct sock *sk)
 				  READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_window_scaling),
 				  &rcv_wscale,
 				  rcv_wnd);
+	/* tcp_select_initial_window() filled tp->rcv_wnd through its out-param,
+	 * so snapshot the scaling_ratio we will use for that initial rwnd.
+	 */
+	tcp_set_rcv_wnd(tp, tp->rcv_wnd);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
 	tp->rcv_ssthresh = tp->rcv_wnd;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 02/14] tcp: snapshot advertise-time scaling for rcv_wnd
From: atwellwea @ 2026-03-14 20:13 UTC (permalink / raw)
  To: netdev, davem, kuba, pabeni, edumazet, ncardwell
  Cc: linux-kernel, linux-api, linux-doc, linux-kselftest,
	linux-trace-kernel, mptcp, dsahern, horms, kuniyu, andrew+netdev,
	willemdebruijn.kernel, jasowang, skhan, corbet, matttbe,
	martineau, geliang, rostedt, mhiramat, mathieu.desnoyers,
	0x7f454c46
In-Reply-To: <20260314201348.1786972-1-atwellwea@gmail.com>

From: Wesley Atwell <atwellwea@gmail.com>

Track the scaling basis that was in force when tp->rcv_wnd was last
advertised, and provide helpers to refresh or interpret that snapshot.

Later patches use this live-window basis to preserve sender-visible rwnd
accounting when receive-side memory costs drift after advertisement.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 .../networking/net_cachelines/tcp_sock.rst    |  1 +
 include/linux/tcp.h                           |  1 +
 include/net/tcp.h                             | 52 ++++++++++++++++++-
 net/ipv4/tcp.c                                |  1 +
 4 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index fecf61166a54..09ece1c59c2d 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -11,6 +11,7 @@ Type                          Name                    fastpath_tx_access  fastpa
 struct inet_connection_sock   inet_conn
 u16                           tcp_header_len          read_mostly         read_mostly         tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
 u16                           gso_segs                read_mostly                             tcp_xmit_size_goal
+u8                            rcv_wnd_scaling_ratio   read_write          read_mostly         tcp_set_rcv_wnd,tcp_can_ingest,tcp_repair_set_window,do_tcp_getsockopt
 __be32                        pred_flags              read_write          read_mostly         tcp_select_window(tx);tcp_rcv_established(rx)
 u64                           bytes_received                              read_write          tcp_rcv_nxt_update(rx)
 u32                           segs_in                                     read_write          tcp_v6_rcv(rx)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6982f10e826b..2ace563d59d6 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -297,6 +297,7 @@ struct tcp_sock {
 		est_ecnfield:2,/* ECN field for AccECN delivered estimates */
 		accecn_opt_demand:2,/* Demand AccECN option for n next ACKs */
 		prev_ecnfield:2; /* ECN bits from the previous segment */
+	u8	rcv_wnd_scaling_ratio; /* 0 if unknown, else tp->rcv_wnd basis */
 	__be32	pred_flags;
 	u64	tcp_clock_cache; /* cache last tcp_clock_ns() (see tcp_mstamp_refresh()) */
 	u64	tcp_mstamp;	/* most recent packet received/sent */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3a0060599afe..6fa7cdb0979e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1741,6 +1741,31 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
 	return __tcp_space_from_win(tcp_sk(sk)->scaling_ratio, win);
 }
 
+static inline bool tcp_wnd_snapshot_valid(u8 scaling_ratio)
+{
+	return scaling_ratio != 0;
+}
+
+static inline bool tcp_space_from_wnd_snapshot(u8 scaling_ratio, int win,
+					       int *space)
+{
+	if (!tcp_wnd_snapshot_valid(scaling_ratio))
+		return false;
+
+	*space = __tcp_space_from_win(scaling_ratio, win);
+	return true;
+}
+
+/* Rebuild hard receive-memory units for data already covered by tp->rcv_wnd if
+ * the advertise-time basis is known.
+ */
+static inline bool tcp_space_from_rcv_wnd(const struct tcp_sock *tp, int win,
+					  int *space)
+{
+	return tcp_space_from_wnd_snapshot(tp->rcv_wnd_scaling_ratio, win,
+					   space);
+}
+
 /* Assume a 50% default for skb->len/skb->truesize ratio.
  * This may be adjusted later in tcp_measure_rcv_mss().
  */
@@ -1748,7 +1773,32 @@ static inline int tcp_space_from_win(const struct sock *sk, int win)
 
 static inline void tcp_scaling_ratio_init(struct sock *sk)
 {
-	tcp_sk(sk)->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tp->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+	tp->rcv_wnd_scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
+}
+
+/* tp->rcv_wnd is paired with the scaling_ratio that was in force when that
+ * window was last advertised. Callers can leave a zero snapshot when the
+ * advertise-time basis is unknown and refresh the pair on the next local
+ * window update.
+ */
+static inline void tcp_set_rcv_wnd_snapshot(struct tcp_sock *tp, u32 win,
+					    u8 scaling_ratio)
+{
+	tp->rcv_wnd = win;
+	tp->rcv_wnd_scaling_ratio = scaling_ratio;
+}
+
+static inline void tcp_set_rcv_wnd(struct tcp_sock *tp, u32 win)
+{
+	tcp_set_rcv_wnd_snapshot(tp, win, tp->scaling_ratio);
+}
+
+static inline void tcp_set_rcv_wnd_unknown(struct tcp_sock *tp, u32 win)
+{
+	tcp_set_rcv_wnd_snapshot(tp, win, 0);
 }
 
 /* TCP receive-side accounting reuses sk_rcvbuf as both a hard memory limit
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 516087c622ad..0383ee8d3b78 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5275,6 +5275,7 @@ static void __init tcp_struct_check(void)
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ce);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, received_ecn_bytes);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, app_limited);
+	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd_scaling_ratio);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_wnd);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_mwnd_seq);
 	CACHELINE_ASSERT_GROUP_MEMBER(struct tcp_sock, tcp_sock_write_txrx, rcv_tstamp);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 01/14] tcp: factor receive-memory accounting helpers
From: atwellwea @ 2026-03-14 20:13 UTC (permalink / raw)
  To: netdev, davem, kuba, pabeni, edumazet, ncardwell
  Cc: linux-kernel, linux-api, linux-doc, linux-kselftest,
	linux-trace-kernel, mptcp, dsahern, horms, kuniyu, andrew+netdev,
	willemdebruijn.kernel, jasowang, skhan, corbet, matttbe,
	martineau, geliang, rostedt, mhiramat, mathieu.desnoyers,
	0x7f454c46
In-Reply-To: <20260314201348.1786972-1-atwellwea@gmail.com>

From: Wesley Atwell <atwellwea@gmail.com>

Factor the core receive-memory byte accounting into small helpers so
window selection, pressure checks, and prune decisions all start from
one set of quantities.

This is preparatory only. Later patches will use the same helpers when
tying sender-visible receive-window state back to hard memory admission.

Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
---
 include/net/tcp.h    | 32 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c |  2 +-
 2 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f87bdacb5a69..3a0060599afe 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1751,12 +1751,34 @@ static inline void tcp_scaling_ratio_init(struct sock *sk)
 	tcp_sk(sk)->scaling_ratio = TCP_DEFAULT_SCALING_RATIO;
 }
 
+/* TCP receive-side accounting reuses sk_rcvbuf as both a hard memory limit
+ * and as the source material for the advertised receive window after
+ * scaling_ratio conversion. Keep the byte accounting explicit so admission,
+ * pruning, and rwnd selection all start from the same quantities.
+ */
+static inline int tcp_rmem_used(const struct sock *sk)
+{
+	return atomic_read(&sk->sk_rmem_alloc);
+}
+
+static inline int tcp_rmem_avail(const struct sock *sk)
+{
+	return READ_ONCE(sk->sk_rcvbuf) - tcp_rmem_used(sk);
+}
+
+/* Sender-visible rwnd headroom also reserves bytes already queued on backlog.
+ * Those bytes are not free to advertise again until __release_sock() drains
+ * backlog and clears sk_backlog.len.
+ */
+static inline int tcp_rwnd_avail(const struct sock *sk)
+{
+	return tcp_rmem_avail(sk) - READ_ONCE(sk->sk_backlog.len);
+}
+
 /* Note: caller must be prepared to deal with negative returns */
 static inline int tcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
-				  READ_ONCE(sk->sk_backlog.len) -
-				  atomic_read(&sk->sk_rmem_alloc));
+	return tcp_win_from_space(sk, tcp_rwnd_avail(sk));
 }
 
 static inline int tcp_full_space(const struct sock *sk)
@@ -1799,7 +1821,7 @@ static inline bool tcp_rmem_pressure(const struct sock *sk)
 	rcvbuf = READ_ONCE(sk->sk_rcvbuf);
 	threshold = rcvbuf - (rcvbuf >> 3);
 
-	return atomic_read(&sk->sk_rmem_alloc) > threshold;
+	return tcp_rmem_used(sk) > threshold;
 }
 
 static inline bool tcp_epollin_ready(const struct sock *sk, int target)
@@ -1949,7 +1971,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
 	    tp->rcv_wnd &&
-	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
+	    tcp_rmem_avail(sk) > 0 &&
 	    !tp->urg_data)
 		tcp_fast_path_on(tp);
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e6b2f4be7723..b8e65e31255e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5959,7 +5959,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	/* Do nothing if our queues are empty. */
-	if (!atomic_read(&sk->sk_rmem_alloc))
+	if (!tcp_rmem_used(sk))
 		return -1;
 
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v2 00/14] tcp: preserve receive-window accounting across ratio drift
From: atwellwea @ 2026-03-14 20:13 UTC (permalink / raw)
  To: netdev, davem, kuba, pabeni, edumazet, ncardwell
  Cc: linux-kernel, linux-api, linux-doc, linux-kselftest,
	linux-trace-kernel, mptcp, dsahern, horms, kuniyu, andrew+netdev,
	willemdebruijn.kernel, jasowang, skhan, corbet, matttbe,
	martineau, geliang, rostedt, mhiramat, mathieu.desnoyers,
	0x7f454c46

From: Wesley Atwell <atwellwea@gmail.com>

This series keeps sender-visible TCP receive-window accounting tied to the
scaling basis that was in force when the window was advertised, even if
later receive-side truesize inflation lowers scaling_ratio or the live
receive window retracts below the largest right edge already exposed to the
sender.

After the receive-window retraction changes, the receive path needs to keep
track of two related pieces of sender-visible state:

  1. the live advertised receive window
  2. the maximum advertised right edge and the basis it was exposed with

This repost snapshots both, uses them to repair receive-buffer backing when
ratio drift would otherwise strand sender-visible space, extends
TCP_REPAIR_WINDOW so repair/restore can round-trip the new state, and adds
truesize-drift coverage through TUN packetdrill tests and netdevsim-based
selftests.

v2:
- repost to net-next and use the [PATCH net-next v2] prefix
- rebase the receive-window accounting changes on top of the retraction
  model
- split the series more finely
- snapshot both the live rwnd basis and the max advertised-window basis
- extend TCP_REPAIR_WINDOW to preserve legacy, v1, and current layouts
- add TUN RX truesize injection and packetdrill coverage for ratio drift
- split the generic netdevsim PSP extension cleanup into its own final
  patch after the peer RX truesize support
- add the requested ABI/runtime comments at the non-obvious review points

Testing:

- full runtime selftest coverage for netdevsim, tcp_ao, mptcp, and
  packetdrill; all runtime suites completed successfully
- tcp_ao completed 24/24 top-level tests, covering 803 passing checks,
  6 expected failures, 36 skips, and 0 unexpected failures
- mptcp completed 588 passing checks in aggregate, with 28 skips and
  0 unexpected failures
- packetdrill completed 219/219 runtime cases with 0 failures,
  including the new tests
- netdevsim completed 18/18 top-level runtime tests with 0 failures,
  including the peer RX truesize and related netdevsim coverage used by
  this series

Wesley Atwell (14):
  tcp: factor receive-memory accounting helpers
  tcp: snapshot advertise-time scaling for rcv_wnd
  tcp: refresh rcv_wnd snapshots at TCP write sites
  tcp: snapshot the maximum advertised receive window
  tcp: grow rcvbuf to back scaled-window quantization slack
  tcp: regrow rcvbuf when scaling_ratio drops after advertisement
  tcp: honor the maximum advertised window after live retraction
  tcp: extend TCP_REPAIR_WINDOW for live and max-window snapshots
  mptcp: refresh TCP receive-window snapshots on subflows
  tcp: expose rmem and backlog in tcp and mptcp rcvbuf_grow tracepoints
  selftests: tcp_ao: cover legacy, v1, and retracted repair windows
  tun/selftests: add RX truesize injection for TCP window tests
  netdevsim: add peer RX truesize support for selftests
  netdevsim: release pinned PSP ext on drop paths

 .../networking/net_cachelines/tcp_sock.rst    |   2 +
 drivers/net/netdevsim/netdev.c                | 156 ++++++-
 drivers/net/netdevsim/netdevsim.h             |   4 +
 drivers/net/tun.c                             |  65 +++
 include/linux/tcp.h                           |   2 +
 include/net/tcp.h                             | 118 ++++-
 include/trace/events/mptcp.h                  |  11 +-
 include/trace/events/tcp.h                    |  12 +-
 include/uapi/linux/if_tun.h                   |   4 +
 include/uapi/linux/tcp.h                      |   8 +
 net/ipv4/tcp.c                                |  75 ++-
 net/ipv4/tcp_fastopen.c                       |   2 +-
 net/ipv4/tcp_input.c                          | 160 ++++++-
 net/ipv4/tcp_minisocks.c                      |   4 +-
 net/ipv4/tcp_output.c                         |  25 +-
 net/mptcp/options.c                           |  14 +-
 net/mptcp/protocol.h                          |  14 +-
 .../selftests/drivers/net/netdevsim/Makefile  |   1 +
 .../drivers/net/netdevsim/peer-rx-truesize.sh | 426 ++++++++++++++++++
 .../tcp_rcv_neg_window_truesize.pkt           | 143 ++++++
 .../net/packetdrill/tcp_rcv_toobig.pkt        |  35 ++
 .../packetdrill/tcp_rcv_toobig_default.pkt    |  97 ++++
 .../tcp_rcv_toobig_default_truesize.pkt       | 118 +++++
 .../tcp_rcv_wnd_shrink_allowed_truesize.pkt   |  49 ++
 .../testing/selftests/net/tcp_ao/lib/aolib.h  |  83 +++-
 .../testing/selftests/net/tcp_ao/lib/repair.c |  18 +-
 .../selftests/net/tcp_ao/self-connect.c       | 201 ++++++++-
 tools/testing/selftests/net/tun.c             | 140 +++++-
 28 files changed, 1911 insertions(+), 76 deletions(-)
 create mode 100755 tools/testing/selftests/drivers/net/netdevsim/peer-rx-truesize.sh
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_neg_window_truesize.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig_default_truesize.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_wnd_shrink_allowed_truesize.pkt


base-commit: f807b5b9b89eb9220d034115c272c312251cbcac
-- 
2.43.0


^ permalink raw reply

* [PATCH v2 2/2] bootconfig: Add more test samples
From: Masami Hiramatsu (Google) @ 2026-03-14 10:10 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177348304012.463670.8543295382997674229.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add more test samples for edge cases (empty block, quoted newline,
various error cases) to tools/bootconfig/samples/.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Make EBNF as a separated section.
---
 .../samples/bad-array-comment-delimiter.bconf      |    2 ++
 tools/bootconfig/samples/bad-dot-middle.bconf      |    1 +
 .../bootconfig/samples/bad-invalid-operator.bconf  |    1 +
 tools/bootconfig/samples/bad-key-dot-end.bconf     |    1 +
 tools/bootconfig/samples/bad-unclosed-quote.bconf  |    1 +
 .../samples/bad-unexpected-close-brace.bconf       |    4 ++++
 .../samples/exp-good-dot-with-block.bconf          |    1 +
 .../bootconfig/samples/exp-good-empty-block.bconf  |    1 +
 .../samples/exp-good-empty-value-sep.bconf         |    3 +++
 .../samples/exp-good-quoted-newline.bconf          |    2 ++
 tools/bootconfig/samples/good-dot-with-block.bconf |    3 +++
 tools/bootconfig/samples/good-empty-block.bconf    |    1 +
 .../bootconfig/samples/good-empty-value-sep.bconf  |    3 +++
 tools/bootconfig/samples/good-quoted-newline.bconf |    2 ++
 14 files changed, 26 insertions(+)
 create mode 100644 tools/bootconfig/samples/bad-array-comment-delimiter.bconf
 create mode 100644 tools/bootconfig/samples/bad-dot-middle.bconf
 create mode 100644 tools/bootconfig/samples/bad-invalid-operator.bconf
 create mode 100644 tools/bootconfig/samples/bad-key-dot-end.bconf
 create mode 100644 tools/bootconfig/samples/bad-unclosed-quote.bconf
 create mode 100644 tools/bootconfig/samples/bad-unexpected-close-brace.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-quoted-newline.bconf
 create mode 100644 tools/bootconfig/samples/good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/good-quoted-newline.bconf

diff --git a/tools/bootconfig/samples/bad-array-comment-delimiter.bconf b/tools/bootconfig/samples/bad-array-comment-delimiter.bconf
new file mode 100644
index 000000000000..5300cef82aa3
--- /dev/null
+++ b/tools/bootconfig/samples/bad-array-comment-delimiter.bconf
@@ -0,0 +1,2 @@
+key = 1 # comment
+      , 2 # Error: comment between value and its comma delimiter
diff --git a/tools/bootconfig/samples/bad-dot-middle.bconf b/tools/bootconfig/samples/bad-dot-middle.bconf
new file mode 100644
index 000000000000..b3bd19e3c991
--- /dev/null
+++ b/tools/bootconfig/samples/bad-dot-middle.bconf
@@ -0,0 +1 @@
+key..word = value # Double dots are not allowed
diff --git a/tools/bootconfig/samples/bad-invalid-operator.bconf b/tools/bootconfig/samples/bad-invalid-operator.bconf
new file mode 100644
index 000000000000..ca19895bee8a
--- /dev/null
+++ b/tools/bootconfig/samples/bad-invalid-operator.bconf
@@ -0,0 +1 @@
+key ?= value # Unsupported operator
diff --git a/tools/bootconfig/samples/bad-key-dot-end.bconf b/tools/bootconfig/samples/bad-key-dot-end.bconf
new file mode 100644
index 000000000000..57ae39d36e95
--- /dev/null
+++ b/tools/bootconfig/samples/bad-key-dot-end.bconf
@@ -0,0 +1 @@
+key. = value # Key cannot end with a dot
diff --git a/tools/bootconfig/samples/bad-unclosed-quote.bconf b/tools/bootconfig/samples/bad-unclosed-quote.bconf
new file mode 100644
index 000000000000..9384e68d17f6
--- /dev/null
+++ b/tools/bootconfig/samples/bad-unclosed-quote.bconf
@@ -0,0 +1 @@
+key = "unclosed quote
diff --git a/tools/bootconfig/samples/bad-unexpected-close-brace.bconf b/tools/bootconfig/samples/bad-unexpected-close-brace.bconf
new file mode 100644
index 000000000000..a372be395200
--- /dev/null
+++ b/tools/bootconfig/samples/bad-unexpected-close-brace.bconf
@@ -0,0 +1,4 @@
+key {
+    subkey = value
+}
+} # Extra closing brace
diff --git a/tools/bootconfig/samples/exp-good-dot-with-block.bconf b/tools/bootconfig/samples/exp-good-dot-with-block.bconf
new file mode 100644
index 000000000000..ff563ceec024
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-dot-with-block.bconf
@@ -0,0 +1 @@
+key.subkey.subsubkey = "value";
diff --git a/tools/bootconfig/samples/exp-good-empty-block.bconf b/tools/bootconfig/samples/exp-good-empty-block.bconf
new file mode 100644
index 000000000000..fe460e8e675c
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-empty-block.bconf
@@ -0,0 +1 @@
+key;
diff --git a/tools/bootconfig/samples/exp-good-empty-value-sep.bconf b/tools/bootconfig/samples/exp-good-empty-value-sep.bconf
new file mode 100644
index 000000000000..266851aae8f2
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-empty-value-sep.bconf
@@ -0,0 +1,3 @@
+key1 = "";
+key2 = "";
+key3 = "";
diff --git a/tools/bootconfig/samples/exp-good-quoted-newline.bconf b/tools/bootconfig/samples/exp-good-quoted-newline.bconf
new file mode 100644
index 000000000000..2b5166541df6
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-quoted-newline.bconf
@@ -0,0 +1,2 @@
+key = "value
+that spans multiple lines";
diff --git a/tools/bootconfig/samples/good-dot-with-block.bconf b/tools/bootconfig/samples/good-dot-with-block.bconf
new file mode 100644
index 000000000000..3d9bef7daa2f
--- /dev/null
+++ b/tools/bootconfig/samples/good-dot-with-block.bconf
@@ -0,0 +1,3 @@
+key.subkey {
+    subsubkey = value
+} # Combination of dot-notation and block syntax
diff --git a/tools/bootconfig/samples/good-empty-block.bconf b/tools/bootconfig/samples/good-empty-block.bconf
new file mode 100644
index 000000000000..8c390f37b177
--- /dev/null
+++ b/tools/bootconfig/samples/good-empty-block.bconf
@@ -0,0 +1 @@
+key { } # Empty block should be allowed and ignored
diff --git a/tools/bootconfig/samples/good-empty-value-sep.bconf b/tools/bootconfig/samples/good-empty-value-sep.bconf
new file mode 100644
index 000000000000..fbfb9a17ff99
--- /dev/null
+++ b/tools/bootconfig/samples/good-empty-value-sep.bconf
@@ -0,0 +1,3 @@
+key1 = ;
+key2 = 
+key3 = # comment
diff --git a/tools/bootconfig/samples/good-quoted-newline.bconf b/tools/bootconfig/samples/good-quoted-newline.bconf
new file mode 100644
index 000000000000..8c9cd088579a
--- /dev/null
+++ b/tools/bootconfig/samples/good-quoted-newline.bconf
@@ -0,0 +1,2 @@
+key = "value
+that spans multiple lines" # Quoted values can contain newlines


^ permalink raw reply related

* [PATCH v2 1/2] Documentation: bootconfig: Add EBNF definiton of bootconfig
From: Masami Hiramatsu (Google) @ 2026-03-14 10:10 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177348304012.463670.8543295382997674229.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add the EBNF definition to Documentation/admin-guide/bootconfig.rst
as an additional section to formally define the bootconfig syntax.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Changes in v2:
  - Move EBNF as a separated section.
---
 Documentation/admin-guide/bootconfig.rst |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index f712758472d5..41bd1ee92395 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -152,6 +152,23 @@ Note that you can NOT put a comment or a newline between value and delimiter
  key = 1 # comment
        ,2
 
+EBNF definition
+===============
+
+The syntax is defined in EBNF as follows::
+
+  Config = { Statement }
+  Statement = [ Key [ Assignment | Block ] | Comment ] ( "\n" | ";" )
+  Assignment = ( "=" | "+=" | ":=" ) ValueList
+  ValueList = [ Value { "," [ { ( Comment | "\n" ) } ] Value } ]
+  Block = "{" { Statement } "}"
+  Key = Word { "." Word }
+  Word = [a-zA-Z0-9_-]+
+  Value = QuotedValue | UnquotedValue
+  QuotedValue = "\"" { any_character_except_double_quote } "\""
+              | "'" { any_character_except_single_quote } "'"
+  UnquotedValue = { any_printable_character_except_delimiters }
+  Comment = "#" { any_character_except_newline }
 
 /proc/bootconfig
 ================


^ permalink raw reply related

* [PATCH v2 0/2] bootconfig: Add EBNF definition and more tests
From: Masami Hiramatsu (Google) @ 2026-03-14 10:10 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel

Hi,

Here is the 2nd version of the series to add the EBNF definition and
more parser test cases of bootconfig to formally define the bootconfig
syntax. In this version, I made EBNF part as an independent section
so that someone can refer it easiler.

Previous version is here;

https://lore.kernel.org/all/177347919093.458550.1919253264724868769.stgit@devnote2/

Thanks,

---

Masami Hiramatsu (Google) (2):
      Documentation: bootconfig: Add EBNF definiton of bootconfig
      bootconfig: Add more test samples


 Documentation/admin-guide/bootconfig.rst           |   17 +++++++++++++++++
 .../samples/bad-array-comment-delimiter.bconf      |    2 ++
 tools/bootconfig/samples/bad-dot-middle.bconf      |    1 +
 .../bootconfig/samples/bad-invalid-operator.bconf  |    1 +
 tools/bootconfig/samples/bad-key-dot-end.bconf     |    1 +
 tools/bootconfig/samples/bad-unclosed-quote.bconf  |    1 +
 .../samples/bad-unexpected-close-brace.bconf       |    4 ++++
 .../samples/exp-good-dot-with-block.bconf          |    1 +
 .../bootconfig/samples/exp-good-empty-block.bconf  |    1 +
 .../samples/exp-good-empty-value-sep.bconf         |    3 +++
 .../samples/exp-good-quoted-newline.bconf          |    2 ++
 tools/bootconfig/samples/good-dot-with-block.bconf |    3 +++
 tools/bootconfig/samples/good-empty-block.bconf    |    1 +
 .../bootconfig/samples/good-empty-value-sep.bconf  |    3 +++
 tools/bootconfig/samples/good-quoted-newline.bconf |    2 ++
 15 files changed, 43 insertions(+)
 create mode 100644 tools/bootconfig/samples/bad-array-comment-delimiter.bconf
 create mode 100644 tools/bootconfig/samples/bad-dot-middle.bconf
 create mode 100644 tools/bootconfig/samples/bad-invalid-operator.bconf
 create mode 100644 tools/bootconfig/samples/bad-key-dot-end.bconf
 create mode 100644 tools/bootconfig/samples/bad-unclosed-quote.bconf
 create mode 100644 tools/bootconfig/samples/bad-unexpected-close-brace.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-quoted-newline.bconf
 create mode 100644 tools/bootconfig/samples/good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/good-quoted-newline.bconf

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 1/2] Documentation: bootconfig: Add EBNF definiton of bootconfig
From: Masami Hiramatsu @ 2026-03-14  9:34 UTC (permalink / raw)
  To: Masami Hiramatsu (Google)
  Cc: Steven Rostedt, linux-kernel, linux-trace-kernel
In-Reply-To: <177347919991.458550.13051415412509206815.stgit@devnote2>

On Sat, 14 Mar 2026 18:06:40 +0900
"Masami Hiramatsu (Google)" <mhiramat@kernel.org> wrote:

> From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 
> Add the EBNF definition to Documentation/admin-guide/bootconfig.rst
> to formally define the bootconfig syntax.
> 

Wait, I rethink it may be better to be a separated section
so that it can be referred easily.
Let me update it.

Thanks,

> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> ---
>  Documentation/admin-guide/bootconfig.rst |   15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
> index f712758472d5..5c5f736ca982 100644
> --- a/Documentation/admin-guide/bootconfig.rst
> +++ b/Documentation/admin-guide/bootconfig.rst
> @@ -22,6 +22,21 @@ The boot config syntax is a simple structured key-value. Each key consists
>  of dot-connected-words, and key and value are connected by ``=``. The value
>  string has to be terminated by the following delimiters described below.
>  
> +The syntax is defined in EBNF as follows::
> +
> +  Config = { Statement }
> +  Statement = [ Key [ Assignment | Block ] | Comment ] ( "\n" | ";" )
> +  Assignment = ( "=" | "+=" | ":=" ) ValueList
> +  ValueList = [ Value { "," [ { ( Comment | "\n" ) } ] Value } ]
> +  Block = "{" { Statement } "}"
> +  Key = Word { "." Word }
> +  Word = [a-zA-Z0-9_-]+
> +  Value = QuotedValue | UnquotedValue
> +  QuotedValue = "\"" { any_character_except_double_quote } "\""
> +              | "'" { any_character_except_single_quote } "'"
> +  UnquotedValue = { any_printable_character_except_delimiters }
> +  Comment = "#" { any_character_except_newline }
> +
>  Each key word must contain only alphabets, numbers, dash (``-``) or underscore
>  (``_``). And each value only contains printable characters or spaces except
>  for delimiters such as semi-colon (``;``), new-line (``\n``), comma (``,``),
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH 2/2] bootconfig: Add more test samples
From: Masami Hiramatsu (Google) @ 2026-03-14  9:06 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177347919093.458550.1919253264724868769.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add more test samples for edge cases (empty block, quoted newline,
various error cases) to tools/bootconfig/samples/.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 .../samples/bad-array-comment-delimiter.bconf      |    2 ++
 tools/bootconfig/samples/bad-dot-middle.bconf      |    1 +
 .../bootconfig/samples/bad-invalid-operator.bconf  |    1 +
 tools/bootconfig/samples/bad-key-dot-end.bconf     |    1 +
 tools/bootconfig/samples/bad-unclosed-quote.bconf  |    1 +
 .../samples/bad-unexpected-close-brace.bconf       |    4 ++++
 .../samples/exp-good-dot-with-block.bconf          |    1 +
 .../bootconfig/samples/exp-good-empty-block.bconf  |    1 +
 .../samples/exp-good-empty-value-sep.bconf         |    3 +++
 .../samples/exp-good-quoted-newline.bconf          |    2 ++
 tools/bootconfig/samples/good-dot-with-block.bconf |    3 +++
 tools/bootconfig/samples/good-empty-block.bconf    |    1 +
 .../bootconfig/samples/good-empty-value-sep.bconf  |    3 +++
 tools/bootconfig/samples/good-quoted-newline.bconf |    2 ++
 14 files changed, 26 insertions(+)
 create mode 100644 tools/bootconfig/samples/bad-array-comment-delimiter.bconf
 create mode 100644 tools/bootconfig/samples/bad-dot-middle.bconf
 create mode 100644 tools/bootconfig/samples/bad-invalid-operator.bconf
 create mode 100644 tools/bootconfig/samples/bad-key-dot-end.bconf
 create mode 100644 tools/bootconfig/samples/bad-unclosed-quote.bconf
 create mode 100644 tools/bootconfig/samples/bad-unexpected-close-brace.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-quoted-newline.bconf
 create mode 100644 tools/bootconfig/samples/good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/good-quoted-newline.bconf

diff --git a/tools/bootconfig/samples/bad-array-comment-delimiter.bconf b/tools/bootconfig/samples/bad-array-comment-delimiter.bconf
new file mode 100644
index 000000000000..5300cef82aa3
--- /dev/null
+++ b/tools/bootconfig/samples/bad-array-comment-delimiter.bconf
@@ -0,0 +1,2 @@
+key = 1 # comment
+      , 2 # Error: comment between value and its comma delimiter
diff --git a/tools/bootconfig/samples/bad-dot-middle.bconf b/tools/bootconfig/samples/bad-dot-middle.bconf
new file mode 100644
index 000000000000..b3bd19e3c991
--- /dev/null
+++ b/tools/bootconfig/samples/bad-dot-middle.bconf
@@ -0,0 +1 @@
+key..word = value # Double dots are not allowed
diff --git a/tools/bootconfig/samples/bad-invalid-operator.bconf b/tools/bootconfig/samples/bad-invalid-operator.bconf
new file mode 100644
index 000000000000..ca19895bee8a
--- /dev/null
+++ b/tools/bootconfig/samples/bad-invalid-operator.bconf
@@ -0,0 +1 @@
+key ?= value # Unsupported operator
diff --git a/tools/bootconfig/samples/bad-key-dot-end.bconf b/tools/bootconfig/samples/bad-key-dot-end.bconf
new file mode 100644
index 000000000000..57ae39d36e95
--- /dev/null
+++ b/tools/bootconfig/samples/bad-key-dot-end.bconf
@@ -0,0 +1 @@
+key. = value # Key cannot end with a dot
diff --git a/tools/bootconfig/samples/bad-unclosed-quote.bconf b/tools/bootconfig/samples/bad-unclosed-quote.bconf
new file mode 100644
index 000000000000..9384e68d17f6
--- /dev/null
+++ b/tools/bootconfig/samples/bad-unclosed-quote.bconf
@@ -0,0 +1 @@
+key = "unclosed quote
diff --git a/tools/bootconfig/samples/bad-unexpected-close-brace.bconf b/tools/bootconfig/samples/bad-unexpected-close-brace.bconf
new file mode 100644
index 000000000000..a372be395200
--- /dev/null
+++ b/tools/bootconfig/samples/bad-unexpected-close-brace.bconf
@@ -0,0 +1,4 @@
+key {
+    subkey = value
+}
+} # Extra closing brace
diff --git a/tools/bootconfig/samples/exp-good-dot-with-block.bconf b/tools/bootconfig/samples/exp-good-dot-with-block.bconf
new file mode 100644
index 000000000000..ff563ceec024
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-dot-with-block.bconf
@@ -0,0 +1 @@
+key.subkey.subsubkey = "value";
diff --git a/tools/bootconfig/samples/exp-good-empty-block.bconf b/tools/bootconfig/samples/exp-good-empty-block.bconf
new file mode 100644
index 000000000000..fe460e8e675c
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-empty-block.bconf
@@ -0,0 +1 @@
+key;
diff --git a/tools/bootconfig/samples/exp-good-empty-value-sep.bconf b/tools/bootconfig/samples/exp-good-empty-value-sep.bconf
new file mode 100644
index 000000000000..266851aae8f2
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-empty-value-sep.bconf
@@ -0,0 +1,3 @@
+key1 = "";
+key2 = "";
+key3 = "";
diff --git a/tools/bootconfig/samples/exp-good-quoted-newline.bconf b/tools/bootconfig/samples/exp-good-quoted-newline.bconf
new file mode 100644
index 000000000000..2b5166541df6
--- /dev/null
+++ b/tools/bootconfig/samples/exp-good-quoted-newline.bconf
@@ -0,0 +1,2 @@
+key = "value
+that spans multiple lines";
diff --git a/tools/bootconfig/samples/good-dot-with-block.bconf b/tools/bootconfig/samples/good-dot-with-block.bconf
new file mode 100644
index 000000000000..3d9bef7daa2f
--- /dev/null
+++ b/tools/bootconfig/samples/good-dot-with-block.bconf
@@ -0,0 +1,3 @@
+key.subkey {
+    subsubkey = value
+} # Combination of dot-notation and block syntax
diff --git a/tools/bootconfig/samples/good-empty-block.bconf b/tools/bootconfig/samples/good-empty-block.bconf
new file mode 100644
index 000000000000..8c390f37b177
--- /dev/null
+++ b/tools/bootconfig/samples/good-empty-block.bconf
@@ -0,0 +1 @@
+key { } # Empty block should be allowed and ignored
diff --git a/tools/bootconfig/samples/good-empty-value-sep.bconf b/tools/bootconfig/samples/good-empty-value-sep.bconf
new file mode 100644
index 000000000000..fbfb9a17ff99
--- /dev/null
+++ b/tools/bootconfig/samples/good-empty-value-sep.bconf
@@ -0,0 +1,3 @@
+key1 = ;
+key2 = 
+key3 = # comment
diff --git a/tools/bootconfig/samples/good-quoted-newline.bconf b/tools/bootconfig/samples/good-quoted-newline.bconf
new file mode 100644
index 000000000000..8c9cd088579a
--- /dev/null
+++ b/tools/bootconfig/samples/good-quoted-newline.bconf
@@ -0,0 +1,2 @@
+key = "value
+that spans multiple lines" # Quoted values can contain newlines


^ permalink raw reply related

* [PATCH 1/2] Documentation: bootconfig: Add EBNF definiton of bootconfig
From: Masami Hiramatsu (Google) @ 2026-03-14  9:06 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel
In-Reply-To: <177347919093.458550.1919253264724868769.stgit@devnote2>

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Add the EBNF definition to Documentation/admin-guide/bootconfig.rst
to formally define the bootconfig syntax.

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
---
 Documentation/admin-guide/bootconfig.rst |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst
index f712758472d5..5c5f736ca982 100644
--- a/Documentation/admin-guide/bootconfig.rst
+++ b/Documentation/admin-guide/bootconfig.rst
@@ -22,6 +22,21 @@ The boot config syntax is a simple structured key-value. Each key consists
 of dot-connected-words, and key and value are connected by ``=``. The value
 string has to be terminated by the following delimiters described below.
 
+The syntax is defined in EBNF as follows::
+
+  Config = { Statement }
+  Statement = [ Key [ Assignment | Block ] | Comment ] ( "\n" | ";" )
+  Assignment = ( "=" | "+=" | ":=" ) ValueList
+  ValueList = [ Value { "," [ { ( Comment | "\n" ) } ] Value } ]
+  Block = "{" { Statement } "}"
+  Key = Word { "." Word }
+  Word = [a-zA-Z0-9_-]+
+  Value = QuotedValue | UnquotedValue
+  QuotedValue = "\"" { any_character_except_double_quote } "\""
+              | "'" { any_character_except_single_quote } "'"
+  UnquotedValue = { any_printable_character_except_delimiters }
+  Comment = "#" { any_character_except_newline }
+
 Each key word must contain only alphabets, numbers, dash (``-``) or underscore
 (``_``). And each value only contains printable characters or spaces except
 for delimiters such as semi-colon (``;``), new-line (``\n``), comma (``,``),


^ permalink raw reply related

* [PATCH 0/2] bootconfig: Add EBNF definition and more tests
From: Masami Hiramatsu (Google) @ 2026-03-14  9:06 UTC (permalink / raw)
  To: Masami Hiramatsu, Steven Rostedt; +Cc: linux-kernel, linux-trace-kernel

Hi,

Here is a pair of patches to add the EBNF definition and more
parser test cases of bootconfig to formally define the bootconfig
syntax.

---

Masami Hiramatsu (Google) (2):
      Documentation: bootconfig: Add EBNF definiton of bootconfig
      bootconfig: Add more test samples


 Documentation/admin-guide/bootconfig.rst           |   15 +++++++++++++++
 .../samples/bad-array-comment-delimiter.bconf      |    2 ++
 tools/bootconfig/samples/bad-dot-middle.bconf      |    1 +
 .../bootconfig/samples/bad-invalid-operator.bconf  |    1 +
 tools/bootconfig/samples/bad-key-dot-end.bconf     |    1 +
 tools/bootconfig/samples/bad-unclosed-quote.bconf  |    1 +
 .../samples/bad-unexpected-close-brace.bconf       |    4 ++++
 .../samples/exp-good-dot-with-block.bconf          |    1 +
 .../bootconfig/samples/exp-good-empty-block.bconf  |    1 +
 .../samples/exp-good-empty-value-sep.bconf         |    3 +++
 .../samples/exp-good-quoted-newline.bconf          |    2 ++
 tools/bootconfig/samples/good-dot-with-block.bconf |    3 +++
 tools/bootconfig/samples/good-empty-block.bconf    |    1 +
 .../bootconfig/samples/good-empty-value-sep.bconf  |    3 +++
 tools/bootconfig/samples/good-quoted-newline.bconf |    2 ++
 15 files changed, 41 insertions(+)
 create mode 100644 tools/bootconfig/samples/bad-array-comment-delimiter.bconf
 create mode 100644 tools/bootconfig/samples/bad-dot-middle.bconf
 create mode 100644 tools/bootconfig/samples/bad-invalid-operator.bconf
 create mode 100644 tools/bootconfig/samples/bad-key-dot-end.bconf
 create mode 100644 tools/bootconfig/samples/bad-unclosed-quote.bconf
 create mode 100644 tools/bootconfig/samples/bad-unexpected-close-brace.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/exp-good-quoted-newline.bconf
 create mode 100644 tools/bootconfig/samples/good-dot-with-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-block.bconf
 create mode 100644 tools/bootconfig/samples/good-empty-value-sep.bconf
 create mode 100644 tools/bootconfig/samples/good-quoted-newline.bconf

--
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH 01/15] tracepoint: Add trace_invoke_##name() API
From: Keith Busch @ 2026-03-14  0:24 UTC (permalink / raw)
  To: Vineeth Remanan Pillai
  Cc: Peter Zijlstra, Steven Rostedt, Dmitry Ilvokhin, Masami Hiramatsu,
	Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
	Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
	netdev, bpf, linux-sctp, tipc-discussion, dev, Oded Gabbay,
	Koby Elbaz, dri-devel, Rafael J. Wysocki, Viresh Kumar,
	Gautham R. Shenoy, Huang Rui, Mario Limonciello, Len Brown,
	Srinivas Pandruvada, linux-pm, MyungJoo Ham, Kyungmin Park,
	Chanwoo Choi, Christian König, Sumit Semwal, linaro-mm-sig,
	Eddie James, Andrew Jeffery, Joel Stanley, linux-fsi,
	David Airlie, Simona Vetter, Alex Deucher, Danilo Krummrich,
	Matthew Brost, Philipp Stanner, Harry Wentland, Leo Li, amd-gfx,
	Jiri Kosina, Benjamin Tissoires, linux-input, Wolfram Sang,
	linux-i2c, Mark Brown, Michael Hennerich, Nuno Sá, linux-spi,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi, Chris Mason,
	David Sterba, linux-btrfs, linux-trace-kernel, linux-kernel
In-Reply-To: <CAO7JXPiu8-LE_gG001_GQLoGVYakPdzmH2SXLqfzJjEUxbn1Rw@mail.gmail.com>

On Thu, Mar 12, 2026 at 12:05:37PM -0400, Vineeth Remanan Pillai wrote:
> On Thu, Mar 12, 2026 at 11:53 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > That seems like an unreasonable waste of energy. You could've had claude
> > write a Coccinelle script for you and saved a ton of tokens.
> 
> Yeah true, Steve also mentioned this to me offline. Haven't used
> Coccinelle before, but now I know :-)

[+ Chris Mason]

At the risk of creating a distraction...

This discussion got me thinking the right skill loaded should have the
AI implicitly use coccinelle to generate the patchset rather than do it
by hand. You could prompt with simple language for a pattern
substitution rather than explicitly request coccinelle, and it should
generate a patch set using a script rather than spending tokens on doing
it "by hand".

I sent such a "skill" to Chris' kernel "review-prompts":

  https://github.com/masoncl/review-prompts/pull/35

I used patch one from this series as the starting point and let the AI
figure the rest out. The result actually found additional patterns that
could take advantage of the optimisation that this series did not
include. The resulting kernel tree that the above github pull request
references cost 2.8k tokens to create with the skill.

^ permalink raw reply

* Re: [PATCH 02/61] btrfs: Prefer IS_ERR_OR_NULL over manual NULL check
From: David Sterba @ 2026-03-13 19:22 UTC (permalink / raw)
  To: Philipp Hahn
  Cc: amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel, dri-devel,
	gfs2, intel-gfx, intel-wired-lan, iommu, kvm, linux-arm-kernel,
	linux-block, linux-bluetooth, linux-btrfs, linux-cifs, linux-clk,
	linux-erofs, linux-ext4, linux-fsdevel, linux-gpio, linux-hyperv,
	linux-input, linux-kernel, linux-leds, linux-media, linux-mips,
	linux-mm, linux-modules, linux-mtd, linux-nfs, linux-omap,
	linux-phy, linux-pm, linux-rockchip, linux-s390, linux-scsi,
	linux-sctp, linux-security-module, linux-sh, linux-sound,
	linux-stm32, linux-trace-kernel, linux-usb, linux-wireless,
	netdev, ntfs3, samba-technical, sched-ext, target-devel,
	tipc-discussion, v9fs, Chris Mason, David Sterba
In-Reply-To: <20260310-b4-is_err_or_null-v1-2-bd63b656022d@avm.de>

On Tue, Mar 10, 2026 at 12:48:28PM +0100, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does
> not like nesting it:
> > WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses
> > unlikely() internally
> Remove the explicit use of likely().
> 
> Change generated with coccinelle.
> 
> To: Chris Mason <clm@fb.com>
> To: David Sterba <dsterba@suse.com>
> Cc: linux-btrfs@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>

Added to for-next, we seem to be using IS_ERR_OR_NULL() already in a
few other places so this is makes sense for consistency. Thanks.

^ permalink raw reply

* Re: [PATCH v2] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Nathan Chancellor @ 2026-03-13 16:37 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: maz, rostedt, arnd, linux-trace-kernel, kvmarm, kernel-team
In-Reply-To: <20260313105829.1214123-1-vdonnefort@google.com>

On Fri, Mar 13, 2026 at 10:58:29AM +0000, Vincent Donnefort wrote:
> Compiler and tooling-generated symbols are difficult to maintain
> across all supported architectures. Make the allowlist more robust by
> replacing the harcoded list with a mechanism that automatically detects
> these symbols.
> 
> This mechanism generates a C function designed to trigger common
> compiler-inserted symbols.
> 
> Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> 
> ---
> 
> Changes in v2:
> 
>   - Use filechk (Nathan)
>   - Removed deprecated extra-y (Nathan)
>   - Added simple_ring_buffer in allowlist (Nathan)
>   - Added memcpy() to generate more symbols (Nathan)
>   - Added __sancov 
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index beb15936829d..96627a909ecc 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -136,17 +136,42 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
>  # simple_ring_buffer is used by the pKVM hypervisor which does not have access
>  # to all kernel symbols. Fail the build if forbidden symbols are found.
>  #
> -UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind
> -UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov
> -UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt
> -UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory
> -UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard
> -UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST))
> +# undefsyms_base generates a set of compiler and tooling-generated symbols that can
> +# safely be ignored for simple_ring_buffer.
> +#
> +filechk_undefsyms_base = \
> +	echo '$(pound)include <linux/atomic.h>'; \
> +	echo '$(pound)include <linux/string.h>'; \
> +	echo '$(pound)include <asm/page.h>'; \
> +	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \
> +	echo 'void undefsyms_base(void *p, int n);'; \
> +	echo 'void undefsyms_base(void *p, int n) {'; \
> +	echo '	char buffer[256] = { 0 };'; \
> +	echo '	u32 u = 0;'; \
> +	echo '	memset((char * volatile)page, 8, PAGE_SIZE);'; \
> +	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));'; \
> +	echo '	memcpy((void * volatile)p, buffer, sizeof(buffer));'; \
> +	echo '	cmpxchg((u32 * volatile)&u, 0, 8);'; \
> +	echo '	WARN_ON(n == 0xdeadbeef);'; \
> +	echo '}'
> +
> +$(obj)/undefsyms_base.c: FORCE
> +	$(call filechk,undefsyms_base)
> +
> +clean-files += undefsyms_base.c
> +
> +$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
> +
> +targets += undefsyms_base.o
> +
> +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
> +		      simple_ring_buffer \
> +		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
>  
>  quiet_cmd_check_undefined = NM      $<
> -      cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(UNDEFINED_ALLOWLIST)`"
> +      cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST))`"
>  
> -$(obj)/%.o.checked: $(obj)/%.o FORCE
> +$(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE
>  	$(call if_changed,check_undefined)
>  
>  always-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o.checked
> 
> base-commit: 33f2e266515717c4b2df585dadefa0525557726c
> -- 
> 2.53.0.851.ga537e3e6e9-goog
> 

Thanks! This is almost perfect for my tests, one final thing that I
noticed as a result of my full overnight builds. For ARCH=riscv (and
some other architectures from a quick grep), there is some logic in
their include/asm/string.h files to avoid FORTIFY_SOURCE when KASAN is
enabled for the entire build but not enabled for the particular file. As
undefsyms_base.o is not linked into vmlinux or modules, it does not
automatically have KASAN enabled.

  $ cat allmod.config
  CONFIG_GCOV_KERNEL=n
  CONFIG_LTO_CLANG_THIN=y
  CONFIG_WERROR=n

  $ make -skj"$(nproc)" ARCH=riscv KCONFIG_ALLCONFIG=1 LLVM=1 mrproper allmodconfig kernel/trace/
  Unexpected symbols in kernel/trace/simple_ring_buffer.o:
                   U __fortify_panic
                   U __write_overflow_field
  ...

This cures that for me.

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 260382f62dbf..55af887a90e2 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -164,6 +164,11 @@ $(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
 
 targets += undefsyms_base.o
 
+# ensure KASAN is enabled to avoid logic that may disable FORTIFY_SOURCE when
+# KASAN is not enabled. undefsyms_base.o does not automatically get KASAN flags
+# because it is not linked into vmlinux.
+KASAN_SANITIZE_undefsyms_base.o := y
+
 UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
 		      simple_ring_buffer \
 		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
--

With that addressed:

Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>

Cheers,
Nathan

^ permalink raw reply related

* Re: [PATCH RFC v3 00/43] guest_memfd: In-place conversion support
From: Sean Christopherson @ 2026-03-13 15:45 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jroedel, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest
In-Reply-To: <20260313-gmem-inplace-conversion-v3-0-5fc12a70ec89@google.com>

On Fri, Mar 13, 2026, Ackerley Tng wrote:
> Hi,
> 
> (Here's the motivation for this series, which I realized was missing from
> the earlier revisions of this series)

...

> I'm intending RFC (v3) as a basis for discussion of flags/content
> modes (name TBD) to allow userspace to request guarantees on how the memory
> contents will look like after setting memory attributes. The last 6 patches
> implement content mode support. These patches will be reordered, and some
> of them could be absorbed into earlier patches, in later revisions.
> 
> Here are the discussion points I can think of (please add on):
> 
> 1. (Might hopefully resolve soon?) Should ZERO be supported on shared to
>    private conversions? Discussion is at [6].

No.  There is no use case.  The entire point of CoCo is that the VMM is untrusted.
Having the guest rely on the VMM to zero memory makes no sense whatsoever.  There
may be a contract between the trusted whatever and the guest, but that's between
those two entities, the VMM is not involved, period.

PRESERVE is different because the intent is to allow the guest to operate on
*untrusted* data.  Operating on untrusted zeros is nonsensical.

ZERO for private=>shared is different between the VMM trusts the host kernel.

> 2. Do we need a CAP for userspace to query the flags/modes supported?

Yes.

>    It seems like there won't be anything dynamic about the flags/modes
>    supported.
> 
>    The userspace code can check what platform it is running on, and then
>    decide ZERO or PRESERVE based on the platform:
> 
>    If the VM is running on TDX,

No.  No, no, no, no.  I have said this over, and over, and over.  The contract
is between userspace and KVM, not between userspace and the underlying CoCo
implementation.  Anything that requires making assumptions based on the VM type
is a non-starter for me.

>    it would want to specify ZERO all the
>    time. If the VM were running on pKVM it would want to specify PRESERVE
>    if it wants to enable in-place sharing, and ZERO if it wants to zero the
>    memory.
> 
>    If someday TDX supports PRESERVE, then there's room for discovery of
>    which algorithm to choose when running the guest. Perhaps that's when
>    the CAP should be introduced?
> 
> 3. What do people think of the structure of how various content modes are
>    checked for support or applied? I used overridable weak functions for
>    architectures that haven't defined support, and defined overrides for
>    x86 to show how I think it would work. For CoCo platforms, I only
>    implemented TDX for illustration purposes and might need help with the
>    other platforms. Should I have used kvm_x86_ops? I tried and found
>    myself defining lots of boilerplate.
> 
> 4. enum for ZERO and PRESERVE?
> 
>    Pros:
> 
>    * No way to define both ZERO and PRESERVE (make impossible states
>      unrepresentable)
>        * e.g. enum kvm_device_type in __u32 type in struct
>          kvm_create_device
>        * But maybe someday some modes can be used together?

Huh?  Oh, you don't mean "enum", you mean "values vs. flags".  Because in C you
can obviously have an enum of flags.

I don't have a strong preference, though I think I'd vote for flags.

Practically speaking, I doubt we'll ever have more than DEFAULT, ZERO, and PRESERVE,
i.e. more than '0', '1, and '2'.  Perhaps I lack imagination, but I can't think
of any operation that we would want to become ABI.  ZERO is special purely because
various CoCo implementations already zero memory on conversion.  Everything else
fits into PRESERVE, because if the kernel perform the operation, then userspace
can do the same, and likely more performantly and obviously without needing a
contract with KVM.

The only other option I can think of is if a CoCo implementation wanted to use an
specific value other than '0' to fill a page on conversion.  Given that starting
from '0' is by far the most common state in computing, I just don't see that
happening.  E.g. that's be like adding k1salloc() in addition to kmalloc() and
kzalloc().

So, we're likely only going to have DEFAULT, ZERO, and PRESERVE, at which point
whether we use flags or values is a wash in terms of how many bits we need: 2.

If we use flags, then we can have a single CAP to enumerate all FLAGS that are
supported KVM_SET_MEMORY_ATTRIBUTES2.  If we use values, we'd need a separate CAP
for flags and a separate cap for conversion operations.

Using values would allow providing a dedicated field in kvm_memory_attributes2,
which _might_ make some code more readable.  But for me, that doesn't outweigh the
disadvantage of needing another CAP.

^ permalink raw reply

* Re: [PATCH v3 1/4] tracing/preemptirq: Optimize preempt_disable/enable() tracepoint overhead
From: Wander Lairson Costa @ 2026-03-13 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton,
	open list:SCHEDULER, open list:TRACING, acme, williams, gmonaco
In-Reply-To: <20260313090404.GK606826@noisy.programming.kicks-ass.net>

On Fri, Mar 13, 2026 at 10:04:04AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 12, 2026 at 02:19:15PM -0300, Wander Lairson Costa wrote:
> 
> > > That's significant bloat, for really very little gain. Realistically
> > > nobody is going to need these.
> > > 
> > 
> > Of course, I can't speak for others, but more than once I debugged issues
> > that those tracepoints had made my life far easier. Those cases convinced
> > me that such a feature would be worth it. But if you don't see
> > value and will reject the patches no matter what, nothing can be done,
> > and I will have to accept defeat.
> 
> If distros are going to enable this, I suppose I'm not going to stop
> this. But I do very much worry about the general bloat of things, there
> are a *LOT* of preempt_{dis,en}able() sites.
> 

We plan to enable these tracepoints in the RHEL kernel-rt to track
extended non-preemptible states that cause high latencies. These
issues occasionally surface in customer OpenShift deployments, where
deploying a custom debug kernel is highly impractical. Having these
tracepoints available in the distribution kernel would be handful for
debugging these production systems. That said, I expect enabling this
feature to be the exception rather than the rule — most distribution
kernels would leave it disabled.


^ permalink raw reply

* Re: [PATCH 00/15] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Vineeth Remanan Pillai @ 2026-03-13 14:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrii Nakryiko, Mathieu Desnoyers, Peter Zijlstra,
	Dmitry Ilvokhin, Masami Hiramatsu, Ingo Molnar, Jens Axboe,
	io-uring, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Marcelo Ricardo Leitner, Xin Long, Jon Maloy, Aaron Conole,
	Eelco Chaudron, Ilya Maximets, netdev, bpf, linux-sctp,
	tipc-discussion, dev, Oded Gabbay, Koby Elbaz, dri-devel,
	Rafael J. Wysocki, Viresh Kumar, Gautham R. Shenoy, Huang Rui,
	Mario Limonciello, Len Brown, Srinivas Pandruvada, linux-pm,
	MyungJoo Ham, Kyungmin Park, Chanwoo Choi, Christian König,
	Sumit Semwal, linaro-mm-sig, Eddie James, Andrew Jeffery,
	Joel Stanley, linux-fsi, David Airlie, Simona Vetter,
	Alex Deucher, Danilo Krummrich, Matthew Brost, Philipp Stanner,
	Harry Wentland, Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires,
	linux-input, Wolfram Sang, linux-i2c, Mark Brown,
	Michael Hennerich, Nuno Sá, linux-spi, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi, Chris Mason, David Sterba,
	linux-btrfs, linux-trace-kernel, linux-kernel
In-Reply-To: <20260312130255.6476e560@gandalf.local.home>

On Thu, Mar 12, 2026 at 1:03 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 12 Mar 2026 09:54:29 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > > > emit_trace_foo()
> > > > __trace_foo()
> >
> > this seems like the best approach, IMO. double-underscored variants
> > are usually used for some specialized/internal version of a function
> > when we know that some conditions are correct (e.g., lock is already
> > taken, or something like that). Which fits here: trace_xxx() will
> > check if tracepoint is enabled, while __trace_xxx() will not check and
> > just invoke the tracepoint? It's short, it's distinct, and it says "I
> > know what I am doing".
>
> Honestly, I consider double underscore as internal only and not something
> anyone but the subsystem maintainers use.
>
> This, is a normal function where it's just saying: If you have it already
> enabled, then you can use this. Thus, I don't think it qualifies as a "you
> know what you are doing".
>
> Perhaps: call_trace_foo() ?
>
call_trace_foo has one collision with the tracepoint
sched_update_nr_running and a function
call_trace_sched_update_nr_running. I had considered this and later
moved to trace_invoke_foo() because of the collision. But I can rename
call_trace_sched_update_nr_running to something else if call_trace_foo
is the general consensus.

Thanks,
Vineeth

^ permalink raw reply

* Re: [PATCH v7 05/15] Documentation/rv: Add documentation about hybrid automata
From: Juri Lelli @ 2026-03-13 13:23 UTC (permalink / raw)
  To: gmonaco
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, linux-trace-kernel, linux-doc, Tomas Glozar,
	Clark Williams, John Kacur
In-Reply-To: <4620e92b1c7f4d87f192a017f3026dfc17bcaef6.camel@redhat.com>

On 13/03/26 14:05, gmonaco@redhat.com wrote:
> Hello,
> 
> On Thu, 2026-03-12 at 11:39 +0100, Juri Lelli wrote:
> > Very minor nit, feel free to ignore, but ...
> > 
> > The formal 7-tuple definition includes 'i' (invariant function), but
> > unlike other elements, 'i' isn't stored in the automaton struct -
> > it's implemented as generated code in ha_verify_constraint(), IIUC.
> > Worth a brief note clarifying this design choice so readers don't
> > expect to find an invariants[] member in the struct? Here or below in
> > the example C code section.
> 
> Thanks for the review! I haven't really thought of that.
> At this stage we are not mentioning any struct element (it's purely
> theoretical), so there shouldn't be any expectation from the reader.
> 
> Later I mention "The function verify_constraint checks guards,
> performs resets and starts timers to validate invariants according to
> specification".
> In fact, also guards are not represented as part of 'function', I may
> mention after that sentence something like: "those cannot easily be
> represented in the automaton struct".
> 
> Not sure if saying more wouldn't make it even more confusing than it
> already is.

Yeah, probably. As mentioned, feel free to ignore, it was just a
thought. :)


^ permalink raw reply

* Re: [PATCH v7 05/15] Documentation/rv: Add documentation about hybrid automata
From: gmonaco @ 2026-03-13 13:05 UTC (permalink / raw)
  To: Juri Lelli
  Cc: linux-kernel, Steven Rostedt, Nam Cao, Juri Lelli,
	Jonathan Corbet, linux-trace-kernel, linux-doc, Tomas Glozar,
	Clark Williams, John Kacur
In-Reply-To: <abKX1XO4vqY74uA7@jlelli-thinkpadt14gen4.remote.csb>

Hello,

On Thu, 2026-03-12 at 11:39 +0100, Juri Lelli wrote:
> Very minor nit, feel free to ignore, but ...
> 
> The formal 7-tuple definition includes 'i' (invariant function), but
> unlike other elements, 'i' isn't stored in the automaton struct -
> it's implemented as generated code in ha_verify_constraint(), IIUC.
> Worth a brief note clarifying this design choice so readers don't
> expect to find an invariants[] member in the struct? Here or below in
> the example C code section.

Thanks for the review! I haven't really thought of that.
At this stage we are not mentioning any struct element (it's purely
theoretical), so there shouldn't be any expectation from the reader.

Later I mention "The function verify_constraint checks guards,
performs resets and starts timers to validate invariants according to
specification".
In fact, also guards are not represented as part of 'function', I may
mention after that sentence something like: "those cannot easily be
represented in the automaton struct".

Not sure if saying more wouldn't make it even more confusing than it
already is.

Thanks,
Gabriele

^ permalink raw reply

* Re: [PATCH 15/15] btrfs: Use trace_invoke_##name() at guarded tracepoint call sites
From: David Sterba @ 2026-03-13 11:57 UTC (permalink / raw)
  To: Vineeth Pillai (Google)
  Cc: Steven Rostedt, Peter Zijlstra, Chris Mason, David Sterba,
	linux-btrfs, linux-kernel, linux-trace-kernel
In-Reply-To: <20260312150523.2054552-16-vineeth@bitbyteword.org>

On Thu, Mar 12, 2026 at 11:05:10AM -0400, Vineeth Pillai (Google) wrote:
> Replace trace_foo() with the new trace_invoke_foo() at sites already
> guarded by trace_foo_enabled(), avoiding a redundant
> static_branch_unlikely() re-evaluation inside the tracepoint.
> trace_invoke_foo() calls the tracepoint callbacks directly without
> utilizing the static branch again.
> 
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> Assisted-by: Claude:claude-sonnet-4-6

Acked-by: David Sterba <dsterba@suse.com>

^ permalink raw reply

* [PATCH v2] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Vincent Donnefort @ 2026-03-13 10:58 UTC (permalink / raw)
  To: maz
  Cc: rostedt, arnd, nathan, linux-trace-kernel, kvmarm, kernel-team,
	Vincent Donnefort

Compiler and tooling-generated symbols are difficult to maintain
across all supported architectures. Make the allowlist more robust by
replacing the harcoded list with a mechanism that automatically detects
these symbols.

This mechanism generates a C function designed to trigger common
compiler-inserted symbols.

Signed-off-by: Vincent Donnefort <vdonnefort@google.com>

---

Changes in v2:

  - Use filechk (Nathan)
  - Removed deprecated extra-y (Nathan)
  - Added simple_ring_buffer in allowlist (Nathan)
  - Added memcpy() to generate more symbols (Nathan)
  - Added __sancov 

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index beb15936829d..96627a909ecc 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -136,17 +136,42 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
 # simple_ring_buffer is used by the pKVM hypervisor which does not have access
 # to all kernel symbols. Fail the build if forbidden symbols are found.
 #
-UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind
-UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov
-UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt
-UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory
-UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard
-UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST))
+# undefsyms_base generates a set of compiler and tooling-generated symbols that can
+# safely be ignored for simple_ring_buffer.
+#
+filechk_undefsyms_base = \
+	echo '$(pound)include <linux/atomic.h>'; \
+	echo '$(pound)include <linux/string.h>'; \
+	echo '$(pound)include <asm/page.h>'; \
+	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);'; \
+	echo 'void undefsyms_base(void *p, int n);'; \
+	echo 'void undefsyms_base(void *p, int n) {'; \
+	echo '	char buffer[256] = { 0 };'; \
+	echo '	u32 u = 0;'; \
+	echo '	memset((char * volatile)page, 8, PAGE_SIZE);'; \
+	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));'; \
+	echo '	memcpy((void * volatile)p, buffer, sizeof(buffer));'; \
+	echo '	cmpxchg((u32 * volatile)&u, 0, 8);'; \
+	echo '	WARN_ON(n == 0xdeadbeef);'; \
+	echo '}'
+
+$(obj)/undefsyms_base.c: FORCE
+	$(call filechk,undefsyms_base)
+
+clean-files += undefsyms_base.c
+
+$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
+
+targets += undefsyms_base.o
+
+UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sancov __sanitizer __tsan __ubsan __x86_indirect_thunk \
+		      simple_ring_buffer \
+		      $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
 
 quiet_cmd_check_undefined = NM      $<
-      cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(UNDEFINED_ALLOWLIST)`"
+      cmd_check_undefined = test -z "`$(NM) -u $< | grep -v $(addprefix -e , $(UNDEFINED_ALLOWLIST))`"
 
-$(obj)/%.o.checked: $(obj)/%.o FORCE
+$(obj)/%.o.checked: $(obj)/%.o $(obj)/undefsyms_base.o FORCE
 	$(call if_changed,check_undefined)
 
 always-$(CONFIG_SIMPLE_RING_BUFFER) += simple_ring_buffer.o.checked

base-commit: 33f2e266515717c4b2df585dadefa0525557726c
-- 
2.53.0.851.ga537e3e6e9-goog


^ permalink raw reply related

* Re: [PATCH] tracing: Generate undef symbols allowlist for simple_ring_buffer
From: Vincent Donnefort @ 2026-03-13 10:23 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: maz, rostedt, arnd, linux-trace-kernel, kvmarm, kernel-team
In-Reply-To: <20260312235153.GA1147071@ax162>

On Thu, Mar 12, 2026 at 04:51:53PM -0700, Nathan Chancellor wrote:
> Hi Vincent,
> 
> On Thu, Mar 12, 2026 at 06:20:10PM +0000, Vincent Donnefort wrote:
> > Compiler and tooling-generated symbols are difficult to maintain
> > across all supported architectures. Make the allowlist more robust by
> > replacing the harcoded list with a mechanism that automatically detects
> > these symbols.
> > 
> > This mechanism generates a C function designed to trigger common
> > compiler-inserted symbols.
> 
> This certainly seems more robust.
> 
> > Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
> > 
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index beb15936829d..3b427b76434a 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -136,17 +136,37 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
> >  # simple_ring_buffer is used by the pKVM hypervisor which does not have access
> >  # to all kernel symbols. Fail the build if forbidden symbols are found.
> >  #
> > -UNDEFINED_ALLOWLIST := memset alt_cb_patch_nops __x86 __ubsan __asan __kasan __gcov __aeabi_unwind
> > -UNDEFINED_ALLOWLIST += __stack_chk_fail stackleak_track_stack __ref_stack __sanitizer llvm_gcda llvm_gcov
> > -UNDEFINED_ALLOWLIST += .TOC\. __clear_pages_unrolled __memmove copy_page warn_slowpath_fmt
> > -UNDEFINED_ALLOWLIST += ftrace_likely_update __hwasan_load __hwasan_store __hwasan_tag_memory
> > -UNDEFINED_ALLOWLIST += warn_bogus_irq_restore __stack_chk_guard
> > -UNDEFINED_ALLOWLIST := $(addprefix -e , $(UNDEFINED_ALLOWLIST))
> > +# undefsyms_base generates a set of compiler and tooling-generated symbols that can
> > +# safely be ignored for simple_ring_buffer.
> > +#
> > +$(obj)/undefsyms_base.c: FORCE
> > +	$(Q)echo '#include <asm/page.h>' > $@
> > +	$(Q)echo '#include <asm/local.h>' >> $@
> > +	$(Q)echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);' >> $@
> > +	$(Q)echo 'void undefsyms_base(int n);' >> $@
> > +	$(Q)echo 'void undefsyms_base(int n) {' >> $@
> > +	$(Q)echo '	char buffer[256] = { 0 };' >> $@
> > +	$(Q)echo '	u32 u = 0;' >> $@
> > +	$(Q)echo '	memset((char * volatile)page, 8, PAGE_SIZE);' >> $@
> > +	$(Q)echo '	memset((char * volatile)buffer, 8, sizeof(buffer));' >> $@
> > +	$(Q)echo '	cmpxchg((u32 * volatile)&u, 0, 8);' >> $@
> > +	$(Q)echo '	WARN_ON(n == 0xdeadbeef);' >> $@
> > +	$(Q)echo '}' >> $@
> 
> This should use filechk, otherwise undefsyms_base.c will be regenerated
> every build, resulting in undefsyms_base.o being rebuilt every time.
> 
>   $ make -skj"$(nproc)" ARCH=x86_64 mrproper allmodconfig kernel/trace/
> 
>   $ make -skj"$(nproc)" ARCH=x86_64 V=2 kernel/trace/
>   ...
>     CC      kernel/trace/undefsyms_base.o - due to: kernel/trace/undefsyms_base.c
>     NM      kernel/trace/simple_ring_buffer.o - due to target missing
> 
> filechk_undefsyms_base = {                                              \
> 	echo '$(pound)include <asm/page.h>';                            \
> 	echo '$(pound)include <asm/local.h>';                           \
> 	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);';       \
> 	echo 'void undefsyms_base(int n);';                             \
> 	echo 'void undefsyms_base(int n) {';                            \
> 	echo '	char buffer[256] = { 0 };';                             \
> 	echo '	u32 u = 0;';                                            \
> 	echo '	memset((char * volatile)page, 8, PAGE_SIZE);';          \
> 	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));';   \
> 	echo '	cmpxchg((u32 * volatile)&u, 0, 8);';                    \
> 	echo '	WARN_ON(n == 0xdeadbeef);';                             \
> 	echo '}';                                                       \
> 	}
> 
> $(obj)/undefsyms_base.c: FORCE
> 	$(call filechk,undefsyms_base)
> 
>   $ make -skj"$(nproc)" ARCH=x86_64 mrproper allmodconfig kernel/trace/
> 
>   $ make -skj"$(nproc)" ARCH=x86_64 V=2 kernel/trace/
>     GEN     Makefile - due to target is PHONY
>     DESCEND objtool
>     CALL    scripts/checksyscalls.sh - due to target is PHONY
>     INSTALL libsubcmd_headers
>     NM      kernel/trace/simple_ring_buffer.o - due to target missing
> 
> > +clean-files += undefsyms_base.c
> > +targets += undefsyms_base.c
> 
> I don't think this targets addition is necessary.
> 
> > +$(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
> > +
> > +extra-y += undefsyms_base.o
> 
> I think this should be
> 
>   targets += undefsyms_base.o
> 
> as extra-y is deprecated per Documentation/kbuild/makefiles.rst.
> 
> > +UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sanitizer __tsan __ubsan __x86_indirect_thunk \
> > +		       $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
> 
> With an allmodconfig + ThinLTO build, I still see:
> 
>   $ cat allmod.config
>   CONFIG_GCOV_KERNEL=n
>   CONFIG_KASAN=n
>   CONFIG_LTO_CLANG_THIN=y
> 
>   $ make -skj"$(nproc)" ARCH=x86_64 KCONFIG_ALLCONFIG=1 LLVM=1 mrproper allmodconfig kernel/trace/
>   Unexpected symbols in kernel/trace/simple_ring_buffer.o:
>                    U __fortify_panic
>                    U __write_overflow_field
>                    U simple_ring_buffer_commit
>                    U simple_ring_buffer_enable_tracing
>                    U simple_ring_buffer_init
>                    U simple_ring_buffer_reserve
>                    U simple_ring_buffer_reset
>                    U simple_ring_buffer_swap_reader_page
>                    U simple_ring_buffer_unload
> 
> Something like:
> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 48c415a0c7e4..0f9a6ce9abd9 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -142,13 +142,15 @@ obj-$(CONFIG_TRACE_REMOTE_TEST) += remote_test.o
>  filechk_undefsyms_base = {                                              \
>  	echo '$(pound)include <asm/page.h>';                            \
>  	echo '$(pound)include <asm/local.h>';                           \
> +	echo '$(pound)include <linux/string.h>';                        \
>  	echo 'static char page[PAGE_SIZE] __aligned(PAGE_SIZE);';       \
> -	echo 'void undefsyms_base(int n);';                             \
> -	echo 'void undefsyms_base(int n) {';                            \
> +	echo 'void undefsyms_base(int n, void *ptr);';                  \
> +	echo 'void undefsyms_base(int n, void *ptr) {';                 \
>  	echo '	char buffer[256] = { 0 };';                             \
>  	echo '	u32 u = 0;';                                            \
>  	echo '	memset((char * volatile)page, 8, PAGE_SIZE);';          \
>  	echo '	memset((char * volatile)buffer, 8, sizeof(buffer));';   \
> +	echo '	memcpy((void* volatile)ptr, buffer, sizeof(buffer));';  \
>  	echo '	cmpxchg((u32 * volatile)&u, 0, 8);';                    \
>  	echo '	WARN_ON(n == 0xdeadbeef);';                             \
>  	echo '}';                                                       \
> --
> 
> cures the first two. The simple_ring_buffer symbols are very odd...
> 
>   $ llvm-nm kernel/trace/simple_ring_buffer.o | grep simple_ring_buffer
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_commit_845
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_enable_tracing_849
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_init_847
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_reserve_841
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_reset_846
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_swap_reader_page_837
>   ---------------- d __UNIQUE_ID_addressable_simple_ring_buffer_unload_848
>   ---------------- t __export_symbol_simple_ring_buffer_commit
>   ---------------- t __export_symbol_simple_ring_buffer_enable_tracing
>   ---------------- t __export_symbol_simple_ring_buffer_init
>   ---------------- t __export_symbol_simple_ring_buffer_reserve
>   ---------------- t __export_symbol_simple_ring_buffer_reset
>   ---------------- t __export_symbol_simple_ring_buffer_swap_reader_page
>   ---------------- t __export_symbol_simple_ring_buffer_unload
>   ---------------- T simple_ring_buffer_commit
>                    U simple_ring_buffer_commit
>   ---------------- T simple_ring_buffer_enable_tracing
>                    U simple_ring_buffer_enable_tracing
>                    U simple_ring_buffer_init
>   ---------------- T simple_ring_buffer_init
>   ---------------- T simple_ring_buffer_init_mm
>   ---------------- T simple_ring_buffer_reserve
>                    U simple_ring_buffer_reserve
>   ---------------- T simple_ring_buffer_reset
>                    U simple_ring_buffer_reset
>                    U simple_ring_buffer_swap_reader_page
>   ---------------- T simple_ring_buffer_swap_reader_page
>                    U simple_ring_buffer_unload
>   ---------------- T simple_ring_buffer_unload
>   ---------------- T simple_ring_buffer_unload_mm
> 
> This is LLVM IR bitcode at this stage, which could be messing things up.
> 
>   $ file kernel/trace/simple_ring_buffer.o
>   kernel/trace/simple_ring_buffer.o: LLVM IR bitcode
> 
> Maybe not worth thinking about too much and just adding it to the
> allowlist manually?

That looks good, thanks for having a look. I'll spin a v2 with your comments.

> 
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index 0f9a6ce9abd9..cb1ec50a8386 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -166,6 +166,7 @@ $(obj)/undefsyms_base.o: $(obj)/undefsyms_base.c
>  targets += undefsyms_base.o
>  
>  UNDEFINED_ALLOWLIST = __asan __gcov __kasan __kcsan __hwasan __sanitizer __tsan __ubsan __x86_indirect_thunk \
> +		       simple_ring_buffer \
>  		       $(shell $(NM) -u $(obj)/undefsyms_base.o 2>/dev/null | awk '{print $$2}')
>  
>  quiet_cmd_check_undefined = NM      $<
> --
> 
> Cheers,
> Nathan

^ permalink raw reply

* Re: [PATCH v3 1/4] tracing/preemptirq: Optimize preempt_disable/enable() tracepoint overhead
From: Peter Zijlstra @ 2026-03-13  9:04 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Masami Hiramatsu, Mathieu Desnoyers, Andrew Morton,
	open list:SCHEDULER, open list:TRACING, acme, williams, gmonaco
In-Reply-To: <abLzS0T_wEt_SkL6@fedora>

On Thu, Mar 12, 2026 at 02:19:15PM -0300, Wander Lairson Costa wrote:

> > That's significant bloat, for really very little gain. Realistically
> > nobody is going to need these.
> > 
> 
> Of course, I can't speak for others, but more than once I debugged issues
> that those tracepoints had made my life far easier. Those cases convinced
> me that such a feature would be worth it. But if you don't see
> value and will reject the patches no matter what, nothing can be done,
> and I will have to accept defeat.

If distros are going to enable this, I suppose I'm not going to stop
this. But I do very much worry about the general bloat of things, there
are a *LOT* of preempt_{dis,en}able() sites.

^ permalink raw reply

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-03-13  8:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-trace-kernel, x86, aik, andrew.jones, binbin.wu, bp,
	brauner, chao.p.peng, chao.p.peng, chenhuacai, corbet,
	dave.hansen, david, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao
In-Reply-To: <abNcEkNseDEBIhop@google.com>

Hi,

On Fri, 13 Mar 2026 at 00:36, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 12, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> >
> > > On Thu, Mar 12, 2026, Fuad Tabba wrote:
> > >> Hi Ackerley,
> > >>
> > >> Before getting into the UAPI semantics, thank you for all the heavy
> > >> lifting you've done here. Figuring out how to make it all work across
> > >> the different platforms is not easy :)
> > >>
> > >> <snip>
> > >>
> > >> > The policy definitions below provide more details:
> > >
> > > Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> > > these are not "policy", they are purely properties of the underlying memory.
> > > Userspace will likely use the attributes to implement policy of some kind, but
> > > KVM straight up doesn't care.
> >
> > Policy might have been the wrong word. I think this is a property of the
> > conversion process/request, not a property of the memory like how
> > shared/private is a property of the memory?
> >
> > I'll have to find another word to describe this enum of
>
> Or just don't?  I'm 100% serious, because unless we carve out a field _just_ for
> these two flags, they're eventually going to get mixed with other stuff.  At that
> point, having a precisely named enum container just gets in the way.

I agree. It makes sense to drop the enum wrapper and the "policy"
terminology entirely. Let's go with direct flags passed to the ioctl
representing the requested memory properties upon conversion.

> > I see you dropped any documentation to do with testing.
>
> Yes.
>
> > I meant to document it (at least something about the unspecified case) so it
> > can be relied on in selftests, with the understanding (already specified
> > elsewhere in Documentation/virt/kvm/api.rst) that nothing about
> > KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be changed
> > anytime. What do you think?
>
> KVM_X86_SW_PROTECTED_VM should self-report like all other VM types, and shouldn't
> support anything that isn't documented as possible.  I.e. we shouldn't allow
> ZERO on shared=>private "for testing".
>
> What I do think we should do is scribble memory on conversions without ZERO or
> PRIVATE, probably guarded by a Kconfig or maybe a module param, to do a best
> effort enforcement of the ABI, i.e. to try and prevent userspace from depending
> on uarch/vendor specific behavior.

I strongly agree with scribbling/poisoning the memory on default
conversions. If userspace specifies neither flag, actively destroying
the data in software is the only way to strictly enforce that the ABI
makes no guarantees, preventing the VMM from implicitly relying on
underlying hardware behavior (like TDX automatically zeroing).

Cheers,
/fuad

^ permalink raw reply

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
From: Fuad Tabba @ 2026-03-13  8:31 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, kvm, linux-doc, linux-kernel,
	linux-kselftest, linux-trace-kernel, x86, aik, andrew.jones,
	binbin.wu, bp, brauner, chao.p.peng, chao.p.peng, chenhuacai,
	corbet, dave.hansen, david, hpa, ira.weiny, jgg, jmattson,
	jroedel, jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao
In-Reply-To: <CAEvNRgFUc+9xCoN9Yo5NThHrvbccWAhPwp9nNM2fvx7QqrcJsg@mail.gmail.com>

Hi Ackerley,

<snip>

> > By default, KVM makes no guarantees about the in-memory values after memory is
> > convert to/from shared/private.  Optionally, userspace may instruct KVM to
> > ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
> > sharing of data, or as an optimization to avoid having to re-zero memory when
> > the trusted entity guarantees the memory will be zeroed after conversion.
> >
>
> How about:
>
> or as an optimization to avoid having to re-zero memory when userspace
> could have relied on the trusted entity to guarantee the memory will be
> zeroed as part of the entire conversion process.
>
> > The behaviors supported by a given KVM instance can be queried via <cap>.  If
>
> I started with some implementation and was questioning the value of a
> CAP. It seems like there won't be anything dynamic about this?

We can drop the CAP for now. Probing via the ioctl and handling
-EOPNOTSUPP is entirely sufficient for the VMM to discover whether
ZERO or PRESERVE are supported for a given architecture and conversion
direction.

> The userspace code can check what platform it is running on, and then
> decide ZERO or PRESERVE based on the platform:
>
> If the VM is running on TDX, it would want to specify ZERO all the
> time. If the VM were running on pKVM it would want to specify PRESERVE
> if it wants to enable in-place sharing, and ZERO if it wants to zero the
> memory.
>
> If someday TDX supports PRESERVE, then there's room for discovery of
> which algorithm to choose when running the guest. Perhaps that's when
> the CAP should be introduced?
>
> > the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
> > reject the conversion request.  Note!  The "ZERO" request is only support for
> > private to shared conversion!

I think that this makes sensefor the UAPI. Returning -EOPNOTSUPP for
shared-to-private ZERO conversions.

For pKVM's specific use cases where the VMM requires a zeroed page to
be injected into the guest's private space via attribute conversion,
the VMM can simply `memset()` the shared memory to zero in userspace,
and then invoke the ioctl with the
`KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE` flag. This completely offloads
the UAPI from making guarantees on behalf of the trusted entity, while
still satisfying pKVM's functional requirements.

Cheers,
/fuad

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox