public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2] net: mscc: ocelot: add missing lock protection in ocelot_port_xmit()
From: Ziyi Guo @ 2026-02-08 22:16 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Paolo Abeni, Claudiu Manoil, Alexandre Belloni, UNGLinuxDriver,
	Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	netdev, linux-kernel
In-Reply-To: <20260208133526.3ed3qzyyj567ylah@skbuf>

On Sun, Feb 8, 2026 at 7:35 AM Vladimir Oltean
<vladimir.oltean@nxp.com> wrote:
>
> The idea is not bad, but I would move one step further.
>
> Refactor the rew_op handling into an ocelot_xmit_timestamp() function as
> a first preparatory patch. The logic will need to be called from two
> places and it's good not to duplicate it.
>
> Then create two separate ocelot_port_xmit_fdma() and ocelot_port_xmit_inj(),
> as a second preparatory patch.
>
> static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
> {
>         if (static_branch_unlikely(&ocelot_fdma_enabled))
>                 return ocelot_port_xmit_fdma(skb, dev);
>
>         return ocelot_port_xmit_inj(skb, dev);
> }
>
> Now, as the third patch, add the required locking in ocelot_port_xmit_inj().
>
> It's best for the FDMA vs register injection code paths to be as
> separate as possible.


Thanks Vladimir and folks for your time. I'll split it into three
patches as series, and send a v3 version.

Best,
Ziyi

^ permalink raw reply

* [PATCH v3 0/3] net: mscc: ocelot: fix missing lock in ocelot_port_xmit()
From: Ziyi Guo @ 2026-02-08 22:55 UTC (permalink / raw)
  To: Vladimir Oltean, Claudiu Manoil, Alexandre Belloni
  Cc: UNGLinuxDriver, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel, Ziyi Guo

ocelot_port_xmit() calls ocelot_can_inject() and
ocelot_port_inject_frame() without holding the injection group lock.
Both functions contain lockdep_assert_held() for the injection lock,
and the correct caller felix_port_deferred_xmit() properly acquires
the lock using ocelot_lock_inj_grp() before calling these functions.

this v3 splits the fix into a 3-patch series to separate 
refactoring from the behavioral change:

  1/3: Extract the PTP timestamp handling into an ocelot_xmit_timestamp()
       helper so the logic isn't duplicated when the function is split.

  2/3: Split ocelot_port_xmit() into ocelot_port_xmit_fdma() and
       ocelot_port_xmit_inj(), keeping the FDMA and register injection
       code paths fully separate.

  3/3: Add ocelot_lock_inj_grp()/ocelot_unlock_inj_grp() in
       ocelot_port_xmit_inj() to fix the missing lock protection.

Patches 1-2 are pure refactors with no behavioral change.
Patch 3 is the actual bug fix.

v3:
 - Split into 3-patch series per Vladimir's review
 - Separate FDMA and register injection paths into distinct functions
v2:
 - Added Fixes tag
v1:
 - Initial submission

Ziyi Guo (3):
  net: mscc: ocelot: extract ocelot_xmit_timestamp() helper
  net: mscc: ocelot: split xmit into FDMA and register injection paths
  net: mscc: ocelot: add missing lock protection in
    ocelot_port_xmit_inj()

 drivers/net/ethernet/mscc/ocelot_net.c | 75 +++++++++++++++++++-------
 1 file changed, 56 insertions(+), 19 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH v3 1/3] net: mscc: ocelot: extract ocelot_xmit_timestamp() helper
From: Ziyi Guo @ 2026-02-08 22:56 UTC (permalink / raw)
  To: Vladimir Oltean, Claudiu Manoil, Alexandre Belloni
  Cc: UNGLinuxDriver, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel, Ziyi Guo
In-Reply-To: <20260208225602.1339325-1-n7l8m4@u.northwestern.edu>

Extract the PTP timestamp handling logic from ocelot_port_xmit() into a
separate ocelot_xmit_timestamp() helper function. This is a pure
refactor with no behavioral change.

The helper returns false if the skb was consumed (freed) due to a
timestamp request failure, and true if the caller should continue with
frame injection. The rew_op value is returned via pointer.

This prepares for splitting ocelot_port_xmit() into separate FDMA and
register injection paths in a subsequent patch.

Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
---
 drivers/net/ethernet/mscc/ocelot_net.c | 36 ++++++++++++++++----------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index 469784d3a1a6..ef4a6c768de9 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -551,33 +551,41 @@ static int ocelot_port_stop(struct net_device *dev)
 	return 0;
 }
 
-static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
+static bool ocelot_xmit_timestamp(struct ocelot *ocelot, int port,
+				  struct sk_buff *skb, u32 *rew_op)
 {
-	struct ocelot_port_private *priv = netdev_priv(dev);
-	struct ocelot_port *ocelot_port = &priv->port;
-	struct ocelot *ocelot = ocelot_port->ocelot;
-	int port = priv->port.index;
-	u32 rew_op = 0;
-
-	if (!static_branch_unlikely(&ocelot_fdma_enabled) &&
-	    !ocelot_can_inject(ocelot, 0))
-		return NETDEV_TX_BUSY;
-
-	/* Check if timestamping is needed */
 	if (ocelot->ptp && (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
 		struct sk_buff *clone = NULL;
 
 		if (ocelot_port_txtstamp_request(ocelot, port, skb, &clone)) {
 			kfree_skb(skb);
-			return NETDEV_TX_OK;
+			return false;
 		}
 
 		if (clone)
 			OCELOT_SKB_CB(skb)->clone = clone;
 
-		rew_op = ocelot_ptp_rew_op(skb);
+		*rew_op = ocelot_ptp_rew_op(skb);
 	}
 
+	return true;
+}
+
+static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ocelot_port_private *priv = netdev_priv(dev);
+	struct ocelot_port *ocelot_port = &priv->port;
+	struct ocelot *ocelot = ocelot_port->ocelot;
+	int port = priv->port.index;
+	u32 rew_op = 0;
+
+	if (!static_branch_unlikely(&ocelot_fdma_enabled) &&
+	    !ocelot_can_inject(ocelot, 0))
+		return NETDEV_TX_BUSY;
+
+	if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op))
+		return NETDEV_TX_OK;
+
 	if (static_branch_unlikely(&ocelot_fdma_enabled)) {
 		ocelot_fdma_inject_frame(ocelot, port, rew_op, skb, dev);
 	} else {
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 2/3] net: mscc: ocelot: split xmit into FDMA and register injection paths
From: Ziyi Guo @ 2026-02-08 22:56 UTC (permalink / raw)
  To: Vladimir Oltean, Claudiu Manoil, Alexandre Belloni
  Cc: UNGLinuxDriver, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel, Ziyi Guo
In-Reply-To: <20260208225602.1339325-1-n7l8m4@u.northwestern.edu>

Split ocelot_port_xmit() into two separate functions:
- ocelot_port_xmit_fdma(): handles the FDMA injection path
- ocelot_port_xmit_inj(): handles the register-based injection path

The top-level ocelot_port_xmit() now dispatches to the appropriate
function based on the ocelot_fdma_enabled static key.

This is a pure refactor with no behavioral change. Separating the two
code paths makes each one simpler and prepares for adding proper locking
to the register injection path without affecting the FDMA path.

Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
---
 drivers/net/ethernet/mscc/ocelot_net.c | 39 ++++++++++++++++++++------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index ef4a6c768de9..d6b0936beca2 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -571,7 +571,25 @@ static bool ocelot_xmit_timestamp(struct ocelot *ocelot, int port,
 	return true;
 }
 
-static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t ocelot_port_xmit_fdma(struct sk_buff *skb,
+					 struct net_device *dev)
+{
+	struct ocelot_port_private *priv = netdev_priv(dev);
+	struct ocelot_port *ocelot_port = &priv->port;
+	struct ocelot *ocelot = ocelot_port->ocelot;
+	int port = priv->port.index;
+	u32 rew_op = 0;
+
+	if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op))
+		return NETDEV_TX_OK;
+
+	ocelot_fdma_inject_frame(ocelot, port, rew_op, skb, dev);
+
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t ocelot_port_xmit_inj(struct sk_buff *skb,
+					struct net_device *dev)
 {
 	struct ocelot_port_private *priv = netdev_priv(dev);
 	struct ocelot_port *ocelot_port = &priv->port;
@@ -579,24 +597,27 @@ static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
 	int port = priv->port.index;
 	u32 rew_op = 0;
 
-	if (!static_branch_unlikely(&ocelot_fdma_enabled) &&
-	    !ocelot_can_inject(ocelot, 0))
+	if (!ocelot_can_inject(ocelot, 0))
 		return NETDEV_TX_BUSY;
 
 	if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op))
 		return NETDEV_TX_OK;
 
-	if (static_branch_unlikely(&ocelot_fdma_enabled)) {
-		ocelot_fdma_inject_frame(ocelot, port, rew_op, skb, dev);
-	} else {
-		ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
+	ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
 
-		consume_skb(skb);
-	}
+	consume_skb(skb);
 
 	return NETDEV_TX_OK;
 }
 
+static netdev_tx_t ocelot_port_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	if (static_branch_unlikely(&ocelot_fdma_enabled))
+		return ocelot_port_xmit_fdma(skb, dev);
+
+	return ocelot_port_xmit_inj(skb, dev);
+}
+
 enum ocelot_action_type {
 	OCELOT_MACT_LEARN,
 	OCELOT_MACT_FORGET,
-- 
2.34.1


^ permalink raw reply related

* [PATCH v3 3/3] net: mscc: ocelot: add missing lock protection in ocelot_port_xmit_inj()
From: Ziyi Guo @ 2026-02-08 22:56 UTC (permalink / raw)
  To: Vladimir Oltean, Claudiu Manoil, Alexandre Belloni
  Cc: UNGLinuxDriver, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel, Ziyi Guo
In-Reply-To: <20260208225602.1339325-1-n7l8m4@u.northwestern.edu>

ocelot_port_xmit_inj() calls ocelot_can_inject() and
ocelot_port_inject_frame() without holding the injection group lock.
Both functions contain lockdep_assert_held() for the injection lock,
and the correct caller felix_port_deferred_xmit() properly acquires
the lock using ocelot_lock_inj_grp() before calling these functions.

Add ocelot_lock_inj_grp()/ocelot_unlock_inj_grp() around the register
injection path to fix the missing lock protection. The FDMA path is not
affected as it uses its own locking mechanism.

Fixes: c5e12ac3beb0 ("net: mscc: ocelot: serialize access to the injection/extraction groups")
Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
---
 drivers/net/ethernet/mscc/ocelot_net.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index d6b0936beca2..9fd8ac7e875c 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -597,14 +597,22 @@ static netdev_tx_t ocelot_port_xmit_inj(struct sk_buff *skb,
 	int port = priv->port.index;
 	u32 rew_op = 0;
 
-	if (!ocelot_can_inject(ocelot, 0))
+	ocelot_lock_inj_grp(ocelot, 0);
+
+	if (!ocelot_can_inject(ocelot, 0)) {
+		ocelot_unlock_inj_grp(ocelot, 0);
 		return NETDEV_TX_BUSY;
+	}
 
-	if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op))
+	if (!ocelot_xmit_timestamp(ocelot, port, skb, &rew_op)) {
+		ocelot_unlock_inj_grp(ocelot, 0);
 		return NETDEV_TX_OK;
+	}
 
 	ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
 
+	ocelot_unlock_inj_grp(ocelot, 0);
+
 	consume_skb(skb);
 
 	return NETDEV_TX_OK;
-- 
2.34.1


^ permalink raw reply related

* [PATCH v4 0/6] landlock: UNIX connect() control by pathname and scope
From: Günther Noack @ 2026-02-08 23:10 UTC (permalink / raw)
  To: Mickaël Salaün, John Johansen, Paul Moore, James Morris,
	Serge E . Hallyn
  Cc: Günther Noack, linux-security-module, Tingmao Wang,
	Justin Suess, Samasth Norway Ananda, Matthieu Buffet,
	Mikhail Ivanov, konstantin.meskhidze, Demi Marie Obenour,
	Alyssa Ross, Jann Horn, Tahera Fahimi, Simon Horman, netdev,
	Alexander Viro, Christian Brauner

Hello!

This patch set introduces a filesystem-based Landlock restriction
mechanism for connecting to UNIX domain sockets (or addressing them
with sendmsg(2)).  It introduces the filesystem access right
LANDLOCK_ACCESS_FS_RESOLVE_UNIX.

For the connection-oriented SOCK_STREAM and SOCK_SEQPACKET type
sockets, the access right makes the connect(2) operation fail with
EACCES, if denied.

SOCK_DGRAM-type UNIX sockets can be used both with connect(2), or by
passing an explicit recipient address with every sendmsg(2)
invocation.  In the latter case, the Landlock check is done when an
explicit recipient address is passed to sendmsg(2) and can make
sendmsg(2) return EACCES.  When UNIX datagram sockets are connected
with connect(2), a fixed recipient address is associated with the
socket and the check happens during connect(2) and may return EACCES.

When LANDLOCK_ACCESS_FS_RESOLVE_UNIX is handled within a Landlock
domain, this domain will only allow connect(2) and sendmsg(2) to
server sockets that were created within the same domain.  Or, to
phrase it the other way around: Unless it is allow-listed with a
LANDLOCK_PATH_BENEATH rule, the newly created domain denies connect(2)
and sendmsg(2) actions that are directed *outwards* of that domain.
In that regard, LANDLOCK_ACCESS_FS_RESOLVE_UNIX has the same semantics
as one of the "scoped" access rights.

== Motivation

Currently, landlocked processes can connect to named UNIX sockets
through the BSD socket API described in unix(7), by invoking socket(2)
followed by connect(2) with a suitable struct sockname_un holding the
socket's filename.  This is a surprising gap in Landlock's sandboxing
capabilities for users (e.g. in [1]) and it can be used to escape a
sandbox when a Unix service offers command execution (various such
scenarios were listed by Tingmao Wang in [2]).

The original feature request is at [4].

== Alternatives and Related Work

=== Alternative: Use existing LSM hooks

We have carefully and seriously considered the use of existing LSM
hooks, but still came to the conclusion that a new LSM hook is better
suited in this case:

The existing hooks security_unix_stream_connect(),
security_unix_may_send() and security_socket_connect() do not give
access to the resolved filesystem path.

* Resolving the filesystem path in the struct sockaddr_un again within
  a Landlock would produce a TOCTOU race, so this is not an option.
* We would therefore need to wire through the resolved struct path
  from unix_find_bsd() to one of the existing LSM hooks which get
  called later.  This would be a more substantial change to af_unix.c.

The struct path that is available in the listening-side struct sock is
can be read through the existing hooks, but it is not an option to use
this information: As the listening socket may have been bound from
within a different namespace, the path that was used for that can is
in the general case not meaningful for a sandboxed process.  In
particular, it is not possible to use this path (or prefixes thereof)
when constructing a sandbox policy in the client-side process.

Paul Moore also chimed in in support of adding a new hook, with the
rationale that the simplest change to the LSM hook interface has
traditionally proven to be the most robust. [11]

More details are on the Github issue at [6] and on the LKML at [9].

In a the discussion of the V2 review, started by Christian Brauner
[10], we have further explored the approach of reusing the existing
LSM hooks but still ended up leaning on the side of introducing a new
hook, with Paul Moore and me (gnoack) arguing for that option.

Further insights about the LSM hook were shared in the V3 review by
Tingmao Wang [12], who spotted additional requirements due to the two
approaches being merged into one patch set.  The summary of that
discussion is in [13].

=== Related work: Scope Control for Pathname Unix Sockets

The motivation for this patch is the same as in Tingmao Wang's patch
set for "scoped" control for pathname Unix sockets [2], originally
proposed in the Github feature request [5].

In [14], we have settled on the decision to merge the two patch sets
into this one, whose primary way of controlling connect(2) is
LANDLOCK_ACCESS_FS_RESOLVE_UNIX, but where this flag additionally has
the semantics of only restricting this unix(7) IPC *outwards* of the
created Landlock domain, in line with the logic that exists for the
existing "scoped" flags already.

By having LANDLOCK_ACCESS_FS_RESOLVE_UNIX implement "scoping"
semantics, we can avoid introducing two separate interacting flags for
now, but we retain the option of introducing
LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET at a later point in time, should
such a flag be needed to express additional rules.

== Credits

The feature was originally suggested by Jann Horn in [7].

Tingmao Wang and Demi Marie Obenour have taken the initiative to
revive this discussion again in [1], [4] and [5].

Tingmao Wang has sent the patch set for the scoped access control for
pathname Unix sockets [2] and has contributed substantial insights
during the code review, shaping the form of the LSM hook and agreeing
to merge the pathname and scoped-flag patch sets.

Justin Suess has sent the patch for the LSM hook in [8] and
subsequently through this patch set.

Christian Brauner and Paul Moore have contributed to the design of the
new LSM hook, discussing the tradeoffs in [10].

Ryan Sullivan has started on an initial implementation and has brought
up relevant discussion points on the Github issue at [4].

As maintainer of Landlock, Mickaël Salaün has done the main review so
far and particularly pointed out ways in which the UNIX connect()
patch sets interact with each other and what we need to look for with
regards to UAPI consistency as Landlock evolves.

[1] https://lore.kernel.org/landlock/515ff0f4-2ab3-46de-8d1e-5c66a93c6ede@gmail.com/
[2] Tingmao Wang's "Implement scope control for pathname Unix sockets"
    https://lore.kernel.org/all/cover.1767115163.git.m@maowtm.org/
[3] https://lore.kernel.org/all/20251230.bcae69888454@gnoack.org/
[4] Github issue for FS-based control for named Unix sockets:
    https://github.com/landlock-lsm/linux/issues/36
[5] Github issue for scope-based restriction of named Unix sockets:
    https://github.com/landlock-lsm/linux/issues/51
[6] https://github.com/landlock-lsm/linux/issues/36#issuecomment-2950632277
[7] https://lore.kernel.org/linux-security-module/CAG48ez3NvVnonOqKH4oRwRqbSOLO0p9djBqgvxVwn6gtGQBPcw@mail.gmail.com/
[8] Patch for the LSM hook:
    https://lore.kernel.org/all/20251231213314.2979118-1-utilityemal77@gmail.com/
[9] https://lore.kernel.org/all/20260108.64bd7391e1ae@gnoack.org/
[10] https://lore.kernel.org/all/20260113-kerngesund-etage-86de4a21da24@brauner/
[11] https://lore.kernel.org/all/CAHC9VhQHZCe0LMx4xzSo-h1SWY489U4frKYnxu4YVrcJN3x7nA@mail.gmail.com/
[12] https://lore.kernel.org/all/e6b6b069-384c-4c45-a56b-fa54b26bc72a@maowtm.org/
[13] https://lore.kernel.org/all/aYMenaSmBkAsFowd@google.com/
[14] https://lore.kernel.org/all/20260205.Kiech3gupee1@digikod.net/

---

== Older versions of this patch set

V1: https://lore.kernel.org/all/20260101134102.25938-1-gnoack3000@gmail.com/
V2: https://lore.kernel.org/all/20260110143300.71048-2-gnoack3000@gmail.com/
V3: https://lore.kernel.org/all/20260119203457.97676-2-gnoack3000@gmail.com/

Changes in V4:

Since this version, this patch set subsumes the scoping semantics from
Tingmao Wang's "Scope Control" patch set [2], per discussion with
Tingmao Wang and Mickaël Salaün in [14] and in the thread leading up
to it.

Now, LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET only restricts connect(2) and
sendmsg(2) *outwards* of the domain where it is restricted, *with the
same semantics as a "scoped" flag*.

 * Implement a layer-mask based version of domain_is_scoped():
   unmask_scoped_access().  Rationale: domain_is_scoped() returns
   early, which we can't do in the layer masks based variant.  The two
   variants are similar enough.
 * LSM hook: Replace 'type' argument with 'sk' argument,
   per discussion in [12] and [13].
 * Bump ABI version to 9 (pessimistically assuming that we won't make
   it for 7.0)
 * Documentation fixes in header file and in Documentation/
 * selftests: more test variants, now also parameterizing whether the
   server socket gets created within the Landlock domain or before that
 * selftests: use EXPECT_EQ() for test cleanup

Changes in V3:
 * LSM hook: rename it to security_unix_find() (Justin Suess)
   (resolving the previously open question about the LSM hook name)
   Related discussions:
   https://lore.kernel.org/all/20260112.Wufar9coosoo@digikod.net/
   https://lore.kernel.org/all/CAHC9VhSRiHwLEWfFkQdPEwgB4AXKbXzw_+3u=9hPpvUTnu02Bg@mail.gmail.com/
 * Reunite the three UNIX resolving access rights back into one
   (resolving the previously open question about the access right
   structuring) Related discussion:
   https://lore.kernel.org/all/20260112.Wufar9coosoo@digikod.net/)
 * Sample tool: Add new UNIX lookup access rights to ACCESS_FILE

Changes in V2:
 * Send Justin Suess's LSM hook patch together with the Landlock
   implementation
 * LSM hook: Pass type and flags parameters to the hook, to make the
   access right more generally usable across LSMs, per suggestion from
   Paul Moore (Implemented by Justin)
 * Split the access right into the three types of UNIX domain sockets:
   SOCK_STREAM, SOCK_DGRAM and SOCK_SEQPACKET.
 * selftests: More exhaustive tests.
 * Removed a minor commit from V1 which adds a missing close(fd) to a
   test (it is already in the mic-next branch)

Günther Noack (5):
  landlock: Control pathname UNIX domain socket resolution by path
  samples/landlock: Add support for named UNIX domain socket
    restrictions
  landlock/selftests: Test named UNIX domain socket restrictions
  landlock: Document FS access right for pathname UNIX sockets
  landlock: Document design rationale for scoped access rights

Justin Suess (1):
  lsm: Add LSM hook security_unix_find

 Documentation/security/landlock.rst          |  38 ++
 Documentation/userspace-api/landlock.rst     |  16 +-
 include/linux/lsm_hook_defs.h                |   5 +
 include/linux/security.h                     |  11 +
 include/uapi/linux/landlock.h                |  10 +
 net/unix/af_unix.c                           |   9 +
 samples/landlock/sandboxer.c                 |  15 +-
 security/landlock/access.h                   |  11 +-
 security/landlock/audit.c                    |   1 +
 security/landlock/fs.c                       | 107 ++++-
 security/landlock/limits.h                   |   2 +-
 security/landlock/syscalls.c                 |   2 +-
 security/security.c                          |  20 +
 tools/testing/selftests/landlock/base_test.c |   2 +-
 tools/testing/selftests/landlock/fs_test.c   | 386 ++++++++++++++++++-
 15 files changed, 608 insertions(+), 27 deletions(-)

-- 
2.52.0


^ permalink raw reply

* [PATCH v4 1/6] lsm: Add LSM hook security_unix_find
From: Günther Noack @ 2026-02-08 23:10 UTC (permalink / raw)
  To: Mickaël Salaün, John Johansen, Paul Moore, James Morris,
	Serge E . Hallyn
  Cc: Günther Noack, Tingmao Wang, Justin Suess,
	linux-security-module, Samasth Norway Ananda, Matthieu Buffet,
	Mikhail Ivanov, konstantin.meskhidze, Demi Marie Obenour,
	Alyssa Ross, Jann Horn, Tahera Fahimi, Simon Horman, netdev,
	Alexander Viro, Christian Brauner
In-Reply-To: <20260208231017.114343-1-gnoack3000@gmail.com>

From: Justin Suess <utilityemal77@gmail.com>

Add a LSM hook security_unix_find.

This hook is called to check the path of a named unix socket before a
connection is initiated. The peer socket may be inspected as well.

Why existing hooks are unsuitable:

Existing socket hooks, security_unix_stream_connect(),
security_unix_may_send(), and security_socket_connect() don't provide
TOCTOU-free / namespace independent access to the paths of sockets.

(1) We cannot resolve the path from the struct sockaddr in existing hooks.
This requires another path lookup. A change in the path between the
two lookups will cause a TOCTOU bug.

(2) We cannot use the struct path from the listening socket, because it
may be bound to a path in a different namespace than the caller,
resulting in a path that cannot be referenced at policy creation time.

Cc: Günther Noack <gnoack3000@gmail.com>
Cc: Tingmao Wang <m@maowtm.org>
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
---
 include/linux/lsm_hook_defs.h |  5 +++++
 include/linux/security.h      | 11 +++++++++++
 net/unix/af_unix.c            |  9 +++++++++
 security/security.c           | 20 ++++++++++++++++++++
 4 files changed, 45 insertions(+)

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 8c42b4bde09c..7a0fd3dbfa29 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -317,6 +317,11 @@ LSM_HOOK(int, 0, post_notification, const struct cred *w_cred,
 LSM_HOOK(int, 0, watch_key, struct key *key)
 #endif /* CONFIG_SECURITY && CONFIG_KEY_NOTIFICATIONS */
 
+#if defined(CONFIG_SECURITY_NETWORK) && defined(CONFIG_SECURITY_PATH)
+LSM_HOOK(int, 0, unix_find, const struct path *path, struct sock *other,
+	 int flags)
+#endif /* CONFIG_SECURITY_NETWORK && CONFIG_SECURITY_PATH */
+
 #ifdef CONFIG_SECURITY_NETWORK
 LSM_HOOK(int, 0, unix_stream_connect, struct sock *sock, struct sock *other,
 	 struct sock *newsk)
diff --git a/include/linux/security.h b/include/linux/security.h
index 83a646d72f6f..99a33d8eb28d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1931,6 +1931,17 @@ static inline int security_mptcp_add_subflow(struct sock *sk, struct sock *ssk)
 }
 #endif	/* CONFIG_SECURITY_NETWORK */
 
+#if defined(CONFIG_SECURITY_NETWORK) && defined(CONFIG_SECURITY_PATH)
+
+int security_unix_find(const struct path *path, struct sock *other, int flags);
+
+#else /* CONFIG_SECURITY_NETWORK && CONFIG_SECURITY_PATH */
+static inline int security_unix_find(const struct path *path, struct sock *other, int flags)
+{
+	return 0;
+}
+#endif /* CONFIG_SECURITY_NETWORK && CONFIG_SECURITY_PATH */
+
 #ifdef CONFIG_SECURITY_INFINIBAND
 int security_ib_pkey_access(void *sec, u64 subnet_prefix, u16 pkey);
 int security_ib_endport_manage_subnet(void *sec, const char *name, u8 port_num);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index d0511225799b..db9d279b3883 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1226,10 +1226,19 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 	if (!S_ISSOCK(inode->i_mode))
 		goto path_put;
 
+	err = -ECONNREFUSED;
 	sk = unix_find_socket_byinode(inode);
 	if (!sk)
 		goto path_put;
 
+	/*
+	 * We call the hook because we know that the inode is a socket
+	 * and we hold a valid reference to it via the path.
+	 */
+	err = security_unix_find(&path, sk, flags);
+	if (err)
+		goto sock_put;
+
 	err = -EPROTOTYPE;
 	if (sk->sk_type == type)
 		touch_atime(&path);
diff --git a/security/security.c b/security/security.c
index 31a688650601..9e9515955098 100644
--- a/security/security.c
+++ b/security/security.c
@@ -4731,6 +4731,26 @@ int security_mptcp_add_subflow(struct sock *sk, struct sock *ssk)
 
 #endif	/* CONFIG_SECURITY_NETWORK */
 
+#if defined(CONFIG_SECURITY_NETWORK) && defined(CONFIG_SECURITY_PATH)
+/*
+ * security_unix_find() - Check if a named AF_UNIX socket can connect
+ * @path: path of the socket being connected to
+ * @other: peer sock
+ * @flags: flags associated with the socket
+ *
+ * This hook is called to check permissions before connecting to a named
+ * AF_UNIX socket.
+ *
+ * Return: Returns 0 if permission is granted.
+ */
+int security_unix_find(const struct path *path, struct sock *other, int flags)
+{
+	return call_int_hook(unix_find, path, other, flags);
+}
+EXPORT_SYMBOL(security_unix_find);
+
+#endif	/* CONFIG_SECURITY_NETWORK && CONFIG_SECURITY_PATH */
+
 #ifdef CONFIG_SECURITY_INFINIBAND
 /**
  * security_ib_pkey_access() - Check if access to an IB pkey is allowed
-- 
2.52.0


^ permalink raw reply related

* [PATCH] net: stmmac: imx: Disable EEE
From: Laurent Pinchart @ 2026-02-08 23:29 UTC (permalink / raw)
  To: netdev, imx
  Cc: Andrew Lunn, Clark Wang, David S. Miller, Eric Dumazet,
	Fabio Estevam, Fabio Estevam, Francesco Dolcini, Frank Li,
	Heiko Schocher, Jakub Kicinski, Joakim Zhang, Joy Zou,
	Kieran Bingham, Marcel Ziswiler, Marco Felsch, Martyn Welch,
	Mathieu Othacehe, Paolo Abeni, Pengutronix Kernel Team,
	Richard Hu, Russell King (Oracle), Sascha Hauer, Shawn Guo,
	Shenwei Wang, Stefan Klug, Stefano Radaelli, Wei Fang,
	Xiaoliang Yang, linux-arm-kernel

The i.MX8MP suffers from an interrupt storm related to the stmmac and
EEE. A long and tedious analysis ([1]) concluded that the SoC wires the
stmmac lpi_intr_o signal to an OR gate along with the main dwmac
interrupts, which causes an interrupt storm for two reasons.

First, there's a race condition due to the interrupt deassertion being
synchronous to the RX clock domain:

- When the PHY exits LPI mode, it restarts generating the RX clock
  (clk_rx_i input signal to the GMAC).
- The MAC detects exit from LPI, and asserts lpi_intr_o. This triggers
  the ENET_EQOS interrupt.
- Before the CPU has time to process the interrupt, the PHY enters LPI
  mode again, and stops generating the RX clock.
- The CPU processes the interrupt and reads the GMAC4_LPI_CTRL_STATUS
  registers. This does not clear lpi_intr_o as there's no clk_rx_i.

An attempt was made to fixing the issue by not stopping RX_CLK in Rx LPI
state ([2]). This alleviates the symptoms but doesn't fix the issue.
Since lpi_intr_o takes four RX_CLK cycles to clear, an interrupt storm
can still occur during that window. In 1000T mode this is harder to
notice, but slower receive clocks cause hundreds to thousands of
spurious interrupts.

Fix the issue by disabling EEE completely on i.MX8MP.

[1] https://lore.kernel.org/all/20251026122905.29028-1-laurent.pinchart@ideasonboard.com/
[2] https://lore.kernel.org/all/20251123053518.8478-1-laurent.pinchart@ideasonboard.com/

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
---
This patch depends on https://lore.kernel.org/r/20251026122905.29028-1-laurent.pinchart@ideasonboard.com
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
index db288fbd5a4d..e8e6f325d92d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c
@@ -317,8 +317,7 @@ static int imx_dwmac_probe(struct platform_device *pdev)
 		return ret;
 	}
 
-	if (data->flags & STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY)
-		plat_dat->flags |= STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY;
+	plat_dat->flags |= data->flags;
 
 	/* Default TX Q0 to use TSO and rest TXQ for TBS */
 	for (int i = 1; i < plat_dat->tx_queues_to_use; i++)
@@ -355,7 +354,8 @@ static struct imx_dwmac_ops imx8mp_dwmac_data = {
 	.addr_width = 34,
 	.mac_rgmii_txclk_auto_adj = false,
 	.set_intf_mode = imx8mp_set_intf_mode,
-	.flags = STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY,
+	.flags = STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY
+	       | STMMAC_FLAG_EEE_DISABLE,
 };
 
 static struct imx_dwmac_ops imx8dxl_dwmac_data = {

base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
prerequisite-patch-id: a3c3f8b08fd66ee3ccce632aad3f4a3c21c92718
-- 
Regards,

Laurent Pinchart


^ permalink raw reply related

* Re: [PATCH RFC net-next] net: stmmac: provide flag to disable EEE
From: Laurent Pinchart @ 2026-02-08 23:30 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Ovidiu Panait, Alexandre Torgue, Andrew Lunn, Andrew Lunn,
	Clark Wang, Daniel Scally, David S. Miller, Emanuele Ghidoli,
	Eric Dumazet, Fabio Estevam, Heiner Kallweit, imx, Jakub Kicinski,
	Kieran Bingham, linux-arm-kernel, linux-stm32, Maxime Coquelin,
	netdev, Oleksij Rempel, Paolo Abeni, Pengutronix Kernel Team,
	Rob Herring, Sascha Hauer, Shawn Guo, Stefan Klug, Wei Fang
In-Reply-To: <20260203231832.GB133801@killaraus>

On Wed, Feb 04, 2026 at 01:18:33AM +0200, Laurent Pinchart wrote:
> On Tue, Feb 03, 2026 at 12:17:06AM +0000, Russell King (Oracle) wrote:
> > On Mon, Feb 02, 2026 at 10:23:45PM +0000, Russell King (Oracle) wrote:
> > > On Mon, Feb 02, 2026 at 08:54:52PM +0200, Ovidiu Panait wrote:
> > > > If not, maybe this patch could be merged to add the flag that disables
> > > > EEE and I will just send a patch to disable EEE on our platforms as well.
> > > 
> > > We still need the flag to disable EEE for platforms where lpi_intr_o is
> > > logically OR'd with the other interrupts, so there's no way to ignore
> > > its persistent assertion.
> > 
> > I'll also state that we need both patches, but there's no point pushing
> > my original patch (to allow EEE to be disabled) unless Laurent is going
> > to also submit a patch to make use of the flag - the EEE disable and
> > Laurent's patch needs to be part of a series. We don't merge stuff that
> > adds facilities that have no users, because that's been proven time and
> > time again to be a recipe for accumulating cruft.
> 
> Sorry for having dropped the ball on this. Your patch arrived just when
> I travelled to Japan for a month, and then it got burried in my inbox.
> Now that it has resurfaced, and that an issue that prevented i.MX8MP
> from booting with v6.19-rc has been fixed, I'll rebase my kernel and
> finally fix that problem.

Done. https://lore.kernel.org/r/20260208232931.2272237-1-laurent.pinchart@ideasonboard.com

> Once again, thank you for all your help with this issue, you have been
> extremely helpful and I'm very grateful.
> 
> > So, at the moment, "net: stmmac: provide flag to disable EEE" ain't
> > going anywhere until there's a patch that makes use of the new flag.

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* [ANN] net-next is CLOSED
From: Jakub Kicinski @ 2026-02-08 23:40 UTC (permalink / raw)
  To: netdev

Hi!

Linus tagged final v6.19 so net-next is closed. We will look thru
what's one the list, but new net-next postings have to wait until
Feb 23rd.

^ permalink raw reply

* [syzbot] [kernel?] INFO: task hung in rescuer_thread (3)
From: syzbot @ 2026-02-09  0:19 UTC (permalink / raw)
  To: bp, dave.hansen, hpa, linux-kernel, mingo, netdev, syzkaller-bugs,
	tglx, x86

Hello,

syzbot found the following issue on:

HEAD commit:    9a9424c756fe net: usb: sr9700: remove code to drive nonexi..
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=17040a52580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2c36acc86fd56a9d
dashboard link: https://syzkaller.appspot.com/bug?extid=1d6360305e34d30ab396
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=115d0a5a580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/aca10cd84bc5/disk-9a9424c7.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/65bf8ad42e4f/vmlinux-9a9424c7.xz
kernel image: https://storage.googleapis.com/syzbot-assets/a69a5e1088ef/bzImage-9a9424c7.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1d6360305e34d30ab396@syzkaller.appspotmail.com

INFO: task kworker/R-wg-cr:5992 blocked for more than 149 seconds.
      Not tainted syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/R-wg-cr state:D stack:25984 pid:5992  tgid:5992  ppid:2      task_flags:0x4208060 flags:0x00080000
Workqueue:  0x0 (wg-crypt-wg1)
Call Trace:
 <TASK>
 context_switch kernel/sched/core.c:5260 [inline]
 __schedule+0x14ea/0x5050 kernel/sched/core.c:6867
 __schedule_loop kernel/sched/core.c:6949 [inline]
 schedule+0x164/0x360 kernel/sched/core.c:6964
 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7021
 __mutex_lock_common kernel/locking/mutex.c:692 [inline]
 __mutex_lock+0x7fe/0x1300 kernel/locking/mutex.c:776
 worker_detach_from_pool kernel/workqueue.c:2730 [inline]
 rescuer_thread+0x593/0x1390 kernel/workqueue.c:3560
 kthread+0x726/0x8b0 kernel/kthread.c:463
 ret_from_fork+0x51b/0xa40 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
 </TASK>

Showing all locks held in the system:
1 lock held by kworker/R-rcu_g/4:
 #0: ffffffff8e3f8c88 (wq_pool_attach_mutex){+.+.}-{4:4}, at: worker_detach_from_pool kernel/workqueue.c:2730 [inline]
 #0: ffffffff8e3f8c88 (wq_pool_attach_mutex){+.+.}-{4:4}, at: rescuer_thread+0x593/0x1390 kernel/workqueue.c:3560


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* Re: [PATCH] selftests/net: add test for IPv4-in-IPv6 tunneling
From: Hangbin Liu @ 2026-02-09  0:56 UTC (permalink / raw)
  To: Linus Heckemann
  Cc: edumazet, davem, eric.dumazet, horms, kuba, morikw2, netdev,
	pabeni, syzbot+d4dda070f833dc5dc89a
In-Reply-To: <20260208144604.1550582-1-git@sphalerite.org>

Hi Linus
On Sun, Feb 08, 2026 at 03:46:04PM +0100, Linus Heckemann wrote:
> 81c734dae203757fb3c9eee6f9896386940776bd was fine in and of itself, but

Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")'

> its backport to 6.12 (and 6.6) broke IPv4-in-IPv6 tunneling, see [1].
> This adds a self-test for basic IPv4-in-IPv6 functionality.
> 
> [1]: https://lore.kernel.org/all/CAA2RiuSnH_2xc+-W6EnFEG00XjS-dszMq61JEvRjcGS31CBw=g@mail.gmail.com/
> ---
>  tools/testing/selftests/net/Makefile      |  1 +
>  tools/testing/selftests/net/ip6_tunnel.sh | 41 +++++++++++++++++++++++
>  2 files changed, 42 insertions(+)
>  create mode 100644 tools/testing/selftests/net/ip6_tunnel.sh
> 
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index 45c4ea381bc36..5037a344ad826 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -43,6 +43,7 @@ TEST_PROGS := \
>  	io_uring_zerocopy_tx.sh \
>  	ioam6.sh \
>  	ip6_gre_headroom.sh \
> +	ip6_tunnel.sh \
>  	ip_defrag.sh \
>  	ip_local_port_range.sh \
>  	ipv6_flowlabel.sh \
> diff --git a/tools/testing/selftests/net/ip6_tunnel.sh b/tools/testing/selftests/net/ip6_tunnel.sh
> new file mode 100644
> index 0000000000000..366f4c06cd6a3
> --- /dev/null
> +++ b/tools/testing/selftests/net/ip6_tunnel.sh
> @@ -0,0 +1,41 @@
> +#!/bin/bash
> +# Test that IPv4-over-IPv6 tunneling works.

Maybe tests IPv6-over-IPv6 as well?

> +
> +set -e
> +
> +setup_prepare() {
> +  ip link add transport1 type veth peer name transport2
> +
> +  ip netns add ns1

net lib has a helper setup_ns()

> +  ip link set transport1 netns ns1
> +  ip netns exec ns1 bash <<EOF
> +  set -e
> +  ip address add 2001:db8::1/64 dev transport1 nodad
> +  ip link set transport1 up
> +  ip link add link transport1 name tunnel1 type ip6tnl mode ipip6 local 2001:db8::1 remote 2001:db8::2
> +  ip address add 172.0.0.1/32 peer 172.0.0.2/32 dev tunnel1
> +  ip link set tunnel1 up
> +EOF

You can use ip -n $ns_name address/link other than put all the cmds in a inner
bash.


> +
> +  ip netns add ns2
> +  ip link set transport2 netns ns2
> +  ip netns exec ns2 bash <<EOF
> +  set -e
> +  ip address add 2001:db8::2/64 dev transport2 nodad
> +  ip link set transport2 up
> +  ip link add link transport2 name tunnel2 type ip6tnl mode ipip6 local 2001:db8::2 remote 2001:db8::1
> +  ip address add 172.0.0.2/32 peer 172.0.0.1/32 dev tunnel2
> +  ip link set tunnel2 up
> +EOF

Same here

> +}
> +
> +cleanup() {
> +  ip netns delete ns1
> +  ip netns delete ns2

At the same time cleanup_all_ns() help could be used.

> +  # in case the namespaces haven't been set up yet
> +  ip link delete transport1
> +}
> +
> +trap cleanup EXIT
> +setup_prepare
> +ip netns exec ns1 ping -W1 -c1 172.0.0.2
> -- 
> 2.52.0
> 

Thanks
Hangbin

^ permalink raw reply

* Re: [PATCH] net: qrtr: Expand control port access to root
From: Jijie Shao @ 2026-02-09  1:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: shaojijie, Vishnu Santhosh, Manivannan Sadhasivam,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	linux-arm-msm, netdev, linux-kernel, bjorn.andersson, chris.lew,
	Deepak Kumar Singh
In-Reply-To: <20260206183436.3291c742@kernel.org>


on 2026/2/7 10:34, Jakub Kicinski wrote:
> On Fri, 6 Feb 2026 11:59:44 +0800 Jijie Shao wrote:
>> Sorry, I noticed that this patch has several check failures.
>> You may want to pay attention to this:
>> https://patchwork.kernel.org/project/netdevbpf/patch/20260205-qrtr-control-port-access-permission-v1-1-e900039e92d5@oss.qualcomm.com/
>>
>> 1.Single patches do not need cover letters; Target tree name not specified in the subject
>> 2.WARNING: line length of 83 exceeds 80 columns
>> 3.AI review found issues
> This is not public CI, the maintainers will point out the issues
> in the code if the failing checks are relevant.
>
> If you'd like to help with code reviews please focus on that,
> reviewing the code. We do not need help with trivial process matters.

ok,


^ permalink raw reply

* Re: RFC: stmmac RSS support
From: Qingfang Deng @ 2026-02-09  1:35 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: netdev, Andrew Lunn, Jose Abreu, Maxime Chevallier,
	Thierry Reding, Paritosh Dixit
In-Reply-To: <aYd4BkAeNW6d0iIC@shell.armlinux.org.uk>

On Sat, 7 Feb 2026 17:36:06 +0000, Russell King (Oracle) wrote:
> Hi,
> 
> While looking at the possibilities of minimising the memory that
> struct plat_stmmacenet_data consumes (880 bytes presently on
> aarch64), I came across the RSS feature in stmmac.
> 
> In commit 76067459c686 ("net: stmmac: Implement RSS and enable it in
> XGMAC core"), support was added for RSS to the core stmmac driver for
> the dwxgmac2 core. I can only find socfpga and tegra as the two
> platform glues that use the dwxgmac2 core.
> 
> RSS support is only enabled when both the core supports it, and the
> platform glue sets priv->plat->rss_en.
> 
> However, the stmmac-related results of grepping for this member do not
> show any platform glues which set this flag:
> 
> $ git grep '\<rss_en\>'
> Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst:        int rss_en;
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:      if (!priv->dma_cap.rssen || !priv->plat->rss_en) {
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:      if (priv->dma_cap.rssen && priv->plat->rss_en)
> 
> So, as no one has decided to enable this feature during the intervening
> six years, is there any benefit to having this code in the mainline
> kernel, or should this feature be dropped?
> 
> If a user appears, the code will remain in git history and could be
> restored.
> 
> Thoughts?

As someone who has worked on both DWC GMAC and DWC XGMAC, I think the
whole XGMAC code needs to be uncoupled from the stmmac driver. They're
completely different IP cores.

> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> 

^ permalink raw reply

* Re: [RFC 0/3] vhost-net: netfilter support for RX path
From: Jason Wang @ 2026-02-09  1:46 UTC (permalink / raw)
  To: Cindy Lu; +Cc: mst, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20260208143441.2177372-1-lulu@redhat.com>

On Sun, Feb 8, 2026 at 10:34 PM Cindy Lu <lulu@redhat.com> wrote:
>
> This series adds a minimal vhost-net filter support for RX.
> It introduces a UAPI for VHOST_NET_SET_FILTER and a simple
> SOCK_SEQPACKET message header. The kernel side keeps a filter
> socket reference and routes RX packets to userspace when
> it was enabled.

I wonder if a packet socket is sufficient or is this for macvtap as well?

Thanks

>
> Tested
> - vhost=on  and vhost=off
>
> Cindy Lu (3):
>   uapi: vhost: add vhost-net netfilter offload API
>   vhost/net: add netfilter socket support
>   vhost/net: add RX netfilter offload path
>
>  drivers/vhost/net.c        | 338 +++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vhost.h |  20 +++
>  2 files changed, 358 insertions(+)
>
> --
> 2.52.0
>


^ permalink raw reply

* [PATCH v4] bpf: test_run: Fix the null pointer dereference issue in bpf_lwt_xmit_push_encap
From: Feng Yang @ 2026-02-09  1:51 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms, ast, daniel, andrii
  Cc: bpf, netdev, linux-kernel

From: Feng Yang <yangfeng@kylinos.cn>

The bpf_lwt_xmit_push_encap helper needs to access skb_dst(skb)->dev to
calculate the needed headroom:

	err = skb_cow_head(skb,
			   len + LL_RESERVED_SPACE(skb_dst(skb)->dev));

But skb->_skb_refdst may not be initialized when the skb is set up by
bpf_prog_test_run_skb function. Executing bpf_lwt_push_ip_encap function
in this scenario will trigger null pointer dereference, causing a kernel
crash as Yinhao reported:

[  105.186365] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  105.186382] #PF: supervisor read access in kernel mode
[  105.186388] #PF: error_code(0x0000) - not-present page
[  105.186393] PGD 121d3d067 P4D 121d3d067 PUD 106c83067 PMD 0
[  105.186404] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  105.186412] CPU: 3 PID: 3250 Comm: poc Kdump: loaded Not tainted 6.19.0-rc5 #1
[  105.186423] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[  105.186427] RIP: 0010:bpf_lwt_push_ip_encap+0x1eb/0x520
[  105.186443] Code: 0f 84 de 01 00 00 0f b7 4a 04 66 85 c9 0f 85 47 01 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc 48 8b 73 58 48 83 e6 fe <48> 8b 36 0f b7 be ec 00 00 00 0f b7 b6 e6 00 00 00 01 fe 83 e6 f0
[  105.186449] RSP: 0018:ffffbb0e0387bc50 EFLAGS: 00010246
[  105.186455] RAX: 000000000000004e RBX: ffff94c74e036500 RCX: ffff94c74874da00
[  105.186460] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff94c74e036500
[  105.186463] RBP: 0000000000000001 R08: 0000000000000002 R09: 0000000000000000
[  105.186467] R10: ffffbb0e0387bd50 R11: 0000000000000000 R12: ffffbb0e0387bc98
[  105.186471] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000002
[  105.186484] FS:  00007f166aa4d680(0000) GS:ffff94c8b7780000(0000) knlGS:0000000000000000
[  105.186490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  105.186494] CR2: 0000000000000000 CR3: 000000015eade001 CR4: 0000000000770ee0
[  105.186499] PKRU: 55555554
[  105.186502] Call Trace:
[  105.186507]  <TASK>
[  105.186513]  bpf_lwt_xmit_push_encap+0x2b/0x40
[  105.186522]  bpf_prog_a75eaad51e517912+0x41/0x49
[  105.186536]  ? kvm_clock_get_cycles+0x18/0x30
[  105.186547]  ? ktime_get+0x3c/0xa0
[  105.186554]  bpf_test_run+0x195/0x320
[  105.186563]  ? bpf_test_run+0x10f/0x320
[  105.186579]  bpf_prog_test_run_skb+0x2f5/0x4f0
[  105.186590]  __sys_bpf+0x69c/0xa40
[  105.186603]  __x64_sys_bpf+0x1e/0x30
[  105.186611]  do_syscall_64+0x59/0x110
[  105.186620]  entry_SYSCALL_64_after_hwframe+0x76/0xe0
[  105.186649] RIP: 0033:0x7f166a97455d

Temporarily add the setting of skb->_skb_refdst before bpf_test_run to resolve the issue.

Fixes: 52f278774e79 ("bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap")
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Closes: https://groups.google.com/g/hust-os-kernel-patches/c/8-a0kPpBW2s
Signed-off-by: Yun Lu <luyun@kylinos.cn>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Tested-by: syzbot@syzkaller.appspotmail.com
---
Changes in v4:
- add rcu lock
- Link to v3: https://lore.kernel.org/all/20260206055113.63476-1-yangfeng59949@163.com/
Changes in v3:
- use dst_init
- Link to v2: https://lore.kernel.org/all/20260205092227.126665-1-yangfeng59949@163.com/
Changes in v2:
- Link to v1: https://lore.kernel.org/all/20260127084520.13890-1-luyun_611@163.com/ 
---
 net/bpf/test_run.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 178c4738e63b..79b60ab9e682 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -989,6 +989,7 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = net->loopback_dev;
+	struct dst_entry bpf_test_run_lwt_xmit_dst;
 	u32 headroom = NET_SKB_PAD + NET_IP_ALIGN;
 	u32 linear_sz = kattr->test.data_size_in;
 	u32 repeat = kattr->test.repeat;
@@ -1156,6 +1157,14 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 		skb->ip_summed = CHECKSUM_COMPLETE;
 	}
 
+	if (prog->type == BPF_PROG_TYPE_LWT_XMIT) {
+		dst_init(&bpf_test_run_lwt_xmit_dst, NULL, NULL,
+			 DST_OBSOLETE_NONE, DST_NOCOUNT);
+		bpf_test_run_lwt_xmit_dst.dev = dev;
+		rcu_read_lock();
+		skb_dst_set_noref(skb, &bpf_test_run_lwt_xmit_dst);
+		rcu_read_unlock();
+	}
 	ret = bpf_test_run(prog, skb, repeat, &retval, &duration, false);
 	if (ret)
 		goto out;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH v6] virtio_net: add page_pool support for buffer allocation
From: Xuan Zhuo @ 2026-02-09  2:00 UTC (permalink / raw)
  To: Vishwanath Seshagiri
  Cc: Eugenio Pérez, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Wei, Matteo Croce,
	Ilias Apalodimas, netdev, virtualization, linux-kernel,
	kernel-team, Michael S . Tsirkin, Jason Wang
In-Reply-To: <20260208175410.1910001-1-vishs@meta.com>

On Sun, 8 Feb 2026 09:54:10 -0800, Vishwanath Seshagiri <vishs@meta.com> wrote:
> Use page_pool for RX buffer allocation in mergeable and small buffer
> modes to enable page recycling and avoid repeated page allocator calls.
> skb_mark_for_recycle() enables page reuse in the network stack.
>
> Big packets mode is unchanged because it uses page->private for linked
> list chaining of multiple pages per buffer, which conflicts with
> page_pool's internal use of page->private.
>
> Implement conditional DMA premapping using virtqueue_dma_dev():
> - When non-NULL (vhost, virtio-pci): use PP_FLAG_DMA_MAP with page_pool
>   handling DMA mapping, submit via virtqueue_add_inbuf_premapped()
> - When NULL (VDUSE, direct physical): page_pool handles allocation only,
>   submit via virtqueue_add_inbuf_ctx()
>
> This preserves the DMA premapping optimization from commit 31f3cd4e5756b
> ("virtio-net: rq submits premapped per-buffer") while adding page_pool
> support as a prerequisite for future zero-copy features (devmem TCP,
> io_uring ZCRX).
>
> Page pools are created in probe and destroyed in remove (not open/close),
> following existing driver behavior where RX buffers remain in virtqueues
> across interface state changes.
>
> Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
> ---
> Changes in v6:
> - Drop page_pool_frag_offset_add() helper and switch to page_pool_alloc_va();
>   page_pool_alloc_netmem() already handles internal fragmentation internally
>   (Jakub Kicinski)
> - v5:
>   https://lore.kernel.org/virtualization/20260206002715.1885869-1-vishs@meta.com/
>
> Benchmark results:
>
> Configuration: pktgen TX -> tap -> vhost-net | virtio-net RX -> XDP_DROP
>
> Small packets (64 bytes, mrg_rxbuf=off):
>   1Q:  853,493 -> 868,923 pps  (+1.8%)
>   2Q: 1,655,793 -> 1,696,707 pps (+2.5%)
>   4Q: 3,143,375 -> 3,302,511 pps (+5.1%)
>   8Q: 6,082,590 -> 6,156,894 pps (+1.2%)
>
> Mergeable RX (64 bytes):
>   1Q:   766,168 ->   814,493 pps  (+6.3%)
>   2Q: 1,384,871 -> 1,670,639 pps (+20.6%)
>   4Q: 2,773,081 -> 3,080,574 pps (+11.1%)
>   8Q: 5,600,615 -> 6,043,891 pps  (+7.9%)
>
> Mergeable RX (1500 bytes):
>   1Q:   741,579 ->   785,442 pps  (+5.9%)
>   2Q: 1,310,043 -> 1,534,554 pps (+17.1%)
>   4Q: 2,748,700 -> 2,890,582 pps  (+5.2%)
>   8Q: 5,348,589 -> 5,618,664 pps  (+5.0%)
>
>  drivers/net/Kconfig      |   1 +
>  drivers/net/virtio_net.c | 434 +++++++++++++++++++--------------------
>  2 files changed, 217 insertions(+), 218 deletions(-)
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index ac12eaf11755..f1e6b6b0a86f 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -450,6 +450,7 @@ config VIRTIO_NET
>  	depends on VIRTIO
>  	select NET_FAILOVER
>  	select DIMLIB
> +	select PAGE_POOL
>  	help
>  	  This is the virtual network driver for virtio.  It can be used with
>  	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index db88dcaefb20..5055df56e4a7 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <net/netdev_rx_queue.h>
>  #include <net/netdev_queues.h>
>  #include <net/xdp_sock_drv.h>
> +#include <net/page_pool/helpers.h>
>
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -290,14 +291,6 @@ struct virtnet_interrupt_coalesce {
>  	u32 max_usecs;
>  };
>
> -/* The dma information of pages allocated at a time. */
> -struct virtnet_rq_dma {
> -	dma_addr_t addr;
> -	u32 ref;
> -	u16 len;
> -	u16 need_sync;
> -};
> -
>  /* Internal representation of a send virtqueue */
>  struct send_queue {
>  	/* Virtqueue associated with this send _queue */
> @@ -356,8 +349,10 @@ struct receive_queue {
>  	/* Average packet length for mergeable receive buffers. */
>  	struct ewma_pkt_len mrg_avg_pkt_len;
>
> -	/* Page frag for packet buffer allocation. */
> -	struct page_frag alloc_frag;
> +	struct page_pool *page_pool;
> +
> +	/* True if page_pool handles DMA mapping via PP_FLAG_DMA_MAP */
> +	bool use_page_pool_dma;
>
>  	/* RX: fragments + linear part + virtio header */
>  	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> @@ -370,9 +365,6 @@ struct receive_queue {
>
>  	struct xdp_rxq_info xdp_rxq;
>
> -	/* Record the last dma info to free after new pages is allocated. */
> -	struct virtnet_rq_dma *last_dma;
> -
>  	struct xsk_buff_pool *xsk_pool;
>
>  	/* xdp rxq used by xsk */
> @@ -521,11 +513,13 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
>  			       struct virtnet_rq_stats *stats);
>  static void virtnet_receive_done(struct virtnet_info *vi, struct receive_queue *rq,
>  				 struct sk_buff *skb, u8 flags);
> -static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
> +static struct sk_buff *virtnet_skb_append_frag(struct receive_queue *rq,
> +					       struct sk_buff *head_skb,
>  					       struct sk_buff *curr_skb,
>  					       struct page *page, void *buf,
>  					       int len, int truesize);
>  static void virtnet_xsk_completed(struct send_queue *sq, int num);
> +static void free_unused_bufs(struct virtnet_info *vi);
>
>  enum virtnet_xmit_type {
>  	VIRTNET_XMIT_TYPE_SKB,
> @@ -706,15 +700,24 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>
> +static void virtnet_put_page(struct receive_queue *rq, struct page *page,
> +			     bool allow_direct)
> +{
> +	if (page_pool_page_is_pp(page))
> +		page_pool_put_page(rq->page_pool, page, -1, allow_direct);
> +	else
> +		put_page(page);
> +}

Why we need this?
For the caller, we should know which one should be used?


> +
>  static void virtnet_rq_free_buf(struct virtnet_info *vi,
>  				struct receive_queue *rq, void *buf)
>  {
>  	if (vi->mergeable_rx_bufs)
> -		put_page(virt_to_head_page(buf));
> +		virtnet_put_page(rq, virt_to_head_page(buf), false);
>  	else if (vi->big_packets)
>  		give_pages(rq, buf);
>  	else
> -		put_page(virt_to_head_page(buf));
> +		virtnet_put_page(rq, virt_to_head_page(buf), false);
>  }
>
>  static void enable_rx_mode_work(struct virtnet_info *vi)
> @@ -876,10 +879,16 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  		skb = virtnet_build_skb(buf, truesize, p - buf, len);
>  		if (unlikely(!skb))
>  			return NULL;
> +		/* Big packets mode chains pages via page->private, which is
> +		 * incompatible with the way page_pool uses page->private.
> +		 * Currently, big packets mode doesn't use page pools.
> +		 */
> +		if (vi->big_packets && !vi->mergeable_rx_bufs) {
> +			page = (struct page *)page->private;
> +			if (page)
> +				give_pages(rq, page);
> +		}
>
> -		page = (struct page *)page->private;
> -		if (page)
> -			give_pages(rq, page);
>  		goto ok;
>  	}
>
> @@ -925,133 +934,18 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	hdr = skb_vnet_common_hdr(skb);
>  	memcpy(hdr, hdr_p, hdr_len);
>  	if (page_to_free)
> -		put_page(page_to_free);
> +		virtnet_put_page(rq, page_to_free, true);
>
>  	return skb;
>  }
>
> -static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len)
> -{
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
> -	struct page *page = virt_to_head_page(buf);
> -	struct virtnet_rq_dma *dma;
> -	void *head;
> -	int offset;
> -
> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
> -
> -	head = page_address(page);
> -
> -	dma = head;
> -
> -	--dma->ref;
> -
> -	if (dma->need_sync && len) {
> -		offset = buf - (head + sizeof(*dma));
> -
> -		virtqueue_map_sync_single_range_for_cpu(rq->vq, dma->addr,
> -							offset, len,
> -							DMA_FROM_DEVICE);
> -	}
> -
> -	if (dma->ref)
> -		return;
> -
> -	virtqueue_unmap_single_attrs(rq->vq, dma->addr, dma->len,
> -				     DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
> -	put_page(page);
> -}
> -
>  static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx)
>  {
>  	struct virtnet_info *vi = rq->vq->vdev->priv;
> -	void *buf;
>
>  	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
>
> -	buf = virtqueue_get_buf_ctx(rq->vq, len, ctx);
> -	if (buf)
> -		virtnet_rq_unmap(rq, buf, *len);
> -
> -	return buf;
> -}
> -
> -static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> -{
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
> -	struct virtnet_rq_dma *dma;
> -	dma_addr_t addr;
> -	u32 offset;
> -	void *head;
> -
> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
> -
> -	head = page_address(rq->alloc_frag.page);
> -
> -	offset = buf - head;
> -
> -	dma = head;
> -
> -	addr = dma->addr - sizeof(*dma) + offset;
> -
> -	sg_init_table(rq->sg, 1);
> -	sg_fill_dma(rq->sg, addr, len);
> -}
> -
> -static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> -{
> -	struct page_frag *alloc_frag = &rq->alloc_frag;
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
> -	struct virtnet_rq_dma *dma;
> -	void *buf, *head;
> -	dma_addr_t addr;
> -
> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
> -
> -	head = page_address(alloc_frag->page);
> -
> -	dma = head;
> -
> -	/* new pages */
> -	if (!alloc_frag->offset) {
> -		if (rq->last_dma) {
> -			/* Now, the new page is allocated, the last dma
> -			 * will not be used. So the dma can be unmapped
> -			 * if the ref is 0.
> -			 */
> -			virtnet_rq_unmap(rq, rq->last_dma, 0);
> -			rq->last_dma = NULL;
> -		}
> -
> -		dma->len = alloc_frag->size - sizeof(*dma);
> -
> -		addr = virtqueue_map_single_attrs(rq->vq, dma + 1,
> -						  dma->len, DMA_FROM_DEVICE, 0);
> -		if (virtqueue_map_mapping_error(rq->vq, addr))
> -			return NULL;
> -
> -		dma->addr = addr;
> -		dma->need_sync = virtqueue_map_need_sync(rq->vq, addr);
> -
> -		/* Add a reference to dma to prevent the entire dma from
> -		 * being released during error handling. This reference
> -		 * will be freed after the pages are no longer used.
> -		 */
> -		get_page(alloc_frag->page);
> -		dma->ref = 1;
> -		alloc_frag->offset = sizeof(*dma);
> -
> -		rq->last_dma = dma;
> -	}
> -
> -	++dma->ref;
> -
> -	buf = head + alloc_frag->offset;
> -
> -	get_page(alloc_frag->page);
> -	alloc_frag->offset += size;
> -
> -	return buf;
> +	return virtqueue_get_buf_ctx(rq->vq, len, ctx);
>  }
>
>  static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf)
> @@ -1067,9 +961,6 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf)
>  		return;
>  	}
>
> -	if (!vi->big_packets || vi->mergeable_rx_bufs)
> -		virtnet_rq_unmap(rq, buf, 0);
> -
>  	virtnet_rq_free_buf(vi, rq, buf);
>  }
>
> @@ -1335,7 +1226,7 @@ static int xsk_append_merge_buffer(struct virtnet_info *vi,
>
>  		truesize = len;
>
> -		curr_skb  = virtnet_skb_append_frag(head_skb, curr_skb, page,
> +		curr_skb  = virtnet_skb_append_frag(rq, head_skb, curr_skb, page,
>  						    buf, len, truesize);
>  		if (!curr_skb) {
>  			put_page(page);
> @@ -1771,7 +1662,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>  	return ret;
>  }
>
> -static void put_xdp_frags(struct xdp_buff *xdp)
> +static void put_xdp_frags(struct receive_queue *rq, struct xdp_buff *xdp)
>  {
>  	struct skb_shared_info *shinfo;
>  	struct page *xdp_page;
> @@ -1781,7 +1672,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
>  		shinfo = xdp_get_shared_info_from_buff(xdp);
>  		for (i = 0; i < shinfo->nr_frags; i++) {
>  			xdp_page = skb_frag_page(&shinfo->frags[i]);
> -			put_page(xdp_page);
> +			virtnet_put_page(rq, xdp_page, true);
>  		}
>  	}
>  }
> @@ -1873,7 +1764,7 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>  	if (page_off + *len + tailroom > PAGE_SIZE)
>  		return NULL;
>
> -	page = alloc_page(GFP_ATOMIC);
> +	page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
>  	if (!page)
>  		return NULL;
>
> @@ -1897,7 +1788,7 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>  		off = buf - page_address(p);
>
>  		if (check_mergeable_len(dev, ctx, buflen)) {
> -			put_page(p);
> +			virtnet_put_page(rq, p, true);
>  			goto err_buf;
>  		}
>
> @@ -1905,21 +1796,21 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>  		 * is sending packet larger than the MTU.
>  		 */
>  		if ((page_off + buflen + tailroom) > PAGE_SIZE) {
> -			put_page(p);
> +			virtnet_put_page(rq, p, true);
>  			goto err_buf;
>  		}
>
>  		memcpy(page_address(page) + page_off,
>  		       page_address(p) + off, buflen);
>  		page_off += buflen;
> -		put_page(p);
> +		virtnet_put_page(rq, p, true);
>  	}
>
>  	/* Headroom does not contribute to packet length */
>  	*len = page_off - XDP_PACKET_HEADROOM;
>  	return page;
>  err_buf:
> -	__free_pages(page, 0);
> +	page_pool_put_page(rq->page_pool, page, -1, true);
>  	return NULL;
>  }
>
> @@ -1969,6 +1860,12 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>  	unsigned int metasize = 0;
>  	u32 act;
>
> +	if (rq->use_page_pool_dma) {
> +		int off = buf - page_address(page);
> +
> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, off, len);
> +	}
> +
>  	if (unlikely(hdr->hdr.gso_type))
>  		goto err_xdp;
>
> @@ -1996,7 +1893,7 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>  			goto err_xdp;
>
>  		buf = page_address(xdp_page);
> -		put_page(page);
> +		virtnet_put_page(rq, page, true);
>  		page = xdp_page;
>  	}
>
> @@ -2028,13 +1925,15 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>  	if (metasize)
>  		skb_metadata_set(skb, metasize);
>
> +	skb_mark_for_recycle(skb);
> +
>  	return skb;
>
>  err_xdp:
>  	u64_stats_inc(&stats->xdp_drops);
>  err:
>  	u64_stats_inc(&stats->drops);
> -	put_page(page);
> +	virtnet_put_page(rq, page, true);
>  xdp_xmit:
>  	return NULL;
>  }
> @@ -2056,6 +1955,12 @@ static struct sk_buff *receive_small(struct net_device *dev,
>  	 */
>  	buf -= VIRTNET_RX_PAD + xdp_headroom;
>
> +	if (rq->use_page_pool_dma) {
> +		int offset = buf - page_address(page);
> +
> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, offset, len);
> +	}
> +
>  	len -= vi->hdr_len;
>  	u64_stats_add(&stats->bytes, len);
>
> @@ -2082,12 +1987,14 @@ static struct sk_buff *receive_small(struct net_device *dev,
>  	}
>
>  	skb = receive_small_build_skb(vi, xdp_headroom, buf, len);
> -	if (likely(skb))
> +	if (likely(skb)) {
> +		skb_mark_for_recycle(skb);
>  		return skb;
> +	}
>
>  err:
>  	u64_stats_inc(&stats->drops);
> -	put_page(page);
> +	virtnet_put_page(rq, page, true);
>  	return NULL;
>  }
>
> @@ -2142,7 +2049,7 @@ static void mergeable_buf_free(struct receive_queue *rq, int num_buf,
>  		}
>  		u64_stats_add(&stats->bytes, len);
>  		page = virt_to_head_page(buf);
> -		put_page(page);
> +		virtnet_put_page(rq, page, true);
>  	}
>  }
>
> @@ -2253,7 +2160,7 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev,
>  		offset = buf - page_address(page);
>
>  		if (check_mergeable_len(dev, ctx, len)) {
> -			put_page(page);
> +			virtnet_put_page(rq, page, true);
>  			goto err;
>  		}
>
> @@ -2272,7 +2179,7 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev,
>  	return 0;
>
>  err:
> -	put_xdp_frags(xdp);
> +	put_xdp_frags(rq, xdp);
>  	return -EINVAL;
>  }
>
> @@ -2337,7 +2244,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi,
>  		if (*len + xdp_room > PAGE_SIZE)
>  			return NULL;
>
> -		xdp_page = alloc_page(GFP_ATOMIC);
> +		xdp_page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
>  		if (!xdp_page)
>  			return NULL;
>
> @@ -2347,7 +2254,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi,
>
>  	*frame_sz = PAGE_SIZE;
>
> -	put_page(*page);
> +	virtnet_put_page(rq, *page, true);
>
>  	*page = xdp_page;
>
> @@ -2393,6 +2300,8 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>  		head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz);
>  		if (unlikely(!head_skb))
>  			break;
> +
> +		skb_mark_for_recycle(head_skb);
>  		return head_skb;
>
>  	case XDP_TX:
> @@ -2403,10 +2312,10 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>  		break;
>  	}
>
> -	put_xdp_frags(&xdp);
> +	put_xdp_frags(rq, &xdp);
>
>  err_xdp:
> -	put_page(page);
> +	virtnet_put_page(rq, page, true);
>  	mergeable_buf_free(rq, num_buf, dev, stats);
>
>  	u64_stats_inc(&stats->xdp_drops);
> @@ -2414,7 +2323,8 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>  	return NULL;
>  }
>
> -static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
> +static struct sk_buff *virtnet_skb_append_frag(struct receive_queue *rq,
> +					       struct sk_buff *head_skb,
>  					       struct sk_buff *curr_skb,
>  					       struct page *page, void *buf,
>  					       int len, int truesize)
> @@ -2446,7 +2356,7 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
>
>  	offset = buf - page_address(page);
>  	if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
> -		put_page(page);
> +		virtnet_put_page(rq, page, true);
>  		skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
>  				     len, truesize);
>  	} else {
> @@ -2475,6 +2385,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
>
>  	head_skb = NULL;
> +
> +	if (rq->use_page_pool_dma)
> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, offset, len);
> +
>  	u64_stats_add(&stats->bytes, len - vi->hdr_len);
>
>  	if (check_mergeable_len(dev, ctx, len))
> @@ -2499,6 +2413,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>
>  	if (unlikely(!curr_skb))
>  		goto err_skb;
> +
> +	skb_mark_for_recycle(head_skb);
>  	while (--num_buf) {
>  		buf = virtnet_rq_get_buf(rq, &len, &ctx);
>  		if (unlikely(!buf)) {
> @@ -2517,7 +2433,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			goto err_skb;
>
>  		truesize = mergeable_ctx_to_truesize(ctx);
> -		curr_skb  = virtnet_skb_append_frag(head_skb, curr_skb, page,
> +		curr_skb  = virtnet_skb_append_frag(rq, head_skb, curr_skb, page,
>  						    buf, len, truesize);
>  		if (!curr_skb)
>  			goto err_skb;
> @@ -2527,7 +2443,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	return head_skb;
>
>  err_skb:
> -	put_page(page);
> +	virtnet_put_page(rq, page, true);
>  	mergeable_buf_free(rq, num_buf, dev, stats);
>
>  err_buf:
> @@ -2666,32 +2582,41 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
>  static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>  			     gfp_t gfp)
>  {
> -	char *buf;
>  	unsigned int xdp_headroom = virtnet_get_headroom(vi);
>  	void *ctx = (void *)(unsigned long)xdp_headroom;
> -	int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
> +	unsigned int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
> +	struct page *page;
> +	dma_addr_t addr;
> +	char *buf;
>  	int err;
>
>  	len = SKB_DATA_ALIGN(len) +
>  	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
> -	if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> -		return -ENOMEM;
> -
> -	buf = virtnet_rq_alloc(rq, len, gfp);
> +	buf = page_pool_alloc_va(rq->page_pool, &len, gfp);
>  	if (unlikely(!buf))
>  		return -ENOMEM;
>
>  	buf += VIRTNET_RX_PAD + xdp_headroom;
>
> -	virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> +	if (rq->use_page_pool_dma) {
> +		page = virt_to_head_page(buf);
> +		addr = page_pool_get_dma_addr(page) +
> +		       (buf - (char *)page_address(page));
>
> -	err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, gfp);
> -	if (err < 0) {
> -		virtnet_rq_unmap(rq, buf, 0);
> -		put_page(virt_to_head_page(buf));
> +		sg_init_table(rq->sg, 1);
> +		sg_fill_dma(rq->sg, addr, vi->hdr_len + GOOD_PACKET_LEN);
> +		err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1,
> +						    buf, ctx, gfp);
> +	} else {
> +		sg_init_one(rq->sg, buf, vi->hdr_len + GOOD_PACKET_LEN);
> +		err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1,
> +					      buf, ctx, gfp);
>  	}
>
> +	if (err < 0)
> +		page_pool_put_page(rq->page_pool, virt_to_head_page(buf),
> +				   -1, false);
>  	return err;
>  }
>
> @@ -2764,13 +2689,14 @@ static unsigned int get_mergeable_buf_len(struct receive_queue *rq,
>  static int add_recvbuf_mergeable(struct virtnet_info *vi,
>  				 struct receive_queue *rq, gfp_t gfp)
>  {
> -	struct page_frag *alloc_frag = &rq->alloc_frag;
>  	unsigned int headroom = virtnet_get_headroom(vi);
>  	unsigned int tailroom = headroom ? sizeof(struct skb_shared_info) : 0;
>  	unsigned int room = SKB_DATA_ALIGN(headroom + tailroom);
> -	unsigned int len, hole;
> -	void *ctx;
> +	unsigned int len, alloc_len;
> +	struct page *page;
> +	dma_addr_t addr;
>  	char *buf;
> +	void *ctx;
>  	int err;
>
>  	/* Extra tailroom is needed to satisfy XDP's assumption. This
> @@ -2779,39 +2705,36 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
>  	 */
>  	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>
> -	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> -		return -ENOMEM;
> -
> -	if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > alloc_frag->size)
> -		len -= sizeof(struct virtnet_rq_dma);
> -
> -	buf = virtnet_rq_alloc(rq, len + room, gfp);
> +	alloc_len = len + room;
> +	buf = page_pool_alloc_va(rq->page_pool, &alloc_len, gfp);
>  	if (unlikely(!buf))
>  		return -ENOMEM;
>
>  	buf += headroom; /* advance address leaving hole at front of pkt */
> -	hole = alloc_frag->size - alloc_frag->offset;
> -	if (hole < len + room) {
> -		/* To avoid internal fragmentation, if there is very likely not
> -		 * enough space for another buffer, add the remaining space to
> -		 * the current buffer.
> -		 * XDP core assumes that frame_size of xdp_buff and the length
> -		 * of the frag are PAGE_SIZE, so we disable the hole mechanism.
> -		 */
> -		if (!headroom)
> -			len += hole;
> -		alloc_frag->offset += hole;
> -	}
>
> -	virtnet_rq_init_one_sg(rq, buf, len);
> +	if (!headroom)
> +		len = alloc_len - room;
>
>  	ctx = mergeable_len_to_ctx(len + room, headroom);
> -	err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, gfp);
> -	if (err < 0) {
> -		virtnet_rq_unmap(rq, buf, 0);
> -		put_page(virt_to_head_page(buf));
> +
> +	if (rq->use_page_pool_dma) {
> +		page = virt_to_head_page(buf);
> +		addr = page_pool_get_dma_addr(page) +
> +		       (buf - (char *)page_address(page));
> +
> +		sg_init_table(rq->sg, 1);
> +		sg_fill_dma(rq->sg, addr, len);
> +		err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1,
> +						    buf, ctx, gfp);
> +	} else {
> +		sg_init_one(rq->sg, buf, len);
> +		err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1,
> +					      buf, ctx, gfp);
>  	}
>
> +	if (err < 0)
> +		page_pool_put_page(rq->page_pool, virt_to_head_page(buf),
> +				   -1, false);
>  	return err;
>  }
>
> @@ -3128,7 +3051,10 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>  		return err;
>
>  	err = xdp_rxq_info_reg_mem_model(&vi->rq[qp_index].xdp_rxq,
> -					 MEM_TYPE_PAGE_SHARED, NULL);
> +					 vi->rq[qp_index].page_pool ?
> +						MEM_TYPE_PAGE_POOL :
> +						MEM_TYPE_PAGE_SHARED,
> +					 vi->rq[qp_index].page_pool);
>  	if (err < 0)
>  		goto err_xdp_reg_mem_model;
>
> @@ -3168,6 +3094,81 @@ static void virtnet_update_settings(struct virtnet_info *vi)
>  		vi->duplex = duplex;
>  }
>
> +static int virtnet_create_page_pools(struct virtnet_info *vi)
> +{
> +	int i, err;
> +
> +	if (!vi->mergeable_rx_bufs && vi->big_packets)
> +		return 0;
> +
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		struct receive_queue *rq = &vi->rq[i];
> +		struct page_pool_params pp_params = { 0 };
> +		struct device *dma_dev;
> +
> +		if (rq->page_pool)
> +			continue;
> +
> +		if (rq->xsk_pool)
> +			continue;
> +
> +		pp_params.order = 0;
> +		pp_params.pool_size = virtqueue_get_vring_size(rq->vq);
> +		pp_params.nid = dev_to_node(vi->vdev->dev.parent);
> +		pp_params.netdev = vi->dev;
> +		pp_params.napi = &rq->napi;
> +
> +		/* Check if backend supports DMA API (e.g., vhost, virtio-pci).
> +		 * If so, use page_pool's DMA mapping for premapped buffers.
> +		 * Otherwise (e.g., VDUSE), page_pool only handles allocation.
> +		 */
> +		dma_dev = virtqueue_dma_dev(rq->vq);
> +		if (dma_dev) {
> +			pp_params.dev = dma_dev;
> +			pp_params.flags = PP_FLAG_DMA_MAP;
> +			pp_params.dma_dir = DMA_FROM_DEVICE;
> +			rq->use_page_pool_dma = true;
> +		} else {
> +			pp_params.dev = vi->vdev->dev.parent;
> +			pp_params.flags = 0;
> +			rq->use_page_pool_dma = false;

Can the page pool handles dma with vi->vdev->dev.parent?

Thanks.

> +		}
> +
> +		rq->page_pool = page_pool_create(&pp_params);
> +		if (IS_ERR(rq->page_pool)) {
> +			err = PTR_ERR(rq->page_pool);
> +			rq->page_pool = NULL;
> +			goto err_cleanup;
> +		}
> +	}
> +	return 0;
> +
> +err_cleanup:
> +	while (--i >= 0) {
> +		struct receive_queue *rq = &vi->rq[i];
> +
> +		if (rq->page_pool) {
> +			page_pool_destroy(rq->page_pool);
> +			rq->page_pool = NULL;
> +		}
> +	}
> +	return err;
> +}
> +
> +static void virtnet_destroy_page_pools(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		struct receive_queue *rq = &vi->rq[i];
> +
> +		if (rq->page_pool) {
> +			page_pool_destroy(rq->page_pool);
> +			rq->page_pool = NULL;
> +		}
> +	}
> +}
> +
>  static int virtnet_open(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> @@ -6287,17 +6288,6 @@ static void free_receive_bufs(struct virtnet_info *vi)
>  	rtnl_unlock();
>  }
>
> -static void free_receive_page_frags(struct virtnet_info *vi)
> -{
> -	int i;
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> -		if (vi->rq[i].alloc_frag.page) {
> -			if (vi->rq[i].last_dma)
> -				virtnet_rq_unmap(&vi->rq[i], vi->rq[i].last_dma, 0);
> -			put_page(vi->rq[i].alloc_frag.page);
> -		}
> -}
> -
>  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
>  {
>  	struct virtnet_info *vi = vq->vdev->priv;
> @@ -6441,10 +6431,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  		vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq);
>  		vi->sq[i].vq = vqs[txq2vq(i)];
>  	}
> -
>  	/* run here: ret == 0. */
>
> -
>  err_find:
>  	kfree(ctx);
>  err_ctx:
> @@ -6945,6 +6933,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>  			goto free;
>  	}
>
> +	/* Create page pools for receive queues.
> +	 * Page pools are created at probe time so they can be used
> +	 * with premapped DMA addresses throughout the device lifetime.
> +	 */
> +	err = virtnet_create_page_pools(vi);
> +	if (err)
> +		goto free_irq_moder;
> +
>  #ifdef CONFIG_SYSFS
>  	if (vi->mergeable_rx_bufs)
>  		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
> @@ -6958,7 +6954,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		vi->failover = net_failover_create(vi->dev);
>  		if (IS_ERR(vi->failover)) {
>  			err = PTR_ERR(vi->failover);
> -			goto free_vqs;
> +			goto free_page_pools;
>  		}
>  	}
>
> @@ -7075,9 +7071,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	unregister_netdev(dev);
>  free_failover:
>  	net_failover_destroy(vi->failover);
> -free_vqs:
> +free_page_pools:
> +	virtnet_destroy_page_pools(vi);
> +free_irq_moder:
> +	virtnet_free_irq_moder(vi);
>  	virtio_reset_device(vdev);
> -	free_receive_page_frags(vi);
>  	virtnet_del_vqs(vi);
>  free:
>  	free_netdev(dev);
> @@ -7102,7 +7100,7 @@ static void remove_vq_common(struct virtnet_info *vi)
>
>  	free_receive_bufs(vi);
>
> -	free_receive_page_frags(vi);
> +	virtnet_destroy_page_pools(vi);
>
>  	virtnet_del_vqs(vi);
>  }
> --
> 2.47.3
>

^ permalink raw reply

* [RFC PATCH net-next] ppp: don't store tx skb in the fastpath
From: Qingfang Deng @ 2026-02-09  2:11 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, linux-ppp, netdev, linux-kernel

Currently, ppp->xmit_pending is used in ppp_send_frame() to pass a skb
to ppp_push(), and holds the skb when a PPP channel cannot immediately
transmit it. This state is redundant because the transmit queue
(ppp->file.xq) can already handle the backlog. Furthermore, during
normal operation, an skb is queued in file.xq only to be immediately
dequeued, causing unnecessary overhead.

Refactor the transmit path to avoid stashing the skb when possible:
- Remove ppp->xmit_pending.
- Rename ppp_send_frame() to ppp_prepare_tx_skb(), and don't call
  ppp_push() in it. It returns 1 if the skb is consumed
  (dropped/handled) or 0 if it can be passed to ppp_push().
- Update ppp_push() to accept the skb. It returns 1 if the skb is
  consumed, or 0 if the channel is busy.
- Optimize __ppp_xmit_process():
  - Fastpath: If the queue is empty, attempt to send the skb directly
    via ppp_push(). If busy, queue it.
  - Slowpath: If the queue is not empty, or fastpath failed, process
    the backlog in file.xq. Split dequeueing loop into a separate
    function ppp_xmit_flush() so ppp_channel_push() uses that directly
    instead of passing a NULL skb to __ppp_xmit_process().

This simplifies the states and reduces locking in the fastpath.

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
---
- Sent as RFC, since net-next is closed.

 drivers/net/ppp/ppp_generic.c | 107 +++++++++++++++++++---------------
 1 file changed, 61 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index f8814d7be6f1..0f7bc3ab4a49 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -134,7 +134,6 @@ struct ppp {
 	int		debug;		/* debug flags 70 */
 	struct slcompress *vj;		/* state for VJ header compression */
 	enum NPmode	npmode[NUM_NP];	/* what to do with each net proto 78 */
-	struct sk_buff	*xmit_pending;	/* a packet ready to go out 88 */
 	struct compressor *xcomp;	/* transmit packet compressor 8c */
 	void		*xc_state;	/* its internal state 90 */
 	struct compressor *rcomp;	/* receive decompressor 94 */
@@ -264,8 +263,8 @@ struct ppp_net {
 static int ppp_unattached_ioctl(struct net *net, struct ppp_file *pf,
 			struct file *file, unsigned int cmd, unsigned long arg);
 static void ppp_xmit_process(struct ppp *ppp, struct sk_buff *skb);
-static void ppp_send_frame(struct ppp *ppp, struct sk_buff *skb);
-static void ppp_push(struct ppp *ppp);
+static int ppp_prepare_tx_skb(struct ppp *ppp, struct sk_buff *skb);
+static int ppp_push(struct ppp *ppp, struct sk_buff *skb);
 static void ppp_channel_push(struct channel *pch);
 static void ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb,
 			      struct channel *pch);
@@ -1651,26 +1650,44 @@ static void ppp_setup(struct net_device *dev)
  */
 
 /* Called to do any work queued up on the transmit side that can now be done */
+static void ppp_xmit_flush(struct ppp *ppp)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&ppp->file.xq))) {
+		if (unlikely(!ppp_push(ppp, skb))) {
+			skb_queue_head(&ppp->file.xq, skb);
+			return;
+		}
+	}
+	/* If there's no work left to do, tell the core net code that we can
+	 * accept some more.
+	 */
+	netif_wake_queue(ppp->dev);
+}
+
 static void __ppp_xmit_process(struct ppp *ppp, struct sk_buff *skb)
 {
 	ppp_xmit_lock(ppp);
-	if (!ppp->closing) {
-		ppp_push(ppp);
-
-		if (skb)
+	if (unlikely(ppp->closing)) {
+		kfree_skb(skb);
+		goto out;
+	}
+	if (unlikely(ppp_prepare_tx_skb(ppp, skb)))
+		goto out;
+	/* Fastpath: No backlog, just send the new skb. */
+	if (likely(skb_queue_empty(&ppp->file.xq))) {
+		if (unlikely(!ppp_push(ppp, skb))) {
 			skb_queue_tail(&ppp->file.xq, skb);
-		while (!ppp->xmit_pending &&
-		       (skb = skb_dequeue(&ppp->file.xq)))
-			ppp_send_frame(ppp, skb);
-		/* If there's no work left to do, tell the core net
-		   code that we can accept some more. */
-		if (!ppp->xmit_pending && !skb_peek(&ppp->file.xq))
-			netif_wake_queue(ppp->dev);
-		else
 			netif_stop_queue(ppp->dev);
-	} else {
-		kfree_skb(skb);
+		}
+		goto out;
 	}
+
+	/* Slowpath: Enqueue the new skb and process backlog */
+	skb_queue_tail(&ppp->file.xq, skb);
+	ppp_xmit_flush(ppp);
+out:
 	ppp_xmit_unlock(ppp);
 }
 
@@ -1757,12 +1774,12 @@ pad_compress_skb(struct ppp *ppp, struct sk_buff *skb)
 }
 
 /*
- * Compress and send a frame.
- * The caller should have locked the xmit path,
- * and xmit_pending should be 0.
+ * Compress and prepare to send a frame.
+ * The caller should have locked the xmit path.
+ * Returns 1 if the skb was consumed, 0 if it can be passed to ppp_push().
  */
-static void
-ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
+static int
+ppp_prepare_tx_skb(struct ppp *ppp, struct sk_buff *skb)
 {
 	int proto = PPP_PROTO(skb);
 	struct sk_buff *new_skb;
@@ -1784,7 +1801,7 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 					      "PPP: outbound frame "
 					      "not passed\n");
 			kfree_skb(skb);
-			return;
+			return 1;
 		}
 		/* if this packet passes the active filter, record the time */
 		if (!(ppp->active_filter &&
@@ -1869,42 +1886,38 @@ ppp_send_frame(struct ppp *ppp, struct sk_buff *skb)
 			goto drop;
 		skb_queue_tail(&ppp->file.rq, skb);
 		wake_up_interruptible(&ppp->file.rwait);
-		return;
+		return 1;
 	}
 
-	ppp->xmit_pending = skb;
-	ppp_push(ppp);
-	return;
+	return 0;
 
  drop:
 	kfree_skb(skb);
 	++ppp->dev->stats.tx_errors;
+	return 1;
 }
 
 /*
- * Try to send the frame in xmit_pending.
+ * Try to send the frame.
  * The caller should have the xmit path locked.
+ * Returns 1 if the skb was consumed, 0 if not.
  */
-static void
-ppp_push(struct ppp *ppp)
+static int
+ppp_push(struct ppp *ppp, struct sk_buff *skb)
 {
 	struct list_head *list;
 	struct channel *pch;
-	struct sk_buff *skb = ppp->xmit_pending;
-
-	if (!skb)
-		return;
 
 	list = &ppp->channels;
 	if (list_empty(list)) {
 		/* nowhere to send the packet, just drop it */
-		ppp->xmit_pending = NULL;
 		kfree_skb(skb);
-		return;
+		return 1;
 	}
 
 	if ((ppp->flags & SC_MULTILINK) == 0) {
 		struct ppp_channel *chan;
+		int ret;
 		/* not doing multilink: send it down the first channel */
 		list = list->next;
 		pch = list_entry(list, struct channel, clist);
@@ -1916,27 +1929,26 @@ ppp_push(struct ppp *ppp)
 			 * skb but linearization failed
 			 */
 			kfree_skb(skb);
-			ppp->xmit_pending = NULL;
+			ret = 1;
 			goto out;
 		}
 
-		if (chan->ops->start_xmit(chan, skb))
-			ppp->xmit_pending = NULL;
+		ret = chan->ops->start_xmit(chan, skb);
 
 out:
 		spin_unlock(&pch->downl);
-		return;
+		return ret;
 	}
 
 #ifdef CONFIG_PPP_MULTILINK
 	/* Multilink: fragment the packet over as many links
 	   as can take the packet at the moment. */
 	if (!ppp_mp_explode(ppp, skb))
-		return;
+		return 0;
 #endif /* CONFIG_PPP_MULTILINK */
 
-	ppp->xmit_pending = NULL;
 	kfree_skb(skb);
+	return 1;
 }
 
 #ifdef CONFIG_PPP_MULTILINK
@@ -2005,7 +2017,7 @@ static int ppp_mp_explode(struct ppp *ppp, struct sk_buff *skb)
 	 * performance if we have a lot of channels.
 	 */
 	if (nfree == 0 || nfree < navail / 2)
-		return 0; /* can't take now, leave it in xmit_pending */
+		return 0; /* can't take now, leave it in transmit queue */
 
 	/* Do protocol field compression */
 	if (skb_linearize(skb))
@@ -2199,8 +2211,12 @@ static void __ppp_channel_push(struct channel *pch, struct ppp *ppp)
 	spin_unlock(&pch->downl);
 	/* see if there is anything from the attached unit to be sent */
 	if (skb_queue_empty(&pch->file.xq)) {
-		if (ppp)
-			__ppp_xmit_process(ppp, NULL);
+		if (ppp) {
+			ppp_xmit_lock(ppp);
+			if (!ppp->closing)
+				ppp_xmit_flush(ppp);
+			ppp_xmit_unlock(ppp);
+		}
 	}
 }
 
@@ -3460,7 +3476,6 @@ static void ppp_destroy_interface(struct ppp *ppp)
 	}
 #endif /* CONFIG_PPP_FILTER */
 
-	kfree_skb(ppp->xmit_pending);
 	free_percpu(ppp->xmit_recursion);
 
 	free_netdev(ppp->dev);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v2 4/9] selftests: net: move gro to lib for HW vs SW reuse
From: Willem de Bruijn @ 2026-02-09  2:36 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, shuah, willemb,
	petrm, donald.hunter, michael.chan, pavan.chebbi, linux-kselftest,
	Jakub Kicinski
In-Reply-To: <20260207003509.3927744-5-kuba@kernel.org>

Jakub Kicinski wrote:
> The gro.c packet sender is used for SW testing but bulk of incoming
> new tests will be HW-specific. So it's better to put them under
> drivers/net/hw/, to avoid tip-toeing around netdevsim. Move gro.c
> to lib so we can reuse it.
> 
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next v2 5/9] selftests: drv-net: give HW stats sync time extra 25% of margin
From: Willem de Bruijn @ 2026-02-09  2:37 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, shuah, willemb,
	petrm, donald.hunter, michael.chan, pavan.chebbi, linux-kselftest,
	Jakub Kicinski
In-Reply-To: <20260207003509.3927744-6-kuba@kernel.org>

Jakub Kicinski wrote:
> There are transient failures for devices which update stats
> periodically, especially if it's the FW DMA'ing the stats
> rather than host periodic work querying the FW. Wait 25%
> longer than strictly necessary.
> 
> For devices which don't report stats-block-usecs we retain
> 25 msec as the default wait time (0.025sec == 20,000usec * 1.25).
> 
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH net-next v2 6/9] selftests: drv-net: gro: use SO_TXTIME to schedule packets together
From: Willem de Bruijn @ 2026-02-09  2:39 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, shuah, willemb,
	petrm, donald.hunter, michael.chan, pavan.chebbi, linux-kselftest,
	Jakub Kicinski
In-Reply-To: <20260207003509.3927744-7-kuba@kernel.org>

Jakub Kicinski wrote:
> Longer packet sequence tests are quite flaky when the test is run
> over a real network. Try to avoid at least the jitter on the sender
> side by scheduling all the packets to be sent at once using SO_TXTIME.
> Use hardcoded tx time of 5msec in the future. In my test increasing
> this time past 2msec makes no difference so 5msec is plenty of margin.
> Since we now expect more output buffering make sure to raise SNDBUF.
> 
> Experimenting with long sequences I see frequent failures when sending
> 200 packets, only 50-100 packets get coalesced. With this change
> up to 1000 packets get coalesced relatively reliably.
> 
> Reviewed-by: Petr Machata <petrm@nvidia.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Does this require having FQ installed? I don't see any qdisc config
in the GRO test.

^ permalink raw reply

* Re: [PATCH v6] virtio_net: add page_pool support for buffer allocation
From: Vishwanath Seshagiri @ 2026-02-09  2:42 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: Eugenio Pérez, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, David Wei, Matteo Croce,
	Ilias Apalodimas, netdev, virtualization, linux-kernel,
	kernel-team, Michael S . Tsirkin, Jason Wang
In-Reply-To: <1770602432.6451533-1-xuanzhuo@linux.alibaba.com>

On 2/8/26 6:00 PM, Xuan Zhuo wrote:
> On Sun, 8 Feb 2026 09:54:10 -0800, Vishwanath Seshagiri <vishs@meta.com> wrote:
>> Use page_pool for RX buffer allocation in mergeable and small buffer
>> modes to enable page recycling and avoid repeated page allocator calls.
>> skb_mark_for_recycle() enables page reuse in the network stack.
>>
>> Big packets mode is unchanged because it uses page->private for linked
>> list chaining of multiple pages per buffer, which conflicts with
>> page_pool's internal use of page->private.
>>
>> Implement conditional DMA premapping using virtqueue_dma_dev():
>> - When non-NULL (vhost, virtio-pci): use PP_FLAG_DMA_MAP with page_pool
>>    handling DMA mapping, submit via virtqueue_add_inbuf_premapped()
>> - When NULL (VDUSE, direct physical): page_pool handles allocation only,
>>    submit via virtqueue_add_inbuf_ctx()
>>
>> This preserves the DMA premapping optimization from commit 31f3cd4e5756b
>> ("virtio-net: rq submits premapped per-buffer") while adding page_pool
>> support as a prerequisite for future zero-copy features (devmem TCP,
>> io_uring ZCRX).
>>
>> Page pools are created in probe and destroyed in remove (not open/close),
>> following existing driver behavior where RX buffers remain in virtqueues
>> across interface state changes.
>>
>> Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
>> ---
>> Changes in v6:
>> - Drop page_pool_frag_offset_add() helper and switch to page_pool_alloc_va();
>>    page_pool_alloc_netmem() already handles internal fragmentation internally
>>    (Jakub Kicinski)
>> - v5:
>>    https://lore.kernel.org/virtualization/20260206002715.1885869-1-vishs@meta.com/
>>
>> Benchmark results:
>>
>> Configuration: pktgen TX -> tap -> vhost-net | virtio-net RX -> XDP_DROP
>>
>> Small packets (64 bytes, mrg_rxbuf=off):
>>    1Q:  853,493 -> 868,923 pps  (+1.8%)
>>    2Q: 1,655,793 -> 1,696,707 pps (+2.5%)
>>    4Q: 3,143,375 -> 3,302,511 pps (+5.1%)
>>    8Q: 6,082,590 -> 6,156,894 pps (+1.2%)
>>
>> Mergeable RX (64 bytes):
>>    1Q:   766,168 ->   814,493 pps  (+6.3%)
>>    2Q: 1,384,871 -> 1,670,639 pps (+20.6%)
>>    4Q: 2,773,081 -> 3,080,574 pps (+11.1%)
>>    8Q: 5,600,615 -> 6,043,891 pps  (+7.9%)
>>
>> Mergeable RX (1500 bytes):
>>    1Q:   741,579 ->   785,442 pps  (+5.9%)
>>    2Q: 1,310,043 -> 1,534,554 pps (+17.1%)
>>    4Q: 2,748,700 -> 2,890,582 pps  (+5.2%)
>>    8Q: 5,348,589 -> 5,618,664 pps  (+5.0%)
>>
>>   drivers/net/Kconfig      |   1 +
>>   drivers/net/virtio_net.c | 434 +++++++++++++++++++--------------------
>>   2 files changed, 217 insertions(+), 218 deletions(-)
>>
>> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>> index ac12eaf11755..f1e6b6b0a86f 100644
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -450,6 +450,7 @@ config VIRTIO_NET
>>   	depends on VIRTIO
>>   	select NET_FAILOVER
>>   	select DIMLIB
>> +	select PAGE_POOL
>>   	help
>>   	  This is the virtual network driver for virtio.  It can be used with
>>   	  QEMU based VMMs (like KVM or Xen).  Say Y or M.
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index db88dcaefb20..5055df56e4a7 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -26,6 +26,7 @@
>>   #include <net/netdev_rx_queue.h>
>>   #include <net/netdev_queues.h>
>>   #include <net/xdp_sock_drv.h>
>> +#include <net/page_pool/helpers.h>
>>
>>   static int napi_weight = NAPI_POLL_WEIGHT;
>>   module_param(napi_weight, int, 0444);
>> @@ -290,14 +291,6 @@ struct virtnet_interrupt_coalesce {
>>   	u32 max_usecs;
>>   };
>>
>> -/* The dma information of pages allocated at a time. */
>> -struct virtnet_rq_dma {
>> -	dma_addr_t addr;
>> -	u32 ref;
>> -	u16 len;
>> -	u16 need_sync;
>> -};
>> -
>>   /* Internal representation of a send virtqueue */
>>   struct send_queue {
>>   	/* Virtqueue associated with this send _queue */
>> @@ -356,8 +349,10 @@ struct receive_queue {
>>   	/* Average packet length for mergeable receive buffers. */
>>   	struct ewma_pkt_len mrg_avg_pkt_len;
>>
>> -	/* Page frag for packet buffer allocation. */
>> -	struct page_frag alloc_frag;
>> +	struct page_pool *page_pool;
>> +
>> +	/* True if page_pool handles DMA mapping via PP_FLAG_DMA_MAP */
>> +	bool use_page_pool_dma;
>>
>>   	/* RX: fragments + linear part + virtio header */
>>   	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>> @@ -370,9 +365,6 @@ struct receive_queue {
>>
>>   	struct xdp_rxq_info xdp_rxq;
>>
>> -	/* Record the last dma info to free after new pages is allocated. */
>> -	struct virtnet_rq_dma *last_dma;
>> -
>>   	struct xsk_buff_pool *xsk_pool;
>>
>>   	/* xdp rxq used by xsk */
>> @@ -521,11 +513,13 @@ static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
>>   			       struct virtnet_rq_stats *stats);
>>   static void virtnet_receive_done(struct virtnet_info *vi, struct receive_queue *rq,
>>   				 struct sk_buff *skb, u8 flags);
>> -static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
>> +static struct sk_buff *virtnet_skb_append_frag(struct receive_queue *rq,
>> +					       struct sk_buff *head_skb,
>>   					       struct sk_buff *curr_skb,
>>   					       struct page *page, void *buf,
>>   					       int len, int truesize);
>>   static void virtnet_xsk_completed(struct send_queue *sq, int num);
>> +static void free_unused_bufs(struct virtnet_info *vi);
>>
>>   enum virtnet_xmit_type {
>>   	VIRTNET_XMIT_TYPE_SKB,
>> @@ -706,15 +700,24 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>>   	return p;
>>   }
>>
>> +static void virtnet_put_page(struct receive_queue *rq, struct page *page,
>> +			     bool allow_direct)
>> +{
>> +	if (page_pool_page_is_pp(page))
>> +		page_pool_put_page(rq->page_pool, page, -1, allow_direct);
>> +	else
>> +		put_page(page);
>> +}
> 
> Why we need this?
> For the caller, we should know which one should be used?
> 

This was after some feedback to unify the alloc/free path checks in v4.
But you raise a valid point - callers already know the mode via 
virtnet_no_page_pool(). I can simplify this to just call 
page_pool_put_page() directly, since virtnet_put_page() is only called
from paths that already checked we're using page_pool. Would you prefer
that?


> 
>> +
>>   static void virtnet_rq_free_buf(struct virtnet_info *vi,
>>   				struct receive_queue *rq, void *buf)
>>   {
>>   	if (vi->mergeable_rx_bufs)
>> -		put_page(virt_to_head_page(buf));
>> +		virtnet_put_page(rq, virt_to_head_page(buf), false);
>>   	else if (vi->big_packets)
>>   		give_pages(rq, buf);
>>   	else
>> -		put_page(virt_to_head_page(buf));
>> +		virtnet_put_page(rq, virt_to_head_page(buf), false);
>>   }
>>
>>   static void enable_rx_mode_work(struct virtnet_info *vi)
>> @@ -876,10 +879,16 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>   		skb = virtnet_build_skb(buf, truesize, p - buf, len);
>>   		if (unlikely(!skb))
>>   			return NULL;
>> +		/* Big packets mode chains pages via page->private, which is
>> +		 * incompatible with the way page_pool uses page->private.
>> +		 * Currently, big packets mode doesn't use page pools.
>> +		 */
>> +		if (vi->big_packets && !vi->mergeable_rx_bufs) {
>> +			page = (struct page *)page->private;
>> +			if (page)
>> +				give_pages(rq, page);
>> +		}
>>
>> -		page = (struct page *)page->private;
>> -		if (page)
>> -			give_pages(rq, page);
>>   		goto ok;
>>   	}
>>
>> @@ -925,133 +934,18 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>   	hdr = skb_vnet_common_hdr(skb);
>>   	memcpy(hdr, hdr_p, hdr_len);
>>   	if (page_to_free)
>> -		put_page(page_to_free);
>> +		virtnet_put_page(rq, page_to_free, true);
>>
>>   	return skb;
>>   }
>>
>> -static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len)
>> -{
>> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>> -	struct page *page = virt_to_head_page(buf);
>> -	struct virtnet_rq_dma *dma;
>> -	void *head;
>> -	int offset;
>> -
>> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
>> -
>> -	head = page_address(page);
>> -
>> -	dma = head;
>> -
>> -	--dma->ref;
>> -
>> -	if (dma->need_sync && len) {
>> -		offset = buf - (head + sizeof(*dma));
>> -
>> -		virtqueue_map_sync_single_range_for_cpu(rq->vq, dma->addr,
>> -							offset, len,
>> -							DMA_FROM_DEVICE);
>> -	}
>> -
>> -	if (dma->ref)
>> -		return;
>> -
>> -	virtqueue_unmap_single_attrs(rq->vq, dma->addr, dma->len,
>> -				     DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
>> -	put_page(page);
>> -}
>> -
>>   static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx)
>>   {
>>   	struct virtnet_info *vi = rq->vq->vdev->priv;
>> -	void *buf;
>>
>>   	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
>>
>> -	buf = virtqueue_get_buf_ctx(rq->vq, len, ctx);
>> -	if (buf)
>> -		virtnet_rq_unmap(rq, buf, *len);
>> -
>> -	return buf;
>> -}
>> -
>> -static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
>> -{
>> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>> -	struct virtnet_rq_dma *dma;
>> -	dma_addr_t addr;
>> -	u32 offset;
>> -	void *head;
>> -
>> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
>> -
>> -	head = page_address(rq->alloc_frag.page);
>> -
>> -	offset = buf - head;
>> -
>> -	dma = head;
>> -
>> -	addr = dma->addr - sizeof(*dma) + offset;
>> -
>> -	sg_init_table(rq->sg, 1);
>> -	sg_fill_dma(rq->sg, addr, len);
>> -}
>> -
>> -static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
>> -{
>> -	struct page_frag *alloc_frag = &rq->alloc_frag;
>> -	struct virtnet_info *vi = rq->vq->vdev->priv;
>> -	struct virtnet_rq_dma *dma;
>> -	void *buf, *head;
>> -	dma_addr_t addr;
>> -
>> -	BUG_ON(vi->big_packets && !vi->mergeable_rx_bufs);
>> -
>> -	head = page_address(alloc_frag->page);
>> -
>> -	dma = head;
>> -
>> -	/* new pages */
>> -	if (!alloc_frag->offset) {
>> -		if (rq->last_dma) {
>> -			/* Now, the new page is allocated, the last dma
>> -			 * will not be used. So the dma can be unmapped
>> -			 * if the ref is 0.
>> -			 */
>> -			virtnet_rq_unmap(rq, rq->last_dma, 0);
>> -			rq->last_dma = NULL;
>> -		}
>> -
>> -		dma->len = alloc_frag->size - sizeof(*dma);
>> -
>> -		addr = virtqueue_map_single_attrs(rq->vq, dma + 1,
>> -						  dma->len, DMA_FROM_DEVICE, 0);
>> -		if (virtqueue_map_mapping_error(rq->vq, addr))
>> -			return NULL;
>> -
>> -		dma->addr = addr;
>> -		dma->need_sync = virtqueue_map_need_sync(rq->vq, addr);
>> -
>> -		/* Add a reference to dma to prevent the entire dma from
>> -		 * being released during error handling. This reference
>> -		 * will be freed after the pages are no longer used.
>> -		 */
>> -		get_page(alloc_frag->page);
>> -		dma->ref = 1;
>> -		alloc_frag->offset = sizeof(*dma);
>> -
>> -		rq->last_dma = dma;
>> -	}
>> -
>> -	++dma->ref;
>> -
>> -	buf = head + alloc_frag->offset;
>> -
>> -	get_page(alloc_frag->page);
>> -	alloc_frag->offset += size;
>> -
>> -	return buf;
>> +	return virtqueue_get_buf_ctx(rq->vq, len, ctx);
>>   }
>>
>>   static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf)
>> @@ -1067,9 +961,6 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, void *buf)
>>   		return;
>>   	}
>>
>> -	if (!vi->big_packets || vi->mergeable_rx_bufs)
>> -		virtnet_rq_unmap(rq, buf, 0);
>> -
>>   	virtnet_rq_free_buf(vi, rq, buf);
>>   }
>>
>> @@ -1335,7 +1226,7 @@ static int xsk_append_merge_buffer(struct virtnet_info *vi,
>>
>>   		truesize = len;
>>
>> -		curr_skb  = virtnet_skb_append_frag(head_skb, curr_skb, page,
>> +		curr_skb  = virtnet_skb_append_frag(rq, head_skb, curr_skb, page,
>>   						    buf, len, truesize);
>>   		if (!curr_skb) {
>>   			put_page(page);
>> @@ -1771,7 +1662,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>>   	return ret;
>>   }
>>
>> -static void put_xdp_frags(struct xdp_buff *xdp)
>> +static void put_xdp_frags(struct receive_queue *rq, struct xdp_buff *xdp)
>>   {
>>   	struct skb_shared_info *shinfo;
>>   	struct page *xdp_page;
>> @@ -1781,7 +1672,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
>>   		shinfo = xdp_get_shared_info_from_buff(xdp);
>>   		for (i = 0; i < shinfo->nr_frags; i++) {
>>   			xdp_page = skb_frag_page(&shinfo->frags[i]);
>> -			put_page(xdp_page);
>> +			virtnet_put_page(rq, xdp_page, true);
>>   		}
>>   	}
>>   }
>> @@ -1873,7 +1764,7 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>>   	if (page_off + *len + tailroom > PAGE_SIZE)
>>   		return NULL;
>>
>> -	page = alloc_page(GFP_ATOMIC);
>> +	page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
>>   	if (!page)
>>   		return NULL;
>>
>> @@ -1897,7 +1788,7 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>>   		off = buf - page_address(p);
>>
>>   		if (check_mergeable_len(dev, ctx, buflen)) {
>> -			put_page(p);
>> +			virtnet_put_page(rq, p, true);
>>   			goto err_buf;
>>   		}
>>
>> @@ -1905,21 +1796,21 @@ static struct page *xdp_linearize_page(struct net_device *dev,
>>   		 * is sending packet larger than the MTU.
>>   		 */
>>   		if ((page_off + buflen + tailroom) > PAGE_SIZE) {
>> -			put_page(p);
>> +			virtnet_put_page(rq, p, true);
>>   			goto err_buf;
>>   		}
>>
>>   		memcpy(page_address(page) + page_off,
>>   		       page_address(p) + off, buflen);
>>   		page_off += buflen;
>> -		put_page(p);
>> +		virtnet_put_page(rq, p, true);
>>   	}
>>
>>   	/* Headroom does not contribute to packet length */
>>   	*len = page_off - XDP_PACKET_HEADROOM;
>>   	return page;
>>   err_buf:
>> -	__free_pages(page, 0);
>> +	page_pool_put_page(rq->page_pool, page, -1, true);
>>   	return NULL;
>>   }
>>
>> @@ -1969,6 +1860,12 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>>   	unsigned int metasize = 0;
>>   	u32 act;
>>
>> +	if (rq->use_page_pool_dma) {
>> +		int off = buf - page_address(page);
>> +
>> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, off, len);
>> +	}
>> +
>>   	if (unlikely(hdr->hdr.gso_type))
>>   		goto err_xdp;
>>
>> @@ -1996,7 +1893,7 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>>   			goto err_xdp;
>>
>>   		buf = page_address(xdp_page);
>> -		put_page(page);
>> +		virtnet_put_page(rq, page, true);
>>   		page = xdp_page;
>>   	}
>>
>> @@ -2028,13 +1925,15 @@ static struct sk_buff *receive_small_xdp(struct net_device *dev,
>>   	if (metasize)
>>   		skb_metadata_set(skb, metasize);
>>
>> +	skb_mark_for_recycle(skb);
>> +
>>   	return skb;
>>
>>   err_xdp:
>>   	u64_stats_inc(&stats->xdp_drops);
>>   err:
>>   	u64_stats_inc(&stats->drops);
>> -	put_page(page);
>> +	virtnet_put_page(rq, page, true);
>>   xdp_xmit:
>>   	return NULL;
>>   }
>> @@ -2056,6 +1955,12 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>   	 */
>>   	buf -= VIRTNET_RX_PAD + xdp_headroom;
>>
>> +	if (rq->use_page_pool_dma) {
>> +		int offset = buf - page_address(page);
>> +
>> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, offset, len);
>> +	}
>> +
>>   	len -= vi->hdr_len;
>>   	u64_stats_add(&stats->bytes, len);
>>
>> @@ -2082,12 +1987,14 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>   	}
>>
>>   	skb = receive_small_build_skb(vi, xdp_headroom, buf, len);
>> -	if (likely(skb))
>> +	if (likely(skb)) {
>> +		skb_mark_for_recycle(skb);
>>   		return skb;
>> +	}
>>
>>   err:
>>   	u64_stats_inc(&stats->drops);
>> -	put_page(page);
>> +	virtnet_put_page(rq, page, true);
>>   	return NULL;
>>   }
>>
>> @@ -2142,7 +2049,7 @@ static void mergeable_buf_free(struct receive_queue *rq, int num_buf,
>>   		}
>>   		u64_stats_add(&stats->bytes, len);
>>   		page = virt_to_head_page(buf);
>> -		put_page(page);
>> +		virtnet_put_page(rq, page, true);
>>   	}
>>   }
>>
>> @@ -2253,7 +2160,7 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev,
>>   		offset = buf - page_address(page);
>>
>>   		if (check_mergeable_len(dev, ctx, len)) {
>> -			put_page(page);
>> +			virtnet_put_page(rq, page, true);
>>   			goto err;
>>   		}
>>
>> @@ -2272,7 +2179,7 @@ static int virtnet_build_xdp_buff_mrg(struct net_device *dev,
>>   	return 0;
>>
>>   err:
>> -	put_xdp_frags(xdp);
>> +	put_xdp_frags(rq, xdp);
>>   	return -EINVAL;
>>   }
>>
>> @@ -2337,7 +2244,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi,
>>   		if (*len + xdp_room > PAGE_SIZE)
>>   			return NULL;
>>
>> -		xdp_page = alloc_page(GFP_ATOMIC);
>> +		xdp_page = page_pool_alloc_pages(rq->page_pool, GFP_ATOMIC);
>>   		if (!xdp_page)
>>   			return NULL;
>>
>> @@ -2347,7 +2254,7 @@ static void *mergeable_xdp_get_buf(struct virtnet_info *vi,
>>
>>   	*frame_sz = PAGE_SIZE;
>>
>> -	put_page(*page);
>> +	virtnet_put_page(rq, *page, true);
>>
>>   	*page = xdp_page;
>>
>> @@ -2393,6 +2300,8 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>>   		head_skb = build_skb_from_xdp_buff(dev, vi, &xdp, xdp_frags_truesz);
>>   		if (unlikely(!head_skb))
>>   			break;
>> +
>> +		skb_mark_for_recycle(head_skb);
>>   		return head_skb;
>>
>>   	case XDP_TX:
>> @@ -2403,10 +2312,10 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>>   		break;
>>   	}
>>
>> -	put_xdp_frags(&xdp);
>> +	put_xdp_frags(rq, &xdp);
>>
>>   err_xdp:
>> -	put_page(page);
>> +	virtnet_put_page(rq, page, true);
>>   	mergeable_buf_free(rq, num_buf, dev, stats);
>>
>>   	u64_stats_inc(&stats->xdp_drops);
>> @@ -2414,7 +2323,8 @@ static struct sk_buff *receive_mergeable_xdp(struct net_device *dev,
>>   	return NULL;
>>   }
>>
>> -static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
>> +static struct sk_buff *virtnet_skb_append_frag(struct receive_queue *rq,
>> +					       struct sk_buff *head_skb,
>>   					       struct sk_buff *curr_skb,
>>   					       struct page *page, void *buf,
>>   					       int len, int truesize)
>> @@ -2446,7 +2356,7 @@ static struct sk_buff *virtnet_skb_append_frag(struct sk_buff *head_skb,
>>
>>   	offset = buf - page_address(page);
>>   	if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
>> -		put_page(page);
>> +		virtnet_put_page(rq, page, true);
>>   		skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
>>   				     len, truesize);
>>   	} else {
>> @@ -2475,6 +2385,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>   	unsigned int headroom = mergeable_ctx_to_headroom(ctx);
>>
>>   	head_skb = NULL;
>> +
>> +	if (rq->use_page_pool_dma)
>> +		page_pool_dma_sync_for_cpu(rq->page_pool, page, offset, len);
>> +
>>   	u64_stats_add(&stats->bytes, len - vi->hdr_len);
>>
>>   	if (check_mergeable_len(dev, ctx, len))
>> @@ -2499,6 +2413,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>
>>   	if (unlikely(!curr_skb))
>>   		goto err_skb;
>> +
>> +	skb_mark_for_recycle(head_skb);
>>   	while (--num_buf) {
>>   		buf = virtnet_rq_get_buf(rq, &len, &ctx);
>>   		if (unlikely(!buf)) {
>> @@ -2517,7 +2433,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>   			goto err_skb;
>>
>>   		truesize = mergeable_ctx_to_truesize(ctx);
>> -		curr_skb  = virtnet_skb_append_frag(head_skb, curr_skb, page,
>> +		curr_skb  = virtnet_skb_append_frag(rq, head_skb, curr_skb, page,
>>   						    buf, len, truesize);
>>   		if (!curr_skb)
>>   			goto err_skb;
>> @@ -2527,7 +2443,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>   	return head_skb;
>>
>>   err_skb:
>> -	put_page(page);
>> +	virtnet_put_page(rq, page, true);
>>   	mergeable_buf_free(rq, num_buf, dev, stats);
>>
>>   err_buf:
>> @@ -2666,32 +2582,41 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
>>   static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq,
>>   			     gfp_t gfp)
>>   {
>> -	char *buf;
>>   	unsigned int xdp_headroom = virtnet_get_headroom(vi);
>>   	void *ctx = (void *)(unsigned long)xdp_headroom;
>> -	int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
>> +	unsigned int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
>> +	struct page *page;
>> +	dma_addr_t addr;
>> +	char *buf;
>>   	int err;
>>
>>   	len = SKB_DATA_ALIGN(len) +
>>   	      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>
>> -	if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
>> -		return -ENOMEM;
>> -
>> -	buf = virtnet_rq_alloc(rq, len, gfp);
>> +	buf = page_pool_alloc_va(rq->page_pool, &len, gfp);
>>   	if (unlikely(!buf))
>>   		return -ENOMEM;
>>
>>   	buf += VIRTNET_RX_PAD + xdp_headroom;
>>
>> -	virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
>> +	if (rq->use_page_pool_dma) {
>> +		page = virt_to_head_page(buf);
>> +		addr = page_pool_get_dma_addr(page) +
>> +		       (buf - (char *)page_address(page));
>>
>> -	err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, gfp);
>> -	if (err < 0) {
>> -		virtnet_rq_unmap(rq, buf, 0);
>> -		put_page(virt_to_head_page(buf));
>> +		sg_init_table(rq->sg, 1);
>> +		sg_fill_dma(rq->sg, addr, vi->hdr_len + GOOD_PACKET_LEN);
>> +		err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1,
>> +						    buf, ctx, gfp);
>> +	} else {
>> +		sg_init_one(rq->sg, buf, vi->hdr_len + GOOD_PACKET_LEN);
>> +		err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1,
>> +					      buf, ctx, gfp);
>>   	}
>>
>> +	if (err < 0)
>> +		page_pool_put_page(rq->page_pool, virt_to_head_page(buf),
>> +				   -1, false);
>>   	return err;
>>   }
>>
>> @@ -2764,13 +2689,14 @@ static unsigned int get_mergeable_buf_len(struct receive_queue *rq,
>>   static int add_recvbuf_mergeable(struct virtnet_info *vi,
>>   				 struct receive_queue *rq, gfp_t gfp)
>>   {
>> -	struct page_frag *alloc_frag = &rq->alloc_frag;
>>   	unsigned int headroom = virtnet_get_headroom(vi);
>>   	unsigned int tailroom = headroom ? sizeof(struct skb_shared_info) : 0;
>>   	unsigned int room = SKB_DATA_ALIGN(headroom + tailroom);
>> -	unsigned int len, hole;
>> -	void *ctx;
>> +	unsigned int len, alloc_len;
>> +	struct page *page;
>> +	dma_addr_t addr;
>>   	char *buf;
>> +	void *ctx;
>>   	int err;
>>
>>   	/* Extra tailroom is needed to satisfy XDP's assumption. This
>> @@ -2779,39 +2705,36 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
>>   	 */
>>   	len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>>
>> -	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
>> -		return -ENOMEM;
>> -
>> -	if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > alloc_frag->size)
>> -		len -= sizeof(struct virtnet_rq_dma);
>> -
>> -	buf = virtnet_rq_alloc(rq, len + room, gfp);
>> +	alloc_len = len + room;
>> +	buf = page_pool_alloc_va(rq->page_pool, &alloc_len, gfp);
>>   	if (unlikely(!buf))
>>   		return -ENOMEM;
>>
>>   	buf += headroom; /* advance address leaving hole at front of pkt */
>> -	hole = alloc_frag->size - alloc_frag->offset;
>> -	if (hole < len + room) {
>> -		/* To avoid internal fragmentation, if there is very likely not
>> -		 * enough space for another buffer, add the remaining space to
>> -		 * the current buffer.
>> -		 * XDP core assumes that frame_size of xdp_buff and the length
>> -		 * of the frag are PAGE_SIZE, so we disable the hole mechanism.
>> -		 */
>> -		if (!headroom)
>> -			len += hole;
>> -		alloc_frag->offset += hole;
>> -	}
>>
>> -	virtnet_rq_init_one_sg(rq, buf, len);
>> +	if (!headroom)
>> +		len = alloc_len - room;
>>
>>   	ctx = mergeable_len_to_ctx(len + room, headroom);
>> -	err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, gfp);
>> -	if (err < 0) {
>> -		virtnet_rq_unmap(rq, buf, 0);
>> -		put_page(virt_to_head_page(buf));
>> +
>> +	if (rq->use_page_pool_dma) {
>> +		page = virt_to_head_page(buf);
>> +		addr = page_pool_get_dma_addr(page) +
>> +		       (buf - (char *)page_address(page));
>> +
>> +		sg_init_table(rq->sg, 1);
>> +		sg_fill_dma(rq->sg, addr, len);
>> +		err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1,
>> +						    buf, ctx, gfp);
>> +	} else {
>> +		sg_init_one(rq->sg, buf, len);
>> +		err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1,
>> +					      buf, ctx, gfp);
>>   	}
>>
>> +	if (err < 0)
>> +		page_pool_put_page(rq->page_pool, virt_to_head_page(buf),
>> +				   -1, false);
>>   	return err;
>>   }
>>
>> @@ -3128,7 +3051,10 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>>   		return err;
>>
>>   	err = xdp_rxq_info_reg_mem_model(&vi->rq[qp_index].xdp_rxq,
>> -					 MEM_TYPE_PAGE_SHARED, NULL);
>> +					 vi->rq[qp_index].page_pool ?
>> +						MEM_TYPE_PAGE_POOL :
>> +						MEM_TYPE_PAGE_SHARED,
>> +					 vi->rq[qp_index].page_pool);
>>   	if (err < 0)
>>   		goto err_xdp_reg_mem_model;
>>
>> @@ -3168,6 +3094,81 @@ static void virtnet_update_settings(struct virtnet_info *vi)
>>   		vi->duplex = duplex;
>>   }
>>
>> +static int virtnet_create_page_pools(struct virtnet_info *vi)
>> +{
>> +	int i, err;
>> +
>> +	if (!vi->mergeable_rx_bufs && vi->big_packets)
>> +		return 0;
>> +
>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>> +		struct receive_queue *rq = &vi->rq[i];
>> +		struct page_pool_params pp_params = { 0 };
>> +		struct device *dma_dev;
>> +
>> +		if (rq->page_pool)
>> +			continue;
>> +
>> +		if (rq->xsk_pool)
>> +			continue;
>> +
>> +		pp_params.order = 0;
>> +		pp_params.pool_size = virtqueue_get_vring_size(rq->vq);
>> +		pp_params.nid = dev_to_node(vi->vdev->dev.parent);
>> +		pp_params.netdev = vi->dev;
>> +		pp_params.napi = &rq->napi;
>> +
>> +		/* Check if backend supports DMA API (e.g., vhost, virtio-pci).
>> +		 * If so, use page_pool's DMA mapping for premapped buffers.
>> +		 * Otherwise (e.g., VDUSE), page_pool only handles allocation.
>> +		 */
>> +		dma_dev = virtqueue_dma_dev(rq->vq);
>> +		if (dma_dev) {
>> +			pp_params.dev = dma_dev;
>> +			pp_params.flags = PP_FLAG_DMA_MAP;
>> +			pp_params.dma_dir = DMA_FROM_DEVICE;
>> +			rq->use_page_pool_dma = true;
>> +		} else {
>> +			pp_params.dev = vi->vdev->dev.parent;
>> +			pp_params.flags = 0;
>> +			rq->use_page_pool_dma = false;
> 
> Can the page pool handles dma with vi->vdev->dev.parent?

No, we cannot use the page_pool DMA with vi->vdev->dev.parent in VDUSE
case because VDUSE uses its own address translation. virtqueue_dma_dev()
returns NULL, virtio doesn't use standard DMA API at all. Now that I 
think about it, setting pp_params.dev in this branch is unnecessary
since it is never accessed. I can remove it, if you prefer.


> 
> Thanks.
> 
>> +		}
>> +
>> +		rq->page_pool = page_pool_create(&pp_params);
>> +		if (IS_ERR(rq->page_pool)) {
>> +			err = PTR_ERR(rq->page_pool);
>> +			rq->page_pool = NULL;
>> +			goto err_cleanup;
>> +		}
>> +	}
>> +	return 0;
>> +
>> +err_cleanup:
>> +	while (--i >= 0) {
>> +		struct receive_queue *rq = &vi->rq[i];
>> +
>> +		if (rq->page_pool) {
>> +			page_pool_destroy(rq->page_pool);
>> +			rq->page_pool = NULL;
>> +		}
>> +	}
>> +	return err;
>> +}
>> +
>> +static void virtnet_destroy_page_pools(struct virtnet_info *vi)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>> +		struct receive_queue *rq = &vi->rq[i];
>> +
>> +		if (rq->page_pool) {
>> +			page_pool_destroy(rq->page_pool);
>> +			rq->page_pool = NULL;
>> +		}
>> +	}
>> +}
>> +
>>   static int virtnet_open(struct net_device *dev)
>>   {
>>   	struct virtnet_info *vi = netdev_priv(dev);
>> @@ -6287,17 +6288,6 @@ static void free_receive_bufs(struct virtnet_info *vi)
>>   	rtnl_unlock();
>>   }
>>
>> -static void free_receive_page_frags(struct virtnet_info *vi)
>> -{
>> -	int i;
>> -	for (i = 0; i < vi->max_queue_pairs; i++)
>> -		if (vi->rq[i].alloc_frag.page) {
>> -			if (vi->rq[i].last_dma)
>> -				virtnet_rq_unmap(&vi->rq[i], vi->rq[i].last_dma, 0);
>> -			put_page(vi->rq[i].alloc_frag.page);
>> -		}
>> -}
>> -
>>   static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
>>   {
>>   	struct virtnet_info *vi = vq->vdev->priv;
>> @@ -6441,10 +6431,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>>   		vi->rq[i].min_buf_len = mergeable_min_buf_len(vi, vi->rq[i].vq);
>>   		vi->sq[i].vq = vqs[txq2vq(i)];
>>   	}
>> -
>>   	/* run here: ret == 0. */
>>
>> -
>>   err_find:
>>   	kfree(ctx);
>>   err_ctx:
>> @@ -6945,6 +6933,14 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   			goto free;
>>   	}
>>
>> +	/* Create page pools for receive queues.
>> +	 * Page pools are created at probe time so they can be used
>> +	 * with premapped DMA addresses throughout the device lifetime.
>> +	 */
>> +	err = virtnet_create_page_pools(vi);
>> +	if (err)
>> +		goto free_irq_moder;
>> +
>>   #ifdef CONFIG_SYSFS
>>   	if (vi->mergeable_rx_bufs)
>>   		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
>> @@ -6958,7 +6954,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   		vi->failover = net_failover_create(vi->dev);
>>   		if (IS_ERR(vi->failover)) {
>>   			err = PTR_ERR(vi->failover);
>> -			goto free_vqs;
>> +			goto free_page_pools;
>>   		}
>>   	}
>>
>> @@ -7075,9 +7071,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>>   	unregister_netdev(dev);
>>   free_failover:
>>   	net_failover_destroy(vi->failover);
>> -free_vqs:
>> +free_page_pools:
>> +	virtnet_destroy_page_pools(vi);
>> +free_irq_moder:
>> +	virtnet_free_irq_moder(vi);
>>   	virtio_reset_device(vdev);
>> -	free_receive_page_frags(vi);
>>   	virtnet_del_vqs(vi);
>>   free:
>>   	free_netdev(dev);
>> @@ -7102,7 +7100,7 @@ static void remove_vq_common(struct virtnet_info *vi)
>>
>>   	free_receive_bufs(vi);
>>
>> -	free_receive_page_frags(vi);
>> +	virtnet_destroy_page_pools(vi);
>>
>>   	virtnet_del_vqs(vi);
>>   }
>> --
>> 2.47.3
>>


^ permalink raw reply

* [PATCH net-next] net: arcnet: remove code depending on nonexistent config option
From: Ethan Nelson-Moore @ 2026-02-09  2:54 UTC (permalink / raw)
  To: netdev
  Cc: Ethan Nelson-Moore, Michael Grzeschik, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

The CONFIG_SA1100_CT6001 option has never existed in the kernel. Remove
code in arcdevice.h referring to it. This allows the
arcnet_(in|out)(s|)b macros to be simplified by removing the BUS_ALIGN
macro.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 drivers/net/arcnet/arcdevice.h | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/net/arcnet/arcdevice.h b/drivers/net/arcnet/arcdevice.h
index bee60b377d7c..6a5af0943567 100644
--- a/drivers/net/arcnet/arcdevice.h
+++ b/drivers/net/arcnet/arcdevice.h
@@ -374,24 +374,17 @@ static inline void arcnet_set_addr(struct net_device *dev, u8 addr)
 
 /* I/O equivalents */
 
-#ifdef CONFIG_SA1100_CT6001
-#define BUS_ALIGN  2  /* 8 bit device on a 16 bit bus - needs padding */
-#else
-#define BUS_ALIGN  1
-#endif
-
 /* addr and offset allow register like names to define the actual IO  address.
- * A configuration option multiplies the offset for alignment.
  */
 #define arcnet_inb(addr, offset)					\
-	inb((addr) + BUS_ALIGN * (offset))
+	inb((addr) + (offset))
 #define arcnet_outb(value, addr, offset)				\
-	outb(value, (addr) + BUS_ALIGN * (offset))
+	outb(value, (addr) + (offset))
 
 #define arcnet_insb(addr, offset, buffer, count)			\
-	insb((addr) + BUS_ALIGN * (offset), buffer, count)
+	insb((addr) + (offset), buffer, count)
 #define arcnet_outsb(addr, offset, buffer, count)			\
-	outsb((addr) + BUS_ALIGN * (offset), buffer, count)
+	outsb((addr) + (offset), buffer, count)
 
 #define arcnet_readb(addr, offset)					\
 	readb((addr) + (offset))
-- 
2.43.0


^ permalink raw reply related

* Re: [REGRESSION] Discussion on "xfrm: Duplicate SPI Handling"
From: Yan Yan @ 2026-02-09  3:03 UTC (permalink / raw)
  To: antony.antony, Steffen Klassert, paul
  Cc: netdev, Herbert Xu, David S . Miller, Eric Dumazet,
	Jakub Kicinski, pabeni, horms, saakashkumar, akamluddin,
	Nathan Harold, greg
In-Reply-To: <aYC0OvkVPkOnVU-i@moon.secunet.de>

Thank you all for the reply! Please see my response inline below

> This is meant for multicast SAs, we don't support multicast SAs.
I reread the RFC and that makes sense to me. Thanks for the explanation.

> To clarify, how are larval outbound SAs created in Android?
> Are they created with zero or non-zero SPIs? Multiple zero SPIs (acquire states) are allowed.

In Android, larval outbound SAs are created with non-zero SPIs because
Android userspace uses XFRM_MSG_ALLOCSPI to manually reserve a unique
value before the SA is fully finalized. These are not kernel-generated
"acquire" states.

Regarding the default behavior:

I'm fine with defaulting to RFC 4301 for the upstream kernel to align
with the modern standard. As long as a toggle exists to fallback to
the legacy RFC 2401 behavior, Android’s compatibility requirements are
met.




On Mon, Feb 2, 2026 at 10:27 PM Antony Antony <antony.antony@secunet.com> wrote:
>
> Hi Yan,
>
> On Wed, Jan 28, 2026 at 16:43:06 +0800, Yan Yan wrote:
> > Hi all,
> >
> > I am writing because unfortunately commit: 94f39804d891 ("xfrm:
> > Duplicate SPI Handling") has caused a regression in the Android OS, so
> > we would like to gain some context to help determine next steps. The
> > issue is caused by the new requirement for global SPI uniqueness
> > during allocation. Based on our review of RFC 4301 and the previous
> > review history, we would like to highlight a few concerns:
> >
> > 1. Regression on Android:
> > This change breaks production behavior in Android, which uses
> > XFRM_MSG_ALLOCSPI to create larval SAs for both directions. Since RFC
> > 4301 allows duplicate outbound SPIs, and larval SAs are often created
>
> To clarify, how are larval outbound SAs created in Android?
> Are they created with zero or non-zero SPIs? Multiple zero SPIs (acquire states) are allowed. However, I guess there is no user API for managing these acquire states. Only the kernel can create them and handle expiration or deletion via SA updates with acq_req?
>
> A user API (UAPI) for creating and deleting acquire states might be a possible solution.
>
> I haven’t been able to consistently reproduce the issue Marvell reported,
> but I suspect the bug could also affect outbound SAs with non-zero SPIs.
> Also when one peer is behind NAT.
>
> I wonder wouldn't duplicate SPI when behind NAT cause issues for output SAs?
> Because the triplet is SPI, Protocol, Daddr. There is  no dport in it.
>
> > before direction or selectors are finalized, the allocator must remain
> > permissive (at least in our current design).
> > This also aligns with a concern Herbert Xu raised during the initial
> > review regarding compatibility:
> > >    "It's also dangerous to unilaterally do this since existing deployments
> > >    could rely on the old behaviour. You'd need to add a toggle for
> > >    compatibility."
> >
> > 2. Inbound SPI uniqueness should not be a hard requirement:
> > The justification for enforcing global SPI uniqueness often cites the
> > statement in RFC 4301, Section 4.1, that for unicast traffic, the SPI
> > "by itself suffices to specify an SA." However, we don’t think this
> > means inbound SPI uniqueness is a hard requirement because of the two
> > following reasons:
> >
> > – Another statement implies that SPI uniqueness is just an
> > implementation choice:
> > >    "Each entry in the SA Database (SAD) MUST
> > >    indicate whether the SA lookup makes use of the destination IP address, or the
> > >    destination and source IP addresses, in addition to the SPI."
> >
> > – There is a "Longest Match" mandate which makes SPI uniqueness unnecessary:
> > >    "For each inbound, IPsec-protected packet, an implementation MUST
> > >    conduct its search of the SAD such that it finds the entry that
> > >    matches the 'longest' SA identifier. In this context, if two or more
> > >    SAD entries match based on the SPI value, then the entry that also
> > >    matches based on destination address... is the 'longest' match."
> >
> > 3. Further clarification on the specific problem being addressed would
> > be helpful. The "real-world" problem this commit intends to fix
> > remains unclear. The patch mentions:
> > >    "This behavior causes inconsistencies during SPI lookups for inbound packets.
> > >    Since the lookup may return an arbitrary SA among those with the same SPI,
> > >    packet processing can fail, resulting in packet drops."
> >
> > However, Linux kernel lookups using the triplet (SPI, Protocol, Daddr)
> > are deterministic. The lookup will not return an "arbitrary" SA
> > because the destination address is used to disambiguate the state.
> >
> > As Antony suggested, this change may cater to SPI-only hardware
> > indexing. If that is the case, we are concerned about applying such
> > hardware-specific limits to the software stack, especially if the
> > behavior is not opt-in, as it appears to require an overly-narrow
> > reading of the RFC 4301.
>
> I agree with your suggestion that making the behavior opt-in.
> I would prefer the Default : to allow duplicate.
>
> > Given these concerns, would it be possible to discuss a revert or,
> > alternatively, could further context be provided regarding the
> > specific real-world problem this commit was intended to address? Once
> > the underlying issue is clearly defined, we can work together to find
> > a backward-compatible solution that satisfies all requirements.
> >
> > Review threads are attached for easy reference:
> > https://lore.kernel.org/netdev/aDQhZ_ikHEt_pLn_@gondor.apana.org.au/T/#r45c1786651ce5af730f757aca7438474d494a323
> > https://lore.kernel.org/netdev/20250616100621.837472-1-saakashkumar@marvell.com/T/#u
>
> -antony



--
--
Best,
Yan

^ permalink raw reply

* [PATCH net-next] net: arcnet: com20020: remove misleading references to multicast
From: Ethan Nelson-Moore @ 2026-02-09  3:06 UTC (permalink / raw)
  To: netdev
  Cc: Ethan Nelson-Moore, Michael Grzeschik, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

ARCnet does not support multicast, only unicast and broadcast. In spite
of this, the com20020 driver contains several references to multicast
in a comment and a function name, including a FIXME that it should be
implemented. Adjust the comment to make the lack of multicast support
clear and rename com20020_set_mc_list to com20020_set_rx_mode.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
---
 drivers/net/arcnet/com20020.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/arcnet/com20020.c b/drivers/net/arcnet/com20020.c
index b8526805ffac..f2fa26626a06 100644
--- a/drivers/net/arcnet/com20020.c
+++ b/drivers/net/arcnet/com20020.c
@@ -56,7 +56,7 @@ static void com20020_copy_to_card(struct net_device *dev, int bufnum,
 				  int offset, void *buf, int count);
 static void com20020_copy_from_card(struct net_device *dev, int bufnum,
 				    int offset, void *buf, int count);
-static void com20020_set_mc_list(struct net_device *dev);
+static void com20020_set_rx_mode(struct net_device *dev);
 static void com20020_close(struct net_device *);
 
 static void com20020_copy_from_card(struct net_device *dev, int bufnum,
@@ -194,7 +194,7 @@ const struct net_device_ops com20020_netdev_ops = {
 	.ndo_start_xmit = arcnet_send_packet,
 	.ndo_tx_timeout = arcnet_timeout,
 	.ndo_set_mac_address = com20020_set_hwaddr,
-	.ndo_set_rx_mode = com20020_set_mc_list,
+	.ndo_set_rx_mode = com20020_set_rx_mode,
 };
 
 /* Set up the struct net_device associated with this card.  Called after
@@ -362,14 +362,8 @@ static void com20020_close(struct net_device *dev)
 	arcnet_outb(lp->config, ioaddr, COM20020_REG_W_CONFIG);
 }
 
-/* Set or clear the multicast filter for this adaptor.
- * num_addrs == -1    Promiscuous mode, receive all packets
- * num_addrs == 0       Normal mode, clear multicast list
- * num_addrs > 0        Multicast mode, receive normal and MC packets, and do
- *                      best-effort filtering.
- *      FIXME - do multicast stuff, not just promiscuous.
- */
-static void com20020_set_mc_list(struct net_device *dev)
+/* ARCnet does not support multicast, only unicast and broadcast */
+static void com20020_set_rx_mode(struct net_device *dev)
 {
 	struct arcnet_local *lp = netdev_priv(dev);
 	int ioaddr = dev->base_addr;
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox