Netdev List
 help / color / mirror / Atom feed
* Re: [net-next RFC PATCH v6 1/4] net: phy: aquantia: move to separate directory
From: Andrew Lunn @ 2023-11-10  0:12 UTC (permalink / raw)
  To: Christian Marangi
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiner Kallweit,
	Russell King, Robert Marko, Vladimir Oltean, netdev, devicetree,
	linux-kernel
In-Reply-To: <20231109123253.3933-1-ansuelsmth@gmail.com>

On Thu, Nov 09, 2023 at 01:32:50PM +0100, Christian Marangi wrote:
> Move aquantia PHY driver to separate driectory in preparation for
> firmware loading support to keep things tidy.
> 
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [net-next RFC PATCH v6 3/4] net: phy: aquantia: add firmware load support
From: Andrew Lunn @ 2023-11-10  0:16 UTC (permalink / raw)
  To: Christian Marangi
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley, Heiner Kallweit,
	Russell King, Robert Marko, Vladimir Oltean, netdev, devicetree,
	linux-kernel
In-Reply-To: <20231109123253.3933-3-ansuelsmth@gmail.com>

On Thu, Nov 09, 2023 at 01:32:52PM +0100, Christian Marangi wrote:
> From: Robert Marko <robimarko@gmail.com>
> 
> Aquantia PHY-s require firmware to be loaded before they start operating.
> It can be automatically loaded in case when there is a SPI-NOR connected
> to Aquantia PHY-s or can be loaded from the host via MDIO.
> 
> This patch adds support for loading the firmware via MDIO as in most cases
> there is no SPI-NOR being used to save on cost.
> Firmware loading code itself is ported from mainline U-boot with cleanups.
> 
> The firmware has mixed values both in big and little endian.
> PHY core itself is big-endian but it expects values to be in little-endian.
> The firmware is little-endian but CRC-16 value for it is stored at the end
> of firmware in big-endian.
> 
> It seems the PHY does the conversion internally from firmware that is
> little-endian to the PHY that is big-endian on using the mailbox
> but mailbox returns a big-endian CRC-16 to verify the written data
> integrity.
> 
> Co-developed-by: Christian Marangi <ansuelsmth@gmail.com>
> Signed-off-by: Robert Marko <robimarko@gmail.com>
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [syzbot] [net?] BUG: unable to handle kernel paging request in nsim_bpf
From: Stanislav Fomichev @ 2023-11-10  0:21 UTC (permalink / raw)
  To: syzbot
  Cc: ast, bpf, daniel, davem, edumazet, hawk, john.fastabend, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs
In-Reply-To: <000000000000d078d30609b138ba@google.com>

On 11/08, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:

Looks like we need to have a distinction between dev-bound vs
dev-offloaded. I'll try to poke it.

> HEAD commit:    8de1e7afcc1c Merge branch 'for-next/core' into for-kernelci
> git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-kernelci
> console output: https://syzkaller.appspot.com/x/log.txt?x=158c647b680000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3e6feaeda5dcbc27
> dashboard link: https://syzkaller.appspot.com/bug?extid=44c2416196b7c607f226
> compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> userspace arch: arm64
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=104da6eb680000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14df3787680000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/0f00907f9764/disk-8de1e7af.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/0502fe78c60d/vmlinux-8de1e7af.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/192135168cc0/Image-8de1e7af.gz.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+44c2416196b7c607f226@syzkaller.appspotmail.com
> 
> netdevsim netdevsim0 netdevsim1: set [1, 0] type 2 family 0 port 6081 - 0
> netdevsim netdevsim0 netdevsim2: set [1, 0] type 2 family 0 port 6081 - 0
> netdevsim netdevsim0 netdevsim3: set [1, 0] type 2 family 0 port 6081 - 0
> Unable to handle kernel paging request at virtual address dfff800000000003
> KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
> Mem abort info:
>   ESR = 0x0000000096000005
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
>   FSC = 0x05: level 1 translation fault
> Data abort info:
>   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [dfff800000000003] address between user and kernel address ranges
> Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 0 PID: 6085 Comm: syz-executor153 Not tainted 6.6.0-rc7-syzkaller-g8de1e7afcc1c #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023
> pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : nsim_setup_prog_hw_checks drivers/net/netdevsim/bpf.c:320 [inline]
> pc : nsim_bpf+0x1e0/0xae0 drivers/net/netdevsim/bpf.c:562
> lr : nsim_bpf+0x8c/0xae0 drivers/net/netdevsim/bpf.c:554
> sp : ffff800096c67790
> x29: ffff800096c677a0 x28: dfff800000000000 x27: ffff700012d8cf00
> x26: dfff800000000000 x25: ffff800096c67a00 x24: 0000000000000008
> x23: ffff800096c67820 x22: 0000000000000018 x21: ffff800096c67820
> x20: ffff0000d3834cc0 x19: ffff0000d3834000 x18: ffff800096c67580
> x17: ffff8000805c1258 x16: ffff80008030c738 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000003
> x11: ffff0000d4ab3780 x10: 00000000000000bc x9 : ffff800085ce8bf0
> x8 : 0000000000000003 x7 : 0000000000000000 x6 : 0000000000000000
> x5 : ffff800092dee000 x4 : 0000000000000000 x3 : ffff80008030c754
> x2 : 0000000000000000 x1 : ffff80009001ef50 x0 : 0000000000000001
> Call trace:
>  nsim_setup_prog_hw_checks drivers/net/netdevsim/bpf.c:320 [inline]
>  nsim_bpf+0x1e0/0xae0 drivers/net/netdevsim/bpf.c:562
>  dev_xdp_install+0x124/0x2f0 net/core/dev.c:9199
>  dev_xdp_attach+0xa4c/0xcc8 net/core/dev.c:9351
>  dev_xdp_attach_link net/core/dev.c:9370 [inline]
>  bpf_xdp_link_attach+0x300/0x710 net/core/dev.c:9540
>  link_create+0x2c0/0x68c kernel/bpf/syscall.c:4954
>  __sys_bpf+0x4d4/0x5dc kernel/bpf/syscall.c:5414
>  __do_sys_bpf kernel/bpf/syscall.c:5448 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5446 [inline]
>  __arm64_sys_bpf+0x80/0x98 kernel/bpf/syscall.c:5446
>  __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
>  invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
>  el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
>  do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
>  el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678
>  el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
>  el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
> Code: 96b3720d f94002c8 91006116 d343fec8 (387a6908) 
> ---[ end trace 0000000000000000 ]---
> ----------------
> Code disassembly (best guess):
>    0:	96b3720d 	bl	0xfffffffffacdc834
>    4:	f94002c8 	ldr	x8, [x22]
>    8:	91006116 	add	x22, x8, #0x18
>    c:	d343fec8 	lsr	x8, x22, #3
> * 10:	387a6908 	ldrb	w8, [x8, x26] <-- trapping instruction
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> 
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before testing.
> 
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup

^ permalink raw reply

* [PATCH v2 1/2] tg3: Move the [rt]x_dropped counters to tg3_napi
From: alexey.pakhunov @ 2023-11-10  0:23 UTC (permalink / raw)
  To: mchan
  Cc: vincent.wong2, netdev, linux-kernel, siva.kallam, prashant,
	Alex Pakhunov

From: Alex Pakhunov <alexey.pakhunov@spacex.com>

This change moves [rt]x_dropped counters to tg3_napi so that they can be
updated by a single writer, race-free.

Signed-off-by: Alex Pakhunov <alexey.pakhunov@spacex.com>
Signed-off-by: Vincent Wong <vincent.wong2@spacex.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 39 +++++++++++++++++++++++++----
 drivers/net/ethernet/broadcom/tg3.h |  4 +--
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 14b311196b8f..de74a63a02dd 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -6845,7 +6845,7 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 				       desc_idx, *post_ptr);
 		drop_it_no_recycle:
 			/* Other statistics kept track of by card. */
-			tp->rx_dropped++;
+			tnapi->rx_dropped++;
 			goto next_pkt;
 		}
 
@@ -8146,7 +8146,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 	dev_kfree_skb_any(skb);
 drop_nofree:
-	tp->tx_dropped++;
+	tnapi->tx_dropped++;
 	return NETDEV_TX_OK;
 }
 
@@ -9325,7 +9325,7 @@ static void __tg3_set_rx_mode(struct net_device *);
 /* tp->lock is held. */
 static int tg3_halt(struct tg3 *tp, int kind, bool silent)
 {
-	int err;
+	int err, i;
 
 	tg3_stop_fw(tp);
 
@@ -9346,6 +9346,14 @@ static int tg3_halt(struct tg3 *tp, int kind, bool silent)
 
 		/* And make sure the next sample is new data */
 		memset(tp->hw_stats, 0, sizeof(struct tg3_hw_stats));
+
+		for (i = 0; i < TG3_IRQ_MAX_VECS; ++i)
+		{
+			struct tg3_napi *tnapi = &tp->napi[i];
+
+			tnapi->rx_dropped = 0;
+			tnapi->tx_dropped = 0;
+		}
 	}
 
 	return err;
@@ -11895,6 +11903,9 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
 {
 	struct rtnl_link_stats64 *old_stats = &tp->net_stats_prev;
 	struct tg3_hw_stats *hw_stats = tp->hw_stats;
+	unsigned long rx_dropped;
+	unsigned long tx_dropped;
+	int i;
 
 	stats->rx_packets = old_stats->rx_packets +
 		get_stat64(&hw_stats->rx_ucast_packets) +
@@ -11941,8 +11952,26 @@ static void tg3_get_nstats(struct tg3 *tp, struct rtnl_link_stats64 *stats)
 	stats->rx_missed_errors = old_stats->rx_missed_errors +
 		get_stat64(&hw_stats->rx_discards);
 
-	stats->rx_dropped = tp->rx_dropped;
-	stats->tx_dropped = tp->tx_dropped;
+	/* Aggregate per-queue counters. The per-queue counters are updated
+	 * by a single writer, race-free. The result computed by this loop
+	 * might not be 100% accurate (counters can be updated in the middle of
+	 * the loop) but the next tg3_get_nstats() will recompute the current
+	 * value so it is acceptable.
+	 *
+	 * Note that these counters wrap around at 4G on 32bit machines.
+	 */
+	rx_dropped = (unsigned long)(old_stats->rx_dropped);
+	tx_dropped = (unsigned long)(old_stats->tx_dropped);
+
+	for (i = 0; i < tp->irq_cnt; i++) {
+		struct tg3_napi *tnapi = &tp->napi[i];
+
+		rx_dropped += tnapi->rx_dropped;
+		tx_dropped += tnapi->tx_dropped;
+	}
+
+	stats->rx_dropped = rx_dropped;
+	stats->tx_dropped = tx_dropped;
 }
 
 static int tg3_get_regs_len(struct net_device *dev)
diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
index 1000c894064f..8d753f8c5b06 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3018,6 +3018,7 @@ struct tg3_napi {
 	u16				*rx_rcb_prod_idx;
 	struct tg3_rx_prodring_set	prodring;
 	struct tg3_rx_buffer_desc	*rx_rcb;
+	unsigned long			rx_dropped;
 
 	u32				tx_prod	____cacheline_aligned;
 	u32				tx_cons;
@@ -3026,6 +3027,7 @@ struct tg3_napi {
 	u32				prodmbox;
 	struct tg3_tx_buffer_desc	*tx_ring;
 	struct tg3_tx_ring_info		*tx_buffers;
+	unsigned long			tx_dropped;
 
 	dma_addr_t			status_mapping;
 	dma_addr_t			rx_rcb_mapping;
@@ -3219,8 +3221,6 @@ struct tg3 {
 
 
 	/* begin "everything else" cacheline(s) section */
-	unsigned long			rx_dropped;
-	unsigned long			tx_dropped;
 	struct rtnl_link_stats64	net_stats_prev;
 	struct tg3_ethtool_stats	estats_prev;
 

base-commit: ffc253263a1375a65fa6c9f62a893e9767fbebfa
-- 
2.39.3


^ permalink raw reply related

* [PATCH v2 2/2] tg3: Increment tx_dropped in tg3_tso_bug()
From: alexey.pakhunov @ 2023-11-10  0:23 UTC (permalink / raw)
  To: mchan
  Cc: vincent.wong2, netdev, linux-kernel, siva.kallam, prashant,
	Alex Pakhunov
In-Reply-To: <20231110002340.3612515-1-alexey.pakhunov@spacex.com>

From: Alex Pakhunov <alexey.pakhunov@spacex.com>

tg3_tso_bug() drops a packet if it cannot be segmented for any reason.
The number of discarded frames should be incremented accordingly.

Signed-off-by: Alex Pakhunov <alexey.pakhunov@spacex.com>
Signed-off-by: Vincent Wong <vincent.wong2@spacex.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index de74a63a02dd..a71df37d78bf 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -7874,8 +7874,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
 
 	segs = skb_gso_segment(skb, tp->dev->features &
 				    ~(NETIF_F_TSO | NETIF_F_TSO6));
-	if (IS_ERR(segs) || !segs)
+	if (IS_ERR(segs) || !segs) {
+		tnapi->tx_dropped++;
 		goto tg3_tso_bug_end;
+	}
 
 	skb_list_walk_safe(segs, seg, next) {
 		skb_mark_not_on_list(seg);
-- 
2.39.3


^ permalink raw reply related

* RE: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it to user-mode
From: Long Li @ 2023-11-10  0:43 UTC (permalink / raw)
  To: Jakub Kicinski, longli@linuxonhyperv.com
  Cc: KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	David S. Miller, Eric Dumazet, Paolo Abeni,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20231108181318.5360af18@kernel.org>

> Subject: Re: [PATCH net-next v4] hv_netvsc: Mark VF as slave before exposing it
> to user-mode
> 
> On Wed,  8 Nov 2023 14:56:52 -0800 longli@linuxonhyperv.com wrote:
> > From: Long Li <longli@microsoft.com>
> >
> > When a VF is being exposed form the kernel, it should be marked as "slave"
> > before exposing to the user-mode. The VF is not usable without netvsc
> > running as master. The user-mode should never see a VF without the "slave"
> flag.
> >
> > An example of a user-mode program depending on this flag is cloud-init
> > (https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> > hub.com%2Fcanonical%2Fcloud-
> init%2Fblob%2F19.3%2Fcloudinit%2Fnet%2F__i
> >
> nit__.py&data=05%7C01%7Clongli%40microsoft.com%7C5fd05bce17d2471c74c
> 00
> >
> 8dbe0c9728b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63835092
> 80435
> >
> 56592%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJB
> >
> TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAL2hc8338ci8Tl5
> ktZjk
> > mWZKCKWMqa%2BGlsE7Ty9g00%3D&reserved=0)
> 
> Quick grep for "flags", "priv" and "slave" doesn't show anything.
> Can you point me to the line of code?

I'm sorry, The URL I put in the commit should be: (I didn't realize the change has not been merged, here is the buggy code)
https://github.com/canonical/cloud-init/blob/3f515387142007fe0992a45486a1e049198a82f2/cloudinit/net/__init__.py#L1094

The code above needs to work with and without netvsc (the possible master device) present. It doesn't work properly with both conditions as of today. The patch series (with Haiyang's patches) fix that.

Because the code is specific to HyperV, we know we could be handling a VF NIC that is possibly a slave device, so checking on "slave" flag is a reliable indication whether the VF should be handled.

The current workflow in the kernel looks like this:
1. VF net device is created and expose to user-mode
2. VF is bonded to NETVSC (if NETVSC exists on the system)

With the current kernel behavior, the user-mode can possibly see the VF after 1, and before 2 when VF is bonded. When this happens, the user-mode doesn't know if the VF will be bonded in the future (it may never happen on systems without NETVSC). In this case, it doesn't know if it should configure the VF or not.

> 
> > When scanning interfaces, it checks on if this interface has a master
> > to decide if it should be configured. There are other user-mode
> > programs perform similar checks.
> >
> > This commit moves the code of setting the slave flag to the time
> > before VF is exposed to user-mode.
> 
> > Change since v3:
> > Change target to net-next.
> 
> You don't consider this a fix? It seems like a race condition.

I will work with Haiyang to get patch sent in a series.

Thanks,

Long

^ permalink raw reply

* Re: [GIT PULL] Networking for v6.7-rc1
From: pr-tracker-bot @ 2023-11-10  1:30 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: torvalds, kuba, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20231109210013.1276858-1-kuba@kernel.org>

The pull request you sent on Thu,  9 Nov 2023 13:00:13 -0800:

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git net-6.7-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/89cdf9d556016a54ff6ddd62324aa5ec790c05cc

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH net-next v2 14/21] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
From: Xuan Zhuo @ 2023-11-10  1:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109065912-mutt-send-email-mst@kernel.org>

On Thu, 9 Nov 2023 06:59:48 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Nov 09, 2023 at 07:16:08PM +0800, Xuan Zhuo wrote:
> > On Thu, 9 Nov 2023 06:11:49 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Tue, Nov 07, 2023 at 11:12:20AM +0800, Xuan Zhuo wrote:
> > > > virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
> > > > buffer) by the last bits of the pointer.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > >  drivers/net/virtio/virtio_net.h | 18 ++++++++++++++++--
> > > >  drivers/net/virtio/xsk.h        |  5 +++++
> > > >  2 files changed, 21 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > > > index a431a2c1ee47..a13d6d301fdb 100644
> > > > --- a/drivers/net/virtio/virtio_net.h
> > > > +++ b/drivers/net/virtio/virtio_net.h
> > > > @@ -225,6 +225,11 @@ struct virtnet_info {
> > > >  	struct failover *failover;
> > > >  };
> > > >
> > > > +static inline bool virtnet_is_skb_ptr(void *ptr)
> > > > +{
> > > > +	return !((unsigned long)ptr & VIRTIO_XMIT_DATA_MASK);
> > > > +}
> > > > +
> > > >  static inline bool virtnet_is_xdp_frame(void *ptr)
> > > >  {
> > > >  	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> > > > @@ -235,6 +240,8 @@ static inline struct xdp_frame *virtnet_ptr_to_xdp(void *ptr)
> > > >  	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> > > >  }
> > > >
> > > > +static inline u32 virtnet_ptr_to_xsk(void *ptr);
> > > > +
> > >
> > > I don't understand why you need this here.
> >
> > The below function virtnet_free_old_xmit needs this.
> >
> > Thanks.
>
> I don't understand why is virtnet_free_old_xmit inline, either.

That is in the header file.


>
> > >
> > >
> > > >  static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > > >  {
> > > >  	struct virtnet_sq_dma *next, *head;
> > > > @@ -261,11 +268,12 @@ static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data)
> > > >  static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > > >  					 u64 *bytes, u64 *packets)
> > > >  {
> > > > +	unsigned int xsknum = 0;
> > > >  	unsigned int len;
> > > >  	void *ptr;
> > > >
> > > >  	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > > > -		if (!virtnet_is_xdp_frame(ptr)) {
> > > > +		if (virtnet_is_skb_ptr(ptr)) {
> > > >  			struct sk_buff *skb;
> > > >
> > > >  			if (sq->do_dma)
> > > > @@ -277,7 +285,7 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > > >
> > > >  			*bytes += skb->len;
> > > >  			napi_consume_skb(skb, in_napi);
> > > > -		} else {
> > > > +		} else if (virtnet_is_xdp_frame(ptr)) {
> > > >  			struct xdp_frame *frame;
> > > >
> > > >  			if (sq->do_dma)
> > > > @@ -287,9 +295,15 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
> > > >
> > > >  			*bytes += xdp_get_frame_len(frame);
> > > >  			xdp_return_frame(frame);
> > > > +		} else {
> > > > +			*bytes += virtnet_ptr_to_xsk(ptr);
> > > > +			++xsknum;
> > > >  		}
> > > >  		(*packets)++;
> > > >  	}
> > > > +
> > > > +	if (xsknum)
> > > > +		xsk_tx_completed(sq->xsk.pool, xsknum);
> > > >  }
> > > >
> > > >  static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
> > > > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > > > index 1bd19dcda649..7ebc9bda7aee 100644
> > > > --- a/drivers/net/virtio/xsk.h
> > > > +++ b/drivers/net/virtio/xsk.h
> > > > @@ -14,6 +14,11 @@ static inline void *virtnet_xsk_to_ptr(u32 len)
> > > >  	return (void *)(p | VIRTIO_XSK_FLAG);
> > > >  }
> > > >
> > > > +static inline u32 virtnet_ptr_to_xsk(void *ptr)
> > > > +{
> > > > +	return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> > > > +}
> > > > +
> > > >  int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > > >  bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > > >  		      int budget);
> > > > --
> > > > 2.32.0.3.g01195cf9f
> > >
> > >
>

^ permalink raw reply

* Re: [PATCH net-next v2 17/21] virtio_net: xsk: rx: skip dma unmap when rq is bind with AF_XDP
From: Xuan Zhuo @ 2023-11-10  1:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109070015-mutt-send-email-mst@kernel.org>

On Thu, 9 Nov 2023 07:00:51 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Nov 09, 2023 at 07:10:02PM +0800, Xuan Zhuo wrote:
> > On Thu, 9 Nov 2023 03:15:03 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Tue, Nov 07, 2023 at 11:12:23AM +0800, Xuan Zhuo wrote:
> > > > When rq is bound with AF_XDP, the buffer dma is managed
> > > > by the AF_XDP APIs. So the buffer got from the virtio core should
> > > > skip the dma unmap operation.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > >
> > >
> > > I don't get it - is this like a bugfix?
> >
> > I want focus on this. So let it as an independent commit.
> >
> > > And why do we need our own flag and checks?
> > > Doesn't virtio core DTRT?
> >
> >
> > struct vring_virtqueue {
> > 	[....]
> >
> > 	/* Do DMA mapping by driver */
> > 	bool premapped;
> >
> > We can not.
> >
> > So I add own flag.
> >
> > Thanks.
>
> Still don't get it. Why not check the premapped flag?

premapped is in the struct vring_virtqueue.

We can not access it from the driver.


>
> >
> > >
> > > > ---
> > > >  drivers/net/virtio/main.c       | 8 +++++---
> > > >  drivers/net/virtio/virtio_net.h | 3 +++
> > > >  drivers/net/virtio/xsk.c        | 1 +
> > > >  3 files changed, 9 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > > > index 15943a22e17d..a318b2533b94 100644
> > > > --- a/drivers/net/virtio/main.c
> > > > +++ b/drivers/net/virtio/main.c
> > > > @@ -430,7 +430,7 @@ static void *virtnet_rq_get_buf(struct virtnet_rq *rq, u32 *len, void **ctx)
> > > >  	void *buf;
> > > >
> > > >  	buf = virtqueue_get_buf_ctx(rq->vq, len, ctx);
> > > > -	if (buf && rq->do_dma)
> > > > +	if (buf && rq->do_dma_unmap)
> > > >  		virtnet_rq_unmap(rq, buf, *len);
> > > >
> > > >  	return buf;
> > > > @@ -561,8 +561,10 @@ static void virtnet_set_premapped(struct virtnet_info *vi)
> > > >
> > > >  		/* disable for big mode */
> > > >  		if (vi->mergeable_rx_bufs || !vi->big_packets) {
> > > > -			if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
> > > > +			if (!virtqueue_set_dma_premapped(vi->rq[i].vq)) {
> > > >  				vi->rq[i].do_dma = true;
> > > > +				vi->rq[i].do_dma_unmap = true;
> > > > +			}
> > > >  		}
> > > >  	}
> > > >  }
> > > > @@ -3944,7 +3946,7 @@ void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
> > > >
> > > >  	rq = &vi->rq[i];
> > > >
> > > > -	if (rq->do_dma)
> > > > +	if (rq->do_dma_unmap)
> > > >  		virtnet_rq_unmap(rq, buf, 0);
> > > >
> > > >  	virtnet_rq_free_buf(vi, rq, buf);
> > > > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > > > index 1242785e311e..2005d0cd22e2 100644
> > > > --- a/drivers/net/virtio/virtio_net.h
> > > > +++ b/drivers/net/virtio/virtio_net.h
> > > > @@ -135,6 +135,9 @@ struct virtnet_rq {
> > > >  	/* Do dma by self */
> > > >  	bool do_dma;
> > > >
> > > > +	/* Do dma unmap after getting buf from virtio core. */
> > > > +	bool do_dma_unmap;
> > > > +
> > > >  	struct {
> > > >  		struct xsk_buff_pool *pool;
> > > >
> > > > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > > > index e737c3353212..b09c473c29fb 100644
> > > > --- a/drivers/net/virtio/xsk.c
> > > > +++ b/drivers/net/virtio/xsk.c
> > > > @@ -210,6 +210,7 @@ static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *
> > > >  		xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
> > > >
> > > >  	rq->xsk.pool = pool;
> > > > +	rq->do_dma_unmap = !pool;
> > > >
> > > >  	virtnet_rx_resume(vi, rq);
> > > >
> > > > --
> > > > 2.32.0.3.g01195cf9f
> > >
> > >
>
>

^ permalink raw reply

* Re: [PATCH net-next v2 12/21] virtio_net: xsk: tx: support tx
From: Xuan Zhuo @ 2023-11-10  1:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109061507-mutt-send-email-mst@kernel.org>

On Thu, 9 Nov 2023 06:58:48 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Nov 09, 2023 at 07:06:23PM +0800, Xuan Zhuo wrote:
> > On Thu, 9 Nov 2023 03:09:00 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Tue, Nov 07, 2023 at 11:12:18AM +0800, Xuan Zhuo wrote:
> > > > The driver's tx napi is very important for XSK. It is responsible for
> > > > obtaining data from the XSK queue and sending it out.
> > > >
> > > > At the beginning, we need to trigger tx napi.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > >  drivers/net/virtio/main.c       |  12 +++-
> > > >  drivers/net/virtio/virtio_net.h |   3 +-
> > > >  drivers/net/virtio/xsk.c        | 110 ++++++++++++++++++++++++++++++++
> > > >  drivers/net/virtio/xsk.h        |  13 ++++
> > > >  4 files changed, 136 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > > > index 6c608b3ce27d..ff6bc764089d 100644
> > > > --- a/drivers/net/virtio/main.c
> > > > +++ b/drivers/net/virtio/main.c
> > > > @@ -2074,6 +2074,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > > >  	struct virtnet_info *vi = sq->vq->vdev->priv;
> > > >  	unsigned int index = vq2txq(sq->vq);
> > > >  	struct netdev_queue *txq;
> > > > +	int busy = 0;
> > > >  	int opaque;
> > > >  	bool done;
> > > >
> > > > @@ -2086,11 +2087,20 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> > > >  	txq = netdev_get_tx_queue(vi->dev, index);
> > > >  	__netif_tx_lock(txq, raw_smp_processor_id());
> > > >  	virtqueue_disable_cb(sq->vq);
> > > > -	free_old_xmit(sq, true);
> > > > +
> > > > +	if (sq->xsk.pool)
> > > > +		busy |= virtnet_xsk_xmit(sq, sq->xsk.pool, budget);
> > >
> > > You use bitwise or on errno values? What's going on here?
> >
> > virtnet_xsk_xmit() return that it is busy or not. Not the errno.
> > Here just record whether this handler is busy or not.
>
>
> Ah I see it's a bool. So make busy a bool too.
>
>
> > >
> > >
> > > > +	else
> > > > +		free_old_xmit(sq, true);
> > > >
> > > >  	if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> > > >  		netif_tx_wake_queue(txq);
> > > >
> > > > +	if (busy) {
> > > > +		__netif_tx_unlock(txq);
> > > > +		return budget;
> > > > +	}
> > > > +
> > > >  	opaque = virtqueue_enable_cb_prepare(sq->vq);
> > > >
> > > >  	done = napi_complete_done(napi, 0);
> > > > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > > > index 442af4673bf8..1c21af47e13c 100644
> > > > --- a/drivers/net/virtio/virtio_net.h
> > > > +++ b/drivers/net/virtio/virtio_net.h
> > > > @@ -9,7 +9,8 @@
> > > >  #include <net/xdp_sock_drv.h>
> > > >
> > > >  #define VIRTIO_XDP_FLAG	BIT(0)
> > > > -#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
> > > > +#define VIRTIO_XSK_FLAG	BIT(1)
> > > > +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG)
> > > >
> > > >  /* RX packet size EWMA. The average packet size is used to determine the packet
> > > >   * buffer size when refilling RX rings. As the entire RX ring may be refilled
> > > > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > > > index 8b397787603f..caa448308232 100644
> > > > --- a/drivers/net/virtio/xsk.c
> > > > +++ b/drivers/net/virtio/xsk.c
> > > > @@ -4,9 +4,119 @@
> > > >   */
> > > >
> > > >  #include "virtio_net.h"
> > > > +#include "xsk.h"
> > > >
> > > >  static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> > > >
> > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > +{
> > > > +	sg->dma_address = addr;
> > > > +	sg->length = len;
> > > > +}
> > > > +
> > > > +static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> > > > +{
> > > > +	struct virtnet_info *vi = sq->vq->vdev->priv;
> > > > +	struct net_device *dev = vi->dev;
> > > > +	int qnum = sq - vi->sq;
> > > > +
> > > > +	/* If it is a raw buffer queue, it does not check whether the status
> > > > +	 * of the queue is stopped when sending. So there is no need to check
> > > > +	 * the situation of the raw buffer queue.
> > > > +	 */
> > > > +	if (virtnet_is_xdp_raw_buffer_queue(vi, qnum))
> > > > +		return;
> > > > +
> > > > +	/* If this sq is not the exclusive queue of the current cpu,
> > > > +	 * then it may be called by start_xmit, so check it running out
> > > > +	 * of space.
> > > > +	 *
> > > > +	 * Stop the queue to avoid getting packets that we are
> > > > +	 * then unable to transmit. Then wait the tx interrupt.
> > > > +	 */
> > > > +	if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
> > >
> > > what does MAX_SKB_FRAGS have to do with it? And where's 2 coming from?
> >
> > check_sq_full_and_disable()
> >
> > Thanks.
>
>
> This is one example of duplication I was talking about earlier.


OK. I write this function about two years ago. Let me rethink about this.

Thanks.


>
> > >
> > > > +		netif_stop_subqueue(dev, qnum);
> > > > +}
> > > > +
> > > > +static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
> > > > +				struct xsk_buff_pool *pool,
> > > > +				struct xdp_desc *desc)
> > > > +{
> > > > +	struct virtnet_info *vi;
> > > > +	dma_addr_t addr;
> > > > +
> > > > +	vi = sq->vq->vdev->priv;
> > > > +
> > > > +	addr = xsk_buff_raw_get_dma(pool, desc->addr);
> > > > +	xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
> > > > +
> > > > +	sg_init_table(sq->sg, 2);
> > > > +
> > > > +	sg_fill_dma(sq->sg, sq->xsk.hdr_dma_address, vi->hdr_len);
> > > > +	sg_fill_dma(sq->sg + 1, addr, desc->len);
> > > > +
> > > > +	return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
> > > > +				    virtnet_xsk_to_ptr(desc->len), GFP_ATOMIC);
> > > > +}
> > > > +
> > > > +static int virtnet_xsk_xmit_batch(struct virtnet_sq *sq,
> > > > +				  struct xsk_buff_pool *pool,
> > > > +				  unsigned int budget,
> > > > +				  u64 *kicks)
> > > > +{
> > > > +	struct xdp_desc *descs = pool->tx_descs;
> > > > +	u32 nb_pkts, max_pkts, i;
> > > > +	bool kick = false;
> > > > +	int err;
> > > > +
> > > > +	/* Every xsk tx packet needs two desc(virtnet header and packet). So we
> > > > +	 * use sq->vq->num_free / 2 as the limitation.
> > > > +	 */
> > > > +	max_pkts = min_t(u32, budget, sq->vq->num_free / 2);
> > > > +
> > > > +	nb_pkts = xsk_tx_peek_release_desc_batch(pool, max_pkts);
> > > > +	if (!nb_pkts)
> > > > +		return 0;
> > > > +
> > > > +	for (i = 0; i < nb_pkts; i++) {
> > > > +		err = virtnet_xsk_xmit_one(sq, pool, &descs[i]);
> > > > +		if (unlikely(err))
> > > > +			break;
> > > > +
> > > > +		kick = true;
> > > > +	}
> > > > +
> > > > +	if (kick && virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq))
> > > > +		(*kicks)++;
> > > > +
> > > > +	return i;
> > > > +}
> > > > +
> > > > +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > > > +		      int budget)
> > > > +{
> > > > +	u64 bytes = 0, packets = 0, kicks = 0;
> > > > +	int sent;
> > > > +
> > > > +	virtnet_free_old_xmit(sq, true, &bytes, &packets);
> > > > +
> > > > +	sent = virtnet_xsk_xmit_batch(sq, pool, budget, &kicks);
> > > > +
> > > > +	virtnet_xsk_check_queue(sq);
> > > > +
> > > > +	u64_stats_update_begin(&sq->stats.syncp);
> > > > +	u64_stats_add(&sq->stats.packets, packets);
> > > > +	u64_stats_add(&sq->stats.bytes, bytes);
> > > > +	u64_stats_add(&sq->stats.kicks, kicks);
> > > > +	u64_stats_add(&sq->stats.xdp_tx,  sent);
> > > > +	u64_stats_update_end(&sq->stats.syncp);
> > > > +
> > > > +	if (xsk_uses_need_wakeup(pool))
> > > > +		xsk_set_tx_need_wakeup(pool);
> > > > +
> > > > +	return sent == budget;
> > > > +}
> > > > +
> > > >  static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *rq,
> > > >  				    struct xsk_buff_pool *pool)
> > > >  {
> > > > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > > > index 1918285c310c..73ca8cd5308b 100644
> > > > --- a/drivers/net/virtio/xsk.h
> > > > +++ b/drivers/net/virtio/xsk.h
> > > > @@ -3,5 +3,18 @@
> > > >  #ifndef __XSK_H__
> > > >  #define __XSK_H__
> > > >
> > > > +#define VIRTIO_XSK_FLAG_OFFSET	4
> > > > +
> > > > +static inline void *virtnet_xsk_to_ptr(u32 len)
> > > > +{
> > > > +	unsigned long p;
> > > > +
> > > > +	p = len << VIRTIO_XSK_FLAG_OFFSET;
> > > > +
> > > > +	return (void *)(p | VIRTIO_XSK_FLAG);
> > > > +}
> > > > +
> > > >  int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> > > > +bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> > > > +		      int budget);
> > > >  #endif
> > > > --
> > > > 2.32.0.3.g01195cf9f
> > >
>
>

^ permalink raw reply

* Re: [PATCH] boning: use a read-write lock in bonding_show_bonds()
From: Haifeng Xu @ 2023-11-10  1:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: j.vosburgh, andy, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <ZUz/AB/kdChj5QHE@nanopsycho>



On 2023/11/9 23:47, Jiri Pirko wrote:
> 
> s/boning/bonding/
> ?

Yes, Eric has pointed out the problem in last email.
Thanks.

^ permalink raw reply

* Re: [PATCH net] virtio_net: fix missing dma unmap for resize
From: Xuan Zhuo @ 2023-11-10  1:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, Jason Wang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, virtualization
In-Reply-To: <20231109070359-mutt-send-email-mst@kernel.org>

On Thu, 9 Nov 2023 07:06:16 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Mon, Nov 06, 2023 at 04:18:32PM +0800, Xuan Zhuo wrote:
> > For rq, we have three cases getting buffers from virtio core:
> >
> > 1. virtqueue_get_buf{,_ctx}
> > 2. virtqueue_detach_unused_buf
> > 3. callback for virtqueue_resize
> >
> > But in commit 295525e29a5b("virtio_net: merge dma operations when
> > filling mergeable buffers"), I missed the dma unmap for the #3 case.
> >
> > That will leak some memory, because I did not release the pages referred
> > by the unused buffers.
> >
> > If we do such script, we will make the system OOM.
> >
> >     while true
> >     do
> >             ethtool -G ens4 rx 128
> >             ethtool -G ens4 rx 256
> >             free -m
> >     done
> >
> > Fixes: 295525e29a5b ("virtio_net: merge dma operations when filling mergeable buffers")
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >  drivers/net/virtio_net.c | 43 ++++++++++++++++++++--------------------
> >  1 file changed, 22 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index d16f592c2061..6423a3a007ce 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -408,6 +408,17 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
> >  	return p;
> >  }
> >
> > +static void virtnet_rq_free_buf(struct virtnet_info *vi,
> > +				struct receive_queue *rq, void *buf)
> > +{
> > +	if (vi->mergeable_rx_bufs)
> > +		put_page(virt_to_head_page(buf));
> > +	else if (vi->big_packets)
> > +		give_pages(rq, buf);
> > +	else
> > +		put_page(virt_to_head_page(buf));
> > +}
> > +
>
> >  static void enable_delayed_refill(struct virtnet_info *vi)
> >  {
> >  	spin_lock_bh(&vi->refill_lock);
> > @@ -634,17 +645,6 @@ static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx)
> >  	return buf;
> >  }
> >
> > -static void *virtnet_rq_detach_unused_buf(struct receive_queue *rq)
> > -{
> > -	void *buf;
> > -
> > -	buf = virtqueue_detach_unused_buf(rq->vq);
> > -	if (buf && rq->do_dma)
> > -		virtnet_rq_unmap(rq, buf, 0);
> > -
> > -	return buf;
> > -}
> > -
> >  static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len)
> >  {
> >  	struct virtnet_rq_dma *dma;
> > @@ -1764,7 +1764,7 @@ static void receive_buf(struct virtnet_info *vi, struct receive_queue *rq,
> >  	if (unlikely(len < vi->hdr_len + ETH_HLEN)) {
> >  		pr_debug("%s: short packet %i\n", dev->name, len);
> >  		DEV_STATS_INC(dev, rx_length_errors);
> > -		virtnet_rq_free_unused_buf(rq->vq, buf);
> > +		virtnet_rq_free_buf(vi, rq, buf);
> >  		return;
> >  	}
> >
> > @@ -4034,14 +4034,15 @@ static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
> >  static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
> >  {
> >  	struct virtnet_info *vi = vq->vdev->priv;
> > +	struct receive_queue *rq;
> >  	int i = vq2rxq(vq);
> >
> > -	if (vi->mergeable_rx_bufs)
> > -		put_page(virt_to_head_page(buf));
> > -	else if (vi->big_packets)
> > -		give_pages(&vi->rq[i], buf);
> > -	else
> > -		put_page(virt_to_head_page(buf));
> > +	rq = &vi->rq[i];
> > +
> > +	if (rq->do_dma)
> > +		virtnet_rq_unmap(rq, buf, 0);
> > +
> > +	virtnet_rq_free_buf(vi, rq, buf);
> >  }
> >
>
> So we have virtnet_rq_free_buf which sounds like it should free any
> buf, and we have virtnet_rq_free_unused_buf which is only for unused.
> Or so it would seem from names but this is not true.
> Better function names?

Sorry. not get it.

virtnet_rq_free_buf() that free the buf passed in. That is called by
virtnet_rq_free_unused_buf or receive_buf to free the buffer. I think
the name is right.

virtnet_rq_free_unused_buf is called by free_unused_bufs() and the
virtqueue_resize() to free the unused bufs. I think this name is right also.

So I do not get your mean.
Are there any details I've overlooked?

Thanks.

>
> >  static void free_unused_bufs(struct virtnet_info *vi)
> > @@ -4057,10 +4058,10 @@ static void free_unused_bufs(struct virtnet_info *vi)
> >  	}
> >
> >  	for (i = 0; i < vi->max_queue_pairs; i++) {
> > -		struct receive_queue *rq = &vi->rq[i];
> > +		struct virtqueue *vq = vi->rq[i].vq;
> >
> > -		while ((buf = virtnet_rq_detach_unused_buf(rq)) != NULL)
> > -			virtnet_rq_free_unused_buf(rq->vq, buf);
> > +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
> > +			virtnet_rq_free_unused_buf(vq, buf);
> >  		cond_resched();
> >  	}
> >  }
> > --
> > 2.32.0.3.g01195cf9f
>

^ permalink raw reply

* Re: [PATCH bpf-next v11 10/13] bpf, net: switch to dynamic registration
From: Martin KaFai Lau @ 2023-11-10  2:19 UTC (permalink / raw)
  To: thinker.li
  Cc: sinquersw, kuifeng, netdev, bpf, ast, song, kernel-team, andrii,
	drosen
In-Reply-To: <20231106201252.1568931-11-thinker.li@gmail.com>

On 11/6/23 12:12 PM, thinker.li@gmail.com wrote:
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 48e97a255945..432c088d4001 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1643,7 +1643,6 @@ struct bpf_struct_ops_desc {
>   #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
>   #define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
>   const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *btf, u32 type_id);
> -void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log);
>   bool bpf_struct_ops_get(const void *kdata);
>   void bpf_struct_ops_put(const void *kdata);
>   int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
> @@ -1689,10 +1688,6 @@ static inline const struct bpf_struct_ops_desc *bpf_struct_ops_find(struct btf *
>   {
>   	return NULL;
>   }
> -static inline void bpf_struct_ops_init(struct btf *btf,
> -				       struct bpf_verifier_log *log)
> -{
> -}
>   static inline bool bpf_try_module_get(const void *data, struct module *owner)
>   {
>   	return try_module_get(owner);
> @@ -3232,6 +3227,8 @@ static inline bool bpf_is_subprog(const struct bpf_prog *prog)
>   }
>   
>   #ifdef CONFIG_BPF_JIT
> +int register_bpf_struct_ops(struct bpf_struct_ops *st_ops);
> +
>   enum bpf_struct_ops_state {
>   	BPF_STRUCT_OPS_STATE_INIT,
>   	BPF_STRUCT_OPS_STATE_INUSE,
> @@ -3243,6 +3240,23 @@ struct bpf_struct_ops_common_value {
>   	refcount_t refcnt;
>   	enum bpf_struct_ops_state state;
>   };
> +
> +/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
> + * the map's value exposed to the userspace and its btf-type-id is
> + * stored at the map->btf_vmlinux_value_type_id.
> + *
> + */
> +#define DEFINE_STRUCT_OPS_VALUE_TYPE(_name)			\
> +extern struct bpf_struct_ops bpf_##_name;			\

Is it still needed?

> +								\
> +struct bpf_struct_ops_##_name {					\
> +	struct bpf_struct_ops_common_value common;		\
> +	struct _name data ____cacheline_aligned_in_smp;		\
> +}
> +
> +extern int bpf_struct_ops_desc_init(struct bpf_struct_ops_desc *st_ops_desc,
> +				    struct btf *btf,
> +				    struct bpf_verifier_log *log);

nit. Remove extern.

>   #endif /* CONFIG_BPF_JIT */
>   
>   #endif /* _LINUX_BPF_H */


^ permalink raw reply

* Re: [RFC v1 3/8] vhost: Add 3 new uapi to support iommufd
From: Jason Wang @ 2023-11-10  2:31 UTC (permalink / raw)
  To: Cindy Lu; +Cc: mst, yi.l.liu, jgg, linux-kernel, virtualization, netdev
In-Reply-To: <CACGkMEvnNXC8PhBNQn_F0ROGRX3CvwmXM6wP2A69aydSuzThYw@mail.gmail.com>

On Wed, Nov 8, 2023 at 3:09 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Nov 8, 2023 at 2:39 PM Cindy Lu <lulu@redhat.com> wrote:
> >
> > On Wed, Nov 8, 2023 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Nov 7, 2023 at 2:57 PM Cindy Lu <lulu@redhat.com> wrote:
> > > >
> > > > On Mon, Nov 6, 2023 at 3:30 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Sat, Nov 4, 2023 at 1:17 AM Cindy Lu <lulu@redhat.com> wrote:
> > > > > >
> > > > > > VHOST_VDPA_SET_IOMMU_FD: bind the device to iommufd device
> > > > > >
> > > > > > VDPA_DEVICE_ATTACH_IOMMUFD_AS: Attach a vdpa device to an iommufd
> > > > > > address space specified by IOAS id.
> > > > > >
> > > > > > VDPA_DEVICE_DETACH_IOMMUFD_AS: Detach a vdpa device
> > > > > > from the iommufd address space
> > > > > >
> > > > > > Signed-off-by: Cindy Lu <lulu@redhat.com>
> > > > > > ---
> > > > >
> > > > > [...]
> > > > >
> > > > > > diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> > > > > > index f5c48b61ab62..07e1b2c443ca 100644
> > > > > > --- a/include/uapi/linux/vhost.h
> > > > > > +++ b/include/uapi/linux/vhost.h
> > > > > > @@ -219,4 +219,70 @@
> > > > > >   */
> > > > > >  #define VHOST_VDPA_RESUME              _IO(VHOST_VIRTIO, 0x7E)
> > > > > >
> > > > > > +/* vhost_vdpa_set_iommufd
> > > > > > + * Input parameters:
> > > > > > + * @iommufd: file descriptor from /dev/iommu; pass -1 to unset
> > > > > > + * @iommufd_ioasid: IOAS identifier returned from ioctl(IOMMU_IOAS_ALLOC)
> > > > > > + * Output parameters:
> > > > > > + * @out_dev_id: device identifier
> > > > > > + */
> > > > > > +struct vhost_vdpa_set_iommufd {
> > > > > > +       __s32 iommufd;
> > > > > > +       __u32 iommufd_ioasid;
> > > > > > +       __u32 out_dev_id;
> > > > > > +};
> > > > > > +
> > > > > > +#define VHOST_VDPA_SET_IOMMU_FD \
> > > > > > +       _IOW(VHOST_VIRTIO, 0x7F, struct vhost_vdpa_set_iommufd)
> > > > > > +
> > > > > > +/*
> > > > > > + * VDPA_DEVICE_ATTACH_IOMMUFD_AS -
> > > > > > + * _IOW(VHOST_VIRTIO, 0x7f, struct vdpa_device_attach_iommufd_as)
> > > > > > + *
> > > > > > + * Attach a vdpa device to an iommufd address space specified by IOAS
> > > > > > + * id.
> > > > > > + *
> > > > > > + * Available only after a device has been bound to iommufd via
> > > > > > + * VHOST_VDPA_SET_IOMMU_FD
> > > > > > + *
> > > > > > + * Undo by VDPA_DEVICE_DETACH_IOMMUFD_AS or device fd close.
> > > > > > + *
> > > > > > + * @argsz:     user filled size of this data.
> > > > > > + * @flags:     must be 0.
> > > > > > + * @ioas_id:   Input the target id which can represent an ioas
> > > > > > + *             allocated via iommufd subsystem.
> > > > > > + *
> > > > > > + * Return: 0 on success, -errno on failure.
> > > > > > + */
> > > > > > +struct vdpa_device_attach_iommufd_as {
> > > > > > +       __u32 argsz;
> > > > > > +       __u32 flags;
> > > > > > +       __u32 ioas_id;
> > > > > > +};
> > > > >
> > > > > I think we need to map ioas to vDPA AS, so there should be an ASID
> > > > > from the view of vDPA?
> > > > >
> > > > > Thanks
> > > > >
> > > > The qemu will have a structure save and  maintain this information,So
> > > > I didn't add this
> > > >  in kernel,we can add this but maybe only for check?
> > >
> > > I meant for example, a simulator has two AS. How can we attach an ioas
> > > to a specific AS with the above uAPI?
> > >
> > > Thank>
> > this   __u32 ioas_id here is alloc from the iommufd system. maybe I
> > need to change to new name iommuds_asid to
> > make this more clear
> > the process in qemu is
> >
> > 1) qemu want to use AS 0 (for example)
> > 2) checking the existing asid. the asid 0 not used before
> > 3 )alloc new asid from iommufd system, get new ioas_id (maybe 3 for example)
> > qemu will save this relation 3<-->0 in the driver.
> > 4) setting the ioctl VDPA_DEVICE_ATTACH_IOMMUFD_AS to attach new ASID
> > to the kernel
>
> So if we want to map IOMMUFD AS 3 to VDPA AS 0, how can it be done?
>
> For example I didn't see a vDPA AS parameter in the above uAPI.
>
> vhost_vdpa_set_iommufd has iommufd_ioasid which is obviously not the vDPA AS.
>
> And ioas_id of vdpa_device_attach_iommufd_as (as you explained above)
> is not vDPA AS.

For example, the simulator/mlx5e has two ASes. It needs to know the
mapping between vDPA AS and iommufd AS. Otherwise the translation will
be problematic.

Thanks

>
> Thanks
>
>
> > 5) while map the memory, qemu will use ASID 3 to map /umap
> > and use ASID 0 for legacy mode map/umap
> >
> > So kernel here will not maintain the ioas_id from iommufd,
> > and this also make the code strange since there will 2 different asid
> > for the same AS, maybe we can save these information in the kernel
> > Thanks
> > cindy
> > > > Thanks
> > > > Cindy
> > > > > > +
> > > > > > +#define VDPA_DEVICE_ATTACH_IOMMUFD_AS \
> > > > > > +       _IOW(VHOST_VIRTIO, 0x82, struct vdpa_device_attach_iommufd_as)
> > > > > > +
> > > > > > +/*
> > > > > > + * VDPA_DEVICE_DETACH_IOMMUFD_AS
> > > > > > + *
> > > > > > + * Detach a vdpa device from the iommufd address space it has been
> > > > > > + * attached to. After it, device should be in a blocking DMA state.
> > > > > > + *
> > > > > > + * Available only after a device has been bound to iommufd via
> > > > > > + * VHOST_VDPA_SET_IOMMU_FD
> > > > > > + *
> > > > > > + * @argsz:     user filled size of this data.
> > > > > > + * @flags:     must be 0.
> > > > > > + *
> > > > > > + * Return: 0 on success, -errno on failure.
> > > > > > + */
> > > > > > +struct vdpa_device_detach_iommufd_as {
> > > > > > +       __u32 argsz;
> > > > > > +       __u32 flags;
> > > > > > +};
> > > > > > +
> > > > > > +#define VDPA_DEVICE_DETACH_IOMMUFD_AS \
> > > > > > +       _IOW(VHOST_VIRTIO, 0x83, struct vdpa_device_detach_iommufd_as)
> > > > > > +
> > > > > >  #endif
> > > > > > --
> > > > > > 2.34.3
> > > > > >
> > > > >
> > > >
> > >
> >


^ permalink raw reply

* Re: [PATCH] boning: use a read-write lock in bonding_show_bonds()
From: Haifeng Xu @ 2023-11-10  2:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: j.vosburgh, andy, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <20231109095502.5a03bfd5@hermes.local>



On 2023/11/10 01:55, Stephen Hemminger wrote:
> On Wed,  8 Nov 2023 06:46:41 +0000
> Haifeng Xu <haifeng.xu@shopee.com> wrote:
> 
>> call stack:
>> ......
>> PID: 210933  TASK: ffff92424e5ec080  CPU: 13  COMMAND: "kworker/u96:2"
>> [ffffa7a8e96bbac0] __schedule at ffffffffb0719898
>> [ffffa7a8e96bbb48] schedule at ffffffffb0719e9e
>> [ffffa7a8e96bbb68] rwsem_down_write_slowpath at ffffffffafb3167a
>> [ffffa7a8e96bbc00] down_write at ffffffffb071bfc1
>> [ffffa7a8e96bbc18] kernfs_remove_by_name_ns at ffffffffafe3593e
>> [ffffa7a8e96bbc48] sysfs_unmerge_group at ffffffffafe38922
>> [ffffa7a8e96bbc68] dpm_sysfs_remove at ffffffffb021c96a
>> [ffffa7a8e96bbc80] device_del at ffffffffb0209af8
>> [ffffa7a8e96bbcd0] netdev_unregister_kobject at ffffffffb04a6b0e
>> [ffffa7a8e96bbcf8] unregister_netdevice_many at ffffffffb046d3d9
>> [ffffa7a8e96bbd60] default_device_exit_batch at ffffffffb046d8d1
>> [ffffa7a8e96bbdd0] ops_exit_list at ffffffffb045e21d
>> [ffffa7a8e96bbe00] cleanup_net at ffffffffb045ea46
>> [ffffa7a8e96bbe60] process_one_work at ffffffffafad94bb
>> [ffffa7a8e96bbeb0] worker_thread at ffffffffafad96ad
>> [ffffa7a8e96bbf10] kthread at ffffffffafae132a
>> [ffffa7a8e96bbf50] ret_from_fork at ffffffffafa04b92
>>
>> 290858 PID: 278176  TASK: ffff925deb39a040  CPU: 32  COMMAND: "node-exporter"
>> [ffffa7a8d14dbb80] __schedule at ffffffffb0719898
>> [ffffa7a8d14dbc08] schedule at ffffffffb0719e9e
>> [ffffa7a8d14dbc28] schedule_preempt_disabled at ffffffffb071a24e
>> [ffffa7a8d14dbc38] __mutex_lock at ffffffffb071af28
>> [ffffa7a8d14dbcb8] __mutex_lock_slowpath at ffffffffb071b1a3
>> [ffffa7a8d14dbcc8] mutex_lock at ffffffffb071b1e2
>> [ffffa7a8d14dbce0] rtnl_lock at ffffffffb047f4b5
>> [ffffa7a8d14dbcf0] bonding_show_bonds at ffffffffc079b1a1 [bonding]
>> [ffffa7a8d14dbd20] class_attr_show at ffffffffb02117ce
>> [ffffa7a8d14dbd30] sysfs_kf_seq_show at ffffffffafe37ba1
>> [ffffa7a8d14dbd50] kernfs_seq_show at ffffffffafe35c07
>> [ffffa7a8d14dbd60] seq_read_iter at ffffffffafd9fce0
>> [ffffa7a8d14dbdc0] kernfs_fop_read_iter at ffffffffafe36a10
>> [ffffa7a8d14dbe00] new_sync_read at ffffffffafd6de23
>> [ffffa7a8d14dbe90] vfs_read at ffffffffafd6e64e
>> [ffffa7a8d14dbed0] ksys_read at ffffffffafd70977
>> [ffffa7a8d14dbf10] __x64_sys_read at ffffffffafd70a0a
>> [ffffa7a8d14dbf20] do_syscall_64 at ffffffffb070bf1c
>> [ffffa7a8d14dbf50] entry_SYSCALL_64_after_hwframe at ffffffffb080007c
>> ......
>>
>> Problem description:
>>
>> Thread 210933 holds the rtnl_mutex and tries to acquire the kernfs_rwsem,
>> but there are many readers which hold the kernfs_rwsem, so it has to sleep
>> for a long time to wait the readers release the lock. Thread 278176 and any
>> other threads which call bonding_show_bonds() also need to wait because
>> they try to accuire the rtnl_mutex.
>>
>> bonding_show_bonds() uses rtnl_mutex to protect the bond_list traversal.
>> However, the addition and deletion of bond_list are only performed in
>> bond_init()/bond_uninit(), so we can intoduce a separate read-write lock
>> to synchronize bond list mutation.
>>
>> What's the benefits of this change?
>>
>> 1) All threads which call bonding_show_bonds() only wait when the
>> registration or unregistration of bond device happens.
>>
>> 2) There are many other users of rtnl_mutex, so bonding_show_bonds()
>> won't compete with them.
>>
>> In a word, this change reduces the lock contention of rtnl_mutex.
>>
>> Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
>> ---
>>  drivers/net/bonding/bond_main.c  | 4 ++++
>>  drivers/net/bonding/bond_sysfs.c | 6 ++++--
>>  include/net/bonding.h            | 3 +++
>>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> Reader-writer locks are slower than spin locks and should be discouraged> Would it be possible to use RCU here instead?
 
In most cases,there are many threads which want to iterate over the bond_list,
the registration or unregistration of bond device rarely happens.

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] virtio_net: xsk: rx: introduce add_recvbuf_xsk()
From: Xuan Zhuo @ 2023-11-10  2:38 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Michael S. Tsirkin, netdev, David S.  Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	virtualization, bpf
In-Reply-To: <ZU0IOQQB5WJzJezw@boxer>

On Thu, 9 Nov 2023 17:26:33 +0100, Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:
> On Thu, Nov 09, 2023 at 07:11:46PM +0800, Xuan Zhuo wrote:
> > On Thu, 9 Nov 2023 03:12:27 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Tue, Nov 07, 2023 at 11:12:22AM +0800, Xuan Zhuo wrote:
> > > > Implement the logic of filling rq with XSK buffers.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > >  drivers/net/virtio/main.c       |  4 ++-
> > > >  drivers/net/virtio/virtio_net.h |  5 ++++
> > > >  drivers/net/virtio/xsk.c        | 49 ++++++++++++++++++++++++++++++++-
> > > >  drivers/net/virtio/xsk.h        |  2 ++
> > > >  4 files changed, 58 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > > > index 6210a6e37396..15943a22e17d 100644
> > > > --- a/drivers/net/virtio/main.c
> > > > +++ b/drivers/net/virtio/main.c
> > > > @@ -1798,7 +1798,9 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
> > > >  	bool oom;
> > > >
> > > >  	do {
> > > > -		if (vi->mergeable_rx_bufs)
> > > > +		if (rq->xsk.pool)
> > > > +			err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk.pool, gfp);
> > > > +		else if (vi->mergeable_rx_bufs)
> > > >  			err = add_recvbuf_mergeable(vi, rq, gfp);
> > > >  		else if (vi->big_packets)
> > > >  			err = add_recvbuf_big(vi, rq, gfp);
> > >
> > > I'm not sure I understand. How does this handle mergeable flag still being set?
> >
> >
> > You has the same question as Jason.
> >
> > So I think maybe I should put the handle into the
> > add_recvbuf_mergeable and add_recvbuf_small.
> >
> > Let me think about this.
> >
> >
> > >
> > >
> > > > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > > > index a13d6d301fdb..1242785e311e 100644
> > > > --- a/drivers/net/virtio/virtio_net.h
> > > > +++ b/drivers/net/virtio/virtio_net.h
> > > > @@ -140,6 +140,11 @@ struct virtnet_rq {
> > > >
> > > >  		/* xdp rxq used by xsk */
> > > >  		struct xdp_rxq_info xdp_rxq;
> > > > +
> > > > +		struct xdp_buff **xsk_buffs;
> > > > +		u32 nxt_idx;
> > > > +		u32 num;
> > > > +		u32 size;
> > > >  	} xsk;
> > > >  };
> > > >
> > > > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > > > index ea5804ddd44e..e737c3353212 100644
> > > > --- a/drivers/net/virtio/xsk.c
> > > > +++ b/drivers/net/virtio/xsk.c
> > > > @@ -38,6 +38,41 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> > > >  		netif_stop_subqueue(dev, qnum);
> > > >  }
> > > >
> > > > +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> > > > +			    struct xsk_buff_pool *pool, gfp_t gfp)
> > > > +{
> > > > +	struct xdp_buff **xsk_buffs;
> > > > +	dma_addr_t addr;
> > > > +	u32 len, i;
> > > > +	int err = 0;
> > > > +
> > > > +	xsk_buffs = rq->xsk.xsk_buffs;
> > > > +
> > > > +	if (rq->xsk.nxt_idx >= rq->xsk.num) {
> > > > +		rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->xsk.size);
> > > > +		if (!rq->xsk.num)
> > > > +			return -ENOMEM;
> > > > +		rq->xsk.nxt_idx = 0;
> > > > +	}
> > >
> > > Another manually rolled linked list implementation.
> > > Please, don't.
> >
> >
> > The array is for speedup.
> >
> > xsk_buff_alloc_batch will return many xsk_buff that will be more efficient than
> > the xsk_buff_alloc.
>
> But your sg list just contains a single entry?
> I think that you have to walk through the xsk_buffs array, retrieve dma
> addrs from there and have sg list sized to the value
> xsk_buff_alloc_batch() returned.
>
> I don't think your logic based on nxt_idx is needed. Please take a look
> how other drivers use xsk_buff_alloc_batch().
>
> I don't see callsites of virtnet_add_recvbuf_xsk() though.


virtnet_add_recvbuf_xsk is called by the above try_fill_recv()
And the loop is in there.

Jason want to reuse the loop of the try_fill_recv().
So in this function I just consume one item.

The nxt_idx is used to cross the try_fill_recv.

If we drop the nxt_idx. This patch will like this:

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 6210a6e37396..88bff83ad0d8 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -1797,6 +1797,15 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
 	int err;
 	bool oom;

+	if (rq->xsk.pool) {
+		err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk.pool, gfp);
+		oom = err == -ENOMEM;
+		if (err > 0)
+			goto kick;
+
+		return err;
+	}
+
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(vi, rq, gfp);
@@ -1809,6 +1818,7 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
 		if (err)
 			break;
 	} while (rq->vq->num_free);
+kick:
 	if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
 		unsigned long flags;

diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index a13d6d301fdb..184866014a19 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -140,6 +140,8 @@ struct virtnet_rq {

 		/* xdp rxq used by xsk */
 		struct xdp_rxq_info xdp_rxq;
+
+		struct xdp_buff **xsk_buffs;
 	} xsk;
 };

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index ea5804ddd44e..73c9323bffd3 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -38,6 +38,46 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
 		netif_stop_subqueue(dev, qnum);
 }

+int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
+			    struct xsk_buff_pool *pool, gfp_t gfp)
+{
+	struct xdp_buff **xsk_buffs;
+	dma_addr_t addr;
+	u32 len, i;
+	int err = 0;
+	int num;
+
+	xsk_buffs = rq->xsk.xsk_buffs;
+
+	num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
+	if (!num)
+		return -ENOMEM;
+
+	for (i = 0; i < num; ++i) {
+		/* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space */
+		addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len;
+		len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
+
+		sg_init_table(rq->sg, 1);
+		sg_fill_dma(rq->sg, addr, len);
+
+		err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
+		if (err)
+			goto err;
+	}
+
+	return num;
+
+err:
+	if (i)
+		err = i;
+
+	for (; i < num; ++i)
+		xsk_buff_free(xsk_buffs[i]);
+
+	return err;
+}
+
 static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
 				struct xsk_buff_pool *pool,
 				struct xdp_desc *desc)
@@ -213,7 +253,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 	struct virtnet_sq *sq;
 	struct device *dma_dev;
 	dma_addr_t hdr_dma;
-	int err;
+	int err, size;

 	/* In big_packets mode, xdp cannot work, so there is no need to
 	 * initialize xsk of rq.
@@ -249,6 +289,12 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 	if (!dma_dev)
 		return -EPERM;

+	size = virtqueue_get_vring_size(rq->vq);
+
+	rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), GFP_KERNEL);
+	if (!rq->xsk.xsk_buffs)
+		return -ENOMEM;
+
 	hdr_dma = dma_map_single(dma_dev, &xsk_hdr, vi->hdr_len, DMA_TO_DEVICE);
 	if (dma_mapping_error(dma_dev, hdr_dma))
 		return -ENOMEM;
@@ -307,6 +353,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)

 	dma_unmap_single(dma_dev, sq->xsk.hdr_dma_address, vi->hdr_len, DMA_TO_DEVICE);

+	kfree(rq->xsk.xsk_buffs);
+
 	return err1 | err2;
 }

diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index 7ebc9bda7aee..bef41a3f954e 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -23,4 +23,6 @@ int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
 bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
 		      int budget);
 int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
+int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
+			    struct xsk_buff_pool *pool, gfp_t gfp);
 #endif



^ permalink raw reply related

* Re: [RFC PATCH v3 04/12] netdev: support binding dma-buf to netdevice
From: Mina Almasry @ 2023-11-10  2:59 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, linux-kernel, linux-arch, linux-kselftest, linux-media,
	dri-devel, linaro-mm-sig, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Jesper Dangaard Brouer, Ilias Apalodimas,
	Arnd Bergmann, David Ahern, Willem de Bruijn, Shuah Khan,
	Sumit Semwal, Christian König, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <076fa6505f3e1c79cc8acdf9903809fad6c2fd31.camel@redhat.com>

On Thu, Nov 9, 2023 at 12:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> I'm trying to wrap my head around the whole infra... the above line is
> confusing. Why do you increment dma_addr? it will be re-initialized in
> the next iteration.
>

That is just a mistake, sorry. Will remove this increment.

On Thu, Nov 9, 2023 at 1:29 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:> >>>
> >>> gen_pool_destroy BUG_ON() if it's not empty at the time of destroying.
> >>> Technically that should never happen, because
> >>> __netdev_devmem_binding_free() should only be called when the refcount
> >>> hits 0, so all the chunks have been freed back to the gen_pool. But,
> >>> just in case, I don't want to crash the server just because I'm
> >>> leaking a chunk... this is a bit of defensive programming that is
> >>> typically frowned upon, but the behavior of gen_pool is so severe I
> >>> think the WARN() + check is warranted here.
> >>
> >> It seems it is pretty normal for the above to happen nowadays because of
> >> retransmits timeouts, NAPI defer schemes mentioned below:
> >>
> >> https://lkml.kernel.org/netdev/168269854650.2191653.8465259808498269815.stgit@firesoul/
> >>
> >> And currently page pool core handles that by using a workqueue.
> >
> > Forgive me but I'm not understanding the concern here.
> >
> > __netdev_devmem_binding_free() is called when binding->ref hits 0.
> >
> > binding->ref is incremented when an iov slice of the dma-buf is
> > allocated, and decremented when an iov is freed. So,
> > __netdev_devmem_binding_free() can't really be called unless all the
> > iovs have been freed, and gen_pool_size() == gen_pool_avail(),
> > regardless of what's happening on the page_pool side of things, right?
>
> I seems to misunderstand it. In that case, it seems to be about
> defensive programming like other checking.
>
> By looking at it more closely, it seems napi_frag_unref() call
> page_pool_page_put_many() directly, which means devmem seems to
> be bypassing the napi_safe optimization.
>
> Can napi_frag_unref() reuse napi_pp_put_page() in order to reuse
> the napi_safe optimization?
>

I think it already does. page_pool_page_put_many() is only called if
!recycle or !napi_pp_put_page(). In that case
page_pool_page_put_many() is just a replacement for put_page(),
because this 'page' may be an iov.

-- 
Thanks,
Mina

^ permalink raw reply

* Re: [PATCH net-next v2 16/21] virtio_net: xsk: rx: introduce add_recvbuf_xsk()
From: Xuan Zhuo @ 2023-11-10  3:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, virtualization, bpf
In-Reply-To: <20231109031003-mutt-send-email-mst@kernel.org>

On Thu, 9 Nov 2023 03:12:27 -0500, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Tue, Nov 07, 2023 at 11:12:22AM +0800, Xuan Zhuo wrote:
> > Implement the logic of filling rq with XSK buffers.
> >
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >  drivers/net/virtio/main.c       |  4 ++-
> >  drivers/net/virtio/virtio_net.h |  5 ++++
> >  drivers/net/virtio/xsk.c        | 49 ++++++++++++++++++++++++++++++++-
> >  drivers/net/virtio/xsk.h        |  2 ++
> >  4 files changed, 58 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
> > index 6210a6e37396..15943a22e17d 100644
> > --- a/drivers/net/virtio/main.c
> > +++ b/drivers/net/virtio/main.c
> > @@ -1798,7 +1798,9 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
> >  	bool oom;
> >
> >  	do {
> > -		if (vi->mergeable_rx_bufs)
> > +		if (rq->xsk.pool)
> > +			err = virtnet_add_recvbuf_xsk(vi, rq, rq->xsk.pool, gfp);
> > +		else if (vi->mergeable_rx_bufs)
> >  			err = add_recvbuf_mergeable(vi, rq, gfp);
> >  		else if (vi->big_packets)
> >  			err = add_recvbuf_big(vi, rq, gfp);
>
> I'm not sure I understand. How does this handle mergeable flag still being set?


# xsk with merge

## fill ring

We fill the ring with the buffers from the xdp socket just like we alloc buffer
from the kernel.

## receive buffer

Now the xsk supported the multi-buffer recently. But I want to support that after
this packet set. So if the packet uses more buffers, I drop that.

If the xdp is not bound or the xdp prog does not redirect the packet to xsk socket.
all buffers are used to make skbs, whatever the packet uses one buffer or more.
But the buffers is from the xsk socket. So we must allocat new pages and copy
the data to the new pages. And free the buffers of xsk socket.


Thanks.


>
>
> > diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
> > index a13d6d301fdb..1242785e311e 100644
> > --- a/drivers/net/virtio/virtio_net.h
> > +++ b/drivers/net/virtio/virtio_net.h
> > @@ -140,6 +140,11 @@ struct virtnet_rq {
> >
> >  		/* xdp rxq used by xsk */
> >  		struct xdp_rxq_info xdp_rxq;
> > +
> > +		struct xdp_buff **xsk_buffs;
> > +		u32 nxt_idx;
> > +		u32 num;
> > +		u32 size;
> >  	} xsk;
> >  };
> >
> > diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
> > index ea5804ddd44e..e737c3353212 100644
> > --- a/drivers/net/virtio/xsk.c
> > +++ b/drivers/net/virtio/xsk.c
> > @@ -38,6 +38,41 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
> >  		netif_stop_subqueue(dev, qnum);
> >  }
> >
> > +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> > +			    struct xsk_buff_pool *pool, gfp_t gfp)
> > +{
> > +	struct xdp_buff **xsk_buffs;
> > +	dma_addr_t addr;
> > +	u32 len, i;
> > +	int err = 0;
> > +
> > +	xsk_buffs = rq->xsk.xsk_buffs;
> > +
> > +	if (rq->xsk.nxt_idx >= rq->xsk.num) {
> > +		rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->xsk.size);
> > +		if (!rq->xsk.num)
> > +			return -ENOMEM;
> > +		rq->xsk.nxt_idx = 0;
> > +	}
>
> Another manually rolled linked list implementation.
> Please, don't.
>
>
> > +
> > +	i = rq->xsk.nxt_idx;
> > +
> > +	/* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space */
> > +	addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len;
> > +	len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
> > +
> > +	sg_init_table(rq->sg, 1);
> > +	sg_fill_dma(rq->sg, addr, len);
> > +
> > +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
> > +	if (err)
> > +		return err;
> > +
> > +	rq->xsk.nxt_idx++;
> > +
> > +	return 0;
> > +}
> > +
> >  static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
> >  				struct xsk_buff_pool *pool,
> >  				struct xdp_desc *desc)
> > @@ -213,7 +248,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
> >  	struct virtnet_sq *sq;
> >  	struct device *dma_dev;
> >  	dma_addr_t hdr_dma;
> > -	int err;
> > +	int err, size;
> >
> >  	/* In big_packets mode, xdp cannot work, so there is no need to
> >  	 * initialize xsk of rq.
> > @@ -249,6 +284,16 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
> >  	if (!dma_dev)
> >  		return -EPERM;
> >
> > +	size = virtqueue_get_vring_size(rq->vq);
> > +
> > +	rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), GFP_KERNEL);
> > +	if (!rq->xsk.xsk_buffs)
> > +		return -ENOMEM;
> > +
> > +	rq->xsk.size = size;
> > +	rq->xsk.nxt_idx = 0;
> > +	rq->xsk.num = 0;
> > +
> >  	hdr_dma = dma_map_single(dma_dev, &xsk_hdr, vi->hdr_len, DMA_TO_DEVICE);
> >  	if (dma_mapping_error(dma_dev, hdr_dma))
> >  		return -ENOMEM;
> > @@ -307,6 +352,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
> >
> >  	dma_unmap_single(dma_dev, sq->xsk.hdr_dma_address, vi->hdr_len, DMA_TO_DEVICE);
> >
> > +	kfree(rq->xsk.xsk_buffs);
> > +
> >  	return err1 | err2;
> >  }
> >
> > diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
> > index 7ebc9bda7aee..bef41a3f954e 100644
> > --- a/drivers/net/virtio/xsk.h
> > +++ b/drivers/net/virtio/xsk.h
> > @@ -23,4 +23,6 @@ int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
> >  bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
> >  		      int budget);
> >  int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
> > +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
> > +			    struct xsk_buff_pool *pool, gfp_t gfp);
> >  #endif
> > --
> > 2.32.0.3.g01195cf9f
>
>

^ permalink raw reply

* Re: [PATCHv3] selftests: bpf: xskxceiver: ksft_print_msg: fix format type error
From: patchwork-bot+netdevbpf @ 2023-11-10  3:30 UTC (permalink / raw)
  To: Anders Roxell
  Cc: bjorn, magnus.karlsson, maciej.fijalkowski, andrii.nakryiko,
	netdev, bpf, linux-kernel
In-Reply-To: <20231109174328.1774571-1-anders.roxell@linaro.org>

Hello:

This patch was applied to bpf/bpf.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Thu,  9 Nov 2023 18:43:28 +0100 you wrote:
> Crossbuilding selftests/bpf for architecture arm64, format specifies
> type error show up like.
> 
> xskxceiver.c:912:34: error: format specifies type 'int' but the argument
> has type '__u64' (aka 'unsigned long long') [-Werror,-Wformat]
>  ksft_print_msg("[%s] expected meta_count [%d], got meta_count [%d]\n",
>                                                                 ~~
>                                                                 %llu
>                 __func__, pkt->pkt_nb, meta->count);
>                                        ^~~~~~~~~~~
> xskxceiver.c:929:55: error: format specifies type 'unsigned long long' but
>  the argument has type 'u64' (aka 'unsigned long') [-Werror,-Wformat]
>  ksft_print_msg("Frag invalid addr: %llx len: %u\n", addr, len);
>                                     ~~~~             ^~~~
> 
> [...]

Here is the summary with links:
  - [PATCHv3] selftests: bpf: xskxceiver: ksft_print_msg: fix format type error
    https://git.kernel.org/bpf/bpf/c/fe69a1b1b6ed

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net] bonding: stop the device in bond_setup_by_slave()
From: Hangbin Liu @ 2023-11-10  3:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Jay Vosburgh,
	Andy Gospodarek, netdev, eric.dumazet, syzbot
In-Reply-To: <20231109180102.4085183-1-edumazet@google.com>

On Thu, Nov 09, 2023 at 06:01:02PM +0000, Eric Dumazet wrote:
> Commit 9eed321cde22 ("net: lapbether: only support ethernet devices")
> has been able to keep syzbot away from net/lapb, until today.
> 
> In the following splat [1], the issue is that a lapbether device has
> been created on a bonding device without members. Then adding a non
> ARPHRD_ETHER member forced the bonding master to change its type.
> 
> The fix is to make sure we call dev_close() in bond_setup_by_slave()
> so that the potential linked lapbether devices (or any other devices
> having assumptions on the physical device) are removed.
> 
> A similar bug has been addressed in commit 40baec225765
> ("bonding: fix panic on non-ARPHRD_ETHER enslave failure")
> 

Do we need also do this if the bond changed to ether device from other dev
type? e.g.

    if (slave_dev->type != ARPHRD_ETHER)
            bond_setup_by_slave(bond_dev, slave_dev);
    else
            bond_ether_setup(bond_dev);

Thanks
Hangbin

^ permalink raw reply

* [PATCH v10 bpf-next 01/17] bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
From: Andrii Nakryiko @ 2023-11-10  3:48 UTC (permalink / raw)
  To: bpf, netdev, paul, brauner
  Cc: linux-fsdevel, linux-security-module, keescook, kernel-team,
	sargun
In-Reply-To: <20231110034838.1295764-1-andrii@kernel.org>

Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
takes over and grants whatever part of BPF syscall is required.

Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
BPF_PROG_LOAD, if the type of BPF program is "network-related" either
CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.

But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
is set:
  - when creating DEVMAP/XDKMAP/CPU_MAP maps;
  - when attaching CGROUP_SKB programs;
  - when handling BPF_PROG_QUERY command.

This patch is changing the latter three cases to follow BPF_PROG_LOAD
model, that is allowing to proceed under either CAP_NET_ADMIN or
CAP_SYS_ADMIN.

This also makes it cleaner in subsequent BPF token patches to switch
wholesomely to a generic bpf_token_capable(int cap) check, that always
falls back to CAP_SYS_ADMIN if requested capability is missing.

Cc: Jakub Kicinski <kuba@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/syscall.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0ed286b8a0f0..ad4d8e433ccc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1096,6 +1096,11 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	return ret;
 }
 
+static bool bpf_net_capable(void)
+{
+	return capable(CAP_NET_ADMIN) || capable(CAP_SYS_ADMIN);
+}
+
 #define BPF_MAP_CREATE_LAST_FIELD map_extra
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
@@ -1199,7 +1204,7 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 	case BPF_MAP_TYPE_XSKMAP:
-		if (!capable(CAP_NET_ADMIN))
+		if (!bpf_net_capable())
 			return -EPERM;
 		break;
 	default:
@@ -2599,7 +2604,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	    !bpf_capable())
 		return -EPERM;
 
-	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
+	if (is_net_admin_prog_type(type) && !bpf_net_capable())
 		return -EPERM;
 	if (is_perfmon_prog_type(type) && !perfmon_capable())
 		return -EPERM;
@@ -3751,7 +3756,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
-		if (!capable(CAP_NET_ADMIN))
+		if (!bpf_net_capable())
 			/* cg-skb progs can be loaded by unpriv user.
 			 * check permissions at attach time.
 			 */
@@ -3954,7 +3959,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 static int bpf_prog_query(const union bpf_attr *attr,
 			  union bpf_attr __user *uattr)
 {
-	if (!capable(CAP_NET_ADMIN))
+	if (!bpf_net_capable())
 		return -EPERM;
 	if (CHECK_ATTR(BPF_PROG_QUERY))
 		return -EINVAL;
-- 
2.34.1


^ permalink raw reply related

* [PATCH v10 bpf-next 00/17] BPF token and BPF FS-based delegation
From: Andrii Nakryiko @ 2023-11-10  3:48 UTC (permalink / raw)
  To: bpf, netdev, paul, brauner
  Cc: linux-fsdevel, linux-security-module, keescook, kernel-team,
	sargun

This patch set introduces an ability to delegate a subset of BPF subsystem
functionality from privileged system-wide daemon (e.g., systemd or any other
container manager) through special mount options for userns-bound BPF FS to
a *trusted* unprivileged application. Trust is the key here. This
functionality is not about allowing unconditional unprivileged BPF usage.
Establishing trust, though, is completely up to the discretion of respective
privileged application that would create and mount a BPF FS instance with
delegation enabled, as different production setups can and do achieve it
through a combination of different means (signing, LSM, code reviews, etc),
and it's undesirable and infeasible for kernel to enforce any particular way
of validating trustworthiness of particular process.

The main motivation for this work is a desire to enable containerized BPF
applications to be used together with user namespaces. This is currently
impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
arbitrary memory, and it's impossible to ensure that they only read memory of
processes belonging to any given namespace. This means that it's impossible to
have a mechanically verifiable namespace-aware CAP_BPF capability, and as such
another mechanism to allow safe usage of BPF functionality is necessary.BPF FS
delegation mount options and BPF token derived from such BPF FS instance is
such a mechanism. Kernel makes no assumption about what "trusted" constitutes
in any particular case, and it's up to specific privileged applications and
their surrounding infrastructure to decide that. What kernel provides is a set
of APIs to setup and mount special BPF FS instanecs and derive BPF tokens from
it. BPF FS and BPF token are both bound to its owning userns and in such a way
are constrained inside intended container. Users can then pass BPF token FD to
privileged bpf() syscall commands, like BPF map creation and BPF program
loading, to perform such operations without having init userns privileged.

This version incorporates feedback and suggestions ([3]) received on v3 of
this patch set, and instead of allowing to create BPF tokens directly assuming
capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accept a few new
delegation mount options. If these options are used and BPF FS itself is
properly created, set up, and mounted inside the user namespaced container,
user application is able to derive a BPF token object from BPF FS instance,
and pass that token to bpf() syscall. As explained in patch #3, BPF token
itself doesn't grant access to BPF functionality, but instead allows kernel to
do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF,
CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one
half of a puzzle and allows container managers and sys admins to have safe and
flexible configuration options: determining which containers get delegation of
BPF functionality through BPF FS, and then which applications within such
containers are allowed to perform bpf() commands, based on namespaces
capabilities.

Previous attempt at addressing this very same problem ([0]) attempted to
utilize authoritative LSM approach, but was conclusively rejected by upstream
LSM maintainers. BPF token concept is not changing anything about LSM
approach, but can be combined with LSM hooks for very fine-grained security
policy. Some ideas about making BPF token more convenient to use with LSM (in
particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
2023 presentation ([1]). E.g., an ability to specify user-provided data
(context), which in combination with BPF LSM would allow implementing a very
dynamic and fine-granular custom security policies on top of BPF token. In the
interest of minimizing API surface area and discussions this was relegated to
follow up patches, as it's not essential to the fundamental concept of
delegatable BPF token.

It should be noted that BPF token is conceptually quite similar to the idea of
/dev/bpf device file, proposed by Song a while ago ([2]). The biggest
difference is the idea of using virtual anon_inode file to hold BPF token and
allowing multiple independent instances of them, each (potentially) with its
own set of restrictions. And also, crucially, BPF token approach is not using
any special stateful task-scoped flags. Instead, bpf() syscall accepts
token_fd parameters explicitly for each relevant BPF command. This addresses
main concerns brought up during the /dev/bpf discussion, and fits better with
overall BPF subsystem design.

This patch set adds a basic minimum of functionality to make BPF token idea
useful and to discuss API and functionality. Currently only low-level libbpf
APIs support creating and passing BPF token around, allowing to test kernel
functionality, but for the most part is not sufficient for real-world
applications, which typically use high-level libbpf APIs based on `struct
bpf_object` type. This was done with the intent to limit the size of patch set
and concentrate on mostly kernel-side changes. All the necessary plumbing for
libbpf will be sent as a separate follow up patch set kernel support makes it
upstream.

Another part that should happen once kernel-side BPF token is established, is
a set of conventions between applications (e.g., systemd), tools (e.g.,
bpftool), and libraries (e.g., libbpf) on exposing delegatable BPF FS
instance(s) at well-defined locations to allow applications take advantage of
this in automatic fashion without explicit code changes on BPF application's
side. But I'd like to postpone this discussion to after BPF token concept
lands.

  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
  [3] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

v9-v10:
  - slight adjustments in LSM parts (Paul);
  - setting delegate_xxx  options require capable(CAP_SYS_ADMIN) (Christian);
  - simplify BPF_TOKEN_CREATE UAPI by accepting BPF FS FD directly (Christian);
v8->v9:
  - fix issue in selftests due to sys/mount.h header (Jiri);
  - fix warning in doc comments in LSM hooks (kernel test robot);
v7->v8:
  - add bpf_token_allow_cmd and bpf_token_capable hooks (Paul);
  - inline bpf_token_alloc() into bpf_token_create() to prevent accidental
    divergence with security_bpf_token_create() hook (Paul);
v6->v7:
  - separate patches to refactor bpf_prog_alloc/bpf_map_alloc LSM hooks, as
    discussed with Paul, and now they also accept struct bpf_token;
  - added bpf_token_create/bpf_token_free to allow LSMs (SELinux,
    specifically) to set up security LSM blob (Paul);
  - last patch also wires bpf_security_struct setup by SELinux, similar to how
    it's done for BPF map/prog, though I'm not sure if that's enough, so worst
    case it's easy to drop this patch if more full fledged SELinux
    implementation will be done separately;
  - small fixes for issues caught by code reviews (Jiri, Hou);
  - fix for test_maps test that doesn't use LIBBPF_OPTS() macro (CI);
v5->v6:
  - fix possible use of uninitialized variable in selftests (CI);
  - don't use anon_inode, instead create one from BPF FS instance (Christian);
  - don't store bpf_token inside struct bpf_map, instead pass it explicitly to
    map_check_btf(). We do store bpf_token inside prog->aux, because it's used
    during verification and even can be checked during attach time for some
    program types;
  - LSM hooks are left intact pending the conclusion of discussion with Paul
    Moore; I'd prefer to do LSM-related changes as a follow up patch set
    anyways;
v4->v5:
  - add pre-patch unifying CAP_NET_ADMIN handling inside kernel/bpf/syscall.c
    (Paul Moore);
  - fix build warnings and errors in selftests and kernel, detected by CI and
    kernel test robot;
v3->v4:
  - add delegation mount options to BPF FS;
  - BPF token is derived from the instance of BPF FS and associates itself
    with BPF FS' owning userns;
  - BPF token doesn't grant BPF functionality directly, it just turns
    capable() checks into ns_capable() checks within BPF FS' owning user;
  - BPF token cannot be pinned;
v2->v3:
  - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
    BPF_OBJ_PIN for BPF token;
v1->v2:
  - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
  - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).

Andrii Nakryiko (17):
  bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
  bpf: add BPF token delegation mount options to BPF FS
  bpf: introduce BPF token object
  bpf: add BPF token support to BPF_MAP_CREATE command
  bpf: add BPF token support to BPF_BTF_LOAD command
  bpf: add BPF token support to BPF_PROG_LOAD command
  bpf: take into account BPF token when fetching helper protos
  bpf: consistently use BPF token throughout BPF verifier logic
  bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
  bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks
  bpf,lsm: add BPF token LSM hooks
  libbpf: add bpf_token_create() API
  libbpf: add BPF token support to bpf_map_create() API
  libbpf: add BPF token support to bpf_btf_load() API
  libbpf: add BPF token support to bpf_prog_load() API
  selftests/bpf: add BPF token-enabled tests
  bpf,selinux: allocate bpf_security_struct per BPF token

 drivers/media/rc/bpf-lirc.c                   |   2 +-
 include/linux/bpf.h                           |  83 ++-
 include/linux/filter.h                        |   2 +-
 include/linux/lsm_hook_defs.h                 |  15 +-
 include/linux/security.h                      |  43 +-
 include/uapi/linux/bpf.h                      |  42 ++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/bpf_lsm.c                          |  15 +-
 kernel/bpf/cgroup.c                           |   6 +-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/helpers.c                          |   6 +-
 kernel/bpf/inode.c                            | 101 ++-
 kernel/bpf/syscall.c                          | 215 ++++--
 kernel/bpf/token.c                            | 249 +++++++
 kernel/bpf/verifier.c                         |  13 +-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/core/filter.c                             |  36 +-
 net/ipv4/bpf_tcp_ca.c                         |   2 +-
 net/netfilter/nf_bpf_link.c                   |   2 +-
 security/security.c                           | 101 ++-
 security/selinux/hooks.c                      |  47 +-
 tools/include/uapi/linux/bpf.h                |  42 ++
 tools/lib/bpf/bpf.c                           |  28 +-
 tools/lib/bpf/bpf.h                           |  35 +-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
 .../testing/selftests/bpf/prog_tests/token.c  | 672 ++++++++++++++++++
 29 files changed, 1610 insertions(+), 167 deletions(-)
 create mode 100644 kernel/bpf/token.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

-- 
2.34.1


^ permalink raw reply

* [PATCH v10 bpf-next 02/17] bpf: add BPF token delegation mount options to BPF FS
From: Andrii Nakryiko @ 2023-11-10  3:48 UTC (permalink / raw)
  To: bpf, netdev, paul, brauner
  Cc: linux-fsdevel, linux-security-module, keescook, kernel-team,
	sargun
In-Reply-To: <20231110034838.1295764-1-andrii@kernel.org>

Add few new mount options to BPF FS that allow to specify that a given
BPF FS instance allows creation of BPF token (added in the next patch),
and what sort of operations are allowed under BPF token. As such, we get
4 new mount options, each is a bit mask
  - `delegate_cmds` allow to specify which bpf() syscall commands are
    allowed with BPF token derived from this BPF FS instance;
  - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
    a set of allowable BPF map types that could be created with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
    a set of allowable BPF program types that could be loaded with BPF token;
  - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
    a set of allowable BPF program attach types that could be loaded with
    BPF token; delegate_progs and delegate_attachs are meant to be used
    together, as full BPF program type is, in general, determined
    through both program type and program attach type.

Currently, these mount options accept the following forms of values:
  - a special value "any", that enables all possible values of a given
  bit set;
  - numeric value (decimal or hexadecimal, determined by kernel
  automatically) that specifies a bit mask value directly;
  - all the values for a given mount option are combined, if specified
  multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
  delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
  mask.

Ideally, more convenient (for humans) symbolic form derived from
corresponding UAPI enums would be accepted (e.g., `-o
delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
it requires a bunch of UAPI header churn, so I postponed it until this
feature lands upstream or at least there is a definite consensus that
this feature is acceptable and is going to make it, just to minimize
amount of wasted effort and not increase amount of non-essential code to
be reviewed.

Attentive reader will notice that BPF FS is now marked as
FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
user namespace as long as the process has sufficient *namespaced*
capabilities within that user namespace. But in reality we still
restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
to allow creating BPF FS context object (i.e., fsopen("bpf")) from
inside unprivileged process inside non-init userns, to capture that
userns as the owning userns. It will still be required to pass this
context object back to privileged process to instantiate and mount it.

This manipulation is important, because capturing non-init userns as the
owning userns of BPF FS instance (super block) allows to use that userns
to constraint BPF token to that userns later on (see next patch). So
creating BPF FS with delegation inside unprivileged userns will restrict
derived BPF token objects to only "work" inside that intended userns,
making it scoped to a intended "container". Also, setting these
delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
process cannot set this up without involvement of a privileged process.

There is a set of selftests at the end of the patch set that simulates
this sequence of steps and validates that everything works as intended.
But careful review is requested to make sure there are no missed gaps in
the implementation and testing.

This somewhat subtle set of aspects is the result of previous
discussions ([0]) about various user namespace implications and
interactions with BPF token functionality and is necessary to contain
BPF token inside intended user namespace.

  [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h | 10 +++++
 kernel/bpf/inode.c  | 91 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 91 insertions(+), 10 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4001d11be151..aeffd71cda3c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1564,6 +1564,16 @@ struct bpf_link_primer {
 	u32 id;
 };
 
+struct bpf_mount_opts {
+	umode_t mode;
+
+	/* BPF token-related delegation options */
+	u64 delegate_cmds;
+	u64 delegate_maps;
+	u64 delegate_progs;
+	u64 delegate_attachs;
+};
+
 struct bpf_struct_ops_value;
 struct btf_member;
 
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 1aafb2ff2e95..53313a95fdc6 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -20,6 +20,7 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/kstrtox.h>
 #include "preload/bpf_preload.h"
 
 enum bpf_type {
@@ -599,10 +600,31 @@ EXPORT_SYMBOL(bpf_prog_get_type_path);
  */
 static int bpf_show_options(struct seq_file *m, struct dentry *root)
 {
+	struct bpf_mount_opts *opts = root->d_sb->s_fs_info;
 	umode_t mode = d_inode(root)->i_mode & S_IALLUGO & ~S_ISVTX;
 
 	if (mode != S_IRWXUGO)
 		seq_printf(m, ",mode=%o", mode);
+
+	if (opts->delegate_cmds == ~0ULL)
+		seq_printf(m, ",delegate_cmds=any");
+	else if (opts->delegate_cmds)
+		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
+
+	if (opts->delegate_maps == ~0ULL)
+		seq_printf(m, ",delegate_maps=any");
+	else if (opts->delegate_maps)
+		seq_printf(m, ",delegate_maps=0x%llx", opts->delegate_maps);
+
+	if (opts->delegate_progs == ~0ULL)
+		seq_printf(m, ",delegate_progs=any");
+	else if (opts->delegate_progs)
+		seq_printf(m, ",delegate_progs=0x%llx", opts->delegate_progs);
+
+	if (opts->delegate_attachs == ~0ULL)
+		seq_printf(m, ",delegate_attachs=any");
+	else if (opts->delegate_attachs)
+		seq_printf(m, ",delegate_attachs=0x%llx", opts->delegate_attachs);
 	return 0;
 }
 
@@ -626,22 +648,27 @@ static const struct super_operations bpf_super_ops = {
 
 enum {
 	OPT_MODE,
+	OPT_DELEGATE_CMDS,
+	OPT_DELEGATE_MAPS,
+	OPT_DELEGATE_PROGS,
+	OPT_DELEGATE_ATTACHS,
 };
 
 static const struct fs_parameter_spec bpf_fs_parameters[] = {
 	fsparam_u32oct	("mode",			OPT_MODE),
+	fsparam_string	("delegate_cmds",		OPT_DELEGATE_CMDS),
+	fsparam_string	("delegate_maps",		OPT_DELEGATE_MAPS),
+	fsparam_string	("delegate_progs",		OPT_DELEGATE_PROGS),
+	fsparam_string	("delegate_attachs",		OPT_DELEGATE_ATTACHS),
 	{}
 };
 
-struct bpf_mount_opts {
-	umode_t mode;
-};
-
 static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
-	struct bpf_mount_opts *opts = fc->fs_private;
+	struct bpf_mount_opts *opts = fc->s_fs_info;
 	struct fs_parse_result result;
-	int opt;
+	int opt, err;
+	u64 msk;
 
 	opt = fs_parse(fc, bpf_fs_parameters, param, &result);
 	if (opt < 0) {
@@ -665,6 +692,28 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case OPT_MODE:
 		opts->mode = result.uint_32 & S_IALLUGO;
 		break;
+	case OPT_DELEGATE_CMDS:
+	case OPT_DELEGATE_MAPS:
+	case OPT_DELEGATE_PROGS:
+	case OPT_DELEGATE_ATTACHS:
+		if (strcmp(param->string, "any") == 0) {
+			msk = ~0ULL;
+		} else {
+			err = kstrtou64(param->string, 0, &msk);
+			if (err)
+				return err;
+		}
+		/* Setting delegation mount options requires privileges */
+		if (msk && !capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		switch (opt) {
+		case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break;
+		case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break;
+		case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break;
+		case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break;
+		default: return -EINVAL;
+		}
+		break;
 	}
 
 	return 0;
@@ -739,10 +788,14 @@ static int populate_bpffs(struct dentry *parent)
 static int bpf_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	static const struct tree_descr bpf_rfiles[] = { { "" } };
-	struct bpf_mount_opts *opts = fc->fs_private;
+	struct bpf_mount_opts *opts = sb->s_fs_info;
 	struct inode *inode;
 	int ret;
 
+	/* Mounting an instance of BPF FS requires privileges */
+	if (fc->user_ns != &init_user_ns && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	ret = simple_fill_super(sb, BPF_FS_MAGIC, bpf_rfiles);
 	if (ret)
 		return ret;
@@ -764,7 +817,10 @@ static int bpf_get_tree(struct fs_context *fc)
 
 static void bpf_free_fc(struct fs_context *fc)
 {
-	kfree(fc->fs_private);
+	struct bpf_mount_opts *opts = fc->s_fs_info;
+
+	if (opts)
+		kfree(opts);
 }
 
 static const struct fs_context_operations bpf_context_ops = {
@@ -786,17 +842,32 @@ static int bpf_init_fs_context(struct fs_context *fc)
 
 	opts->mode = S_IRWXUGO;
 
-	fc->fs_private = opts;
+	/* start out with no BPF token delegation enabled */
+	opts->delegate_cmds = 0;
+	opts->delegate_maps = 0;
+	opts->delegate_progs = 0;
+	opts->delegate_attachs = 0;
+
+	fc->s_fs_info = opts;
 	fc->ops = &bpf_context_ops;
 	return 0;
 }
 
+static void bpf_kill_super(struct super_block *sb)
+{
+	struct bpf_mount_opts *opts = sb->s_fs_info;
+
+	kill_litter_super(sb);
+	kfree(opts);
+}
+
 static struct file_system_type bpf_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "bpf",
 	.init_fs_context = bpf_init_fs_context,
 	.parameters	= bpf_fs_parameters,
-	.kill_sb	= kill_litter_super,
+	.kill_sb	= bpf_kill_super,
+	.fs_flags	= FS_USERNS_MOUNT,
 };
 
 static int __init bpf_init(void)
-- 
2.34.1


^ permalink raw reply related

* [PATCH v10 bpf-next 04/17] bpf: add BPF token support to BPF_MAP_CREATE command
From: Andrii Nakryiko @ 2023-11-10  3:48 UTC (permalink / raw)
  To: bpf, netdev, paul, brauner
  Cc: linux-fsdevel, linux-security-module, keescook, kernel-team,
	sargun
In-Reply-To: <20231110034838.1295764-1-andrii@kernel.org>

Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
BPF map creation from unprivileged process through delegated BPF token.

Wire through a set of allowed BPF map types to BPF token, derived from
BPF FS at BPF token creation time. This, in combination with allowed_cmds
allows to create a narrowly-focused BPF token (controlled by privileged
agent) with a restrictive set of BPF maps that application can attempt
to create.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |  2 +
 include/uapi/linux/bpf.h                      |  2 +
 kernel/bpf/inode.c                            |  3 +-
 kernel/bpf/syscall.c                          | 52 ++++++++++++++-----
 kernel/bpf/token.c                            | 16 ++++++
 tools/include/uapi/linux/bpf.h                |  2 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |  2 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |  3 ++
 8 files changed, 67 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fc4b5856bbde..61e3fd819e50 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1583,6 +1583,7 @@ struct bpf_token {
 	atomic64_t refcnt;
 	struct user_namespace *userns;
 	u64 allowed_cmds;
+	u64 allowed_maps;
 };
 
 struct bpf_struct_ops_value;
@@ -2219,6 +2220,7 @@ int bpf_token_create(union bpf_attr *attr);
 struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
 
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9e62ef957c4f..1ed71ea78030 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1430,6 +1431,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 6ce3f9696e72..9c7865d1c53d 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -613,7 +613,8 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
 	else if (opts->delegate_cmds)
 		seq_printf(m, ",delegate_cmds=0x%llx", opts->delegate_cmds);
 
-	if (opts->delegate_maps == ~0ULL)
+	mask = (1ULL << __MAX_BPF_MAP_TYPE) - 1;
+	if ((opts->delegate_maps & mask) == mask)
 		seq_printf(m, ",delegate_maps=any");
 	else if (opts->delegate_maps)
 		seq_printf(m, ",delegate_maps=0x%llx", opts->delegate_maps);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a7bf4322f51c..6e6c27b89b71 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -984,8 +984,8 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
-static int map_check_btf(struct bpf_map *map, const struct btf *btf,
-			 u32 btf_key_id, u32 btf_value_id)
+static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
+			 const struct btf *btf, u32 btf_key_id, u32 btf_value_id)
 {
 	const struct btf_type *key_type, *value_type;
 	u32 key_size, value_size;
@@ -1013,7 +1013,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	if (!IS_ERR_OR_NULL(map->record)) {
 		int i;
 
-		if (!bpf_capable()) {
+		if (!bpf_token_capable(token, CAP_BPF)) {
 			ret = -EPERM;
 			goto free_map_tab;
 		}
@@ -1101,11 +1101,12 @@ static bool bpf_net_capable(void)
 	return capable(CAP_NET_ADMIN) || capable(CAP_SYS_ADMIN);
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD map_extra
+#define BPF_MAP_CREATE_LAST_FIELD map_token_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
 	const struct bpf_map_ops *ops;
+	struct bpf_token *token = NULL;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 map_type = attr->map_type;
 	struct bpf_map *map;
@@ -1156,14 +1157,32 @@ static int map_create(union bpf_attr *attr)
 	if (!ops->map_mem_usage)
 		return -EINVAL;
 
+	if (attr->map_token_fd) {
+		token = bpf_token_get_from_fd(attr->map_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+
+		/* if current token doesn't grant map creation permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_MAP_CREATE) ||
+		    !bpf_token_allow_map_type(token, attr->map_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	err = -EPERM;
+
 	/* Intent here is for unprivileged_bpf_disabled to block BPF map
 	 * creation for unprivileged users; other actions depend
 	 * on fd availability and access to bpffs, so are dependent on
 	 * object creation success. Even with unprivileged BPF disabled,
 	 * capability checks are still carried out.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_token_capable(token, CAP_BPF))
+		goto put_token;
 
 	/* check privileged map type permissions */
 	switch (map_type) {
@@ -1196,25 +1215,27 @@ static int map_create(union bpf_attr *attr)
 	case BPF_MAP_TYPE_LRU_PERCPU_HASH:
 	case BPF_MAP_TYPE_STRUCT_OPS:
 	case BPF_MAP_TYPE_CPUMAP:
-		if (!bpf_capable())
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_BPF))
+			goto put_token;
 		break;
 	case BPF_MAP_TYPE_SOCKMAP:
 	case BPF_MAP_TYPE_SOCKHASH:
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 	case BPF_MAP_TYPE_XSKMAP:
-		if (!bpf_net_capable())
-			return -EPERM;
+		if (!bpf_token_capable(token, CAP_NET_ADMIN))
+			goto put_token;
 		break;
 	default:
 		WARN(1, "unsupported map type %d", map_type);
-		return -EPERM;
+		goto put_token;
 	}
 
 	map = ops->map_alloc(attr);
-	if (IS_ERR(map))
-		return PTR_ERR(map);
+	if (IS_ERR(map)) {
+		err = PTR_ERR(map);
+		goto put_token;
+	}
 	map->ops = ops;
 	map->map_type = map_type;
 
@@ -1251,7 +1272,7 @@ static int map_create(union bpf_attr *attr)
 		map->btf = btf;
 
 		if (attr->btf_value_type_id) {
-			err = map_check_btf(map, btf, attr->btf_key_type_id,
+			err = map_check_btf(map, token, btf, attr->btf_key_type_id,
 					    attr->btf_value_type_id);
 			if (err)
 				goto free_map;
@@ -1272,6 +1293,7 @@ static int map_create(union bpf_attr *attr)
 		goto free_map_sec;
 
 	bpf_map_save_memcg(map);
+	bpf_token_put(token);
 
 	err = bpf_map_new_fd(map, f_flags);
 	if (err < 0) {
@@ -1292,6 +1314,8 @@ static int map_create(union bpf_attr *attr)
 free_map:
 	btf_put(map->btf);
 	map->ops->map_free(map);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 0d5cb87fecf6..6495b269418b 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -70,6 +70,13 @@ static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
 		seq_printf(m, "allowed_cmds:\tany\n");
 	else
 		seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
+
+	BUILD_BUG_ON(__MAX_BPF_MAP_TYPE >= 64);
+	mask = (1ULL << __MAX_BPF_MAP_TYPE) - 1;
+	if ((token->allowed_maps & mask) == mask)
+		seq_printf(m, "allowed_maps:\tany\n");
+	else
+		seq_printf(m, "allowed_maps:\t0x%llx\n", token->allowed_maps);
 }
 
 #define BPF_TOKEN_INODE_NAME "bpf-token"
@@ -150,6 +157,7 @@ int bpf_token_create(union bpf_attr *attr)
 
 	mnt_opts = path.dentry->d_sb->s_fs_info;
 	token->allowed_cmds = mnt_opts->delegate_cmds;
+	token->allowed_maps = mnt_opts->delegate_maps;
 
 	fd = get_unused_fd_flags(O_CLOEXEC);
 	if (fd < 0) {
@@ -198,3 +206,11 @@ bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
 
 	return token->allowed_cmds & (1ULL << cmd);
 }
+
+bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type)
+{
+	if (!token || type >= __MAX_BPF_MAP_TYPE)
+		return false;
+
+	return token->allowed_maps & (1ULL << type);
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9e62ef957c4f..1ed71ea78030 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -983,6 +983,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_BLOOM_FILTER,
 	BPF_MAP_TYPE_USER_RINGBUF,
 	BPF_MAP_TYPE_CGRP_STORAGE,
+	__MAX_BPF_MAP_TYPE
 };
 
 /* Note that tracing related programs such as
@@ -1430,6 +1431,7 @@ union bpf_attr {
 		 * to using 5 hash functions).
 		 */
 		__u64	map_extra;
+		__u32	map_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 9f766ddd946a..573249a2814d 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -68,6 +68,8 @@ void test_libbpf_probe_map_types(void)
 
 		if (map_type == BPF_MAP_TYPE_UNSPEC)
 			continue;
+		if (strcmp(map_type_name, "__MAX_BPF_MAP_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(map_type_name))
 			continue;
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index c440ea3311ed..2a0633f43c73 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -132,6 +132,9 @@ static void test_libbpf_bpf_map_type_str(void)
 		const char *map_type_str;
 		char buf[256];
 
+		if (map_type == __MAX_BPF_MAP_TYPE)
+			continue;
+
 		map_type_name = btf__str_by_offset(btf, e->name_off);
 		map_type_str = libbpf_bpf_map_type_str(map_type);
 		ASSERT_OK_PTR(map_type_str, map_type_name);
-- 
2.34.1


^ permalink raw reply related

* [PATCH v10 bpf-next 06/17] bpf: add BPF token support to BPF_PROG_LOAD command
From: Andrii Nakryiko @ 2023-11-10  3:48 UTC (permalink / raw)
  To: bpf, netdev, paul, brauner
  Cc: linux-fsdevel, linux-security-module, keescook, kernel-team,
	sargun
In-Reply-To: <20231110034838.1295764-1-andrii@kernel.org>

Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
allowed BPF program types and attach types, derived from BPF FS at BPF
token creation time. Then make sure we perform bpf_token_capable()
checks everywhere where it's relevant.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/bpf.h                           |  6 ++
 include/uapi/linux/bpf.h                      |  2 +
 kernel/bpf/core.c                             |  1 +
 kernel/bpf/inode.c                            |  6 +-
 kernel/bpf/syscall.c                          | 87 ++++++++++++++-----
 kernel/bpf/token.c                            | 27 ++++++
 tools/include/uapi/linux/bpf.h                |  2 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |  2 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |  3 +
 9 files changed, 110 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 61e3fd819e50..4c9a93a6df2a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1444,6 +1444,7 @@ struct bpf_prog_aux {
 #ifdef CONFIG_SECURITY
 	void *security;
 #endif
+	struct bpf_token *token;
 	struct bpf_prog_offload *offload;
 	struct btf *btf;
 	struct bpf_func_info *func_info;
@@ -1584,6 +1585,8 @@ struct bpf_token {
 	struct user_namespace *userns;
 	u64 allowed_cmds;
 	u64 allowed_maps;
+	u64 allowed_progs;
+	u64 allowed_attachs;
 };
 
 struct bpf_struct_ops_value;
@@ -2221,6 +2224,9 @@ struct bpf_token *bpf_token_get_from_fd(u32 ufd);
 
 bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd);
 bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type);
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type);
 
 int bpf_obj_pin_user(u32 ufd, int path_fd, const char __user *pathname);
 int bpf_obj_get_user(int path_fd, const char __user *pathname, int flags);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 61ad6828d8d4..1894e1709070 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1028,6 +1028,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1501,6 +1502,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 08626b519ce2..fc8de25b7948 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2747,6 +2747,7 @@ void bpf_prog_free(struct bpf_prog *fp)
 
 	if (aux->dst_prog)
 		bpf_prog_put(aux->dst_prog);
+	bpf_token_put(aux->token);
 	INIT_WORK(&aux->work, bpf_prog_free_deferred);
 	schedule_work(&aux->work);
 }
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 9c7865d1c53d..5359a0929c35 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -619,12 +619,14 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root)
 	else if (opts->delegate_maps)
 		seq_printf(m, ",delegate_maps=0x%llx", opts->delegate_maps);
 
-	if (opts->delegate_progs == ~0ULL)
+	mask = (1ULL << __MAX_BPF_PROG_TYPE) - 1;
+	if ((opts->delegate_progs & mask) == mask)
 		seq_printf(m, ",delegate_progs=any");
 	else if (opts->delegate_progs)
 		seq_printf(m, ",delegate_progs=0x%llx", opts->delegate_progs);
 
-	if (opts->delegate_attachs == ~0ULL)
+	mask = (1ULL << __MAX_BPF_ATTACH_TYPE) - 1;
+	if ((opts->delegate_attachs & mask) == mask)
 		seq_printf(m, ",delegate_attachs=any");
 	else if (opts->delegate_attachs)
 		seq_printf(m, ",delegate_attachs=0x%llx", opts->delegate_attachs);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 28ac4bdc3f32..615b5ad18c61 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2583,13 +2583,15 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define	BPF_PROG_LOAD_LAST_FIELD log_true_size
+#define BPF_PROG_LOAD_LAST_FIELD prog_token_fd
 
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 {
 	enum bpf_prog_type type = attr->prog_type;
 	struct bpf_prog *prog, *dst_prog = NULL;
 	struct btf *attach_btf = NULL;
+	struct bpf_token *token = NULL;
+	bool bpf_cap;
 	int err;
 	char license[128];
 
@@ -2605,10 +2607,31 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 				 BPF_F_XDP_DEV_BOUND_ONLY))
 		return -EINVAL;
 
+	bpf_prog_load_fixup_attach_type(attr);
+
+	if (attr->prog_token_fd) {
+		token = bpf_token_get_from_fd(attr->prog_token_fd);
+		if (IS_ERR(token))
+			return PTR_ERR(token);
+		/* if current token doesn't grant prog loading permissions,
+		 * then we can't use this token, so ignore it and rely on
+		 * system-wide capabilities checks
+		 */
+		if (!bpf_token_allow_cmd(token, BPF_PROG_LOAD) ||
+		    !bpf_token_allow_prog_type(token, attr->prog_type,
+					       attr->expected_attach_type)) {
+			bpf_token_put(token);
+			token = NULL;
+		}
+	}
+
+	bpf_cap = bpf_token_capable(token, CAP_BPF);
+	err = -EPERM;
+
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
 	    (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
 	/* Intent here is for unprivileged_bpf_disabled to block BPF program
 	 * creation for unprivileged users; other actions depend
@@ -2617,21 +2640,23 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	 * capability checks are still carried out for these
 	 * and other operations.
 	 */
-	if (sysctl_unprivileged_bpf_disabled && !bpf_capable())
-		return -EPERM;
+	if (sysctl_unprivileged_bpf_disabled && !bpf_cap)
+		goto put_token;
 
 	if (attr->insn_cnt == 0 ||
-	    attr->insn_cnt > (bpf_capable() ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS))
-		return -E2BIG;
+	    attr->insn_cnt > (bpf_cap ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) {
+		err = -E2BIG;
+		goto put_token;
+	}
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
-	    !bpf_capable())
-		return -EPERM;
+	    !bpf_cap)
+		goto put_token;
 
-	if (is_net_admin_prog_type(type) && !bpf_net_capable())
-		return -EPERM;
-	if (is_perfmon_prog_type(type) && !perfmon_capable())
-		return -EPERM;
+	if (is_net_admin_prog_type(type) && !bpf_token_capable(token, CAP_NET_ADMIN))
+		goto put_token;
+	if (is_perfmon_prog_type(type) && !bpf_token_capable(token, CAP_PERFMON))
+		goto put_token;
 
 	/* attach_prog_fd/attach_btf_obj_fd can specify fd of either bpf_prog
 	 * or btf, we need to check which one it is
@@ -2641,27 +2666,33 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 		if (IS_ERR(dst_prog)) {
 			dst_prog = NULL;
 			attach_btf = btf_get_by_fd(attr->attach_btf_obj_fd);
-			if (IS_ERR(attach_btf))
-				return -EINVAL;
+			if (IS_ERR(attach_btf)) {
+				err = -EINVAL;
+				goto put_token;
+			}
 			if (!btf_is_kernel(attach_btf)) {
 				/* attaching through specifying bpf_prog's BTF
 				 * objects directly might be supported eventually
 				 */
 				btf_put(attach_btf);
-				return -ENOTSUPP;
+				err = -ENOTSUPP;
+				goto put_token;
 			}
 		}
 	} else if (attr->attach_btf_id) {
 		/* fall back to vmlinux BTF, if BTF type ID is specified */
 		attach_btf = bpf_get_btf_vmlinux();
-		if (IS_ERR(attach_btf))
-			return PTR_ERR(attach_btf);
-		if (!attach_btf)
-			return -EINVAL;
+		if (IS_ERR(attach_btf)) {
+			err = PTR_ERR(attach_btf);
+			goto put_token;
+		}
+		if (!attach_btf) {
+			err = -EINVAL;
+			goto put_token;
+		}
 		btf_get(attach_btf);
 	}
 
-	bpf_prog_load_fixup_attach_type(attr);
 	if (bpf_prog_load_check_attach(type, attr->expected_attach_type,
 				       attach_btf, attr->attach_btf_id,
 				       dst_prog)) {
@@ -2669,7 +2700,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -EINVAL;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	/* plain bpf_prog allocation */
@@ -2679,7 +2711,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 			bpf_prog_put(dst_prog);
 		if (attach_btf)
 			btf_put(attach_btf);
-		return -ENOMEM;
+		err = -EINVAL;
+		goto put_token;
 	}
 
 	prog->expected_attach_type = attr->expected_attach_type;
@@ -2690,6 +2723,10 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	/* move token into prog->aux, reuse taken refcnt */
+	prog->aux->token = token;
+	token = NULL;
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
@@ -2791,6 +2828,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr, u32 uattr_size)
 	if (prog->aux->attach_btf)
 		btf_put(prog->aux->attach_btf);
 	bpf_prog_free(prog);
+put_token:
+	bpf_token_put(token);
 	return err;
 }
 
@@ -3780,7 +3819,7 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_SK_LOOKUP:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
 	case BPF_PROG_TYPE_CGROUP_SKB:
-		if (!bpf_net_capable())
+		if (!bpf_token_capable(prog->aux->token, CAP_NET_ADMIN))
 			/* cg-skb progs can be loaded by unpriv user.
 			 * check permissions at attach time.
 			 */
diff --git a/kernel/bpf/token.c b/kernel/bpf/token.c
index 6495b269418b..7a5183578a69 100644
--- a/kernel/bpf/token.c
+++ b/kernel/bpf/token.c
@@ -77,6 +77,20 @@ static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
 		seq_printf(m, "allowed_maps:\tany\n");
 	else
 		seq_printf(m, "allowed_maps:\t0x%llx\n", token->allowed_maps);
+
+	BUILD_BUG_ON(__MAX_BPF_PROG_TYPE >= 64);
+	mask = (1ULL << __MAX_BPF_PROG_TYPE) - 1;
+	if ((token->allowed_progs & mask) == mask)
+		seq_printf(m, "allowed_progs:\tany\n");
+	else
+		seq_printf(m, "allowed_progs:\t0x%llx\n", token->allowed_progs);
+
+	BUILD_BUG_ON(__MAX_BPF_ATTACH_TYPE >= 64);
+	mask = (1ULL << __MAX_BPF_ATTACH_TYPE) - 1;
+	if ((token->allowed_attachs & mask) == mask)
+		seq_printf(m, "allowed_attachs:\tany\n");
+	else
+		seq_printf(m, "allowed_attachs:\t0x%llx\n", token->allowed_attachs);
 }
 
 #define BPF_TOKEN_INODE_NAME "bpf-token"
@@ -158,6 +172,8 @@ int bpf_token_create(union bpf_attr *attr)
 	mnt_opts = path.dentry->d_sb->s_fs_info;
 	token->allowed_cmds = mnt_opts->delegate_cmds;
 	token->allowed_maps = mnt_opts->delegate_maps;
+	token->allowed_progs = mnt_opts->delegate_progs;
+	token->allowed_attachs = mnt_opts->delegate_attachs;
 
 	fd = get_unused_fd_flags(O_CLOEXEC);
 	if (fd < 0) {
@@ -214,3 +230,14 @@ bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type t
 
 	return token->allowed_maps & (1ULL << type);
 }
+
+bool bpf_token_allow_prog_type(const struct bpf_token *token,
+			       enum bpf_prog_type prog_type,
+			       enum bpf_attach_type attach_type)
+{
+	if (!token || prog_type >= __MAX_BPF_PROG_TYPE || attach_type >= __MAX_BPF_ATTACH_TYPE)
+		return false;
+
+	return (token->allowed_progs & (1ULL << prog_type)) &&
+	       (token->allowed_attachs & (1ULL << attach_type));
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 61ad6828d8d4..1894e1709070 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1028,6 +1028,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
 	BPF_PROG_TYPE_NETFILTER,
+	__MAX_BPF_PROG_TYPE
 };
 
 enum bpf_attach_type {
@@ -1501,6 +1502,7 @@ union bpf_attr {
 		 * truncated), or smaller (if log buffer wasn't filled completely).
 		 */
 		__u32		log_true_size;
+		__u32		prog_token_fd;
 	};
 
 	struct { /* anonymous struct used by BPF_OBJ_* commands */
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
index 573249a2814d..4ed46ed58a7b 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_probes.c
@@ -30,6 +30,8 @@ void test_libbpf_probe_prog_types(void)
 
 		if (prog_type == BPF_PROG_TYPE_UNSPEC)
 			continue;
+		if (strcmp(prog_type_name, "__MAX_BPF_PROG_TYPE") == 0)
+			continue;
 
 		if (!test__start_subtest(prog_type_name))
 			continue;
diff --git a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
index 2a0633f43c73..384bc1f7a65e 100644
--- a/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
+++ b/tools/testing/selftests/bpf/prog_tests/libbpf_str.c
@@ -189,6 +189,9 @@ static void test_libbpf_bpf_prog_type_str(void)
 		const char *prog_type_str;
 		char buf[256];
 
+		if (prog_type == __MAX_BPF_PROG_TYPE)
+			continue;
+
 		prog_type_name = btf__str_by_offset(btf, e->name_off);
 		prog_type_str = libbpf_bpf_prog_type_str(prog_type);
 		ASSERT_OK_PTR(prog_type_str, prog_type_name);
-- 
2.34.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox