Netdev List
 help / color / mirror / Atom feed
* [PATCH memcg v2] net: set proper memcg for net_init hooks allocations
From: Vasily Averin @ 2022-04-23 15:38 UTC (permalink / raw)
  To: Vlastimil Babka, Shakeel Butt
  Cc: kernel, Florian Westphal, linux-kernel, Roman Gushchin,
	Michal Hocko, cgroups, netdev, David S. Miller, Jakub Kicinski,
	Paolo Abeni
In-Reply-To: <202204231806.8O86U791-lkp@intel.com>

__register_pernet_operations() executes init hook of registered
pernet_operation structure in all existing net namespaces.

Typically, these hooks are called by a process associated with
the specified net namespace, and all __GFP_ACCOUNTING marked
allocation are accounted for corresponding container/memcg.

However __register_pernet_operations() calls the hooks in the same
context, and as a result all marked allocations are accounted
to one memcg for all processed net namespaces.

This patch adjusts active memcg for each net namespace and helps
to account memory allocated inside ops_init() into the proper memcg.

Signed-off-by: Vasily Averin <vvs@openvz.org>
---
v2: introduced get/put_net_memcg(),
    new functions are moved under CONFIG_MEMCG_KMEM
    to fix compilation issues reported by Intel's kernel test robot

v1: introduced get_mem_cgroup_from_kmem(), which takes the refcount
    for the found memcg, suggested by Shakeel
---
 include/linux/memcontrol.h | 35 +++++++++++++++++++++++++++++++++++
 net/core/net_namespace.c   |  7 +++++++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..5230d3c5585a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1714,6 +1714,33 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 
 struct mem_cgroup *mem_cgroup_from_obj(void *p);
 
+static inline struct mem_cgroup *get_mem_cgroup_from_kmem(void *p)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	do {
+		memcg = mem_cgroup_from_obj(p);
+	} while (memcg && !css_tryget(&memcg->css));
+	rcu_read_unlock();
+	return memcg;
+}
+
+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = get_mem_cgroup_from_kmem(p);
+
+	if (!memcg)
+		memcg = root_mem_cgroup;
+
+	return memcg;
+}
+static inline void put_net_memcg(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
 #else
 static inline bool mem_cgroup_kmem_disabled(void)
 {
@@ -1766,6 +1793,14 @@ static inline struct mem_cgroup *mem_cgroup_from_obj(void *p)
        return NULL;
 }
 
+static inline struct mem_cgroup *get_net_memcg(void *p)
+{
+	return NULL;
+}
+
+static inline void put_net_memcg(struct mem_cgroup *memcg)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index a5b5bb99c644..bf88360b8377 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -26,6 +26,7 @@
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
+#include <linux/sched/mm.h>
 /*
  *	Our network namespace constructor/destructor lists
  */
@@ -1147,7 +1148,13 @@ static int __register_pernet_operations(struct list_head *list,
 		 * setup_net() and cleanup_net() are not possible.
 		 */
 		for_each_net(net) {
+			struct mem_cgroup *old, *memcg;
+
+			memcg = get_net_memcg(net);
+			old = set_active_memcg(memcg);
 			error = ops_init(ops, net);
+			set_active_memcg(old);
+			put_net_memcg(memcg);
 			if (error)
 				goto out_undo;
 			list_add_tail(&net->exit_list, &net_exit_list);
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 15:35 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <ff95db6e-5a0a-7e63-080f-c719ac434c34@blackwall.org>

On 23/04/2022 18:23, Nikolay Aleksandrov wrote:
> On 23/04/2022 18:01, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
[snip]
>>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>>                                 return head_skb;
>>>>>>>                         }
>>>>>>>                         break;
>>>>>>
>>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>>
>>>>>
>>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>>
>>>> Yes, you are right.
>>>>
>>>>>
>>>>> The above change guarantees that buf and p will be in the same page
>>>>
>>>>
>>>> How can this be guaranteed?
>>>>
>>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>>    from the allocation.
>>>> 2. set xdp
>>>> 3. alloc for new buffer
>>>>
>>>
>>> p = page_address(page) + offset;
>>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>>
>>> => p and buf are always in the same page, no?
>>
>> I don't think it is, it's entirely possible to split on two pages.
>>
>>>
>>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>>> possible that p and buf are not on the same page.
>>>>
>>>> I fixed my previous patch.
>>>>
>>>> Thanks.
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..d95e82255b94 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                          * xdp.data_meta were adjusted
>>>>                          */
>>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>>> +
>>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>>
>>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
>>
>> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
>>
>>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
>>
>> Thanks.
>>
> 
> That is equal to offset:
>                        offset = xdp.data - page_address(xdp_page) -
>                                  vi->hdr_len - metasize;
> 
> So I do agree that it will work, it is effectively what I also suggested in the
> other email and it will be equal to just doing buf = page_address() in the xdp_linearize
> case because p = page_address + offset, and now we do buf = p - headroom where headroom also
> equals offset, so we get buf = page_address().
> 

All of the above is equivalent to:
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..12e88980e4b3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1012,8 +1012,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      offset);
                                return head_skb;
+                       } else {
+                               headroom = xdp.data - (buf - headroom) - vi->hdr_len - metasize;
                        }
                        break;
                case XDP_TX:

I agree with that, it is also equivalent to my proposal in the other email. It adjusts the new
headroom after the xdp prog which is ok. I'll wait (and test it in the meantime) for other
feedback and if all agree I'll post v2.

>>
>>>
>>>>                         /* We can only create skb based on xdp_page. */
>>>>                         if (unlikely(xdp_page != page)) {
>>>>                                 rcu_read_unlock();
>>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      headroom);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
> 


^ permalink raw reply related

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Handle single-chip-address OF property
From: Marek Behún @ 2022-04-23 15:16 UTC (permalink / raw)
  To: Nathan Rossi
  Cc: Andrew Lunn, netdev, linux-kernel, Vivien Didelot,
	Florian Fainelli, Vladimir Oltean, David S. Miller,
	Jakub Kicinski, Paolo Abeni
In-Reply-To: <CA+aJhH3EtAxAKy8orC-SU8UnagBCibF3dHXrp78zfjuAzj4vUg@mail.gmail.com>

>     switch0: switch0@16 {
>         compatible = "marvell,mv88e6141", "marvell,mv88e6085";

Not relevant to your problem, but the node name should be switch@16,
not switch0@16. The 0 is redundant and should not be there.

Marek

^ permalink raw reply

* [PATCH iproute2-next 2/3] libbpf: Remove use of bpf_program__set_priv and bpf_program__priv
From: David Ahern @ 2022-04-23 15:22 UTC (permalink / raw)
  To: netdev; +Cc: stephen, toke, haliu, David Ahern
In-Reply-To: <20220423152300.16201-1-dsahern@kernel.org>

bpf_program__set_priv and bpf_program__priv are deprecated as of
libbpf v0.7+. Rather than store the map as priv on the program,
change find_legacy_tail_calls to take an argument to return a reference
to the map.

find_legacy_tail_calls is invoked twice from load_bpf_object - the
first time to check for programs that should be loaded. In this case
a reference to the map is not needed, but it does validate the map
exists. The second is invoked from update_legacy_tail_call_maps where
the map pointer is needed.

Signed-off-by: David Ahern <dsahern@kernel.org>
---
 lib/bpf_libbpf.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/lib/bpf_libbpf.c b/lib/bpf_libbpf.c
index f723f6310c28..7dd1faf536f4 100644
--- a/lib/bpf_libbpf.c
+++ b/lib/bpf_libbpf.c
@@ -151,7 +151,8 @@ handle_legacy_map_in_map(struct bpf_object *obj, struct bpf_map *inner_map,
 	return ret;
 }
 
-static int find_legacy_tail_calls(struct bpf_program *prog, struct bpf_object *obj)
+static int find_legacy_tail_calls(struct bpf_program *prog, struct bpf_object *obj,
+				  struct bpf_map **pmap)
 {
 	unsigned int map_id, key_id;
 	const char *sec_name;
@@ -175,8 +176,8 @@ static int find_legacy_tail_calls(struct bpf_program *prog, struct bpf_object *o
 	if (!map)
 		return -1;
 
-	/* Save the map here for later updating */
-	bpf_program__set_priv(prog, map, NULL);
+	if (pmap)
+		*pmap = map;
 
 	return 0;
 }
@@ -190,8 +191,10 @@ static int update_legacy_tail_call_maps(struct bpf_object *obj)
 	struct bpf_map *map;
 
 	bpf_object__for_each_program(prog, obj) {
-		map = bpf_program__priv(prog);
-		if (!map)
+		/* load_bpf_object has already verified find_legacy_tail_calls
+		 * succeeds when it should
+		 */
+		if (find_legacy_tail_calls(prog, obj, &map) < 0)
 			continue;
 
 		prog_fd = bpf_program__fd(prog);
@@ -275,7 +278,8 @@ static int load_bpf_object(struct bpf_cfg_in *cfg)
 
 		/* Only load the programs that will either be subsequently
 		 * attached or inserted into a tail call map */
-		if (find_legacy_tail_calls(p, obj) < 0 && !prog_to_attach) {
+		if (find_legacy_tail_calls(p, obj, NULL) < 0 &&
+		    !prog_to_attach) {
 			ret = bpf_program__set_autoload(p, false);
 			if (ret)
 				return -EINVAL;
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 15:23 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <1650726113.2334588-1-xuanzhuo@linux.alibaba.com>

On 23/04/2022 18:01, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:36, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>>>> mode with virtio_net after updating to newer kernels. After
>>>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>>>> calculates the build_skb address wrong because the offset can become less
>>>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>>>> depending on how lower offset is):
>>>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>>>
>>>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>>>> via an xdp prog. The calculations done are:
>>>>>>>  receive_mergeable():
>>>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>>>           vi->hdr_len - metasize;
>>>>>>>
>>>>>>>  page_to_skb():
>>>>>>>  p = page_address(page) + offset;
>>>>>>>  ...
>>>>>>>  buf = p - headroom;
>>>>>>>
>>>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>>>> first version of the patch that was applied as:
>>>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>>>> was actually correct because it calculated the page starting address
>>>>>>> without relying on offset or headroom, but then the second version that
>>>>>>> was applied as:
>>>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> was wrong and added the above calculation.
>>>>>>> An example xdp prog[3] is below.
>>>>>>>
>>>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>>>
>>>>>>> [2] Two of the many traces:
>>>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>>>  [   41.321387] Call Trace:
>>>>>>>  [   41.321819]  <TASK>
>>>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>>>  [   41.744254]  </TASK>
>>>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>
>>>>>>>  and
>>>>>>>
>>>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>>>  [   33.532484] Call Trace:
>>>>>>>  [   33.532496]  <TASK>
>>>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>>>
>>>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>>>  #include <linux/bpf.h>
>>>>>>>  #include <bpf/bpf_helpers.h>
>>>>>>>
>>>>>>>  SEC("xdp_pass")
>>>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>>>  {
>>>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>>>           return XDP_PASS;
>>>>>>>  }
>>>>>>>
>>>>>>>  char _license[] SEC("license") = "GPL";
>>>>>>>
>>>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>>>
>>>>>>> CC: stable@vger.kernel.org
>>>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>> CC: virtualization@lists.linux-foundation.org
>>>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>>>> ---
>>>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>>>> --- a/drivers/net/virtio_net.c
>>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>>>  	 */
>>>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>>>> +	if (headroom) {
>>>>>>> +		truesize = PAGE_SIZE;
>>>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>>>
>>>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>>>> buf is probably not page-aligned.
>>>>>>
>>>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>>>> each allocation is page-aligned .
>>>>>>
>>>>>> The problem here is that the value of headroom is wrong, the package is
>>>>>> structured like this:
>>>>>>
>>>>>> from device    | headroom          | virtio-net hdr | data |
>>>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>>>
>>>>> You're free to push data back (not necessarily through meta).
>>>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>>>
>>>>>>
>>>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>>>> virtio-net hdr.
>>>>>>
>>>>>> So I think it might be better to change it this way.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>> index 87838cbe38cf..086ae835ec86 100644
>>>>>> --- a/drivers/net/virtio_net.c
>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>>>                                                        len, PAGE_SIZE, false,
>>>>>>                                                        metasize,
>>>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>>>                                 return head_skb;
>>>>>>                         }
>>>>>>                         break;
>>>>>
>>>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>>>
>>>>
>>>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>>>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>>
>>> Yes, you are right.
>>>
>>>>
>>>> The above change guarantees that buf and p will be in the same page
>>>
>>>
>>> How can this be guaranteed?
>>>
>>> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>>>    from the allocation.
>>> 2. set xdp
>>> 3. alloc for new buffer
>>>
>>
>> p = page_address(page) + offset;
>> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>>
>> => p and buf are always in the same page, no?
> 
> I don't think it is, it's entirely possible to split on two pages.
> 
>>
>>> The buffer allocated in the third step must be unaligned, and it is entirely
>>> possible that p and buf are not on the same page.
>>>
>>> I fixed my previous patch.
>>>
>>> Thanks.
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..d95e82255b94 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                          * xdp.data_meta were adjusted
>>>                          */
>>>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
>>> +
>>> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>>
>> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.
> 
> Yes, you are right. For the case of xdp_linearize_page() , we should change it.
> 
>    headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);
> 
> Thanks.
> 

That is equal to offset:
                       offset = xdp.data - page_address(xdp_page) -
                                 vi->hdr_len - metasize;

So I do agree that it will work, it is effectively what I also suggested in the
other email and it will be equal to just doing buf = page_address() in the xdp_linearize
case because p = page_address + offset, and now we do buf = p - headroom where headroom also
equals offset, so we get buf = page_address().

> 
>>
>>>                         /* We can only create skb based on xdp_page. */
>>>                         if (unlikely(xdp_page != page)) {
>>>                                 rcu_read_unlock();
>>> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      headroom);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>


^ permalink raw reply

* [PATCH iproute2-next 3/3] libbpf: Remove use of bpf_map_is_offload_neutral
From: David Ahern @ 2022-04-23 15:23 UTC (permalink / raw)
  To: netdev; +Cc: stephen, toke, haliu, David Ahern
In-Reply-To: <20220423152300.16201-1-dsahern@kernel.org>

bpf_map_is_offload_neutral is deprecated as of v0.8+;
import definition to maintain backwards compatibility.

Signed-off-by: David Ahern <dsahern@kernel.org>
---
 lib/bpf_libbpf.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/lib/bpf_libbpf.c b/lib/bpf_libbpf.c
index 7dd1faf536f4..7b16ee71589d 100644
--- a/lib/bpf_libbpf.c
+++ b/lib/bpf_libbpf.c
@@ -249,6 +249,11 @@ static int handle_legacy_maps(struct bpf_object *obj)
 	return ret;
 }
 
+static bool bpf_map_is_offload_neutral(const struct bpf_map *map)
+{
+	return bpf_map__type(map) == BPF_MAP_TYPE_PERF_EVENT_ARRAY;
+}
+
 static int load_bpf_object(struct bpf_cfg_in *cfg)
 {
 	struct bpf_program *p, *prog = NULL;
@@ -294,7 +299,7 @@ static int load_bpf_object(struct bpf_cfg_in *cfg)
 	}
 
 	bpf_object__for_each_map(map, obj) {
-		if (!bpf_map__is_offload_neutral(map))
+		if (!bpf_map_is_offload_neutral(map))
 			bpf_map__set_ifindex(map, cfg->ifindex);
 	}
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related

* [PATCH iproute2-next 0/3] Address more libbpf deprecations
From: David Ahern @ 2022-04-23 15:22 UTC (permalink / raw)
  To: netdev; +Cc: stephen, toke, haliu, David Ahern

Another round of changes to handle libbpf deprecations. Compiles are
clean as of libbpf commit 533c7666eb72 ("Fix downloads formats").

David Ahern (3):
  libbpf: Use bpf_object__load instead of bpf_object__load_xattr
  libbpf: Remove use of bpf_program__set_priv and bpf_program__priv
  libbpf: Remove use of bpf_map_is_offload_neutral

 lib/bpf_libbpf.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply

* [PATCH iproute2-next 1/3] libbpf: Use bpf_object__load instead of bpf_object__load_xattr
From: David Ahern @ 2022-04-23 15:22 UTC (permalink / raw)
  To: netdev; +Cc: stephen, toke, haliu, David Ahern
In-Reply-To: <20220423152300.16201-1-dsahern@kernel.org>

bpf_object__load_xattr is deprecated as of v0.8+; remove it
in favor of bpf_object__load.

Signed-off-by: David Ahern <dsahern@kernel.org>
---
 lib/bpf_libbpf.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/lib/bpf_libbpf.c b/lib/bpf_libbpf.c
index f4f98caa1e58..f723f6310c28 100644
--- a/lib/bpf_libbpf.c
+++ b/lib/bpf_libbpf.c
@@ -248,7 +248,6 @@ static int handle_legacy_maps(struct bpf_object *obj)
 
 static int load_bpf_object(struct bpf_cfg_in *cfg)
 {
-	struct bpf_object_load_attr attr = {};
 	struct bpf_program *p, *prog = NULL;
 	struct bpf_object *obj;
 	char root_path[PATH_MAX];
@@ -305,11 +304,7 @@ static int load_bpf_object(struct bpf_cfg_in *cfg)
 	if (ret)
 		goto unload_obj;
 
-	attr.obj = obj;
-	if (cfg->verbose)
-		attr.log_level = 2;
-
-	ret = bpf_object__load_xattr(&attr);
+	ret = bpf_object__load(obj);
 	if (ret)
 		goto unload_obj;
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Handle single-chip-address OF property
From: Marek Behún @ 2022-04-23 15:18 UTC (permalink / raw)
  To: Nathan Rossi
  Cc: Andrew Lunn, netdev, linux-kernel, Vivien Didelot,
	Florian Fainelli, Vladimir Oltean, David S. Miller,
	Jakub Kicinski, Paolo Abeni
In-Reply-To: <CA+aJhH3EtAxAKy8orC-SU8UnagBCibF3dHXrp78zfjuAzj4vUg@mail.gmail.com>

On Sun, 24 Apr 2022 00:41:22 +1000
Nathan Rossi <nathan@nathanrossi.com> wrote:

> On Sun, 24 Apr 2022 at 00:07, Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > On Sat, Apr 23, 2022 at 01:14:27PM +0000, Nathan Rossi wrote:  
> > > Handle the parsing and use of single chip addressing when the switch has
> > > the single-chip-address property defined. This allows for specifying the
> > > switch as using single chip addressing even when mdio address 0 is used
> > > by another device on the bus. This is a feature of some switches (e.g.
> > > the MV88E6341/MV88E6141) where the switch shares the bus only responding
> > > to the higher 16 addresses.  
> >
> > Hi Nathan
> >
> > I think i'm missing something in this explanation:
> >
> > smi.c says:
> >
> > /* The switch ADDR[4:1] configuration pins define the chip SMI device address
> >  * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
> >  *
> >  * When ADDR is all zero, the chip uses Single-chip Addressing Mode, assuming it
> >  * is the only device connected to the SMI master. In this mode it responds to
> >  * all 32 possible SMI addresses, and thus maps directly the internal devices.
> >  *
> >  * When ADDR is non-zero, the chip uses Multi-chip Addressing Mode, allowing
> >  * multiple devices to share the SMI interface. In this mode it responds to only
> >  * 2 registers, used to indirectly access the internal SMI devices.
> >  *
> >  * Some chips use a different scheme: Only the ADDR4 pin is used for
> >  * configuration, and the device responds to 16 of the 32 SMI
> >  * addresses, allowing two to coexist on the same SMI interface.
> >  */
> >
> > So if ADDR = 0, it takes up the whole bus. And in this case reg = 0.
> > If ADDR != 0, it is in multi chip mode, and DT reg = ADDR.
> >
> > int mv88e6xxx_smi_init(struct mv88e6xxx_chip *chip,
> >                        struct mii_bus *bus, int sw_addr)
> > {
> >         if (chip->info->dual_chip)
> >                 chip->smi_ops = &mv88e6xxx_smi_dual_direct_ops;
> >         else if (sw_addr == 0)
> >                 chip->smi_ops = &mv88e6xxx_smi_direct_ops;
> >         else if (chip->info->multi_chip)
> >                 chip->smi_ops = &mv88e6xxx_smi_indirect_ops;
> >         else
> >                 return -EINVAL;
> >
> > This seems to implement what is above. smi_direct_ops == whole bus,
> > smi_indirect_ops == multi-chip mode.
> >
> > In what situation do you see this not working? What device are you
> > using, what does you DT look like, and what at the ADDR value?  
> 
> The device I am using is the MV88E6141, it follows the second scheme
> such that it only responds to the upper 16 of the 32 SMI addresses in
> single chip addressing mode. I am able to define the switch at address
> 0, and everything works. However in the device I am using (Netgate
> SG-3100) the ethernet phys for the non switch ethernet interfaces are
> also on the same mdio bus as the switch. One of those phys is
> configured with address 0. Defining both the ethernet-phy and switch
> as address 0 does not work.

This makes the need of new property reasonable. You can add my

Acked-by: Marek Behún <kabel@kernel.org>

^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 15:10 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <89a628a2-f31c-7740-fdf1-1bc8023636cd@blackwall.org>

On 23/04/2022 17:46, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
> [snip]
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
>>
>> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
>> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
>> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
>>
>>
> 
> Completely untested alternative could be based on the offset size, that is if it has
> eaten into the headroom and is smaller then we swap them (that means we start at page
> boundary since we have headroom guaranteed space):
>  buf = page_address(page) + (offset > headroom ? offset - headroom : 0);
> 
> or perhaps in current code terms:
>  buf = p - (offset > headroom ? headroom : offset);
> 

Actually looking at add_recvbuf_mergeable() I take that back. We should look into a
different solution. That seems wrong as well.

If headroom can reside in 2 pages it is more difficult to get the correct address.

> That means offset is somewhere inside the headroom of the buf and, the buf itself
> starts at page boundary (when offset < headroom). I think this preserves the correct
> headroom for the new skb. WDYT?
> 
> Cheers,
>  Nik
> 
> 
> 


^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Xuan Zhuo @ 2022-04-23 15:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <c206c147-ad7e-b615-2e96-572482335563@blackwall.org>

On Sat, 23 Apr 2022 17:58:05 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:36, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> >>> On 23/04/2022 16:31, Xuan Zhuo wrote:
> >>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>>>> mode with virtio_net after updating to newer kernels. After
> >>>>> investigating the reason it turned out that when using mergeable bufs
> >>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>>>> calculates the build_skb address wrong because the offset can become less
> >>>>> than the headroom so it gets the address of the previous page (-X bytes
> >>>>> depending on how lower offset is):
> >>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>>>
> >>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>>>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>>>> via an xdp prog. The calculations done are:
> >>>>>  receive_mergeable():
> >>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>>>  offset = xdp.data - page_address(xdp_page) -
> >>>>>           vi->hdr_len - metasize;
> >>>>>
> >>>>>  page_to_skb():
> >>>>>  p = page_address(page) + offset;
> >>>>>  ...
> >>>>>  buf = p - headroom;
> >>>>>
> >>>>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>>>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>>>> on what's done with the skb (when it's freed most often) we get all kinds
> >>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>>>> commit is interesting because the patch was sent and applied twice (it
> >>>>> seems the first one got lost during merge back in 5.13 window). The
> >>>>> first version of the patch that was applied as:
> >>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>>>> was actually correct because it calculated the page starting address
> >>>>> without relying on offset or headroom, but then the second version that
> >>>>> was applied as:
> >>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> was wrong and added the above calculation.
> >>>>> An example xdp prog[3] is below.
> >>>>>
> >>>>> [1] https://github.com/cilium/cilium/issues/19453
> >>>>>
> >>>>> [2] Two of the many traces:
> >>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>>>  [   41.321387] Call Trace:
> >>>>>  [   41.321819]  <TASK>
> >>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>>>  [   41.445592]  ? 0xffffffffa3000000
> >>>>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>>>  [   41.479314]  ? 0xffffffffa3000000
> >>>>>  [   41.496937]  vfs_read+0x138/0x190
> >>>>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>>>  [   41.568050] RIP: 0033:0x48765b
> >>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>>>  [   41.744254]  </TASK>
> >>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>
> >>>>>  and
> >>>>>
> >>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>>>  [   33.532471] page dumped because: nonzero mapcount
> >>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>>>  [   33.532484] Call Trace:
> >>>>>  [   33.532496]  <TASK>
> >>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>>>
> >>>>> [3] XDP program to reproduce(xdp_pass.c):
> >>>>>  #include <linux/bpf.h>
> >>>>>  #include <bpf/bpf_helpers.h>
> >>>>>
> >>>>>  SEC("xdp_pass")
> >>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>>>  {
> >>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>>>           return XDP_PASS;
> >>>>>  }
> >>>>>
> >>>>>  char _license[] SEC("license") = "GPL";
> >>>>>
> >>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>>>
> >>>>> CC: stable@vger.kernel.org
> >>>>> CC: Jason Wang <jasowang@redhat.com>
> >>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>> CC: virtualization@lists.linux-foundation.org
> >>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>>>> ---
> >>>>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>>> index 87838cbe38cf..0687dd88e97f 100644
> >>>>> --- a/drivers/net/virtio_net.c
> >>>>> +++ b/drivers/net/virtio_net.c
> >>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>>>  	 */
> >>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>>>> +	if (headroom) {
> >>>>> +		truesize = PAGE_SIZE;
> >>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>>>
> >>>> The reason for not doing this is that buf and p may not be on the same page, and
> >>>> buf is probably not page-aligned.
> >>>>
> >>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >>>> allocates a large block of memory at one time, and allocates from it each time.
> >>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >>>> each allocation is page-aligned .
> >>>>
> >>>> The problem here is that the value of headroom is wrong, the package is
> >>>> structured like this:
> >>>>
> >>>> from device    | headroom          | virtio-net hdr | data |
> >>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >>>
> >>> You're free to push data back (not necessarily through meta).
> >>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >>>
> >>>>
> >>>> The page_address(page) + offset we pass to page_to_skb() points to the
> >>>> virtio-net hdr.
> >>>>
> >>>> So I think it might be better to change it this way.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>> index 87838cbe38cf..086ae835ec86 100644
> >>>> --- a/drivers/net/virtio_net.c
> >>>> +++ b/drivers/net/virtio_net.c
> >>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>>>                                                        len, PAGE_SIZE, false,
> >>>>                                                        metasize,
> >>>> -                                                      VIRTIO_XDP_HEADROOM);
> >>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>>>                                 return head_skb;
> >>>>                         }
> >>>>                         break;
> >>>
> >>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> >>> So just doing that would take care of the meta, but won't take care of moving data.
> >>>
> >>
> >> Also it doesn't take care of the case where page_to_skb() is called with the original page
> >> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> >
> > Yes, you are right.
> >
> >>
> >> The above change guarantees that buf and p will be in the same page
> >
> >
> > How can this be guaranteed?
> >
> > 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
> >    from the allocation.
> > 2. set xdp
> > 3. alloc for new buffer
> >
>
> p = page_address(page) + offset;
> buf = p & PAGE_MASK; // whatever page p lands in is where buf is set
>
> => p and buf are always in the same page, no?

I don't think it is, it's entirely possible to split on two pages.

>
> > The buffer allocated in the third step must be unaligned, and it is entirely
> > possible that p and buf are not on the same page.
> >
> > I fixed my previous patch.
> >
> > Thanks.
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..d95e82255b94 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                          * xdp.data_meta were adjusted
> >                          */
> >                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> > +
> > +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
>
> This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

Yes, you are right. For the case of xdp_linearize_page() , we should change it.

   headroom = xdp.data - vi->hdr_len - metasize - page_address(xdp_page);

Thanks.


>
> >                         /* We can only create skb based on xdp_page. */
> >                         if (unlikely(xdp_page != page)) {
> >                                 rcu_read_unlock();
> > @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      headroom);
> >                                 return head_skb;
> >                         }
> >                         break;
>

^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 14:58 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <1650724608.256497-2-xuanzhuo@linux.alibaba.com>

On 23/04/2022 17:36, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>>> mode with virtio_net after updating to newer kernels. After
>>>>> investigating the reason it turned out that when using mergeable bufs
>>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>>> calculates the build_skb address wrong because the offset can become less
>>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>>> depending on how lower offset is):
>>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>>
>>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>>> via an xdp prog. The calculations done are:
>>>>>  receive_mergeable():
>>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>>           vi->hdr_len - metasize;
>>>>>
>>>>>  page_to_skb():
>>>>>  p = page_address(page) + offset;
>>>>>  ...
>>>>>  buf = p - headroom;
>>>>>
>>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>>> commit is interesting because the patch was sent and applied twice (it
>>>>> seems the first one got lost during merge back in 5.13 window). The
>>>>> first version of the patch that was applied as:
>>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>>> was actually correct because it calculated the page starting address
>>>>> without relying on offset or headroom, but then the second version that
>>>>> was applied as:
>>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> was wrong and added the above calculation.
>>>>> An example xdp prog[3] is below.
>>>>>
>>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>>
>>>>> [2] Two of the many traces:
>>>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>>>  [   41.321387] Call Trace:
>>>>>  [   41.321819]  <TASK>
>>>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>>>  [   41.445592]  ? 0xffffffffa3000000
>>>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>>>  [   41.479314]  ? 0xffffffffa3000000
>>>>>  [   41.496937]  vfs_read+0x138/0x190
>>>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>  [   41.568050] RIP: 0033:0x48765b
>>>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>>>  [   41.744254]  </TASK>
>>>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>
>>>>>  and
>>>>>
>>>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>>>  [   33.532471] page dumped because: nonzero mapcount
>>>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>>>  [   33.532484] Call Trace:
>>>>>  [   33.532496]  <TASK>
>>>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>>>
>>>>> [3] XDP program to reproduce(xdp_pass.c):
>>>>>  #include <linux/bpf.h>
>>>>>  #include <bpf/bpf_helpers.h>
>>>>>
>>>>>  SEC("xdp_pass")
>>>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>>>  {
>>>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>>>           return XDP_PASS;
>>>>>  }
>>>>>
>>>>>  char _license[] SEC("license") = "GPL";
>>>>>
>>>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>>>
>>>>> CC: stable@vger.kernel.org
>>>>> CC: Jason Wang <jasowang@redhat.com>
>>>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>>>> CC: virtualization@lists.linux-foundation.org
>>>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>> ---
>>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>>  	 */
>>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>>> +	if (headroom) {
>>>>> +		truesize = PAGE_SIZE;
>>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>>
>>>> The reason for not doing this is that buf and p may not be on the same page, and
>>>> buf is probably not page-aligned.
>>>>
>>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>>> allocates a large block of memory at one time, and allocates from it each time.
>>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>>> each allocation is page-aligned .
>>>>
>>>> The problem here is that the value of headroom is wrong, the package is
>>>> structured like this:
>>>>
>>>> from device    | headroom          | virtio-net hdr | data |
>>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>>
>>> You're free to push data back (not necessarily through meta).
>>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>>
>>>>
>>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>>> virtio-net hdr.
>>>>
>>>> So I think it might be better to change it this way.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..086ae835ec86 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>>                                                        len, PAGE_SIZE, false,
>>>>                                                        metasize,
>>>> -                                                      VIRTIO_XDP_HEADROOM);
>>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>>                                 return head_skb;
>>>>                         }
>>>>                         break;
>>>
>>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>>> So just doing that would take care of the meta, but won't take care of moving data.
>>>
>>
>> Also it doesn't take care of the case where page_to_skb() is called with the original page
>> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> Yes, you are right.
> 
>>
>> The above change guarantees that buf and p will be in the same page
> 
> 
> How can this be guaranteed?
> 
> 1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
>    from the allocation.
> 2. set xdp
> 3. alloc for new buffer
> 

p = page_address(page) + offset;
buf = p & PAGE_MASK; // whatever page p lands in is where buf is set

=> p and buf are always in the same page, no?

> The buffer allocated in the third step must be unaligned, and it is entirely
> possible that p and buf are not on the same page.
> 
> I fixed my previous patch.
> 
> Thanks.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..d95e82255b94 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                          * xdp.data_meta were adjusted
>                          */
>                         len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
> +
> +                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);

This is wrong, xdp.data isn't related to buf in the xdp_linearize_page() case.

>                         /* We can only create skb based on xdp_page. */
>                         if (unlikely(xdp_page != page)) {
>                                 rcu_read_unlock();
> @@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      headroom);
>                                 return head_skb;
>                         }
>                         break;


^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 14:46 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <a58bfd2c-4f83-11fe-08d1-19c1d6497fc2@blackwall.org>

On 23/04/2022 17:30, Nikolay Aleksandrov wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
>> On 23/04/2022 16:31, Xuan Zhuo wrote:
>>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>>> mode with virtio_net after updating to newer kernels. After
>>>> investigating the reason it turned out that when using mergeable bufs
>>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>>> calculates the build_skb address wrong because the offset can become less
>>>> than the headroom so it gets the address of the previous page (-X bytes
>>>> depending on how lower offset is):
>>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>>
>>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>>> via an xdp prog. The calculations done are:
>>>>  receive_mergeable():
>>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>>  offset = xdp.data - page_address(xdp_page) -
>>>>           vi->hdr_len - metasize;
>>>>
>>>>  page_to_skb():
>>>>  p = page_address(page) + offset;
>>>>  ...
>>>>  buf = p - headroom;
>>>>
>>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>>> on what's done with the skb (when it's freed most often) we get all kinds
>>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>>> commit is interesting because the patch was sent and applied twice (it
>>>> seems the first one got lost during merge back in 5.13 window). The
>>>> first version of the patch that was applied as:
>>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>>> was actually correct because it calculated the page starting address
>>>> without relying on offset or headroom, but then the second version that
>>>> was applied as:
>>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>>> was wrong and added the above calculation.
>>>> An example xdp prog[3] is below.
>>>>
>>>> [1] https://github.com/cilium/cilium/issues/19453
>>>>
>>>> [2] Two of the many traces:
[snip]
>>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index 87838cbe38cf..0687dd88e97f 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>>  	 */
>>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>>> +	if (headroom) {
>>>> +		truesize = PAGE_SIZE;
>>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>>
>>> The reason for not doing this is that buf and p may not be on the same page, and
>>> buf is probably not page-aligned.
>>>
>>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>>> allocates a large block of memory at one time, and allocates from it each time.
>>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>>> each allocation is page-aligned .
>>>
>>> The problem here is that the value of headroom is wrong, the package is
>>> structured like this:
>>>
>>> from device    | headroom          | virtio-net hdr | data |
>>> after xdp      | headroom  |  virtio-net hdr | meta | data |
>>
>> You're free to push data back (not necessarily through meta).
>> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
>>
>>>
>>> The page_address(page) + offset we pass to page_to_skb() points to the
>>> virtio-net hdr.
>>>
>>> So I think it might be better to change it this way.
>>>
>>> Thanks.
>>>
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..086ae835ec86 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>>                                                        len, PAGE_SIZE, false,
>>>                                                        metasize,
>>> -                                                      VIRTIO_XDP_HEADROOM);
>>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>>                                 return head_skb;
>>>                         }
>>>                         break;
>>
>> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
>> So just doing that would take care of the meta, but won't take care of moving data.
>>
> 
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).
> 
> The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
> make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
> Unfortunately the new headroom will not be correct if it is a frag, it will be longer.
> 
> 

Completely untested alternative could be based on the offset size, that is if it has
eaten into the headroom and is smaller then we swap them (that means we start at page
boundary since we have headroom guaranteed space):
 buf = page_address(page) + (offset > headroom ? offset - headroom : 0);

or perhaps in current code terms:
 buf = p - (offset > headroom ? headroom : offset);

That means offset is somewhere inside the headroom of the buf and, the buf itself
starts at page boundary (when offset < headroom). I think this preserves the correct
headroom for the new skb. WDYT?

Cheers,
 Nik




^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Xuan Zhuo @ 2022-04-23 14:36 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <a58bfd2c-4f83-11fe-08d1-19c1d6497fc2@blackwall.org>

On Sat, 23 Apr 2022 17:30:11 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> > On 23/04/2022 16:31, Xuan Zhuo wrote:
> >> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >>> We received a report[1] of kernel crashes when Cilium is used in XDP
> >>> mode with virtio_net after updating to newer kernels. After
> >>> investigating the reason it turned out that when using mergeable bufs
> >>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >>> calculates the build_skb address wrong because the offset can become less
> >>> than the headroom so it gets the address of the previous page (-X bytes
> >>> depending on how lower offset is):
> >>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>>
> >>> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >>> shows offset that is less than headroom by adding 4 bytes of metadata
> >>> via an xdp prog. The calculations done are:
> >>>  receive_mergeable():
> >>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>>  offset = xdp.data - page_address(xdp_page) -
> >>>           vi->hdr_len - metasize;
> >>>
> >>>  page_to_skb():
> >>>  p = page_address(page) + offset;
> >>>  ...
> >>>  buf = p - headroom;
> >>>
> >>> Now buf goes -4 bytes from the page's starting address as can be seen
> >>> above which is set as skb->head and skb->data by build_skb later. Depending
> >>> on what's done with the skb (when it's freed most often) we get all kinds
> >>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >>> commit is interesting because the patch was sent and applied twice (it
> >>> seems the first one got lost during merge back in 5.13 window). The
> >>> first version of the patch that was applied as:
> >>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >>> was actually correct because it calculated the page starting address
> >>> without relying on offset or headroom, but then the second version that
> >>> was applied as:
> >>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> was wrong and added the above calculation.
> >>> An example xdp prog[3] is below.
> >>>
> >>> [1] https://github.com/cilium/cilium/issues/19453
> >>>
> >>> [2] Two of the many traces:
> >>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>>  [   41.321387] Call Trace:
> >>>  [   41.321819]  <TASK>
> >>>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>>  [   41.322902]  __kfree_skb+0x20/0x30
> >>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>>  [   41.445592]  ? 0xffffffffa3000000
> >>>  [   41.462442]  new_sync_read+0x148/0x160
> >>>  [   41.479314]  ? 0xffffffffa3000000
> >>>  [   41.496937]  vfs_read+0x138/0x190
> >>>  [   41.517198]  ksys_read+0x87/0xc0
> >>>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>  [   41.568050] RIP: 0033:0x48765b
> >>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>>  [   41.744254]  </TASK>
> >>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>
> >>>  and
> >>>
> >>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>>  [   33.532471] page dumped because: nonzero mapcount
> >>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>>  [   33.532484] Call Trace:
> >>>  [   33.532496]  <TASK>
> >>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>>  [   33.532506]  bad_page.cold+0x63/0x94
> >>>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>>  [   33.532515]  free_unref_page+0x1b/0x100
> >>>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>>
> >>> [3] XDP program to reproduce(xdp_pass.c):
> >>>  #include <linux/bpf.h>
> >>>  #include <bpf/bpf_helpers.h>
> >>>
> >>>  SEC("xdp_pass")
> >>>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>>  {
> >>>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>>           return XDP_PASS;
> >>>  }
> >>>
> >>>  char _license[] SEC("license") = "GPL";
> >>>
> >>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>>
> >>> CC: stable@vger.kernel.org
> >>> CC: Jason Wang <jasowang@redhat.com>
> >>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> CC: Daniel Borkmann <daniel@iogearbox.net>
> >>> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >>> CC: virtualization@lists.linux-foundation.org
> >>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >>> ---
> >>>  drivers/net/virtio_net.c | 8 ++++++--
> >>>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 87838cbe38cf..0687dd88e97f 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>>  	 */
> >>> -	truesize = headroom ? PAGE_SIZE : truesize;
> >>> +	if (headroom) {
> >>> +		truesize = PAGE_SIZE;
> >>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >>
> >> The reason for not doing this is that buf and p may not be on the same page, and
> >> buf is probably not page-aligned.
> >>
> >> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> >> allocates a large block of memory at one time, and allocates from it each time.
> >> Although in xdp mode, each allocation is page_size, it does not guarantee that
> >> each allocation is page-aligned .
> >>
> >> The problem here is that the value of headroom is wrong, the package is
> >> structured like this:
> >>
> >> from device    | headroom          | virtio-net hdr | data |
> >> after xdp      | headroom  |  virtio-net hdr | meta | data |
> >
> > You're free to push data back (not necessarily through meta).
> > You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> >
> >>
> >> The page_address(page) + offset we pass to page_to_skb() points to the
> >> virtio-net hdr.
> >>
> >> So I think it might be better to change it this way.
> >>
> >> Thanks.
> >>
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..086ae835ec86 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >>                                                        len, PAGE_SIZE, false,
> >>                                                        metasize,
> >> -                                                      VIRTIO_XDP_HEADROOM);
> >> +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >>                                 return head_skb;
> >>                         }
> >>                         break;
> >
> > That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> > So just doing that would take care of the meta, but won't take care of moving data.
> >
>
> Also it doesn't take care of the case where page_to_skb() is called with the original page
> i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

Yes, you are right.

>
> The above change guarantees that buf and p will be in the same page


How can this be guaranteed?

1. For example, we applied for a 32k buffer first, and took away 1500 + hdr_len
   from the allocation.
2. set xdp
3. alloc for new buffer

The buffer allocated in the third step must be unaligned, and it is entirely
possible that p and buf are not on the same page.

I fixed my previous patch.

Thanks.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 87838cbe38cf..d95e82255b94 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1005,6 +1005,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                         * xdp.data_meta were adjusted
                         */
                        len = xdp.data_end - xdp.data + vi->hdr_len + metasize;
+
+                       headroom = xdp.data - vi->hdr_len - metasize - (buf - headroom);
                        /* We can only create skb based on xdp_page. */
                        if (unlikely(xdp_page != page)) {
                                rcu_read_unlock();
@@ -1012,7 +1014,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
                                head_skb = page_to_skb(vi, rq, xdp_page, offset,
                                                       len, PAGE_SIZE, false,
                                                       metasize,
-                                                      VIRTIO_XDP_HEADROOM);
+                                                      headroom);
                                return head_skb;
                        }
                        break;

^ permalink raw reply related

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Handle single-chip-address OF property
From: Nathan Rossi @ 2022-04-23 14:41 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, Vivien Didelot, Florian Fainelli,
	Vladimir Oltean, David S. Miller, Jakub Kicinski, Paolo Abeni
In-Reply-To: <YmQIHWL4iTS5qVIz@lunn.ch>

On Sun, 24 Apr 2022 at 00:07, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Sat, Apr 23, 2022 at 01:14:27PM +0000, Nathan Rossi wrote:
> > Handle the parsing and use of single chip addressing when the switch has
> > the single-chip-address property defined. This allows for specifying the
> > switch as using single chip addressing even when mdio address 0 is used
> > by another device on the bus. This is a feature of some switches (e.g.
> > the MV88E6341/MV88E6141) where the switch shares the bus only responding
> > to the higher 16 addresses.
>
> Hi Nathan
>
> I think i'm missing something in this explanation:
>
> smi.c says:
>
> /* The switch ADDR[4:1] configuration pins define the chip SMI device address
>  * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
>  *
>  * When ADDR is all zero, the chip uses Single-chip Addressing Mode, assuming it
>  * is the only device connected to the SMI master. In this mode it responds to
>  * all 32 possible SMI addresses, and thus maps directly the internal devices.
>  *
>  * When ADDR is non-zero, the chip uses Multi-chip Addressing Mode, allowing
>  * multiple devices to share the SMI interface. In this mode it responds to only
>  * 2 registers, used to indirectly access the internal SMI devices.
>  *
>  * Some chips use a different scheme: Only the ADDR4 pin is used for
>  * configuration, and the device responds to 16 of the 32 SMI
>  * addresses, allowing two to coexist on the same SMI interface.
>  */
>
> So if ADDR = 0, it takes up the whole bus. And in this case reg = 0.
> If ADDR != 0, it is in multi chip mode, and DT reg = ADDR.
>
> int mv88e6xxx_smi_init(struct mv88e6xxx_chip *chip,
>                        struct mii_bus *bus, int sw_addr)
> {
>         if (chip->info->dual_chip)
>                 chip->smi_ops = &mv88e6xxx_smi_dual_direct_ops;
>         else if (sw_addr == 0)
>                 chip->smi_ops = &mv88e6xxx_smi_direct_ops;
>         else if (chip->info->multi_chip)
>                 chip->smi_ops = &mv88e6xxx_smi_indirect_ops;
>         else
>                 return -EINVAL;
>
> This seems to implement what is above. smi_direct_ops == whole bus,
> smi_indirect_ops == multi-chip mode.
>
> In what situation do you see this not working? What device are you
> using, what does you DT look like, and what at the ADDR value?

The device I am using is the MV88E6141, it follows the second scheme
such that it only responds to the upper 16 of the 32 SMI addresses in
single chip addressing mode. I am able to define the switch at address
0, and everything works. However in the device I am using (Netgate
SG-3100) the ethernet phys for the non switch ethernet interfaces are
also on the same mdio bus as the switch. One of those phys is
configured with address 0. Defining both the ethernet-phy and switch
as address 0 does not work.

The device tree I have looks like:

&mdio {
    status = "okay";
    pinctrl-0 = <&mdio_pins>;
    pinctrl-names = "default";

    phy0: ethernet-phy@0 {
        status = "okay";
        reg = <0>;
    };

    phy1: ethernet-phy@1 {
        status = "okay";
        reg = <1>;
    };

    switch0: switch0@16 {
        compatible = "marvell,mv88e6141", "marvell,mv88e6085";
        single-chip-address;
        reg = <16>; /* first port in single address mode */
        dsa,member = <0 0>;
        status = "okay";

... ports/mdio nodes ...
    };
};

Thanks,
Nathan

>
> Thanks
>         Andrew

^ permalink raw reply

* Re: [PATCH net-next] 1588 support on bcm54210pe
From: Florian Fainelli @ 2022-04-23 14:40 UTC (permalink / raw)
  To: Richard Cochran, Andrew Lunn
  Cc: Lasse Johnsen, netdev, Gordon Hollingworth, Ahmad Byagowi,
	Heiner Kallweit, Russell King, bcm-kernel-feedback-list,
	Florian Fainelli
In-Reply-To: <20220422194810.GA9325@hoboy.vegasvil.org>



On 4/22/2022 12:48 PM, Richard Cochran wrote:
> On Fri, Apr 22, 2022 at 05:00:07PM +0200, Andrew Lunn wrote:
> 
>>> I am confident that this code is relevant exclusively to the
>>> BCM54210PE.
> 
> Not true.
> 
>> It will not even work with the BCM54210, BCM54210S and
>>> BCM54210SE PHYs.
> 
> The registers you used are also present in the BCM541xx devices.
> Pretty sure your code would work on those devices (after adjusting
> register offsets).
> 
>> Florian can probably tell us more, but often hardware like this is
>> shared by multiple devices. If it is, you might want to use a more
>> generic prefix.
> 
> My understanding is that there are two implementions, gen1 and gen2.
> Your bcm542xx and the bcm541xx are both gen1, and both support inband
> Rx time stamping.

That is correct. Lasse for your future submission please address the 
following:

- conform to the usual patch submission style and break up your changes 
between bcmgenet.c (although I doubt you need to change it), broadcom.c 
and bcm-phy-lib.[ch]

- do not create a PHY device entry specifically for BCM54210PE, use the 
existing BCM54210 entry and add checks using the revision field of 
phydev->phy_id where necessary. There are already many entries in this 
driver, adding more does not help maintaining it. Also, I went through 
several months of work fixing bugs and adding decent power management 
features to this driver that all PHYs should leverage, adding a new 
entry means we need to verify whether all code paths are hit or not

- move generic code, such as all of the PTP code into bcm-phy-lib.[ch] 
where it can easily be re-used across multiple PHY device driver entries 
(54810, 54210 etc.)

Thanks!

> 
> Because the registers are all the same (just the offsets are
> different), I'd like to see a common module that can be used by all
> gen1 devices.  The module could be named bcm-ptp-gen1.c for example.

I would prefer that we just stick to adding that code to 
bcm-phy-lib.[ch] which all Broadcom PHY drivers can use and we can 
decide whether we want to add a Kconfig option specifically for PTP.

Cheers
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Xuan Zhuo @ 2022-04-23 14:20 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <8d511a16-8d69-82b1-48a1-24de3a592aef@blackwall.org>

On Sat, 23 Apr 2022 17:16:53 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
> > On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
> >> We received a report[1] of kernel crashes when Cilium is used in XDP
> >> mode with virtio_net after updating to newer kernels. After
> >> investigating the reason it turned out that when using mergeable bufs
> >> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
> >> calculates the build_skb address wrong because the offset can become less
> >> than the headroom so it gets the address of the previous page (-X bytes
> >> depending on how lower offset is):
> >>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
> >>
> >> This is a pr_err() I added in the beginning of page_to_skb which clearly
> >> shows offset that is less than headroom by adding 4 bytes of metadata
> >> via an xdp prog. The calculations done are:
> >>  receive_mergeable():
> >>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
> >>  offset = xdp.data - page_address(xdp_page) -
> >>           vi->hdr_len - metasize;
> >>
> >>  page_to_skb():
> >>  p = page_address(page) + offset;
> >>  ...
> >>  buf = p - headroom;
> >>
> >> Now buf goes -4 bytes from the page's starting address as can be seen
> >> above which is set as skb->head and skb->data by build_skb later. Depending
> >> on what's done with the skb (when it's freed most often) we get all kinds
> >> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
> >> commit is interesting because the patch was sent and applied twice (it
> >> seems the first one got lost during merge back in 5.13 window). The
> >> first version of the patch that was applied as:
> >>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
> >> was actually correct because it calculated the page starting address
> >> without relying on offset or headroom, but then the second version that
> >> was applied as:
> >>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> was wrong and added the above calculation.
> >> An example xdp prog[3] is below.
> >>
> >> [1] https://github.com/cilium/cilium/issues/19453
> >>
> >> [2] Two of the many traces:
> >>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
> >>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
> >>  [   41.300891] kernel BUG at include/linux/mm.h:720!
> >>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> >>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
> >>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
> >>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
> >>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
> >>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
> >>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
> >>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
> >>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
> >>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
> >>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
> >>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
> >>  [   41.321387] Call Trace:
> >>  [   41.321819]  <TASK>
> >>  [   41.322193]  skb_release_data+0x13f/0x1c0
> >>  [   41.322902]  __kfree_skb+0x20/0x30
> >>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
> >>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
> >>  [   41.384102]  inet_recvmsg+0x42/0x100
> >>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
> >>  [   41.428201]  sock_read_iter+0x84/0xd0
> >>  [   41.445592]  ? 0xffffffffa3000000
> >>  [   41.462442]  new_sync_read+0x148/0x160
> >>  [   41.479314]  ? 0xffffffffa3000000
> >>  [   41.496937]  vfs_read+0x138/0x190
> >>  [   41.517198]  ksys_read+0x87/0xc0
> >>  [   41.535336]  do_syscall_64+0x3b/0x90
> >>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>  [   41.568050] RIP: 0033:0x48765b
> >>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> >>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
> >>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
> >>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
> >>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
> >>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
> >>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
> >>  [   41.744254]  </TASK>
> >>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>
> >>  and
> >>
> >>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
> >>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
> >>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
> >>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> >>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
> >>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
> >>  [   33.532471] page dumped because: nonzero mapcount
> >>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
> >>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
> >>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
> >>  [   33.532484] Call Trace:
> >>  [   33.532496]  <TASK>
> >>  [   33.532500]  dump_stack_lvl+0x45/0x5a
> >>  [   33.532506]  bad_page.cold+0x63/0x94
> >>  [   33.532510]  free_pcp_prepare+0x290/0x420
> >>  [   33.532515]  free_unref_page+0x1b/0x100
> >>  [   33.532518]  skb_release_data+0x13f/0x1c0
> >>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
> >>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
> >>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
> >>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
> >>
> >> [3] XDP program to reproduce(xdp_pass.c):
> >>  #include <linux/bpf.h>
> >>  #include <bpf/bpf_helpers.h>
> >>
> >>  SEC("xdp_pass")
> >>  int xdp_pkt_pass(struct xdp_md *ctx)
> >>  {
> >>           bpf_xdp_adjust_head(ctx, -(int)32);
> >>           return XDP_PASS;
> >>  }
> >>
> >>  char _license[] SEC("license") = "GPL";
> >>
> >>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
> >>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
> >>
> >> CC: stable@vger.kernel.org
> >> CC: Jason Wang <jasowang@redhat.com>
> >> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >> CC: Daniel Borkmann <daniel@iogearbox.net>
> >> CC: "Michael S. Tsirkin" <mst@redhat.com>
> >> CC: virtualization@lists.linux-foundation.org
> >> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
> >> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
> >> ---
> >>  drivers/net/virtio_net.c | 8 ++++++--
> >>  1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 87838cbe38cf..0687dd88e97f 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
> >>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
> >>  	 */
> >> -	truesize = headroom ? PAGE_SIZE : truesize;
> >> +	if (headroom) {
> >> +		truesize = PAGE_SIZE;
> >> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> >
> > The reason for not doing this is that buf and p may not be on the same page, and
> > buf is probably not page-aligned.
> >
> > The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> > allocates a large block of memory at one time, and allocates from it each time.
> > Although in xdp mode, each allocation is page_size, it does not guarantee that
> > each allocation is page-aligned .
> >
> > The problem here is that the value of headroom is wrong, the package is
> > structured like this:
> >
> > from device    | headroom          | virtio-net hdr | data |
> > after xdp      | headroom  |  virtio-net hdr | meta | data |
>
> You're free to push data back (not necessarily through meta).

Oh, is that possible? Thanks, that should take this into account.

> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

Yes, indeed hdr_vaild is false. But the virtio-net hdr space is included,
although it is not legal.

			offset = xdp.data - page_address(xdp_page) -
				 vi->hdr_len - metasize;

Thanks.

>
> >
> > The page_address(page) + offset we pass to page_to_skb() points to the
> > virtio-net hdr.
> >
> > So I think it might be better to change it this way.
> >
> > Thanks.
> >
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 87838cbe38cf..086ae835ec86 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
> >                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
> >                                                        len, PAGE_SIZE, false,
> >                                                        metasize,
> > -                                                      VIRTIO_XDP_HEADROOM);
> > +                                                      VIRTIO_XDP_HEADROOM - metazie);
> >                                 return head_skb;
> >                         }
> >                         break;
>
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
>




^ permalink raw reply

* [PATCH bpf-next] selftests/bpf: Fix incorrect TRUNNER_BINARY name output
From: Yuntao Wang @ 2022-04-23 14:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Shuah Khan, netdev, bpf, linux-kernel, linux-kselftest,
	Yuntao Wang

Currently, when we run 'make test_progs', the output is:

  CLNG-BPF [test_maps] atomic_bounds.o
  ...
  GEN-SKEL [test_progs] atomic_bounds.skel.h
  ...
  TEST-OBJ [test_progs] align.test.o
  ...
  TEST-HDR [test_progs] tests.h
  EXT-OBJ  [test_progs] test_progs.o
  ...
  BINARY   test_progs

As you can see, the TRUNNER_BINARY name in the CLNG-BPF part is test_maps,
which is incorrect.

Similarly, when we run 'make test_maps', the output is:

  CLNG-BPF [test_maps] atomic_bounds.o
  ...
  GEN-SKEL [test_progs] atomic_bounds.skel.h
  ...
  TEST-OBJ [test_maps] array_map_batch_ops.test.o
  ...
  TEST-HDR [test_maps] tests.h
  EXT-OBJ  [test_maps] test_maps.o
  ...
  BINARY   test_maps

At this time, the TRUNNER_BINARY name in the GEN-SKEL part is wrong.

Again, if we run 'make /full/path/to/selftests/bpf/test_vmlinux.skel.h',
the output is:

  CLNG-BPF [test_maps] test_vmlinux.o
  GEN-SKEL [test_progs] test_vmlinux.skel.h

Here, the TRUNNER_BINARY names are inappropriate and meaningless, they
should be removed.

This patch fixes these and all other similar issues.

With the patch applied, the output becomes:

  $ make test_progs

  CLNG-BPF [test_progs] atomic_bounds.o
  ...
  GEN-SKEL [test_progs] atomic_bounds.skel.h
  ...
  TEST-OBJ [test_progs] align.test.o
  ...
  TEST-HDR [test_progs] tests.h
  EXT-OBJ  [test_progs] test_progs.o
  ...
  BINARY   test_progs

  $ make test_maps

  CLNG-BPF [test_maps] atomic_bounds.o
  ...
  GEN-SKEL [test_maps] atomic_bounds.skel.h
  ...
  TEST-OBJ [test_maps] array_map_batch_ops.test.o
  ...
  TEST-HDR [test_maps] tests.h
  EXT-OBJ  [test_maps] test_maps.o
  ...
  BINARY   test_maps

  $ make /full/path/to/selftests/bpf/test_vmlinux.skel.h

  CLNG-BPF test_vmlinux.o
  GEN-SKEL test_vmlinux.skel.h

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
---
 tools/testing/selftests/bpf/Makefile | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index bafdc5373a13..3cf444cb20af 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -413,7 +413,7 @@ $(TRUNNER_BPF_OBJS): $(TRUNNER_OUTPUT)/%.o:				\
 					  $(TRUNNER_BPF_CFLAGS))
 
 $(TRUNNER_BPF_SKELS): %.skel.h: %.o $(BPFTOOL) | $(TRUNNER_OUTPUT)
-	$$(call msg,GEN-SKEL,$(TRUNNER_BINARY),$$@)
+	$$(call msg,GEN-SKEL,$$(TRUNNER_BINARY),$$@)
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked1.o) $$<
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked2.o) $$(<:.o=.linked1.o)
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked3.o) $$(<:.o=.linked2.o)
@@ -422,7 +422,7 @@ $(TRUNNER_BPF_SKELS): %.skel.h: %.o $(BPFTOOL) | $(TRUNNER_OUTPUT)
 	$(Q)$$(BPFTOOL) gen subskeleton $$(<:.o=.linked3.o) name $$(notdir $$(<:.o=)) > $$(@:.skel.h=.subskel.h)
 
 $(TRUNNER_BPF_LSKELS): %.lskel.h: %.o $(BPFTOOL) | $(TRUNNER_OUTPUT)
-	$$(call msg,GEN-SKEL,$(TRUNNER_BINARY),$$@)
+	$$(call msg,GEN-SKEL,$$(TRUNNER_BINARY),$$@)
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked1.o) $$<
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked2.o) $$(<:.o=.linked1.o)
 	$(Q)$$(BPFTOOL) gen object $$(<:.o=.linked3.o) $$(<:.o=.linked2.o)
@@ -430,12 +430,12 @@ $(TRUNNER_BPF_LSKELS): %.lskel.h: %.o $(BPFTOOL) | $(TRUNNER_OUTPUT)
 	$(Q)$$(BPFTOOL) gen skeleton -L $$(<:.o=.linked3.o) name $$(notdir $$(<:.o=_lskel)) > $$@
 
 $(TRUNNER_BPF_SKELS_LINKED): $(TRUNNER_BPF_OBJS) $(BPFTOOL) | $(TRUNNER_OUTPUT)
-	$$(call msg,LINK-BPF,$(TRUNNER_BINARY),$$(@:.skel.h=.o))
+	$$(call msg,LINK-BPF,$$(TRUNNER_BINARY),$$(@:.skel.h=.o))
 	$(Q)$$(BPFTOOL) gen object $$(@:.skel.h=.linked1.o) $$(addprefix $(TRUNNER_OUTPUT)/,$$($$(@F)-deps))
 	$(Q)$$(BPFTOOL) gen object $$(@:.skel.h=.linked2.o) $$(@:.skel.h=.linked1.o)
 	$(Q)$$(BPFTOOL) gen object $$(@:.skel.h=.linked3.o) $$(@:.skel.h=.linked2.o)
 	$(Q)diff $$(@:.skel.h=.linked2.o) $$(@:.skel.h=.linked3.o)
-	$$(call msg,GEN-SKEL,$(TRUNNER_BINARY),$$@)
+	$$(call msg,GEN-SKEL,$$(TRUNNER_BINARY),$$@)
 	$(Q)$$(BPFTOOL) gen skeleton $$(@:.skel.h=.linked3.o) name $$(notdir $$(@:.skel.h=)) > $$@
 	$(Q)$$(BPFTOOL) gen subskeleton $$(@:.skel.h=.linked3.o) name $$(notdir $$(@:.skel.h=)) > $$(@:.skel.h=.subskel.h)
 endif
@@ -444,7 +444,7 @@ endif
 ifeq ($($(TRUNNER_TESTS_DIR)-tests-hdr),)
 $(TRUNNER_TESTS_DIR)-tests-hdr := y
 $(TRUNNER_TESTS_HDR): $(TRUNNER_TESTS_DIR)/*.c
-	$$(call msg,TEST-HDR,$(TRUNNER_BINARY),$$@)
+	$$(call msg,TEST-HDR,$$(TRUNNER_BINARY),$$@)
 	$$(shell (echo '/* Generated header, do not edit */';					\
 		  sed -n -E 's/^void (serial_)?test_([a-zA-Z0-9_]+)\((void)?\).*/DEFINE_TEST(\2)/p'	\
 			$(TRUNNER_TESTS_DIR)/*.c | sort ;	\
@@ -461,7 +461,7 @@ $(TRUNNER_TEST_OBJS): $(TRUNNER_OUTPUT)/%.test.o:			\
 		      $(TRUNNER_BPF_LSKELS)				\
 		      $(TRUNNER_BPF_SKELS_LINKED)			\
 		      $$(BPFOBJ) | $(TRUNNER_OUTPUT)
-	$$(call msg,TEST-OBJ,$(TRUNNER_BINARY),$$@)
+	$$(call msg,TEST-OBJ,$$(TRUNNER_BINARY),$$@)
 	$(Q)cd $$(@D) && $$(CC) -I. $$(CFLAGS) -c $(CURDIR)/$$< $$(LDLIBS) -o $$(@F)
 
 $(TRUNNER_EXTRA_OBJS): $(TRUNNER_OUTPUT)/%.o:				\
@@ -469,17 +469,19 @@ $(TRUNNER_EXTRA_OBJS): $(TRUNNER_OUTPUT)/%.o:				\
 		       $(TRUNNER_EXTRA_HDRS)				\
 		       $(TRUNNER_TESTS_HDR)				\
 		       $$(BPFOBJ) | $(TRUNNER_OUTPUT)
-	$$(call msg,EXT-OBJ,$(TRUNNER_BINARY),$$@)
+	$$(call msg,EXT-OBJ,$$(TRUNNER_BINARY),$$@)
 	$(Q)$$(CC) $$(CFLAGS) -c $$< $$(LDLIBS) -o $$@
 
 # non-flavored in-srctree builds receive special treatment, in particular, we
 # do not need to copy extra resources (see e.g. test_btf_dump_case())
 $(TRUNNER_BINARY)-extras: $(TRUNNER_EXTRA_FILES) | $(TRUNNER_OUTPUT)
 ifneq ($2:$(OUTPUT),:$(shell pwd))
-	$$(call msg,EXT-COPY,$(TRUNNER_BINARY),$(TRUNNER_EXTRA_FILES))
+	$$(call msg,EXT-COPY,$$(TRUNNER_BINARY),$(TRUNNER_EXTRA_FILES))
 	$(Q)rsync -aq $$^ $(TRUNNER_OUTPUT)/
 endif
 
+$(OUTPUT)/$(TRUNNER_BINARY): TRUNNER_BINARY = $(TRUNNER_BINARY)
+
 $(OUTPUT)/$(TRUNNER_BINARY): $(TRUNNER_TEST_OBJS)			\
 			     $(TRUNNER_EXTRA_OBJS) $$(BPFOBJ)		\
 			     $(RESOLVE_BTFIDS)				\
@@ -489,6 +491,8 @@ $(OUTPUT)/$(TRUNNER_BINARY): $(TRUNNER_TEST_OBJS)			\
 	$(Q)$(RESOLVE_BTFIDS) --btf $(TRUNNER_OUTPUT)/btf_data.o $$@
 	$(Q)ln -sf $(if $2,..,.)/tools/build/bpftool/bootstrap/bpftool $(if $2,$2/)bpftool
 
+TRUNNER_BINARY =
+
 endef
 
 # Define test_progs test runner.
-- 
2.35.3


^ permalink raw reply related

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 14:30 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <8d511a16-8d69-82b1-48a1-24de3a592aef@blackwall.org>

On 23/04/2022 17:16, Nikolay Aleksandrov wrote:
> On 23/04/2022 16:31, Xuan Zhuo wrote:
>> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>>> We received a report[1] of kernel crashes when Cilium is used in XDP
>>> mode with virtio_net after updating to newer kernels. After
>>> investigating the reason it turned out that when using mergeable bufs
>>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>>> calculates the build_skb address wrong because the offset can become less
>>> than the headroom so it gets the address of the previous page (-X bytes
>>> depending on how lower offset is):
>>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>>
>>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>>> shows offset that is less than headroom by adding 4 bytes of metadata
>>> via an xdp prog. The calculations done are:
>>>  receive_mergeable():
>>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>>  offset = xdp.data - page_address(xdp_page) -
>>>           vi->hdr_len - metasize;
>>>
>>>  page_to_skb():
>>>  p = page_address(page) + offset;
>>>  ...
>>>  buf = p - headroom;
>>>
>>> Now buf goes -4 bytes from the page's starting address as can be seen
>>> above which is set as skb->head and skb->data by build_skb later. Depending
>>> on what's done with the skb (when it's freed most often) we get all kinds
>>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>>> commit is interesting because the patch was sent and applied twice (it
>>> seems the first one got lost during merge back in 5.13 window). The
>>> first version of the patch that was applied as:
>>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>>> was actually correct because it calculated the page starting address
>>> without relying on offset or headroom, but then the second version that
>>> was applied as:
>>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> was wrong and added the above calculation.
>>> An example xdp prog[3] is below.
>>>
>>> [1] https://github.com/cilium/cilium/issues/19453
>>>
>>> [2] Two of the many traces:
>>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>>  [   41.321387] Call Trace:
>>>  [   41.321819]  <TASK>
>>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>>  [   41.322902]  __kfree_skb+0x20/0x30
>>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>>  [   41.445592]  ? 0xffffffffa3000000
>>>  [   41.462442]  new_sync_read+0x148/0x160
>>>  [   41.479314]  ? 0xffffffffa3000000
>>>  [   41.496937]  vfs_read+0x138/0x190
>>>  [   41.517198]  ksys_read+0x87/0xc0
>>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>  [   41.568050] RIP: 0033:0x48765b
>>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>>  [   41.744254]  </TASK>
>>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>
>>>  and
>>>
>>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>>  [   33.532471] page dumped because: nonzero mapcount
>>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>>  [   33.532484] Call Trace:
>>>  [   33.532496]  <TASK>
>>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>>  [   33.532506]  bad_page.cold+0x63/0x94
>>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>>  [   33.532515]  free_unref_page+0x1b/0x100
>>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>>
>>> [3] XDP program to reproduce(xdp_pass.c):
>>>  #include <linux/bpf.h>
>>>  #include <bpf/bpf_helpers.h>
>>>
>>>  SEC("xdp_pass")
>>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>>  {
>>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>>           return XDP_PASS;
>>>  }
>>>
>>>  char _license[] SEC("license") = "GPL";
>>>
>>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>>
>>> CC: stable@vger.kernel.org
>>> CC: Jason Wang <jasowang@redhat.com>
>>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> CC: Daniel Borkmann <daniel@iogearbox.net>
>>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>>> CC: virtualization@lists.linux-foundation.org
>>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>>> ---
>>>  drivers/net/virtio_net.c | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 87838cbe38cf..0687dd88e97f 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>>  	 */
>>> -	truesize = headroom ? PAGE_SIZE : truesize;
>>> +	if (headroom) {
>>> +		truesize = PAGE_SIZE;
>>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
>>
>> The reason for not doing this is that buf and p may not be on the same page, and
>> buf is probably not page-aligned.
>>
>> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
>> allocates a large block of memory at one time, and allocates from it each time.
>> Although in xdp mode, each allocation is page_size, it does not guarantee that
>> each allocation is page-aligned .
>>
>> The problem here is that the value of headroom is wrong, the package is
>> structured like this:
>>
>> from device    | headroom          | virtio-net hdr | data |
>> after xdp      | headroom  |  virtio-net hdr | meta | data |
> 
> You're free to push data back (not necessarily through meta).
> You don't have virtio-net hdr for the xdp case (hdr_valid is false there).
> 
>>
>> The page_address(page) + offset we pass to page_to_skb() points to the
>> virtio-net hdr.
>>
>> So I think it might be better to change it this way.
>>
>> Thanks.
>>
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..086ae835ec86 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>>                                                        len, PAGE_SIZE, false,
>>                                                        metasize,
>> -                                                      VIRTIO_XDP_HEADROOM);
>> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>>                                 return head_skb;
>>                         }
>>                         break;
> 
> That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
> So just doing that would take care of the meta, but won't take care of moving data.
> 

Also it doesn't take care of the case where page_to_skb() is called with the original page
i.e. when we already have headroom, so we hit the next/standard page_to_skb() call (xdp_page == page).

The above change guarantees that buf and p will be in the same page and the skb_reserve() call will
make skb->data point to p - buf, i.e. to the beginning of the valid data in that page.
Unfortunately the new headroom will not be correct if it is a frag, it will be longer.



^ permalink raw reply

* Re: [PATCH net] virtio_net: fix wrong buf address calculation when using xdp
From: Nikolay Aleksandrov @ 2022-04-23 14:16 UTC (permalink / raw)
  To: Xuan Zhuo
  Cc: kuba, davem, stable, Jason Wang, Daniel Borkmann,
	Michael S. Tsirkin, virtualization, netdev
In-Reply-To: <1650720683.8168066-1-xuanzhuo@linux.alibaba.com>

On 23/04/2022 16:31, Xuan Zhuo wrote:
> On Sat, 23 Apr 2022 14:26:12 +0300, Nikolay Aleksandrov <razor@blackwall.org> wrote:
>> We received a report[1] of kernel crashes when Cilium is used in XDP
>> mode with virtio_net after updating to newer kernels. After
>> investigating the reason it turned out that when using mergeable bufs
>> with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf()
>> calculates the build_skb address wrong because the offset can become less
>> than the headroom so it gets the address of the previous page (-X bytes
>> depending on how lower offset is):
>>  page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256
>>
>> This is a pr_err() I added in the beginning of page_to_skb which clearly
>> shows offset that is less than headroom by adding 4 bytes of metadata
>> via an xdp prog. The calculations done are:
>>  receive_mergeable():
>>  headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes
>>  offset = xdp.data - page_address(xdp_page) -
>>           vi->hdr_len - metasize;
>>
>>  page_to_skb():
>>  p = page_address(page) + offset;
>>  ...
>>  buf = p - headroom;
>>
>> Now buf goes -4 bytes from the page's starting address as can be seen
>> above which is set as skb->head and skb->data by build_skb later. Depending
>> on what's done with the skb (when it's freed most often) we get all kinds
>> of corruptions and BUG_ON() triggers in mm[2]. The story of the faulty
>> commit is interesting because the patch was sent and applied twice (it
>> seems the first one got lost during merge back in 5.13 window). The
>> first version of the patch that was applied as:
>>  commit 7bf64460e3b2 ("virtio-net: get build_skb() buf by data ptr")
>> was actually correct because it calculated the page starting address
>> without relying on offset or headroom, but then the second version that
>> was applied as:
>>  commit 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> was wrong and added the above calculation.
>> An example xdp prog[3] is below.
>>
>> [1] https://github.com/cilium/cilium/issues/19453
>>
>> [2] Two of the many traces:
>>  [   40.437400] BUG: Bad page state in process swapper/0  pfn:14940
>>  [   40.916726] BUG: Bad page state in process systemd-resolve  pfn:053b7
>>  [   41.300891] kernel BUG at include/linux/mm.h:720!
>>  [   41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>  [   41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G    B   W         5.18.0-rc1+ #37
>>  [   41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   41.306018] RIP: 0010:page_frag_free+0x79/0xe0
>>  [   41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6
>>  [   41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292
>>  [   41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000
>>  [   41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff
>>  [   41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff
>>  [   41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600
>>  [   41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c
>>  [   41.317700] FS:  00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000
>>  [   41.319150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  [   41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0
>>  [   41.321387] Call Trace:
>>  [   41.321819]  <TASK>
>>  [   41.322193]  skb_release_data+0x13f/0x1c0
>>  [   41.322902]  __kfree_skb+0x20/0x30
>>  [   41.343870]  tcp_recvmsg_locked+0x671/0x880
>>  [   41.363764]  tcp_recvmsg+0x5e/0x1c0
>>  [   41.384102]  inet_recvmsg+0x42/0x100
>>  [   41.406783]  ? sock_recvmsg+0x1d/0x70
>>  [   41.428201]  sock_read_iter+0x84/0xd0
>>  [   41.445592]  ? 0xffffffffa3000000
>>  [   41.462442]  new_sync_read+0x148/0x160
>>  [   41.479314]  ? 0xffffffffa3000000
>>  [   41.496937]  vfs_read+0x138/0x190
>>  [   41.517198]  ksys_read+0x87/0xc0
>>  [   41.535336]  do_syscall_64+0x3b/0x90
>>  [   41.551637]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  [   41.568050] RIP: 0033:0x48765b
>>  [   41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
>>  [   41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000
>>  [   41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b
>>  [   41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016
>>  [   41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4
>>  [   41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9
>>  [   41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff
>>  [   41.744254]  </TASK>
>>  [   41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>
>>  and
>>
>>  [   33.524802] BUG: Bad page state in process systemd-network  pfn:11e60
>>  [   33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e
>>  [   33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60
>>  [   33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
>>  [   33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000
>>  [   33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000
>>  [   33.532471] page dumped because: nonzero mapcount
>>  [   33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net
>>  [   33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37
>>  [   33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014
>>  [   33.532484] Call Trace:
>>  [   33.532496]  <TASK>
>>  [   33.532500]  dump_stack_lvl+0x45/0x5a
>>  [   33.532506]  bad_page.cold+0x63/0x94
>>  [   33.532510]  free_pcp_prepare+0x290/0x420
>>  [   33.532515]  free_unref_page+0x1b/0x100
>>  [   33.532518]  skb_release_data+0x13f/0x1c0
>>  [   33.532524]  kfree_skb_reason+0x3e/0xc0
>>  [   33.532527]  ip6_mc_input+0x23c/0x2b0
>>  [   33.532531]  ip6_sublist_rcv_finish+0x83/0x90
>>  [   33.532534]  ip6_sublist_rcv+0x22b/0x2b0
>>
>> [3] XDP program to reproduce(xdp_pass.c):
>>  #include <linux/bpf.h>
>>  #include <bpf/bpf_helpers.h>
>>
>>  SEC("xdp_pass")
>>  int xdp_pkt_pass(struct xdp_md *ctx)
>>  {
>>           bpf_xdp_adjust_head(ctx, -(int)32);
>>           return XDP_PASS;
>>  }
>>
>>  char _license[] SEC("license") = "GPL";
>>
>>  compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o
>>  load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass
>>
>> CC: stable@vger.kernel.org
>> CC: Jason Wang <jasowang@redhat.com>
>> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> CC: Daniel Borkmann <daniel@iogearbox.net>
>> CC: "Michael S. Tsirkin" <mst@redhat.com>
>> CC: virtualization@lists.linux-foundation.org
>> Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr")
>> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
>> ---
>>  drivers/net/virtio_net.c | 8 ++++++--
>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 87838cbe38cf..0687dd88e97f 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -434,9 +434,13 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>>  	 * Buffers with headroom use PAGE_SIZE as alloc size, see
>>  	 * add_recvbuf_mergeable() + get_mergeable_buf_len()
>>  	 */
>> -	truesize = headroom ? PAGE_SIZE : truesize;
>> +	if (headroom) {
>> +		truesize = PAGE_SIZE;
>> +		buf = (char *)((unsigned long)p & PAGE_MASK);
> 
> The reason for not doing this is that buf and p may not be on the same page, and
> buf is probably not page-aligned.
> 
> The implementation of virtio-net merge is add_recvbuf_mergeable(), which
> allocates a large block of memory at one time, and allocates from it each time.
> Although in xdp mode, each allocation is page_size, it does not guarantee that
> each allocation is page-aligned .
> 
> The problem here is that the value of headroom is wrong, the package is
> structured like this:
> 
> from device    | headroom          | virtio-net hdr | data |
> after xdp      | headroom  |  virtio-net hdr | meta | data |

You're free to push data back (not necessarily through meta).
You don't have virtio-net hdr for the xdp case (hdr_valid is false there).

> 
> The page_address(page) + offset we pass to page_to_skb() points to the
> virtio-net hdr.
> 
> So I think it might be better to change it this way.
> 
> Thanks.
> 
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 87838cbe38cf..086ae835ec86 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1012,7 +1012,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>                                 head_skb = page_to_skb(vi, rq, xdp_page, offset,
>                                                        len, PAGE_SIZE, false,
>                                                        metasize,
> -                                                      VIRTIO_XDP_HEADROOM);
> +                                                      VIRTIO_XDP_HEADROOM - metazie);
>                                 return head_skb;
>                         }
>                         break;

That patch doesn't fix it, as I said with xdp you can move both data and data_meta.
So just doing that would take care of the meta, but won't take care of moving data.


^ permalink raw reply

* Re: [PATCH 2/2] net: dsa: mv88e6xxx: Handle single-chip-address OF property
From: Andrew Lunn @ 2022-04-23 14:07 UTC (permalink / raw)
  To: Nathan Rossi
  Cc: netdev, linux-kernel, Vivien Didelot, Florian Fainelli,
	Vladimir Oltean, David S. Miller, Jakub Kicinski, Paolo Abeni
In-Reply-To: <20220423131427.237160-2-nathan@nathanrossi.com>

On Sat, Apr 23, 2022 at 01:14:27PM +0000, Nathan Rossi wrote:
> Handle the parsing and use of single chip addressing when the switch has
> the single-chip-address property defined. This allows for specifying the
> switch as using single chip addressing even when mdio address 0 is used
> by another device on the bus. This is a feature of some switches (e.g.
> the MV88E6341/MV88E6141) where the switch shares the bus only responding
> to the higher 16 addresses.

Hi Nathan

I think i'm missing something in this explanation:

smi.c says:

/* The switch ADDR[4:1] configuration pins define the chip SMI device address
 * (ADDR[0] is always zero, thus only even SMI addresses can be strapped).
 *
 * When ADDR is all zero, the chip uses Single-chip Addressing Mode, assuming it
 * is the only device connected to the SMI master. In this mode it responds to
 * all 32 possible SMI addresses, and thus maps directly the internal devices.
 *
 * When ADDR is non-zero, the chip uses Multi-chip Addressing Mode, allowing
 * multiple devices to share the SMI interface. In this mode it responds to only
 * 2 registers, used to indirectly access the internal SMI devices.
 *
 * Some chips use a different scheme: Only the ADDR4 pin is used for
 * configuration, and the device responds to 16 of the 32 SMI
 * addresses, allowing two to coexist on the same SMI interface.
 */

So if ADDR = 0, it takes up the whole bus. And in this case reg = 0.
If ADDR != 0, it is in multi chip mode, and DT reg = ADDR.

int mv88e6xxx_smi_init(struct mv88e6xxx_chip *chip,
                       struct mii_bus *bus, int sw_addr)
{
        if (chip->info->dual_chip)
                chip->smi_ops = &mv88e6xxx_smi_dual_direct_ops;
        else if (sw_addr == 0)
                chip->smi_ops = &mv88e6xxx_smi_direct_ops;
        else if (chip->info->multi_chip)
                chip->smi_ops = &mv88e6xxx_smi_indirect_ops;
        else
                return -EINVAL;

This seems to implement what is above. smi_direct_ops == whole bus,
smi_indirect_ops == multi-chip mode.

In what situation do you see this not working? What device are you
using, what does you DT look like, and what at the ADDR value?

Thanks
	Andrew

^ permalink raw reply

* [PATCH bpf-next 4/4] bpftool: Generate helpers for pinning prog through bpf object skeleton
From: Yafang Shao @ 2022-04-23 14:00 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh
  Cc: netdev, bpf, Yafang Shao
In-Reply-To: <20220423140058.54414-1-laoar.shao@gmail.com>

After this change, with command 'bpftool gen skeleton XXX.bpf.o', the
helpers for pinning BPF prog will be generated in BPF object skeleton. It
could simplify userspace code which wants to pin bpf progs in bpffs.

The new helpers are named with __{pin, unpin}_prog, because it only pins
bpf progs. If the user also wants to pin bpf maps in bpffs, he can use
LIBBPF_PIN_BY_NAME.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/bpf/bpftool/gen.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
index 8f76d8d9996c..1d06ebde723b 100644
--- a/tools/bpf/bpftool/gen.c
+++ b/tools/bpf/bpftool/gen.c
@@ -1087,6 +1087,8 @@ static int do_skeleton(int argc, char **argv)
 			static inline int load(struct %1$s *skel);	    \n\
 			static inline int attach(struct %1$s *skel);	    \n\
 			static inline void detach(struct %1$s *skel);	    \n\
+			static inline int pin_prog(struct %1$s *skel, const char *path);\n\
+			static inline void unpin_prog(struct %1$s *skel);   \n\
 			static inline void destroy(struct %1$s *skel);	    \n\
 			static inline const void *elf_bytes(size_t *sz);    \n\
 		#endif /* __cplusplus */				    \n\
@@ -1172,6 +1174,18 @@ static int do_skeleton(int argc, char **argv)
 		%1$s__detach(struct %1$s *obj)				    \n\
 		{							    \n\
 			bpf_object__detach_skeleton(obj->skeleton);	    \n\
+		}							    \n\
+									    \n\
+		static inline int					    \n\
+		%1$s__pin_prog(struct %1$s *obj, const char *path)	    \n\
+		{							    \n\
+			return bpf_object__pin_skeleton_prog(obj->skeleton, path);\n\
+		}							    \n\
+									    \n\
+		static inline void					    \n\
+		%1$s__unpin_prog(struct %1$s *obj)			    \n\
+		{							    \n\
+			bpf_object__unpin_skeleton_prog(obj->skeleton);	    \n\
 		}							    \n\
 		",
 		obj_name
@@ -1237,6 +1251,8 @@ static int do_skeleton(int argc, char **argv)
 		int %1$s::load(struct %1$s *skel) { return %1$s__load(skel); }		\n\
 		int %1$s::attach(struct %1$s *skel) { return %1$s__attach(skel); }	\n\
 		void %1$s::detach(struct %1$s *skel) { %1$s__detach(skel); }		\n\
+		int %1$s::pin_prog(struct %1$s *skel, const char *path) { return %1$s__pin_prog(skel, path); }\n\
+		void %1$s::unpin_prog(struct %1$s *skel) { %1$s__unpin_prog(skel); }	\n\
 		void %1$s::destroy(struct %1$s *skel) { %1$s__destroy(skel); }		\n\
 		const void *%1$s::elf_bytes(size_t *sz) { return %1$s__elf_bytes(sz); } \n\
 		#endif /* __cplusplus */				    \n\
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 1/4] libbpf: Define DEFAULT_BPFFS
From: Yafang Shao @ 2022-04-23 14:00 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh
  Cc: netdev, bpf, Yafang Shao
In-Reply-To: <20220423140058.54414-1-laoar.shao@gmail.com>

Let's use a macro DEFAULT_BPFFS instead of the hard-coded "/sys/fs/bpf".

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/lib/bpf/bpf_helpers.h | 2 +-
 tools/lib/bpf/libbpf.c      | 2 +-
 tools/lib/bpf/libbpf.h      | 6 ++++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 44df982d2a5c..9161ebcd3466 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -137,7 +137,7 @@ struct bpf_map_def {
 
 enum libbpf_pin_type {
 	LIBBPF_PIN_NONE,
-	/* PIN_BY_NAME: pin maps by name (in /sys/fs/bpf by default) */
+	/* PIN_BY_NAME: pin maps by name (in DEFAULT_BPFFS by default) */
 	LIBBPF_PIN_BY_NAME,
 };
 
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9a213aaaac8a..13fcf91e9e0e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2180,7 +2180,7 @@ static int build_map_pin_path(struct bpf_map *map, const char *path)
 	int len;
 
 	if (!path)
-		path = "/sys/fs/bpf";
+		path = DEFAULT_BPFFS;
 
 	len = snprintf(buf, PATH_MAX, "%s/%s", path, bpf_map__name(map));
 	if (len < 0)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index cdbfee60ea3e..3784867811a4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -28,6 +28,8 @@ LIBBPF_API __u32 libbpf_major_version(void);
 LIBBPF_API __u32 libbpf_minor_version(void);
 LIBBPF_API const char *libbpf_version_string(void);
 
+#define DEFAULT_BPFFS "/sys/fs/bpf"
+
 enum libbpf_errno {
 	__LIBBPF_ERRNO__START = 4000,
 
@@ -91,7 +93,7 @@ struct bpf_object_open_opts {
 	bool relaxed_core_relocs;
 	/* maps that set the 'pinning' attribute in their definition will have
 	 * their pin_path attribute set to a file in this directory, and be
-	 * auto-pinned to that path on load; defaults to "/sys/fs/bpf".
+	 * auto-pinned to that path on load; defaults to DEFAULT_BPFFS.
 	 */
 	const char *pin_root_path;
 
@@ -190,7 +192,7 @@ bpf_object__open_xattr(struct bpf_object_open_attr *attr);
 
 enum libbpf_pin_type {
 	LIBBPF_PIN_NONE,
-	/* PIN_BY_NAME: pin maps by name (in /sys/fs/bpf by default) */
+	/* PIN_BY_NAME: pin maps by name (in DEFAULT_BPFFS by default) */
 	LIBBPF_PIN_BY_NAME,
 };
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 2/4] libbpf: Add helpers for pinning bpf prog through bpf object skeleton
From: Yafang Shao @ 2022-04-23 14:00 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh
  Cc: netdev, bpf, Yafang Shao
In-Reply-To: <20220423140058.54414-1-laoar.shao@gmail.com>

Currently there're helpers for allowing to open/load/attach BPF object
through BPF object skeleton. Let's also add helpers for pinning through
BPF object skeleton. It could simplify BPF userspace code which wants to
pin the progs into bpffs.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/lib/bpf/libbpf.c   | 59 ++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  4 +++
 tools/lib/bpf/libbpf.map |  2 ++
 3 files changed, 65 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 13fcf91e9e0e..e7ed6c53c525 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -12726,6 +12726,65 @@ void bpf_object__detach_skeleton(struct bpf_object_skeleton *s)
 	}
 }
 
+int bpf_object__pin_skeleton_prog(struct bpf_object_skeleton *s,
+				  const char *path)
+{
+	struct bpf_link *link;
+	int err;
+	int i;
+
+	if (!s->prog_cnt)
+		return libbpf_err(-EINVAL);
+
+	if (!path)
+		path = DEFAULT_BPFFS;
+
+	for (i = 0; i < s->prog_cnt; i++) {
+		char buf[PATH_MAX];
+		int len;
+
+		len = snprintf(buf, PATH_MAX, "%s/%s", path, s->progs[i].name);
+		if (len < 0) {
+			err = -EINVAL;
+			goto err_unpin_prog;
+		} else if (len >= PATH_MAX) {
+			err = -ENAMETOOLONG;
+			goto err_unpin_prog;
+		}
+
+		link = *s->progs[i].link;
+		if (!link) {
+			err = -EINVAL;
+			goto err_unpin_prog;
+		}
+
+		err = bpf_link__pin(link, buf);
+		if (err)
+			goto err_unpin_prog;
+	}
+
+	return 0;
+
+err_unpin_prog:
+	bpf_object__unpin_skeleton_prog(s);
+
+	return libbpf_err(err);
+}
+
+void bpf_object__unpin_skeleton_prog(struct bpf_object_skeleton *s)
+{
+	struct bpf_link *link;
+	int i;
+
+	for (i = 0; i < s->prog_cnt; i++) {
+		link = *s->progs[i].link;
+		if (!link || !link->pin_path)
+			continue;
+
+		bpf_link__unpin(link);
+	}
+}
+
 void bpf_object__destroy_skeleton(struct bpf_object_skeleton *s)
 {
 	if (!s)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 3784867811a4..af44b0968cca 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1427,6 +1427,10 @@ bpf_object__open_skeleton(struct bpf_object_skeleton *s,
 LIBBPF_API int bpf_object__load_skeleton(struct bpf_object_skeleton *s);
 LIBBPF_API int bpf_object__attach_skeleton(struct bpf_object_skeleton *s);
 LIBBPF_API void bpf_object__detach_skeleton(struct bpf_object_skeleton *s);
+LIBBPF_API int
+bpf_object__pin_skeleton_prog(struct bpf_object_skeleton *s, const char *path);
+LIBBPF_API void
+bpf_object__unpin_skeleton_prog(struct bpf_object_skeleton *s);
 LIBBPF_API void bpf_object__destroy_skeleton(struct bpf_object_skeleton *s);
 
 struct bpf_var_skeleton {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 82f6d62176dd..4e3e37b84b3a 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -55,6 +55,8 @@ LIBBPF_0.0.1 {
 		bpf_object__unload;
 		bpf_object__unpin_maps;
 		bpf_object__unpin_programs;
+		bpf_object__pin_skeleton_prog;
+		bpf_object__unpin_skeleton_prog;
 		bpf_perf_event_read_simple;
 		bpf_prog_attach;
 		bpf_prog_detach;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next 3/4] bpftool: Fix incorrect return in generated detach helper
From: Yafang Shao @ 2022-04-23 14:00 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh
  Cc: netdev, bpf, Yafang Shao
In-Reply-To: <20220423140058.54414-1-laoar.shao@gmail.com>

There is no return value of bpf_object__detach_skeleton(), so we'd
better not return it, that is formal.

Fixes: 5dc7a8b21144 ("bpftool, selftests/bpf: Embed object file inside skeleton")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/bpf/bpftool/gen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c
index 7678af364793..8f76d8d9996c 100644
--- a/tools/bpf/bpftool/gen.c
+++ b/tools/bpf/bpftool/gen.c
@@ -1171,7 +1171,7 @@ static int do_skeleton(int argc, char **argv)
 		static inline void					    \n\
 		%1$s__detach(struct %1$s *obj)				    \n\
 		{							    \n\
-			return bpf_object__detach_skeleton(obj->skeleton);  \n\
+			bpf_object__detach_skeleton(obj->skeleton);	    \n\
 		}							    \n\
 		",
 		obj_name
-- 
2.17.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox