Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH -next] macvlan: Fix memleak in macvlan_changelink_sources
From: David Miller @ 2020-06-20  3:15 UTC (permalink / raw)
  To: zhengbin13; +Cc: kuba, michael-dev, netdev, linux-kernel, yi.zhang
In-Reply-To: <20200618132629.659977-1-zhengbin13@huawei.com>

From: Zheng Bin <zhengbin13@huawei.com>
Date: Thu, 18 Jun 2020 21:26:29 +0800

> macvlan_changelink_sources
>   if (addr)
>     ret = macvlan_hash_add_source(vlan, addr)
>   nla_for_each_attr(nla, head, len, rem)
>     ret = macvlan_hash_add_source(vlan, addr)
>     -->If fail, need to free previous malloc memory
> 
> Fixes: 79cf79abce71 ("macvlan: add source mode")
> Signed-off-by: Zheng Bin <zhengbin13@huawei.com>

Bug fixes should never be submitted against net-next.

They should instead be submitted against 'net'.

Thank you.

^ permalink raw reply

* [PATCH net-next 2/2] ipv6: fib6: avoid indirect calls from fib6_rule_lookup
From: Brian Vazquez @ 2020-06-20  3:14 UTC (permalink / raw)
  To: Brian Vazquez, Brian Vazquez, Eric Dumazet, David S . Miller
  Cc: linux-kernel, netdev, Luigi Rizzo
In-Reply-To: <20200620031419.219106-1-brianvv@google.com>

It was reported that a considerable amount of cycles were spent on the
expensive indirect calls on fib6_rule_lookup. This patch introduces an
inline helper called pol_route_func that uses the indirect_call_wrappers
to avoid the indirect calls.

This patch saves around 50ns per call.

Performance was measured on the receiver by checking the amount of
syncookies that server was able to generate under a synflood load.

Traffic was generated using trafgen[1] which was pushing around 1Mpps on
a single queue. Receiver was using only one rx queue which help to
create a bottle neck and make the experiment rx-bounded.

These are the syncookies generated over 10s from the different runs:

Whithout the patch:
TcpExtSyncookiesSent            3553749            0.0
TcpExtSyncookiesSent            3550895            0.0
TcpExtSyncookiesSent            3553845            0.0
TcpExtSyncookiesSent            3541050            0.0
TcpExtSyncookiesSent            3539921            0.0
TcpExtSyncookiesSent            3557659            0.0
TcpExtSyncookiesSent            3526812            0.0
TcpExtSyncookiesSent            3536121            0.0
TcpExtSyncookiesSent            3529963            0.0
TcpExtSyncookiesSent            3536319            0.0

With the patch:
TcpExtSyncookiesSent            3611786            0.0
TcpExtSyncookiesSent            3596682            0.0
TcpExtSyncookiesSent            3606878            0.0
TcpExtSyncookiesSent            3599564            0.0
TcpExtSyncookiesSent            3601304            0.0
TcpExtSyncookiesSent            3609249            0.0
TcpExtSyncookiesSent            3617437            0.0
TcpExtSyncookiesSent            3608765            0.0
TcpExtSyncookiesSent            3620205            0.0
TcpExtSyncookiesSent            3601895            0.0

Without the patch the average is 354263 pkt/s or 2822 ns/pkt and with
the patch the average is 360738 pkt/s or 2772 ns/pkt which gives an
estimate of 50 ns per packet.

[1] http://netsniff-ng.org/

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Cc: Luigi Rizzo <lrizzo@google.com>
---
 include/net/ip6_fib.h | 36 ++++++++++++++++++++++++++++++++++++
 net/ipv6/fib6_rules.c |  9 ++++++---
 net/ipv6/ip6_fib.c    |  3 ++-
 net/ipv6/route.c      |  8 ++++----
 4 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3f615a29766e..0ff7e97216d4 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -19,6 +19,7 @@
 #include <net/netlink.h>
 #include <net/inetpeer.h>
 #include <net/fib_notifier.h>
+#include <linux/indirect_call_wrapper.h>
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 #define FIB6_TABLE_HASHSZ 256
@@ -552,6 +553,41 @@ struct bpf_iter__ipv6_route {
 };
 #endif
 
+INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_output(struct net *net,
+					     struct fib6_table *table,
+					     struct flowi6 *fl6,
+					     const struct sk_buff *skb,
+					     int flags));
+INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_input(struct net *net,
+					     struct fib6_table *table,
+					     struct flowi6 *fl6,
+					     const struct sk_buff *skb,
+					     int flags));
+INDIRECT_CALLABLE_DECLARE(struct rt6_info *__ip6_route_redirect(struct net *net,
+					     struct fib6_table *table,
+					     struct flowi6 *fl6,
+					     const struct sk_buff *skb,
+					     int flags));
+INDIRECT_CALLABLE_DECLARE(struct rt6_info *ip6_pol_route_lookup(struct net *net,
+					     struct fib6_table *table,
+					     struct flowi6 *fl6,
+					     const struct sk_buff *skb,
+					     int flags));
+static inline struct rt6_info *pol_lookup_func(pol_lookup_t lookup,
+						struct net *net,
+						struct fib6_table *table,
+						struct flowi6 *fl6,
+						const struct sk_buff *skb,
+						int flags)
+{
+	return INDIRECT_CALL_4(lookup,
+			       ip6_pol_route_lookup,
+			       ip6_pol_route_output,
+			       ip6_pol_route_input,
+			       __ip6_route_redirect,
+			       net, table, fl6, skb, flags);
+}
+
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
 static inline bool fib6_has_custom_rules(const struct net *net)
 {
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index fafe556d21e0..6053ef851555 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -111,11 +111,13 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
 	} else {
 		struct rt6_info *rt;
 
-		rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
+		rt = pol_lookup_func(lookup,
+			     net, net->ipv6.fib6_local_tbl, fl6, skb, flags);
 		if (rt != net->ipv6.ip6_null_entry && rt->dst.error != -EAGAIN)
 			return &rt->dst;
 		ip6_rt_put_flags(rt, flags);
-		rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
+		rt = pol_lookup_func(lookup,
+			     net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
 		if (rt->dst.error != -EAGAIN)
 			return &rt->dst;
 		ip6_rt_put_flags(rt, flags);
@@ -226,7 +228,8 @@ static int __fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
 		goto out;
 	}
 
-	rt = lookup(net, table, flp6, arg->lookup_data, flags);
+	rt = pol_lookup_func(lookup,
+			     net, table, flp6, arg->lookup_data, flags);
 	if (rt != net->ipv6.ip6_null_entry) {
 		err = fib6_rule_saddr(net, rule, flags, flp6,
 				      ip6_dst_idev(&rt->dst)->dev);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 49ee89bbcba0..25a90f3f705c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -314,7 +314,8 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
 {
 	struct rt6_info *rt;
 
-	rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
+	rt = pol_lookup_func(lookup,
+			net, net->ipv6.fib6_main_tbl, fl6, skb, flags);
 	if (rt->dst.error == -EAGAIN) {
 		ip6_rt_put_flags(rt, flags);
 		rt = net->ipv6.ip6_null_entry;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 82cbb46a2a4f..5852039ca9cf 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1207,7 +1207,7 @@ static struct rt6_info *ip6_create_rt_rcu(const struct fib6_result *res)
 	return nrt;
 }
 
-static struct rt6_info *ip6_pol_route_lookup(struct net *net,
+INDIRECT_CALLABLE_SCOPE struct rt6_info *ip6_pol_route_lookup(struct net *net,
 					     struct fib6_table *table,
 					     struct flowi6 *fl6,
 					     const struct sk_buff *skb,
@@ -2274,7 +2274,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 }
 EXPORT_SYMBOL_GPL(ip6_pol_route);
 
-static struct rt6_info *ip6_pol_route_input(struct net *net,
+INDIRECT_CALLABLE_SCOPE struct rt6_info *ip6_pol_route_input(struct net *net,
 					    struct fib6_table *table,
 					    struct flowi6 *fl6,
 					    const struct sk_buff *skb,
@@ -2465,7 +2465,7 @@ void ip6_route_input(struct sk_buff *skb)
 						      &fl6, skb, flags));
 }
 
-static struct rt6_info *ip6_pol_route_output(struct net *net,
+INDIRECT_CALLABLE_SCOPE struct rt6_info *ip6_pol_route_output(struct net *net,
 					     struct fib6_table *table,
 					     struct flowi6 *fl6,
 					     const struct sk_buff *skb,
@@ -2912,7 +2912,7 @@ struct ip6rd_flowi {
 	struct in6_addr gateway;
 };
 
-static struct rt6_info *__ip6_route_redirect(struct net *net,
+INDIRECT_CALLABLE_SCOPE struct rt6_info *__ip6_route_redirect(struct net *net,
 					     struct fib6_table *table,
 					     struct flowi6 *fl6,
 					     const struct sk_buff *skb,
-- 
2.27.0.111.gc72c7da667-goog


^ permalink raw reply related

* [PATCH net-next 1/2] indirect_call_wrapper: extend indirect wrapper to support up to 4 calls
From: Brian Vazquez @ 2020-06-20  3:14 UTC (permalink / raw)
  To: Brian Vazquez, Brian Vazquez, Eric Dumazet, David S . Miller
  Cc: linux-kernel, netdev

There are many places where 2 annotations are not enough. This patch
adds INDIRECT_CALL_3 and INDIRECT_CALL_4 to cover such cases.

Signed-off-by: Brian Vazquez <brianvv@google.com>
---
 include/linux/indirect_call_wrapper.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h
index 00d7e8e919c6..54c02c84906a 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -23,6 +23,16 @@
 		likely(f == f2) ? f2(__VA_ARGS__) :			\
 				  INDIRECT_CALL_1(f, f1, __VA_ARGS__);	\
 	})
+#define INDIRECT_CALL_3(f, f3, f2, f1, ...)					\
+	({									\
+		likely(f == f3) ? f3(__VA_ARGS__) :				\
+				  INDIRECT_CALL_2(f, f2, f1, __VA_ARGS__);	\
+	})
+#define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...)					\
+	({									\
+		likely(f == f4) ? f4(__VA_ARGS__) :				\
+				  INDIRECT_CALL_3(f, f3, f2, f1, __VA_ARGS__);	\
+	})
 
 #define INDIRECT_CALLABLE_DECLARE(f)	f
 #define INDIRECT_CALLABLE_SCOPE
@@ -30,6 +40,8 @@
 #else
 #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALL_2(f, f2, f1, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_3(f, f3, f2, f1, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALLABLE_DECLARE(f)
 #define INDIRECT_CALLABLE_SCOPE		static
 #endif
-- 
2.27.0.111.gc72c7da667-goog


^ permalink raw reply related

* Re: [PATCH net v5 0/4] several fixes for indirect flow_blocks offload
From: David Miller @ 2020-06-20  3:13 UTC (permalink / raw)
  To: wenxu; +Cc: netdev, pablo, vladbu, simon.horman
In-Reply-To: <1592484551-16188-1-git-send-email-wenxu@ucloud.cn>

From: wenxu@ucloud.cn
Date: Thu, 18 Jun 2020 20:49:07 +0800

> From: wenxu <wenxu@ucloud.cn>
> 
> v2:
> patch2: store the cb_priv of representor to the flow_block_cb->indr.cb_priv
> in the driver. And make the correct check with the statments
> this->indr.cb_priv == cb_priv
> 
> patch4: del the driver list only in the indriect cleanup callbacks
> 
> v3:
> add the cover letter and changlogs.
> 
> v4:
> collapsed 1/4, 2/4, 4/4 in v3 to one fix
> Add the prepare patch 1 and 2
> 
> v5:
> patch1: place flow_indr_block_cb_alloc() right before
> flow_indr_dev_setup_offload() to avoid moving flow_block_indr_init()
> 
> This series fixes commit 1fac52da5942 ("net: flow_offload: consolidate
> indirect flow_block infrastructure") that revists the flow_block
> infrastructure.
> 
> patch #1 #2: prepare for fix patch #3
> add and use flow_indr_block_cb_alloc/remove function
> 
> patch #3: fix flow_indr_dev_unregister path
> If the representor is removed, then identify the indirect flow_blocks
> that need to be removed by the release callback and the port representor
> structure. To identify the port representor structure, a new 
> indr.cb_priv field needs to be introduced. The flow_block also needs to
> be removed from the driver list from the cleanup path
> 
> 
> patch#4 fix block->nooffloaddevcnt warning dmesg log.
> When a indr device add in offload success. The block->nooffloaddevcnt
> should be 0. After the representor go away. When the dir device go away
> the flow_block UNBIND operation with -EOPNOTSUPP which lead the warning
> demesg log. 
> The block->nooffloaddevcnt should always count for indr block.
> even the indr block offload successful. The representor maybe
> gone away and the ingress qdisc can work in software mode.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net] geneve: allow changing DF behavior after creation
From: David Miller @ 2020-06-20  3:07 UTC (permalink / raw)
  To: sd; +Cc: netdev, sbrivio
In-Reply-To: <3b72fc01841507f8439a90f618ef6f6240b9463f.1592473442.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Thu, 18 Jun 2020 12:13:22 +0200

> Currently, trying to change the DF parameter of a geneve device does
> nothing:
> 
>     # ip -d link show geneve1
>     14: geneve1: <snip>
>         link/ether <snip>
>         geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip>
>     # ip link set geneve1 type geneve id 1 df unset
>     # ip -d link show geneve1
>     14: geneve1: <snip>
>         link/ether <snip>
>         geneve id 1 remote 10.0.0.1 ttl auto df set dstport 6081 <snip>
> 
> We just need to update the value in geneve_changelink.
> 
> Fixes: a025fb5f49ad ("geneve: Allow configuration of DF behaviour")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] enetc: Fix HW_VLAN_CTAG_TX|RX toggling
From: David Miller @ 2020-06-20  3:01 UTC (permalink / raw)
  To: claudiu.manoil; +Cc: netdev
In-Reply-To: <1592471812-13035-1-git-send-email-claudiu.manoil@nxp.com>

From: Claudiu Manoil <claudiu.manoil@nxp.com>
Date: Thu, 18 Jun 2020 12:16:52 +0300

> VLAN tag insertion/extraction offload is correctly
> activated at probe time but deactivation of this feature
> (i.e. via ethtool) is broken.  Toggling works only for
> Tx/Rx ring 0 of a PF, and is ignored for the other rings,
> including the VF rings.
> To fix this, the existing VLAN offload toggling code
> was extended to all the rings assigned to a netdevice,
> instead of the default ring 0 (likely a leftover from the
> early validation days of this feature).  And the code was
> moved to the common set_features() function to fix toggling
> for the VF driver too.
> 
> Fixes: d4fd0404c1c9 ("enetc: Introduce basic PF and VF ENETC ethernet drivers")
> Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Cong Wang @ 2020-06-20  3:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <20200620011409.GG237539@carbon.dhcp.thefacebook.com>

On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > I think so, though I'm not familiar with the bfp cgroup code.
> >
> > > If so, we might wanna fix it in a different way,
> > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > like in cgroup_put(). It feels more reliable to me.
> > >
> >
> > Yeah I also have this idea in my mind.
>
> I wonder if the following patch will fix the issue?

Interesting, AFAIU, this refcnt is for bpf programs attached
to the cgroup. By this suggestion, do you mean the root
cgroup does not need to refcnt the bpf programs attached
to it? This seems odd, as I don't see how root is different
from others in terms of bpf programs which can be attached
and detached in the same way.

I certainly understand the root cgroup is never gone, but this
does not mean the bpf programs attached to it too.

What am I missing?

Thanks.

^ permalink raw reply

* Re: [PATCH v2] net: macb: undo operations in case of failure
From: David Miller @ 2020-06-20  2:59 UTC (permalink / raw)
  To: claudiu.beznea
  Cc: nicolas.ferre, kuba, linux, antoine.tenart, netdev, linux-kernel
In-Reply-To: <1592469460-17825-1-git-send-email-claudiu.beznea@microchip.com>

From: Claudiu Beznea <claudiu.beznea@microchip.com>
Date: Thu, 18 Jun 2020 11:37:40 +0300

> Undo previously done operation in case macb_phylink_connect()
> fails. Since macb_reset_hw() is the 1st undo operation the
> napi_exit label was renamed to reset_hw.
> 
> Fixes: 7897b071ac3b ("net: macb: convert to phylink")
> Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>

Applied and queued up for -stable, thank you.

^ permalink raw reply

* Re: [PATCH net 0/3] rxrpc: Performance drop fix and other fixes
From: David Miller @ 2020-06-20  2:57 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-afs, linux-kernel
In-Reply-To: <159246661514.1229328.4419873299996950969.stgit@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Thu, 18 Jun 2020 08:50:15 +0100

> 
> Here are three fixes for rxrpc:
> 
>  (1) Fix a trace symbol mapping.  It doesn't seem to let you map to "".
> 
>  (2) Fix the handling of the remote receive window size when it increases
>      beyond the size we can support for our transmit window.
> 
>  (3) Fix a performance drop caused by retransmitted packets being
>      accidentally marked as already ACK'd.
> 
> The patches are tagged here:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
> 	rxrpc-fixes-20200618

Pulled, thanks David.

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zefan Li @ 2020-06-20  2:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <20200620011409.GG237539@carbon.dhcp.thefacebook.com>

>>> If so, we might wanna fix it in a different way,
>>> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
>>> like in cgroup_put(). It feels more reliable to me.
>>>
>>
>> Yeah I also have this idea in my mind.
> 
> I wonder if the following patch will fix the issue?
> 

I guess so, but it's better we have someone who reported this bug to
test it.

> --
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4598e4da6b1b..7eb51137d896 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
>  #ifdef CONFIG_CGROUP_BPF
>  static inline void cgroup_bpf_get(struct cgroup *cgrp)
>  {
> -       percpu_ref_get(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_get(&cgrp->bpf.refcnt);
>  }
>  
>  static inline void cgroup_bpf_put(struct cgroup *cgrp)
>  {
> -       percpu_ref_put(&cgrp->bpf.refcnt);
> +       if (!(cgrp->self.flags & CSS_NO_REF))
> +               percpu_ref_put(&cgrp->bpf.refcnt);
>  }
>  
>  #else /* CONFIG_CGROUP_BPF */
> 


^ permalink raw reply

* Re: [PATCH 0/3] Add Marvell 88E1340S, 88E1548P support
From: David Miller @ 2020-06-20  2:47 UTC (permalink / raw)
  To: fido_max; +Cc: netdev, f.fainelli, hkallweit1, linux, kuba, andrew
In-Reply-To: <1592602289.1450270@f540.i.mail.ru>

From: Кочетков Максим <fido_max@inbox.ru>
Date: Sat, 20 Jun 2020 00:31:29 +0300

> It is based on 5.7.0

You need to post your patches against the tree onto which it
will be applied, which in this case is net-next.

^ permalink raw reply

* [PATCH resend] net: cxgb4: fix return error value in t4_prep_fw
From: Li Heng @ 2020-06-20  2:49 UTC (permalink / raw)
  To: vishal, davem, kuba, hariprasad; +Cc: netdev, linux-kernel, liheng40

t4_prep_fw goto bye tag with positive return value when something
bad happened and which can not free resource in adap_init0.
so fix it to return negative value.

Fixes: 16e47624e76b ("cxgb4: Add new scheme to update T4/T5 firmware")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Li Heng <liheng40@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---

resent with netdev cced

---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 2a3480f..9121cef 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -3493,7 +3493,7 @@ int t4_prep_fw(struct adapter *adap, struct fw_info *fw_info,
 	drv_fw = &fw_info->fw_hdr;

 	/* Read the header of the firmware on the card */
-	ret = -t4_read_flash(adap, FLASH_FW_START,
+	ret = t4_read_flash(adap, FLASH_FW_START,
 			    sizeof(*card_fw) / sizeof(uint32_t),
 			    (uint32_t *)card_fw, 1);
 	if (ret == 0) {
@@ -3522,8 +3522,8 @@ int t4_prep_fw(struct adapter *adap, struct fw_info *fw_info,
 		   should_install_fs_fw(adap, card_fw_usable,
 					be32_to_cpu(fs_fw->fw_ver),
 					be32_to_cpu(card_fw->fw_ver))) {
-		ret = -t4_fw_upgrade(adap, adap->mbox, fw_data,
-				     fw_size, 0);
+		ret = t4_fw_upgrade(adap, adap->mbox, fw_data,
+				    fw_size, 0);
 		if (ret != 0) {
 			dev_err(adap->pdev_dev,
 				"failed to install firmware: %d\n", ret);
@@ -3554,7 +3554,7 @@ int t4_prep_fw(struct adapter *adap, struct fw_info *fw_info,
 			FW_HDR_FW_VER_MICRO_G(c), FW_HDR_FW_VER_BUILD_G(c),
 			FW_HDR_FW_VER_MAJOR_G(k), FW_HDR_FW_VER_MINOR_G(k),
 			FW_HDR_FW_VER_MICRO_G(k), FW_HDR_FW_VER_BUILD_G(k));
-		ret = EINVAL;
+		ret = -EINVAL;
 		goto bye;
 	}

--
2.7.4


^ permalink raw reply related

* Re: [PATCH 0/6] Add Microchip MCP25XXFD CAN driver
From: Manivannan Sadhasivam @ 2020-06-20  2:43 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: wg, kernel, linux-can, netdev, linux-kernel
In-Reply-To: <20200618085533.GA26093@mani>

On Thu, Jun 18, 2020 at 02:25:33PM +0530, Manivannan Sadhasivam wrote:
> Hi,
> 
> On 0611, Marc Kleine-Budde wrote:
> > On 6/10/20 9:44 AM, Manivannan Sadhasivam wrote:
> > > Hello,
> > > 
> > > This series adds CAN network driver support for Microchip MCP25XXFD CAN
> > > Controller with MCP2517FD as the target controller version. This series is
> > > mostly inspired (or taken) from the previous iterations posted by Martin Sperl.
> > > I've trimmed down the parts which are not necessary for the initial version
> > > to ease review. Still the series is relatively huge but I hope to get some
> > > reviews (post -rcX ofc!).
> > > 
> > > Link to the origial series posted by Martin:
> > > https://www.spinics.net/lists/devicetree/msg284462.html
> > > 
> > > I've not changed the functionality much but done some considerable amount of
> > > cleanups and also preserved the authorship of Martin for all the patches he has
> > > posted earlier. This series has been tested on 96Boards RB3 platform by myself
> > > and Martin has tested the previous version on Rpi3 with external MCP2517FD
> > > controller.
> > 
> > I initially started looking at Martin's driver and it was not using several
> > modern CAN driver infrastructures. I then posted some cleanup patches but Martin
> > was not working on the driver any more. Then I decided to rewrite the driver,
> > that is the one I'm hoping to mainline soon.
> > 
> 
> So how should we proceed from here? It is okay for me to work on adding some
> features and also fixing the issues you've reported so far. But I want to reach
> a consensus before moving forward.
> 
> If you think that it makes sense to go with your set of patches, then I need an
> estimate on when you'll post the first revision.
> 

Ping!

> > Can you give it a try?
> > 
> > https://github.com/marckleinebudde/linux/commits/v5.6-rpi/mcp25xxfd-20200607-41
> > 
> 
> Sure thing. Will do.
> 
> Thanks,
> Mani
> 
> > Marc
> > 
> > -- 
> > Pengutronix e.K.                 | Marc Kleine-Budde           |
> > Embedded Linux                   | https://www.pengutronix.de  |
> > Vertretung West/Dortmund         | Phone: +49-231-2826-924     |
> > Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* Re: [PATCH net-next 1/3] net/sched: Introduce action hash
From: Ariel Levkovich @ 2020-06-20  1:13 UTC (permalink / raw)
  To: Davide Caratti, netdev
  Cc: jiri, kuba, jhs, xiyou.wangcong, ast, daniel, Jiri Pirko
In-Reply-To: <000266053809204e05e2dba71d62fab734cf6c97.camel@redhat.com>

On 6/19/20 12:13 PM, Davide Caratti wrote:
> hello Ariel,
>
> (I'm doing a resend because I suspect that my original reply was dropped
> somewhere).
>
> Thanks for your patch! some comments/questions below:
>
> On Fri, 2020-06-19 at 01:15 +0300, Ariel Levkovich wrote:
>> Allow setting a hash value to a packet for a future match.
>>
>> The action will determine the packet's hash result according to
>> the selected hash type.
>> The first option is to select a basic asymmetric l4 hash calculation
>> on the packet headers which will either use the skb->hash value as
>> if such was already calculated and set by the device driver, or it
>> will perform the kernel jenkins hash function on the packet which will
>> generate the result otherwise.
> If I understand correctly, this new tc action is going to change the skb
> metadata based on some operation done on the packet. Linux has already a
> tc module that does this job, it's act_skbedit.
>
> Wouldn't it be possible to extend act_skbedit instead of adding a new tc
> action? that would save us from some bugs we already encountered in the
> past (maybe I spotted a couple of them below), and we would also leverage on
> the existing tests.
>
>> The other option is for user to provide an BPF program which is
>> dedicated to calculate the hash. In such case the program is loaded
>> and used by tc to perform the hash calculation and provide it to
>> the hash action to be stored in skb->hash field.
>>
>> The BPF option can be useful for future HW offload support of the hash
>> calculation by emulating the HW hash function when it's different than
>> the kernel's but yet we want to maintain consistency between the SW and
>> the HW.
> Like Daniel noticed, this can be done by act_bpf. Using 'jump'
> or 'goto_chain' control actions, it should be possible to get to the same
> result combining act_skbedit and act_bpf. WDYT?


Hi Davide and Daniel,

First of all, thanks for your review and comments.


I'll try to address both of your comments regarding existing 
alternatives to this new action

here so that we can have a single thread about it.

Act_bpf indeed can provide a similar functionality. Furthermore, there 
are already existing BPF helpers

to allow user to change skb->hash within the BPF program, so there's no 
need to perform act_skbedit

after act_bpf.


However, since we are trying to offer the user multiple methods to 
calculate the hash, and not only

using a BPF program, act_bpf on its own is not enough.

If we are looking at HW offload (as Daniel mentioned), like I mentioned 
in the cover letter,

it is important that SW will be able to get the same hash result as in 
HW for a certain packet.

When certain HW is not able to forward TC the hash result, using a BPF 
program that mimics the

HW hash function is useful to maintain consistency but there are cases 
where the HW can and

does forward the hash value via the received packet's metadata and the 
vendor driver already

fills in the skb->hash with this value. In such cases BPF program usage 
can be avoided.

So to sum it up, this api is offering user both ways to calculate the hash:

1. Use the value that is already there (If the vendor driver already set 
it. If not, calculate using Linux jhash).

2. Use a given BPF program to calculate the hash and to set skb->hash 
with it.


It's true, you can cover both cases with BPF - meaning, always use BPF 
even if HW/driver can provide hash

to TC in other means but we thought about giving an option to avoid 
writing and using BPF when

it's not necessary.


Appreciate your further comments and thoughts about this and of course, 
the code comments

will be addressed and fixed.


Ariel


>
>> Usage is as follows:
>>
>> $ tc filter add dev ens1f0_0 ingress \
>> prio 1 chain 0 proto ip \
>> flower ip_proto tcp \
>> action hash bpf object-file <bpf file> \
>> action goto chain 2
> [...]
>
>> diff --git a/include/net/act_api.h b/include/net/act_api.h
>> index 8c3934880670..b7e5d060bd2f 100644
>> --- a/include/net/act_api.h
>> +++ b/include/net/act_api.h
>> @@ -12,6 +12,8 @@
>>   #include <net/net_namespace.h>
>>   #include <net/netns/generic.h>
>>   
>> +#define ACT_BPF_NAME_LEN	256
>> +
> (BTW, line above seems to be a leftover. Correct?)
>
>>   struct tcf_idrinfo {
>>   	struct mutex	lock;
>>   	struct idr	action_idr;
>>
> [...]
>
>> new file mode 100644
>> index 000000000000..40a5c34f8745
>> --- /dev/null
>> +++ b/net/sched/act_hash.c
>> @@ -0,0 +1,376 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/* -
>> + * net/sched/act_hash.c  Hash calculation action
>> + *
>> + * Author:   Ariel Levkovich <lariel@mellanox.com>
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/kernel.h>
>> +#include <linux/module.h>
>> +#include <linux/rtnetlink.h>
>> +#include <linux/skbuff.h>
>> +#include <linux/filter.h>
>> +#include <net/netlink.h>
>> +#include <net/pkt_sched.h>
>> +#include <net/pkt_cls.h>
>> +#include <linux/tc_act/tc_hash.h>
>> +#include <net/tc_act/tc_hash.h>
>> +
>> +#define ACT_HASH_BPF_NAME_LEN	256
>> +
>> +static unsigned int hash_net_id;
>> +static struct tc_action_ops act_hash_ops;
>> +
>> +static int tcf_hash_act(struct sk_buff *skb, const struct tc_action *a,
>> +			struct tcf_result *res)
>> +{
>> +	struct tcf_hash *h = to_hash(a);
>> +	struct tcf_hash_params *p;
>> +	int action;
>> +	u32 hash;
>> +
>> +	tcf_lastuse_update(&h->tcf_tm);
>> +	tcf_action_update_bstats(&h->common, skb);
>> +
>> +	p = rcu_dereference_bh(h->hash_p);
>> +
>> +	action = READ_ONCE(h->tcf_action);
>> +
>> +	switch (p->alg) {
>> +	case TCA_HASH_ALG_L4:
>> +		hash = skb_get_hash(skb);
>> +		/* If a hash basis was provided, add it into
>> +		 * hash calculation here and re-set skb->hash
>> +		 * to the new result with sw_hash indication
>> +		 * and keeping the l4 hash indication.
>> +		 */
>> +		hash = jhash_1word(hash, p->basis);
>> +		__skb_set_sw_hash(skb, hash, skb->l4_hash);
> can you consider moving the above line to the data path of act_skbedit, and
> extend the control plane accordingly?
>
>> +		break;
>> +	case TCA_HASH_ALG_BPF:
> here the code is assuming that the action is at tc ingress. But
> theoretically we could install this action also on egress, nobody is
> forbidding that.  whouldn't it be better to add proper checks (or using
> directly act_bpf with appropriate control action, that already does this
> job)?
>
>> +		__skb_push(skb, skb->mac_len);
>> +		bpf_compute_data_pointers(skb);
>> +		hash = BPF_PROG_RUN(p->prog, skb);
>> +		__skb_pull(skb, skb->mac_len);
>> +		/* The BPF program hash function type is
>> +		 * unknown so only the sw hash bit is set.
>> +		 */
>> +		__skb_set_sw_hash(skb, hash, false);
>> +		break;
>> +	}
>> +	return action;
>> +}
>> +
>> +static const struct nla_policy hash_policy[TCA_HASH_MAX + 1] = {
>> +	[TCA_HASH_PARMS]	= { .type = NLA_EXACT_LEN, .len = sizeof(struct tc_hash) },
>> +	[TCA_HASH_ALG]		= { .type = NLA_U32 },
>> +	[TCA_HASH_BASIS]	= { .type = NLA_U32 },
>> +	[TCA_HASH_BPF_FD]	= { .type = NLA_U32 },
>> +	[TCA_HASH_BPF_NAME]	= { .type = NLA_NUL_STRING,
>> +				    .len = ACT_HASH_BPF_NAME_LEN },
>> +};
>> +
>> +static int tcf_hash_bpf_init(struct nlattr **tb, struct tcf_hash_params *params)
>> +{
>> +	struct bpf_prog *fp;
>> +	char *name = NULL;
>> +	u32 bpf_fd;
>> +
>> +	bpf_fd = nla_get_u32(tb[TCA_HASH_BPF_FD]);
> shouldn't we check for non-NULL tb[TCA_HASH_BPF_FD] to avoid a kernel crash here?
> please note, act_bpf does it:
>
> https://elixir.bootlin.com/linux/v5.8-rc1/source/net/sched/act_bpf.c#L337
>
> [...]
>
>> +static int tcf_hash_init(struct net *net, struct nlattr *nla,
>> +			 struct nlattr *est, struct tc_action **a,
>> +			 int replace, int bind, bool rtnl_held,
>> +			 struct tcf_proto *tp, u32 flags,
>> +			 struct netlink_ext_ack *extack)
>> +{
> [...]
>
>> +
>> +	if (!tb[TCA_HASH_ALG]) {
>> +		NL_SET_ERR_MSG_MOD(extack, "Missing hash algorithm selection");
>> +		err = -EINVAL;
>> +		goto cleanup;
>> +	}
>> +
>> +	p->alg = nla_get_u32(tb[TCA_HASH_ALG]);
> I don't understand why 'p->alg' is assigned and then validated. Wouldn't
> it be better to validate it earlier, and assign only when we know it's a
> good value? this would also avoid the spinlock unbalance below:
>
>> +	spin_lock_bh(&h->tcf_lock);
>> +
>> +	switch (p->alg) {
>> +	case TCA_HASH_ALG_L4:
>> +		break;
>> +	case TCA_HASH_ALG_BPF:
>> +		if (res != ACT_P_CREATED) {
>> +			params = rcu_dereference_protected(h->hash_p, 1);
>> +			old.prog = params->prog;
>> +			old.bpf_name = params->bpf_name;
>> +		}
>> +
>> +		err = tcf_hash_bpf_init(tb, p);
>> +		if (err)
>> +			goto cleanup;
> shouldn't we spin_unlock_bh() here?
>> +
>> +		break;
>> +	default:
>> +		NL_SET_ERR_MSG_MOD(extack, "Hash type not supported");
>> +		err = -EINVAL;
>> +		goto cleanup;
> shouldn't we spin_unlock_bh() here?
>
>> +	}
>> +	if (tb[TCA_HASH_BASIS])
>> +		p->basis = nla_get_u32(tb[TCA_HASH_BASIS]);
>> +
>> +	goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch);
>> +	p = rcu_replace_pointer(h->hash_p, p,
>> +				lockdep_is_held(&h->tcf_lock));
>> +	spin_unlock_bh(&h->tcf_lock);
>> +
>> +	if (goto_ch)
>> +		tcf_chain_put_by_act(goto_ch);
>> +	if (p)
>> +		kfree_rcu(p, rcu);
>> +
>> +	if (res == ACT_P_CREATED) {
>> +		tcf_idr_insert(tn, *a);
>> +	} else {
>> +		synchronize_rcu();
>> +		tcf_hash_bpf_cleanup(&old);
>> +	}
>> +
>> +	return res;
>> +
>> +cleanup:
>> +	if (goto_ch)
>> +		tcf_chain_put_by_act(goto_ch);
>> +	kfree(p);
>> +
>> +release_idr:
>> +	tcf_idr_release(*a, bind);
>> +	return err;
>> +}
> thank you in advance for any feedback!
>


^ permalink raw reply

* Re: [PATCH] mt76: mt76x2: fix pci suspend
From: kernel test robot @ 2020-06-20  2:31 UTC (permalink / raw)
  To: Lorenzo Bianconi, Oleksandr Natalenko
  Cc: kbuild-all, Lorenzo Bianconi, Felix Fietkau, Ryder Lee,
	Kalle Valo, Jakub Kicinski, Matthias Brugger, linux-wireless,
	netdev, linux-mediatek
In-Reply-To: <20200618111859.GC698688@lore-desk.lan>

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

Hi Lorenzo,

I love your patch! Yet something to improve:

[auto build test ERROR on wireless-drivers-next/master]
[also build test ERROR on v5.8-rc1 next-20200618]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Lorenzo-Bianconi/mt76-mt76x2-fix-pci-suspend/20200618-192056
base:   https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git master
config: x86_64-randconfig-a013-20200619 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
reproduce (this is a W=1 build):
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "mt76x02_dma_reset" [drivers/net/wireless/mediatek/mt76/mt76x2/mt76x2e.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37134 bytes --]

^ permalink raw reply

* [PATCH] net Add MODULE_DESCRIPTION entries to network modules
From: Rob Gill @ 2020-06-20  2:08 UTC (permalink / raw)
  To: netdev; +Cc: Rob Gill

The user tool modinfo is used to get information on kernel modules, including a
description where it is available.

This patch adds a brief MODULE_DESCRIPTION to the following modules:

9p
drop_monitor
esp4_offload
esp6_offload
fou
fou6
ila
sch_fq
sch_fq_codel
sch_hhf

Signed-off-by: Rob Gill <rrobgill@protonmail.com>
---
 net/9p/mod.c             | 1 +
 net/core/drop_monitor.c  | 1 +
 net/ipv4/esp4_offload.c  | 1 +
 net/ipv4/fou.c           | 1 +
 net/ipv6/esp6_offload.c  | 1 +
 net/ipv6/fou6.c          | 1 +
 net/ipv6/ila/ila_main.c  | 1 +
 net/sched/sch_fq.c       | 1 +
 net/sched/sch_fq_codel.c | 1 +
 net/sched/sch_hhf.c      | 1 +
 10 files changed, 10 insertions(+)

diff --git a/net/9p/mod.c b/net/9p/mod.c
index c1b62428d..512656685 100644
--- a/net/9p/mod.c
+++ b/net/9p/mod.c
@@ -189,3 +189,4 @@ MODULE_AUTHOR("Latchesar Ionkov <lucho@ionkov.net>");
 MODULE_AUTHOR("Eric Van Hensbergen <ericvh@gmail.com>");
 MODULE_AUTHOR("Ron Minnich <rminnich@lanl.gov>");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Plan 9 Resource Sharing Support (9P2000)");
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index 2ee7bc4c9..b09bebead 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -1721,3 +1721,4 @@ module_exit(exit_net_drop_monitor);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Neil Horman <nhorman@tuxdriver.com>");
 MODULE_ALIAS_GENL_FAMILY("NET_DM");
+MODULE_DESCRIPTION("Monitoring code for network dropped packet alerts");
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index d14133eac..5bda5aeda 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -361,3 +361,4 @@ module_exit(esp4_offload_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
 MODULE_ALIAS_XFRM_OFFLOAD_TYPE(AF_INET, XFRM_PROTO_ESP);
+MODULE_DESCRIPTION("IPV4 GSO/GRO offload support");
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index dcc79ff54..abd083415 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -1304,3 +1304,4 @@ module_init(fou_init);
 module_exit(fou_fini);
 MODULE_AUTHOR("Tom Herbert <therbert@google.com>");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Foo over UDP");
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 55addea19..1ca516fb3 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -395,3 +395,4 @@ module_exit(esp6_offload_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Steffen Klassert <steffen.klassert@secunet.com>");
 MODULE_ALIAS_XFRM_OFFLOAD_TYPE(AF_INET6, XFRM_PROTO_ESP);
+MODULE_DESCRIPTION("IPV6 GSO/GRO offload support");
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
index 091f94184..430518ae2 100644
--- a/net/ipv6/fou6.c
+++ b/net/ipv6/fou6.c
@@ -224,3 +224,4 @@ module_init(fou6_init);
 module_exit(fou6_fini);
 MODULE_AUTHOR("Tom Herbert <therbert@google.com>");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Foo over UDP (IPv6)");
diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c
index 257d2b681..36c58aa25 100644
--- a/net/ipv6/ila/ila_main.c
+++ b/net/ipv6/ila/ila_main.c
@@ -120,3 +120,4 @@ module_init(ila_init);
 module_exit(ila_fini);
 MODULE_AUTHOR("Tom Herbert <tom@herbertland.com>");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("IPv6: Identifier Locator Addressing (ILA)");
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 8f06a808c..2fb76fc0c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -1075,3 +1075,4 @@ module_init(fq_module_init)
 module_exit(fq_module_exit)
 MODULE_AUTHOR("Eric Dumazet");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Fair Queue Packet Scheduler");
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 436160be9..459a78405 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -721,3 +721,4 @@ module_init(fq_codel_module_init)
 module_exit(fq_codel_module_exit)
 MODULE_AUTHOR("Eric Dumazet");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Fair Queue CoDel discipline");
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index be35f03b6..420ede875 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -721,3 +721,4 @@ module_exit(hhf_module_exit)
 MODULE_AUTHOR("Terry Lam");
 MODULE_AUTHOR("Nandita Dukkipati");
 MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Heavy-Hitter Filter (HHF)");
-- 
2.17.1



^ permalink raw reply related

* Re: [PATCH v2 net-next] ipv6: icmp6: avoid indirect call for icmpv6_send()
From: kernel test robot @ 2020-06-20  1:53 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: kbuild-all, netdev, Eric Dumazet, Jakub Kicinski
In-Reply-To: <20200619190259.170189-1-edumazet@google.com>

[-- Attachment #1: Type: text/plain, Size: 8538 bytes --]

Hi Eric,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/ipv6-icmp6-avoid-indirect-call-for-icmpv6_send/20200620-030444
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 0fb9fbab405351aa0c18973881c4103e4da886b6
config: nds32-randconfig-r002-20200619 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

In file included from ./arch/nds32/include/generated/asm/bug.h:1,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from include/asm-generic/preempt.h:5,
from ./arch/nds32/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:78,
from include/linux/spinlock.h:51,
from include/linux/seqlock.h:36,
from include/linux/time.h:6,
from include/linux/stat.h:19,
from include/linux/module.h:13,
from net/ipv6/icmp.c:30:
include/linux/dma-mapping.h: In function 'dma_map_resource':
arch/nds32/include/asm/memory.h:82:32: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
82 | #define pfn_valid(pfn)  ((pfn) >= PHYS_PFN_OFFSET && (pfn) < (PHYS_PFN_OFFSET + max_mapnr))
|                                ^~
include/asm-generic/bug.h:144:27: note: in definition of macro 'WARN_ON_ONCE'
144 |  int __ret_warn_once = !!(condition);            |                           ^~~~~~~~~
include/linux/dma-mapping.h:352:19: note: in expansion of macro 'pfn_valid'
352 |  if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr))))
|                   ^~~~~~~~~
net/ipv6/icmp.c: At top level:
>> net/ipv6/icmp.c:442:6: warning: no previous prototype for 'icmp6_send' [-Wmissing-prototypes]
442 | void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
|      ^~~~~~~~~~

vim +/icmp6_send +442 net/ipv6/icmp.c

   438	
   439	/*
   440	 *	Send an ICMP message in response to a packet in error
   441	 */
 > 442	void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
   443			const struct in6_addr *force_saddr)
   444	{
   445		struct inet6_dev *idev = NULL;
   446		struct ipv6hdr *hdr = ipv6_hdr(skb);
   447		struct sock *sk;
   448		struct net *net;
   449		struct ipv6_pinfo *np;
   450		const struct in6_addr *saddr = NULL;
   451		struct dst_entry *dst;
   452		struct icmp6hdr tmp_hdr;
   453		struct flowi6 fl6;
   454		struct icmpv6_msg msg;
   455		struct ipcm6_cookie ipc6;
   456		int iif = 0;
   457		int addr_type = 0;
   458		int len;
   459		u32 mark;
   460	
   461		if ((u8 *)hdr < skb->head ||
   462		    (skb_network_header(skb) + sizeof(*hdr)) > skb_tail_pointer(skb))
   463			return;
   464	
   465		if (!skb->dev)
   466			return;
   467		net = dev_net(skb->dev);
   468		mark = IP6_REPLY_MARK(net, skb->mark);
   469		/*
   470		 *	Make sure we respect the rules
   471		 *	i.e. RFC 1885 2.4(e)
   472		 *	Rule (e.1) is enforced by not using icmp6_send
   473		 *	in any code that processes icmp errors.
   474		 */
   475		addr_type = ipv6_addr_type(&hdr->daddr);
   476	
   477		if (ipv6_chk_addr(net, &hdr->daddr, skb->dev, 0) ||
   478		    ipv6_chk_acast_addr_src(net, skb->dev, &hdr->daddr))
   479			saddr = &hdr->daddr;
   480	
   481		/*
   482		 *	Dest addr check
   483		 */
   484	
   485		if (addr_type & IPV6_ADDR_MULTICAST || skb->pkt_type != PACKET_HOST) {
   486			if (type != ICMPV6_PKT_TOOBIG &&
   487			    !(type == ICMPV6_PARAMPROB &&
   488			      code == ICMPV6_UNK_OPTION &&
   489			      (opt_unrec(skb, info))))
   490				return;
   491	
   492			saddr = NULL;
   493		}
   494	
   495		addr_type = ipv6_addr_type(&hdr->saddr);
   496	
   497		/*
   498		 *	Source addr check
   499		 */
   500	
   501		if (__ipv6_addr_needs_scope_id(addr_type)) {
   502			iif = icmp6_iif(skb);
   503		} else {
   504			dst = skb_dst(skb);
   505			iif = l3mdev_master_ifindex(dst ? dst->dev : skb->dev);
   506		}
   507	
   508		/*
   509		 *	Must not send error if the source does not uniquely
   510		 *	identify a single node (RFC2463 Section 2.4).
   511		 *	We check unspecified / multicast addresses here,
   512		 *	and anycast addresses will be checked later.
   513		 */
   514		if ((addr_type == IPV6_ADDR_ANY) || (addr_type & IPV6_ADDR_MULTICAST)) {
   515			net_dbg_ratelimited("icmp6_send: addr_any/mcast source [%pI6c > %pI6c]\n",
   516					    &hdr->saddr, &hdr->daddr);
   517			return;
   518		}
   519	
   520		/*
   521		 *	Never answer to a ICMP packet.
   522		 */
   523		if (is_ineligible(skb)) {
   524			net_dbg_ratelimited("icmp6_send: no reply to icmp error [%pI6c > %pI6c]\n",
   525					    &hdr->saddr, &hdr->daddr);
   526			return;
   527		}
   528	
   529		/* Needed by both icmp_global_allow and icmpv6_xmit_lock */
   530		local_bh_disable();
   531	
   532		/* Check global sysctl_icmp_msgs_per_sec ratelimit */
   533		if (!(skb->dev->flags & IFF_LOOPBACK) && !icmpv6_global_allow(net, type))
   534			goto out_bh_enable;
   535	
   536		mip6_addr_swap(skb);
   537	
   538		sk = icmpv6_xmit_lock(net);
   539		if (!sk)
   540			goto out_bh_enable;
   541	
   542		memset(&fl6, 0, sizeof(fl6));
   543		fl6.flowi6_proto = IPPROTO_ICMPV6;
   544		fl6.daddr = hdr->saddr;
   545		if (force_saddr)
   546			saddr = force_saddr;
   547		if (saddr) {
   548			fl6.saddr = *saddr;
   549		} else if (!icmpv6_rt_has_prefsrc(sk, type, &fl6)) {
   550			/* select a more meaningful saddr from input if */
   551			struct net_device *in_netdev;
   552	
   553			in_netdev = dev_get_by_index(net, IP6CB(skb)->iif);
   554			if (in_netdev) {
   555				ipv6_dev_get_saddr(net, in_netdev, &fl6.daddr,
   556						   inet6_sk(sk)->srcprefs,
   557						   &fl6.saddr);
   558				dev_put(in_netdev);
   559			}
   560		}
   561		fl6.flowi6_mark = mark;
   562		fl6.flowi6_oif = iif;
   563		fl6.fl6_icmp_type = type;
   564		fl6.fl6_icmp_code = code;
   565		fl6.flowi6_uid = sock_net_uid(net, NULL);
   566		fl6.mp_hash = rt6_multipath_hash(net, &fl6, skb, NULL);
   567		security_skb_classify_flow(skb, flowi6_to_flowi(&fl6));
   568	
   569		sk->sk_mark = mark;
   570		np = inet6_sk(sk);
   571	
   572		if (!icmpv6_xrlim_allow(sk, type, &fl6))
   573			goto out;
   574	
   575		tmp_hdr.icmp6_type = type;
   576		tmp_hdr.icmp6_code = code;
   577		tmp_hdr.icmp6_cksum = 0;
   578		tmp_hdr.icmp6_pointer = htonl(info);
   579	
   580		if (!fl6.flowi6_oif && ipv6_addr_is_multicast(&fl6.daddr))
   581			fl6.flowi6_oif = np->mcast_oif;
   582		else if (!fl6.flowi6_oif)
   583			fl6.flowi6_oif = np->ucast_oif;
   584	
   585		ipcm6_init_sk(&ipc6, np);
   586		fl6.flowlabel = ip6_make_flowinfo(ipc6.tclass, fl6.flowlabel);
   587	
   588		dst = icmpv6_route_lookup(net, skb, sk, &fl6);
   589		if (IS_ERR(dst))
   590			goto out;
   591	
   592		ipc6.hlimit = ip6_sk_dst_hoplimit(np, &fl6, dst);
   593	
   594		msg.skb = skb;
   595		msg.offset = skb_network_offset(skb);
   596		msg.type = type;
   597	
   598		len = skb->len - msg.offset;
   599		len = min_t(unsigned int, len, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr));
   600		if (len < 0) {
   601			net_dbg_ratelimited("icmp: len problem [%pI6c > %pI6c]\n",
   602					    &hdr->saddr, &hdr->daddr);
   603			goto out_dst_release;
   604		}
   605	
   606		rcu_read_lock();
   607		idev = __in6_dev_get(skb->dev);
   608	
   609		if (ip6_append_data(sk, icmpv6_getfrag, &msg,
   610				    len + sizeof(struct icmp6hdr),
   611				    sizeof(struct icmp6hdr),
   612				    &ipc6, &fl6, (struct rt6_info *)dst,
   613				    MSG_DONTWAIT)) {
   614			ICMP6_INC_STATS(net, idev, ICMP6_MIB_OUTERRORS);
   615			ip6_flush_pending_frames(sk);
   616		} else {
   617			icmpv6_push_pending_frames(sk, &fl6, &tmp_hdr,
   618						   len + sizeof(struct icmp6hdr));
   619		}
   620		rcu_read_unlock();
   621	out_dst_release:
   622		dst_release(dst);
   623	out:
   624		icmpv6_xmit_unlock(sk);
   625	out_bh_enable:
   626		local_bh_enable();
   627	}
   628	EXPORT_SYMBOL(icmp6_send);
   629	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28782 bytes --]

^ permalink raw reply

* Re: [PATCH v3 bpf-next 9/9] tools/bpftool: add documentation and sample output for process info
From: Quentin Monnet @ 2020-06-20  1:47 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Hao Luo, Arnaldo Carvalho de Melo,
	Song Liu
In-Reply-To: <20200619231703.738941-10-andriin@fb.com>

2020-06-19 16:17 UTC-0700 ~ Andrii Nakryiko <andriin@fb.com>
> Add statements about bpftool being able to discover process info, holding
> reference to BPF map, prog, link, or BTF. Show example output as well.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Reviewed-by: Quentin Monnet <quentin@isovalent.com>

Thanks!


^ permalink raw reply

* Re: [PATCH v3 bpf-next 8/9] tools/bpftool: show info for processes holding BPF map/prog/link/btf FDs
From: Quentin Monnet @ 2020-06-20  1:46 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Hao Luo, Arnaldo Carvalho de Melo,
	Song Liu
In-Reply-To: <20200619231703.738941-9-andriin@fb.com>

2020-06-19 16:17 UTC-0700 ~ Andrii Nakryiko <andriin@fb.com>
> Add bpf_iter-based way to find all the processes that hold open FDs against
> BPF object (map, prog, link, btf). bpftool always attempts to discover this,
> but will silently give up if kernel doesn't yet support bpf_iter BPF programs.
> Process name and PID are emitted for each process (task group).
> 
> Sample output for each of 4 BPF objects:
> 
> $ sudo ./bpftool prog show
> 2694: cgroup_device  tag 8c42dee26e8cd4c2  gpl
>         loaded_at 2020-06-16T15:34:32-0700  uid 0
>         xlated 648B  jited 409B  memlock 4096B
>         pids systemd(1)
> 2907: cgroup_skb  name egress  tag 9ad187367cf2b9e8  gpl
>         loaded_at 2020-06-16T18:06:54-0700  uid 0
>         xlated 48B  jited 59B  memlock 4096B  map_ids 2436
>         btf_id 1202
>         pids test_progs(2238417), test_progs(2238445)
> 
> $ sudo ./bpftool map show
> 2436: array  name test_cgr.bss  flags 0x400
>         key 4B  value 8B  max_entries 1  memlock 8192B
>         btf_id 1202
>         pids test_progs(2238417), test_progs(2238445)
> 2445: array  name pid_iter.rodata  flags 0x480
>         key 4B  value 4B  max_entries 1  memlock 8192B
>         btf_id 1214  frozen
>         pids bpftool(2239612)
> 
> $ sudo ./bpftool link show
> 61: cgroup  prog 2908
>         cgroup_id 375301  attach_type egress
>         pids test_progs(2238417), test_progs(2238445)
> 62: cgroup  prog 2908
>         cgroup_id 375344  attach_type egress
>         pids test_progs(2238417), test_progs(2238445)
> 
> $ sudo ./bpftool btf show
> 1202: size 1527B  prog_ids 2908,2907  map_ids 2436
>         pids test_progs(2238417), test_progs(2238445)
> 1242: size 34684B
>         pids bpftool(2258892)
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Reviewed-by: Quentin Monnet <quentin@isovalent.com>


^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Roman Gushchin @ 2020-06-20  1:14 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <f80878fe-bf2d-605a-50e4-bda97a1390c2@huawei.com>

On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> On 2020/6/20 8:51, Roman Gushchin wrote:
> > On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> >> On 2020/6/19 5:09, Cong Wang wrote:
> >>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>>>
> >>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>>>
> >>>>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>>>
> >>>>>> Thanks for fixing this.
> >>>>>>
> >>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>
> >>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>> to make it more readable.
> >>>>>>>
> >>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>>>
> >>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>> with systemd invovled.
> >>>>>>
> >>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>
> >>>>> Good point.
> >>>>>
> >>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>> is the one to blame, because it is the first commit that began to
> >>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>
> >>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>> seems closer to the origin.
> >>>
> >>> Yeah, I will update the Fixes tag and send V2.
> >>>
> >>
> >> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> >> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> >> but this is fine, because when the socket is to be freed:
> >>
> >>  sk_prot_free()
> >>    cgroup_sk_free()
> >>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>
> >> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> >>
> >> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> >> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> >> that needs to be fixed.
> > 
> > Hm, does it mean that the problem always happens with the root cgroup?
> > 
> >>From the stacktrace provided by Peter it looks like that the problem
> > is bpf-related, but the original patch says nothing about it.
> > 
> > So from the test above it sounds like the problem is that we're trying
> > to release root's cgroup_bpf, which is a bad idea, I totally agree.
> > Is this the problem?
> 
> I think so, though I'm not familiar with the bfp cgroup code.
> 
> > If so, we might wanna fix it in a different way,
> > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > like in cgroup_put(). It feels more reliable to me.
> > 
> 
> Yeah I also have this idea in my mind.

I wonder if the following patch will fix the issue?

--

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..7eb51137d896 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -942,12 +942,14 @@ static inline bool cgroup_task_frozen(struct task_struct *task)
 #ifdef CONFIG_CGROUP_BPF
 static inline void cgroup_bpf_get(struct cgroup *cgrp)
 {
-       percpu_ref_get(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_get(&cgrp->bpf.refcnt);
 }
 
 static inline void cgroup_bpf_put(struct cgroup *cgrp)
 {
-       percpu_ref_put(&cgrp->bpf.refcnt);
+       if (!(cgrp->self.flags & CSS_NO_REF))
+               percpu_ref_put(&cgrp->bpf.refcnt);
 }
 
 #else /* CONFIG_CGROUP_BPF */

^ permalink raw reply related

* Re: [PATCH v3 bpf-next 3/9] selftests/bpf: add __ksym extern selftest
From: Hao Luo @ 2020-06-20  1:09 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Arnaldo Carvalho de Melo, Song Liu,
	Quentin Monnet
In-Reply-To: <20200619231703.738941-4-andriin@fb.com>

Reviewed-by: Hao Luo <haoluo@google.com>


On Fri, Jun 19, 2020 at 4:19 PM Andrii Nakryiko <andriin@fb.com> wrote:
>
> Validate libbpf is able to handle weak and strong kernel symbol externs in BPF
> code correctly.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  .../testing/selftests/bpf/prog_tests/ksyms.c  | 71 +++++++++++++++++++
>  .../testing/selftests/bpf/progs/test_ksyms.c  | 32 +++++++++
>  2 files changed, 103 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/ksyms.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_ksyms.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/ksyms.c b/tools/testing/selftests/bpf/prog_tests/ksyms.c
> new file mode 100644
> index 000000000000..e3d6777226a8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/ksyms.c
> @@ -0,0 +1,71 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +
> +#include <test_progs.h>
> +#include "test_ksyms.skel.h"
> +#include <sys/stat.h>
> +
> +static int duration;
> +
> +static __u64 kallsyms_find(const char *sym)
> +{
> +       char type, name[500];
> +       __u64 addr, res = 0;
> +       FILE *f;
> +
> +       f = fopen("/proc/kallsyms", "r");
> +       if (CHECK(!f, "kallsyms_fopen", "failed to open: %d\n", errno))
> +               return 0;
> +
> +       while (fscanf(f, "%llx %c %499s%*[^\n]\n", &addr, &type, name) > 0) {
> +               if (strcmp(name, sym) == 0) {
> +                       res = addr;
> +                       goto out;
> +               }
> +       }
> +
> +       CHECK(false, "not_found", "symbol %s not found\n", sym);
> +out:
> +       fclose(f);
> +       return res;
> +}
> +
> +void test_ksyms(void)
> +{
> +       __u64 link_fops_addr = kallsyms_find("bpf_link_fops");
> +       const char *btf_path = "/sys/kernel/btf/vmlinux";
> +       struct test_ksyms *skel;
> +       struct test_ksyms__data *data;
> +       struct stat st;
> +       __u64 btf_size;
> +       int err;
> +
> +       if (CHECK(stat(btf_path, &st), "stat_btf", "err %d\n", errno))
> +               return;
> +       btf_size = st.st_size;
> +
> +       skel = test_ksyms__open_and_load();
> +       if (CHECK(!skel, "skel_open", "failed to open and load skeleton\n"))
> +               return;
> +
> +       err = test_ksyms__attach(skel);
> +       if (CHECK(err, "skel_attach", "skeleton attach failed: %d\n", err))
> +               goto cleanup;
> +
> +       /* trigger tracepoint */
> +       usleep(1);
> +
> +       data = skel->data;
> +       CHECK(data->out__bpf_link_fops != link_fops_addr, "bpf_link_fops",
> +             "got 0x%llx, exp 0x%llx\n",
> +             data->out__bpf_link_fops, link_fops_addr);
> +       CHECK(data->out__bpf_link_fops1 != 0, "bpf_link_fops1",
> +             "got %llu, exp %llu\n", data->out__bpf_link_fops1, (__u64)0);
> +       CHECK(data->out__btf_size != btf_size, "btf_size",
> +             "got %llu, exp %llu\n", data->out__btf_size, btf_size);
> +       CHECK(data->out__per_cpu_start != 0, "__per_cpu_start",
> +             "got %llu, exp %llu\n", data->out__per_cpu_start, (__u64)0);
> +
> +cleanup:
> +       test_ksyms__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_ksyms.c b/tools/testing/selftests/bpf/progs/test_ksyms.c
> new file mode 100644
> index 000000000000..6c9cbb5a3bdf
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_ksyms.c
> @@ -0,0 +1,32 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2019 Facebook */
> +
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +__u64 out__bpf_link_fops = -1;
> +__u64 out__bpf_link_fops1 = -1;
> +__u64 out__btf_size = -1;
> +__u64 out__per_cpu_start = -1;
> +
> +extern const void bpf_link_fops __ksym;
> +extern const void __start_BTF __ksym;
> +extern const void __stop_BTF __ksym;
> +extern const void __per_cpu_start __ksym;
> +/* non-existing symbol, weak, default to zero */
> +extern const void bpf_link_fops1 __ksym __weak;
> +
> +SEC("raw_tp/sys_enter")
> +int handler(const void *ctx)
> +{
> +       out__bpf_link_fops = (__u64)&bpf_link_fops;
> +       out__btf_size = (__u64)(&__stop_BTF - &__start_BTF);
> +       out__per_cpu_start = (__u64)&__per_cpu_start;
> +
> +       out__bpf_link_fops1 = (__u64)&bpf_link_fops1;
> +
> +       return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.24.1
>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 2/9] libbpf: add support for extracting kernel symbol addresses
From: Hao Luo @ 2020-06-20  1:08 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Arnaldo Carvalho de Melo, Song Liu,
	Quentin Monnet
In-Reply-To: <20200619231703.738941-3-andriin@fb.com>

Reviewed-by: Hao Luo <haoluo@google.com>


On Fri, Jun 19, 2020 at 4:19 PM Andrii Nakryiko <andriin@fb.com> wrote:
>
> Add support for another (in addition to existing Kconfig) special kind of
> externs in BPF code, kernel symbol externs. Such externs allow BPF code to
> "know" kernel symbol address and either use it for comparisons with kernel
> data structures (e.g., struct file's f_op pointer, to distinguish different
> kinds of file), or, with the help of bpf_probe_user_kernel(), to follow
> pointers and read data from global variables. Kernel symbol addresses are
> found through /proc/kallsyms, which should be present in the system.
>
> Currently, such kernel symbol variables are typeless: they have to be defined
> as `extern const void <symbol>` and the only operation you can do (in C code)
> with them is to take its address. Such extern should reside in a special
> section '.ksyms'. bpf_helpers.h header provides __ksym macro for this. Strong
> vs weak semantics stays the same as with Kconfig externs. If symbol is not
> found in /proc/kallsyms, this will be a failure for strong (non-weak) extern,
> but will be defaulted to 0 for weak externs.
>
> If the same symbol is defined multiple times in /proc/kallsyms, then it will
> be error if any of the associated addresses differs. In that case, address is
> ambiguous, so libbpf falls on the side of caution, rather than confusing user
> with randomly chosen address.
>
> In the future, once kernel is extended with variables BTF information, such
> ksym externs will be supported in a typed version, which will allow BPF
> program to read variable's contents directly, similarly to how it's done for
> fentry/fexit input arguments.
>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
>  tools/lib/bpf/bpf_helpers.h |   1 +
>  tools/lib/bpf/btf.h         |   5 ++
>  tools/lib/bpf/libbpf.c      | 144 ++++++++++++++++++++++++++++++++++--
>  3 files changed, 144 insertions(+), 6 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index f67dce2af802..a510d8ed716f 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -75,5 +75,6 @@ enum libbpf_tristate {
>  };
>
>  #define __kconfig __attribute__((section(".kconfig")))
> +#define __ksym __attribute__((section(".ksyms")))
>
>  #endif
> diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
> index 70c1b7ec2bd0..06cd1731c154 100644
> --- a/tools/lib/bpf/btf.h
> +++ b/tools/lib/bpf/btf.h
> @@ -168,6 +168,11 @@ static inline bool btf_kflag(const struct btf_type *t)
>         return BTF_INFO_KFLAG(t->info);
>  }
>
> +static inline bool btf_is_void(const struct btf_type *t)
> +{
> +       return btf_kind(t) == BTF_KIND_UNKN;
> +}
> +
>  static inline bool btf_is_int(const struct btf_type *t)
>  {
>         return btf_kind(t) == BTF_KIND_INT;
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 4b021cb94e48..3fabc530290f 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -285,6 +285,7 @@ struct bpf_struct_ops {
>  #define BSS_SEC ".bss"
>  #define RODATA_SEC ".rodata"
>  #define KCONFIG_SEC ".kconfig"
> +#define KSYMS_SEC ".ksyms"
>  #define STRUCT_OPS_SEC ".struct_ops"
>
>  enum libbpf_map_type {
> @@ -330,6 +331,7 @@ struct bpf_map {
>  enum extern_type {
>         EXT_UNKNOWN,
>         EXT_KCFG,
> +       EXT_KSYM,
>  };
>
>  enum kcfg_type {
> @@ -357,6 +359,9 @@ struct extern_desc {
>                         int data_off;
>                         bool is_signed;
>                 } kcfg;
> +               struct {
> +                       unsigned long long addr;
> +               } ksym;
>         };
>  };
>
> @@ -2812,9 +2817,25 @@ static int cmp_externs(const void *_a, const void *_b)
>         return strcmp(a->name, b->name);
>  }
>
> +static int find_int_btf_id(const struct btf *btf)
> +{
> +       const struct btf_type *t;
> +       int i, n;
> +
> +       n = btf__get_nr_types(btf);
> +       for (i = 1; i <= n; i++) {
> +               t = btf__type_by_id(btf, i);
> +
> +               if (btf_is_int(t) && btf_int_bits(t) == 32)
> +                       return i;
> +       }
> +
> +       return 0;
> +}
> +
>  static int bpf_object__collect_externs(struct bpf_object *obj)
>  {
> -       struct btf_type *sec, *kcfg_sec = NULL;
> +       struct btf_type *sec, *kcfg_sec = NULL, *ksym_sec = NULL;
>         const struct btf_type *t;
>         struct extern_desc *ext;
>         int i, n, off;
> @@ -2895,6 +2916,17 @@ static int bpf_object__collect_externs(struct bpf_object *obj)
>                                 pr_warn("extern (kcfg) '%s' type is unsupported\n", ext_name);
>                                 return -ENOTSUP;
>                         }
> +               } else if (strcmp(sec_name, KSYMS_SEC) == 0) {
> +                       const struct btf_type *vt;
> +
> +                       ksym_sec = sec;
> +                       ext->type = EXT_KSYM;
> +
> +                       vt = skip_mods_and_typedefs(obj->btf, t->type, NULL);
> +                       if (!btf_is_void(vt)) {
> +                               pr_warn("extern (ksym) '%s' is not typeless (void)\n", ext_name);
> +                               return -ENOTSUP;
> +                       }
>                 } else {
>                         pr_warn("unrecognized extern section '%s'\n", sec_name);
>                         return -ENOTSUP;
> @@ -2908,6 +2940,46 @@ static int bpf_object__collect_externs(struct bpf_object *obj)
>         /* sort externs by type, for kcfg ones also by (align, size, name) */
>         qsort(obj->externs, obj->nr_extern, sizeof(*ext), cmp_externs);
>
> +       /* for .ksyms section, we need to turn all externs into allocated
> +        * variables in BTF to pass kernel verification; we do this by
> +        * pretending that each extern is a 8-byte variable
> +        */
> +       if (ksym_sec) {
> +               /* find existing 4-byte integer type in BTF to use for fake
> +                * extern variables in DATASEC
> +                */
> +               int int_btf_id = find_int_btf_id(obj->btf);
> +
> +               for (i = 0; i < obj->nr_extern; i++) {
> +                       ext = &obj->externs[i];
> +                       if (ext->type != EXT_KSYM)
> +                               continue;
> +                       pr_debug("extern (ksym) #%d: symbol %d, name %s\n",
> +                                i, ext->sym_idx, ext->name);
> +               }
> +
> +               sec = ksym_sec;
> +               n = btf_vlen(sec);
> +               for (i = 0, off = 0; i < n; i++, off += sizeof(int)) {
> +                       struct btf_var_secinfo *vs = btf_var_secinfos(sec) + i;
> +                       struct btf_type *vt;
> +
> +                       vt = (void *)btf__type_by_id(obj->btf, vs->type);
> +                       ext_name = btf__name_by_offset(obj->btf, vt->name_off);
> +                       ext = find_extern_by_name(obj, ext_name);
> +                       if (!ext) {
> +                               pr_warn("failed to find extern definition for BTF var '%s'\n",
> +                                       ext_name);
> +                               return -ESRCH;
> +                       }
> +                       btf_var(vt)->linkage = BTF_VAR_GLOBAL_ALLOCATED;
> +                       vt->type = int_btf_id;
> +                       vs->offset = off;
> +                       vs->size = sizeof(int);
> +               }
> +               sec->size = off;
> +       }
> +
>         if (kcfg_sec) {
>                 sec = kcfg_sec;
>                 /* for kcfg externs calculate their offsets within a .kconfig map */
> @@ -2919,7 +2991,7 @@ static int bpf_object__collect_externs(struct bpf_object *obj)
>
>                         ext->kcfg.data_off = roundup(off, ext->kcfg.align);
>                         off = ext->kcfg.data_off + ext->kcfg.sz;
> -                       pr_debug("extern #%d (kcfg): symbol %d, off %u, name %s\n",
> +                       pr_debug("extern (kcfg) #%d: symbol %d, off %u, name %s\n",
>                                  i, ext->sym_idx, ext->kcfg.data_off, ext->name);
>                 }
>                 sec->size = off;
> @@ -5009,9 +5081,14 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
>                         break;
>                 case RELO_EXTERN:
>                         ext = &obj->externs[relo->sym_off];
> -                       insn[0].src_reg = BPF_PSEUDO_MAP_VALUE;
> -                       insn[0].imm = obj->maps[obj->kconfig_map_idx].fd;
> -                       insn[1].imm = ext->kcfg.data_off;
> +                       if (ext->type == EXT_KCFG) {
> +                               insn[0].src_reg = BPF_PSEUDO_MAP_VALUE;
> +                               insn[0].imm = obj->maps[obj->kconfig_map_idx].fd;
> +                               insn[1].imm = ext->kcfg.data_off;
> +                       } else /* EXT_KSYM */ {
> +                               insn[0].imm = (__u32)ext->ksym.addr;
> +                               insn[1].imm = ext->ksym.addr >> 32;
> +                       }
>                         break;
>                 case RELO_CALL:
>                         err = bpf_program__reloc_text(prog, obj, relo);
> @@ -5630,10 +5707,58 @@ static int bpf_object__sanitize_maps(struct bpf_object *obj)
>         return 0;
>  }
>
> +static int bpf_object__read_kallsyms_file(struct bpf_object *obj)
> +{
> +       char sym_type, sym_name[500];
> +       unsigned long long sym_addr;
> +       struct extern_desc *ext;
> +       int ret, err = 0;
> +       FILE *f;
> +
> +       f = fopen("/proc/kallsyms", "r");
> +       if (!f) {
> +               err = -errno;
> +               pr_warn("failed to open /proc/kallsyms: %d\n", err);
> +               return err;
> +       }
> +
> +       while (true) {
> +               ret = fscanf(f, "%llx %c %499s%*[^\n]\n",
> +                            &sym_addr, &sym_type, sym_name);
> +               if (ret == EOF && feof(f))
> +                       break;
> +               if (ret != 3) {
> +                       pr_warn("failed to read kallasyms entry: %d\n", ret);
> +                       err = -EINVAL;
> +                       goto out;
> +               }
> +
> +               ext = find_extern_by_name(obj, sym_name);
> +               if (!ext || ext->type != EXT_KSYM)
> +                       continue;
> +
> +               if (ext->is_set && ext->ksym.addr != sym_addr) {
> +                       pr_warn("extern (ksym) '%s' resolution is ambiguous: 0x%llx or 0x%llx\n",
> +                               sym_name, ext->ksym.addr, sym_addr);
> +                       err = -EINVAL;
> +                       goto out;
> +               }
> +               if (!ext->is_set) {
> +                       ext->is_set = true;
> +                       ext->ksym.addr = sym_addr;
> +                       pr_debug("extern (ksym) %s=0x%llx\n", sym_name, sym_addr);
> +               }
> +       }
> +
> +out:
> +       fclose(f);
> +       return err;
> +}
> +
>  static int bpf_object__resolve_externs(struct bpf_object *obj,
>                                        const char *extra_kconfig)
>  {
> -       bool need_config = false;
> +       bool need_config = false, need_kallsyms = false;
>         struct extern_desc *ext;
>         void *kcfg_data = NULL;
>         int err, i;
> @@ -5663,6 +5788,8 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
>                 } else if (ext->type == EXT_KCFG &&
>                            strncmp(ext->name, "CONFIG_", 7) == 0) {
>                         need_config = true;
> +               } else if (ext->type == EXT_KSYM) {
> +                       need_kallsyms = true;
>                 } else {
>                         pr_warn("unrecognized extern '%s'\n", ext->name);
>                         return -EINVAL;
> @@ -5686,6 +5813,11 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
>                 if (err)
>                         return -EINVAL;
>         }
> +       if (need_kallsyms) {
> +               err = bpf_object__read_kallsyms_file(obj);
> +               if (err)
> +                       return -EINVAL;
> +       }
>         for (i = 0; i < obj->nr_extern; i++) {
>                 ext = &obj->externs[i];
>
> --
> 2.24.1
>

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zefan Li @ 2020-06-20  1:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <20200620005115.GE237539@carbon.dhcp.thefacebook.com>

On 2020/6/20 8:51, Roman Gushchin wrote:
> On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
>> On 2020/6/19 5:09, Cong Wang wrote:
>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>
>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>
>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>
>>>>>> Thanks for fixing this.
>>>>>>
>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>
>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>> to make it more readable.
>>>>>>>
>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>
>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>> with systemd invovled.
>>>>>>
>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>
>>>>> Good point.
>>>>>
>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>> is the one to blame, because it is the first commit that began to
>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>
>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>> seems closer to the origin.
>>>
>>> Yeah, I will update the Fixes tag and send V2.
>>>
>>
>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>> but this is fine, because when the socket is to be freed:
>>
>>  sk_prot_free()
>>    cgroup_sk_free()
>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>
>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
>> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
>> that needs to be fixed.
> 
> Hm, does it mean that the problem always happens with the root cgroup?
> 
>>From the stacktrace provided by Peter it looks like that the problem
> is bpf-related, but the original patch says nothing about it.
> 
> So from the test above it sounds like the problem is that we're trying
> to release root's cgroup_bpf, which is a bad idea, I totally agree.
> Is this the problem?

I think so, though I'm not familiar with the bfp cgroup code.

> If so, we might wanna fix it in a different way,
> just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> like in cgroup_put(). It feels more reliable to me.
> 

Yeah I also have this idea in my mind.

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Zefan Li @ 2020-06-20  0:51 UTC (permalink / raw)
  To: Cong Wang
  Cc: Roman Gushchin, Linux Kernel Network Developers,
	Cameron Berkenpas, Peter Geis, Lu Fengqi, Daniël Sonck,
	Daniel Borkmann, Tejun Heo
In-Reply-To: <459be87d-0272-9ea9-839a-823b01e354b6@huawei.com>

在 2020/6/20 8:45, Zefan Li 写道:
> On 2020/6/20 3:51, Cong Wang wrote:
>> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li <lizefan@huawei.com> wrote:
>>>
>>> On 2020/6/19 5:09, Cong Wang wrote:
>>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
>>>>>
>>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
>>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>>>>>>>
>>>>>>> Cc: Roman Gushchin <guro@fb.com>
>>>>>>>
>>>>>>> Thanks for fixing this.
>>>>>>>
>>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
>>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
>>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
>>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
>>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
>>>>>>>> even when cgroup_sk_alloc is disabled.
>>>>>>>>
>>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
>>>>>>>> would terminate this function if called there. And for sk_alloc()
>>>>>>>> skcd->val is always zero. So it's safe to factor out the code
>>>>>>>> to make it more readable.
>>>>>>>>
>>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>>>>>>>
>>>>>>> but I don't think the bug was introduced by this commit, because there
>>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
>>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
>>>>>>> classid in cgroupfs. This commit just made it much easier to happen
>>>>>>> with systemd invovled.
>>>>>>>
>>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
>>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
>>>>>>
>>>>>> Good point.
>>>>>>
>>>>>> I take a deeper look, it looks like commit d979a39d7242e06
>>>>>> is the one to blame, because it is the first commit that began to
>>>>>> hold cgroup refcnt in cgroup_sk_alloc().
>>>>>
>>>>> I agree, ut seems that the issue is not related to bpf and probably
>>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
>>>>> seems closer to the origin.
>>>>
>>>> Yeah, I will update the Fixes tag and send V2.
>>>>
>>>
>>> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
>>> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
>>> but this is fine, because when the socket is to be freed:
>>>
>>>  sk_prot_free()
>>>    cgroup_sk_free()
>>>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>>>
>>> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
>>
>> But skcd->val can be a pointer to a non-root cgroup:
> 
> It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug happens
> when cgroup_sk_alloc is disabled.
> 

And please read those recent bug reports, they all happened when bpf cgroup was in use,
and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Roman Gushchin @ 2020-06-20  0:51 UTC (permalink / raw)
  To: Zefan Li
  Cc: Cong Wang, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <4f17229e-1843-5bfc-ea2f-67ebaa9056da@huawei.com>

On Fri, Jun 19, 2020 at 02:40:19PM +0800, Zefan Li wrote:
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin <guro@fb.com> wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >>>>
> >>>> Cc: Roman Gushchin <guro@fb.com>
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> > 
> > Yeah, I will update the Fixes tag and send V2.
> > 
> 
> Commit d979a39d7242e06 looks innocent to me. With this commit when cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be incremented,
> but this is fine, because when the socket is to be freed:
> 
>  sk_prot_free()
>    cgroup_sk_free()
>      cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> 
> cgroup_put() does nothing for the default root cgroup, so nothing bad will happen.
> 
> but cgroup_bpf_put() will decrement the bpf refcnt while this refcnt were not incremented
> as cgroup_sk_alloc has already been disabled. That's why I think it's 4bfc0bb2c60e2f4c
> that needs to be fixed.

Hm, does it mean that the problem always happens with the root cgroup?

From the stacktrace provided by Peter it looks like that the problem
is bpf-related, but the original patch says nothing about it.

So from the test above it sounds like the problem is that we're trying
to release root's cgroup_bpf, which is a bad idea, I totally agree.
Is this the problem? If so, we might wanna fix it in a different way,
just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
like in cgroup_put(). It feels more reliable to me.

Thanks!


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox