netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry
@ 2024-07-08 13:31 Daniel Borkmann
  2024-07-08 13:31 ` [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release Daniel Borkmann
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Daniel Borkmann @ 2024-07-08 13:31 UTC (permalink / raw)
  To: martin.lau
  Cc: bpf, netdev, Daniel Borkmann, Pedro Pinto, Hyunwoo Kim, Wongi Lee

Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
an issue that the tcx_entry can be released too early leading to a use
after free (UAF) when an active old-style ingress or clsact qdisc with a
shared tc block is later replaced by another ingress or clsact instance.

Essentially, the sequence to trigger the UAF (one example) can be as follows:

  1. A network namespace is created
  2. An ingress qdisc is created. This allocates a tcx_entry, and
     &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
     same time, a tcf block with index 1 is created.
  3. chain0 is attached to the tcf block. chain0 must be connected to
     the block linked to the ingress qdisc to later reach the function
     tcf_chain0_head_change_cb_del() which triggers the UAF.
  4. Create and graft a clsact qdisc. This causes the ingress qdisc
     created in step 1 to be removed, thus freeing the previously linked
     tcx_entry:

     rtnetlink_rcv_msg()
       => tc_modify_qdisc()
         => qdisc_create()
           => clsact_init() [a]
         => qdisc_graft()
           => qdisc_destroy()
             => __qdisc_destroy()
               => ingress_destroy() [b]
                 => tcx_entry_free()
                   => kfree_rcu() // tcx_entry freed

  5. Finally, the network namespace is closed. This registers the
     cleanup_net worker, and during the process of releasing the
     remaining clsact qdisc, it accesses the tcx_entry that was
     already freed in step 4, causing the UAF to occur:

     cleanup_net()
       => ops_exit_list()
         => default_device_exit_batch()
           => unregister_netdevice_many()
             => unregister_netdevice_many_notify()
               => dev_shutdown()
                 => qdisc_put()
                   => clsact_destroy() [c]
                     => tcf_block_put_ext()
                       => tcf_chain0_head_change_cb_del()
                         => tcf_chain_head_change_item()
                           => clsact_chain_head_change()
                             => mini_qdisc_pair_swap() // UAF

There are also other variants, the gist is to add an ingress (or clsact)
qdisc with a specific shared block, then to replace that qdisc, waiting
for the tcx_entry kfree_rcu() to be executed and subsequently accessing
the current active qdisc's miniq one way or another.

The correct fix is to turn the miniq_active boolean into a counter. What
can be observed, at step 2 above, the counter transitions from 0->1, at
step [a] from 1->2 (in order for the miniq object to remain active during
the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
eventual release. The reference counter in general ranges from [0,2] and
it does not need to be atomic since all access to the counter is protected
by the rtnl mutex. With this in place, there is no longer a UAF happening
and the tcx_entry is freed at the correct time.

Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Pedro Pinto <xten@osec.io>
Co-developed-by: Pedro Pinto <xten@osec.io>
Signed-off-by: Pedro Pinto <xten@osec.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Hyunwoo Kim <v4bel@theori.io>
Cc: Wongi Lee <qwerty@theori.io>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
---
 [ Apparently both parties found this issue as part of Google KernelCTF.
   Pedro reported the issue to us several days earlier than Hyunwoo &
   Wongi. I would have loved to place all parties into the Reported-by
   tag in addition to the informal mention above, but apparently this
   causes issues wrt the KernelCTF organisers despite that the commit
   message mentions the report timing. Hence in the tag only first come
   first serve this time. Thank you all in any case for reporting! ]

 include/net/tcx.h       | 13 +++++++++----
 net/sched/sch_ingress.c | 12 ++++++------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/net/tcx.h b/include/net/tcx.h
index 72a3e75e539f..5ce0ce9e0c02 100644
--- a/include/net/tcx.h
+++ b/include/net/tcx.h
@@ -13,7 +13,7 @@ struct mini_Qdisc;
 struct tcx_entry {
 	struct mini_Qdisc __rcu *miniq;
 	struct bpf_mprog_bundle bundle;
-	bool miniq_active;
+	u32 miniq_active;
 	struct rcu_head rcu;
 };
 
@@ -125,11 +125,16 @@ static inline void tcx_skeys_dec(bool ingress)
 	tcx_dec();
 }
 
-static inline void tcx_miniq_set_active(struct bpf_mprog_entry *entry,
-					const bool active)
+static inline void tcx_miniq_inc(struct bpf_mprog_entry *entry)
 {
 	ASSERT_RTNL();
-	tcx_entry(entry)->miniq_active = active;
+	tcx_entry(entry)->miniq_active++;
+}
+
+static inline void tcx_miniq_dec(struct bpf_mprog_entry *entry)
+{
+	ASSERT_RTNL();
+	tcx_entry(entry)->miniq_active--;
 }
 
 static inline bool tcx_entry_is_active(struct bpf_mprog_entry *entry)
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index c2ef9dcf91d2..cc6051d4f2ef 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -91,7 +91,7 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 	entry = tcx_entry_fetch_or_create(dev, true, &created);
 	if (!entry)
 		return -ENOMEM;
-	tcx_miniq_set_active(entry, true);
+	tcx_miniq_inc(entry);
 	mini_qdisc_pair_init(&q->miniqp, sch, &tcx_entry(entry)->miniq);
 	if (created)
 		tcx_entry_update(dev, entry, true);
@@ -121,7 +121,7 @@ static void ingress_destroy(struct Qdisc *sch)
 	tcf_block_put_ext(q->block, sch, &q->block_info);
 
 	if (entry) {
-		tcx_miniq_set_active(entry, false);
+		tcx_miniq_dec(entry);
 		if (!tcx_entry_is_active(entry)) {
 			tcx_entry_update(dev, NULL, true);
 			tcx_entry_free(entry);
@@ -257,7 +257,7 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 	entry = tcx_entry_fetch_or_create(dev, true, &created);
 	if (!entry)
 		return -ENOMEM;
-	tcx_miniq_set_active(entry, true);
+	tcx_miniq_inc(entry);
 	mini_qdisc_pair_init(&q->miniqp_ingress, sch, &tcx_entry(entry)->miniq);
 	if (created)
 		tcx_entry_update(dev, entry, true);
@@ -276,7 +276,7 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 	entry = tcx_entry_fetch_or_create(dev, false, &created);
 	if (!entry)
 		return -ENOMEM;
-	tcx_miniq_set_active(entry, true);
+	tcx_miniq_inc(entry);
 	mini_qdisc_pair_init(&q->miniqp_egress, sch, &tcx_entry(entry)->miniq);
 	if (created)
 		tcx_entry_update(dev, entry, false);
@@ -302,7 +302,7 @@ static void clsact_destroy(struct Qdisc *sch)
 	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
 
 	if (ingress_entry) {
-		tcx_miniq_set_active(ingress_entry, false);
+		tcx_miniq_dec(ingress_entry);
 		if (!tcx_entry_is_active(ingress_entry)) {
 			tcx_entry_update(dev, NULL, true);
 			tcx_entry_free(ingress_entry);
@@ -310,7 +310,7 @@ static void clsact_destroy(struct Qdisc *sch)
 	}
 
 	if (egress_entry) {
-		tcx_miniq_set_active(egress_entry, false);
+		tcx_miniq_dec(egress_entry);
 		if (!tcx_entry_is_active(egress_entry)) {
 			tcx_entry_update(dev, NULL, false);
 			tcx_entry_free(egress_entry);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release
  2024-07-08 13:31 [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry Daniel Borkmann
@ 2024-07-08 13:31 ` Daniel Borkmann
  2024-07-08 22:34   ` Martin KaFai Lau
  2024-07-08 22:30 ` [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry patchwork-bot+netdevbpf
  2024-07-09  0:21 ` John Fastabend
  2 siblings, 1 reply; 6+ messages in thread
From: Daniel Borkmann @ 2024-07-08 13:31 UTC (permalink / raw)
  To: martin.lau; +Cc: bpf, netdev, Daniel Borkmann

Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
---
 tools/testing/selftests/bpf/config            |  3 +
 .../selftests/bpf/prog_tests/tc_links.c       | 61 +++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index c0ef168eb637..c95f6f1ab74a 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -61,9 +61,12 @@ CONFIG_MPLS=y
 CONFIG_MPLS_IPTUNNEL=y
 CONFIG_MPLS_ROUTING=y
 CONFIG_MPTCP=y
+CONFIG_NET_ACT_SKBMOD=y
+CONFIG_NET_CLS=y
 CONFIG_NET_CLS_ACT=y
 CONFIG_NET_CLS_BPF=y
 CONFIG_NET_CLS_FLOWER=y
+CONFIG_NET_CLS_MATCHALL=y
 CONFIG_NET_FOU=y
 CONFIG_NET_FOU_IP_TUNNELS=y
 CONFIG_NET_IPGRE=y
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_links.c b/tools/testing/selftests/bpf/prog_tests/tc_links.c
index bc9841144685..1af9ec1149aa 100644
--- a/tools/testing/selftests/bpf/prog_tests/tc_links.c
+++ b/tools/testing/selftests/bpf/prog_tests/tc_links.c
@@ -9,6 +9,8 @@
 #define ping_cmd "ping -q -c1 -w1 127.0.0.1 > /dev/null"
 
 #include "test_tc_link.skel.h"
+
+#include "netlink_helpers.h"
 #include "tc_helpers.h"
 
 void serial_test_tc_links_basic(void)
@@ -1787,6 +1789,65 @@ void serial_test_tc_links_ingress(void)
 	test_tc_links_ingress(BPF_TCX_INGRESS, false, false);
 }
 
+struct qdisc_req {
+	struct nlmsghdr  n;
+	struct tcmsg     t;
+	char             buf[1024];
+};
+
+static int qdisc_replace(int ifindex, const char *kind, bool block)
+{
+	struct rtnl_handle rth = { .fd = -1 };
+	struct qdisc_req req;
+	int err;
+
+	err = rtnl_open(&rth, 0);
+	if (!ASSERT_OK(err, "open_rtnetlink"))
+		return err;
+
+	memset(&req, 0, sizeof(req));
+	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg));
+	req.n.nlmsg_flags = NLM_F_CREATE | NLM_F_REPLACE | NLM_F_REQUEST;
+	req.n.nlmsg_type = RTM_NEWQDISC;
+	req.t.tcm_family = AF_UNSPEC;
+	req.t.tcm_ifindex = ifindex;
+	req.t.tcm_parent = 0xfffffff1;
+
+	addattr_l(&req.n, sizeof(req), TCA_KIND, kind, strlen(kind) + 1);
+	if (block)
+		addattr32(&req.n, sizeof(req), TCA_INGRESS_BLOCK, 1);
+
+	err = rtnl_talk(&rth, &req.n, NULL);
+	ASSERT_OK(err, "talk_rtnetlink");
+	rtnl_close(&rth);
+	return err;
+}
+
+void serial_test_tc_links_dev_chain0(void)
+{
+	int err, ifindex;
+
+	ASSERT_OK(system("ip link add dev foo type veth peer name bar"), "add veth");
+	ifindex = if_nametoindex("foo");
+	ASSERT_NEQ(ifindex, 0, "non_zero_ifindex");
+	err = qdisc_replace(ifindex, "ingress", true);
+	if (!ASSERT_OK(err, "attaching ingress"))
+		goto cleanup;
+	ASSERT_OK(system("tc filter add block 1 matchall action skbmod swap mac"), "add block");
+	err = qdisc_replace(ifindex, "clsact", false);
+	if (!ASSERT_OK(err, "attaching clsact"))
+		goto cleanup;
+	/* Heuristic: kern_sync_rcu() alone does not work; a wait-time of ~5s
+	 * triggered the issue without the fix reliably 100% of the time.
+	 */
+	sleep(5);
+	ASSERT_OK(system("tc filter add dev foo ingress matchall action skbmod swap mac"), "add filter");
+cleanup:
+	ASSERT_OK(system("ip link del dev foo"), "del veth");
+	ASSERT_EQ(if_nametoindex("foo"), 0, "foo removed");
+	ASSERT_EQ(if_nametoindex("bar"), 0, "bar removed");
+}
+
 static void test_tc_links_dev_mixed(int target)
 {
 	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry
  2024-07-08 13:31 [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry Daniel Borkmann
  2024-07-08 13:31 ` [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release Daniel Borkmann
@ 2024-07-08 22:30 ` patchwork-bot+netdevbpf
  2024-07-09  0:21 ` John Fastabend
  2 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-07-08 22:30 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: martin.lau, bpf, netdev, xten, v4bel, qwerty

Hello:

This series was applied to bpf/bpf.git (master)
by Martin KaFai Lau <martin.lau@kernel.org>:

On Mon,  8 Jul 2024 15:31:29 +0200 you wrote:
> Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
> an issue that the tcx_entry can be released too early leading to a use
> after free (UAF) when an active old-style ingress or clsact qdisc with a
> shared tc block is later replaced by another ingress or clsact instance.
> 
> Essentially, the sequence to trigger the UAF (one example) can be as follows:
> 
> [...]

Here is the summary with links:
  - [bpf,1/2] bpf: Fix too early release of tcx_entry
    https://git.kernel.org/bpf/bpf/c/1cb6f0bae504
  - [bpf,2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release
    https://git.kernel.org/bpf/bpf/c/5f1d18de7918

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release
  2024-07-08 13:31 ` [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release Daniel Borkmann
@ 2024-07-08 22:34   ` Martin KaFai Lau
  2024-07-09 19:48     ` Daniel Borkmann
  0 siblings, 1 reply; 6+ messages in thread
From: Martin KaFai Lau @ 2024-07-08 22:34 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: martin.lau, bpf, netdev

On 7/8/24 6:31 AM, Daniel Borkmann wrote:
> Add a test case which replaces an active ingress qdisc while keeping the
> miniq in-tact during the transition period to the new clsact qdisc.

Thanks for the explanation in patch 1 fix and the test. Applied.

> 
>    # ./vmtest.sh -- ./test_progs -t tc_link
>    [...]
>    ./test_progs -t tc_link
>    [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
>    [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
>    #332     tc_links_after:OK
>    #333     tc_links_append:OK
>    #334     tc_links_basic:OK
>    #335     tc_links_before:OK
>    #336     tc_links_chain_classic:OK
>    #337     tc_links_chain_mixed:OK
>    #338     tc_links_dev_chain0:OK
>    #339     tc_links_dev_cleanup:OK
>    #340     tc_links_dev_mixed:OK
>    #341     tc_links_ingress:OK
>    #342     tc_links_invalid:OK
>    #343     tc_links_prepend:OK
>    #344     tc_links_replace:OK
>    #345     tc_links_revision:OK
>    Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Martin KaFai Lau <martin.lau@kernel.org>
> ---
>   tools/testing/selftests/bpf/config            |  3 +
>   .../selftests/bpf/prog_tests/tc_links.c       | 61 +++++++++++++++++++
>   2 files changed, 64 insertions(+)
> 
> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index c0ef168eb637..c95f6f1ab74a 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -61,9 +61,12 @@ CONFIG_MPLS=y
>   CONFIG_MPLS_IPTUNNEL=y
>   CONFIG_MPLS_ROUTING=y
>   CONFIG_MPTCP=y
> +CONFIG_NET_ACT_SKBMOD=y
> +CONFIG_NET_CLS=y
>   CONFIG_NET_CLS_ACT=y
>   CONFIG_NET_CLS_BPF=y
>   CONFIG_NET_CLS_FLOWER=y
> +CONFIG_NET_CLS_MATCHALL=y
>   CONFIG_NET_FOU=y
>   CONFIG_NET_FOU_IP_TUNNELS=y
>   CONFIG_NET_IPGRE=y
> diff --git a/tools/testing/selftests/bpf/prog_tests/tc_links.c b/tools/testing/selftests/bpf/prog_tests/tc_links.c
> index bc9841144685..1af9ec1149aa 100644
> --- a/tools/testing/selftests/bpf/prog_tests/tc_links.c
> +++ b/tools/testing/selftests/bpf/prog_tests/tc_links.c
> @@ -9,6 +9,8 @@
>   #define ping_cmd "ping -q -c1 -w1 127.0.0.1 > /dev/null"
>   
>   #include "test_tc_link.skel.h"
> +
> +#include "netlink_helpers.h"
>   #include "tc_helpers.h"
>   
>   void serial_test_tc_links_basic(void)
> @@ -1787,6 +1789,65 @@ void serial_test_tc_links_ingress(void)
>   	test_tc_links_ingress(BPF_TCX_INGRESS, false, false);
>   }
>   
> +struct qdisc_req {
> +	struct nlmsghdr  n;
> +	struct tcmsg     t;
> +	char             buf[1024];
> +};
> +
> +static int qdisc_replace(int ifindex, const char *kind, bool block)
> +{
> +	struct rtnl_handle rth = { .fd = -1 };
> +	struct qdisc_req req;
> +	int err;
> +
> +	err = rtnl_open(&rth, 0);
> +	if (!ASSERT_OK(err, "open_rtnetlink"))
> +		return err;
> +
> +	memset(&req, 0, sizeof(req));
> +	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg));
> +	req.n.nlmsg_flags = NLM_F_CREATE | NLM_F_REPLACE | NLM_F_REQUEST;
> +	req.n.nlmsg_type = RTM_NEWQDISC;
> +	req.t.tcm_family = AF_UNSPEC;
> +	req.t.tcm_ifindex = ifindex;
> +	req.t.tcm_parent = 0xfffffff1;
> +
> +	addattr_l(&req.n, sizeof(req), TCA_KIND, kind, strlen(kind) + 1);
> +	if (block)
> +		addattr32(&req.n, sizeof(req), TCA_INGRESS_BLOCK, 1);
> +
> +	err = rtnl_talk(&rth, &req.n, NULL);
> +	ASSERT_OK(err, "talk_rtnetlink");
> +	rtnl_close(&rth);
> +	return err;
> +}
> +
> +void serial_test_tc_links_dev_chain0(void)
> +{
> +	int err, ifindex;
> +
> +	ASSERT_OK(system("ip link add dev foo type veth peer name bar"), "add veth");
> +	ifindex = if_nametoindex("foo");
> +	ASSERT_NEQ(ifindex, 0, "non_zero_ifindex");
> +	err = qdisc_replace(ifindex, "ingress", true);
> +	if (!ASSERT_OK(err, "attaching ingress"))
> +		goto cleanup;
> +	ASSERT_OK(system("tc filter add block 1 matchall action skbmod swap mac"), "add block");
> +	err = qdisc_replace(ifindex, "clsact", false);
> +	if (!ASSERT_OK(err, "attaching clsact"))
> +		goto cleanup;
> +	/* Heuristic: kern_sync_rcu() alone does not work; a wait-time of ~5s
> +	 * triggered the issue without the fix reliably 100% of the time.
> +	 */
> +	sleep(5);

It may be handy to be able to trigger rcu_barrier() to wait for the pending rcu 
callbacks to finish. Not sure if there is an existing way to do that without 
introducing a kfunc in bpf_testmod. Probably something to think about as an 
optimization.

Thanks for the fix and the test. Applied.


> +	ASSERT_OK(system("tc filter add dev foo ingress matchall action skbmod swap mac"), "add filter");
> +cleanup:
> +	ASSERT_OK(system("ip link del dev foo"), "del veth");
> +	ASSERT_EQ(if_nametoindex("foo"), 0, "foo removed");
> +	ASSERT_EQ(if_nametoindex("bar"), 0, "bar removed");
> +}
> +
>   static void test_tc_links_dev_mixed(int target)
>   {
>   	LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry
  2024-07-08 13:31 [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry Daniel Borkmann
  2024-07-08 13:31 ` [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release Daniel Borkmann
  2024-07-08 22:30 ` [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry patchwork-bot+netdevbpf
@ 2024-07-09  0:21 ` John Fastabend
  2 siblings, 0 replies; 6+ messages in thread
From: John Fastabend @ 2024-07-09  0:21 UTC (permalink / raw)
  To: Daniel Borkmann, martin.lau
  Cc: bpf, netdev, Daniel Borkmann, Pedro Pinto, Hyunwoo Kim, Wongi Lee

Daniel Borkmann wrote:
> Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
> an issue that the tcx_entry can be released too early leading to a use
> after free (UAF) when an active old-style ingress or clsact qdisc with a
> shared tc block is later replaced by another ingress or clsact instance.
> 
> Essentially, the sequence to trigger the UAF (one example) can be as follows:
> 
>   1. A network namespace is created
>   2. An ingress qdisc is created. This allocates a tcx_entry, and
>      &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
>      same time, a tcf block with index 1 is created.
>   3. chain0 is attached to the tcf block. chain0 must be connected to
>      the block linked to the ingress qdisc to later reach the function
>      tcf_chain0_head_change_cb_del() which triggers the UAF.
>   4. Create and graft a clsact qdisc. This causes the ingress qdisc
>      created in step 1 to be removed, thus freeing the previously linked
>      tcx_entry:
> 
>      rtnetlink_rcv_msg()
>        => tc_modify_qdisc()
>          => qdisc_create()
>            => clsact_init() [a]
>          => qdisc_graft()
>            => qdisc_destroy()
>              => __qdisc_destroy()
>                => ingress_destroy() [b]
>                  => tcx_entry_free()
>                    => kfree_rcu() // tcx_entry freed
> 
>   5. Finally, the network namespace is closed. This registers the
>      cleanup_net worker, and during the process of releasing the
>      remaining clsact qdisc, it accesses the tcx_entry that was
>      already freed in step 4, causing the UAF to occur:
> 
>      cleanup_net()
>        => ops_exit_list()
>          => default_device_exit_batch()
>            => unregister_netdevice_many()
>              => unregister_netdevice_many_notify()
>                => dev_shutdown()
>                  => qdisc_put()
>                    => clsact_destroy() [c]
>                      => tcf_block_put_ext()
>                        => tcf_chain0_head_change_cb_del()
>                          => tcf_chain_head_change_item()
>                            => clsact_chain_head_change()
>                              => mini_qdisc_pair_swap() // UAF
> 
> There are also other variants, the gist is to add an ingress (or clsact)
> qdisc with a specific shared block, then to replace that qdisc, waiting
> for the tcx_entry kfree_rcu() to be executed and subsequently accessing
> the current active qdisc's miniq one way or another.
> 
> The correct fix is to turn the miniq_active boolean into a counter. What
> can be observed, at step 2 above, the counter transitions from 0->1, at
> step [a] from 1->2 (in order for the miniq object to remain active during
> the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
> eventual release. The reference counter in general ranges from [0,2] and
> it does not need to be atomic since all access to the counter is protected
> by the rtnl mutex. With this in place, there is no longer a UAF happening
> and the tcx_entry is freed at the correct time.
> 
> Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
> Reported-by: Pedro Pinto <xten@osec.io>
> Co-developed-by: Pedro Pinto <xten@osec.io>
> Signed-off-by: Pedro Pinto <xten@osec.io>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Hyunwoo Kim <v4bel@theori.io>
> Cc: Wongi Lee <qwerty@theori.io>
> Cc: Martin KaFai Lau <martin.lau@kernel.org>
> ---

Acked-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release
  2024-07-08 22:34   ` Martin KaFai Lau
@ 2024-07-09 19:48     ` Daniel Borkmann
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel Borkmann @ 2024-07-09 19:48 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: martin.lau, bpf, netdev

On 7/9/24 12:34 AM, Martin KaFai Lau wrote:
[...]
> It may be handy to be able to trigger rcu_barrier() to wait for the pending rcu callbacks to finish. Not sure if there is an existing way to do that without introducing a kfunc in bpf_testmod. Probably something to think about as an optimization.
> 
> Thanks for the fix and the test. Applied.

Thanks, I'll take a look if this is or can be cleanly exposed somehow.

>> +    ASSERT_OK(system("tc filter add dev foo ingress matchall action skbmod swap mac"), "add filter");
>> +cleanup:
>> +    ASSERT_OK(system("ip link del dev foo"), "del veth");
>> +    ASSERT_EQ(if_nametoindex("foo"), 0, "foo removed");
>> +    ASSERT_EQ(if_nametoindex("bar"), 0, "bar removed");
>> +}
>> +
>>   static void test_tc_links_dev_mixed(int target)
>>   {
>>       LIBBPF_OPTS(bpf_tc_opts, tc_opts, .handle = 1, .priority = 1);
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-07-09 19:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-08 13:31 [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry Daniel Borkmann
2024-07-08 13:31 ` [PATCH bpf 2/2] selftests/bpf: Extend tcx tests to cover late tcx_entry release Daniel Borkmann
2024-07-08 22:34   ` Martin KaFai Lau
2024-07-09 19:48     ` Daniel Borkmann
2024-07-08 22:30 ` [PATCH bpf 1/2] bpf: Fix too early release of tcx_entry patchwork-bot+netdevbpf
2024-07-09  0:21 ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).