* [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
@ 2023-05-06 0:12 ` Peilin Ye
2023-05-08 11:22 ` Jamal Hadi Salim
2023-05-06 0:13 ` [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
` (4 subsequent siblings)
5 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:12 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
Similar to mq_init(), return -EOPNOTSUPP if 'parent' is not
TC_H_INGRESS.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
net/sched/sch_ingress.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..3d71f7a3b4ad 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
struct net_device *dev = qdisc_dev(sch);
int err;
+ if (sch->parent != TC_H_INGRESS)
+ return -EOPNOTSUPP;
+
net_inc_ingress_queue();
mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS
2023-05-06 0:12 ` [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
@ 2023-05-08 11:22 ` Jamal Hadi Salim
0 siblings, 0 replies; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:22 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, May 5, 2023 at 8:12 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
> Similar to mq_init(), return -EOPNOTSUPP if 'parent' is not
> TC_H_INGRESS.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> ---
> net/sched/sch_ingress.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..3d71f7a3b4ad 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
> struct net_device *dev = qdisc_dev(sch);
> int err;
>
> + if (sch->parent != TC_H_INGRESS)
> + return -EOPNOTSUPP;
> +
> net_inc_ingress_queue();
>
> mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-06 0:12 ` [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
@ 2023-05-06 0:13 ` Peilin Ye
2023-05-08 11:23 ` Jamal Hadi Salim
2023-05-06 0:14 ` [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
` (3 subsequent siblings)
5 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:13 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
equals TC_H_INGRESS). Return -EOPNOTSUPP if 'parent' is not
TC_H_CLSACT.
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
net/sched/sch_ingress.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 3d71f7a3b4ad..13218a1fe4a5 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -222,6 +222,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
struct net_device *dev = qdisc_dev(sch);
int err;
+ if (sch->parent != TC_H_CLSACT)
+ return -EOPNOTSUPP;
+
net_inc_ingress_queue();
net_inc_egress_queue();
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT
2023-05-06 0:13 ` [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
@ 2023-05-08 11:23 ` Jamal Hadi Salim
0 siblings, 0 replies; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:23 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, May 5, 2023 at 8:13 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
> equals TC_H_INGRESS). Return -EOPNOTSUPP if 'parent' is not
> TC_H_CLSACT.
>
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> ---
> net/sched/sch_ingress.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 3d71f7a3b4ad..13218a1fe4a5 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -222,6 +222,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
> struct net_device *dev = qdisc_dev(sch);
> int err;
>
> + if (sch->parent != TC_H_CLSACT)
> + return -EOPNOTSUPP;
> +
> net_inc_ingress_queue();
> net_inc_egress_queue();
>
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-06 0:12 ` [PATCH net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-06 0:13 ` [PATCH net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
@ 2023-05-06 0:14 ` Peilin Ye
2023-05-08 11:23 ` Jamal Hadi Salim
2023-05-06 0:14 ` [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
` (2 subsequent siblings)
5 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:14 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
(TC_H_INGRESS, TC_H_CLSACT):
$ ip link add name ifb0 type ifb
$ tc qdisc add dev ifb0 parent ffff:fff1 htb
$ tc qdisc add dev ifb0 clsact
Error: Exclusivity flag on, cannot modify.
$ drgn
...
>>> ifb0 = netdev_get_by_name(prog, "ifb0")
>>> qdisc = ifb0.ingress_queue.qdisc_sleeping
>>> print(qdisc.ops.id.string_().decode())
htb
>>> qdisc.flags.value_() # TCQ_F_INGRESS
2
Only allow ingress and clsact Qdiscs under ffff:fff1. Return -EINVAL
for everything else. Make TCQ_F_INGRESS a static flag of ingress and
clsact Qdiscs.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
net/sched/sch_api.c | 7 ++++++-
net/sched/sch_ingress.c | 4 ++--
2 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index fdb8f429333d..383195955b7d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
sch->parent = parent;
if (handle == TC_H_INGRESS) {
- sch->flags |= TCQ_F_INGRESS;
+ if (!(sch->flags & TCQ_F_INGRESS)) {
+ NL_SET_ERR_MSG(extack,
+ "Specified parent ID is reserved for ingress and clsact Qdiscs");
+ err = -EINVAL;
+ goto err_out3;
+ }
handle = TC_H_MAKE(TC_H_INGRESS, 0);
} else {
if (handle == 0) {
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 13218a1fe4a5..caea51e0d4e9 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -137,7 +137,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
.cl_ops = &ingress_class_ops,
.id = "ingress",
.priv_size = sizeof(struct ingress_sched_data),
- .static_flags = TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
.init = ingress_init,
.destroy = ingress_destroy,
.dump = ingress_dump,
@@ -275,7 +275,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
.cl_ops = &clsact_class_ops,
.id = "clsact",
.priv_size = sizeof(struct clsact_sched_data),
- .static_flags = TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
.init = clsact_init,
.destroy = clsact_destroy,
.dump = ingress_dump,
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
2023-05-06 0:14 ` [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
@ 2023-05-08 11:23 ` Jamal Hadi Salim
0 siblings, 0 replies; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:23 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, May 5, 2023 at 8:14 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
> (TC_H_INGRESS, TC_H_CLSACT):
>
> $ ip link add name ifb0 type ifb
> $ tc qdisc add dev ifb0 parent ffff:fff1 htb
> $ tc qdisc add dev ifb0 clsact
> Error: Exclusivity flag on, cannot modify.
> $ drgn
> ...
> >>> ifb0 = netdev_get_by_name(prog, "ifb0")
> >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
> >>> print(qdisc.ops.id.string_().decode())
> htb
> >>> qdisc.flags.value_() # TCQ_F_INGRESS
> 2
>
> Only allow ingress and clsact Qdiscs under ffff:fff1. Return -EINVAL
> for everything else. Make TCQ_F_INGRESS a static flag of ingress and
> clsact Qdiscs.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> ---
> net/sched/sch_api.c | 7 ++++++-
> net/sched/sch_ingress.c | 4 ++--
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index fdb8f429333d..383195955b7d 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
> sch->parent = parent;
>
> if (handle == TC_H_INGRESS) {
> - sch->flags |= TCQ_F_INGRESS;
> + if (!(sch->flags & TCQ_F_INGRESS)) {
> + NL_SET_ERR_MSG(extack,
> + "Specified parent ID is reserved for ingress and clsact Qdiscs");
> + err = -EINVAL;
> + goto err_out3;
> + }
> handle = TC_H_MAKE(TC_H_INGRESS, 0);
> } else {
> if (handle == 0) {
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 13218a1fe4a5..caea51e0d4e9 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -137,7 +137,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
> .cl_ops = &ingress_class_ops,
> .id = "ingress",
> .priv_size = sizeof(struct ingress_sched_data),
> - .static_flags = TCQ_F_CPUSTATS,
> + .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
> .init = ingress_init,
> .destroy = ingress_destroy,
> .dump = ingress_dump,
> @@ -275,7 +275,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
> .cl_ops = &clsact_class_ops,
> .id = "clsact",
> .priv_size = sizeof(struct clsact_sched_data),
> - .static_flags = TCQ_F_CPUSTATS,
> + .static_flags = TCQ_F_INGRESS | TCQ_F_CPUSTATS,
> .init = clsact_init,
> .destroy = clsact_destroy,
> .dump = ingress_dump,
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
` (2 preceding siblings ...)
2023-05-06 0:14 ` [PATCH net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
@ 2023-05-06 0:14 ` Peilin Ye
2023-05-08 11:24 ` Jamal Hadi Salim
2023-05-06 0:15 ` [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
5 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:14 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
Currently, after creating an ingress (or clsact) Qdisc and grafting it
under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
e.g. a TBF Qdisc:
$ ip link add ifb0 type ifb
$ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
$ tc qdisc add dev ifb0 clsact
$ tc qdisc link dev ifb0 handle ffff: parent 1:1
$ tc qdisc show dev ifb0
qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
qdisc clsact ffff: parent ffff:fff1 refcnt 2
^^^^^^^^
clsact's refcount has increased: it is now grafted under both
TC_H_CLSACT and 1:1.
ingress and clsact Qdiscs should only be used under TC_H_INGRESS
(TC_H_CLSACT). Prohibit regrafting them.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
net/sched/sch_api.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 383195955b7d..49b9c1bbfdd9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
NL_SET_ERR_MSG(extack, "Invalid qdisc name");
return -EINVAL;
}
+ if (q->flags & TCQ_F_INGRESS) {
+ NL_SET_ERR_MSG(extack,
+ "Cannot regraft ingress or clsact Qdiscs");
+ return -EINVAL;
+ }
if (q == p ||
(p && check_loop(q, p, 0))) {
NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs
2023-05-06 0:14 ` [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
@ 2023-05-08 11:24 ` Jamal Hadi Salim
0 siblings, 0 replies; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:24 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, May 5, 2023 at 8:14 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> Currently, after creating an ingress (or clsact) Qdisc and grafting it
> under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
> e.g. a TBF Qdisc:
>
> $ ip link add ifb0 type ifb
> $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
> $ tc qdisc add dev ifb0 clsact
> $ tc qdisc link dev ifb0 handle ffff: parent 1:1
> $ tc qdisc show dev ifb0
> qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
> qdisc clsact ffff: parent ffff:fff1 refcnt 2
> ^^^^^^^^
>
> clsact's refcount has increased: it is now grafted under both
> TC_H_CLSACT and 1:1.
>
> ingress and clsact Qdiscs should only be used under TC_H_INGRESS
> (TC_H_CLSACT). Prohibit regrafting them.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> ---
> net/sched/sch_api.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 383195955b7d..49b9c1bbfdd9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> NL_SET_ERR_MSG(extack, "Invalid qdisc name");
> return -EINVAL;
> }
> + if (q->flags & TCQ_F_INGRESS) {
> + NL_SET_ERR_MSG(extack,
> + "Cannot regraft ingress or clsact Qdiscs");
> + return -EINVAL;
> + }
> if (q == p ||
> (p && check_loop(q, p, 0))) {
> NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
` (3 preceding siblings ...)
2023-05-06 0:14 ` [PATCH net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
@ 2023-05-06 0:15 ` Peilin Ye
2023-05-08 11:29 ` Jamal Hadi Salim
2023-05-08 14:11 ` Pedro Tammela
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
5 siblings, 2 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:15 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
Grafting ingress and clsact Qdiscs does not need a for-loop in
qdisc_graft(). Refactor it. No functional changes intended.
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
net/sched/sch_api.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 49b9c1bbfdd9..f72a581666a2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if (parent == NULL) {
unsigned int i, num_q, ingress;
+ struct netdev_queue *dev_queue;
ingress = 0;
num_q = dev->num_tx_queues;
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
- num_q = 1;
ingress = 1;
if (!dev_ingress_queue(dev)) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
@@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if (new && new->ops->attach && !ingress)
goto skip;
- for (i = 0; i < num_q; i++) {
- struct netdev_queue *dev_queue = dev_ingress_queue(dev);
-
- if (!ingress)
+ if (!ingress) {
+ for (i = 0; i < num_q; i++) {
dev_queue = netdev_get_tx_queue(dev, i);
+ old = dev_graft_qdisc(dev_queue, new);
- old = dev_graft_qdisc(dev_queue, new);
- if (new && i > 0)
- qdisc_refcount_inc(new);
-
- if (!ingress)
+ if (new && i > 0)
+ qdisc_refcount_inc(new);
qdisc_put(old);
+ }
+ } else {
+ dev_queue = dev_ingress_queue(dev);
+ old = dev_graft_qdisc(dev_queue, new);
}
skip:
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
2023-05-06 0:15 ` [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
@ 2023-05-08 11:29 ` Jamal Hadi Salim
2023-05-08 22:24 ` Peilin Ye
2023-05-08 14:11 ` Pedro Tammela
1 sibling, 1 reply; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:29 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
John Fastabend
On Fri, May 5, 2023 at 8:15 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> Grafting ingress and clsact Qdiscs does not need a for-loop in
> qdisc_graft(). Refactor it. No functional changes intended.
>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Fixed John's email address.
This one i am not so sure; num_q = 1 implies it will run on the for
loop only once. I am not sure it improves readability either. Anyways
for the effort you put into it i am tossing a coin and saying:
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> net/sched/sch_api.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 49b9c1bbfdd9..f72a581666a2 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (parent == NULL) {
> unsigned int i, num_q, ingress;
> + struct netdev_queue *dev_queue;
>
> ingress = 0;
> num_q = dev->num_tx_queues;
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> - num_q = 1;
> ingress = 1;
> if (!dev_ingress_queue(dev)) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> @@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if (new && new->ops->attach && !ingress)
> goto skip;
>
> - for (i = 0; i < num_q; i++) {
> - struct netdev_queue *dev_queue = dev_ingress_queue(dev);
> -
> - if (!ingress)
> + if (!ingress) {
> + for (i = 0; i < num_q; i++) {
> dev_queue = netdev_get_tx_queue(dev, i);
> + old = dev_graft_qdisc(dev_queue, new);
>
> - old = dev_graft_qdisc(dev_queue, new);
> - if (new && i > 0)
> - qdisc_refcount_inc(new);
> -
> - if (!ingress)
> + if (new && i > 0)
> + qdisc_refcount_inc(new);
> qdisc_put(old);
> + }
> + } else {
> + dev_queue = dev_ingress_queue(dev);
> + old = dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
2023-05-08 11:29 ` Jamal Hadi Salim
@ 2023-05-08 22:24 ` Peilin Ye
0 siblings, 0 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-08 22:24 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
John Fastabend
On Mon, May 08, 2023 at 07:29:26AM -0400, Jamal Hadi Salim wrote:
> On Fri, May 5, 2023 at 8:15 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >
> > Grafting ingress and clsact Qdiscs does not need a for-loop in
> > qdisc_graft(). Refactor it. No functional changes intended.
>
> This one i am not so sure; num_q = 1 implies it will run on the for
> loop only once. I am not sure it improves readability either. Anyways
> for the effort you put into it i am tossing a coin and saying:
> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Yeah, it doesn't make much difference itself. I'm just afraid that,
without [5/6], [6/6] would look like:
for (i = 0; i < num_q; i++) {
if (!ingress) {
dev_queue = netdev_get_tx_queue(dev, i);
old = dev_graft_qdisc(dev_queue, new);
else {
old = dev_graft_qdisc(dev_queue, NULL);
}
if (new && i > 0)
qdisc_refcount_inc(new);
if (!ingress) {
qdisc_put(old);
} else {
/* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
* unprotected concurrent accesses to net_device::miniq_{in,e}gress
* pointer(s) in mini_qdisc_pair_swap().
*/
qdisc_notify(net, skb, n, classid, old, new, extack);
qdisc_destroy(old);
}
if (ingress)
dev_graft_qdisc(dev_queue, new);
}
The "!ingress" path doesn't share a single line with "ingress", which
looks a bit weird to me. Personally I'd like to keep [5/6].
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
2023-05-06 0:15 ` [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-08 11:29 ` Jamal Hadi Salim
@ 2023-05-08 14:11 ` Pedro Tammela
1 sibling, 0 replies; 51+ messages in thread
From: Pedro Tammela @ 2023-05-08 14:11 UTC (permalink / raw)
To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Hillf Danton, netdev, linux-kernel, Cong Wang
On 05/05/2023 21:15, Peilin Ye wrote:
> Grafting ingress and clsact Qdiscs does not need a for-loop in
> qdisc_graft(). Refactor it. No functional changes intended.
>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Just a FYI:
If you decide to keep this refactoring, it will need to be back ported
together with the subsequent fix.
I would personally leave it to net-next.
Thanks for chasing this!
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
> ---
> net/sched/sch_api.c | 20 ++++++++++----------
> 1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 49b9c1bbfdd9..f72a581666a2 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (parent == NULL) {
> unsigned int i, num_q, ingress;
> + struct netdev_queue *dev_queue;
>
> ingress = 0;
> num_q = dev->num_tx_queues;
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> - num_q = 1;
> ingress = 1;
> if (!dev_ingress_queue(dev)) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> @@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if (new && new->ops->attach && !ingress)
> goto skip;
>
> - for (i = 0; i < num_q; i++) {
> - struct netdev_queue *dev_queue = dev_ingress_queue(dev);
> -
> - if (!ingress)
> + if (!ingress) {
> + for (i = 0; i < num_q; i++) {
> dev_queue = netdev_get_tx_queue(dev, i);
> + old = dev_graft_qdisc(dev_queue, new);
>
> - old = dev_graft_qdisc(dev_queue, new);
> - if (new && i > 0)
> - qdisc_refcount_inc(new);
> -
> - if (!ingress)
> + if (new && i > 0)
> + qdisc_refcount_inc(new);
> qdisc_put(old);
> + }
> + } else {
> + dev_queue = dev_ingress_queue(dev);
> + old = dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-06 0:09 ` [PATCH net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
` (4 preceding siblings ...)
2023-05-06 0:15 ` [PATCH net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
@ 2023-05-06 0:16 ` Peilin Ye
2023-05-08 11:32 ` Jamal Hadi Salim
` (3 more replies)
5 siblings, 4 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-06 0:16 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
Peilin Ye
mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
clsact Qdiscs and miniq_egress.
Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
access the same miniq_{in,e}gress pointer(s) concurrently with the new
Qdisc, causing race conditions [1] including a use-after-free in
mini_qdisc_pair_swap() reported by syzbot:
BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
print_report mm/kasan/report.c:430 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:536
mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...
The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
finished.
To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
qdisc_destroy() for "old" before grafting "new".
Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.
[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
adds a flower filter X to A.
Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
b2) to replace A, then adds a flower filter Y to B.
Thread 1 A's refcnt Thread 2
RTM_NEWQDISC (A, RTNL-locked)
qdisc_create(A) 1
qdisc_graft(A) 9
RTM_NEWTFILTER (X, RTNL-lockless)
__tcf_qdisc_find(A) 10
tcf_chain0_head_change(A)
mini_qdisc_pair_swap(A) (1st)
|
| RTM_NEWQDISC (B, RTNL-locked)
RCU 2 qdisc_graft(B)
| 1 notify_and_destroy(A)
|
tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
qdisc_destroy(A) tcf_chain0_head_change(B)
tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
mini_qdisc_pair_swap(A) (3rd) |
... ...
Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().
This is just one of the possible consequences of concurrently accessing
net_device::miniq_{in,e}gress pointers. The point is clear, however:
B's first call to mini_qdisc_pair_swap() should take place after A's
last call, in qdisc_destroy().
Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
include/net/sch_generic.h | 8 ++++++++
net/sched/sch_api.c | 26 +++++++++++++++++++++-----
net/sched/sch_generic.c | 14 +++++++++++---
3 files changed, 40 insertions(+), 8 deletions(-)
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
refcount_inc(&qdisc->refcnt);
}
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return true;
+ return refcount_dec_if_one(&qdisc->refcnt);
+}
+
/* Intended to be used by unlocked users, when concurrent qdisc release is
* possible.
*/
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
struct Qdisc *qdisc);
void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
void qdisc_put(struct Qdisc *qdisc);
void qdisc_put_unlocked(struct Qdisc *qdisc);
void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..a2d07bc8ded6 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if ((q && q->flags & TCQ_F_INGRESS) ||
(new && new->flags & TCQ_F_INGRESS)) {
ingress = 1;
- if (!dev_ingress_queue(dev)) {
+ dev_queue = dev_ingress_queue(dev);
+ if (!dev_queue) {
NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
return -ENOENT;
}
+
+ /* This is the counterpart of that qdisc_refcount_inc_nz() call in
+ * __tcf_qdisc_find() for RTNL-lockless filter requests.
+ */
+ if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
+ NL_SET_ERR_MSG(extack,
+ "Current ingress or clsact Qdisc has ongoing filter request(s)");
+ return -EBUSY;
+ }
}
if (dev->flags & IFF_UP)
@@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
qdisc_put(old);
}
} else {
- dev_queue = dev_ingress_queue(dev);
- old = dev_graft_qdisc(dev_queue, new);
+ old = dev_graft_qdisc(dev_queue, NULL);
+
+ /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
+ * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+ * pointer(s) in mini_qdisc_pair_swap().
+ */
+ qdisc_notify(net, skb, n, classid, old, new, extack);
+ qdisc_destroy(old);
+
+ dev_graft_qdisc(dev_queue, new);
}
skip:
@@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
if (new && new->ops->attach)
new->ops->attach(new);
- } else {
- notify_and_destroy(net, skb, n, classid, old, new, extack);
}
if (dev->flags & IFF_UP)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
qdisc_free(q);
}
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
{
const struct Qdisc_ops *ops = qdisc->ops;
@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
call_rcu(&qdisc->rcu, qdisc_free_cb);
}
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+ if (qdisc->flags & TCQ_F_BUILTIN)
+ return;
+
+ __qdisc_destroy(qdisc);
+}
+
void qdisc_put(struct Qdisc *qdisc)
{
if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
!refcount_dec_and_test(&qdisc->refcnt))
return;
- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
}
EXPORT_SYMBOL(qdisc_put);
@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
!refcount_dec_and_rtnl_lock(&qdisc->refcnt))
return;
- qdisc_destroy(qdisc);
+ __qdisc_destroy(qdisc);
rtnl_unlock();
}
EXPORT_SYMBOL(qdisc_put_unlocked);
--
2.20.1
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
@ 2023-05-08 11:32 ` Jamal Hadi Salim
2023-05-08 21:58 ` Peilin Ye
2023-05-08 14:12 ` Pedro Tammela
` (2 subsequent siblings)
3 siblings, 1 reply; 51+ messages in thread
From: Jamal Hadi Salim @ 2023-05-08 11:32 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
John Fastabend
On Fri, May 5, 2023 at 8:16 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
>
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
> Cc: Hillf Danton <hdanton@sina.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Thanks for the excellent analysis Peilin and for chasing this to the
end. I have no doubt it was a lot of fun! ;->
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
cheers,
jamal
> ---
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 26 +++++++++++++++++++++-----
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* This is the counterpart of that qdisc_refcount_inc_nz() call in
> + * __tcf_qdisc_find() for RTNL-lockless filter requests.
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> + NL_SET_ERR_MSG(extack,
> + "Current ingress or clsact Qdisc has ongoing filter request(s)");
> + return -EBUSY;
> + }
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-08 11:32 ` Jamal Hadi Salim
@ 2023-05-08 21:58 ` Peilin Ye
0 siblings, 0 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-08 21:58 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, Vlad Buslov,
Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
John Fastabend
On Mon, May 08, 2023 at 07:32:01AM -0400, Jamal Hadi Salim wrote:
> Thanks for the excellent analysis Peilin and for chasing this to the
> end. I have no doubt it was a lot of fun! ;->
Of course ;-)
> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Thanks for reviewing this!
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-08 11:32 ` Jamal Hadi Salim
@ 2023-05-08 14:12 ` Pedro Tammela
2023-05-08 22:01 ` Peilin Ye
2023-05-09 1:33 ` Jakub Kicinski
2023-05-17 18:48 ` Jakub Kicinski
3 siblings, 1 reply; 51+ messages in thread
From: Pedro Tammela @ 2023-05-08 14:12 UTC (permalink / raw)
To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
Hillf Danton, netdev, linux-kernel, Cong Wang
On 05/05/2023 21:16, Peilin Ye wrote:
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress. ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap(). Similar for
> clsact Qdiscs and miniq_egress.
>
> Unfortunately, after introducing RTNL-lockless RTM_{NEW,DEL,GET}TFILTER
> requests, when e.g. replacing ingress (clsact) Qdiscs, the old Qdisc could
> access the same miniq_{in,e}gress pointer(s) concurrently with the new
> Qdisc, causing race conditions [1] including a use-after-free in
> mini_qdisc_pair_swap() reported by syzbot:
>
> BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> print_report mm/kasan/report.c:430 [inline]
> kasan_report+0x11c/0x130 mm/kasan/report.c:536
> mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
>
> The new (ingress or clsact) Qdisc should only call mini_qdisc_pair_swap()
> after the old Qdisc's last call (in {ingress,clsact}_destroy()) has
> finished.
>
> To achieve this, in qdisc_graft(), return -EBUSY if the old (ingress or
> clsact) Qdisc has ongoing RTNL-lockless filter requests, and call
> qdisc_destroy() for "old" before grafting "new".
>
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-lockless filter requests. Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
>
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
>
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
>
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Link: https://lore.kernel.org/netdev/0000000000006cf87705f79acf1a@google.com
> Cc: Hillf Danton <hdanton@sina.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Thanks for chasing this!
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
> ---
> include/net/sch_generic.h | 8 ++++++++
> net/sched/sch_api.c | 26 +++++++++++++++++++++-----
> net/sched/sch_generic.c | 14 +++++++++++---
> 3 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> refcount_inc(&qdisc->refcnt);
> }
>
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return true;
> + return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
> /* Intended to be used by unlocked users, when concurrent qdisc release is
> * possible.
> */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> struct Qdisc *qdisc);
> void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
> void qdisc_put(struct Qdisc *qdisc);
> void qdisc_put_unlocked(struct Qdisc *qdisc);
> void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..a2d07bc8ded6 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,20 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> if ((q && q->flags & TCQ_F_INGRESS) ||
> (new && new->flags & TCQ_F_INGRESS)) {
> ingress = 1;
> - if (!dev_ingress_queue(dev)) {
> + dev_queue = dev_ingress_queue(dev);
> + if (!dev_queue) {
> NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> return -ENOENT;
> }
> +
> + /* This is the counterpart of that qdisc_refcount_inc_nz() call in
> + * __tcf_qdisc_find() for RTNL-lockless filter requests.
> + */
> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> + NL_SET_ERR_MSG(extack,
> + "Current ingress or clsact Qdisc has ongoing filter request(s)");
> + return -EBUSY;
> + }
> }
>
> if (dev->flags & IFF_UP)
> @@ -1104,8 +1114,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> qdisc_put(old);
> }
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
> }
>
> skip:
> @@ -1119,8 +1137,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>
> if (new && new->ops->attach)
> new->ops->attach(new);
> - } else {
> - notify_and_destroy(net, skb, n, classid, old, new, extack);
> }
>
> if (dev->flags & IFF_UP)
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> qdisc_free(q);
> }
>
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
> {
> const struct Qdisc_ops *ops = qdisc->ops;
>
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> call_rcu(&qdisc->rcu, qdisc_free_cb);
> }
>
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> + if (qdisc->flags & TCQ_F_BUILTIN)
> + return;
> +
> + __qdisc_destroy(qdisc);
> +}
> +
> void qdisc_put(struct Qdisc *qdisc)
> {
> if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> !refcount_dec_and_test(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> }
> EXPORT_SYMBOL(qdisc_put);
>
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> return;
>
> - qdisc_destroy(qdisc);
> + __qdisc_destroy(qdisc);
> rtnl_unlock();
> }
> EXPORT_SYMBOL(qdisc_put_unlocked);
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-08 14:12 ` Pedro Tammela
@ 2023-05-08 22:01 ` Peilin Ye
0 siblings, 0 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-08 22:01 UTC (permalink / raw)
To: Pedro Tammela
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jamal Hadi Salim, Cong Wang, Jiri Pirko, Peilin Ye,
Daniel Borkmann, John Fastabend, Vlad Buslov, Hillf Danton,
netdev, linux-kernel, Cong Wang
On Mon, May 08, 2023 at 11:12:24AM -0300, Pedro Tammela wrote:
> Thanks for chasing this!
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Thanks for testing, Pedro!
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-08 11:32 ` Jamal Hadi Salim
2023-05-08 14:12 ` Pedro Tammela
@ 2023-05-09 1:33 ` Jakub Kicinski
2023-05-10 20:11 ` Peilin Ye
2023-05-17 18:48 ` Jakub Kicinski
3 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-09 1:33 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
>
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1. Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
>
> This is just one of the possible consequences of concurrently accessing
> net_device::miniq_{in,e}gress pointers. The point is clear, however:
> B's first call to mini_qdisc_pair_swap() should take place after A's
> last call, in qdisc_destroy().
Great analysis, thanks for squashing this bug.
Have you considered creating a fix more localized to the miniq
implementation? It seems that having per-device miniq pointers is
incompatible with using reference counted objects. So miniq is
a more natural place to solve the problem. Otherwise workarounds
in the core keep piling up (here qdisc_graft()).
Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
If active qdisc is neither a1 nor a2 we should leave the dev state
alone.
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-09 1:33 ` Jakub Kicinski
@ 2023-05-10 20:11 ` Peilin Ye
2023-05-10 23:15 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-10 20:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Mon, May 08, 2023 at 06:33:24PM -0700, Jakub Kicinski wrote:
> Great analysis, thanks for squashing this bug.
Thanks, happy to help!
> Have you considered creating a fix more localized to the miniq
> implementation? It seems that having per-device miniq pointers is
> incompatible with using reference counted objects. So miniq is
> a more natural place to solve the problem. Otherwise workarounds
> in the core keep piling up (here qdisc_graft()).
>
> Can we replace the rcu_assign_pointer in (3rd) with a cmpxchg()?
> If active qdisc is neither a1 nor a2 we should leave the dev state
> alone.
Yes, I have tried fixing this in mini_qdisc_pair_swap(), but I am afraid
it is hard:
(3rd) is called from ->destroy(), so currently it uses RCU_INIT_POINTER()
to set dev->miniq_ingress to NULL. It will need a logic like:
I am A. Set dev->miniq_ingress to NULL, if and only if it is a1 or a2,
and do it atomically.
We need more than a cmpxchg() to implement this "set NULL iff a1 or a2".
Additionally:
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> adds a flower filter X to A.
>
> Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> b2) to replace A, then adds a flower filter Y to B.
>
> Thread 1 A's refcnt Thread 2
> RTM_NEWQDISC (A, RTNL-locked)
> qdisc_create(A) 1
> qdisc_graft(A) 9
>
> RTM_NEWTFILTER (X, RTNL-lockless)
> __tcf_qdisc_find(A) 10
> tcf_chain0_head_change(A)
> mini_qdisc_pair_swap(A) (1st)
> |
> | RTM_NEWQDISC (B, RTNL-locked)
> RCU 2 qdisc_graft(B)
> | 1 notify_and_destroy(A)
> |
> tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> qdisc_destroy(A) tcf_chain0_head_change(B)
> tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> mini_qdisc_pair_swap(A) (3rd) |
> ... ...
Looking at the code, I think there is no guarantee that (1st) cannot
happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
handlers get preempted?
If (1st) happens later than (2nd), we will need to make (1st) no-op, by
detecting that we are the "old" Qdisc. I am not sure there is any
(clean) way to do it. I even thought about:
(1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
(2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
means we are the "old" Qdisc (have been replaced), and should do
nothing.
However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
egress one, so I can't container_of() during step (1) ...
Eventually I created [5,6/6]. It is a workaround indeed, in the sense
that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
makes the code correct in a relatively understandable way, without slowing
down mini_qdisc_pair_swap() or sch_handle_*gress().
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-10 20:11 ` Peilin Ye
@ 2023-05-10 23:15 ` Jakub Kicinski
2023-05-11 20:40 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-10 23:15 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> > Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> > adds a flower filter X to A.
> >
> > Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> > b2) to replace A, then adds a flower filter Y to B.
> >
> > Thread 1 A's refcnt Thread 2
> > RTM_NEWQDISC (A, RTNL-locked)
> > qdisc_create(A) 1
> > qdisc_graft(A) 9
> >
> > RTM_NEWTFILTER (X, RTNL-lockless)
> > __tcf_qdisc_find(A) 10
> > tcf_chain0_head_change(A)
> > mini_qdisc_pair_swap(A) (1st)
> > |
> > | RTM_NEWQDISC (B, RTNL-locked)
> > RCU 2 qdisc_graft(B)
> > | 1 notify_and_destroy(A)
> > |
> > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> > qdisc_destroy(A) tcf_chain0_head_change(B)
> > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > mini_qdisc_pair_swap(A) (3rd) |
> > ... ...
>
> Looking at the code, I think there is no guarantee that (1st) cannot
> happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> handlers get preempted?
Right, we need qdisc_graft(B) to update the appropriate dev pointer
to point to b1. With that the ordering should not matter. Probably
using the ->attach() callback?
> If (1st) happens later than (2nd), we will need to make (1st) no-op, by
> detecting that we are the "old" Qdisc. I am not sure there is any
> (clean) way to do it. I even thought about:
>
> (1) Get the containing Qdisc of "miniqp" we are working on, "qdisc";
> (2) Test if "qdisc == qdisc->dev_queue->qdisc_sleeping". If false, it
> means we are the "old" Qdisc (have been replaced), and should do
> nothing.
>
> However, for clsact Qdiscs I don't know if "miniqp" is the ingress or
> egress one, so I can't container_of() during step (1) ...
And we can't be using multiple pieces of information to make
the decision since AFAIU mini_qdisc_pair_swap() can race with
qdisc_graft().
My thinking was to make sure that dev->miniq_* pointers always point
to one of the miniqs of the currently attached qdisc. Right now, on
a quick look, those pointers are not initialized during initial graft,
only when first filter is added, and may be cleared when filters are
removed. But I don't think that's strictly required, miniq with no
filters should be fine.
> Eventually I created [5,6/6]. It is a workaround indeed, in the sense
> that it changes sch_api.c to avoid a mini Qdisc issue. However I think it
> makes the code correct in a relatively understandable way,
What's your benchmark for being understandable?
> without slowing down mini_qdisc_pair_swap() or sch_handle_*gress().
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-10 23:15 ` Jakub Kicinski
@ 2023-05-11 20:40 ` Peilin Ye
2023-05-11 23:20 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-11 20:40 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> My thinking was to make sure that dev->miniq_* pointers always point
> to one of the miniqs of the currently attached qdisc. Right now, on
> a quick look, those pointers are not initialized during initial graft,
> only when first filter is added, and may be cleared when filters are
> removed. But I don't think that's strictly required, miniq with no
> filters should be fine.
Ah, I see, thanks for explaining, I didn't think of that. Looking at
sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
(mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
Should I keep this behavior by:
diff --git a/net/core/dev.c b/net/core/dev.c
index 735096d42c1d..9016481377e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
* that are not configured with an ingress qdisc will bail
* out here.
*/
- if (!miniq)
+ if (!miniq || !miniq->filter_list)
return skb;
if (*pt_prev) {
On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> > On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> > > Thread 1 A's refcnt Thread 2
> > > RTM_NEWQDISC (A, RTNL-locked)
> > > qdisc_create(A) 1
> > > qdisc_graft(A) 9
> > >
> > > RTM_NEWTFILTER (X, RTNL-lockless)
> > > __tcf_qdisc_find(A) 10
> > > tcf_chain0_head_change(A)
> > > mini_qdisc_pair_swap(A) (1st)
> > > |
> > > | RTM_NEWQDISC (B, RTNL-locked)
> > > RCU 2 qdisc_graft(B)
> > > | 1 notify_and_destroy(A)
> > > |
> > > tcf_block_release(A) 0 RTM_NEWTFILTER (Y, RTNL-lockless)
> > > qdisc_destroy(A) tcf_chain0_head_change(B)
> > > tcf_chain0_head_change_cb_del(A) mini_qdisc_pair_swap(B) (2nd)
> > > mini_qdisc_pair_swap(A) (3rd) |
> > > ... ...
> >
> > Looking at the code, I think there is no guarantee that (1st) cannot
> > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > handlers get preempted?
>
> Right, we need qdisc_graft(B) to update the appropriate dev pointer
> to point to b1. With that the ordering should not matter. Probably
> using the ->attach() callback?
->attach() is later than dev_graft_qdisc(). We should get ready for
concurrent filter requests (i.e. have dev pointer pointing to b1) before
grafting (publishing) B. Also currently qdisc_graft() doesn't call
->attach() for ingress and clsact Qdiscs.
But I see your point, thanks for the suggestion! I'll try ->init() and
create v2.
Thanks,
Peilin Ye
^ permalink raw reply related [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-11 20:40 ` Peilin Ye
@ 2023-05-11 23:20 ` Jakub Kicinski
2023-05-11 23:46 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-11 23:20 UTC (permalink / raw)
To: Peilin Ye, Jiri Pirko, Daniel Borkmann
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Peilin Ye, John Fastabend, Vlad Buslov, Pedro Tammela,
Hillf Danton, netdev, linux-kernel, Cong Wang
On Thu, 11 May 2023 13:40:10 -0700 Peilin Ye wrote:
> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > My thinking was to make sure that dev->miniq_* pointers always point
> > to one of the miniqs of the currently attached qdisc. Right now, on
> > a quick look, those pointers are not initialized during initial graft,
> > only when first filter is added, and may be cleared when filters are
> > removed. But I don't think that's strictly required, miniq with no
> > filters should be fine.
>
> Ah, I see, thanks for explaining, I didn't think of that. Looking at
> sch_handle_ingress() BTW, currently mini Qdisc stats aren't being updated
> (mini_qdisc_bstats_cpu_update()) if there's no filters, is this intended?
> Should I keep this behavior by:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 735096d42c1d..9016481377e0 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5116,7 +5116,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
> * that are not configured with an ingress qdisc will bail
> * out here.
> */
> - if (!miniq)
> + if (!miniq || !miniq->filter_list)
> return skb;
>
> if (*pt_prev) {
Good question, maybe Jiri or Daniel can answer?
> On Wed, May 10, 2023 at 04:15:59PM -0700, Jakub Kicinski wrote:
> > On Wed, 10 May 2023 13:11:19 -0700 Peilin Ye wrote:
> > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > > handlers get preempted?
> >
> > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > to point to b1. With that the ordering should not matter. Probably
> > using the ->attach() callback?
>
> ->attach() is later than dev_graft_qdisc(). We should get ready for
> concurrent filter requests (i.e. have dev pointer pointing to b1) before
> grafting (publishing) B.
I thought even for "unlocked" filter operations the start of it is
under the lock, but the lock gets dropped after qdisc/block are found.
I could be misremembering, I haven't looked at the code.
> Also currently qdisc_graft() doesn't call
> ->attach() for ingress and clsact Qdiscs.
>
> But I see your point, thanks for the suggestion! I'll try ->init() and
> create v2.
->init() may be too early, aren't there any error points which could
prevent the Qdisc from binding after ->init() was called?
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-11 23:20 ` Jakub Kicinski
@ 2023-05-11 23:46 ` Peilin Ye
2023-05-12 0:11 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-11 23:46 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jiri Pirko, Daniel Borkmann, David S. Miller, Eric Dumazet,
Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Vlad Buslov, Pedro Tammela, Hillf Danton, netdev,
linux-kernel, Cong Wang
On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > But I see your point, thanks for the suggestion! I'll try ->init() and
> > create v2.
>
> ->init() may be too early, aren't there any error points which could
> prevent the Qdisc from binding after ->init() was called?
You're right, it's in qdisc_create(), argh...
On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > > Looking at the code, I think there is no guarantee that (1st) cannot
> > > > happen after (2nd), although unlikely? Can RTNL-lockless RTM_NEWTFILTER
> > > > handlers get preempted?
> > >
> > > Right, we need qdisc_graft(B) to update the appropriate dev pointer
> > > to point to b1. With that the ordering should not matter. Probably
> > > using the ->attach() callback?
> >
> > ->attach() is later than dev_graft_qdisc(). We should get ready for
> > concurrent filter requests (i.e. have dev pointer pointing to b1) before
> > grafting (publishing) B.
>
> I thought even for "unlocked" filter operations the start of it is
> under the lock, but the lock gets dropped after qdisc/block are found.
> I could be misremembering, I haven't looked at the code.
No, f.e. RTM_NEWTFILTER is registered as RTNL_FLAG_DOIT_UNLOCKED, so
tc_new_tfilter() starts and calls __tcf_qdisc_find() without RTNL mutex,
at least in latest code.
Thinking,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-11 23:46 ` Peilin Ye
@ 2023-05-12 0:11 ` Peilin Ye
2023-05-15 22:45 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-12 0:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jiri Pirko, Daniel Borkmann, David S. Miller, Eric Dumazet,
Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Vlad Buslov, Pedro Tammela, Hillf Danton, netdev,
linux-kernel, Cong Wang
On Thu, May 11, 2023 at 04:46:33PM -0700, Peilin Ye wrote:
> On Thu, May 11, 2023 at 04:20:23PM -0700, Jakub Kicinski wrote:
> > > But I see your point, thanks for the suggestion! I'll try ->init() and
> > > create v2.
> >
> > ->init() may be too early, aren't there any error points which could
> > prevent the Qdisc from binding after ->init() was called?
>
> You're right, it's in qdisc_create(), argh...
->destroy() is called for all error points between ->init() and
dev_graft_qdisc(). I'll try handling it in ->destroy().
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-12 0:11 ` Peilin Ye
@ 2023-05-15 22:45 ` Peilin Ye
2023-05-16 19:22 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-15 22:45 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Jiri Pirko, Daniel Borkmann, David S. Miller, Eric Dumazet,
Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Vlad Buslov, Pedro Tammela, Hillf Danton, netdev,
linux-kernel, Cong Wang
On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > ->init() may be too early, aren't there any error points which could
> > > prevent the Qdisc from binding after ->init() was called?
> >
> > You're right, it's in qdisc_create(), argh...
>
> ->destroy() is called for all error points between ->init() and
> dev_graft_qdisc(). I'll try handling it in ->destroy().
Sorry for any confusion: there is no point at all undoing "setting dev
pointer to b1" in ->destroy() because datapath has already been affected.
To summarize, grafting B mustn't fail after setting dev pointer to b1, so
->init() is too early, because e.g. if user requested [1] to create a rate
estimator, gen_new_estimator() could fail after ->init() in
qdisc_create().
On the other hand, ->attach() is too late because it's later than
dev_graft_qdisc(), so concurrent filter requests might see uninitialized
dev pointer in theory.
Please suggest; is adding another callback (or calling ->attach()) right
before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
fix?
[1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-15 22:45 ` Peilin Ye
@ 2023-05-16 19:22 ` Jakub Kicinski
2023-05-16 19:35 ` Vlad Buslov
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-16 19:22 UTC (permalink / raw)
To: Vlad Buslov
Cc: Peilin Ye, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
> > > You're right, it's in qdisc_create(), argh...
> >
> > ->destroy() is called for all error points between ->init() and
> > dev_graft_qdisc(). I'll try handling it in ->destroy().
>
> Sorry for any confusion: there is no point at all undoing "setting dev
> pointer to b1" in ->destroy() because datapath has already been affected.
>
> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
> ->init() is too early, because e.g. if user requested [1] to create a rate
> estimator, gen_new_estimator() could fail after ->init() in
> qdisc_create().
>
> On the other hand, ->attach() is too late because it's later than
> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
> dev pointer in theory.
>
> Please suggest; is adding another callback (or calling ->attach()) right
> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
> fix?
>
> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact
Vlad, could you please clarify how you expect the unlocked filter
operations to work when the qdisc has global state?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-16 19:22 ` Jakub Kicinski
@ 2023-05-16 19:35 ` Vlad Buslov
2023-05-16 21:50 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Vlad Buslov @ 2023-05-16 19:35 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Peilin Ye, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Tue 16 May 2023 at 12:22, Jakub Kicinski <kuba@kernel.org> wrote:
> On Mon, 15 May 2023 15:45:15 -0700 Peilin Ye wrote:
>> On Thu, May 11, 2023 at 05:11:23PM -0700, Peilin Ye wrote:
>> > > You're right, it's in qdisc_create(), argh...
>> >
>> > ->destroy() is called for all error points between ->init() and
>> > dev_graft_qdisc(). I'll try handling it in ->destroy().
>>
>> Sorry for any confusion: there is no point at all undoing "setting dev
>> pointer to b1" in ->destroy() because datapath has already been affected.
>>
>> To summarize, grafting B mustn't fail after setting dev pointer to b1, so
>> ->init() is too early, because e.g. if user requested [1] to create a rate
>> estimator, gen_new_estimator() could fail after ->init() in
>> qdisc_create().
>>
>> On the other hand, ->attach() is too late because it's later than
>> dev_graft_qdisc(), so concurrent filter requests might see uninitialized
>> dev pointer in theory.
>>
>> Please suggest; is adding another callback (or calling ->attach()) right
>> before dev_graft_qdisc() for ingress (clsact) Qdiscs too much for this
>> fix?
>>
>> [1] e.g. $ tc qdisc add dev eth0 estimator 1s 8s clsact
>
> Vlad, could you please clarify how you expect the unlocked filter
> operations to work when the qdisc has global state?
Jakub, I didn't account for per-net_device pointer usage by miniqp code
hence this bug. I didn't comment on the fix because I was away from my
PC last week but Peilin's approach seems reasonable to me. When Peilin
brought up the issue initially I also tried to come up with some trick
to contain the changes to miniqp code instead of changing core but
couldn't think of anything workable due to the limitations already
discussed in this thread. I'm open to explore alternative approaches to
solving this issue, if that is what you suggest.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-16 19:35 ` Vlad Buslov
@ 2023-05-16 21:50 ` Jakub Kicinski
2023-05-16 22:58 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-16 21:50 UTC (permalink / raw)
To: Vlad Buslov
Cc: Peilin Ye, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Tue, 16 May 2023 22:35:51 +0300 Vlad Buslov wrote:
> > Vlad, could you please clarify how you expect the unlocked filter
> > operations to work when the qdisc has global state?
>
> Jakub, I didn't account for per-net_device pointer usage by miniqp code
> hence this bug. I didn't comment on the fix because I was away from my
> PC last week but Peilin's approach seems reasonable to me. When Peilin
> brought up the issue initially I also tried to come up with some trick
> to contain the changes to miniqp code instead of changing core but
> couldn't think of anything workable due to the limitations already
> discussed in this thread. I'm open to explore alternative approaches to
> solving this issue, if that is what you suggest.
Given Peilin's investigation I think fix without changing core may
indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
is elevated will be appreciated by the users, do we already have
similar behavior in other parts of TC?
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-16 21:50 ` Jakub Kicinski
@ 2023-05-16 22:58 ` Peilin Ye
2023-05-17 0:39 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Peilin Ye @ 2023-05-16 22:58 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Vlad Buslov, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Tue, May 16, 2023 at 02:50:10PM -0700, Jakub Kicinski wrote:
> > > Vlad, could you please clarify how you expect the unlocked filter
> > > operations to work when the qdisc has global state?
> >
> > Jakub, I didn't account for per-net_device pointer usage by miniqp code
> > hence this bug. I didn't comment on the fix because I was away from my
> > PC last week but Peilin's approach seems reasonable to me. When Peilin
> > brought up the issue initially I also tried to come up with some trick
> > to contain the changes to miniqp code instead of changing core but
> > couldn't think of anything workable due to the limitations already
> > discussed in this thread. I'm open to explore alternative approaches to
> > solving this issue, if that is what you suggest.
>
> Given Peilin's investigation I think fix without changing core may
> indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> is elevated will be appreciated by the users, do we already have
> similar behavior in other parts of TC?
Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY:
net/sched/cls_u32.c:
if (ht->refcnt == 1) {
u32_destroy_hnode(tp, ht, extack);
} else {
NL_SET_ERR_MSG_MOD(extack, "Can not delete in-use filter");
return -EBUSY;
}
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-16 22:58 ` Peilin Ye
@ 2023-05-17 0:39 ` Jakub Kicinski
2023-05-17 8:49 ` Vlad Buslov
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-17 0:39 UTC (permalink / raw)
To: Peilin Ye
Cc: Vlad Buslov, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
> > Given Peilin's investigation I think fix without changing core may
> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
> > is elevated will be appreciated by the users, do we already have
> > similar behavior in other parts of TC?
>
> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
I meant -EBUSY due to a race (another operation being in flight).
I think that's different.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-17 0:39 ` Jakub Kicinski
@ 2023-05-17 8:49 ` Vlad Buslov
2023-05-17 18:48 ` Jakub Kicinski
0 siblings, 1 reply; 51+ messages in thread
From: Vlad Buslov @ 2023-05-17 8:49 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Peilin Ye, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote:
> On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
>> > Given Peilin's investigation I think fix without changing core may
>> > indeed be hard. I'm not sure if returning -EBUSY when qdisc refcnt
>> > is elevated will be appreciated by the users, do we already have
>> > similar behavior in other parts of TC?
>>
>> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
>
> I meant -EBUSY due to a race (another operation being in flight).
> I think that's different.
I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
functionality instead of returning error to the user would be a better
approach? Currently the function is replayed when qdisc_create() returns
EAGAIN. It should be trivial to do the same for qdisc_graft() result.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-17 8:49 ` Vlad Buslov
@ 2023-05-17 18:48 ` Jakub Kicinski
2023-05-17 22:20 ` Peilin Ye
0 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-17 18:48 UTC (permalink / raw)
To: Vlad Buslov
Cc: Peilin Ye, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> On Tue 16 May 2023 at 17:39, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Tue, 16 May 2023 15:58:46 -0700 Peilin Ye wrote:
> >>
> >> Seems like trying to delete an "in-use" cls_u32 filter returns -EBUSY
> >
> > I meant -EBUSY due to a race (another operation being in flight).
> > I think that's different.
>
> I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> functionality instead of returning error to the user would be a better
> approach? Currently the function is replayed when qdisc_create() returns
> EAGAIN. It should be trivial to do the same for qdisc_graft() result.
Sounds better than returning -EBUSY to the user and expecting them
to retry, yes.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-17 18:48 ` Jakub Kicinski
@ 2023-05-17 22:20 ` Peilin Ye
0 siblings, 0 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-17 22:20 UTC (permalink / raw)
To: Vlad Buslov
Cc: Jakub Kicinski, Jiri Pirko, Daniel Borkmann, David S. Miller,
Eric Dumazet, Paolo Abeni, Jamal Hadi Salim, Cong Wang, Peilin Ye,
John Fastabend, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Wed, May 17, 2023 at 11:48:05AM -0700, Jakub Kicinski wrote:
> On Wed, 17 May 2023 11:49:10 +0300 Vlad Buslov wrote:
> > I wonder if somehow leveraging existing tc_modify_qdisc() 'replay'
> > functionality instead of returning error to the user would be a better
> > approach? Currently the function is replayed when qdisc_create() returns
> > EAGAIN. It should be trivial to do the same for qdisc_graft() result.
>
> Sounds better than returning -EBUSY to the user and expecting them
> to retry, yes.
Thanks for the suggestion, Vlad! I'll try this in tc_modify_qdisc() and
tc_get_qdisc() (for Qdisc deletion) in v2.
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-06 0:16 ` [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
` (2 preceding siblings ...)
2023-05-09 1:33 ` Jakub Kicinski
@ 2023-05-17 18:48 ` Jakub Kicinski
2023-05-17 21:18 ` Peilin Ye
3 siblings, 1 reply; 51+ messages in thread
From: Jakub Kicinski @ 2023-05-17 18:48 UTC (permalink / raw)
To: Peilin Ye
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Fri, 5 May 2023 17:16:10 -0700 Peilin Ye wrote:
> } else {
> - dev_queue = dev_ingress_queue(dev);
> - old = dev_graft_qdisc(dev_queue, new);
> + old = dev_graft_qdisc(dev_queue, NULL);
> +
> + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> + * pointer(s) in mini_qdisc_pair_swap().
> + */
> + qdisc_notify(net, skb, n, classid, old, new, extack);
> + qdisc_destroy(old);
> +
> + dev_graft_qdisc(dev_queue, new);
BTW can't @old be NULL here?
^ permalink raw reply [flat|nested] 51+ messages in thread* Re: [PATCH net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
2023-05-17 18:48 ` Jakub Kicinski
@ 2023-05-17 21:18 ` Peilin Ye
0 siblings, 0 replies; 51+ messages in thread
From: Peilin Ye @ 2023-05-17 21:18 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann, John Fastabend,
Vlad Buslov, Pedro Tammela, Hillf Danton, netdev, linux-kernel,
Cong Wang
On Wed, May 17, 2023 at 11:48:25AM -0700, Jakub Kicinski wrote:
> > } else {
> > - dev_queue = dev_ingress_queue(dev);
> > - old = dev_graft_qdisc(dev_queue, new);
> > + old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > + /* {ingress,clsact}_destroy() "old" before grafting "new" to avoid
> > + * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > + * pointer(s) in mini_qdisc_pair_swap().
> > + */
> > + qdisc_notify(net, skb, n, classid, old, new, extack);
> > + qdisc_destroy(old);
> > +
> > + dev_graft_qdisc(dev_queue, new);
>
> BTW can't @old be NULL here?
ingress_queue->qdisc_sleeping is initialized to &noop_qdisc (placeholder)
in dev_ingress_queue_create(), and dev_graft_qdisc() also grafts
&noop_qdisc to represent "there's no Qdisc":
/* ... and graft new one */
if (qdisc == NULL)
qdisc = &noop_qdisc;
dev_queue->qdisc_sleeping = qdisc;
So @old can't be NULL here.
Thanks,
Peilin Ye
^ permalink raw reply [flat|nested] 51+ messages in thread