From: Vlad Buslov <vladbu@nvidia.com>
To: Peilin Ye <yepeilin.cs@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>,
Jakub Kicinski <kuba@kernel.org>,
Pedro Tammela <pctammela@mojatatu.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
Jiri Pirko <jiri@resnulli.us>,
Peilin Ye <peilin.ye@bytedance.com>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
"Hillf Danton" <hdanton@sina.com>, <netdev@vger.kernel.org>,
Cong Wang <cong.wang@bytedance.com>
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Date: Thu, 8 Jun 2023 12:17:27 +0300 [thread overview]
Message-ID: <87pm66wbgo.fsf@nvidia.com> (raw)
In-Reply-To: <ZIEjUobtdPCu648e@C02FL77VMD6R.googleapis.com>
On Wed 07 Jun 2023 at 17:39, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> >> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> >> be remedied by setting a new Qdisc->flags bit
>> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >> >
>> >> > Thanks for the suggestion! I'll try this approach.
>> >> >
>> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> >> > the "being-deleted" Qdisc. I'll try forcing "late" requests (that arrive
>> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> >> > even longer?).
>> >>
>> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> >> set is also fine? Would be much easier to implement as opposed to moving
>> >> rtnl_lock there.
>> >
>> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
>> > 16G, CONFIG_NR_CPUS=8). I tried deleting the ingress Qdisc (let's call it
>> > "request A") while it has a lot of ongoing filter requests, and here's the
>> > result:
>> >
>> > #1 #2 #3 #4
>> > ----------------------------------------------------------
>> > a. refcnt 89 93 230 571
>> > b. replayed 167,568 196,450 336,291 878,027
>> > c. time real 0m2.478s 0m2.746s 0m3.693s 0m9.461s
>> > user 0m0.000s 0m0.000s 0m0.000s 0m0.000s
>> > sys 0m0.623s 0m0.681s 0m1.119s 0m2.770s
>> >
>> > a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>> > b. is the number of times A has been replayed;
>> > c. is the time(1) output for A.
>> >
>> > a. and b. are collected from printk() output. This is better than before,
>> > but A could still be replayed for hundreds of thousands of times and hang
>> > for a few seconds.
>>
>> I don't get where does few seconds waiting time come from. I'm probably
>> missing something obvious here, but the waiting time should be the
>> maximum filter op latency of new/get/del filter request that is already
>> in-flight (i.e. already passed qdisc_is_destroying() check) and it
>> should take several orders of magnitude less time.
>
> Yeah I agree, here's what I did:
>
> In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
> loop:
>
> $ echo "1 1 32" > /sys/bus/netdevsim/new_device
> $ tc qdisc add dev eth1 ingress
> $ for (( i=1; i<=3000; i++ ))
> > do
> > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
> > done
>
> When the loop is running, I delete the Qdisc in Terminal 2:
>
> $ time tc qdisc delete dev eth1 ingress
>
> Which took seconds on average. However, if I specify a unique "prio" when
> adding filters in that loop, e.g.:
>
> $ for (( i=1; i<=3000; i++ ))
> > do
> > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
> > done ^^^^^^^
>
> Then deleting the Qdisc in Terminal 2 becomes a lot faster:
>
> real 0m0.712s
> user 0m0.000s
> sys 0m0.152s
>
> In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
> yet another test [1], which looks a lot better.
That makes sense, thanks for explaining.
>
> When I didn't specify "prio", sometimes that
> rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> -EEXIST. Is it because that concurrent add-filter requests auto-allocated
> the same "prio" number, so they collided with each other? Do you think
> this is related to why it's slow?
It is slow because when creating a filter without providing priority you
are basically measuring the latency of creating a whole flower
classifier instance (multiple memory allocation, initialization of all
kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
not just a single filter, so significantly higher latency is expected.
But my point still stands: with latest version of your fix the maximum
time of 'spinning' in sch_api is the maximum concurrent
tcf_{new|del|get}_tfilter op latency that has already obtained the qdisc
and any concurrent filter API messages coming after qdisc->flags
"DELETED-REJECT-NEW-FILTERS" has been set will fail and can't livelock
the concurrent qdisc del/replace.
>
> Thanks,
> Peilin Ye
>
> [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
> in -batch mode that keeps adding a unique filter (with "prio" and "handle"
> specified) then deletes it. Again, when they are running I delete the
> ingress Qdisc, and here's the result:
>
> #1 #2 #3 #4
> ----------------------------------------------------------
> a. refcnt 64 63 64 64
> b. replayed 169 5,630 887 3,442
> c. time real 0m0.171s 0m0.147s 0m0.186s 0m0.111s
> user 0m0.000s 0m0.009s 0m0.001s 0m0.000s
> sys 0m0.112s 0m0.108s 0m0.115s 0m0.104s
next prev parent reply other threads:[~2023-06-08 9:27 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-24 1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-24 1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-24 15:37 ` Pedro Tammela
2023-05-24 15:57 ` Jamal Hadi Salim
2023-05-24 1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 15:58 ` Jamal Hadi Salim
2023-05-24 1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-24 1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-24 15:39 ` Pedro Tammela
2023-05-24 16:09 ` Jamal Hadi Salim
2023-05-25 9:25 ` Paolo Abeni
2023-05-26 12:19 ` Jamal Hadi Salim
2023-05-26 12:20 ` Jamal Hadi Salim
2023-05-26 19:47 ` Jamal Hadi Salim
2023-05-26 20:21 ` Pedro Tammela
2023-05-26 23:09 ` Peilin Ye
2023-05-27 2:33 ` Jakub Kicinski
2023-05-27 8:23 ` Peilin Ye
2023-05-28 18:54 ` Jamal Hadi Salim
2023-05-29 11:50 ` Vlad Buslov
2023-05-29 12:58 ` Vlad Buslov
2023-05-30 1:03 ` Jakub Kicinski
2023-05-30 9:11 ` Peilin Ye
2023-05-30 12:18 ` Vlad Buslov
2023-05-31 0:29 ` Peilin Ye
2023-06-01 3:57 ` Peilin Ye
2023-06-01 6:20 ` Vlad Buslov
2023-06-07 0:57 ` Peilin Ye
2023-06-07 8:18 ` Vlad Buslov
2023-06-08 1:08 ` Peilin Ye
2023-06-08 7:48 ` Vlad Buslov
2023-06-11 3:25 ` Peilin Ye
2023-06-08 0:39 ` Peilin Ye
2023-06-08 9:17 ` Vlad Buslov [this message]
2023-06-10 0:20 ` Peilin Ye
2023-06-01 13:03 ` Pedro Tammela
2023-06-07 4:25 ` Peilin Ye
2023-05-29 13:55 ` Jamal Hadi Salim
2023-05-29 19:14 ` Peilin Ye
2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pm66wbgo.fsf@nvidia.com \
--to=vladbu@nvidia.com \
--cc=cong.wang@bytedance.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hdanton@sina.com \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pctammela@mojatatu.com \
--cc=peilin.ye@bytedance.com \
--cc=xiyou.wangcong@gmail.com \
--cc=yepeilin.cs@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).