From: Vlad Buslov <vladbu@nvidia.com>
To: Peilin Ye <yepeilin.cs@gmail.com>, Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
Pedro Tammela <pctammela@mojatatu.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Paolo Abeni <pabeni@redhat.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
Jiri Pirko <jiri@resnulli.us>,
Peilin Ye <peilin.ye@bytedance.com>,
Daniel Borkmann <daniel@iogearbox.net>,
"John Fastabend" <john.fastabend@gmail.com>,
Hillf Danton <hdanton@sina.com>, <netdev@vger.kernel.org>,
Cong Wang <cong.wang@bytedance.com>
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Date: Mon, 29 May 2023 15:58:50 +0300 [thread overview]
Message-ID: <87fs7fxov6.fsf@nvidia.com> (raw)
In-Reply-To: <87jzwrxrz8.fsf@nvidia.com>
On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@nvidia.com> wrote:
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>>
>>> Hi Jakub and all,
>>>
>>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>>> > > Thanks a lot, I'll get right on it.
>>> >
>>> > Any insights? Is it just a live-lock inherent to the retry scheme
>>> > or we actually forget to release the lock/refcnt?
>>>
>>> I think it's just a thread holding the RTNL mutex for too long (replaying
>>> too many times). We could replay for arbitrary times in
>>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>>> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
> Task 1 graft Qdisc Task 2 new filter
> + +
> | |
> v v
> rtnl_lock() take q->refcnt
> + +
> | |
> v v
> Spin while q->refcnt!=1 Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>
Followup: I considered two approaches for preventing the dealock:
- Refactor cls_api to always obtain the lock before taking a reference
to Qdisc. I started implementing PoC moving the rtnl_lock() call in
tc_new_tfilter() before __tcf_qdisc_find() and decided it is not
feasible because cls_api will still try to obtain rtnl_lock when
offloading a filter to a device with non-unlocked driver or after
releasing the lock when loading a classifier module.
- Account for such cls_api behavior in sch_api by dropping and
re-tacking the lock before replaying. This actually seems to be quite
straightforward since 'replay' functionality that we are reusing for
this is designed for similar behavior - it releases rtnl lock before
loading a sch module, takes the lock again and safely replays the
function by re-obtaining all the necessary data.
If livelock with concurrent filters insertion is an issue, then it can
be remedied by setting a new Qdisc->flags bit
"DELETED-REJECT-NEW-FILTERS" and checking for it together with
QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
insertion coming after the flag is set to synchronize on rtnl lock.
Thoughts?
>>>
>>> I tested the new reproducer Pedro posted, on:
>>>
>>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>>
>>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>> any issues (in about 30 minutes).
>>>
>>> 3. All 6 v5 patches, plus this diff:
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 286b7c58f5b9..988718ba5abe 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>> * RTNL-unlocked filter request(s). This is the counterpart of that
>>> * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>> */
>>> - if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>>> + if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>>> + rtnl_unlock();
>>> + rtnl_lock();
>>> return -EAGAIN;
>>> + }
>>> }
>>>
>>> if (dev->flags & IFF_UP)
>>>
>>> Did not trigger any issues (in about 30 mintues) either.
>>>
>>> What would you suggest?
>>
>>
>> I am more worried it is a wackamole situation. We fixed the first
>> reproducer with essentially patches 1-4 but we opened a new one which
>> the second reproducer catches. One thing the current reproducer does
>> is create a lot rtnl contention in the beggining by creating all those
>> devices and then after it is just creating/deleting qdisc and doing
>> update with flower where such contention is reduced. i.e it may just
>> take longer for the mole to pop up.
>>
>> Why dont we push the V1 patch in and then worry about getting clever
>> with EAGAIN after? Can you test the V1 version with the repro Pedro
>> posted? It shouldnt have these issues. Also it would be interesting to
>> see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.
next prev parent reply other threads:[~2023-05-29 13:13 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-24 1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-24 1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-24 15:37 ` Pedro Tammela
2023-05-24 15:57 ` Jamal Hadi Salim
2023-05-24 1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 15:58 ` Jamal Hadi Salim
2023-05-24 1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
2023-05-24 15:38 ` Pedro Tammela
2023-05-24 1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-24 1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-24 15:39 ` Pedro Tammela
2023-05-24 16:09 ` Jamal Hadi Salim
2023-05-25 9:25 ` Paolo Abeni
2023-05-26 12:19 ` Jamal Hadi Salim
2023-05-26 12:20 ` Jamal Hadi Salim
2023-05-26 19:47 ` Jamal Hadi Salim
2023-05-26 20:21 ` Pedro Tammela
2023-05-26 23:09 ` Peilin Ye
2023-05-27 2:33 ` Jakub Kicinski
2023-05-27 8:23 ` Peilin Ye
2023-05-28 18:54 ` Jamal Hadi Salim
2023-05-29 11:50 ` Vlad Buslov
2023-05-29 12:58 ` Vlad Buslov [this message]
2023-05-30 1:03 ` Jakub Kicinski
2023-05-30 9:11 ` Peilin Ye
2023-05-30 12:18 ` Vlad Buslov
2023-05-31 0:29 ` Peilin Ye
2023-06-01 3:57 ` Peilin Ye
2023-06-01 6:20 ` Vlad Buslov
2023-06-07 0:57 ` Peilin Ye
2023-06-07 8:18 ` Vlad Buslov
2023-06-08 1:08 ` Peilin Ye
2023-06-08 7:48 ` Vlad Buslov
2023-06-11 3:25 ` Peilin Ye
2023-06-08 0:39 ` Peilin Ye
2023-06-08 9:17 ` Vlad Buslov
2023-06-10 0:20 ` Peilin Ye
2023-06-01 13:03 ` Pedro Tammela
2023-06-07 4:25 ` Peilin Ye
2023-05-29 13:55 ` Jamal Hadi Salim
2023-05-29 19:14 ` Peilin Ye
2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87fs7fxov6.fsf@nvidia.com \
--to=vladbu@nvidia.com \
--cc=cong.wang@bytedance.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hdanton@sina.com \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=john.fastabend@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pctammela@mojatatu.com \
--cc=peilin.ye@bytedance.com \
--cc=xiyou.wangcong@gmail.com \
--cc=yepeilin.cs@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).