From: Vlad Buslov <vladbu@mellanox.com>
To: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Vlad Buslov <vladbu@mellanox.com>,
Edward Cree <ecree@solarflare.com>,
Linux Kernel Network Developers <netdev@vger.kernel.org>,
David Miller <davem@davemloft.net>,
Jamal Hadi Salim <jhs@mojatatu.com>,
Jiri Pirko <jiri@resnulli.us>,
Davide Caratti <dcaratti@redhat.com>,
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>,
Jakub Kicinski <kuba@kernel.org>
Subject: Re: [PATCH net-next v2 0/4] Implement classifier-action terse dump mode
Date: Mon, 25 May 2020 14:38:49 +0300 [thread overview]
Message-ID: <vbfa71wz1km.fsf@mellanox.com> (raw)
In-Reply-To: <CAM_iQpVJ-MT0t1qtvWBp2=twPq6GWsn5-sAW6=QVf4Gc97Mmeg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 6249 bytes --]
On Fri 22 May 2020 at 22:33, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, May 20, 2020 at 12:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >> I considered that approach initially but decided against it for
>> >> following reasons:
>> >>
>> >> - Generic data is covered by current terse dump implementation.
>> >> Everything else will be act or cls specific which would result long
>> >> list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
>> >> TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
>> >> TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
>> >> these would require a lot of dedicated logic in act and cls dump
>> >> callbacks. Also, it would be quite a challenge to test all possible
>> >> combinations.
>> >
>> > Well, if you consider netlink dump as a database query, what Edward
>> > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather
>> > than "select * from cls_db".
>> >
>> > No one said it is easy to implement, it is just more elegant than you
>> > select a hardcoded set of columns for the user.
>>
>> As I explained to Edward, having denser netlink packets with more
>> filters per packet is only part of optimization. Another part is not
>> executing some code at all. Consider fl_dump_key() which is 200 lines
>> function with bunch of conditionals like that:
>>
>> static int fl_dump_key(struct sk_buff *skb, struct net *net,
>> struct fl_flow_key *key, struct fl_flow_key *mask)
>> {
>> if (mask->meta.ingress_ifindex) {
>> struct net_device *dev;
>>
>> dev = __dev_get_by_index(net, key->meta.ingress_ifindex);
>> if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name))
>> goto nla_put_failure;
>> }
>>
>> if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
>> mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
>> sizeof(key->eth.dst)) ||
>> fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC,
>> mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK,
>> sizeof(key->eth.src)) ||
>> fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE,
>> &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
>> sizeof(key->basic.n_proto)))
>> goto nla_put_failure;
>>
>> if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls))
>> goto nla_put_failure;
>>
>> if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID,
>> TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan))
>> goto nla_put_failure;
>> ...
>>
>>
>> Now imagine all of these are extended with additional if (flags &
>> TCA_DUMP_XXX). All gains from not outputting some other minor stuff into
>> netlink packet will be negated by it.
>
> Interesting, are you saying a bit test is as expensive as appending
> an actual netlink attribution to the dumping? I am surprised.
It is not just adding a clause to all those conditionals. Some functions
are not called at all with current terse dump design. In the case of
fl_dump_key() it is just a bunch of conditionals (and maybe price of
cache misses to access struct fl_flow_key in a first place). In case of
tc_action_ops->dump() it is also obtaining a spinlock, some atomic ops,
etc. But I agree, "negated" is too strong of a word, "significantly
impacted" is more correct.
>
>
>>
>>
>> >
>> > Think about it, what if another user wants a less terse dump but still
>> > not a full dump? Would you implement ops->terse_dump2()? Or
>> > what if people still think your terse dump is not as terse as she wants?
>> > ops->mini_dump()? How many ops's we would end having?
>>
>> User can discard whatever he doesn't need in user land code. The goal of
>> this change is performance optimization, not designing a generic
>> kernel-space data filtering mechanism.
>
> You optimize the performance by reducing the dump size, which is
> already effectively a data filtering. This doesn't have to be your goal,
> you are implementing it anyway.
>
>
>>
>> >
>> >
>> >>
>> >> - It is hard to come up with proper validation for such implementation.
>> >> In case of terse dump I just return an error if classifier doesn't
>> >> implement the callback (and since current implementation only outputs
>> >> generic action info, it doesn't even require support from
>> >> action-specific dump callbacks). But, for example, how do we validate
>> >> a case where user sets some flower and tunnel_key act dump flags from
>> >> previous paragraph, but Qdisc contains some other classifier? Or
>> >> flower classifier points to other types of actions? Or when flower
>> >> classifier has and tunnel_key actions but also mirred? Should the
>> >
>> > Each action should be able to dump selectively too. If you think it
>> > as a database, it is just a different table with different schemas.
>>
>> How is designing custom SQL-like query language (according to your
>> example at the beginning of the mail) for filter dump is going to
>> improve performance? If there is a way to do it in fast a generic manner
>> with BPF, then I'm very interested to hear the details. But adding
>> hundred more hardcoded conditionals is just not a solution considering
>> main motivations for this change is performance.
>
> I still wonder how a bit test is as expensive as you claim, it does
> not look like you actually measure it. This of course depends on the
> size of the dump, but if you look at other netlink dump in kernel,
> not just tc filters, we already dump a lot of attributes per record.
>
> Thanks.
I agree that I didn't specify which parts of the change constitute what
fraction of the dump rate increase. Lets stage a simple test to verify
the cost of calling just two functions (fl_dump_key() and
tc_act_ops->dump() callback) and instantly discarding their results from
packet (patch attached).
[-- Attachment #2: Test patch --]
[-- Type: text/plain, Size: 1771 bytes --]
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 8ac7eb0a8309..267ee76d3ddb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -771,6 +771,9 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a)
{
unsigned char *b = skb_tail_pointer(skb);
struct tc_cookie *cookie;
+ unsigned char *c;
+ struct nlattr *nest;
+ int err;
if (nla_put_string(skb, TCA_KIND, a->ops->kind))
goto nla_put_failure;
@@ -787,6 +790,16 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a)
}
rcu_read_unlock();
+ c = skb_tail_pointer(skb);
+ nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
+ if (nest == NULL)
+ goto nla_put_failure;
+ err = tcf_action_dump_old(skb, a, 0, 0);
+ if (err > 0) {
+ nla_nest_end(skb, nest);
+ }
+ nlmsg_trim(skb, c);
+
return 0;
nla_put_failure:
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 0c574700da75..1bc6294c5c9b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -2771,8 +2771,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh,
struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
{
+ struct fl_flow_key *key, *mask;
struct cls_fl_filter *f = fh;
struct nlattr *nest;
+ unsigned char *b;
bool skip_hw;
if (!f)
@@ -2786,8 +2788,15 @@ static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh,
spin_lock(&tp->lock);
+ key = &f->key;
+ mask = &f->mask->key;
skip_hw = tc_skip_hw(f->flags);
+ b = skb_tail_pointer(skb);
+ if (fl_dump_key(skb, net, key, mask))
+ goto nla_put_failure_locked;
+ nlmsg_trim(skb, b);
+
if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags))
goto nla_put_failure_locked;
[-- Attachment #3: Type: text/plain, Size: 947 bytes --]
Result for terse dumping 1m simple rules (flower with L2 key + gact
drop) on current net-next:
$ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null
real 0m3.445s
user 0m2.087s
sys 0m1.298s
With patch applied:
$ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null
real 0m5.035s
user 0m3.289s
sys 0m1.687s
As we can see this leads to 30% overhead in kernel space execution time.
Now with more complex rules (flower 5tuple + act tunnel key + act
mirred) on current net-next:
$ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null
real 0m4.052s
user 0m2.065s
sys 0m1.937s
Same rules with patch applied:
$ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null
real 0m6.346s
user 0m3.166s
sys 0m3.108s
With more complex rules performance impact on kernel space execution
time get more severe (60%). Overall, this looks like significant
degradation.
next prev parent reply other threads:[~2020-05-25 11:38 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-15 11:40 [PATCH net-next v2 0/4] Implement classifier-action terse dump mode Vlad Buslov
2020-05-15 11:40 ` [PATCH net-next v2 1/4] net: sched: introduce terse dump flag Vlad Buslov
2020-05-15 11:51 ` Jiri Pirko
2020-05-15 11:40 ` [PATCH net-next v2 2/4] net: sched: implement terse dump support in act Vlad Buslov
2020-05-15 11:51 ` Jiri Pirko
2020-05-15 11:40 ` [PATCH net-next v2 3/4] net: sched: cls_flower: implement terse dump support Vlad Buslov
2020-05-15 11:40 ` [PATCH net-next v2 4/4] selftests: implement flower classifier terse dump tests Vlad Buslov
2020-05-15 17:04 ` [PATCH net-next v2 0/4] Implement classifier-action terse dump mode Jakub Kicinski
2020-05-18 6:46 ` Vlad Buslov
2020-05-15 17:25 ` David Miller
2020-05-18 6:50 ` Vlad Buslov
2020-05-17 19:13 ` Cong Wang
2020-05-18 6:44 ` Vlad Buslov
2020-05-18 18:46 ` Cong Wang
2020-05-19 9:10 ` Vlad Buslov
2020-05-19 18:39 ` Cong Wang
2020-05-20 7:33 ` Vlad Buslov
2020-05-18 15:37 ` Edward Cree
2020-05-18 18:50 ` Cong Wang
2020-05-19 9:04 ` Vlad Buslov
2020-05-19 14:30 ` Edward Cree
2020-05-19 15:17 ` Vlad Buslov
2020-05-19 18:58 ` Cong Wang
2020-05-20 7:24 ` Vlad Buslov
2020-05-22 19:33 ` Cong Wang
2020-05-25 11:38 ` Vlad Buslov [this message]
2020-05-21 14:36 ` Vlad Buslov
2020-05-21 17:02 ` Jakub Kicinski
2020-05-22 16:16 ` Vlad Buslov
2020-05-23 11:06 ` Jamal Hadi Salim
2020-05-22 19:41 ` Cong Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=vbfa71wz1km.fsf@mellanox.com \
--to=vladbu@mellanox.com \
--cc=davem@davemloft.net \
--cc=dcaratti@redhat.com \
--cc=ecree@solarflare.com \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=marcelo.leitner@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.