* tc tp creation performance degratation since kernel 5.1
@ 2019-06-12 12:03 Jiri Pirko
2019-06-12 12:30 ` Paolo Abeni
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jiri Pirko @ 2019-06-12 12:03 UTC (permalink / raw)
To: netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk
Hi.
I came across serious performance degradation when adding many tps. I'm
using following script:
------------------------------------------------------------------------
#!/bin/bash
dev=testdummy
ip link add name $dev type dummy
ip link set dev $dev up
tc qdisc add dev $dev ingress
tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
pref_id=1
while [ $pref_id -lt 20000 ]
do
echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name
((pref_id++))
done
start=$(date +"%s")
tc -b $tmp_file_name
stop=$(date +"%s")
echo "Insertion duration: $(($stop - $start)) sec"
rm -f $tmp_file_name
ip link del dev $dev
------------------------------------------------------------------------
On my testing vm, result on 5.1 kernel is:
Insertion duration: 3 sec
On net-next this is:
Insertion duration: 54 sec
I did simple prifiling using perf. Output on 5.1 kernel:
77.85% tc [kernel.kallsyms] [k] tcf_chain_tp_find
3.30% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.33% tc_pref_scale.s [kernel.kallsyms] [k] do_syscall_64
0.60% tc_pref_scale.s libc-2.28.so [.] malloc
0.55% tc [kernel.kallsyms] [k] mutex_spin_on_owner
0.51% tc libc-2.28.so [.] __memset_sse2_unaligned_erms
0.40% tc_pref_scale.s libc-2.28.so [.] __gconv_transform_utf8_internal
0.38% tc_pref_scale.s libc-2.28.so [.] _int_free
0.37% tc_pref_scale.s libc-2.28.so [.] __GI___strlen_sse2
0.37% tc [kernel.kallsyms] [k] idr_get_free
Output on net-next:
39.26% tc [kernel.vmlinux] [k] lock_is_held_type
33.99% tc [kernel.vmlinux] [k] tcf_chain_tp_find
12.77% tc [kernel.vmlinux] [k] __asan_load4_noabort
1.90% tc [kernel.vmlinux] [k] __asan_load8_noabort
1.08% tc [kernel.vmlinux] [k] lock_acquire
0.94% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled
0.61% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled.part.5
0.51% tc [kernel.vmlinux] [k] unwind_next_frame
0.50% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.47% tc_pref_scale.s [kernel.vmlinux] [k] lock_acquire
0.47% tc [kernel.vmlinux] [k] lock_release
I didn't investigate this any further now. I fear that this might be
related to Vlad's changes in the area. Any ideas?
Thanks!
Jiri
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: tc tp creation performance degratation since kernel 5.1 2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko @ 2019-06-12 12:30 ` Paolo Abeni 2019-06-13 4:50 ` Jiri Pirko 2019-06-12 12:34 ` Vlad Buslov 2019-06-13 8:11 ` Jiri Pirko 2 siblings, 1 reply; 10+ messages in thread From: Paolo Abeni @ 2019-06-12 12:30 UTC (permalink / raw) To: Jiri Pirko, netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk Hi, On Wed, 2019-06-12 at 14:03 +0200, Jiri Pirko wrote: > I did simple prifiling using perf. Output on 5.1 kernel: > 77.85% tc [kernel.kallsyms] [k] tcf_chain_tp_find > 3.30% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 1.33% tc_pref_scale.s [kernel.kallsyms] [k] do_syscall_64 > 0.60% tc_pref_scale.s libc-2.28.so [.] malloc > 0.55% tc [kernel.kallsyms] [k] mutex_spin_on_owner > 0.51% tc libc-2.28.so [.] __memset_sse2_unaligned_erms > 0.40% tc_pref_scale.s libc-2.28.so [.] __gconv_transform_utf8_internal > 0.38% tc_pref_scale.s libc-2.28.so [.] _int_free > 0.37% tc_pref_scale.s libc-2.28.so [.] __GI___strlen_sse2 > 0.37% tc [kernel.kallsyms] [k] idr_get_free > > Output on net-next: > 39.26% tc [kernel.vmlinux] [k] lock_is_held_type It looks like you have lockdep enabled here, but not on the 5.1 build. That would explain such a large perf difference. Can you please double check? thanks, Paolo ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-12 12:30 ` Paolo Abeni @ 2019-06-13 4:50 ` Jiri Pirko 0 siblings, 0 replies; 10+ messages in thread From: Jiri Pirko @ 2019-06-13 4:50 UTC (permalink / raw) To: Paolo Abeni; +Cc: netdev, vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk Wed, Jun 12, 2019 at 02:30:37PM CEST, pabeni@redhat.com wrote: >Hi, > >On Wed, 2019-06-12 at 14:03 +0200, Jiri Pirko wrote: >> I did simple prifiling using perf. Output on 5.1 kernel: >> 77.85% tc [kernel.kallsyms] [k] tcf_chain_tp_find >> 3.30% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >> 1.33% tc_pref_scale.s [kernel.kallsyms] [k] do_syscall_64 >> 0.60% tc_pref_scale.s libc-2.28.so [.] malloc >> 0.55% tc [kernel.kallsyms] [k] mutex_spin_on_owner >> 0.51% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >> 0.40% tc_pref_scale.s libc-2.28.so [.] __gconv_transform_utf8_internal >> 0.38% tc_pref_scale.s libc-2.28.so [.] _int_free >> 0.37% tc_pref_scale.s libc-2.28.so [.] __GI___strlen_sse2 >> 0.37% tc [kernel.kallsyms] [k] idr_get_free >> >> Output on net-next: >> 39.26% tc [kernel.vmlinux] [k] lock_is_held_type > >It looks like you have lockdep enabled here, but not on the 5.1 build. > >That would explain such a large perf difference. > >Can you please double check? Will do. > >thanks, > >Paolo > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko 2019-06-12 12:30 ` Paolo Abeni @ 2019-06-12 12:34 ` Vlad Buslov 2019-06-13 5:49 ` Jiri Pirko 2019-06-13 8:11 ` Jiri Pirko 2 siblings, 1 reply; 10+ messages in thread From: Vlad Buslov @ 2019-06-12 12:34 UTC (permalink / raw) To: Jiri Pirko Cc: netdev@vger.kernel.org, Vlad Buslov, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov On Wed 12 Jun 2019 at 15:03, Jiri Pirko <jiri@resnulli.us> wrote: > Hi. > > I came across serious performance degradation when adding many tps. I'm > using following script: > > ------------------------------------------------------------------------ > #!/bin/bash > > dev=testdummy > ip link add name $dev type dummy > ip link set dev $dev up > tc qdisc add dev $dev ingress > > tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") > pref_id=1 > > while [ $pref_id -lt 20000 ] > do > echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name > ((pref_id++)) > done > > start=$(date +"%s") > tc -b $tmp_file_name > stop=$(date +"%s") > echo "Insertion duration: $(($stop - $start)) sec" > rm -f $tmp_file_name > > ip link del dev $dev > ------------------------------------------------------------------------ > > On my testing vm, result on 5.1 kernel is: > Insertion duration: 3 sec > On net-next this is: > Insertion duration: 54 sec > > I did simple prifiling using perf. Output on 5.1 kernel: > 77.85% tc [kernel.kallsyms] [k] tcf_chain_tp_find > 3.30% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 1.33% tc_pref_scale.s [kernel.kallsyms] [k] do_syscall_64 > 0.60% tc_pref_scale.s libc-2.28.so [.] malloc > 0.55% tc [kernel.kallsyms] [k] mutex_spin_on_owner > 0.51% tc libc-2.28.so [.] __memset_sse2_unaligned_erms > 0.40% tc_pref_scale.s libc-2.28.so [.] __gconv_transform_utf8_internal > 0.38% tc_pref_scale.s libc-2.28.so [.] _int_free > 0.37% tc_pref_scale.s libc-2.28.so [.] __GI___strlen_sse2 > 0.37% tc [kernel.kallsyms] [k] idr_get_free Are these results for same config? Here I don't see any lockdep or KASAN. However in next trace... > > Output on net-next: > 39.26% tc [kernel.vmlinux] [k] lock_is_held_type > 33.99% tc [kernel.vmlinux] [k] tcf_chain_tp_find > 12.77% tc [kernel.vmlinux] [k] __asan_load4_noabort > 1.90% tc [kernel.vmlinux] [k] __asan_load8_noabort > 1.08% tc [kernel.vmlinux] [k] lock_acquire > 0.94% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled > 0.61% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled.part.5 > 0.51% tc [kernel.vmlinux] [k] unwind_next_frame > 0.50% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore > 0.47% tc_pref_scale.s [kernel.vmlinux] [k] lock_acquire > 0.47% tc [kernel.vmlinux] [k] lock_release ... both lockdep and kasan consume most of CPU time. BTW it takes 5 sec to execute your script on my system with net-next (debug options disabled). > > I didn't investigate this any further now. I fear that this might be > related to Vlad's changes in the area. Any ideas? > > Thanks! > > Jiri ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-12 12:34 ` Vlad Buslov @ 2019-06-13 5:49 ` Jiri Pirko 0 siblings, 0 replies; 10+ messages in thread From: Jiri Pirko @ 2019-06-13 5:49 UTC (permalink / raw) To: Vlad Buslov Cc: netdev@vger.kernel.org, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov Wed, Jun 12, 2019 at 02:34:02PM CEST, vladbu@mellanox.com wrote: > >On Wed 12 Jun 2019 at 15:03, Jiri Pirko <jiri@resnulli.us> wrote: >> Hi. >> >> I came across serious performance degradation when adding many tps. I'm >> using following script: >> >> ------------------------------------------------------------------------ >> #!/bin/bash >> >> dev=testdummy >> ip link add name $dev type dummy >> ip link set dev $dev up >> tc qdisc add dev $dev ingress >> >> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >> pref_id=1 >> >> while [ $pref_id -lt 20000 ] >> do >> echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name >> ((pref_id++)) >> done >> >> start=$(date +"%s") >> tc -b $tmp_file_name >> stop=$(date +"%s") >> echo "Insertion duration: $(($stop - $start)) sec" >> rm -f $tmp_file_name >> >> ip link del dev $dev >> ------------------------------------------------------------------------ >> >> On my testing vm, result on 5.1 kernel is: >> Insertion duration: 3 sec >> On net-next this is: >> Insertion duration: 54 sec >> >> I did simple prifiling using perf. Output on 5.1 kernel: >> 77.85% tc [kernel.kallsyms] [k] tcf_chain_tp_find >> 3.30% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >> 1.33% tc_pref_scale.s [kernel.kallsyms] [k] do_syscall_64 >> 0.60% tc_pref_scale.s libc-2.28.so [.] malloc >> 0.55% tc [kernel.kallsyms] [k] mutex_spin_on_owner >> 0.51% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >> 0.40% tc_pref_scale.s libc-2.28.so [.] __gconv_transform_utf8_internal >> 0.38% tc_pref_scale.s libc-2.28.so [.] _int_free >> 0.37% tc_pref_scale.s libc-2.28.so [.] __GI___strlen_sse2 >> 0.37% tc [kernel.kallsyms] [k] idr_get_free > >Are these results for same config? Here I don't see any lockdep or >KASAN. However in next trace... > >> >> Output on net-next: >> 39.26% tc [kernel.vmlinux] [k] lock_is_held_type >> 33.99% tc [kernel.vmlinux] [k] tcf_chain_tp_find >> 12.77% tc [kernel.vmlinux] [k] __asan_load4_noabort >> 1.90% tc [kernel.vmlinux] [k] __asan_load8_noabort >> 1.08% tc [kernel.vmlinux] [k] lock_acquire >> 0.94% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled >> 0.61% tc [kernel.vmlinux] [k] debug_lockdep_rcu_enabled.part.5 >> 0.51% tc [kernel.vmlinux] [k] unwind_next_frame >> 0.50% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore >> 0.47% tc_pref_scale.s [kernel.vmlinux] [k] lock_acquire >> 0.47% tc [kernel.vmlinux] [k] lock_release > >... both lockdep and kasan consume most of CPU time. > >BTW it takes 5 sec to execute your script on my system with net-next >(debug options disabled). You are right, my bad. Sorry for the fuzz. > >> >> I didn't investigate this any further now. I fear that this might be >> related to Vlad's changes in the area. Any ideas? >> >> Thanks! >> >> Jiri ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko 2019-06-12 12:30 ` Paolo Abeni 2019-06-12 12:34 ` Vlad Buslov @ 2019-06-13 8:11 ` Jiri Pirko 2019-06-13 10:09 ` Vlad Buslov 2 siblings, 1 reply; 10+ messages in thread From: Jiri Pirko @ 2019-06-13 8:11 UTC (permalink / raw) To: netdev; +Cc: vladbu, pablo, xiyou.wangcong, jhs, mlxsw, alexanderk, pabeni I made a mistake during measurements, sorry about that. This is the correct script: ----------------------------------------------------------------------- #!/bin/bash dev=testdummy ip link add name $dev type dummy ip link set dev $dev up tc qdisc add dev $dev ingress tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") pref_id=1 while [ $pref_id -lt 20000 ] do echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name ((pref_id++)) done start=$(date +"%s") tc -b $tmp_file_name stop=$(date +"%s") echo "Insertion duration: $(($stop - $start)) sec" rm -f $tmp_file_name ip link del dev $dev ----------------------------------------------------------------------- Note the commented out matchall. I don't see the regression with matchall. However, I see that with flower: kernel 5.1 Insertion duration: 4 sec kernel 5.2 Insertion duration: 163 sec I don't see any significant difference in perf: kernel 5.1 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 0.93% tc [kernel.vmlinux] [k] idr_get_free 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 0.69% tc [kernel.vmlinux] [k] finish_task_switch 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms 0.49% tc [kernel.vmlinux] [k] __memset 0.36% tc_pref_scale_o libc-2.28.so [.] malloc 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free 0.24% tc [kernel.vmlinux] [k] __memcpy 0.23% tc [cls_flower] [k] fl_change 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse 0.22% tc [kernel.vmlinux] [k] __slab_alloc 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms 0.86% ip [kernel.kallsyms] [k] finish_task_switch 0.67% tc [kernel.kallsyms] [k] memset 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner 0.52% tc_pref_scale_o libc-2.28.so [.] malloc 0.48% tc [kernel.kallsyms] [k] idr_get_free 0.46% tc [kernel.kallsyms] [k] fl_change 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted Any ideas? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-13 8:11 ` Jiri Pirko @ 2019-06-13 10:09 ` Vlad Buslov 2019-06-13 11:11 ` Jiri Pirko 0 siblings, 1 reply; 10+ messages in thread From: Vlad Buslov @ 2019-06-13 10:09 UTC (permalink / raw) To: Jiri Pirko Cc: netdev@vger.kernel.org, Vlad Buslov, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov, pabeni@redhat.com On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote: > I made a mistake during measurements, sorry about that. > > This is the correct script: > ----------------------------------------------------------------------- > #!/bin/bash > > dev=testdummy > ip link add name $dev type dummy > ip link set dev $dev up > tc qdisc add dev $dev ingress > > tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") > pref_id=1 > > while [ $pref_id -lt 20000 ] > do > echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name > #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name > ((pref_id++)) > done > > start=$(date +"%s") > tc -b $tmp_file_name > stop=$(date +"%s") > echo "Insertion duration: $(($stop - $start)) sec" > rm -f $tmp_file_name > > ip link del dev $dev > ----------------------------------------------------------------------- > > Note the commented out matchall. I don't see the regression with > matchall. However, I see that with flower: > kernel 5.1 > Insertion duration: 4 sec > kernel 5.2 > Insertion duration: 163 sec > > I don't see any significant difference in perf: > kernel 5.1 > 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find > 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner > 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore > 0.93% tc [kernel.vmlinux] [k] idr_get_free > 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 > 0.69% tc [kernel.vmlinux] [k] finish_task_switch > 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms > 0.49% tc [kernel.vmlinux] [k] __memset > 0.36% tc_pref_scale_o libc-2.28.so [.] malloc > 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free > 0.24% tc [kernel.vmlinux] [k] __memcpy > 0.23% tc [cls_flower] [k] fl_change > 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse > 0.22% tc [kernel.vmlinux] [k] __slab_alloc > > > 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find > 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 > 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms > 0.86% ip [kernel.kallsyms] [k] finish_task_switch > 0.67% tc [kernel.kallsyms] [k] memset > 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner > 0.52% tc_pref_scale_o libc-2.28.so [.] malloc > 0.48% tc [kernel.kallsyms] [k] idr_get_free > 0.46% tc [kernel.kallsyms] [k] fl_change > 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free > 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 > 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc > 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted > > Any ideas? Thanks for providing reproduction script! I've investigate the problem and found the root cause. First of all I noticed that CPU utilization during problematic tc run is quite low (<10%), so I decided to investigate why tc sleeps so much. I've used bcc and obtained following off-CPU trace (uninteresting traces are omitted for brevity): ~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. ... finish_task_switch __sched_text_start schedule schedule_timeout wait_for_completion __wait_rcu_gp synchronize_rcu fl_change tc_new_tfilter rtnetlink_rcv_msg netlink_rcv_skb netlink_unicast netlink_sendmsg sock_sendmsg ___sys_sendmsg __sys_sendmsg do_syscall_64 entry_SYSCALL_64_after_hwframe - tc (2069) 142284953 As you can see 142 seconds are spent sleeping in synchronize_rcu(). The code is in fl_create_new_mask() function: err = rhashtable_replace_fast(&head->ht, &mask->ht_node, &newmask->ht_node, mask_ht_params); if (err) goto errout_destroy; /* Wait until any potential concurrent users of mask are finished */ synchronize_rcu(); The justification for this is described in comment in fl_check_assign_mask() (user of fl_create_new_mask()): /* Insert mask as temporary node to prevent concurrent creation of mask * with same key. Any concurrent lookups with same key will return * -EAGAIN because mask's refcnt is zero. It is safe to insert * stack-allocated 'mask' to masks hash table because we call * synchronize_rcu() before returning from this function (either in case * of error or after replacing it with heap-allocated mask in * fl_create_new_mask()). */ fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, &mask->ht_node, mask_ht_params); The offending commit is part of my series that implements unlocked flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask insertion") The justification presented in it is no longer relevant since Ivan Vecera changed mask to be dynamically allocated in commit 2cddd2014782 ("net/sched: cls_flower: allocate mask dynamically in fl_change()"). With this we can just change fl_change() to deallocate temporary mask with rcu grace period and remove offending syncrhonize_rcu() call. Any other suggestions? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-13 10:09 ` Vlad Buslov @ 2019-06-13 11:11 ` Jiri Pirko 2019-06-13 11:26 ` Vlad Buslov 0 siblings, 1 reply; 10+ messages in thread From: Jiri Pirko @ 2019-06-13 11:11 UTC (permalink / raw) To: Vlad Buslov Cc: netdev@vger.kernel.org, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov, pabeni@redhat.com Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote: >On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote: >> I made a mistake during measurements, sorry about that. >> >> This is the correct script: >> ----------------------------------------------------------------------- >> #!/bin/bash >> >> dev=testdummy >> ip link add name $dev type dummy >> ip link set dev $dev up >> tc qdisc add dev $dev ingress >> >> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >> pref_id=1 >> >> while [ $pref_id -lt 20000 ] >> do >> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name >> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name >> ((pref_id++)) >> done >> >> start=$(date +"%s") >> tc -b $tmp_file_name >> stop=$(date +"%s") >> echo "Insertion duration: $(($stop - $start)) sec" >> rm -f $tmp_file_name >> >> ip link del dev $dev >> ----------------------------------------------------------------------- >> >> Note the commented out matchall. I don't see the regression with >> matchall. However, I see that with flower: >> kernel 5.1 >> Insertion duration: 4 sec >> kernel 5.2 >> Insertion duration: 163 sec >> >> I don't see any significant difference in perf: >> kernel 5.1 >> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore >> 0.93% tc [kernel.vmlinux] [k] idr_get_free >> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >> 0.49% tc [kernel.vmlinux] [k] __memset >> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >> 0.24% tc [kernel.vmlinux] [k] __memcpy >> 0.23% tc [cls_flower] [k] fl_change >> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >> >> >> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >> 0.67% tc [kernel.kallsyms] [k] memset >> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >> 0.48% tc [kernel.kallsyms] [k] idr_get_free >> 0.46% tc [kernel.kallsyms] [k] fl_change >> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted >> >> Any ideas? > >Thanks for providing reproduction script! > >I've investigate the problem and found the root cause. First of all I >noticed that CPU utilization during problematic tc run is quite low >(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >and obtained following off-CPU trace (uninteresting traces are omitted >for brevity): > >~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >... > finish_task_switch > __sched_text_start > schedule > schedule_timeout > wait_for_completion > __wait_rcu_gp > synchronize_rcu > fl_change > tc_new_tfilter > rtnetlink_rcv_msg > netlink_rcv_skb > netlink_unicast > netlink_sendmsg > sock_sendmsg > ___sys_sendmsg > __sys_sendmsg > do_syscall_64 > entry_SYSCALL_64_after_hwframe > - tc (2069) > 142284953 > >As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >code is in fl_create_new_mask() function: > > err = rhashtable_replace_fast(&head->ht, &mask->ht_node, > &newmask->ht_node, mask_ht_params); > if (err) > goto errout_destroy; > > /* Wait until any potential concurrent users of mask are finished */ > synchronize_rcu(); > >The justification for this is described in comment in >fl_check_assign_mask() (user of fl_create_new_mask()): > > /* Insert mask as temporary node to prevent concurrent creation of mask > * with same key. Any concurrent lookups with same key will return > * -EAGAIN because mask's refcnt is zero. It is safe to insert > * stack-allocated 'mask' to masks hash table because we call > * synchronize_rcu() before returning from this function (either in case > * of error or after replacing it with heap-allocated mask in > * fl_create_new_mask()). > */ > fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, > &mask->ht_node, > mask_ht_params); > >The offending commit is part of my series that implements unlocked >flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >insertion") > >The justification presented in it is no longer relevant since Ivan >Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >With this we can just change fl_change() to deallocate temporary mask >with rcu grace period and remove offending syncrhonize_rcu() call. > >Any other suggestions? So basically you just change synchronize_rcu() to kfree_rcu(mask), correct? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-13 11:11 ` Jiri Pirko @ 2019-06-13 11:26 ` Vlad Buslov 2019-06-13 14:18 ` Jiri Pirko 0 siblings, 1 reply; 10+ messages in thread From: Vlad Buslov @ 2019-06-13 11:26 UTC (permalink / raw) To: Jiri Pirko Cc: Vlad Buslov, netdev@vger.kernel.org, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov, pabeni@redhat.com On Thu 13 Jun 2019 at 14:11, Jiri Pirko <jiri@resnulli.us> wrote: > Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote: >>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote: >>> I made a mistake during measurements, sorry about that. >>> >>> This is the correct script: >>> ----------------------------------------------------------------------- >>> #!/bin/bash >>> >>> dev=testdummy >>> ip link add name $dev type dummy >>> ip link set dev $dev up >>> tc qdisc add dev $dev ingress >>> >>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >>> pref_id=1 >>> >>> while [ $pref_id -lt 20000 ] >>> do >>> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name >>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name >>> ((pref_id++)) >>> done >>> >>> start=$(date +"%s") >>> tc -b $tmp_file_name >>> stop=$(date +"%s") >>> echo "Insertion duration: $(($stop - $start)) sec" >>> rm -f $tmp_file_name >>> >>> ip link del dev $dev >>> ----------------------------------------------------------------------- >>> >>> Note the commented out matchall. I don't see the regression with >>> matchall. However, I see that with flower: >>> kernel 5.1 >>> Insertion duration: 4 sec >>> kernel 5.2 >>> Insertion duration: 163 sec >>> >>> I don't see any significant difference in perf: >>> kernel 5.1 >>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >>> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore >>> 0.93% tc [kernel.vmlinux] [k] idr_get_free >>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >>> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>> 0.49% tc [kernel.vmlinux] [k] __memset >>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >>> 0.24% tc [kernel.vmlinux] [k] __memcpy >>> 0.23% tc [cls_flower] [k] fl_change >>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >>> >>> >>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >>> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >>> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >>> 0.67% tc [kernel.kallsyms] [k] memset >>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >>> 0.48% tc [kernel.kallsyms] [k] idr_get_free >>> 0.46% tc [kernel.kallsyms] [k] fl_change >>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >>> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted >>> >>> Any ideas? >> >>Thanks for providing reproduction script! >> >>I've investigate the problem and found the root cause. First of all I >>noticed that CPU utilization during problematic tc run is quite low >>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >>and obtained following off-CPU trace (uninteresting traces are omitted >>for brevity): >> >>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >>... >> finish_task_switch >> __sched_text_start >> schedule >> schedule_timeout >> wait_for_completion >> __wait_rcu_gp >> synchronize_rcu >> fl_change >> tc_new_tfilter >> rtnetlink_rcv_msg >> netlink_rcv_skb >> netlink_unicast >> netlink_sendmsg >> sock_sendmsg >> ___sys_sendmsg >> __sys_sendmsg >> do_syscall_64 >> entry_SYSCALL_64_after_hwframe >> - tc (2069) >> 142284953 >> >>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >>code is in fl_create_new_mask() function: >> >> err = rhashtable_replace_fast(&head->ht, &mask->ht_node, >> &newmask->ht_node, mask_ht_params); >> if (err) >> goto errout_destroy; >> >> /* Wait until any potential concurrent users of mask are finished */ >> synchronize_rcu(); >> >>The justification for this is described in comment in >>fl_check_assign_mask() (user of fl_create_new_mask()): >> >> /* Insert mask as temporary node to prevent concurrent creation of mask >> * with same key. Any concurrent lookups with same key will return >> * -EAGAIN because mask's refcnt is zero. It is safe to insert >> * stack-allocated 'mask' to masks hash table because we call >> * synchronize_rcu() before returning from this function (either in case >> * of error or after replacing it with heap-allocated mask in >> * fl_create_new_mask()). >> */ >> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, >> &mask->ht_node, >> mask_ht_params); >> >>The offending commit is part of my series that implements unlocked >>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >>insertion") >> >>The justification presented in it is no longer relevant since Ivan >>Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >>("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >>With this we can just change fl_change() to deallocate temporary mask >>with rcu grace period and remove offending syncrhonize_rcu() call. >> >>Any other suggestions? > > So basically you just change synchronize_rcu() to kfree_rcu(mask), > correct? Not really. I remove synchronize_rcu() and change all kfree(mask) in fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which uses queue_rcu_work() internally. This would allow us to deallocate fl_flow_mask in same manner on all code paths and doesn't require any extensions in fl_flow_mask struct (kfree_rcu would require extending it with rcu_head). ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: tc tp creation performance degratation since kernel 5.1 2019-06-13 11:26 ` Vlad Buslov @ 2019-06-13 14:18 ` Jiri Pirko 0 siblings, 0 replies; 10+ messages in thread From: Jiri Pirko @ 2019-06-13 14:18 UTC (permalink / raw) To: Vlad Buslov Cc: netdev@vger.kernel.org, pablo@netfilter.org, xiyou.wangcong@gmail.com, jhs@mojatatu.com, mlxsw, Alex Kushnarov, pabeni@redhat.com Thu, Jun 13, 2019 at 01:26:17PM CEST, vladbu@mellanox.com wrote: > >On Thu 13 Jun 2019 at 14:11, Jiri Pirko <jiri@resnulli.us> wrote: >> Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote: >>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <jiri@resnulli.us> wrote: >>>> I made a mistake during measurements, sorry about that. >>>> >>>> This is the correct script: >>>> ----------------------------------------------------------------------- >>>> #!/bin/bash >>>> >>>> dev=testdummy >>>> ip link add name $dev type dummy >>>> ip link set dev $dev up >>>> tc qdisc add dev $dev ingress >>>> >>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >>>> pref_id=1 >>>> >>>> while [ $pref_id -lt 20000 ] >>>> do >>>> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name >>>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name >>>> ((pref_id++)) >>>> done >>>> >>>> start=$(date +"%s") >>>> tc -b $tmp_file_name >>>> stop=$(date +"%s") >>>> echo "Insertion duration: $(($stop - $start)) sec" >>>> rm -f $tmp_file_name >>>> >>>> ip link del dev $dev >>>> ----------------------------------------------------------------------- >>>> >>>> Note the commented out matchall. I don't see the regression with >>>> matchall. However, I see that with flower: >>>> kernel 5.1 >>>> Insertion duration: 4 sec >>>> kernel 5.2 >>>> Insertion duration: 163 sec >>>> >>>> I don't see any significant difference in perf: >>>> kernel 5.1 >>>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >>>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >>>> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore >>>> 0.93% tc [kernel.vmlinux] [k] idr_get_free >>>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >>>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >>>> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>>> 0.49% tc [kernel.vmlinux] [k] __memset >>>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.24% tc [kernel.vmlinux] [k] __memcpy >>>> 0.23% tc [cls_flower] [k] fl_change >>>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >>>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >>>> >>>> >>>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >>>> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >>>> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >>>> 0.67% tc [kernel.kallsyms] [k] memset >>>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >>>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.48% tc [kernel.kallsyms] [k] idr_get_free >>>> 0.46% tc [kernel.kallsyms] [k] fl_change >>>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >>>> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted >>>> >>>> Any ideas? >>> >>>Thanks for providing reproduction script! >>> >>>I've investigate the problem and found the root cause. First of all I >>>noticed that CPU utilization during problematic tc run is quite low >>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >>>and obtained following off-CPU trace (uninteresting traces are omitted >>>for brevity): >>> >>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >>>... >>> finish_task_switch >>> __sched_text_start >>> schedule >>> schedule_timeout >>> wait_for_completion >>> __wait_rcu_gp >>> synchronize_rcu >>> fl_change >>> tc_new_tfilter >>> rtnetlink_rcv_msg >>> netlink_rcv_skb >>> netlink_unicast >>> netlink_sendmsg >>> sock_sendmsg >>> ___sys_sendmsg >>> __sys_sendmsg >>> do_syscall_64 >>> entry_SYSCALL_64_after_hwframe >>> - tc (2069) >>> 142284953 >>> >>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >>>code is in fl_create_new_mask() function: >>> >>> err = rhashtable_replace_fast(&head->ht, &mask->ht_node, >>> &newmask->ht_node, mask_ht_params); >>> if (err) >>> goto errout_destroy; >>> >>> /* Wait until any potential concurrent users of mask are finished */ >>> synchronize_rcu(); >>> >>>The justification for this is described in comment in >>>fl_check_assign_mask() (user of fl_create_new_mask()): >>> >>> /* Insert mask as temporary node to prevent concurrent creation of mask >>> * with same key. Any concurrent lookups with same key will return >>> * -EAGAIN because mask's refcnt is zero. It is safe to insert >>> * stack-allocated 'mask' to masks hash table because we call >>> * synchronize_rcu() before returning from this function (either in case >>> * of error or after replacing it with heap-allocated mask in >>> * fl_create_new_mask()). >>> */ >>> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, >>> &mask->ht_node, >>> mask_ht_params); >>> >>>The offending commit is part of my series that implements unlocked >>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >>>insertion") >>> >>>The justification presented in it is no longer relevant since Ivan >>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >>>("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >>>With this we can just change fl_change() to deallocate temporary mask >>>with rcu grace period and remove offending syncrhonize_rcu() call. >>> >>>Any other suggestions? >> >> So basically you just change synchronize_rcu() to kfree_rcu(mask), >> correct? > >Not really. I remove synchronize_rcu() and change all kfree(mask) in >fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which >uses queue_rcu_work() internally. This would allow us to deallocate >fl_flow_mask in same manner on all code paths and doesn't require any >extensions in fl_flow_mask struct (kfree_rcu would require extending it >with rcu_head). Got it. Makes sense to me. Thanks! ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-06-13 16:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-06-12 12:03 tc tp creation performance degratation since kernel 5.1 Jiri Pirko 2019-06-12 12:30 ` Paolo Abeni 2019-06-13 4:50 ` Jiri Pirko 2019-06-12 12:34 ` Vlad Buslov 2019-06-13 5:49 ` Jiri Pirko 2019-06-13 8:11 ` Jiri Pirko 2019-06-13 10:09 ` Vlad Buslov 2019-06-13 11:11 ` Jiri Pirko 2019-06-13 11:26 ` Vlad Buslov 2019-06-13 14:18 ` Jiri Pirko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).