From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vlad Buslov Subject: Re: [PATCH net-next v2 01/17] net: sched: refactor mini_qdisc_pair_swap() to use workqueue Date: Mon, 14 Jan 2019 19:00:48 +0000 Message-ID: References: <1544523120-4566-1-git-send-email-vladbu@mellanox.com> <1544523120-4566-2-git-send-email-vladbu@mellanox.com> <20181217102248.GB2096@nanopsycho> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Cc: Jiri Pirko , Linux Kernel Network Developers , Jamal Hadi Salim , David Miller , Alexei Starovoitov , Daniel Borkmann To: Cong Wang , Jiri Pirko Return-path: Received: from mail-eopbgr130053.outbound.protection.outlook.com ([40.107.13.53]:18496 "EHLO EUR01-HE1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726643AbfANTAx (ORCPT ); Mon, 14 Jan 2019 14:00:53 -0500 In-Reply-To: Content-Language: en-US Sender: netdev-owner@vger.kernel.org List-ID: On Thu 20 Dec 2018 at 13:55, Vlad Buslov wrote: > On Wed 19 Dec 2018 at 04:27, Cong Wang wrote: >> On Mon, Dec 17, 2018 at 2:30 AM Jiri Pirko wrote: >>> >>> Sun, Dec 16, 2018 at 07:52:18PM CET, xiyou.wangcong@gmail.com wrote: >>> >On Sun, Dec 16, 2018 at 8:32 AM Vlad Buslov wrot= e: >>> >> >>> >> On Thu 13 Dec 2018 at 23:32, Cong Wang wr= ote: >>> >> > On Tue, Dec 11, 2018 at 2:19 AM Vlad Buslov = wrote: >>> >> >> >>> >> >> As a part of the effort to remove dependency on rtnl lock, cls AP= I is being >>> >> >> converted to use fine-grained locking mechanisms instead of globa= l rtnl >>> >> >> lock. However, chain_head_change callback for ingress Qdisc is a = sleeping >>> >> >> function and cannot be executed while holding a spinlock. >>> >> > >>> >> > >>> >> > Why does it have to be a spinlock not a mutex? >>> >> > >>> >> > I've read your cover letter and this changelog, I don't find any >>> >> > answer. >>> >> >>> >> My initial implementation used mutex. However, it was changed to >>> >> spinlock by Jiri's request during internal review. >>> >> >>> > >>> >So what's the answer to my question? :) >>> >>> Yeah, my concern agains mutexes was that it would be needed to have one >>> per every block and per every chain. I find it quite heavy and I believ= e >>> it is better to use spinlock in those cases. This patch is a side effec= t >>> of that. Do you think it would be better to have mutexes instead of >>> spinlocks? >> >> My only concern with spinlock is we have to go async as we >> can't block. This is almost always error-prone especially >> when dealing with tcf block. I had to give up with spinlock >> for idrinfo->lock, please take a look at: >> >> commit 95278ddaa15cfa23e4a06ee9ed7b6ee0197c500b >> Author: Cong Wang >> Date: Tue Oct 2 12:50:19 2018 -0700 >> >> net_sched: convert idrinfo->lock from spinlock to a mutex >> >> >> There are indeed some cases in kernel we do take multiple >> mutex'es, for example, >> >> /* >> * bd_mutex locking: >> * >> * mutex_lock(part->bd_mutex) >> * mutex_lock_nested(whole->bd_mutex, 1) >> */ >> >> So, how heavy are they comparing with spinlocks? >> >> Thanks. > > Hi Cong, > > I did quick-and-dirty mutex-based implementation of my cls API patchset > (removed async miniqp refactoring, changed block and chain lock types to > mutex lock) and performed some benchmarks to determine exact mutex > impact on cls API performance (both rule and chains API). > > 1. Parallel flower insertion rate (create 5 million flower filters on > same chain/prio with 10 tc instances) - same performance: > > SPINLOCK > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 0m12.183s > user 0m35.013s > sys 1m24.947s > > MUTEX > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 0m12.246s > user 0m35.417s > sys 1m25.330s > > 2. Parallel flower insertion rate (create 100k flower filters, each on > new instance of chain/tp, 10 tc instances) - mutex implementation is > ~17% slower: > > SPINLOCK: > $ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 2m19.285s > user 0m1.863s > sys 5m41.157s > > MUTEX: > multichain$ time ls add_sw* | xargs -n 1 -P 100 sudo tc -force -b > > real 2m43.495s > user 0m1.664s > sys 8m32.483s > > 3. Chains insertion rate with cls chain API (create 100k empty chains, 1 > tc instance) - ~3% difference: > > SPINLOCK: > $ time sudo tc -b add_chains > > real 0m46.822s > user 0m0.239s > sys 0m46.437s > > MUTEX: > $ time sudo tc -b add_chains > > real 0m48.600s > user 0m0.230s > sys 0m48.227s > > > Only case where performance is significantly impacted by mutex is second > test. This happens because chain lookup is a linear search that is > performed while holding highly contested block lock. Perf profile for > mutex version confirms this: > > + 91.83% 0.00% tc [kernel.vmlinux] [k] en= try_SYSCALL_64_after_hwframe > + 91.83% 0.00% tc [kernel.vmlinux] [k] do= _syscall_64 > + 91.80% 0.00% tc libc-2.25.so [.] __= libc_sendmsg > + 91.78% 0.00% tc [kernel.vmlinux] [k] __= sys_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] __= _sys_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] so= ck_sendmsg > + 91.77% 0.00% tc [kernel.vmlinux] [k] ne= tlink_sendmsg > + 91.77% 0.01% tc [kernel.vmlinux] [k] ne= tlink_unicast > + 91.77% 0.00% tc [kernel.vmlinux] [k] ne= tlink_rcv_skb > + 91.71% 0.01% tc [kernel.vmlinux] [k] rt= netlink_rcv_msg > + 91.69% 0.03% tc [kernel.vmlinux] [k] tc= _new_tfilter > + 90.96% 26.45% tc [kernel.vmlinux] [k] __= tcf_chain_get > + 64.30% 0.01% tc [kernel.vmlinux] [k] __= mutex_lock.isra.7 > + 39.36% 39.18% tc [kernel.vmlinux] [k] os= q_lock > + 24.92% 24.81% tc [kernel.vmlinux] [k] mu= tex_spin_on_owner > > Based on these results we can conclude that use-cases with significant > amount of chains on single block will be affected by using mutex in cls > API. > > Regards, > Vlad Hi Cong, Any comments regarding benchmark results? Are you okay with spinlock-based implementation? Thanks, Vlad