From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71905C31E45 for ; Thu, 13 Jun 2019 15:04:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3F22A21743 for ; Thu, 13 Jun 2019 15:04:56 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=resnulli-us.20150623.gappssmtp.com header.i=@resnulli-us.20150623.gappssmtp.com header.b="nR8UmPPo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733059AbfFMPEz (ORCPT ); Thu, 13 Jun 2019 11:04:55 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:36481 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732466AbfFMOSZ (ORCPT ); Thu, 13 Jun 2019 10:18:25 -0400 Received: by mail-wm1-f66.google.com with SMTP id u8so10331698wmm.1 for ; Thu, 13 Jun 2019 07:18:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=resnulli-us.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=hnqPodsnXRhtyXTS5/jOUXyEiqHytL9QSOkl9ggAjmY=; b=nR8UmPPoUkZZ6g1DZ61ugUFJtTPKsgWnARYsoI1J4oQksoFoqxbXhxg/ZMlIwxkxRm /QMPAwZYoHKmycwG4BANl7BnwjI9A8LAAI2PNDy8NsTJ/lgmNo6zY7v3110Foayuue2d ZA3anvfR96GxXA0wbm1boKsKm6NpdircJloDzOsMCHAprqJJJShMUKNCeSDs696cSrLk ZyKBMl+W75zVkjlu9YznHdbvDtQHUbomKuG8OtmZnK2MBaBWOYBttDo4qHaFh7qK8lEJ DnF6VxsHlOybEfb+C+8v1TBARqMglNZw/CfNVvrGIRx4kb/piR67OFOAktzM+kVvNzaG lEEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=hnqPodsnXRhtyXTS5/jOUXyEiqHytL9QSOkl9ggAjmY=; b=b++j0uAR4iC1dIz4qKsHWzJElk+Hsbivxnichg2pKtTVFbwdhTW4/aTvqQoOF0gaTr Stn9m/QN+sqti32Osr9qb5DdB1LrMExgXpPqI2rVjQxhalkcgFpTYHgjKrSS2FMCApx2 8e4/BieU3wNnDMLcHgR2fWRe74vVnkQYUKs9/2XhiW+xTs8T1rK++LNMWM7STuz1rF5h ueDJ3rCNCNRJ0pCZwRIyO/tS8p87fylw+Z2CDeVtuPfXVsZPflCogKqcJsWziYCMw1iX yyxx+AYsxy+CZlg7pJClLNlzvZpahb5PhByleWOlNDxhpJfrDOZduW5l/hq40Z3h3WKh Lzaw== X-Gm-Message-State: APjAAAWlP3nNKCnMZWUaVLwjf3ROkAZKLwNG06X5cG8cEoZS3iAsGG/8 0LxdOhlJiHEgJLJEeFom+ND4/w== X-Google-Smtp-Source: APXvYqy8h7XKowpxYKZNFQ5VYa23RZr3ogY+2ZlUiYJKSBZ8HB8IO6ZxcxPS6glWJiOF0ob+wCI8lg== X-Received: by 2002:a1c:5f09:: with SMTP id t9mr4263284wmb.112.1560435502883; Thu, 13 Jun 2019 07:18:22 -0700 (PDT) Received: from localhost (ip-78-45-163-56.net.upcbroadband.cz. [78.45.163.56]) by smtp.gmail.com with ESMTPSA id w14sm3726194wrt.59.2019.06.13.07.18.22 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Thu, 13 Jun 2019 07:18:22 -0700 (PDT) Date: Thu, 13 Jun 2019 16:18:21 +0200 From: Jiri Pirko To: Vlad Buslov Cc: "netdev@vger.kernel.org" , "pablo@netfilter.org" , "xiyou.wangcong@gmail.com" , "jhs@mojatatu.com" , mlxsw , Alex Kushnarov , "pabeni@redhat.com" Subject: Re: tc tp creation performance degratation since kernel 5.1 Message-ID: <20190613141821.GA2242@nanopsycho> References: <20190612120341.GA2207@nanopsycho> <20190613081152.GC2254@nanopsycho.orion> <20190613111135.GA2201@nanopsycho.orion> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Thu, Jun 13, 2019 at 01:26:17PM CEST, vladbu@mellanox.com wrote: > >On Thu 13 Jun 2019 at 14:11, Jiri Pirko wrote: >> Thu, Jun 13, 2019 at 12:09:32PM CEST, vladbu@mellanox.com wrote: >>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko wrote: >>>> I made a mistake during measurements, sorry about that. >>>> >>>> This is the correct script: >>>> ----------------------------------------------------------------------- >>>> #!/bin/bash >>>> >>>> dev=testdummy >>>> ip link add name $dev type dummy >>>> ip link set dev $dev up >>>> tc qdisc add dev $dev ingress >>>> >>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >>>> pref_id=1 >>>> >>>> while [ $pref_id -lt 20000 ] >>>> do >>>> echo "filter add dev $dev ingress proto ip pref $pref_id flower action drop" >> $tmp_file_name >>>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall action drop" >> $tmp_file_name >>>> ((pref_id++)) >>>> done >>>> >>>> start=$(date +"%s") >>>> tc -b $tmp_file_name >>>> stop=$(date +"%s") >>>> echo "Insertion duration: $(($stop - $start)) sec" >>>> rm -f $tmp_file_name >>>> >>>> ip link del dev $dev >>>> ----------------------------------------------------------------------- >>>> >>>> Note the commented out matchall. I don't see the regression with >>>> matchall. However, I see that with flower: >>>> kernel 5.1 >>>> Insertion duration: 4 sec >>>> kernel 5.2 >>>> Insertion duration: 163 sec >>>> >>>> I don't see any significant difference in perf: >>>> kernel 5.1 >>>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >>>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >>>> 1.44% tc [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore >>>> 0.93% tc [kernel.vmlinux] [k] idr_get_free >>>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >>>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >>>> 0.53% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>>> 0.49% tc [kernel.vmlinux] [k] __memset >>>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.24% tc [kernel.vmlinux] [k] __memcpy >>>> 0.23% tc [cls_flower] [k] fl_change >>>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >>>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >>>> >>>> >>>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >>>> 2.70% tc [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >>>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >>>> 0.87% tc libc-2.28.so [.] __memset_sse2_unaligned_erms >>>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >>>> 0.67% tc [kernel.kallsyms] [k] memset >>>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >>>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.48% tc [kernel.kallsyms] [k] idr_get_free >>>> 0.46% tc [kernel.kallsyms] [k] fl_change >>>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >>>> 0.34% tc_pref_scale_o libc-2.28.so [.] __fcntl64_nocancel_adjusted >>>> >>>> Any ideas? >>> >>>Thanks for providing reproduction script! >>> >>>I've investigate the problem and found the root cause. First of all I >>>noticed that CPU utilization during problematic tc run is quite low >>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >>>and obtained following off-CPU trace (uninteresting traces are omitted >>>for brevity): >>> >>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >>>... >>> finish_task_switch >>> __sched_text_start >>> schedule >>> schedule_timeout >>> wait_for_completion >>> __wait_rcu_gp >>> synchronize_rcu >>> fl_change >>> tc_new_tfilter >>> rtnetlink_rcv_msg >>> netlink_rcv_skb >>> netlink_unicast >>> netlink_sendmsg >>> sock_sendmsg >>> ___sys_sendmsg >>> __sys_sendmsg >>> do_syscall_64 >>> entry_SYSCALL_64_after_hwframe >>> - tc (2069) >>> 142284953 >>> >>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >>>code is in fl_create_new_mask() function: >>> >>> err = rhashtable_replace_fast(&head->ht, &mask->ht_node, >>> &newmask->ht_node, mask_ht_params); >>> if (err) >>> goto errout_destroy; >>> >>> /* Wait until any potential concurrent users of mask are finished */ >>> synchronize_rcu(); >>> >>>The justification for this is described in comment in >>>fl_check_assign_mask() (user of fl_create_new_mask()): >>> >>> /* Insert mask as temporary node to prevent concurrent creation of mask >>> * with same key. Any concurrent lookups with same key will return >>> * -EAGAIN because mask's refcnt is zero. It is safe to insert >>> * stack-allocated 'mask' to masks hash table because we call >>> * synchronize_rcu() before returning from this function (either in case >>> * of error or after replacing it with heap-allocated mask in >>> * fl_create_new_mask()). >>> */ >>> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, >>> &mask->ht_node, >>> mask_ht_params); >>> >>>The offending commit is part of my series that implements unlocked >>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >>>insertion") >>> >>>The justification presented in it is no longer relevant since Ivan >>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >>>("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >>>With this we can just change fl_change() to deallocate temporary mask >>>with rcu grace period and remove offending syncrhonize_rcu() call. >>> >>>Any other suggestions? >> >> So basically you just change synchronize_rcu() to kfree_rcu(mask), >> correct? > >Not really. I remove synchronize_rcu() and change all kfree(mask) in >fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which >uses queue_rcu_work() internally. This would allow us to deallocate >fl_flow_mask in same manner on all code paths and doesn't require any >extensions in fl_flow_mask struct (kfree_rcu would require extending it >with rcu_head). Got it. Makes sense to me. Thanks!