From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jakub Kicinski <jakub.kicinski@netronome.com>
Subject: Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC
 actions
Date: Wed, 29 Aug 2018 20:12:11 +0200
Message-ID: <20180829201211.6c58b827@cakuba.netronome.com>
References: <20180809150118.5275.63824.stgit@wsfd-netdev20.ntdv.lab.eng.bos.redhat.com>
        <20180809202608.6b816326@cakuba.netronome.com>
        <20180811.120627.662252154567814394.davem@davemloft.net>
        <4E288F34-0559-4C8A-8B3B-4410364791AA@redhat.com>
        <20180817042722.0e534ce0@cakuba>
        <229BA7FA-916B-47EA-8FD4-3F0B8BDDD145@redhat.com>
        <20180823201446.3802e84b@cakuba.netronome.com>
        <E4A2D662-0C89-4B61-BB68-ED985D9EC4B3@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: "David Miller" <davem@davemloft.net>, netdev@vger.kernel.org,
        jhs@mojatatu.com, xiyou.wangcong@gmail.com, jiri@resnulli.us,
        simon.horman@netronome.com,
        "Marcelo Ricardo Leitner" <mleitner@redhat.com>,
        louis.peens@netronome.com
To: "Eelco Chaudron" <echaudro@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf1-f195.google.com ([209.85.210.195]:38216 "EHLO
        mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1727399AbeH2WK0 (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 29 Aug 2018 18:10:26 -0400
Received: by mail-pf1-f195.google.com with SMTP id x17-v6so2638521pfh.5
        for <netdev@vger.kernel.org>; Wed, 29 Aug 2018 11:12:21 -0700 (PDT)
In-Reply-To: <E4A2D662-0C89-4B61-BB68-ED985D9EC4B3@redhat.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote:
> On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:
>=20
> > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote: =20
> >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote: =20
> >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote: =20
> >>>> On 11 Aug 2018, at 21:06, David Miller wrote:
> >>>> =20
> >>>>> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> >>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700
> >>>>> =20
> >>>>>> It is not immediately clear why this is needed.  The memory and
> >>>>>> updating two sets of counters won't come for free, so perhaps a
> >>>>>> stronger justification than troubleshooting is due? :S
> >>>>>>
> >>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd
> >>>>>> know
> >>>>>> that traffic hits the SW datapath, plus the rules which are in_hw
> >>>>>> will
> >>>>>> most likely not match as of today for flower (assuming
> >>>>>> correctness). =20
> >>>>
> >>>> I strongly believe that these counters are a requirement for a=20
> >>>> mixed
> >>>> software/hardware (flow) based forwarding environment. The global
> >>>> counters will not help much here as you might have chosen to have
> >>>> certain traffic forwarded by software.
> >>>>
> >>>> These counters are probably the only option you have to figure out
> >>>> why
> >>>> forwarding is not as fast as expected, and you want to blame the TC
> >>>> offload NIC. =20
> >>>
> >>> The suggested debugging flow would be:
> >>>  (1) check the global counter for fallback are incrementing;
> >>>  (2) find a flow with high stats but no in_hw flag set.
> >>>
> >>> The in_hw indication should be sufficient in most cases (unless=20
> >>> there
> >>> are shared blocks between netdevs of different ASICs...). =20
> >>
> >> I guess the aim is to find miss behaving hardware, i.e. having the=20
> >> in_hw
> >> flag set, but flows still coming to the kernel. =20
> >
> > For misbehaving hardware in_hw will not work indeed.  Whether we need
> > these extra always-on stats for such use case could be debated :)
> > =20
> >>>>>> I'm slightly concerned about potential performance impact, would
> >>>>>> you
> >>>>>> be able to share some numbers for non-trivial number of flows=20
> >>>>>> (100k
> >>>>>> active?)? =20
> >>>>>
> >>>>> Agreed, features used for diagnostics cannot have a harmful=20
> >>>>> penalty
> >>>>> for fast path performance. =20
> >>>>
> >>>> Fast path performance is not affected as these counters are not
> >>>> incremented there. They are only incremented by the nic driver when
> >>>> they
> >>>> gather their statistics from hardware. =20
> >>>
> >>> Not by much, you are adding state to performance-critical=20
> >>> structures,
> >>> though, for what is effectively debugging purposes.
> >>>
> >>> I was mostly talking about the HW offload stat updates (sorry for=20
> >>> not
> >>> being clear).
> >>>
> >>> We can have some hundreds of thousands active offloaded flows, each=20
> >>> of
> >>> them can have multiple actions, and stats have to be updated=20
> >>> multiple
> >>> times per second and dumped probably around once a second, too.  On=20
> >>> a
> >>> busy system the stats will get evicted from cache between each=20
> >>> round.
> >>>
> >>> But I'm speculating let's see if I can get some numbers on it (if=20
> >>> you
> >>> could get some too, that would be great!). =20
> >>
> >> I=E2=80=99ll try to measure some of this later this week/early next we=
ek. =20
> >
> > I asked Louis to run some tests while I'm travelling, and he reports
> > that my worry about reporting the extra stats was unfounded.  Update
> > function does not show up in traces at all.  It seems under stress
> > (generated with stress-ng) the thread dumping the stats in userspace
> > (in OvS it would be the revalidator) actually consumes less CPU in
> > __gnet_stats_copy_basic (0.4% less for ~2.0% total).
> >
> > Would this match with your results?  I'm not sure why dumping would be
> > faster with your change.. =20
>=20
> Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC=20
> rules installed in HW.
>=20
> For __gnet_stats_copy_basic() being faster I have (had) a theory. Now=20
> this function is called twice, and I assumed the first call would cache=20
> memory and the second call would be faster.
>=20
> Sampling a lot of perf data, I get an average of 1115ns with the base=20
> kernel and 954ns with the fix applied, so about ~14%.
>=20
> Thought I would perf tcf_action_copy_stats() as it is the place updating=
=20
> the additional counter. But even in this case, I see a better=20
> performance with the patch applied.
>=20
> In average 13581ns with the fix, vs base kernel at 1391ns, so about=20
> 2.3%.
>=20
> I guess the changes to the tc_action structure got better cache=20
> alignment.

Interesting you could reproduce the speed up too!  +1 for the guess.
Seems like my caution about slowing down SW paths to support HW offload
landed on a very unfortunate patch :)