netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Petr Machata <petrm@nvidia.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	<netdev@vger.kernel.org>
Cc: Ido Schimmel <idosch@nvidia.com>, Petr Machata <petrm@nvidia.com>,
	"David Ahern" <dsahern@kernel.org>,
	Simon Horman <horms@kernel.org>, <mlxsw@nvidia.com>
Subject: [PATCH net-next v4 0/7] Support for nexthop group statistics
Date: Wed, 6 Mar 2024 13:49:14 +0100	[thread overview]
Message-ID: <cover.1709727981.git.petrm@nvidia.com> (raw)

ECMP is a fundamental component in L3 designs. However, it's fragile. Many
factors influence whether an ECMP group will operate as intended: hash
policy (i.e. the set of fields that contribute to ECMP hash calculation),
neighbor validity, hash seed (which might lead to polarization) or the type
of ECMP group used (hash-threshold or resilient).

At the same time, collection of statistics that would help an operator
determine that the group performs as desired, is difficult.

A solution that we present in this patchset is to add counters to next hop
group entries. For SW-datapath deployments, this will on its own allow
collection and evaluation of relevant statistics. For HW-datapath
deployments, we further add a way to request that HW counters be installed
for a given group, in-kernel interfaces to collect the HW statistics, and
netlink interfaces to query them.

For example:

    # ip nexthop replace id 4000 group 4001/4002 hw_stats on

    # ip -s -d nexthop show id 4000
    id 4000 group 4001/4002 scope global proto unspec offload hw_stats on used on
      stats:
        id 4001 packets 5002 packets_hw 5000
        id 4002 packets 4999 packets_hw 4999

The point of the patchset is visibility of ECMP balance, and that is
influenced by packet headers, not their payload. Correspondingly, we only
include packet counters in the statistics, not byte counters.

We also decided to model HW statistics as a nexthop group attribute, not an
arbitrary nexthop one. The latter would count any traffic going through a
given nexthop, regardless of which ECMP group it is in, or any at all. The
reason is again hat the point of the patchset is ECMP balance visibility,
not arbitrary inspection of how busy a particular nexthop is.
Implementation of individual-nexthop statistics is certainly possible, and
could well follow the general approach we are taking in this patchset.
For resilient groups, per-bucket statistics could be done in a similar
manner as well.

This patchset contains the core code. mlxsw support will be sent in a
follow-up patch set.

This patchset progresses as follows:

- Patches #1 and #2 add support for a new next-hop object attribute,
  NHA_OP_FLAGS. That is meant to carry various op-specific signaling, in
  particular whether SW- and HW-collected nexthop stats should be part of
  the get or dump response. The idea is to avoid wasting message space, and
  time for collection of HW statistics, when the values are not needed.

- Patches #3 and #4 add SW-datapath stats and corresponding UAPI.

- Patches #5, #6 and #7 add support fro HW-datapath stats and UAPI.
  Individual drivers still need to contribute the appropriate HW-specific
  support code.

v4:
- Patch #2:
    - s/nla_get_bitfield32/nla_get_u32/ in __nh_valid_dump_req().

v3:
- Patch #3:
    - Convert to u64_stats_t
- Patch #4:
    - Give a symbolic name to the set of all valid dump flags
      for the NHA_OP_FLAGS attribute.
    - Convert to u64_stats_t
- Patch #6:
    - Use a named constant for the NHA_HW_STATS_ENABLE policy.

v2:
- Patch #2:
    - Change OP_FLAGS to u32, enforce through NLA_POLICY_MASK
- Patch #3:
    - Set err on nexthop_create_group() error path
- Patch #4:
    - Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS
    - Rename jump target in nla_put_nh_group_stats() to avoid
      having to rename further in the patchset.
- Patch #7:
    - Use uint to encode NHA_GROUP_STATS_ENTRY_PACKETS_HW
    - Do not cancel outside of nesting in nla_put_nh_group_stats()

Ido Schimmel (5):
  net: nexthop: Add nexthop group entry stats
  net: nexthop: Expose nexthop group stats to user space
  net: nexthop: Add hardware statistics notifications
  net: nexthop: Add ability to enable / disable hardware statistics
  net: nexthop: Expose nexthop group HW stats to user space

Petr Machata (2):
  net: nexthop: Adjust netlink policy parsing for a new attribute
  net: nexthop: Add NHA_OP_FLAGS

 include/net/nexthop.h        |  29 +++
 include/uapi/linux/nexthop.h |  45 +++++
 net/ipv4/nexthop.c           | 329 ++++++++++++++++++++++++++++++-----
 3 files changed, 363 insertions(+), 40 deletions(-)

-- 
2.43.0


             reply	other threads:[~2024-03-06 12:50 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-06 12:49 Petr Machata [this message]
2024-03-06 12:49 ` [PATCH net-next v4 1/7] net: nexthop: Adjust netlink policy parsing for a new attribute Petr Machata
2024-03-09 15:02   ` Eric Dumazet
2024-03-09 17:50     ` David Ahern
2024-03-06 12:49 ` [PATCH net-next v4 2/7] net: nexthop: Add NHA_OP_FLAGS Petr Machata
2024-03-06 17:22   ` Ido Schimmel
2024-03-06 12:49 ` [PATCH net-next v4 3/7] net: nexthop: Add nexthop group entry stats Petr Machata
2024-03-06 12:49 ` [PATCH net-next v4 4/7] net: nexthop: Expose nexthop group stats to user space Petr Machata
2024-03-06 12:49 ` [PATCH net-next v4 5/7] net: nexthop: Add hardware statistics notifications Petr Machata
2024-03-06 12:49 ` [PATCH net-next v4 6/7] net: nexthop: Add ability to enable / disable hardware statistics Petr Machata
2024-03-06 12:49 ` [PATCH net-next v4 7/7] net: nexthop: Expose nexthop group HW stats to user space Petr Machata
2024-03-08 10:40 ` [PATCH net-next v4 0/7] Support for nexthop group statistics patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1709727981.git.petrm@nvidia.com \
    --to=petrm@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=idosch@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=mlxsw@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).