netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shay Drory <shayd@nvidia.com>
To: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>
Cc: jiri@nvidia.com, saeedm@nvidia.com, parav@nvidia.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Shay Drory <shayd@nvidia.com>
Subject: [PATCH net-next 0/4] devlink: Introduce cpu_affinity command
Date: Tue, 22 Feb 2022 12:58:08 +0200	[thread overview]
Message-ID: <20220222105812.18668-1-shayd@nvidia.com> (raw)

Currently a user can only configure the IRQ CPU affinity of a device
via the global /proc/irq/../smp_afinity interface, however this
interface changes the affinity globally across all subsystems connected
to the device.

Historically, this API is useful for single function devices since,
generally speaking, the queue structure created on top of the IRQ
vectors is predictable enough that this control is usable.

However, with complex multi-subsystem devices, like mlx5, the
assignment of queues at every layer throughout the software stack is
complex and there are multiple queues, each for different usage, over
the same IRQ. Hence, a simple fiddling of the base IRQ is no longer
effective.

As an example mlx5 SF's can share MSI-X IRQ between themselves, which
means that currently user doesn't have control over which SF to use
which CPU set. Hence, an application and IRQ can run on different
CPUs, which leads to lower performance, as shown in the bellow table.

application=netperf,    SF-IRQ     channel affinity   latecy(usec)
                                                      (lower is better)
cpu=0 (numa=0)           cpu={0}   cpu={0}            14.417
cpu=8 (numa=0)           cpu={0}   cpu={0}            15.114 (+5%)
cpu=1 (numa=1)           cpu={0}   cpu={0}            17.784 (+30%)

This series is a start at resolving this problem by inverting the
control of the affinities. Instead of having the user go around behind
the driver and adjusting the IRQs the driver already created we want
to have the user tell the software layer what CPUs to use and the
software layer will manage this. The suggested command will then trickle
down to the PCI driver which will create/share MSI-X IRQs and resources
to achieve it. In the mlx5 SF example the involved software components
would be devlink, rdma, vdpa and netdev.

This series introduces a devlink control that assigns a CPU set to the
cross-subsystem mlx5_core PCI function device. This can be used either
on PF, VF or SF and restricts all the software layers above it to the
given CPU set.

For specified CPU, SF either uses an existing IRQ affiliated to the CPU
or a new IRQ available from the device. For example if user gives
affinity 3 (11 in binary), SF will create driver internal required
completion EQ, attached to these specific CPU's IRQ.
If SF is already fully probed, devlink reload is required for
cpu_affinity to take effect.

The following command sets the affinity of mlx5 PF/VF/SF.
devlink command structure:
$ devlink dev param set auxiliary/mlx5_core.sf.4 name cpu_affinity value \
          [cpu_bitmask] cmode driverinit

Applications that want to restrict a SF or VF HW to a CPU set, for
instance container workloads, can make use of this API to easily
achieve it.

Shay Drory (4):
  net netlink: Introduce NLA_BITFIELD type
  devlink: Add support for NLA_BITFIELD for devlink param
  devlink: Add new cpu_affinity generic device param
  net/mlx5: Support cpu_affinity devlink dev param

 .../networking/devlink/devlink-params.rst     |   5 +
 Documentation/networking/devlink/mlx5.rst     |   3 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c | 123 +++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/devlink.h |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  39 +++++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   2 +
 .../ethernet/mellanox/mlx5/core/mlx5_irq.h    |   5 +-
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c |  85 +++++++++-
 include/net/devlink.h                         |  22 +++
 include/net/netlink.h                         |  30 ++++
 include/uapi/linux/netlink.h                  |  10 ++
 lib/nlattr.c                                  | 145 +++++++++++++++++-
 net/core/devlink.c                            | 143 +++++++++++++++--
 net/netlink/policy.c                          |   4 +
 14 files changed, 594 insertions(+), 24 deletions(-)

-- 
2.21.3


             reply	other threads:[~2022-02-22 11:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-22 10:58 Shay Drory [this message]
2022-02-22 10:58 ` [PATCH net-next 1/4] net netlink: Introduce NLA_BITFIELD type Shay Drory
2022-02-22 10:58 ` [PATCH net-next 2/4] devlink: Add support for NLA_BITFIELD for devlink param Shay Drory
2022-02-22 10:58 ` [PATCH net-next 3/4] devlink: Add new cpu_affinity generic device param Shay Drory
2022-02-22 10:58 ` [PATCH net-next 4/4] net/mlx5: Support cpu_affinity devlink dev param Shay Drory

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220222105812.18668-1-shayd@nvidia.com \
    --to=shayd@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=jiri@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=parav@nvidia.com \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).