* Re: [PATCH v2] net: macb: undo operations in case of failure
From: Nicolas Ferre @ 2020-06-18 15:53 UTC (permalink / raw)
To: Claudiu Beznea, davem, kuba, linux; +Cc: antoine.tenart, netdev, linux-kernel
In-Reply-To: <1592469460-17825-1-git-send-email-claudiu.beznea@microchip.com>
On 18/06/2020 at 10:37, Claudiu Beznea wrote:
> Undo previously done operation in case macb_phylink_connect()
> fails. Since macb_reset_hw() is the 1st undo operation the
> napi_exit label was renamed to reset_hw.
>
> Fixes: 7897b071ac3b ("net: macb: convert to phylink")
> Signed-off-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Thanks Claudiu.
Regards,
Nicolas
> ---
>
> Changes in v2:
> - corrected fixes SHA1
>
> drivers/net/ethernet/cadence/macb_main.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
> index 67933079aeea..257c4920cb88 100644
> --- a/drivers/net/ethernet/cadence/macb_main.c
> +++ b/drivers/net/ethernet/cadence/macb_main.c
> @@ -2558,7 +2558,7 @@ static int macb_open(struct net_device *dev)
>
> err = macb_phylink_connect(bp);
> if (err)
> - goto napi_exit;
> + goto reset_hw;
>
> netif_tx_start_all_queues(dev);
>
> @@ -2567,9 +2567,11 @@ static int macb_open(struct net_device *dev)
>
> return 0;
>
> -napi_exit:
> +reset_hw:
> + macb_reset_hw(bp);
> for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue)
> napi_disable(&queue->napi);
> + macb_free_consistent(bp);
> pm_exit:
> pm_runtime_put_sync(&bp->pdev->dev);
> return err;
>
--
Nicolas Ferre
^ permalink raw reply
* Re: [PATCH net] selftests/net: report etf errors correctly
From: Jakub Kicinski @ 2020-06-18 15:54 UTC (permalink / raw)
To: Willem de Bruijn; +Cc: netdev, davem, Willem de Bruijn
In-Reply-To: <20200618145549.37937-1-willemdebruijn.kernel@gmail.com>
On Thu, 18 Jun 2020 10:55:49 -0400 Willem de Bruijn wrote:
> + switch (err->ee_errno) {
> + case ECANCELED:
> + if (err->ee_code != SO_EE_CODE_TXTIME_MISSED)
> + error(1, 0, "errqueue: unknown ECANCELED %u\n",
> + err->ee_code);
> + reason = "missed txtime";
> + break;
> + case EINVAL:
> + if (err->ee_code != SO_EE_CODE_TXTIME_INVALID_PARAM)
> + error(1, 0, "errqueue: unknown EINVAL %u\n",
> + err->ee_code);
> + reason = "invalid txtime";
> + break;
> + default:
> + error(1, 0, "errqueue: errno %u code %u\n",
> + err->ee_errno, err->ee_code);
> + };
>
> tstamp = ((int64_t) err->ee_data) << 32 | err->ee_info;
> tstamp -= (int64_t) glob_tstart;
> tstamp /= 1000 * 1000;
> - fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped\n",
> - data[ret - 1], tstamp);
> + fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> + data[ret - 1], tstamp, reason);
Hi Willem! Checkpatch is grumpy about some misalignment here:
CHECK: Alignment should match open parenthesis
#67: FILE: tools/testing/selftests/net/so_txtime.c:187:
+ error(1, 0, "errqueue: unknown ECANCELED %u\n",
+ err->ee_code);
CHECK: Alignment should match open parenthesis
#73: FILE: tools/testing/selftests/net/so_txtime.c:193:
+ error(1, 0, "errqueue: unknown EINVAL %u\n",
+ err->ee_code);
CHECK: Alignment should match open parenthesis
#87: FILE: tools/testing/selftests/net/so_txtime.c:205:
+ fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
+ data[ret - 1], tstamp, reason);
^ permalink raw reply
* Re: [PATCH net] ibmveth: Fix max MTU limit
From: Jakub Kicinski @ 2020-06-18 15:57 UTC (permalink / raw)
To: Thomas Falcon; +Cc: netdev, linuxppc-dev
In-Reply-To: <1592495026-27202-1-git-send-email-tlfalcon@linux.ibm.com>
On Thu, 18 Jun 2020 10:43:46 -0500 Thomas Falcon wrote:
> The max MTU limit defined for ibmveth is not accounting for
> virtual ethernet buffer overhead, which is twenty-two additional
> bytes set aside for the ethernet header and eight additional bytes
> of an opaque handle reserved for use by the hypervisor. Update the
> max MTU to reflect this overhead.
>
> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
How about
Fixes: d894be57ca92 ("ethernet: use net core MTU range checking in more drivers")
Fixes: 110447f8269a ("ethernet: fix min/max MTU typos")
?
^ permalink raw reply
* Re: [PATCH v1 2/3] net/fsl: acpize xgmac_mdio
From: Andy Shevchenko @ 2020-06-18 16:00 UTC (permalink / raw)
To: Jeremy Linton
Cc: Andrew Lunn, Calvin Johnson, Russell King - ARM Linux admin, Jon,
Cristi Sovaiala, Ioana Ciornei, Florian Fainelli, Madalin Bucur,
netdev, linux.cj
In-Reply-To: <a1ae8926-9082-74ca-298a-853d297c84e7@arm.com>
On Thu, Jun 18, 2020 at 6:46 PM Jeremy Linton <jeremy.linton@arm.com> wrote:
> On 6/17/20 12:34 PM, Andrew Lunn wrote:
> > On Wed, Jun 17, 2020 at 10:45:34PM +0530, Calvin Johnson wrote:
> >> From: Jeremy Linton <jeremy.linton@arm.com>
> >
> >> +static const struct acpi_device_id xgmac_acpi_match[] = {
> >> + { "NXP0006", (kernel_ulong_t)NULL },
> >
> > Hi Jeremy
> >
> > What exactly does NXP0006 represent? An XGMAC MDIO bus master? Some
> > NXP MDIO bus master? An XGMAC Ethernet controller which has an NXP
> > MDIO bus master? A cluster of Ethernet controllers?
>
> Strictly speaking its a NXP defined (they own the "NXP" prefix per
> https://uefi.org/pnp_id_list) id. So, they have tied it to a specific
> bit of hardware. In this case it appears to be a shared MDIO master
> which isn't directly contained in an Ethernet controller. Its somewhat
> similar to a "nxp,xxxxx" compatible id, depending on how they are using
> it to identify an ACPI device object (_HID()/_CID()).
>
> So AFAIK, this is all valid ACPI usage as long as the ID maps to a
> unique device/object.
>
> >
> > Is this documented somewhere? In the DT world we have a clear
> > documentation for all the compatible strings. Is there anything
> > similar in the ACPI world for these magic numbers?
>
> Sadly not fully. The mentioned PNP and ACPI
> (https://uefi.org/acpi_id_list) ids lists are requested and registered
> to a given organization. But, once the prefix is owned, it becomes the
> responsibility of that organization to assign & manage the ID's with
> their prefix. There are various individuals/etc which have collected
> lists, though like PCI ids, there aren't any formal publishing requirements.
And here is the question, do we have (in form of email or other means)
an official response from NXP about above mentioned ID?
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH v5 3/3] net: phy: mscc: handle the clkout control on some phy variants
From: Heiko Stübner @ 2020-06-18 16:01 UTC (permalink / raw)
To: Russell King - ARM Linux admin
Cc: Andrew Lunn, davem, kuba, robh+dt, f.fainelli, hkallweit1, netdev,
devicetree, linux-kernel, christoph.muellner
In-Reply-To: <20200618154748.GE1551@shell.armlinux.org.uk>
Am Donnerstag, 18. Juni 2020, 17:47:48 CEST schrieb Russell King - ARM Linux admin:
> On Thu, Jun 18, 2020 at 05:41:54PM +0200, Heiko Stübner wrote:
> > Am Donnerstag, 18. Juni 2020, 15:41:02 CEST schrieb Russell King - ARM Linux admin:
> > > On Thu, Jun 18, 2020 at 03:28:22PM +0200, Andrew Lunn wrote:
> > > > On Thu, Jun 18, 2020 at 02:11:39PM +0200, Heiko Stuebner wrote:
> > > > > From: Heiko Stuebner <heiko.stuebner@theobroma-systems.com>
> > > > >
> > > > > At least VSC8530/8531/8540/8541 contain a clock output that can emit
> > > > > a predefined rate of 25, 50 or 125MHz.
> > > > >
> > > > > This may then feed back into the network interface as source clock.
> > > > > So expose a clock-provider from the phy using the common clock framework
> > > > > to allow setting the rate.
> > > > >
> > > > > Signed-off-by: Heiko Stuebner <heiko.stuebner@theobroma-systems.com>
> > > > > ---
> > > > > drivers/net/phy/mscc/mscc.h | 13 +++
> > > > > drivers/net/phy/mscc/mscc_main.c | 182 +++++++++++++++++++++++++++++--
> > > > > 2 files changed, 187 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/phy/mscc/mscc.h b/drivers/net/phy/mscc/mscc.h
> > > > > index fbcee5fce7b2..94883dab5cc1 100644
> > > > > --- a/drivers/net/phy/mscc/mscc.h
> > > > > +++ b/drivers/net/phy/mscc/mscc.h
> > > > > @@ -218,6 +218,13 @@ enum rgmii_clock_delay {
> > > > > #define INT_MEM_DATA_M 0x00ff
> > > > > #define INT_MEM_DATA(x) (INT_MEM_DATA_M & (x))
> > > > >
> > > > > +#define MSCC_CLKOUT_CNTL 13
> > > > > +#define CLKOUT_ENABLE BIT(15)
> > > > > +#define CLKOUT_FREQ_MASK GENMASK(14, 13)
> > > > > +#define CLKOUT_FREQ_25M (0x0 << 13)
> > > > > +#define CLKOUT_FREQ_50M (0x1 << 13)
> > > > > +#define CLKOUT_FREQ_125M (0x2 << 13)
> > > > > +
> > > > > #define MSCC_PHY_PROC_CMD 18
> > > > > #define PROC_CMD_NCOMPLETED 0x8000
> > > > > #define PROC_CMD_FAILED 0x4000
> > > > > @@ -360,6 +367,12 @@ struct vsc8531_private {
> > > > > */
> > > > > unsigned int base_addr;
> > > > >
> > > > > +#ifdef CONFIG_COMMON_CLK
> > > > > + struct clk_hw clkout_hw;
> > > > > +#endif
> > > > > + u32 clkout_rate;
> > > > > + int clkout_enabled;
> > > > > +
> > > > > #if IS_ENABLED(CONFIG_MACSEC)
> > > > > /* MACsec fields:
> > > > > * - One SecY per device (enforced at the s/w implementation level)
> > > > > diff --git a/drivers/net/phy/mscc/mscc_main.c b/drivers/net/phy/mscc/mscc_main.c
> > > > > index 5d2777522fb4..727a9dd58403 100644
> > > > > --- a/drivers/net/phy/mscc/mscc_main.c
> > > > > +++ b/drivers/net/phy/mscc/mscc_main.c
> > > > > @@ -7,6 +7,7 @@
> > > > > * Copyright (c) 2016 Microsemi Corporation
> > > > > */
> > > > >
> > > > > +#include <linux/clk-provider.h>
> > > > > #include <linux/firmware.h>
> > > > > #include <linux/jiffies.h>
> > > > > #include <linux/kernel.h>
> > > > > @@ -431,7 +432,6 @@ static int vsc85xx_dt_led_mode_get(struct phy_device *phydev,
> > > > >
> > > > > return led_mode;
> > > > > }
> > > > > -
> > > > > #else
> > > > > static int vsc85xx_edge_rate_magic_get(struct phy_device *phydev)
> > > > > {
> > > > > @@ -1508,6 +1508,43 @@ static int vsc85xx_config_init(struct phy_device *phydev)
> > > > > return 0;
> > > > > }
> > > > >
> > > > > +static int vsc8531_config_init(struct phy_device *phydev)
> > > > > +{
> > > > > + struct vsc8531_private *vsc8531 = phydev->priv;
> > > > > + u16 val;
> > > > > + int rc;
> > > > > +
> > > > > + rc = vsc85xx_config_init(phydev);
> > > > > + if (rc)
> > > > > + return rc;
> > > > > +
> > > > > +#ifdef CONFIG_COMMON_CLK
> > > > > + switch (vsc8531->clkout_rate) {
> > > > > + case 25000000:
> > > > > + val = CLKOUT_FREQ_25M;
> > > > > + break;
> > > > > + case 50000000:
> > > > > + val = CLKOUT_FREQ_50M;
> > > > > + break;
> > > > > + case 125000000:
> > > > > + val = CLKOUT_FREQ_125M;
> > > > > + break;
> > > > > + default:
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + if (vsc8531->clkout_enabled)
> > > > > + val |= CLKOUT_ENABLE;
> > > > > +
> > > > > + rc = phy_write_paged(phydev, MSCC_PHY_PAGE_EXTENDED_GPIO,
> > > > > + MSCC_CLKOUT_CNTL, val);
> > > > > + if (rc)
> > > > > + return rc;
> > > > > +#endif
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > >
> > > > > +static int vsc8531_clkout_prepare(struct clk_hw *hw)
> > > > > +{
> > > > > + struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
> > > > > +
> > > > > + vsc8531->clkout_enabled = true;
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void vsc8531_clkout_unprepare(struct clk_hw *hw)
> > > > > +{
> > > > > + struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
> > > > > +
> > > > > + vsc8531->clkout_enabled = false;
> > > > > +}
> > > > > +
> > > >
> > > > > +static const struct clk_ops vsc8531_clkout_ops = {
> > > > > + .prepare = vsc8531_clkout_prepare,
> > > > > + .unprepare = vsc8531_clkout_unprepare,
> > > > > + .is_prepared = vsc8531_clkout_is_prepared,
> > > > > + .recalc_rate = vsc8531_clkout_recalc_rate,
> > > > > + .round_rate = vsc8531_clkout_round_rate,
> > > > > + .set_rate = vsc8531_clkout_set_rate,
> > > >
> > > > I'm not sure this is the expected behaviour. The clk itself should
> > > > only start ticking when the enable callback is called. But this code
> > > > will enable the clock when config_init() is called. I think you should
> > > > implement the enable and disable methods.
> > >
> > > That is actually incorrect. The whole "prepare" vs "enable" difference
> > > is that prepare can schedule, enable isn't permitted. So, if you need
> > > to sleep to enable the clock, then enabling the clock in the prepare
> > > callback is the right thing to do.
> > >
> > > However, the above driver just sets a flag, which only gets used when
> > > the PHY's config_init method is called; that really doesn't seem to be
> > > sane - the clock is available from the point that the PHY has been
> > > probed, and it'll be expected that once the clock is published, it can
> > > be made functional.
> >
> > Though I'm not sure how this fits in the whole bringup of ethernet phys.
> > Like the phy is dependent on the underlying ethernet controller to
> > actually turn it on.
> >
> > I guess we should check the phy-state and if it's not accessible, just
> > keep the values and if it's in a suitable state do the configuration.
> >
> > Calling a vsc8531_config_clkout() from both the vsc8531_config_init()
> > as well as the clk_(un-)prepare and clk_set_rate functions and being
> > protected by a check against phy_is_started() ?
>
> It sounds like it doesn't actually fit the clk API paradym then. I
> see that Rob suggested it, and from the DT point of view, it makes
> complete sense, but then if the hardware can't actually be used in
> the way the clk API expects it to be used, then there's a semantic
> problem.
>
> What is this clock used for?
It provides a source for the mac-clk for the actual transfers, here to
provide the 125MHz clock needed for the RGMII interface .
So right now the old rk3368-lion devicetree just declares a stub
fixed-clock and instructs the soc's clock controller to use it [0] .
And in the cover-letter here, I show the update variant with using
the clock defined here.
I've added the idea from my previous mail like shown below [1].
which would take into account the phy-state.
But I guess I'll wait for more input before spamming people with v6.
Thanks
Heiko
[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/boot/dts/rockchip/rk3368-lion.dtsi#n150
[1]
@@ -1508,6 +1508,157 @@ static int vsc85xx_config_init(struct phy_device *phydev)
return 0;
}
+#ifdef CONFIG_COMMON_CLK
+#define clkout_hw_to_vsc8531(_hw) container_of(_hw, struct vsc8531_private, clkout_hw)
+
+static int clkout_rates[] = {
+ 125000000,
+ 50000000,
+ 25000000,
+};
+
+static int vsc8531_config_clkout(struct phy_device *phydev)
+{
+ struct vsc8531_private *vsc8531 = phydev->priv;
+ u16 val;
+
+ /* when called from clk functions, make sure phy is running */
+ if (phy_is_started(phydev))
+ return 0;
+
+ switch (vsc8531->clkout_rate) {
+ case 25000000:
+ val = CLKOUT_FREQ_25M;
+ break;
+ case 50000000:
+ val = CLKOUT_FREQ_50M;
+ break;
+ case 125000000:
+ val = CLKOUT_FREQ_125M;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (vsc8531->clkout_enabled)
+ val |= CLKOUT_ENABLE;
+
+ return phy_write_paged(phydev, MSCC_PHY_PAGE_EXTENDED_GPIO,
+ MSCC_CLKOUT_CNTL, val);
+}
+
+static unsigned long vsc8531_clkout_recalc_rate(struct clk_hw *hw,
+ unsigned long parent_rate)
+{
+ struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
+
+ return vsc8531->clkout_rate;
+}
+
+static long vsc8531_clkout_round_rate(struct clk_hw *hw, unsigned long rate,
+ unsigned long *prate)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(clkout_rates); i++)
+ if (clkout_rates[i] <= rate)
+ return clkout_rates[i];
+ return 0;
+}
+
+static int vsc8531_clkout_set_rate(struct clk_hw *hw, unsigned long rate,
+ unsigned long parent_rate)
+{
+ struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
+ struct phy_device *phydev = vsc8531->phydev;
+
+ vsc8531->clkout_rate = rate;
+ return vsc8531_config_clkout(phydev);
+}
+
+static int vsc8531_clkout_prepare(struct clk_hw *hw)
+{
+ struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
+ struct phy_device *phydev = vsc8531->phydev;
+
+ vsc8531->clkout_enabled = true;
+ return vsc8531_config_clkout(phydev);
+}
+
+static void vsc8531_clkout_unprepare(struct clk_hw *hw)
+{
+ struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
+ struct phy_device *phydev = vsc8531->phydev;
+
+ vsc8531->clkout_enabled = false;
+ vsc8531_config_clkout(phydev);
+}
+
+static int vsc8531_clkout_is_prepared(struct clk_hw *hw)
+{
+ struct vsc8531_private *vsc8531 = clkout_hw_to_vsc8531(hw);
+
+ return vsc8531->clkout_enabled;
+}
+
+static const struct clk_ops vsc8531_clkout_ops = {
+ .prepare = vsc8531_clkout_prepare,
+ .unprepare = vsc8531_clkout_unprepare,
+ .is_prepared = vsc8531_clkout_is_prepared,
+ .recalc_rate = vsc8531_clkout_recalc_rate,
+ .round_rate = vsc8531_clkout_round_rate,
+ .set_rate = vsc8531_clkout_set_rate,
+};
+
+static int vsc8531_register_clkout(struct phy_device *phydev)
+{
+ struct vsc8531_private *vsc8531 = phydev->priv;
+ struct device *dev = &phydev->mdio.dev;
+ struct device_node *of_node = dev->of_node;
+ struct clk_init_data init;
+ int ret;
+
+ init.name = "vsc8531-clkout";
+ init.ops = &vsc8531_clkout_ops;
+ init.flags = 0;
+ init.parent_names = NULL;
+ init.num_parents = 0;
+ vsc8531->clkout_hw.init = &init;
+
+ /* optional override of the clockname */
+ of_property_read_string(of_node, "clock-output-names", &init.name);
+
+ /* register the clock */
+ ret = devm_clk_hw_register(dev, &vsc8531->clkout_hw);
+ if (!ret)
+ ret = devm_of_clk_add_hw_provider(dev, of_clk_hw_simple_get,
+ &vsc8531->clkout_hw);
+
+ return ret;
+}
+#else
+static int vsc8531_register_clkout(struct phy_device *phydev)
+{
+ return 0;
+}
+
+static int vsc8531_config_clkout(struct phy_device *phydev)
+{
+ return 0;
+}
+#endif
+
+static int vsc8531_config_init(struct phy_device *phydev)
+{
+ int rc;
+
+ rc = vsc85xx_config_init(phydev);
+ if (rc)
+ return rc;
+
+ return vsc8531_config_clkout(phydev);
+}
+
^ permalink raw reply
* [RFC PATCH 12/21] mlx5: hook up the netgpu channel functions
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Hook up all the netgpu plumbing, except the enable/disable calls.
Those will be added after the netgpu module itself.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
.../mellanox/mlx5/core/en/netgpu/setup.c | 2 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 35 +++++++++++++
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 52 +++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en_txrx.c | 15 +++++-
4 files changed, 97 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
index f0578c41951d..76df316611fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
@@ -78,7 +78,7 @@ mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count)
* doesn't consider any_cache_count.
*/
return ctx->napi_cache_count >= count ||
- sq_cons_ready(&ctx->fill) >= (count - ctx->napi_cache_count);
+ sq_cons_avail(&ctx->fill, count - ctx->napi_cache_count);
}
void mlx5e_netgpu_taken(struct mlx5e_rq *rq)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 01d234369df6..c791578be5ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -62,6 +62,7 @@
#include "en/xsk/setup.h"
#include "en/xsk/rx.h"
#include "en/xsk/tx.h"
+#include "en/netgpu/setup.h"
#include "en/hv_vhca_stats.h"
#include "en/devlink.h"
#include "lib/mlx5.h"
@@ -1955,6 +1956,24 @@ mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
return err;
}
+static int
+mlx5e_netgpu_optional_open(struct mlx5e_priv *priv, int ix,
+ struct mlx5e_params *params,
+ struct mlx5e_channel_param *cparam,
+ struct mlx5e_channel *c)
+{
+ struct netgpu_ctx *ctx;
+ int err = 0;
+
+ ctx = mlx5e_netgpu_get_ctx(params, params->xsk, ix);
+
+ if (ctx)
+ err = mlx5e_open_netgpu(priv, params, ctx, c);
+
+ return err;
+}
+
+
static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
struct mlx5e_params *params,
struct mlx5e_channel_param *cparam,
@@ -2002,6 +2021,13 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
goto err_close_queues;
}
+ /* This opens a second set of shadow queues for netgpu */
+ if (params->hd_split) {
+ err = mlx5e_netgpu_optional_open(priv, ix, params, cparam, c);
+ if (unlikely(err))
+ goto err_close_queues;
+ }
+
*cp = c;
return 0;
@@ -2037,6 +2063,9 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
mlx5e_deactivate_xsk(c);
+ if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+ mlx5e_deactivate_netgpu(c);
+
mlx5e_deactivate_rq(&c->rq);
mlx5e_deactivate_icosq(&c->icosq);
for (tc = 0; tc < c->num_tc; tc++)
@@ -2047,6 +2076,10 @@ static void mlx5e_close_channel(struct mlx5e_channel *c)
{
if (test_bit(MLX5E_CHANNEL_STATE_XSK, c->state))
mlx5e_close_xsk(c);
+
+ if (test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+ mlx5e_close_netgpu(c);
+
mlx5e_close_queues(c);
netif_napi_del(&c->napi);
@@ -3012,11 +3045,13 @@ void mlx5e_activate_priv_channels(struct mlx5e_priv *priv)
mlx5e_redirect_rqts_to_channels(priv, &priv->channels);
mlx5e_xsk_redirect_rqts_to_channels(priv, &priv->channels);
+ mlx5e_netgpu_redirect_rqts_to_channels(priv, &priv->channels);
}
void mlx5e_deactivate_priv_channels(struct mlx5e_priv *priv)
{
mlx5e_xsk_redirect_rqts_to_drop(priv, &priv->channels);
+ mlx5e_netgpu_redirect_rqts_to_drop(priv, &priv->channels);
mlx5e_redirect_rqts_to_drop(priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dbb1c6323967..1edc157696f2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -50,6 +50,9 @@
#include "en/xdp.h"
#include "en/xsk/rx.h"
#include "en/health.h"
+#include "en/netgpu/setup.h"
+
+#include <net/netgpu.h>
static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
{
@@ -266,8 +269,11 @@ static inline int mlx5e_page_alloc(struct mlx5e_rq *rq,
{
if (rq->umem)
return mlx5e_xsk_page_alloc_umem(rq, dma_info);
- else
- return mlx5e_page_alloc_pool(rq, dma_info);
+
+ if (dma_info->netgpu_source)
+ return mlx5e_netgpu_get_page(rq, dma_info);
+
+ return mlx5e_page_alloc_pool(rq, dma_info);
}
void mlx5e_page_dma_unmap(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
@@ -279,6 +285,9 @@ void mlx5e_page_release_dynamic(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info,
bool recycle)
{
+ if (dma_info->netgpu_source)
+ return mlx5e_netgpu_put_page(rq, dma_info, recycle);
+
if (likely(recycle)) {
if (mlx5e_rx_cache_put(rq, dma_info))
return;
@@ -394,6 +403,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
return -ENOMEM;
}
+ if (rq->netgpu && !mlx5e_netgpu_avail(rq, wqe_bulk))
+ return -ENOMEM;
+
for (i = 0; i < wqe_bulk; i++) {
struct mlx5e_rx_wqe_cyc *wqe = mlx5_wq_cyc_get_wqe(wq, ix + i);
@@ -402,6 +414,9 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk)
goto free_wqes;
}
+ if (rq->netgpu)
+ mlx5e_netgpu_taken(rq);
+
return 0;
free_wqes:
@@ -416,12 +431,17 @@ mlx5e_add_skb_frag(struct mlx5e_rq *rq, struct sk_buff *skb,
struct mlx5e_dma_info *di, u32 frag_offset, u32 len,
unsigned int truesize)
{
+ /* XXX skip this if netgpu_source... */
dma_sync_single_for_cpu(rq->pdev,
di->addr + frag_offset,
len, DMA_FROM_DEVICE);
- page_ref_inc(di->page);
skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
di->page, frag_offset, len, truesize);
+
+ if (skb->zc_netgpu)
+ di->page = NULL;
+ else
+ page_ref_inc(di->page);
}
static inline void
@@ -1109,16 +1129,26 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
{
struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
struct mlx5e_wqe_frag_info *head_wi = wi;
- u16 headlen = min_t(u32, MLX5E_RX_MAX_HEAD, cqe_bcnt);
+ bool hd_split = rq->netgpu;
+ u16 header_len = hd_split ? TOTAL_HEADERS : MLX5E_RX_MAX_HEAD;
+ u16 headlen = min_t(u32, header_len, cqe_bcnt);
u16 frag_headlen = headlen;
u16 byte_cnt = cqe_bcnt - headlen;
struct sk_buff *skb;
+ /* RST packets may have short headers (74) and no payload */
+ if (hd_split && headlen != TOTAL_HEADERS && byte_cnt) {
+ /* XXX add drop counter */
+ pr_warn_once("BAD hd_split: headlen %d != %d\n",
+ headlen, TOTAL_HEADERS);
+ return NULL;
+ }
+
/* XDP is not supported in this configuration, as incoming packets
* might spread among multiple pages.
*/
skb = napi_alloc_skb(rq->cq.napi,
- ALIGN(MLX5E_RX_MAX_HEAD, sizeof(long)));
+ ALIGN(header_len, sizeof(long)));
if (unlikely(!skb)) {
rq->stats->buff_alloc_err++;
return NULL;
@@ -1126,6 +1156,18 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
prefetchw(skb->data);
+ if (hd_split) {
+ /* first frag is only headers, should skip this frag and
+ * assume that all of the headers already copied to the skb
+ * inline data.
+ */
+ frag_info++;
+ frag_headlen = 0;
+ wi++;
+
+ skb->zc_netgpu = 1;
+ }
+
while (byte_cnt) {
u16 frag_consumed_bytes =
min_t(u16, frag_info->frag_size - frag_headlen, byte_cnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 8480278f2ee2..1c646a6dc29a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -122,6 +122,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
struct mlx5e_rq *xskrq = &c->xskrq;
struct mlx5e_rq *rq = &c->rq;
bool xsk_open = test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
+ bool netgpu_open = test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
bool aff_change = false;
bool busy_xsk = false;
bool busy = false;
@@ -139,7 +140,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
if (likely(budget)) { /* budget=0 means: don't poll rx rings */
- if (xsk_open)
+ if (xsk_open || netgpu_open)
work_done = mlx5e_poll_rx_cq(&xskrq->cq, budget);
if (likely(budget - work_done))
@@ -154,6 +155,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
mlx5e_post_rx_mpwqes,
mlx5e_post_rx_wqes,
rq);
+
+ if (netgpu_open) {
+ mlx5e_poll_ico_cq(&c->xskicosq.cq);
+ busy_xsk |= xskrq->post_wqes(xskrq);
+ }
+
if (xsk_open) {
if (mlx5e_poll_ico_cq(&c->xskicosq.cq))
/* Don't clear the flag if nothing was polled to prevent
@@ -191,6 +198,12 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
mlx5e_cq_arm(&c->icosq.cq);
mlx5e_cq_arm(&c->xdpsq.cq);
+ if (netgpu_open) {
+ mlx5e_handle_rx_dim(xskrq);
+ mlx5e_cq_arm(&c->xskicosq.cq);
+ mlx5e_cq_arm(&xskrq->cq);
+ }
+
if (xsk_open) {
mlx5e_handle_rx_dim(xskrq);
mlx5e_cq_arm(&c->xskicosq.cq);
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 13/21] netdevice: add SETUP_NETGPU to the netdev_bpf structure
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
This command will be used to setup/tear down netgpu queues.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/netdevice.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6fc613ed8eae..ea3c15ef0f29 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -880,6 +880,7 @@ enum bpf_netdev_command {
BPF_OFFLOAD_MAP_ALLOC,
BPF_OFFLOAD_MAP_FREE,
XDP_SETUP_XSK_UMEM,
+ XDP_SETUP_NETGPU,
};
struct bpf_prog_offload_ops;
@@ -911,6 +912,11 @@ struct netdev_bpf {
struct xdp_umem *umem;
u16 queue_id;
} xsk;
+ /* XDP_SETUP_NETGPU */
+ struct {
+ struct netgpu_ctx *ctx;
+ u16 queue_id;
+ } netgpu;
};
};
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 07/21] mlx5: remove the umem parameter from mlx5e_open_channel
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Instead of obtaining the umem parameter from the channel parameters
and passing it to the function, push this down into the function itself.
Move xsk open logic into its own function, in preparation for the
upcoming netgpu commit.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
.../net/ethernet/mellanox/mlx5/core/en_main.c | 35 +++++++++++++------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cc8d30aa8a33..01d234369df6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1935,15 +1935,33 @@ static u8 mlx5e_enumerate_lag_port(struct mlx5_core_dev *mdev, int ix)
return (ix + port_aff_bias) % mlx5e_get_num_lag_ports(mdev);
}
+static int
+mlx5e_xsk_optional_open(struct mlx5e_priv *priv, int ix,
+ struct mlx5e_params *params,
+ struct mlx5e_channel_param *cparam,
+ struct mlx5e_channel *c)
+{
+ struct mlx5e_xsk_param xsk;
+ struct xdp_umem *umem;
+ int err = 0;
+
+ umem = mlx5e_xsk_get_umem(params, params->xsk, ix);
+
+ if (umem) {
+ mlx5e_build_xsk_param(umem, &xsk);
+ err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+ }
+
+ return err;
+}
+
static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
struct mlx5e_params *params,
struct mlx5e_channel_param *cparam,
- struct xdp_umem *umem,
struct mlx5e_channel **cp)
{
int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
struct net_device *netdev = priv->netdev;
- struct mlx5e_xsk_param xsk;
struct mlx5e_channel *c;
unsigned int irq;
int err;
@@ -1977,9 +1995,9 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
if (unlikely(err))
goto err_napi_del;
- if (umem) {
- mlx5e_build_xsk_param(umem, &xsk);
- err = mlx5e_open_xsk(priv, params, &xsk, umem, c);
+ /* This opens a second set of shadow queues for xsk */
+ if (params->xdp_prog) {
+ err = mlx5e_xsk_optional_open(priv, ix, params, cparam, c);
if (unlikely(err))
goto err_close_queues;
}
@@ -2345,12 +2363,7 @@ int mlx5e_open_channels(struct mlx5e_priv *priv,
mlx5e_build_channel_param(priv, &chs->params, cparam);
for (i = 0; i < chs->num; i++) {
- struct xdp_umem *umem = NULL;
-
- if (chs->params.xdp_prog)
- umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i);
-
- err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]);
+ err = mlx5e_open_channel(priv, i, &chs->params, cparam, &chs->c[i]);
if (err)
goto err_close_channels;
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 19/21] core: add page recycling logic for netgpu pages
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
netgpu pages will always have a refcount of at least one (held by
the netgpu module). This logic and the codepath obviously needs
work, but suffices for a proof-of-concept.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
net/core/skbuff.c | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2a391042be53..2b4176cab578 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -69,6 +69,7 @@
#include <net/xfrm.h>
#include <net/mpls.h>
#include <net/mptcp.h>
+#include <net/netgpu.h>
#include <linux/uaccess.h>
#include <trace/events/skb.h>
@@ -590,6 +591,24 @@ static void skb_free_head(struct sk_buff *skb)
kfree(head);
}
+static void skb_netgpu_unref(struct skb_shared_info *shinfo)
+{
+ struct page *page;
+ int count;
+ int i;
+
+ /* pages attached for skbs for TX shouldn't come here, since
+ * the skb is not marked as "zc_netgpu". (only RX skbs have this).
+ * dummy page does come here, but always has elevated refc.
+ */
+ for (i = 0; i < shinfo->nr_frags; i++) {
+ page = skb_frag_page(&shinfo->frags[i]);
+ count = page_ref_dec_return(page);
+ if (count <= 2)
+ __netgpu_put_page(g_ctx, page, false);
+ }
+}
+
static void skb_release_data(struct sk_buff *skb)
{
struct skb_shared_info *shinfo = skb_shinfo(skb);
@@ -600,8 +619,12 @@ static void skb_release_data(struct sk_buff *skb)
&shinfo->dataref))
return;
- for (i = 0; i < shinfo->nr_frags; i++)
- __skb_frag_unref(&shinfo->frags[i]);
+ if (skb->zc_netgpu && shinfo->nr_frags) {
+ skb_netgpu_unref(shinfo);
+ } else {
+ for (i = 0; i < shinfo->nr_frags; i++)
+ __skb_frag_unref(&shinfo->frags[i]);
+ }
if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list);
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 17/21] net/core: add the SO_REGISTER_DMA socket option
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
This option says that the socket will be performing zero copy sends
and receives through the netgpu module.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/uapi/asm-generic/socket.h | 2 ++
net/core/sock.c | 26 ++++++++++++++++++++++++++
2 files changed, 28 insertions(+)
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 77f7c1638eb1..5a8577c90e2a 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -119,6 +119,8 @@
#define SO_DETACH_REUSEPORT_BPF 68
+#define SO_REGISTER_DMA 69
+
#if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..c9e93ee675d6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -828,6 +828,25 @@ void sock_set_rcvbuf(struct sock *sk, int val)
}
EXPORT_SYMBOL(sock_set_rcvbuf);
+extern int netgpu_register_dma(struct sock *sk, char __user *optval, unsigned int optlen);
+
+static int
+sock_register_dma(struct sock *sk, char __user *optval, unsigned int optlen)
+{
+ int rc;
+ int (*fn)(struct sock *sk, char __user *optval, unsigned int optlen);
+
+ fn = symbol_get(netgpu_register_dma);
+ if (!fn)
+ return -EINVAL;
+
+ rc = fn(sk, optval, optlen);
+
+ symbol_put(netgpu_register_dma);
+
+ return rc;
+}
+
/*
* This is meant for all protocols to use and covers goings on
* at the socket level. Everything here is generic.
@@ -1232,6 +1251,13 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
}
break;
+ case SO_REGISTER_DMA:
+ if (!sk->sk_bound_dev_if)
+ ret = -EINVAL;
+ else
+ ret = sock_register_dma(sk, optval, optlen);
+ break;
+
case SO_TXTIME:
if (optlen != sizeof(struct sock_txtime)) {
ret = -EINVAL;
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 10/21] mlx5: add netgpu queue functions
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Add the netgpu setup/teardown functions, which are not hooked up yet.
The driver also handles netgpu module loading and unloading.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
.../net/ethernet/mellanox/mlx5/core/Makefile | 3 +-
.../mellanox/mlx5/core/en/netgpu/setup.c | 475 ++++++++++++++++++
.../mellanox/mlx5/core/en/netgpu/setup.h | 42 ++
3 files changed, 519 insertions(+), 1 deletion(-)
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index b61e47bc16e8..27983bd074e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -25,7 +25,8 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \
en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \
en_selftest.o en/port.o en/monitor_stats.o en/health.o \
en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/umem.o \
- en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o
+ en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o \
+ en/netgpu/setup.o
#
# Netdev extra
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
new file mode 100644
index 000000000000..f0578c41951d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
@@ -0,0 +1,475 @@
+#include <linux/prefetch.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/indirect_call_wrapper.h>
+#include <net/ip6_checksum.h>
+#include <net/page_pool.h>
+#include <net/inet_ecn.h>
+#include "en.h"
+#include "en_tc.h"
+#include "lib/clock.h"
+#include "en/xdp.h"
+#include "en/params.h"
+#include "en/netgpu/setup.h"
+
+#include <net/netgpu.h>
+#include <uapi/misc/shqueue.h>
+
+int (*fn_netgpu_get_page)(struct netgpu_ctx *ctx,
+ struct page **page, dma_addr_t *dma);
+void (*fn_netgpu_put_page)(struct netgpu_ctx *, struct page *, bool);
+int (*fn_netgpu_get_pages)(struct sock *, struct page **,
+ unsigned long, int);
+struct netgpu_ctx *g_ctx;
+
+static void
+netgpu_fn_unload(void)
+{
+ if (fn_netgpu_get_page)
+ symbol_put(netgpu_get_page);
+ if (fn_netgpu_put_page)
+ symbol_put(netgpu_put_page);
+ if (fn_netgpu_get_pages)
+ symbol_put(netgpu_get_pages);
+
+ fn_netgpu_get_page = NULL;
+ fn_netgpu_put_page = NULL;
+ fn_netgpu_get_pages = NULL;
+}
+
+static int
+netgpu_fn_load(void)
+{
+ fn_netgpu_get_page = symbol_get(netgpu_get_page);
+ fn_netgpu_put_page = symbol_get(netgpu_put_page);
+ fn_netgpu_get_pages = symbol_get(netgpu_get_pages);
+
+ if (fn_netgpu_get_page &&
+ fn_netgpu_put_page &&
+ fn_netgpu_get_pages)
+ return 0;
+
+ netgpu_fn_unload();
+
+ return -EFAULT;
+}
+
+void
+mlx5e_netgpu_put_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+ bool recycle)
+{
+ struct netgpu_ctx *ctx = rq->netgpu;
+ struct page *page = dma_info->page;
+
+ if (page) {
+ put_page(page);
+ __netgpu_put_page(ctx, page, recycle);
+ }
+}
+
+bool
+mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count)
+{
+ struct netgpu_ctx *ctx = rq->netgpu;
+
+ /* XXX
+ * napi_cache_count is not a total count, and this also
+ * doesn't consider any_cache_count.
+ */
+ return ctx->napi_cache_count >= count ||
+ sq_cons_ready(&ctx->fill) >= (count - ctx->napi_cache_count);
+}
+
+void mlx5e_netgpu_taken(struct mlx5e_rq *rq)
+{
+ struct netgpu_ctx *ctx = rq->netgpu;
+
+ sq_cons_complete(&ctx->fill);
+}
+
+int
+mlx5e_netgpu_get_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info)
+{
+ struct netgpu_ctx *ctx = rq->netgpu;
+
+ return __netgpu_get_page(ctx, &dma_info->page, &dma_info->addr);
+}
+
+struct netgpu_ctx *
+mlx5e_netgpu_get_ctx(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+ u16 ix)
+{
+ if (!xsk || !xsk->ctx_tbl)
+ return NULL;
+
+ if (unlikely(ix >= params->num_channels))
+ return NULL;
+
+ if (unlikely(!xsk->is_netgpu))
+ return NULL;
+
+ return xsk->ctx_tbl[ix];
+}
+
+static int mlx5e_netgpu_get_tbl(struct mlx5e_xsk *xsk)
+{
+ if (!xsk->ctx_tbl) {
+ xsk->ctx_tbl = kcalloc(MLX5E_MAX_NUM_CHANNELS,
+ sizeof(*xsk->ctx_tbl), GFP_KERNEL);
+ if (unlikely(!xsk->ctx_tbl))
+ return -ENOMEM;
+ xsk->is_netgpu = true;
+ }
+ if (!xsk->is_netgpu)
+ return -EINVAL;
+
+ xsk->refcnt++;
+ xsk->ever_used = true;
+
+ return 0;
+}
+
+static void mlx5e_netgpu_put_tbl(struct mlx5e_xsk *xsk)
+{
+ if (!--xsk->refcnt) {
+ kfree(xsk->ctx_tbl);
+ xsk->ctx_tbl = NULL;
+ }
+}
+
+static void mlx5e_netgpu_remove_ctx(struct mlx5e_xsk *xsk, u16 ix)
+{
+ xsk->ctx_tbl[ix] = NULL;
+
+ mlx5e_netgpu_put_tbl(xsk);
+}
+
+static int mlx5e_netgpu_add_ctx(struct mlx5e_xsk *xsk, struct netgpu_ctx *ctx,
+ u16 ix)
+{
+ int err;
+
+ err = mlx5e_netgpu_get_tbl(xsk);
+ if (unlikely(err))
+ return err;
+
+ xsk->ctx_tbl[ix] = ctx;
+
+ return 0;
+}
+
+static int mlx5e_netgpu_enable_locked(struct mlx5e_priv *priv,
+ struct netgpu_ctx *ctx, u16 ix)
+{
+ struct mlx5e_params *params = &priv->channels.params;
+ struct mlx5e_channel *c;
+ int err;
+
+ if (unlikely(mlx5e_netgpu_get_ctx(&priv->channels.params,
+ &priv->xsk, ix)))
+ return -EBUSY;
+
+ err = mlx5e_netgpu_add_ctx(&priv->xsk, ctx, ix);
+ if (unlikely(err))
+ return err;
+
+ if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) {
+ /* XSK objects will be created on open. */
+ goto validate_closed;
+ }
+
+ if (!params->hd_split) {
+ /* XSK objects will be created when header split is set,
+ * and the channels are reopened.
+ */
+ goto validate_closed;
+ }
+
+ c = priv->channels.c[ix];
+
+ err = mlx5e_open_netgpu(priv, params, ctx, c);
+ if (unlikely(err))
+ goto err_remove_ctx;
+
+ mlx5e_activate_netgpu(c);
+
+ /* Don't wait for WQEs, because the newer xdpsock sample doesn't provide
+ * any Fill Ring entries at the setup stage.
+ */
+
+ err = mlx5e_netgpu_redirect_rqt_to_channel(priv, priv->channels.c[ix]);
+ if (unlikely(err))
+ goto err_deactivate;
+
+ return 0;
+
+err_deactivate:
+ mlx5e_deactivate_netgpu(c);
+ mlx5e_close_netgpu(c);
+
+err_remove_ctx:
+ mlx5e_netgpu_remove_ctx(&priv->xsk, ix);
+
+ return err;
+
+validate_closed:
+ return 0;
+}
+
+static int mlx5e_netgpu_disable_locked(struct mlx5e_priv *priv, u16 ix)
+{
+ struct mlx5e_channel *c;
+ struct netgpu_ctx *ctx;
+
+ ctx = mlx5e_netgpu_get_ctx(&priv->channels.params, &priv->xsk, ix);
+
+ if (unlikely(!ctx))
+ return -EINVAL;
+
+ if (!test_bit(MLX5E_STATE_OPENED, &priv->state))
+ goto remove_ctx;
+
+ /* NETGPU RQ is only created if header split is set. */
+ if (!priv->channels.params.hd_split)
+ goto remove_ctx;
+
+ c = priv->channels.c[ix];
+ mlx5e_netgpu_redirect_rqt_to_drop(priv, ix);
+ mlx5e_deactivate_netgpu(c);
+ mlx5e_close_netgpu(c);
+
+remove_ctx:
+ mlx5e_netgpu_remove_ctx(&priv->xsk, ix);
+
+ return 0;
+}
+
+static int mlx5e_netgpu_enable_ctx(struct mlx5e_priv *priv,
+ struct netgpu_ctx *ctx, u16 ix)
+{
+ int err;
+
+ mutex_lock(&priv->state_lock);
+ err = netgpu_fn_load();
+ if (!err)
+ err = mlx5e_netgpu_enable_locked(priv, ctx, ix);
+ g_ctx = ctx;
+ mutex_unlock(&priv->state_lock);
+
+ return err;
+}
+
+static int mlx5e_netgpu_disable_ctx(struct mlx5e_priv *priv, u16 ix)
+{
+ int err;
+
+ mutex_lock(&priv->state_lock);
+ err = mlx5e_netgpu_disable_locked(priv, ix);
+ netgpu_fn_unload();
+ g_ctx = NULL;
+ mutex_unlock(&priv->state_lock);
+
+ return err;
+}
+
+int
+mlx5e_netgpu_setup_ctx(struct net_device *dev, struct netgpu_ctx *ctx, u16 qid)
+{
+ struct mlx5e_priv *priv = netdev_priv(dev);
+ struct mlx5e_params *params = &priv->channels.params;
+ u16 ix;
+
+ if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid,
+ MLX5E_RQ_GROUP_XSK, &ix)))
+ return -EINVAL;
+
+ return ctx ? mlx5e_netgpu_enable_ctx(priv, ctx, ix) :
+ mlx5e_netgpu_disable_ctx(priv, ix);
+}
+
+static void mlx5e_build_netgpuicosq_param(struct mlx5e_priv *priv,
+ u8 log_wq_size,
+ struct mlx5e_sq_param *param)
+{
+ void *sqc = param->sqc;
+ void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+ mlx5e_build_sq_param_common(priv, param);
+
+ MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
+}
+
+static void mlx5e_build_netgpu_cparam(struct mlx5e_priv *priv,
+ struct mlx5e_params *params,
+ struct mlx5e_channel_param *cparam)
+{
+ const u8 icosq_size = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
+ struct mlx5e_xsk_param *xsk = (void *)0x1;
+
+ mlx5e_build_rq_param(priv, params, xsk, &cparam->rq);
+ mlx5e_build_rx_cq_param(priv, params, NULL, &cparam->rx_cq);
+
+ mlx5e_build_netgpuicosq_param(priv, icosq_size, &cparam->icosq);
+ mlx5e_build_ico_cq_param(priv, icosq_size, &cparam->icosq_cq);
+}
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+ struct netgpu_ctx *ctx, struct mlx5e_channel *c)
+{
+ struct mlx5e_channel_param *cparam;
+ struct dim_cq_moder icocq_moder = {};
+ struct xdp_umem *umem = (void *)0x1;
+ int err;
+
+ cparam = kvzalloc(sizeof(*cparam), GFP_KERNEL);
+ if (!cparam)
+ return -ENOMEM;
+
+ mlx5e_build_netgpu_cparam(priv, params, cparam);
+
+ err = mlx5e_open_cq(c, params->rx_cq_moderation, &cparam->rx_cq,
+ &c->xskrq.cq);
+ if (unlikely(err))
+ goto err_free_cparam;
+
+ err = mlx5e_open_rq(c, params, &cparam->rq, NULL, umem, &c->xskrq);
+ if (unlikely(err))
+ goto err_close_rx_cq;
+ c->xskrq.netgpu = ctx;
+
+ err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->xskicosq.cq);
+ if (unlikely(err))
+ goto err_close_rq;
+
+ /* Create a dedicated SQ for posting NOPs whenever we need an IRQ to be
+ * triggered and NAPI to be called on the correct CPU.
+ */
+ err = mlx5e_open_icosq(c, params, &cparam->icosq, &c->xskicosq);
+ if (unlikely(err))
+ goto err_close_icocq;
+
+ kvfree(cparam);
+
+ spin_lock_init(&c->xskicosq_lock);
+
+ set_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+
+ return 0;
+
+err_close_icocq:
+ mlx5e_close_cq(&c->xskicosq.cq);
+
+err_close_rq:
+ mlx5e_close_rq(&c->xskrq);
+
+err_close_rx_cq:
+ mlx5e_close_cq(&c->xskrq.cq);
+
+err_free_cparam:
+ kvfree(cparam);
+
+ return err;
+}
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c)
+{
+ clear_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state);
+ napi_synchronize(&c->napi);
+ synchronize_rcu(); /* Sync with the XSK wakeup. */
+
+ mlx5e_close_rq(&c->xskrq);
+ mlx5e_close_cq(&c->xskrq.cq);
+ mlx5e_close_icosq(&c->xskicosq);
+ mlx5e_close_cq(&c->xskicosq.cq);
+
+ /* zero these out - so the next open has a clean slate. */
+ memset(&c->xskrq, 0, sizeof(c->xskrq));
+ memset(&c->xsksq, 0, sizeof(c->xsksq));
+ memset(&c->xskicosq, 0, sizeof(c->xskicosq));
+}
+
+void mlx5e_activate_netgpu(struct mlx5e_channel *c)
+{
+ mlx5e_activate_icosq(&c->xskicosq);
+ set_bit(MLX5E_RQ_STATE_ENABLED, &c->xskrq.state);
+ /* TX queue is created active. */
+
+ spin_lock(&c->xskicosq_lock);
+ mlx5e_trigger_irq(&c->xskicosq);
+ spin_unlock(&c->xskicosq_lock);
+}
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c)
+{
+ mlx5e_deactivate_rq(&c->xskrq);
+ /* TX queue is disabled on close. */
+ mlx5e_deactivate_icosq(&c->xskicosq);
+}
+
+static int mlx5e_redirect_netgpu_rqt(struct mlx5e_priv *priv, u16 ix, u32 rqn)
+{
+ struct mlx5e_redirect_rqt_param direct_rrp = {
+ .is_rss = false,
+ {
+ .rqn = rqn,
+ },
+ };
+
+ u32 rqtn = priv->xsk_tir[ix].rqt.rqtn;
+
+ return mlx5e_redirect_rqt(priv, rqtn, 1, direct_rrp);
+}
+
+int mlx5e_netgpu_redirect_rqt_to_channel(struct mlx5e_priv *priv,
+ struct mlx5e_channel *c)
+{
+ return mlx5e_redirect_netgpu_rqt(priv, c->ix, c->xskrq.rqn);
+}
+
+int mlx5e_netgpu_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix)
+{
+ return mlx5e_redirect_netgpu_rqt(priv, ix, priv->drop_rq.rqn);
+}
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+ struct mlx5e_channels *chs)
+{
+ int err, i;
+
+ for (i = 0; i < chs->num; i++) {
+ struct mlx5e_channel *c = chs->c[i];
+
+ if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, c->state))
+ continue;
+
+ err = mlx5e_netgpu_redirect_rqt_to_channel(priv, c);
+ if (unlikely(err))
+ goto err_stop;
+ }
+
+ return 0;
+
+err_stop:
+ for (i--; i >= 0; i--) {
+ if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+ continue;
+
+ mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+ }
+
+ return err;
+}
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+ struct mlx5e_channels *chs)
+{
+ int i;
+
+ for (i = 0; i < chs->num; i++) {
+ if (!test_bit(MLX5E_CHANNEL_STATE_NETGPU, chs->c[i]->state))
+ continue;
+
+ mlx5e_netgpu_redirect_rqt_to_drop(priv, i);
+ }
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
new file mode 100644
index 000000000000..37fde92ef89d
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
@@ -0,0 +1,42 @@
+#pragma once
+
+struct netgpu_ctx *
+mlx5e_netgpu_get_ctx(struct mlx5e_params *params, struct mlx5e_xsk *xsk,
+ u16 ix);
+
+int
+mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+ struct netgpu_ctx *ctx, struct mlx5e_channel *c);
+
+bool mlx5e_netgpu_avail(struct mlx5e_rq *rq, u8 count);
+void mlx5e_netgpu_taken(struct mlx5e_rq *rq);
+
+int
+mlx5e_netgpu_setup_ctx(struct net_device *dev, struct netgpu_ctx *ctx, u16 qid);
+
+int
+mlx5e_netgpu_get_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info);
+
+void
+mlx5e_netgpu_put_page(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+ bool recycle);
+
+int mlx5e_open_netgpu(struct mlx5e_priv *priv, struct mlx5e_params *params,
+ struct netgpu_ctx *ctx, struct mlx5e_channel *c);
+
+void mlx5e_close_netgpu(struct mlx5e_channel *c);
+
+void mlx5e_activate_netgpu(struct mlx5e_channel *c);
+
+void mlx5e_deactivate_netgpu(struct mlx5e_channel *c);
+
+int mlx5e_netgpu_redirect_rqt_to_channel(struct mlx5e_priv *priv,
+ struct mlx5e_channel *c);
+
+int mlx5e_netgpu_redirect_rqt_to_drop(struct mlx5e_priv *priv, u16 ix);
+
+int mlx5e_netgpu_redirect_rqts_to_channels(struct mlx5e_priv *priv,
+ struct mlx5e_channels *chs);
+
+void mlx5e_netgpu_redirect_rqts_to_drop(struct mlx5e_priv *priv,
+ struct mlx5e_channels *chs);
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 01/21] mm: add {add|release}_memory_pages
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
This allows creation of system pages at a specific physical address,
which is useful for creating dummy backing pages which correspond to
unaddressable external memory at specific locations.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/memory_hotplug.h | 4 +++
mm/memory_hotplug.c | 65 ++++++++++++++++++++++++++++++++--
2 files changed, 67 insertions(+), 2 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd8..05e012e1a203 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -138,6 +138,10 @@ extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params);
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+ struct mhp_params *params);
+void release_memory_pages(struct resource *res);
+
#ifndef CONFIG_ARCH_HAS_ADD_PAGES
static inline int add_pages(int nid, unsigned long start_pfn,
unsigned long nr_pages, struct mhp_params *params)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9b34e03e730a..926cd4a2f81f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -125,8 +125,8 @@ static struct resource *register_memory_resource(u64 start, u64 size,
resource_name, flags);
if (!res) {
- pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
- start, start + size);
+ pr_debug("Unable to reserve %s region: %016llx->%016llx\n",
+ resource_name, start, start + size);
return ERR_PTR(-EEXIST);
}
return res;
@@ -1109,6 +1109,67 @@ int add_memory(int nid, u64 start, u64 size)
}
EXPORT_SYMBOL_GPL(add_memory);
+static int __ref add_memory_section(int nid, struct resource *res,
+ struct mhp_params *params)
+{
+ u64 start, end, section_size;
+ int ret;
+
+ /* must align start/end with memory block size */
+ end = res->start + resource_size(res);
+ section_size = memory_block_size_bytes();
+ start = round_down(res->start, section_size);
+ end = round_up(end, section_size);
+
+ mem_hotplug_begin();
+ ret = __add_pages(nid,
+ PHYS_PFN(start), PHYS_PFN(end - start), params);
+ mem_hotplug_done();
+
+ return ret;
+}
+
+/* requires device_hotplug_lock, see add_memory_resource() */
+static struct resource * __ref __add_memory_pages(int nid, u64 start, u64 size,
+ struct mhp_params *params)
+{
+ struct resource *res;
+ int ret;
+
+ res = register_memory_resource(start, size, "Private RAM");
+ if (IS_ERR(res))
+ return res;
+
+ ret = add_memory_section(nid, res, params);
+ if (ret < 0) {
+ release_memory_resource(res);
+ return ERR_PTR(ret);
+ }
+
+ return res;
+}
+
+struct resource *add_memory_pages(int nid, u64 start, u64 size,
+ struct mhp_params *params)
+{
+ struct resource *res;
+
+ lock_device_hotplug();
+ res = __add_memory_pages(nid, start, size, params);
+ unlock_device_hotplug();
+
+ return res;
+}
+EXPORT_SYMBOL_GPL(add_memory_pages);
+
+void release_memory_pages(struct resource *res)
+{
+ lock_device_hotplug();
+ release_memory_resource(res);
+ unlock_device_hotplug();
+}
+EXPORT_SYMBOL_GPL(release_memory_pages);
+
/*
* Add special, driver-managed memory to the system as system RAM. Such
* memory is not exposed via the raw firmware-provided memmap as system
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 02/21] mm: Allow DMA mapping of pages which are not online
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Change the system RAM check from 'valid' to 'online', so dummy
pages which refer to external DMA resources can be mapped.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/dma-mapping.h | 4 ++--
include/linux/mmzone.h | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 78f677cf45ab..fb142a01d1ba 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -348,8 +348,8 @@ static inline dma_addr_t dma_map_resource(struct device *dev,
BUG_ON(!valid_dma_direction(dir));
- /* Don't allow RAM to be mapped */
- if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr))))
+ /* Don't allow online RAM to be mapped */
+ if (WARN_ON_ONCE(pfn_online(PHYS_PFN(phys_addr))))
return DMA_MAPPING_ERROR;
if (dma_is_direct(ops))
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..9a9fe5704f97 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1348,6 +1348,13 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
return -1;
}
+static inline int pfn_online(unsigned long pfn)
+{
+ if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+ return 0;
+ return online_section(__nr_to_section(pfn_to_section_nr(pfn)));
+}
+
/*
* These are _only_ used during initialisation, therefore they
* can use __initdata ... They could have names to indicate
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 21/21] mlx5: add XDP_SETUP_NETGPU hook
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Add the hook which enables and disables the zero copy queues.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c791578be5ea..05f93f78ebbc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4598,6 +4598,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
case XDP_SETUP_XSK_UMEM:
return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem,
xdp->xsk.queue_id);
+ case XDP_SETUP_NETGPU:
+ return mlx5e_netgpu_setup_ctx(dev, xdp->netgpu.ctx,
+ xdp->netgpu.queue_id);
default:
return -EINVAL;
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 04/21] mlx5: add definitions for header split and netgpu
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Add definitions for fixed-length header splitting at TOTAL_HEADERS,
and pointers for the upcoming netdma work. This reuses the same
structures as xsk, so both cannot be run simultaneously.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 22 +++++++++++++++++--
.../net/ethernet/mellanox/mlx5/core/en/txrx.h | 3 +++
2 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 842db20493df..0483cc815340 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -58,6 +58,12 @@
extern const struct net_device_ops mlx5e_netdev_ops;
struct page_pool;
+#define TCP_HDRS_LEN (20 + 20) /* headers + options */
+#define IP6_HDRS_LEN (40)
+#define MAC_HDR_LEN (14)
+#define TOTAL_HEADERS (TCP_HDRS_LEN + IP6_HDRS_LEN + MAC_HDR_LEN)
+#define HD_SPLIT_DEFAULT_FRAG_SIZE (4096)
+#define MLX5E_HD_SPLIT(params) (params->hd_split)
#define MLX5E_METADATA_ETHER_TYPE (0x8CE4)
#define MLX5E_METADATA_ETHER_LEN 8
@@ -228,6 +234,7 @@ enum mlx5e_priv_flag {
MLX5E_PFLAG_RX_STRIDING_RQ,
MLX5E_PFLAG_RX_NO_CSUM_COMPLETE,
MLX5E_PFLAG_XDP_TX_MPWQE,
+ MLX5E_PFLAG_RX_HD_SPLIT,
MLX5E_NUM_PFLAGS, /* Keep last */
};
@@ -263,6 +270,7 @@ struct mlx5e_params {
struct mlx5e_xsk *xsk;
unsigned int sw_mtu;
int hard_mtu;
+ bool hd_split;
};
enum {
@@ -299,7 +307,8 @@ struct mlx5e_cq_decomp {
enum mlx5e_dma_map_type {
MLX5E_DMA_MAP_SINGLE,
- MLX5E_DMA_MAP_PAGE
+ MLX5E_DMA_MAP_PAGE,
+ MLX5E_DMA_MAP_FIXED
};
struct mlx5e_sq_dma {
@@ -367,6 +376,7 @@ struct mlx5e_dma_info {
struct page *page;
struct xdp_buff *xsk;
};
+ bool netgpu_source;
};
/* XDP packets can be transmitted in different ways. On completion, we need to
@@ -545,6 +555,7 @@ enum mlx5e_rq_flag {
struct mlx5e_rq_frag_info {
int frag_size;
int frag_stride;
+ int frag_source;
};
struct mlx5e_rq_frags_info {
@@ -554,6 +565,7 @@ struct mlx5e_rq_frags_info {
u8 wqe_bulk;
};
+struct netgpu_ctx;
struct mlx5e_rq {
/* data path */
union {
@@ -611,6 +623,7 @@ struct mlx5e_rq {
/* AF_XDP zero-copy */
struct xdp_umem *umem;
+ struct netgpu_ctx *netgpu;
struct work_struct recover_work;
@@ -628,6 +641,7 @@ struct mlx5e_rq {
enum mlx5e_channel_state {
MLX5E_CHANNEL_STATE_XSK,
+ MLX5E_CHANNEL_STATE_NETGPU,
MLX5E_CHANNEL_NUM_STATES
};
@@ -736,9 +750,13 @@ struct mlx5e_xsk {
* but it doesn't distinguish between zero-copy and non-zero-copy UMEMs,
* so rely on our mechanism.
*/
- struct xdp_umem **umems;
+ union {
+ struct xdp_umem **umems;
+ struct netgpu_ctx **ctx_tbl;
+ };
u16 refcnt;
bool ever_used;
+ bool is_netgpu;
};
/* Temporary storage for variables that are allocated when struct mlx5e_priv is
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index bfd3e1161bc6..dfd20c30de02 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -238,6 +238,9 @@ mlx5e_tx_dma_unmap(struct device *pdev, struct mlx5e_sq_dma *dma)
case MLX5E_DMA_MAP_PAGE:
dma_unmap_page(pdev, dma->addr, dma->size, DMA_TO_DEVICE);
break;
+ case MLX5E_DMA_MAP_FIXED:
+ /* DMA mappings are fixed, or managed elsewhere. */
+ break;
default:
WARN_ONCE(true, "mlx5e_tx_dma_unmap unknown DMA type!\n");
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 03/21] tcp: Pad TCP options out to a fixed size
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
The "header splitting" feature used by netgpu doesn't actually parse
the incoming packet header. Instead, it splits the packet at a fixed
offset. In order for this to work, the sender needs to send packets
with a fixed header size.
(Obviously not for upstream committing, just for prototyping)
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
net/ipv4/tcp_output.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a50e1990a845..afc996ef2d4e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -438,6 +438,7 @@ struct tcp_out_options {
u8 ws; /* window scale, 0 to disable */
u8 num_sack_blocks; /* number of SACK blocks to include */
u8 hash_size; /* bytes in hash_location */
+ u8 pad_size; /* additional nops for padding */
__u8 *hash_location; /* temporary pointer, overloaded */
__u32 tsval, tsecr; /* need to include OPTION_TS */
struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */
@@ -562,6 +563,15 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
smc_options_write(ptr, &options);
mptcp_options_write(ptr, opts);
+
+ /* pad out options for netgpu */
+ if (opts->pad_size) {
+ int len = opts->pad_size;
+ u8 *p = (u8 *)ptr;
+
+ while (len--)
+ *p++ = TCPOPT_NOP;
+ }
}
static void smc_set_option(const struct tcp_sock *tp,
@@ -824,6 +834,12 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
}
+ /* force padding for netgpu */
+ if (size < 20) {
+ opts->pad_size = 20 - size;
+ size += opts->pad_size;
+ }
+
return size;
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 08/21] misc: add shqueue.h for prototyping
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Shared queues between user and kernel use their own private structures
for accessing a shared data area, but they need to use the same queue
functions.
Rather than doing the 'right' thing and duplicating the file for
each domain, temporary cheat for prototyping and use a single shared
file.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/uapi/misc/shqueue.h | 205 ++++++++++++++++++++++++++++++++++++
1 file changed, 205 insertions(+)
create mode 100644 include/uapi/misc/shqueue.h
diff --git a/include/uapi/misc/shqueue.h b/include/uapi/misc/shqueue.h
new file mode 100644
index 000000000000..258b9db35dbd
--- /dev/null
+++ b/include/uapi/misc/shqueue.h
@@ -0,0 +1,205 @@
+#pragma once
+
+/* XXX
+ * This is not a user api, but placed here for prototyping, in order to
+ * avoid two nigh identical copies for user and kernel space.
+ */
+
+/* kernel only */
+struct shared_queue_map {
+ unsigned prod ____cacheline_aligned_in_smp;
+ unsigned cons ____cacheline_aligned_in_smp;
+ char data[] ____cacheline_aligned_in_smp;
+};
+
+/* user and kernel private copy - identical in order to share sq* fcns */
+struct shared_queue {
+ unsigned *prod;
+ unsigned *cons;
+ char *data;
+ unsigned elt_sz;
+ unsigned mask;
+ unsigned cached_prod;
+ unsigned cached_cons;
+ unsigned entries;
+
+ unsigned map_sz;
+ void *map_ptr;
+};
+
+/*
+ * see documenation in tools/include/linux/ring_buffer.h
+ * using explicit smp_/_ONCE is an optimization over smp_{store|load}
+ */
+
+static inline void __sq_load_acquire_cons(struct shared_queue *q)
+{
+ /* Refresh the local tail pointer */
+ q->cached_cons = READ_ONCE(*q->cons);
+ /* A, matches D */
+}
+
+static inline void __sq_store_release_cons(struct shared_queue *q)
+{
+ smp_mb(); /* D, matches A */
+ WRITE_ONCE(*q->cons, q->cached_cons);
+}
+
+static inline void __sq_load_acquire_prod(struct shared_queue *q)
+{
+ /* Refresh the local pointer */
+ q->cached_prod = READ_ONCE(*q->prod);
+ smp_rmb(); /* C, matches B */
+}
+
+static inline void __sq_store_release_prod(struct shared_queue *q)
+{
+ smp_wmb(); /* B, matches C */
+ WRITE_ONCE(*q->prod, q->cached_prod);
+}
+
+static inline void sq_cons_refresh(struct shared_queue *q)
+{
+ __sq_store_release_cons(q);
+ __sq_load_acquire_prod(q);
+}
+
+static inline bool sq_empty(struct shared_queue *q)
+{
+ return READ_ONCE(*q->prod) == READ_ONCE(*q->cons);
+}
+
+static inline bool sq_cons_empty(struct shared_queue *q)
+{
+ return q->cached_prod == q->cached_cons;
+}
+
+static inline unsigned __sq_cons_ready(struct shared_queue *q)
+{
+ return q->cached_prod - q->cached_cons;
+}
+
+static inline unsigned sq_cons_ready(struct shared_queue *q)
+{
+ if (q->cached_prod == q->cached_cons)
+ __sq_load_acquire_prod(q);
+
+ return q->cached_prod - q->cached_cons;
+}
+
+static inline bool sq_cons_avail(struct shared_queue *q, unsigned count)
+{
+ if (count <= __sq_cons_ready(q))
+ return true;
+ __sq_load_acquire_prod(q);
+ return count <= __sq_cons_ready(q);
+}
+
+static inline void *sq_get_ptr(struct shared_queue *q, unsigned idx)
+{
+ return q->data + (idx & q->mask) * q->elt_sz;
+}
+
+static inline void sq_cons_complete(struct shared_queue *q)
+{
+ __sq_store_release_cons(q);
+}
+
+static inline void *sq_cons_peek(struct shared_queue *q)
+{
+ if (sq_cons_empty(q)) {
+ sq_cons_refresh(q);
+ if (sq_cons_empty(q))
+ return NULL;
+ }
+ return sq_get_ptr(q, q->cached_cons);
+}
+
+static inline unsigned
+sq_peek_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+ unsigned i, idx, ready;
+
+ ready = sq_cons_ready(q);
+ if (!ready)
+ return 0;
+
+ count = count > ready ? ready : count;
+
+ idx = q->cached_cons;
+ for (i = 0; i < count; i++)
+ ptr[i] = sq_get_ptr(q, idx++);
+
+ q->cached_cons += count;
+
+ return count;
+}
+
+static inline unsigned
+sq_cons_batch(struct shared_queue *q, void **ptr, unsigned count)
+{
+ unsigned i, idx, ready;
+
+ ready = sq_cons_ready(q);
+ if (!ready)
+ return 0;
+
+ count = count > ready ? ready : count;
+
+ idx = q->cached_cons;
+ for (i = 0; i < count; i++)
+ ptr[i] = sq_get_ptr(q, idx++);
+
+ q->cached_cons += count;
+ sq_cons_complete(q);
+
+ return count;
+}
+
+static inline void sq_cons_advance(struct shared_queue *q)
+{
+ q->cached_cons++;
+}
+
+static inline unsigned __sq_prod_space(struct shared_queue *q)
+{
+ return q->entries - (q->cached_prod - q->cached_cons);
+}
+
+static inline unsigned sq_prod_space(struct shared_queue *q)
+{
+ unsigned space;
+
+ space = __sq_prod_space(q);
+ if (!space) {
+ __sq_load_acquire_cons(q);
+ space = __sq_prod_space(q);
+ }
+ return space;
+}
+
+static inline bool sq_prod_avail(struct shared_queue *q, unsigned count)
+{
+ if (count <= __sq_prod_space(q))
+ return true;
+ __sq_load_acquire_cons(q);
+ return count <= __sq_prod_space(q);
+}
+
+static inline void *sq_prod_get_ptr(struct shared_queue *q)
+{
+ return sq_get_ptr(q, q->cached_prod++);
+}
+
+static inline void *sq_prod_reserve(struct shared_queue *q)
+{
+ if (!sq_prod_space(q))
+ return NULL;
+
+ return sq_prod_get_ptr(q);
+}
+
+static inline void sq_prod_submit(struct shared_queue *q)
+{
+ __sq_store_release_prod(q);
+}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 16/21] lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
If a sock is marked as sending zc data, have the iterator
retrieve the correct zc pages from the netgpu module.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/uio.h | 4 ++++
lib/iov_iter.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
net/core/datagram.c | 6 +++++-
3 files changed, 54 insertions(+), 1 deletion(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9576fd8158d7..d4c15205a248 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -227,6 +227,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
+struct sock;
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct sock *sk,
+ size_t maxsize, struct page **pages, unsigned maxpages,
+ size_t *pgoff);
const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index bf538c2bec77..a50fa3999de3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,6 +10,9 @@
#include <linux/scatterlist.h>
#include <linux/instrumented.h>
+#include <net/netgpu.h>
+#include <net/sock.h>
+
#define PIPE_PARANOIA /* for now */
#define iterate_iovec(i, n, __v, __p, skip, STEP) { \
@@ -1349,6 +1352,48 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages);
+ssize_t iov_iter_sk_get_pages(struct iov_iter *i, struct sock *sk,
+ size_t maxsize, struct page **pages, unsigned maxpages,
+ size_t *pgoff)
+{
+ const struct iovec *iov;
+ unsigned long addr;
+ struct iovec v;
+ size_t len;
+ unsigned n;
+ int ret;
+
+ if (!sk->sk_user_data)
+ return iov_iter_get_pages(i, pages, maxsize, maxpages, pgoff);
+
+ if (maxsize > i->count)
+ maxsize = i->count;
+
+ if (!iter_is_iovec(i))
+ return -EFAULT;
+
+ if (iov_iter_rw(i) != WRITE)
+ return -EFAULT;
+
+ iterate_iovec(i, maxsize, v, iov, i->iov_offset, ({
+ addr = (unsigned long)v.iov_base;
+ *pgoff = addr & (PAGE_SIZE - 1);
+ len = v.iov_len + *pgoff;
+
+ if (len > maxpages * PAGE_SIZE)
+ len = maxpages * PAGE_SIZE;
+
+ n = DIV_ROUND_UP(len, PAGE_SIZE);
+
+ ret = __netgpu_get_pages(sk, pages, addr, n);
+ if (ret > 0)
+ ret = (ret == n ? len : ret * PAGE_SIZE) - *pgoff;
+ return ret;
+ 0;}));
+ return 0;
+}
+EXPORT_SYMBOL(iov_iter_sk_get_pages);
+
static struct page **get_pages_array(size_t n)
{
return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 639745d4f3b9..7dd8814c222a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -530,6 +530,10 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
struct iov_iter *to, int len)
{
trace_skb_copy_datagram_iovec(skb, len);
+ if (skb->zc_netgpu) {
+ pr_err("skb netgpu datagram on !netgpu sk\n");
+ return -EFAULT;
+ }
return __skb_datagram_iter(skb, offset, to, len, false,
simple_copy_to_iter, NULL);
}
@@ -631,7 +635,7 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
if (frag == MAX_SKB_FRAGS)
return -EMSGSIZE;
- copied = iov_iter_get_pages(from, pages, length,
+ copied = iov_iter_sk_get_pages(from, sk, length, pages,
MAX_SKB_FRAGS - frag, &start);
if (copied < 0)
return -EFAULT;
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 11/21] skbuff: add a zc_netgpu bitflag
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
This could likely be moved elsewhere. The presence of the flag on
the skb indicates that one of the fragments may contain zerocopy
data (where the data is not accessible to the cpu).
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/skbuff.h | 3 ++-
net/core/skbuff.c | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0c0377fc00c2..ba41d1a383f8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -782,7 +782,8 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
- pfmemalloc:1;
+ pfmemalloc:1,
+ zc_netgpu:1;
#ifdef CONFIG_SKB_EXTENSIONS
__u8 active_extensions;
#endif
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b8afefe6f6b6..2a391042be53 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -992,6 +992,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
n->cloned = 1;
n->nohdr = 0;
n->peeked = 0;
+ C(zc_netgpu);
C(pfmemalloc);
n->destructor = NULL;
C(tail);
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 05/21] mlx5/xsk: check that xsk does not conflict with netgpu
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
netgpu will use the same data structures as xsk, so make sure that
they are not conflicting.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c | 3 +++
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h | 3 +++
2 files changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
index 7b17fcd0a56d..f3d3569816cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
@@ -27,7 +27,10 @@ static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk)
sizeof(*xsk->umems), GFP_KERNEL);
if (unlikely(!xsk->umems))
return -ENOMEM;
+ xsk->is_netgpu = false;
}
+ if (xsk->is_netgpu)
+ return -EINVAL;
xsk->refcnt++;
xsk->ever_used = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
index 25b4cbe58b54..c7eff534d28a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
@@ -15,6 +15,9 @@ static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params,
if (unlikely(ix >= params->num_channels))
return NULL;
+ if (unlikely(xsk->is_netgpu))
+ return NULL;
+
return xsk->umems[ix];
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 00/21] netgpu: networking between NIC and GPU/CPU.
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
This series is a working RFC proof-of-concept that implements DMA
zero-copy between the NIC and a GPU device for the data path, while
keeping the protocol processing on the host CPU.
This also works for zero-copy send/recv to host (CPU) memory.
Current limitations:
- mlx5 only, header splitting is at a fixed offset.
- currently only TCP protocol delivery is performed.
- not optimized (hey, it works!)
- TX completion notification is planned, but not in this patchset.
- one socket per device
- not compatible with xsk (re-uses same datastructures)
- not compatible with bpf payload inspection
- x86 !iommu only; liberties are taken with PA addresses.
The next section provides a brief overview of how things work, for this
phase 0 proof of concept.
A transport context is created on a device, which sets up the datapath,
and the device queues. Only specialized RX queues are needed, the
standard TX queues are used for packet transmission.
Memory areas which participate in zero-copy transmission are registered
with the context. These areas can be used as either RX packet buffers
or TX data areas (or both). The memory can come from either malloc/mmap
or cudaMalloc(). The latter call provides a handle to the userspace
application, but the memory region is only accessible to the GPU.
A socket is created and registered with the context, which sets
SOCK_ZEROCOPY, and is bound to the device with SO_BINDTODEVICE.
Asymmetrical data paths are possible (zc TX, normal RX), and vice versa,
but the curreent PoC sets things up for symmetrical transport. The
application needs to provide the RX buffers to the receive queue,
similar to AF_XDP.
Once things are set up, data is sent to the network with sendmsg(). The
iovecs provided contain an address in the region previously registered.
The normal protocol stack processing constructs the packet, but the data
is not touched by the stack. In this phase, the application is not
notified when the protocol processing is complete and the data area is
safe to modify again.
For RX, packets undergo the usual protocol processing and are delivered
up to the socket receive queue. At this point, the skb data fragments
are delivered to the application as iovecs through an AF_XDP style
queue. The application can poll for readability, but does not use
read() to receive the data.
The initial application used is iperf3, a modified version with the
userspace library is available at:
https://github.com/jlemon/iperf
https://github.com/jlemon/netgpu
Running "iperf3 -s -z --dport 8888" (host memory) on a 12Gbps link:
11.3 Gbit/sec receive
10.8 Gbit/sec tramsmit
Running "iperf3 -s -z --dport 8888 --gpu" on a 25Gbps link:
22.5 Gbit/sec receive
12.6 Gbit/sec transmit (!!!)
For the GPU runs, the Intel PCI monitoring tools were used to confirm
that the host PCI bus was mostly idle. The TX performance needs further
investigation.
Comments welcome. The next phase of the work will clean up the
interface, adding completion notifications, and a flexible queue
creation mechanism.
--
Jonathan
Jonathan Lemon (21):
mm: add {add|release}_memory_pages
mm: Allow DMA mapping of pages which are not online
tcp: Pad TCP options out to a fixed size
mlx5: add definitions for header split and netgpu
mlx5/xsk: check that xsk does not conflict with netgpu
mlx5: add header_split flag
mlx5: remove the umem parameter from mlx5e_open_channel
misc: add shqueue.h for prototyping
include: add definitions for netgpu
mlx5: add netgpu queue functions
skbuff: add a zc_netgpu bitflag
mlx5: hook up the netgpu channel functions
netdevice: add SETUP_NETGPU to the netdev_bpf structure
kernel: export free_uid
netgpu: add network/gpu dma module
lib: have __zerocopy_sg_from_iter get netgpu pages for a sk
net/core: add the SO_REGISTER_DMA socket option
tcp: add MSG_NETDMA flag for sendmsg()
core: add page recycling logic for netgpu pages
core/skbuff: use skb_zdata for testing whether skb is zerocopy
mlx5: add XDP_SETUP_NETGPU hook
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/netgpu/Kconfig | 10 +
drivers/misc/netgpu/Makefile | 11 +
drivers/misc/netgpu/nvidia.c | 1516 +++++++++++++++++
.../net/ethernet/mellanox/mlx5/core/Makefile | 3 +-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 22 +-
.../mellanox/mlx5/core/en/netgpu/setup.c | 475 ++++++
.../mellanox/mlx5/core/en/netgpu/setup.h | 42 +
.../net/ethernet/mellanox/mlx5/core/en/txrx.h | 3 +
.../ethernet/mellanox/mlx5/core/en/xsk/umem.c | 3 +
.../ethernet/mellanox/mlx5/core/en/xsk/umem.h | 3 +
.../ethernet/mellanox/mlx5/core/en_ethtool.c | 15 +
.../net/ethernet/mellanox/mlx5/core/en_main.c | 118 +-
.../net/ethernet/mellanox/mlx5/core/en_rx.c | 52 +-
.../net/ethernet/mellanox/mlx5/core/en_txrx.c | 15 +-
include/linux/dma-mapping.h | 4 +-
include/linux/memory_hotplug.h | 4 +
include/linux/mmzone.h | 7 +
include/linux/netdevice.h | 6 +
include/linux/skbuff.h | 27 +-
include/linux/socket.h | 1 +
include/linux/uio.h | 4 +
include/net/netgpu.h | 65 +
include/uapi/asm-generic/socket.h | 2 +
include/uapi/misc/netgpu.h | 43 +
include/uapi/misc/shqueue.h | 205 +++
kernel/user.c | 1 +
lib/iov_iter.c | 45 +
mm/memory_hotplug.c | 65 +-
net/core/datagram.c | 6 +-
net/core/skbuff.c | 44 +-
net/core/sock.c | 26 +
net/ipv4/tcp.c | 8 +
net/ipv4/tcp_output.c | 16 +
35 files changed, 2828 insertions(+), 41 deletions(-)
create mode 100644 drivers/misc/netgpu/Kconfig
create mode 100644 drivers/misc/netgpu/Makefile
create mode 100644 drivers/misc/netgpu/nvidia.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.c
create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/netgpu/setup.h
create mode 100644 include/net/netgpu.h
create mode 100644 include/uapi/misc/netgpu.h
create mode 100644 include/uapi/misc/shqueue.h
--
2.24.1
^ permalink raw reply
* [RFC PATCH 06/21] mlx5: add header_split flag
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
Adds a "rx_hd_split" private flag parameter to ethtool.
This enables header splitting, and sets up the fragment mappings.
The feature is currently only enabled for netgpu channels.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
.../ethernet/mellanox/mlx5/core/en_ethtool.c | 15 +++++++
.../net/ethernet/mellanox/mlx5/core/en_main.c | 45 +++++++++++++++----
2 files changed, 52 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index ec5658bbe3c5..a1b5d8b33b0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1905,6 +1905,20 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
return err;
}
+static int set_pflag_rx_hd_split(struct net_device *netdev, bool enable)
+{
+ struct mlx5e_priv *priv = netdev_priv(netdev);
+ int err;
+
+ priv->channels.params.hd_split = enable;
+ err = mlx5e_safe_reopen_channels(priv);
+ if (err)
+ netdev_err(priv->netdev,
+ "%s failed to reopen channels, err(%d).\n",
+ __func__, err);
+ return err;
+}
+
static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
{ "rx_cqe_moder", set_pflag_rx_cqe_based_moder },
{ "tx_cqe_moder", set_pflag_tx_cqe_based_moder },
@@ -1912,6 +1926,7 @@ static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
{ "rx_striding_rq", set_pflag_rx_striding_rq },
{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
{ "xdp_tx_mpwqe", set_pflag_xdp_tx_mpwqe },
+ { "rx_hd_split", set_pflag_rx_hd_split },
};
static int mlx5e_handle_pflag(struct net_device *netdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a836a02a2116..cc8d30aa8a33 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -123,7 +123,8 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
{
- params->rq_wq_type = mlx5e_striding_rq_possible(mdev, params) &&
+ params->rq_wq_type = MLX5E_HD_SPLIT(params) ? MLX5_WQ_TYPE_CYCLIC :
+ mlx5e_striding_rq_possible(mdev, params) &&
MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ) ?
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
MLX5_WQ_TYPE_CYCLIC;
@@ -323,6 +324,8 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
if (prev)
prev->last_in_page = true;
}
+ next_frag.di->netgpu_source =
+ !!frag_info[f].frag_source;
*frag = next_frag;
/* prepare next */
@@ -373,6 +376,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
struct mlx5_core_dev *mdev = c->mdev;
void *rqc = rqp->rqc;
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
+ bool hd_split = MLX5E_HD_SPLIT(params) && (umem == (void *)0x1);
+ u32 num_xsk_frames = 0;
u32 rq_xdp_ix;
u32 pool_size;
int wq_sz;
@@ -391,9 +396,10 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->mdev = mdev;
rq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
rq->xdpsq = &c->rq_xdpsq;
- rq->umem = umem;
+ if (xsk)
+ rq->umem = umem;
- if (rq->umem)
+ if (umem)
rq->stats = &c->priv->channel_stats[c->ix].xskrq;
else
rq->stats = &c->priv->channel_stats[c->ix].rq;
@@ -404,14 +410,18 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->xdp_prog = params->xdp_prog;
rq_xdp_ix = rq->ix;
- if (xsk)
+ if (umem)
rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK;
err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix);
if (err < 0)
goto err_rq_wq_destroy;
+ if (umem == (void *)0x1)
+ rq->buff.headroom = 0;
+ else
+ rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
+
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
- rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
pool_size = 1 << params->log_rq_mtu_frames;
switch (rq->wq_type) {
@@ -509,6 +519,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
rq->wqe.skb_from_cqe = xsk ?
mlx5e_xsk_skb_from_cqe_linear :
+ hd_split ? mlx5e_skb_from_cqe_nonlinear :
mlx5e_rx_is_linear_skb(params, NULL) ?
mlx5e_skb_from_cqe_linear :
mlx5e_skb_from_cqe_nonlinear;
@@ -2035,13 +2046,19 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
int frag_size_max = DEFAULT_FRAG_SIZE;
u32 buf_size = 0;
int i;
+ bool hd_split = MLX5E_HD_SPLIT(params) && xsk;
+
+ if (hd_split)
+ frag_size_max = HD_SPLIT_DEFAULT_FRAG_SIZE;
+ else
+ frag_size_max = DEFAULT_FRAG_SIZE;
#ifdef CONFIG_MLX5_EN_IPSEC
if (MLX5_IPSEC_DEV(mdev))
byte_count += MLX5E_METADATA_ETHER_LEN;
#endif
- if (mlx5e_rx_is_linear_skb(params, xsk)) {
+ if (!hd_split && mlx5e_rx_is_linear_skb(params, xsk)) {
int frag_stride;
frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
@@ -2059,6 +2076,16 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
frag_size_max = PAGE_SIZE;
i = 0;
+
+ if (hd_split) {
+ // Start with one fragment for all headers (implementing HDS)
+ info->arr[0].frag_size = TOTAL_HEADERS;
+ info->arr[0].frag_stride = roundup_pow_of_two(PAGE_SIZE);
+ buf_size += TOTAL_HEADERS;
+ // Now, continue with the payload frags.
+ i = 1;
+ }
+
while (buf_size < byte_count) {
int frag_size = byte_count - buf_size;
@@ -2066,8 +2093,10 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
frag_size = min(frag_size, frag_size_max);
info->arr[i].frag_size = frag_size;
- info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
-
+ info->arr[i].frag_stride = roundup_pow_of_two(hd_split ?
+ PAGE_SIZE :
+ frag_size);
+ info->arr[i].frag_source = hd_split;
buf_size += frag_size;
i++;
}
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 18/21] tcp: add MSG_NETDMA flag for sendmsg()
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
This flag indicates that the attached data is a zero-copy send,
and the pages should be retrieved from the netgpu module. The
socket must have been marked as SOCK_ZEROCOPY, and also registered
with netgpu via SO_REGISTER_DMA.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/socket.h | 1 +
net/ipv4/tcp.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 04d2bc97f497..63816cc25dee 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -310,6 +310,7 @@ struct ucred {
*/
#define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */
+#define MSG_NETDMA 0x8000000
#define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */
#define MSG_CMSG_CLOEXEC 0x40000000 /* Set close_on_exec for file
descriptor received through
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 810cc164f795..219670152f77 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1209,6 +1209,14 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
uarg->zerocopy = 0;
}
+ if (flags & MSG_NETDMA && size && sock_flag(sk, SOCK_ZEROCOPY)) {
+ zc = sk->sk_route_caps & NETIF_F_SG;
+ if (!zc) {
+ err = -EFAULT;
+ goto out_err;
+ }
+ }
+
if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
!tp->repair) {
err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size, uarg);
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 14/21] kernel: export free_uid
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
get_uid is a static inline which can be called from a module, so
free_uid should also be callable.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
kernel/user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/user.c b/kernel/user.c
index b1635d94a1f2..1e015abf0a2b 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -171,6 +171,7 @@ void free_uid(struct user_struct *up)
if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
free_user(up, flags);
}
+EXPORT_SYMBOL_GPL(free_uid);
struct user_struct *alloc_uid(kuid_t uid)
{
--
2.24.1
^ permalink raw reply related
* [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>
skb_zcopy() flag indicates whether the skb has a zerocopy ubuf.
netgpu does not use ubufs, so skb_zdata() indicates whether the
skb is carrying zero copy data, and should be left alone, while
skb_zcopy() indicates whhether there is an attached ubuf.
Also, when a write() on a zero-copy socket returns EWOULDBLOCK,
this is not synchronized with select(), which will only look at
the send buffer, and return writability if there is tcp space.
This appears to be caused by some ubuf logic, leading to iperf
spending 70% of its time in select() for ZC transmits. With this
change, the time spent drops to 20%.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
include/linux/skbuff.h | 24 +++++++++++++++++++++++-
net/core/skbuff.c | 16 ++++++++++++----
2 files changed, 35 insertions(+), 5 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ba41d1a383f8..3c2efd45655b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -443,8 +443,12 @@ enum {
/* generate software time stamp when entering packet scheduling */
SKBTX_SCHED_TSTAMP = 1 << 6,
+
+ /* fragments are accessed only via DMA */
+ SKBTX_DEV_NETDMA = 1 << 7,
};
+#define SKBTX_ZERODATA_FRAG (SKBTX_DEV_ZEROCOPY | SKBTX_DEV_NETDMA)
#define SKBTX_ZEROCOPY_FRAG (SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
#define SKBTX_ANY_SW_TSTAMP (SKBTX_SW_TSTAMP | \
SKBTX_SCHED_TSTAMP)
@@ -1416,6 +1420,24 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
return &skb_shinfo(skb)->hwtstamps;
}
+static inline bool skb_netdma(struct sk_buff *skb)
+{
+ return skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_NETDMA;
+}
+
+static inline bool skb_zdata(struct sk_buff *skb)
+{
+ return skb && skb_shinfo(skb)->tx_flags & SKBTX_ZERODATA_FRAG;
+}
+
+static inline void skb_netdma_set(struct sk_buff *skb, bool netdma)
+{
+ if (skb && netdma) {
+ skb_shinfo(skb)->tx_flags |= SKBTX_DEV_NETDMA;
+ skb_shinfo(skb)->destructor_arg = NULL;
+ }
+}
+
static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
{
bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
@@ -3260,7 +3282,7 @@ static inline int skb_add_data(struct sk_buff *skb,
static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
const struct page *page, int off)
{
- if (skb_zcopy(skb))
+ if (skb_zdata(skb))
return false;
if (i) {
const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2b4176cab578..67a421257a27 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1323,6 +1323,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
}
skb_zcopy_set(skb, uarg, NULL);
+ skb_netdma_set(skb, sk->sk_user_data);
+
return skb->len - orig_len;
}
EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1330,8 +1332,8 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
gfp_t gfp_mask)
{
- if (skb_zcopy(orig)) {
- if (skb_zcopy(nskb)) {
+ if (skb_zdata(orig)) {
+ if (skb_zdata(nskb)) {
/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
if (!gfp_mask) {
WARN_ON_ONCE(1);
@@ -1343,6 +1345,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
return -EIO;
}
skb_zcopy_set(nskb, skb_uarg(orig), NULL);
+ skb_netdma_set(nskb, skb_netdma(orig));
}
return 0;
}
@@ -1372,6 +1375,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
return -EINVAL;
+ if (!skb_has_shared_frag(skb))
+ return -EINVAL;
+
if (!num_frags)
goto release;
@@ -2078,6 +2084,8 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
*/
int i, k, eat = (skb->tail + delta) - skb->end;
+ BUG_ON(skb_netdma(skb));
+
if (eat > 0 || skb_cloned(skb)) {
if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
GFP_ATOMIC))
@@ -3328,7 +3336,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
if (skb_headlen(skb))
return 0;
- if (skb_zcopy(tgt) || skb_zcopy(skb))
+ if (skb_zdata(tgt) || skb_zdata(skb))
return 0;
todo = shiftlen;
@@ -5171,7 +5179,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
from_shinfo = skb_shinfo(from);
if (to_shinfo->frag_list || from_shinfo->frag_list)
return false;
- if (skb_zcopy(to) || skb_zcopy(from))
+ if (skb_zdata(to) || skb_zdata(from))
return false;
if (skb_headlen(from) != 0) {
--
2.24.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox