Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] net: qrtr: add MODULE_ALIAS_NETPROTO macro
From: David Miller @ 2018-04-17 13:58 UTC (permalink / raw)
  To: nicolas.dechesne; +Cc: bjorn.andersson, netdev, linux-kernel
In-Reply-To: <20180417120326.11022-1-nicolas.dechesne@linaro.org>

From: Nicolas Dechesne <nicolas.dechesne@linaro.org>
Date: Tue, 17 Apr 2018 14:03:26 +0200

> To ensure that qrtr can be loaded automatically, when needed, if it is compiled
> as module.
> 
> Signed-off-by: Nicolas Dechesne <nicolas.dechesne@linaro.org>

Applied.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-17 13:58 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Samudrala, Sridhar, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180417133028.GI33938@C02RW35GFVH8.dhcp.broadcom.net>

On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
>>
>> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
>> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
>> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
>> > > <sridhar.samudrala@intel.com> wrote:
>> > >
>> > > > I meant between PFs on 2 compute nodes.
>> > > If the PF serves as uplink rep, it functions as  a switch port -- applications
>> > > don't run on switch ports. One way to get apps to run on the host in switchdev
>> > > mode is probe one of the VFs there.
>> > >
>> > >
>> > >
>> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
>> seen on the host, no more PF netdev.
>
> That is not the functionality I would propose.  The PF netdev will still be there.

Andy,

Basically LGTM, so even in smartnic configs, the PF @ the host is
still privileged to
create/destroy VFs or provision MACs for them even if it is not the
e-switch manager
anymore?

Actually AFAIK this  can also work somehow otherwise, e.g a smartnic FW
"pushes" the VFs into the host w.o them being under a host admin directive.

^ permalink raw reply

* Re: [PATCH net-next] liquidio: Enhanced ethtool stats
From: David Miller @ 2018-04-17 13:57 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	intiyaz.basha
In-Reply-To: <20180417063052.GA3331@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Mon, 16 Apr 2018 23:30:53 -0700

> From: Intiyaz Basha <intiyaz.basha@cavium.com>
> 
> 1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion
> 2. Included fcs_err, jabber_err, l2_err and frame_err errors under
>    rx_errors
> 3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and
>    fw_err_drop under rx_dropped
> 4. Included max_collision_fail, max_deferral_fail, total_collisions,
>    fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped
> 5. Counting dma mapping errors
> 6. Added some firmware stats description and removed for some
> 
> Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
> Acked-by: Derek Chickles <derek.chickles@cavium.com>
> Acked-by: Satanand Burla <satananda.burla@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: Alexei Starovoitov @ 2018-04-17 13:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Network Development, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

On Tue, Apr 17, 2018 at 5:58 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Resubmit V10 against net-next, as it contains NIC driver changes.
>
> This patchset works towards supporting different XDP RX-ring memory
> allocators.  As this will be needed by the AF_XDP zero-copy mode.
>
> The patchset uses mlx5 as the sample driver, which gets implemented
> XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
> change thought the patchset).
>
> A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
> And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
>
> Support for a driver supplied allocator is implemented, and a
> refurbished version of page_pool is the first return allocator type
> introduced.  This will be a integration point for AF_XDP zero-copy.
>
> The mlx5 driver evolve into using the page_pool, and see a performance
> increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.
>
>
> The patchset stop at 16 patches (one over limit), but more API changes
> are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
> APIs to support bulking.  As this will address some known limits.
>
> V2: Updated according to Tariq's feedback
> V3: Updated based on feedback from Jason Wang and Alex Duyck
> V4: Updated based on feedback from Tariq and Jason
> V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
> V6: Updated based on feedback from Eric Dumazet and Alex Duyck
> V7: Adapt to i40e that got XDP_REDIRECT support in-between
> V8:
>  Updated based on feedback kbuild test robot, and adjust for mlx5 changes
>  page_pool only compiled into kernel when drivers Kconfig 'select' feature
> V9:
>  Remove some inline statements, let compiler decide what to inline
>  Fix return value in virtio_net driver
>  Adjust for mlx5 changes in-between submissions
> V10:
>  Minor adjust for mlx5 requested by Tariq
>  Resubmit against net-next

looks like you forgot to include extra patch to fixup xdp_adjust_head()
helper. Otherwise reused xdp_frame in the top of that packet is leaking
kernel pointers into bpf program.
Could you please respin with that change included?

^ permalink raw reply

* Re: [PATCH 10/10] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
From: Geert Uytterhoeven @ 2018-04-17 13:53 UTC (permalink / raw)
  To: Michael Schmitz; +Cc: netdev, Linux/m68k, Michael.Karcher, Michael Karcher
In-Reply-To: <1523916285-6057-11-git-send-email-schmitzmic@gmail.com>

Hi Michael,

Thanks for your patch!

On Tue, Apr 17, 2018 at 12:04 AM, Michael Schmitz <schmitzmic@gmail.com> wrote:
> Add platform device driver to populate the ax88796 platform data from
> information provided by the XSurf100 zorro device driver.
> This driver will have to be loaded before loading the ax88796 module,
> or compiled as built-in.

Is that really true? The platform device should be probed when both the
device and driver have been registered, but order shouldn't matter.

> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>

Missing "From: Michael Karcher ..."?

> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>

> --- a/drivers/net/ethernet/8390/Kconfig
> +++ b/drivers/net/ethernet/8390/Kconfig
> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>
>  config AX88796
>         tristate "ASIX AX88796 NE2000 clone support"
> -       depends on (ARM || MIPS || SUPERH)
> +       depends on (ARM || MIPS || SUPERH || AMIGA)

s/AMIGA/ZORRO/, for consistency with the below.

>         select CRC32
>         select PHYLIB
>         select MDIO_BITBANG
> @@ -45,6 +45,18 @@ config AX88796_93CX6
>         ---help---
>           Select this if your platform comes with an external 93CX6 eeprom.
>
> +config XSURF100
> +       tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
> +       depends on ZORRO
> +       depends on AX88796

It's a bit unfortunate the user has to enable _two_ config options to enable
this driver.

I see two solutions for that:

1) Hide the XSURF100 symbol, so it gets enabled automatically if AX88796 is
   enabled on a Zorro bus system:

    config XSURF100
            tristate
            depends on ZORRO
            default AX88796

2) Hide the AX88796 symbol, and let it be selected by XSURF100:

    config AX88796
            tristate "ASIX AX88796 NE2000 clone support" if !ZORRO
            depends on ARM || MIPS || SUPERH || ZORRO
            ...

    config XSURF100
            tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
            depends on ZORRO
            select AX88796

> --- /dev/null
> +++ b/drivers/net/ethernet/8390/xsurf100.c
> @@ -0,0 +1,411 @@
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/platform_device.h>
> +#include <linux/zorro.h>
> +#include <net/ax88796.h>
> +#include <asm/amigaints.h>
> +
> +#define ZORRO_PROD_INDIVIDUAL_COMPUTERS_X_SURF100 \
> +               ZORRO_ID(INDIVIDUAL_COMPUTERS, 0x64, 0)

Another long define to get rid of? ;-)

> +/* Hard reset the card. This used to pause for the same period that a
> + * 8390 reset command required, but that shouldn't be necessary.
> + */
> +static void ax_reset_8390(struct net_device *dev)
> +{
> +       struct ei_device *ei_local = netdev_priv(dev);
> +       unsigned long reset_start_time = jiffies;
> +       void __iomem *addr = (void __iomem *)dev->base_addr;
> +
> +       netif_dbg(ei_local, hw, dev, "resetting the 8390 t=%ld...\n", jiffies);
> +
> +       ei_outb(ei_inb(addr + NE_RESET), addr + NE_RESET);
> +
> +       ei_local->txing = 0;
> +       ei_local->dmaing = 0;
> +
> +       /* This check _should_not_ be necessary, omit eventually. */
> +       while ((ei_inb(addr + EN0_ISR) & ENISR_RESET) == 0) {
> +               if (time_after(jiffies, reset_start_time + 2 * HZ / 100)) {
> +                       netdev_warn(dev, "%s: did not complete.\n", __func__);
> +                       break;
> +               }

cpu_relax()?

How long does this usually take? If > 1 ms, you can use e.g. msleep(1)
instead of cpu_relax().


> +       }
> +
> +       ei_outb(ENISR_RESET, addr + EN0_ISR);   /* Ack intr. */
> +}

> +       if (ei_local->dmaing) {
> +               netdev_err(dev,
> +                          "DMAing conflict in %s "
> +                          "[DMAstat:%d][irqlock:%d].\n",

Please don't split error messages, as that makes it more difficult to
grep for them.

> +                          __func__,
> +                          ei_local->dmaing, ei_local->irqlock);
> +               return;

> +static int xsurf100_probe(struct zorro_dev *zdev,
> +                         const struct zorro_device_id *ent)
> +{

> +       /* error handling for ioremap regs */
> +       if (!ax88796_data.base_regs) {
> +               dev_err(&zdev->dev, "Cannot ioremap area %p (registers)\n",
> +                       (void *)zdev->resource.start);

Please use %pR to format struct resource.
Documentation/core-api/printk-formats.rst

> +       /* error handling for ioremap data */
> +       if (!ax88796_data.data_area) {
> +               dev_err(&zdev->dev, "Cannot ioremap area %p (32-bit access)\n",
> +                       (void *)zdev->resource.start + XS100_8390_DATA32_BASE);

%pR

> +static void xsurf100_remove(struct zorro_dev *zdev)
> +{
> +       struct platform_device *pdev;
> +       struct xsurf100_ax_plat_data *xs100;
> +
> +       pdev = zorro_get_drvdata(zdev);
> +       xs100 = dev_get_platdata(&pdev->dev);

struct platform_device *pdev = pdev = zorro_get_drvdata(zdev);
struct xsurf100_ax_plat_data *xs100 = dev_get_platdata(&pdev->dev);

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 0/3] Receive Side Coalescing for macb driver
From: David Miller @ 2018-04-17 13:53 UTC (permalink / raw)
  To: rafalo; +Cc: nicolas.ferre, netdev, linux-kernel, piotrs, ltyrala
In-Reply-To: <BN3PR0701MB1122B2FF2A1C54CD363B16A7C9B70@BN3PR0701MB1122.namprd07.prod.outlook.com>

From: Rafal Ozieblo <rafalo@cadence.com>
Date: Tue, 17 Apr 2018 08:59:35 +0000

> If IP supports RSC and skb has 2B reserved for alignment we end up
> with none packets receive correctly (2B missing in the each skb).
> We can either leave few customers without support in Linux driver or
> let them use the driver with decrease performance.

You probably want to find a way to memmove() the headers to be aligned
in this case, and then point the data afterwards into SKB page frags.

^ permalink raw reply

* Re: One question about __tcp_select_window()
From: Wang Jian @ 2018-04-17 13:53 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAP4sYWWtXeux8EH0dSkxN-aZ5b7MvLWTgkhejeHz+d+2=qFQ2A@mail.gmail.com>

I test the fix with 4.17.0-rc1+ and it seems work.

1. iperf -c IP -i 20 -t 60 -w 1K
 with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
I also try other windows and have similar results.

2. Use tcp probe trace snd_wind.
with-fix vs without-fix: 1245568 vs 1042816

3. I don't see extra retransmit/drops.

On Sun, Apr 15, 2018 at 8:50 PM, Wang Jian <jianjian.wang1@gmail.com> wrote:
> Hi all,
>
> While I read __tcp_select_window() code, I find that it maybe return a
> smaller window.
> Below is one scenario I thought, may be not right:
> In function __tcp_select_window(), assume:
> full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS.
> And assume disable window scaling, then
> window = tp->rcv_wnd > free_space && window > free_space
> then it will round down free_space and return it.
>
> Is this expected behavior? The comment is also saying
> "Get the largest window that is a nice multiple of mss."
>
> Should we do something like below ? Or I miss something?
> I don't know how to verify it now.
>
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk)
>                  * We also don't do any window rounding when the free space
>                  * is too small.
>                  */
> -               if (window <= free_space - mss || window > free_space)
> +               if (window <= free_space - mss)
>                         window = rounddown(free_space, mss);
> -               else if (mss == full_space &&
> +               else if (window <= free_space && mss == full_space &&
>                          free_space > window + (full_space >> 1))
>                         window = free_space;
>         }
>
> Thanks.

^ permalink raw reply

* Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
From: David Miller @ 2018-04-17 13:51 UTC (permalink / raw)
  To: geert
  Cc: schmitzmic, glaubitz, netdev, andrew, linux-m68k, Michael.Karcher,
	kernel
In-Reply-To: <CAMuHMdUZa31pbmyieoqyh-RnsRtFKPOVbrzC9DveV8h0sZh_qw@mail.gmail.com>

From: Geert Uytterhoeven <geert@linux-m68k.org>
Date: Tue, 17 Apr 2018 10:20:25 +0200

> BTW, I have a git alias for that:
> 
> $ git help fixes
> `git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s'
> $ git fixes 82533ad9a1c
> Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq
> without request_irq first")

Thanks for sharing :)

^ permalink raw reply

* Re: [PATCH] VSOCK: make af_vsock.ko removable again
From: David Miller @ 2018-04-17 13:45 UTC (permalink / raw)
  To: stefanha; +Cc: netdev, xiyou.wangcong, jhansen
In-Reply-To: <20180417062558.18018-1-stefanha@redhat.com>

From: Stefan Hajnoczi <stefanha@redhat.com>
Date: Tue, 17 Apr 2018 14:25:58 +0800

> Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> vsock_init_tables()") introduced a module_init() function without a
> corresponding module_exit() function.
> 
> Modules with an init function can only be removed if they also have an
> exit function.  Therefore the vsock module was considered "permanent"
> and could not be removed.
> 
> This patch adds an empty module_exit() function so that "rmmod vsock"
> works.  No explicit cleanup is required because:
> 
> 1. Transports call vsock_core_exit() upon exit and cannot be removed
>    while sockets are still alive.
> 2. vsock_diag.ko does not perform any action that requires cleanup by
>    vsock.ko.
> 
> Reported-by: Xiumei Mu <xmu@redhat.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Jorgen Hansen <jhansen@vmware.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Applied, but please provide a proper Fixes: tag next time.  I added it
for you this time.

^ permalink raw reply

* Re: [PATCH] net: change the comment of dev_mc_init
From: David Miller @ 2018-04-17 13:37 UTC (permalink / raw)
  To: sunlw.fnst; +Cc: netdev
In-Reply-To: <e893355c-0a4c-8db5-c812-68e8edd2ded4@cn.fujitsu.com>

From: sunlianwen <sunlw.fnst@cn.fujitsu.com>
Date: Tue, 17 Apr 2018 15:27:01 +0800

> the comment of dev_mc_init() is wrong. which use dev_mc_flush
> instead of dev_mc_init.
> 
> 
> Signed-off-by:Lianwen Sun <sunlw.fnst@cn.fujitsu.com>

Please don't put so many empty lines, one is enough to separate your
commit log message from the signoff.

> diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> index e3e6a3e2ca22..d884d8f5f0e5 100644
> --- a/net/core/dev_addr_lists.c
> +++ b/net/core/dev_addr_lists.c
> @@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev)
>  EXPORT_SYMBOL(dev_mc_flush);
> 
>  /**
> - *     dev_mc_flush - Init multicast address list
> + *     dev_mc_init - Init multicast address list
>   *     @dev: device
>   *
>   *     Init multicast address list.
> -- 
> 2.17.0

This patch was corrupted by your email client, it transformed TAB
characters into sequences of SPACEs.

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Florian Westphal @ 2018-04-17 13:29 UTC (permalink / raw)
  To: Michal Kubecek, netdev, Florian Westphal, Marcelo Ricardo Leitner,
	Eric Dumazet
  Cc: Jozsef Kadlecsik
In-Reply-To: <20180417123437.GA19885@nautica>

Dominique Martinet <asmadeus@codewreck.org> wrote:

[ CC Jozsef ]

> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK
> 
> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688882 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
>
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

Yes, this looks correct. Jozsef, can you please have a look?

Problem seems to be that conntrack believes that ACK packet
re-initializes the connection:

 595                 /*
 596                  * RFC 793: "if a TCP is reinitialized ... then it need
 597                  * not wait at all; it must only be sure to use sequence
 598                  * numbers larger than those recently used."
 599                  */
 600                 sender->td_end =
 601                 sender->td_maxend = end;
 602                 sender->td_maxwin = (win == 0 ? 1 : win);
 603 
 604                 tcp_options(skb, dataoff, tcph, sender);

and last line clears the scale value (no wscale option in data packet).


Transitions are:
 server > client SYN          sNO -> sSS
 client > server SYN          sSS -> sS2
 server > client SYNACK       sS2 -> sSR /* here */
 client > server ACK          sSR -> sES

SYN/ACK was observed in original direction so we hit
state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
when we see the ack packet and end up in the 'TCP is reinitialized' branch.

AFAICS, without this, connection would move to sES just fine,
as the data ack is in window.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-17 13:30 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Andy Gospodarek, Or Gerlitz, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <f6a731b1-d8a5-d9f4-f758-4f36070fd02c@intel.com>

On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
> 
> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
> > > <sridhar.samudrala@intel.com> wrote:
> > > 
> > > > I meant between PFs on 2 compute nodes.
> > > If the PF serves as uplink rep, it functions as  a switch port -- applications
> > > don't run on switch ports. One way to get apps to run on the host in switchdev
> > > mode is probe one of the VFs there.
> > > 
> > > 
> > > 
> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
> seen on the host, no more PF netdev.

That is not the functionality I would propose.  The PF netdev will still be there.

> Are you going to expose another way to change sriov_num_vfs when the device is in
> 'switchdev' mode OR do we need to switch to 'legacy' mode to increase/decrease the number of
> VFs?

Since the PF netdev will not disappear, the standard ways to configure number
of VF, etc is still available.

> Even in switchdev mode, i guess it will be possible for host apps to use the IP configured
> on the uplink rep to talk externally.
> 
> In case of multiple uplinks, are you exposing one uplink-rep netdev per uplink?

^ permalink raw reply

* Re: [PATCH v2 8/8] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
From: Andrew Lunn @ 2018-04-17 13:26 UTC (permalink / raw)
  To: Michael Schmitz; +Cc: netdev, linux-m68k, Michael.Karcher, Michael Karcher
In-Reply-To: <1523930895-6973-9-git-send-email-schmitzmic@gmail.com>

On Tue, Apr 17, 2018 at 02:08:15PM +1200, Michael Schmitz wrote:
> Add platform device driver to populate the ax88796 platform data from
> information provided by the XSurf100 zorro device driver.
> This driver will have to be loaded before loading the ax88796 module,
> or compiled as built-in.
> 
> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
> ---
>  drivers/net/ethernet/8390/Kconfig    |   14 +-
>  drivers/net/ethernet/8390/Makefile   |    1 +
>  drivers/net/ethernet/8390/xsurf100.c |  411 ++++++++++++++++++++++++++++++++++
>  3 files changed, 425 insertions(+), 1 deletions(-)
>  create mode 100644 drivers/net/ethernet/8390/xsurf100.c
> 
> diff --git a/drivers/net/ethernet/8390/Kconfig b/drivers/net/ethernet/8390/Kconfig
> index fdc6734..0cadd45 100644
> --- a/drivers/net/ethernet/8390/Kconfig
> +++ b/drivers/net/ethernet/8390/Kconfig
> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>  
>  config AX88796
>  	tristate "ASIX AX88796 NE2000 clone support"
> -	depends on (ARM || MIPS || SUPERH)
> +	depends on (ARM || MIPS || SUPERH || AMIGA)

Hi Michael

Will it compile on other platforms? If so, it is a good idea to add
COMPILE_TEST as well.

	     Andrew

^ permalink raw reply

* Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Or Gerlitz @ 2018-04-17 13:23 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski
  Cc: Linux Netdev List, David Miller, Ido Schimmel, mlxsw, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Michael Chan, Ganesh Goudar,
	Saeed Mahameed, Simon Horman, Pieter Jansen van Vuuren,
	John Hurley, Dirk van der Merwe, Alexander Duyck, Or Gerlitz,
	David Ahern, vijaya.guvva,
	"Burla, Satananda" <satananda
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

On Thu, Mar 22, 2018 at 1:55 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> This patchset resolves 2 issues we have right now:
> 1) There are many netdevices / ports in the system, for port, pf, vf
>    represenatation but the user has no way to see which is which
> 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
>    which may lead to inconsistent names between drivers.
>
> This patchset introduces port flavours which should address the first
> problem. I'm testing this with Netronome nfp hardware. When the user
> has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:

J/J (Jiri/Jakub) --

re "2 physical ports, 1 pf, and 4 vfs" --- does NFP exposes one PF for
both physical ports?

FWIW note that in mlx5 and AFAIK any other device except for mlx4 (...)
folks have FPP (Function Per Port) scheme.

[..]

> The desired output should look like this:
> # devlink port
> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
> pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
> pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
> pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
> pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
> pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
> As you can see, the netdev names are generated according to the flavour
> and port number. In case the port is split, the split subnumber is also included.

What is the purpose/role in getting dev link ports here? is it such
that @ the end
of the day the driver would do a devlink_port_get_phys_port_name() call in their
get phys port name ndo? or we buy more advantages out of doing so?

Or.

^ permalink raw reply

* Re: [PATCH v2 2/8] net: ax88796: Attach MII bus only when open
From: Andrew Lunn @ 2018-04-17 13:19 UTC (permalink / raw)
  To: Michael Schmitz
  Cc: netdev, linux-m68k, Michael.Karcher, Michael Karcher,
	Michael Karcher
In-Reply-To: <1523930895-6973-3-git-send-email-schmitzmic@gmail.com>

On Tue, Apr 17, 2018 at 02:08:09PM +1200, Michael Schmitz wrote:
> From: Michael Karcher <debian@mkarcher.dialup.fu-berlin.de>
> 
> Call ax_mii_init in ax_open(), and unregister/remove mdiobus resources
> in ax_close().
> 
> This is needed to be able to unload the module, as the module is busy
> while the MII bus is attached.
> 
> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp
From: Geert Uytterhoeven @ 2018-04-17 13:11 UTC (permalink / raw)
  To: Simon Horman
  Cc: Sergei Shtylyov, Magnus Damm, netdev, Linux-Renesas, Wolfram Sang,
	Masaru Nagai, Kazuya Mizuguchi
In-Reply-To: <20180417085030.32650-2-horms+renesas@verge.net.au>

Hi Simon,

On Tue, Apr 17, 2018 at 10:50 AM, Simon Horman
<horms+renesas@verge.net.au> wrote:
> From: Masaru Nagai <masaru.nagai.vx@renesas.com>
>
> [   58.490829] =================================
> [   58.495205] [ INFO: inconsistent lock state ]
> [   58.499583] 4.9.0-yocto-standard-00007-g2ef7caf #57 Not tainted

Can this be triggered on contemporary kernels, too?

> [   58.505529] ---------------------------------
> [   58.509904] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 08/10] net: ax88796: Make reset more robust on AX88796B
From: Andrew Lunn @ 2018-04-17 13:01 UTC (permalink / raw)
  To: Michael Karcher, Florian Fainelli
  Cc: Michael Schmitz, netdev, Linux/m68k, John Paul Adrian Glaubitz,
	Michael Karcher
In-Reply-To: <58605.92.192.176.127.1523942290.webmail@webmail.zedat.fu-berlin.de>

On Tue, Apr 17, 2018 at 07:18:10AM +0200, Michael Karcher wrote:
> [Andrew, sorry for the dup. I did hit reply-to-auhor instead of
> reply-to-all first.]
> 
> Andrew Lunn schrieb:
> >> > This should really be fixed in the PHY driver, not the MAC.
> >>
> >> OK - do you want this separate, or as part of this series? Might have
> >> a few side effects on more commonly used hardware, perhaps?
> >
> > Hi Michael
> >
> > What PHY driver is used?
> The ax88796b comes with its own integrated (buggy) PHY needing this
> workaround. This PHY has its own ID which is not known by Linux, so it is
> using the genphy driver as fallback.
> 
> > In the driver you can implement a .soft_reset
> > function which first does the dummy write, and then uses
> > genphy_soft_reset() to do the actual reset.
> We could do that - but I dont't see the point in creating a PHY driver
> that is only ever used by this MAC driver, just to add a single line to
> the genphy driver. If the same PHY might be used with a different MAC,
> you definitely would have a point there, though.


Hi Michael

We try to keep the core code clean, and put all workarounds for buggy
hardware in drivers specific to them. It just helps keep the core code
maintainable.

I would prefer a driver specific to this PHY with the workaround. But
lets see what Florian says.

     Andrew

^ permalink raw reply

* [net-next V10 PATCH 16/16] xdp: transition into using xdp_frame for ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-04-17 13:00 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   30 ++++++++++++++-----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   24 +++++++++++---------
 drivers/net/tun.c                             |   19 ++++++++++------
 drivers/net/virtio_net.c                      |   24 ++++++++++++--------
 include/linux/netdevice.h                     |    4 ++-
 net/core/filter.c                             |   17 +++++++++++++-
 7 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c8bf4d35fdea..87fb27ab9c24 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 #define I40E_XDP_CONSUMED 1
 #define I40E_XDP_TX 2
 
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
+static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+				 struct i40e_ring *xdp_ring)
+{
+	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!xdpf))
+		return I40E_XDP_CONSUMED;
+
+	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 		break;
 	case XDP_TX:
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-		result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
@@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
  * @xdp: data to transmit
  * @xdp_ring: XDP Tx ring
  **/
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring)
 {
 	u16 i = xdp_ring->next_to_use;
 	struct i40e_tx_buffer *tx_bi;
 	struct i40e_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
+	u32 size = xdpf->len;
 	dma_addr_t dma;
-	u32 size;
-
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return I40E_XDP_CONSUMED;
-
-	size = xdpf->len;
 
 	if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
 		xdp_ring->tx_stats.tx_busy++;
@@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
  *
  * Returns Zero if sent, else an error code
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct i40e_netdev_priv *np = netdev_priv(dev);
 	unsigned int queue_index = smp_processor_id();
@@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
-	err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 857b1d743c8d..4bf318b8be85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4f2864165723..51e7d82a5860 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2262,7 +2262,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2270,6 +2270,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2278,12 +2279,19 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	if (!xdp_prog)
 		goto xdp_out;
 
+	prefetchw(xdp->data_hard_start); /* xdp_frame write */
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -2386,7 +2394,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
-			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -8344,20 +8351,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -10010,7 +10012,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10026,7 +10028,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bec130cdbd9d..1e58be152d5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1301,18 +1301,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
@@ -1336,6 +1331,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1683,7 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		case XDP_TX:
 			get_page(alloc_frag->page);
 			alloc_frag->offset += buflen;
-			if (tun_xdp_xmit(tun->dev, &xdp))
+			if (tun_xdp_tx(tun->dev, &xdp))
 				goto err_redirect;
 			tun_xdp_flush(tun->dev);
 			rcu_read_unlock();
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ab3d7cbc4c49..01694e26f03e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static int __virtnet_xdp_xmit(struct virtnet_info *vi,
-			      struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -459,7 +455,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return 0;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -472,7 +468,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	return __virtnet_xdp_xmit(vi, xdp);
+	return __virtnet_xdp_xmit(vi, xdpf);
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
@@ -569,6 +565,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -611,7 +608,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -702,6 +702,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -766,7 +767,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..14e0777ffcfb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1165,7 +1165,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1356,7 +1356,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d31aff93270d..3bb0cb98a9be 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2749,13 +2749,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2771,11 +2776,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related

* Re: tcp hang when socket fills up ?
From: Michal Kubecek @ 2018-04-17 13:00 UTC (permalink / raw)
  To: netdev
  Cc: Dominique Martinet, Florian Westphal, Marcelo Ricardo Leitner,
	Eric Dumazet
In-Reply-To: <20180417123437.GA19885@nautica>

On Tue, Apr 17, 2018 at 02:34:37PM +0200, Dominique Martinet wrote:
> Michal Kubecek wrote on Tue, Apr 17, 2018:
> > Data (21 bytes) packet in reply direction. And somewhere between the
> > first and second debugging print, we ended up with sender scale=0 and
> > that value is then preserved from now on.
> > 
> > The only place between the two debug prints where we could change only
> > one of the td_sender values are the two calls to tcp_options() but
> > neither should be called now unless I missed something. I'll try to
> > think about it some more.
> 
> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK

This must be what nf_conntrack_proto_tcp.c calls "simultaneous open".

> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688882 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

I agree that sender->td_maxwin is not zero so that the handshake above
probably left the conntrack in TCP_CONNTRACK_SYN_RECV state for some
reason. I'll try to go through the code with the pattern you mentioned
in mind.

Michal Kubecek

^ permalink raw reply

* [net-next V10 PATCH 15/16] xdp: transition into using xdp_frame for return API
From: Jesper Dangaard Brouer @ 2018-04-17 13:00 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c     |    5 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 9 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 96c54cbfb1f9..c8bf4d35fdea 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
 		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
 			kfree(tx_buffer->raw_buf);
 		else if (ring_is_xdp(ring))
-			xdp_return_frame(tx_buffer->xdpf->data,
-					 &tx_buffer->xdpf->mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 		if (dma_unmap_len(tx_buffer, len))
@@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 		/* free the skb/XDP data */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buf->xdpf->data, &tx_buf->xdpf->mem);
+			xdp_return_frame(tx_buf->xdpf);
 		else
 			napi_consume_skb(tx_buf->skb, napi_budget);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index abb5248e917e..7dd5038cfcc4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -241,8 +241,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -250,7 +249,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f10904ec2172..4f2864165723 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5797,7 +5798,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8348,16 +8349,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8372,8 +8378,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f42436d7f2d9..7bbf0db27a01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -890,6 +890,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 	dma_sync_single_range_for_cpu(rq->pdev, di->addr, wi->offset,
 				      frag_size, DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 283bde85c455..bec130cdbd9d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2196,7 +2196,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 42d338fe9a8d..ab3d7cbc4c49 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d0ee437753dc..137ad5f9f40f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 33e382afbd95..0c86b53a3a63 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,9 +308,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {

^ permalink raw reply related

* [net-next V10 PATCH 14/16] mlx5: use page_pool for xdp_return_frame call
From: Jesper Dangaard Brouer @ 2018-04-17 13:00 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V10: Req from Tariq
 - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1a05d1072c5e..3317a4da87cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -534,6 +536,7 @@ struct mlx5e_rq {
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2dca0933dfd3..f10037419978 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
+	pool_size = 1 << params->log_rq_mtu_frames;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+
+		pool_size = MLX5_MPWRQ_PAGES_PER_WQE << mlx5e_mpwqe_get_log_rq_size(params);
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.order     = rq->buff.page_order;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+	pp_params.pool_size = pool_size;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dev       = c->pdev;
+	pp_params.dma_dir   = rq->buff.map_dir;
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		if (rq->wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
+			kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -556,6 +579,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -569,6 +594,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e24be05907f..f42436d7f2d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,

^ permalink raw reply related

* [net-next V10 PATCH 13/16] xdp: allow page_pool as an allocator type in xdp_return_frame
From: Jesper Dangaard Brouer @ 2018-04-17 12:59 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

V8: address issues found by kbuild test robot
 - Address sparse should be static warnings
 - Allow xdp.o to be compiled without page_pool.o

V9: Remove inline from .c file, compiler knows best

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |   14 +++++++++++
 include/net/xdp.h       |    3 ++
 net/core/xdp.c          |   60 ++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 65 insertions(+), 12 deletions(-)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 1fe77db59518..c79087153148 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -117,7 +117,12 @@ void __page_pool_put_page(struct page_pool *pool,
 
 static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
 {
+	/* When page_pool isn't compiled-in, net/core/xdp.c doesn't
+	 * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
+	 */
+#ifdef CONFIG_PAGE_POOL
 	__page_pool_put_page(pool, page, false);
+#endif
 }
 /* Very limited use-cases allow recycle direct */
 static inline void page_pool_recycle_direct(struct page_pool *pool,
@@ -126,4 +131,13 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
 	__page_pool_put_page(pool, page, true);
 }
 
+static inline bool is_page_pool_compiled_in(void)
+{
+#ifdef CONFIG_PAGE_POOL
+	return true;
+#else
+	return false;
+#endif
+}
+
 #endif /* _NET_PAGE_POOL_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5f67c62540aa..d0ee437753dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum xdp_mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
 	u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8b2cb79b5de0..33e382afbd95 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/rhashtable.h>
+#include <net/page_pool.h>
 
 #include <net/xdp.h>
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
 	struct xdp_mem_info mem;
-	void *allocator;
+	union {
+		void *allocator;
+		struct page_pool *page_pool;
+	};
 	struct rhash_head node;
 	struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	/* Allow this ID to be reused */
 	ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-	/* TODO: Depending on allocator type/pointer free resources */
+	/* Notice, driver is expected to free the *allocator,
+	 * e.g. page_pool, and MUST also use RCU free.
+	 */
 
 	/* Poison memory */
 	xa->mem.id = 0xFFFF;
@@ -225,6 +231,17 @@ static int __mem_id_cyclic_get(gfp_t gfp)
 	return id;
 }
 
+static bool __is_supported_mem_type(enum xdp_mem_type type)
+{
+	if (type == MEM_TYPE_PAGE_POOL)
+		return is_page_pool_compiled_in();
+
+	if (type >= MEM_TYPE_MAX)
+		return false;
+
+	return true;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum xdp_mem_type type, void *allocator)
 {
@@ -238,13 +255,16 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 		return -EFAULT;
 	}
 
-	if (type >= MEM_TYPE_MAX)
-		return -EINVAL;
+	if (!__is_supported_mem_type(type))
+		return -EOPNOTSUPP;
 
 	xdp_rxq->mem.type = type;
 
-	if (!allocator)
+	if (!allocator) {
+		if (type == MEM_TYPE_PAGE_POOL)
+			return -EINVAL; /* Setup time check page_pool req */
 		return 0;
+	}
 
 	/* Delay init of rhashtable to save memory if feature isn't used */
 	if (!mem_id_init) {
@@ -290,15 +310,31 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+	struct xdp_mem_allocator *xa;
+	struct page *page;
+
+	switch (mem->type) {
+	case MEM_TYPE_PAGE_POOL:
+		rcu_read_lock();
+		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+		page = virt_to_head_page(data);
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		rcu_read_unlock();
+		break;
+	case MEM_TYPE_PAGE_SHARED:
 		page_frag_free(data);
-		return;
-	}
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
+		break;
+	case MEM_TYPE_PAGE_ORDER0:
+		page = virt_to_page(data); /* Assumes order0 page*/
 		put_page(page);
+		break;
+	default:
+		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
+		break;
 	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [net-next V10 PATCH 12/16] page_pool: refurbish version of page_pool code
From: Jesper Dangaard Brouer @ 2018-04-17 12:59 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

V6: Adjustments requested by Eric Dumazet
 - Adjust ____cacheline_aligned_in_smp usage/placement
 - Move rcu_head in struct page_pool
 - Free pages quicker on destroy, minimize resources delayed an RCU period
 - Remove code for forward/backward compat ABI interface

V8: Issues found by kbuild test robot
 - Address sparse should be static warnings
 - Only compile+link when a driver use/select page_pool,
   mlx5 selects CONFIG_PAGE_POOL, although its first used in two patches

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |    1 
 include/net/page_pool.h                         |  129 +++++++++
 net/Kconfig                                     |    3 
 net/core/Makefile                               |    1 
 net/core/page_pool.c                            |  317 +++++++++++++++++++++++
 5 files changed, 451 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index c032319f1cb9..12257034131e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -30,6 +30,7 @@ config MLX5_CORE_EN
 	bool "Mellanox Technologies ConnectX-4 Ethernet support"
 	depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE
 	depends on IPV6=y || IPV6=n || MLX5_CORE=m
+	select PAGE_POOL
 	default n
 	---help---
 	  Ethernet support in Mellanox Technologies ConnectX-4 NIC.
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..1fe77db59518
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count;
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	unsigned int	flags;
+	unsigned int	order;
+	unsigned int	pool_size;
+	int		nid;  /* Numa node id to allocate from pages from */
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+};
+
+struct page_pool {
+	struct rcu_head rcu;
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..6fa1a4493b8c 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,9 @@ config MAY_USE_DEVLINK
 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
 	  devlink is a loadable module and the driver using it is built-in.
 
+config PAGE_POOL
+       bool
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..7080417f8bc8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-$(CONFIG_PAGE_POOL) += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..68bf07206744
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,317 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+static int page_pool_init(struct page_pool *pool,
+			  const struct page_pool_params *params)
+{
+	unsigned int ring_qsize = 1024; /* Default */
+
+	memcpy(&pool->p, params, sizeof(pool->p));
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	/* Sanity limit mem that can be pinned down */
+	if (ring_qsize > 32768)
+		return -E2BIG;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill
+		 *
+		 * Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: the ring consumer lock is not really
+		 * needed due to the softirq/NAPI protection, but
+		 * later need the ability to reclaim pages on the
+		 * ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages_slow(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+	/* An optimization would be to call __free_pages(page, pool->p.order)
+	 * knowing page is not part of page-cache (thus avoiding a
+	 * __page_cache_release() call).
+	 */
+}
+
+static bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in special circumstances, into the
+ * alloc side cache.  E.g. during RX-NAPI processing for XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This allocator is optimized for the XDP mode that uses
+	 * one-frame-per-page, but have fallbacks that act like the
+	 * regular page allocator APIs.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier done in page_ref_count / READ_ONCE */
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, fallback to free pages */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Fallback/non-XDP mode: API user have elevated refcnt.
+	 *
+	 * Many drivers split up the page into fragments, and some
+	 * want to keep doing this to save memory and do refcnt based
+	 * recycling. Support this use case too, to ease drivers
+	 * switching between XDP/non-XDP.
+	 *
+	 * In-case page_pool maintains the DMA mapping, API user must
+	 * call page_pool_put_page once.  In this elevated refcnt
+	 * case, the DMA is unmapped/released, as driver is likely
+	 * doing refcnt based recycle tricks, meaning another process
+	 * will be invoking put_page.
+	 */
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+static void __page_pool_empty_ring(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1))
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+
+		__page_pool_return_page(pool, page);
+	}
+}
+
+static void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	WARN(pool->alloc.count, "API usage violation");
+
+	__page_pool_empty_ring(pool);
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+/* Cleanup and release resources */
+void page_pool_destroy(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* No more consumers should exist, but producers could still
+	 * be in-flight.
+	 */
+	__page_pool_empty_ring(pool);
+
+	/* An xdp_mem_allocator can still ref page_pool pointer */
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy);

^ permalink raw reply related

* [net-next V10 PATCH 11/16] xdp: rhashtable with allocator ID to pointer mapping
From: Jesper Dangaard Brouer @ 2018-04-17 12:59 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

V6: Per req of Alex Duyck
- Introduce rhashtable_lookup() call in later patch

V8: Address sparse should be static warnings (from kbuild test robot)

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
 drivers/net/tun.c                             |    6 +
 drivers/net/virtio_net.c                      |    7 +
 include/net/xdp.h                             |   14 --
 net/core/xdp.c                                |  223 ++++++++++++++++++++++++-
 5 files changed, 241 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0bfe6cf2bf8b..f10904ec2172 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6370,7 +6370,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = -1;
-	int size;
+	int size, err;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6407,6 +6407,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 			     rx_ring->queue_index) < 0)
 		goto err;
 
+	err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err) {
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+		goto err;
+	}
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2c85e5cac2a9..283bde85c455 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -854,6 +854,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 				       tun->dev, tfile->queue_index);
 		if (err < 0)
 			goto out;
+		err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			goto out;
+		}
 		err = 0;
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f50e1ad81ad4..42d338fe9a8d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1305,6 +1305,13 @@ static int virtnet_open(struct net_device *dev)
 		if (err < 0)
 			return err;
 
+		err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+			return err;
+		}
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index ea3773f94f65..5f67c62540aa 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,7 @@ enum xdp_mem_type {
 
 struct xdp_mem_info {
 	u32 type; /* enum xdp_mem_type, but known size type */
+	u32 id;
 };
 
 struct xdp_rxq_info {
@@ -99,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-static inline
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
-{
-	if (mem->type == MEM_TYPE_PAGE_SHARED)
-		page_frag_free(data);
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
-		put_page(page);
-	}
-}
+void xdp_return_frame(void *data, struct xdp_mem_info *mem);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 7e6b3545277d..8b2cb79b5de0 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -5,6 +5,9 @@
  */
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/rhashtable.h>
 
 #include <net/xdp.h>
 
@@ -13,6 +16,99 @@
 #define REG_STATE_UNREGISTERED	0x2
 #define REG_STATE_UNUSED	0x3
 
+static DEFINE_IDA(mem_id_pool);
+static DEFINE_MUTEX(mem_id_lock);
+#define MEM_ID_MAX 0xFFFE
+#define MEM_ID_MIN 1
+static int mem_id_next = MEM_ID_MIN;
+
+static bool mem_id_init; /* false */
+static struct rhashtable *mem_id_ht;
+
+struct xdp_mem_allocator {
+	struct xdp_mem_info mem;
+	void *allocator;
+	struct rhash_head node;
+	struct rcu_head rcu;
+};
+
+static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
+{
+	const u32 *k = data;
+	const u32 key = *k;
+
+	BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
+		     != sizeof(u32));
+
+	/* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
+	return key << RHT_HASH_RESERVED_SPACE;
+}
+
+static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
+			  const void *ptr)
+{
+	const struct xdp_mem_allocator *xa = ptr;
+	u32 mem_id = *(u32 *)arg->key;
+
+	return xa->mem.id != mem_id;
+}
+
+static const struct rhashtable_params mem_id_rht_params = {
+	.nelem_hint = 64,
+	.head_offset = offsetof(struct xdp_mem_allocator, node),
+	.key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
+	.key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
+	.max_size = MEM_ID_MAX,
+	.min_size = 8,
+	.automatic_shrinking = true,
+	.hashfn    = xdp_mem_id_hashfn,
+	.obj_cmpfn = xdp_mem_id_cmp,
+};
+
+static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
+{
+	struct xdp_mem_allocator *xa;
+
+	xa = container_of(rcu, struct xdp_mem_allocator, rcu);
+
+	/* Allow this ID to be reused */
+	ida_simple_remove(&mem_id_pool, xa->mem.id);
+
+	/* TODO: Depending on allocator type/pointer free resources */
+
+	/* Poison memory */
+	xa->mem.id = 0xFFFF;
+	xa->mem.type = 0xF0F0;
+	xa->allocator = (void *)0xDEAD9001;
+
+	kfree(xa);
+}
+
+static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+{
+	struct xdp_mem_allocator *xa;
+	int id = xdp_rxq->mem.id;
+	int err;
+
+	if (id == 0)
+		return;
+
+	mutex_lock(&mem_id_lock);
+
+	xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
+	if (!xa) {
+		mutex_unlock(&mem_id_lock);
+		return;
+	}
+
+	err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
+	WARN_ON(err);
+
+	call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
+
+	mutex_unlock(&mem_id_lock);
+}
+
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
@@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
+	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
+
+	/* Reset mem info to defaults */
+	xdp_rxq->mem.id = 0;
+	xdp_rxq->mem.type = 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
@@ -72,20 +174,131 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
 
+static int __mem_id_init_hash_table(void)
+{
+	struct rhashtable *rht;
+	int ret;
+
+	if (unlikely(mem_id_init))
+		return 0;
+
+	rht = kzalloc(sizeof(*rht), GFP_KERNEL);
+	if (!rht)
+		return -ENOMEM;
+
+	ret = rhashtable_init(rht, &mem_id_rht_params);
+	if (ret < 0) {
+		kfree(rht);
+		return ret;
+	}
+	mem_id_ht = rht;
+	smp_mb(); /* mutex lock should provide enough pairing */
+	mem_id_init = true;
+
+	return 0;
+}
+
+/* Allocate a cyclic ID that maps to allocator pointer.
+ * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+ *
+ * Caller must lock mem_id_lock.
+ */
+static int __mem_id_cyclic_get(gfp_t gfp)
+{
+	int retries = 1;
+	int id;
+
+again:
+	id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
+	if (id < 0) {
+		if (id == -ENOSPC) {
+			/* Cyclic allocator, reset next id */
+			if (retries--) {
+				mem_id_next = MEM_ID_MIN;
+				goto again;
+			}
+		}
+		return id; /* errno */
+	}
+	mem_id_next = id + 1;
+
+	return id;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum xdp_mem_type type, void *allocator)
 {
+	struct xdp_mem_allocator *xdp_alloc;
+	gfp_t gfp = GFP_KERNEL;
+	int id, errno, ret;
+	void *ptr;
+
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return -EFAULT;
+	}
+
 	if (type >= MEM_TYPE_MAX)
 		return -EINVAL;
 
 	xdp_rxq->mem.type = type;
 
-	if (allocator)
-		return -EOPNOTSUPP;
+	if (!allocator)
+		return 0;
+
+	/* Delay init of rhashtable to save memory if feature isn't used */
+	if (!mem_id_init) {
+		mutex_lock(&mem_id_lock);
+		ret = __mem_id_init_hash_table();
+		mutex_unlock(&mem_id_lock);
+		if (ret < 0) {
+			WARN_ON(1);
+			return ret;
+		}
+	}
+
+	xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
+	if (!xdp_alloc)
+		return -ENOMEM;
+
+	mutex_lock(&mem_id_lock);
+	id = __mem_id_cyclic_get(gfp);
+	if (id < 0) {
+		errno = id;
+		goto err;
+	}
+	xdp_rxq->mem.id = id;
+	xdp_alloc->mem  = xdp_rxq->mem;
+	xdp_alloc->allocator = allocator;
+
+	/* Insert allocator into ID lookup table */
+	ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
+	if (IS_ERR(ptr)) {
+		errno = PTR_ERR(ptr);
+		goto err;
+	}
+
+	mutex_unlock(&mem_id_lock);
 
-	/* TODO: Allocate an ID that maps to allocator pointer
-	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
-	 */
 	return 0;
+err:
+	mutex_unlock(&mem_id_lock);
+	kfree(xdp_alloc);
+	return errno;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
+
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+		page_frag_free(data);
+		return;
+	}
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related

* [net-next V10 PATCH 10/16] mlx5: register a memory model when XDP is enabled
From: Jesper Dangaard Brouer @ 2018-04-17 12:59 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b29c1d93f058..2dca0933dfd3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -512,6 +512,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
+	/* This must only be activate for order-0 pages */
+	if (rq->xdp_prog) {
+		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err)
+			goto err_rq_wq_destroy;
+	}
+
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox