Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Or Gerlitz @ 2018-04-17 13:23 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski
  Cc: Linux Netdev List, David Miller, Ido Schimmel, mlxsw, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Michael Chan, Ganesh Goudar,
	Saeed Mahameed, Simon Horman, Pieter Jansen van Vuuren,
	John Hurley, Dirk van der Merwe, Alexander Duyck, Or Gerlitz,
	David Ahern, vijaya.guvva,
	"Burla, Satananda" <satananda
In-Reply-To: <20180322105522.8186-1-jiri@resnulli.us>

On Thu, Mar 22, 2018 at 1:55 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> This patchset resolves 2 issues we have right now:
> 1) There are many netdevices / ports in the system, for port, pf, vf
>    represenatation but the user has no way to see which is which
> 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
>    which may lead to inconsistent names between drivers.
>
> This patchset introduces port flavours which should address the first
> problem. I'm testing this with Netronome nfp hardware. When the user
> has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:

J/J (Jiri/Jakub) --

re "2 physical ports, 1 pf, and 4 vfs" --- does NFP exposes one PF for
both physical ports?

FWIW note that in mlx5 and AFAIK any other device except for mlx4 (...)
folks have FPP (Function Per Port) scheme.

[..]

> The desired output should look like this:
> # devlink port
> pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
> pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
> pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
> pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
> pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
> pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
> As you can see, the netdev names are generated according to the flavour
> and port number. In case the port is split, the split subnumber is also included.

What is the purpose/role in getting dev link ports here? is it such
that @ the end
of the day the driver would do a devlink_port_get_phys_port_name() call in their
get phys port name ndo? or we buy more advantages out of doing so?

Or.

^ permalink raw reply

* Re: [PATCH v2 8/8] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
From: Andrew Lunn @ 2018-04-17 13:26 UTC (permalink / raw)
  To: Michael Schmitz; +Cc: netdev, linux-m68k, Michael.Karcher, Michael Karcher
In-Reply-To: <1523930895-6973-9-git-send-email-schmitzmic@gmail.com>

On Tue, Apr 17, 2018 at 02:08:15PM +1200, Michael Schmitz wrote:
> Add platform device driver to populate the ax88796 platform data from
> information provided by the XSurf100 zorro device driver.
> This driver will have to be loaded before loading the ax88796 module,
> or compiled as built-in.
> 
> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>
> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
> ---
>  drivers/net/ethernet/8390/Kconfig    |   14 +-
>  drivers/net/ethernet/8390/Makefile   |    1 +
>  drivers/net/ethernet/8390/xsurf100.c |  411 ++++++++++++++++++++++++++++++++++
>  3 files changed, 425 insertions(+), 1 deletions(-)
>  create mode 100644 drivers/net/ethernet/8390/xsurf100.c
> 
> diff --git a/drivers/net/ethernet/8390/Kconfig b/drivers/net/ethernet/8390/Kconfig
> index fdc6734..0cadd45 100644
> --- a/drivers/net/ethernet/8390/Kconfig
> +++ b/drivers/net/ethernet/8390/Kconfig
> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>  
>  config AX88796
>  	tristate "ASIX AX88796 NE2000 clone support"
> -	depends on (ARM || MIPS || SUPERH)
> +	depends on (ARM || MIPS || SUPERH || AMIGA)

Hi Michael

Will it compile on other platforms? If so, it is a good idea to add
COMPILE_TEST as well.

	     Andrew

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-17 13:30 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Andy Gospodarek, Or Gerlitz, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <f6a731b1-d8a5-d9f4-f758-4f36070fd02c@intel.com>

On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
> 
> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
> > > <sridhar.samudrala@intel.com> wrote:
> > > 
> > > > I meant between PFs on 2 compute nodes.
> > > If the PF serves as uplink rep, it functions as  a switch port -- applications
> > > don't run on switch ports. One way to get apps to run on the host in switchdev
> > > mode is probe one of the VFs there.
> > > 
> > > 
> > > 
> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
> seen on the host, no more PF netdev.

That is not the functionality I would propose.  The PF netdev will still be there.

> Are you going to expose another way to change sriov_num_vfs when the device is in
> 'switchdev' mode OR do we need to switch to 'legacy' mode to increase/decrease the number of
> VFs?

Since the PF netdev will not disappear, the standard ways to configure number
of VF, etc is still available.

> Even in switchdev mode, i guess it will be possible for host apps to use the IP configured
> on the uplink rep to talk externally.
> 
> In case of multiple uplinks, are you exposing one uplink-rep netdev per uplink?

^ permalink raw reply

* Re: tcp hang when socket fills up ?
From: Florian Westphal @ 2018-04-17 13:29 UTC (permalink / raw)
  To: Michal Kubecek, netdev, Florian Westphal, Marcelo Ricardo Leitner,
	Eric Dumazet
  Cc: Jozsef Kadlecsik
In-Reply-To: <20180417123437.GA19885@nautica>

Dominique Martinet <asmadeus@codewreck.org> wrote:

[ CC Jozsef ]

> Could it have something to do with the way I setup the connection?
> I don't think the "both remotes call connect() with carefully selected
> source/dest port" is a very common case..
> 
> If you look at the tcpdump outputs I attached the sequence usually is
> something like
>  server > client SYN
>  client > server SYN
>  server > client SYNACK
>  client > server ACK
> 
> ultimately it IS a connection, but with an extra SYN packet in front of
> it (that first SYN opens up the conntrack of the nat so that the
> client's syn can come in, the client's conntrack will be that of a
> normal connection since its first SYN goes in directly after the
> server's (it didn't see the server's SYN))
> 
> Looking at my logs again, I'm seeing the same as you:
> 
> This looks like the actual SYN/SYN/SYNACK/ACK:
>  - 14.364090 seq=505004283 likely SYN coming out of server
>  - 14.661731 seq=1913287797 on next line it says receiver
> end=505004284 so likely the matching SYN from client
> Which this time gets a proper SYNACK from server:
> 14.662020 seq=505004283 ack=1913287798
> And following final dataless ACK:
> 14.687570 seq=1913287798 ack=505004284
> 
> Then as you point out some data ACK, where the scale poofs:
> 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> 14.688824 tcp_in_window: 
> 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> 14.688882 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
>
> As you say, only tcp_options() will clear only on side of the scales.
> We don't have sender->td_maxwin == 0 (printed) so I see no other way
> than we are in the last else if:
>  - we have after(end, sender->td_end) (end=1913287819 > sender
> end=1913287798)
>  - I assume the tcp state machine must be confused because of the
> SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> but since this is a data packet it doesn't have the tcp option for scale
> thus scale resets.

Yes, this looks correct. Jozsef, can you please have a look?

Problem seems to be that conntrack believes that ACK packet
re-initializes the connection:

 595                 /*
 596                  * RFC 793: "if a TCP is reinitialized ... then it need
 597                  * not wait at all; it must only be sure to use sequence
 598                  * numbers larger than those recently used."
 599                  */
 600                 sender->td_end =
 601                 sender->td_maxend = end;
 602                 sender->td_maxwin = (win == 0 ? 1 : win);
 603 
 604                 tcp_options(skb, dataoff, tcph, sender);

and last line clears the scale value (no wscale option in data packet).


Transitions are:
 server > client SYN          sNO -> sSS
 client > server SYN          sSS -> sS2
 server > client SYNACK       sS2 -> sSR /* here */
 client > server ACK          sSR -> sES

SYN/ACK was observed in original direction so we hit
state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
when we see the ack packet and end up in the 'TCP is reinitialized' branch.

AFAICS, without this, connection would move to sES just fine,
as the data ack is in window.

^ permalink raw reply

* Re: [PATCH] net: change the comment of dev_mc_init
From: David Miller @ 2018-04-17 13:37 UTC (permalink / raw)
  To: sunlw.fnst; +Cc: netdev
In-Reply-To: <e893355c-0a4c-8db5-c812-68e8edd2ded4@cn.fujitsu.com>

From: sunlianwen <sunlw.fnst@cn.fujitsu.com>
Date: Tue, 17 Apr 2018 15:27:01 +0800

> the comment of dev_mc_init() is wrong. which use dev_mc_flush
> instead of dev_mc_init.
> 
> 
> Signed-off-by:Lianwen Sun <sunlw.fnst@cn.fujitsu.com>

Please don't put so many empty lines, one is enough to separate your
commit log message from the signoff.

> diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
> index e3e6a3e2ca22..d884d8f5f0e5 100644
> --- a/net/core/dev_addr_lists.c
> +++ b/net/core/dev_addr_lists.c
> @@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev)
>  EXPORT_SYMBOL(dev_mc_flush);
> 
>  /**
> - *     dev_mc_flush - Init multicast address list
> + *     dev_mc_init - Init multicast address list
>   *     @dev: device
>   *
>   *     Init multicast address list.
> -- 
> 2.17.0

This patch was corrupted by your email client, it transformed TAB
characters into sequences of SPACEs.

^ permalink raw reply

* Re: [PATCH] VSOCK: make af_vsock.ko removable again
From: David Miller @ 2018-04-17 13:45 UTC (permalink / raw)
  To: stefanha; +Cc: netdev, xiyou.wangcong, jhansen
In-Reply-To: <20180417062558.18018-1-stefanha@redhat.com>

From: Stefan Hajnoczi <stefanha@redhat.com>
Date: Tue, 17 Apr 2018 14:25:58 +0800

> Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> vsock_init_tables()") introduced a module_init() function without a
> corresponding module_exit() function.
> 
> Modules with an init function can only be removed if they also have an
> exit function.  Therefore the vsock module was considered "permanent"
> and could not be removed.
> 
> This patch adds an empty module_exit() function so that "rmmod vsock"
> works.  No explicit cleanup is required because:
> 
> 1. Transports call vsock_core_exit() upon exit and cannot be removed
>    while sockets are still alive.
> 2. vsock_diag.ko does not perform any action that requires cleanup by
>    vsock.ko.
> 
> Reported-by: Xiumei Mu <xmu@redhat.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Jorgen Hansen <jhansen@vmware.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Applied, but please provide a proper Fixes: tag next time.  I added it
for you this time.

^ permalink raw reply

* Re: [PATCH v2 3/8] net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
From: David Miller @ 2018-04-17 13:51 UTC (permalink / raw)
  To: geert
  Cc: schmitzmic, glaubitz, netdev, andrew, linux-m68k, Michael.Karcher,
	kernel
In-Reply-To: <CAMuHMdUZa31pbmyieoqyh-RnsRtFKPOVbrzC9DveV8h0sZh_qw@mail.gmail.com>

From: Geert Uytterhoeven <geert@linux-m68k.org>
Date: Tue, 17 Apr 2018 10:20:25 +0200

> BTW, I have a git alias for that:
> 
> $ git help fixes
> `git fixes' is aliased to `show --format='Fixes: %h ("%s")' -s'
> $ git fixes 82533ad9a1c
> Fixes: 82533ad9a1c ("net: ethernet: ax88796: don't call free_irq
> without request_irq first")

Thanks for sharing :)

^ permalink raw reply

* Re: One question about __tcp_select_window()
From: Wang Jian @ 2018-04-17 13:53 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAP4sYWWtXeux8EH0dSkxN-aZ5b7MvLWTgkhejeHz+d+2=qFQ2A@mail.gmail.com>

I test the fix with 4.17.0-rc1+ and it seems work.

1. iperf -c IP -i 20 -t 60 -w 1K
 with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
I also try other windows and have similar results.

2. Use tcp probe trace snd_wind.
with-fix vs without-fix: 1245568 vs 1042816

3. I don't see extra retransmit/drops.

On Sun, Apr 15, 2018 at 8:50 PM, Wang Jian <jianjian.wang1@gmail.com> wrote:
> Hi all,
>
> While I read __tcp_select_window() code, I find that it maybe return a
> smaller window.
> Below is one scenario I thought, may be not right:
> In function __tcp_select_window(), assume:
> full_space is 6mss, free_space is 2mss, tp->rcv_wnd is 3MSS.
> And assume disable window scaling, then
> window = tp->rcv_wnd > free_space && window > free_space
> then it will round down free_space and return it.
>
> Is this expected behavior? The comment is also saying
> "Get the largest window that is a nice multiple of mss."
>
> Should we do something like below ? Or I miss something?
> I don't know how to verify it now.
>
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2680,9 +2680,9 @@ u32 __tcp_select_window(struct sock *sk)
>                  * We also don't do any window rounding when the free space
>                  * is too small.
>                  */
> -               if (window <= free_space - mss || window > free_space)
> +               if (window <= free_space - mss)
>                         window = rounddown(free_space, mss);
> -               else if (mss == full_space &&
> +               else if (window <= free_space && mss == full_space &&
>                          free_space > window + (full_space >> 1))
>                         window = free_space;
>         }
>
> Thanks.

^ permalink raw reply

* Re: [PATCH 0/3] Receive Side Coalescing for macb driver
From: David Miller @ 2018-04-17 13:53 UTC (permalink / raw)
  To: rafalo; +Cc: nicolas.ferre, netdev, linux-kernel, piotrs, ltyrala
In-Reply-To: <BN3PR0701MB1122B2FF2A1C54CD363B16A7C9B70@BN3PR0701MB1122.namprd07.prod.outlook.com>

From: Rafal Ozieblo <rafalo@cadence.com>
Date: Tue, 17 Apr 2018 08:59:35 +0000

> If IP supports RSC and skb has 2B reserved for alignment we end up
> with none packets receive correctly (2B missing in the each skb).
> We can either leave few customers without support in Linux driver or
> let them use the driver with decrease performance.

You probably want to find a way to memmove() the headers to be aligned
in this case, and then point the data afterwards into SKB page frags.

^ permalink raw reply

* Re: [PATCH 10/10] net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
From: Geert Uytterhoeven @ 2018-04-17 13:53 UTC (permalink / raw)
  To: Michael Schmitz; +Cc: netdev, Linux/m68k, Michael.Karcher, Michael Karcher
In-Reply-To: <1523916285-6057-11-git-send-email-schmitzmic@gmail.com>

Hi Michael,

Thanks for your patch!

On Tue, Apr 17, 2018 at 12:04 AM, Michael Schmitz <schmitzmic@gmail.com> wrote:
> Add platform device driver to populate the ax88796 platform data from
> information provided by the XSurf100 zorro device driver.
> This driver will have to be loaded before loading the ax88796 module,
> or compiled as built-in.

Is that really true? The platform device should be probed when both the
device and driver have been registered, but order shouldn't matter.

> Signed-off-by: Michael Karcher <kernel@mkarcher.dialup.fu-berlin.de>

Missing "From: Michael Karcher ..."?

> Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>

> --- a/drivers/net/ethernet/8390/Kconfig
> +++ b/drivers/net/ethernet/8390/Kconfig
> @@ -30,7 +30,7 @@ config PCMCIA_AXNET
>
>  config AX88796
>         tristate "ASIX AX88796 NE2000 clone support"
> -       depends on (ARM || MIPS || SUPERH)
> +       depends on (ARM || MIPS || SUPERH || AMIGA)

s/AMIGA/ZORRO/, for consistency with the below.

>         select CRC32
>         select PHYLIB
>         select MDIO_BITBANG
> @@ -45,6 +45,18 @@ config AX88796_93CX6
>         ---help---
>           Select this if your platform comes with an external 93CX6 eeprom.
>
> +config XSURF100
> +       tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
> +       depends on ZORRO
> +       depends on AX88796

It's a bit unfortunate the user has to enable _two_ config options to enable
this driver.

I see two solutions for that:

1) Hide the XSURF100 symbol, so it gets enabled automatically if AX88796 is
   enabled on a Zorro bus system:

    config XSURF100
            tristate
            depends on ZORRO
            default AX88796

2) Hide the AX88796 symbol, and let it be selected by XSURF100:

    config AX88796
            tristate "ASIX AX88796 NE2000 clone support" if !ZORRO
            depends on ARM || MIPS || SUPERH || ZORRO
            ...

    config XSURF100
            tristate "Amiga XSurf 100 AX88796/NE2000 clone support"
            depends on ZORRO
            select AX88796

> --- /dev/null
> +++ b/drivers/net/ethernet/8390/xsurf100.c
> @@ -0,0 +1,411 @@
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/platform_device.h>
> +#include <linux/zorro.h>
> +#include <net/ax88796.h>
> +#include <asm/amigaints.h>
> +
> +#define ZORRO_PROD_INDIVIDUAL_COMPUTERS_X_SURF100 \
> +               ZORRO_ID(INDIVIDUAL_COMPUTERS, 0x64, 0)

Another long define to get rid of? ;-)

> +/* Hard reset the card. This used to pause for the same period that a
> + * 8390 reset command required, but that shouldn't be necessary.
> + */
> +static void ax_reset_8390(struct net_device *dev)
> +{
> +       struct ei_device *ei_local = netdev_priv(dev);
> +       unsigned long reset_start_time = jiffies;
> +       void __iomem *addr = (void __iomem *)dev->base_addr;
> +
> +       netif_dbg(ei_local, hw, dev, "resetting the 8390 t=%ld...\n", jiffies);
> +
> +       ei_outb(ei_inb(addr + NE_RESET), addr + NE_RESET);
> +
> +       ei_local->txing = 0;
> +       ei_local->dmaing = 0;
> +
> +       /* This check _should_not_ be necessary, omit eventually. */
> +       while ((ei_inb(addr + EN0_ISR) & ENISR_RESET) == 0) {
> +               if (time_after(jiffies, reset_start_time + 2 * HZ / 100)) {
> +                       netdev_warn(dev, "%s: did not complete.\n", __func__);
> +                       break;
> +               }

cpu_relax()?

How long does this usually take? If > 1 ms, you can use e.g. msleep(1)
instead of cpu_relax().


> +       }
> +
> +       ei_outb(ENISR_RESET, addr + EN0_ISR);   /* Ack intr. */
> +}

> +       if (ei_local->dmaing) {
> +               netdev_err(dev,
> +                          "DMAing conflict in %s "
> +                          "[DMAstat:%d][irqlock:%d].\n",

Please don't split error messages, as that makes it more difficult to
grep for them.

> +                          __func__,
> +                          ei_local->dmaing, ei_local->irqlock);
> +               return;

> +static int xsurf100_probe(struct zorro_dev *zdev,
> +                         const struct zorro_device_id *ent)
> +{

> +       /* error handling for ioremap regs */
> +       if (!ax88796_data.base_regs) {
> +               dev_err(&zdev->dev, "Cannot ioremap area %p (registers)\n",
> +                       (void *)zdev->resource.start);

Please use %pR to format struct resource.
Documentation/core-api/printk-formats.rst

> +       /* error handling for ioremap data */
> +       if (!ax88796_data.data_area) {
> +               dev_err(&zdev->dev, "Cannot ioremap area %p (32-bit access)\n",
> +                       (void *)zdev->resource.start + XS100_8390_DATA32_BASE);

%pR

> +static void xsurf100_remove(struct zorro_dev *zdev)
> +{
> +       struct platform_device *pdev;
> +       struct xsurf100_ax_plat_data *xs100;
> +
> +       pdev = zorro_get_drvdata(zdev);
> +       xs100 = dev_get_platdata(&pdev->dev);

struct platform_device *pdev = pdev = zorro_get_drvdata(zdev);
struct xsurf100_ax_plat_data *xs100 = dev_get_platdata(&pdev->dev);

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: Alexei Starovoitov @ 2018-04-17 13:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Network Development, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann, Tariq Toukan
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

On Tue, Apr 17, 2018 at 5:58 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Resubmit V10 against net-next, as it contains NIC driver changes.
>
> This patchset works towards supporting different XDP RX-ring memory
> allocators.  As this will be needed by the AF_XDP zero-copy mode.
>
> The patchset uses mlx5 as the sample driver, which gets implemented
> XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
> change thought the patchset).
>
> A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
> And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
>
> Support for a driver supplied allocator is implemented, and a
> refurbished version of page_pool is the first return allocator type
> introduced.  This will be a integration point for AF_XDP zero-copy.
>
> The mlx5 driver evolve into using the page_pool, and see a performance
> increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.
>
>
> The patchset stop at 16 patches (one over limit), but more API changes
> are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
> APIs to support bulking.  As this will address some known limits.
>
> V2: Updated according to Tariq's feedback
> V3: Updated based on feedback from Jason Wang and Alex Duyck
> V4: Updated based on feedback from Tariq and Jason
> V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
> V6: Updated based on feedback from Eric Dumazet and Alex Duyck
> V7: Adapt to i40e that got XDP_REDIRECT support in-between
> V8:
>  Updated based on feedback kbuild test robot, and adjust for mlx5 changes
>  page_pool only compiled into kernel when drivers Kconfig 'select' feature
> V9:
>  Remove some inline statements, let compiler decide what to inline
>  Fix return value in virtio_net driver
>  Adjust for mlx5 changes in-between submissions
> V10:
>  Minor adjust for mlx5 requested by Tariq
>  Resubmit against net-next

looks like you forgot to include extra patch to fixup xdp_adjust_head()
helper. Otherwise reused xdp_frame in the top of that packet is leaking
kernel pointers into bpf program.
Could you please respin with that change included?

^ permalink raw reply

* Re: [PATCH net-next] liquidio: Enhanced ethtool stats
From: David Miller @ 2018-04-17 13:57 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	intiyaz.basha
In-Reply-To: <20180417063052.GA3331@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Mon, 16 Apr 2018 23:30:53 -0700

> From: Intiyaz Basha <intiyaz.basha@cavium.com>
> 
> 1. Added red_drops stats. Inbound packets dropped by RED, buffer exhaustion
> 2. Included fcs_err, jabber_err, l2_err and frame_err errors under
>    rx_errors
> 3. Included fifo_err, dmac_drop, red_drops, fw_err_pko, fw_err_link and
>    fw_err_drop under rx_dropped
> 4. Included max_collision_fail, max_deferral_fail, total_collisions,
>    fw_err_pko, fw_err_link, fw_err_drop and fw_err_pki under tx_dropped
> 5. Counting dma mapping errors
> 6. Added some firmware stats description and removed for some
> 
> Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
> Acked-by: Derek Chickles <derek.chickles@cavium.com>
> Acked-by: Satanand Burla <satananda.burla@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Or Gerlitz @ 2018-04-17 13:58 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Samudrala, Sridhar, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180417133028.GI33938@C02RW35GFVH8.dhcp.broadcom.net>

On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek
<andrew.gospodarek@broadcom.com> wrote:
> On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
>>
>> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
>> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
>> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
>> > > <sridhar.samudrala@intel.com> wrote:
>> > >
>> > > > I meant between PFs on 2 compute nodes.
>> > > If the PF serves as uplink rep, it functions as  a switch port -- applications
>> > > don't run on switch ports. One way to get apps to run on the host in switchdev
>> > > mode is probe one of the VFs there.
>> > >
>> > >
>> > >
>> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
>> seen on the host, no more PF netdev.
>
> That is not the functionality I would propose.  The PF netdev will still be there.

Andy,

Basically LGTM, so even in smartnic configs, the PF @ the host is
still privileged to
create/destroy VFs or provision MACs for them even if it is not the
e-switch manager
anymore?

Actually AFAIK this  can also work somehow otherwise, e.g a smartnic FW
"pushes" the VFs into the host w.o them being under a host admin directive.

^ permalink raw reply

* Re: [PATCH] net: qrtr: add MODULE_ALIAS_NETPROTO macro
From: David Miller @ 2018-04-17 13:58 UTC (permalink / raw)
  To: nicolas.dechesne; +Cc: bjorn.andersson, netdev, linux-kernel
In-Reply-To: <20180417120326.11022-1-nicolas.dechesne@linaro.org>

From: Nicolas Dechesne <nicolas.dechesne@linaro.org>
Date: Tue, 17 Apr 2018 14:03:26 +0200

> To ensure that qrtr can be loaded automatically, when needed, if it is compiled
> as module.
> 
> Signed-off-by: Nicolas Dechesne <nicolas.dechesne@linaro.org>

Applied.

^ permalink raw reply

* Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: David Miller @ 2018-04-17 14:04 UTC (permalink / raw)
  To: brouer
  Cc: netdev, bjorn.topel, magnus.karlsson, eugenia, jasowang,
	john.fastabend, eranbe, saeedm, galp, borkmann,
	alexei.starovoitov, tariqt
In-Reply-To: <152396988259.12633.11175312729378665019.stgit@firesoul>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 17 Apr 2018 14:58:52 +0200

> Resubmit V10 against net-next, as it contains NIC driver changes.

Series applied, thanks Jesper.

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-17 14:04 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180417124716.wsypd5zl4n4galrz@debian>

On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > [...]
> > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > +			      void **ctx)
> > > > > > > +{
> > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > +	unsigned int i, j;
> > > > > > > +
> > > > > > > +	/* Clear data ptr. */
> > > > > > > +	vq->desc_state[head].data = NULL;
> > > > > > > +
> > > > > > > +	i = head;
> > > > > > > +
> > > > > > > +	for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > +		desc->flags = 0x0;
> > > > > > Looks like this is unnecessary.
> > > > > It's safer to zero it. If we don't zero it, after we
> > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > this function, the desc is still available to the
> > > > > device.
> > > > 
> > > > Well detach_unused_buf_packed() should be called after device is stopped,
> > > > otherwise even if you try to clear, there will still be a window that device
> > > > may use it.
> > > 
> > > This is not about whether the device has been stopped or
> > > not. We don't have other places to re-initialize the ring
> > > descriptors and wrap_counter. So they need to be set to
> > > the correct values when doing detach_unused_buf.
> > > 
> > > Best regards,
> > > Tiwei Bie
> > 
> > find vqs is the time to do it.
> 
> The .find_vqs() will call .setup_vq() which will eventually
> call vring_create_virtqueue(). It's a different case. Here
> we're talking about re-initializing the descs and updating
> the wrap counter when detaching the unused descs (In this
> case, split ring just needs to decrease vring.avail->idx).
> 
> Best regards,
> Tiwei Bie

There's no requirement that  virtqueue_detach_unused_buf re-initializes
the descs. It happens on cleanup path just before drivers delete the
vqs.

-- 
MST

^ permalink raw reply

* Re: [net-next V10 PATCH 00/16] XDP redirect memory return API
From: David Miller @ 2018-04-17 14:07 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: brouer, netdev, bjorn.topel, magnus.karlsson, eugenia, jasowang,
	john.fastabend, eranbe, saeedm, galp, borkmann, tariqt
In-Reply-To: <CAADnVQ+fhDt-U6Kd7wAfZ759egfH8qcNPdKbBuHfBNCRqHGcfw@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 17 Apr 2018 06:53:33 -0700

> looks like you forgot to include extra patch to fixup xdp_adjust_head()
> helper. Otherwise reused xdp_frame in the top of that packet is leaking
> kernel pointers into bpf program.
> Could you please respin with that change included?

Just in time, I was about to push this back out. :)

Jesper, please respin with Alexei's requested changes.

^ permalink raw reply

* [bpf-next PATCH] samples/bpf: fix xdp_monitor user output for tracepoint exception
From: Jesper Dangaard Brouer @ 2018-04-17 14:08 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Alexei Starovoitov, Jesper Dangaard Brouer

The variable rec_i contains an XDP action code not an error.
Thus, using err2str() was wrong, it should have been action2str().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/xdp_monitor_user.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdp_monitor_user.c b/samples/bpf/xdp_monitor_user.c
index eec14520d513..894bc64c2cac 100644
--- a/samples/bpf/xdp_monitor_user.c
+++ b/samples/bpf/xdp_monitor_user.c
@@ -330,7 +330,7 @@ static void stats_print(struct stats_record *stats_rec,
 			pps = calc_pps_u64(r, p, t);
 			if (pps > 0)
 				printf(fmt1, "Exception", i,
-				       0.0, pps, err2str(rec_i));
+				       0.0, pps, action2str(rec_i));
 		}
 		pps = calc_pps_u64(&rec->total, &prev->total, t);
 		if (pps > 0)

^ permalink raw reply related

* Re: [PATCH][net-next] net: ip tos cgroup
From: Daniel Borkmann @ 2018-04-17 14:11 UTC (permalink / raw)
  To: Li RongQing; +Cc: netdev, tj, ast, brakmo
In-Reply-To: <1523936161-11676-1-git-send-email-lirongqing@baidu.com>

On 04/17/2018 05:36 AM, Li RongQing wrote:
> ip tos segment can be changed by setsockopt(IP_TOS), or by iptables;
> this patch creates a new method to change socket tos segment of
> processes based on cgroup
> 
> The usage:
> 
>     1. mount ip_tos cgroup, and setting tos value
>     mount -t cgroup -o ip_tos ip_tos /cgroups/tos
>     echo tos_value >/cgroups/tos/ip_tos.tos
>     2. then move processes to cgroup, or create processes in cgroup
> 
> Signed-off-by: jimyan <jimyan@baidu.com>
> Signed-off-by: Li RongQing <lirongqing@baidu.com>

This functionality is already possible through the help of BPF programs
attached to cgroups, have you had a chance to look into that?

>  include/linux/cgroup_subsys.h |   4 ++
>  include/net/tos_cgroup.h      |  35 ++++++++++++
>  net/ipv4/Kconfig              |  10 ++++
>  net/ipv4/Makefile             |   1 +
>  net/ipv4/af_inet.c            |   2 +
>  net/ipv4/tos_cgroup.c         | 128 ++++++++++++++++++++++++++++++++++++++++++
>  net/ipv6/af_inet6.c           |   2 +
>  7 files changed, 182 insertions(+)
>  create mode 100644 include/net/tos_cgroup.h
>  create mode 100644 net/ipv4/tos_cgroup.c
> 
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index acb77dcff3b4..1b86eda1c23e 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -61,6 +61,10 @@ SUBSYS(pids)
>  SUBSYS(rdma)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
> +SUBSYS(ip_tos)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/include/net/tos_cgroup.h b/include/net/tos_cgroup.h
> new file mode 100644
> index 000000000000..0868e921faf3
> --- /dev/null
> +++ b/include/net/tos_cgroup.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _IP_TOS_CGROUP_H
> +#define _IP_TOS_CGROUP_H
> +
> +#include <linux/cgroup.h>
> +#include <linux/hardirq.h>
> +
> +struct tos_cgroup_state {
> +	struct cgroup_subsys_state css;
> +	u32 tos;
> +};
> +
> +#if IS_ENABLED(CONFIG_IP_TOS_CGROUP)
> +static inline u32 task_ip_tos(struct task_struct *p)
> +{
> +	u32 tos;
> +
> +	if (in_interrupt())
> +		return 0;
> +
> +	rcu_read_lock();
> +	tos = container_of(task_css(p, ip_tos_cgrp_id),
> +			struct tos_cgroup_state, css)->tos;
> +	rcu_read_unlock();
> +
> +	return tos;
> +}
> +#else /* !CONFIG_IP_TOS_CGROUP */
> +static inline u32 task_ip_tos(struct task_struct *p)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_IP_TOS_CGROUP */
> +#endif  /* _IP_TOS_CGROUP_H */
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 80dad301361d..57070bbb0394 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -753,3 +753,13 @@ config TCP_MD5SIG
>  	  on the Internet.
>  
>  	  If unsure, say N.
> +
> +config IP_TOS_CGROUP
> +	bool "ip tos cgroup"
> +	depends on CGROUPS
> +	default n
> +	---help---
> +	  Say Y here if you want to set ip packet tos based on the
> +	  control cgroup of their process.
> +
> +	  This can set ip packet tos
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index a07b7dd06def..12c708142d1f 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -61,6 +61,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
>  obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
>  obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
>  obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
> +obj-$(CONFIG_IP_TOS_CGROUP) += tos_cgroup.o
>  
>  obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
>  		      xfrm4_output.o xfrm4_protocol.o
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index eaed0367e669..e2dd902b06dd 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -120,6 +120,7 @@
>  #include <linux/mroute.h>
>  #endif
>  #include <net/l3mdev.h>
> +#include <net/tos_cgroup.h>
>  
>  #include <trace/events/sock.h>
>  
> @@ -356,6 +357,7 @@ static int inet_create(struct net *net, struct socket *sock, int protocol,
>  	inet->mc_index	= 0;
>  	inet->mc_list	= NULL;
>  	inet->rcv_tos	= 0;
> +	inet->tos       = task_ip_tos(current);
>  
>  	sk_refcnt_debug_inc(sk);
>  
> diff --git a/net/ipv4/tos_cgroup.c b/net/ipv4/tos_cgroup.c
> new file mode 100644
> index 000000000000..dbb828f5b464
> --- /dev/null
> +++ b/net/ipv4/tos_cgroup.c
> @@ -0,0 +1,128 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/module.h>
> +#include <linux/types.h>
> +#include <linux/errno.h>
> +#include <linux/slab.h>
> +#include <linux/skbuff.h>
> +#include <linux/cgroup.h>
> +#include <net/sock.h>
> +#include <net/inet_sock.h>
> +#include <net/tos_cgroup.h>
> +#include <linux/fdtable.h>
> +#include <net/route.h>
> +#include <net/inet_ecn.h>
> +#include <linux/sched/task.h>
> +
> +static inline
> +struct tos_cgroup_state *css_tos_cgroup(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct tos_cgroup_state, css) : NULL;
> +}
> +
> +static inline struct tos_cgroup_state *task_tos_cgroup(struct task_struct *task)
> +{
> +	return css_tos_cgroup(task_css(task, ip_tos_cgrp_id));
> +}
> +
> +static struct cgroup_subsys_state
> +*cgrp_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> +	struct tos_cgroup_state *cs;
> +
> +	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
> +	if (!cs)
> +		return ERR_PTR(-ENOMEM);
> +
> +	return &cs->css;
> +}
> +
> +static void cgrp_css_free(struct cgroup_subsys_state *css)
> +{
> +	kfree(css_tos_cgroup(css));
> +}
> +
> +static int update_tos(const void *v, struct file *file, unsigned int n)
> +{
> +	int err;
> +	struct socket *sock = sock_from_file(file, &err);
> +	unsigned char val = (unsigned char)*(u64 *)v;
> +
> +	if (sock && (sock->sk->sk_family == PF_INET ||
> +				sock->sk->sk_family == PF_INET6)) {
> +		struct inet_sock *inet = inet_sk(sock->sk);
> +
> +		lock_sock(sock->sk);
> +		if (sock->sk->sk_type == SOCK_STREAM) {
> +			val &= ~INET_ECN_MASK;
> +			val |= inet->tos & INET_ECN_MASK;
> +		}
> +		if (inet->tos != val) {
> +			inet->tos = val;
> +			sock->sk->sk_priority = rt_tos2priority(val);
> +			sk_dst_reset(sock->sk);
> +		}
> +		release_sock(sock->sk);
> +	}
> +	return 0;
> +}
> +
> +static void cgrp_attach(struct cgroup_taskset *tset)
> +{
> +	struct task_struct *p;
> +	struct cgroup_subsys_state *css;
> +	u64 v;
> +
> +	cgroup_taskset_for_each(p, css, tset) {
> +		task_lock(p);
> +		v = task_tos_cgroup(p)->tos;
> +		iterate_fd(p->files, 0, update_tos, (void *)&v);
> +		task_unlock(p);
> +	}
> +}
> +
> +static u64 read_tos(struct cgroup_subsys_state *css, struct cftype *cft)
> +{
> +	return css_tos_cgroup(css)->tos;
> +}
> +
> +static int
> +write_tos(struct cgroup_subsys_state *css, struct cftype *cft, u64 value)
> +{
> +	struct css_task_iter it;
> +	struct task_struct *task = NULL;
> +
> +	if (value < 0 || value > 255) {
> +		pr_info("Invalid TOS value\n");
> +		return 0;
> +	}
> +
> +	css_tos_cgroup(css)->tos = (u32)value;
> +
> +	css_task_iter_start(css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		task_lock(task);
> +		iterate_fd(task->files, 0, update_tos, (void *)&value);
> +		task_unlock(task);
> +	}
> +	css_task_iter_end(&it);
> +
> +	return 0;
> +}
> +
> +static struct cftype ss_files[] = {
> +	{
> +		.name = "tos",
> +		.read_u64 = read_tos,
> +		.write_u64 = write_tos,
> +	},
> +	{ }	/* terminate */
> +};
> +
> +struct cgroup_subsys ip_tos_cgrp_subsys = {
> +	.css_alloc	= cgrp_css_alloc,
> +	.css_free	= cgrp_css_free,
> +	.attach		= cgrp_attach,
> +	.legacy_cftypes	= ss_files,
> +};
> +
> +MODULE_LICENSE("GPL v2");
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 8da0b513f188..d33e240613e0 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -44,6 +44,7 @@
>  #include <linux/icmpv6.h>
>  #include <linux/netfilter_ipv6.h>
>  
> +#include <net/tos_cgroup.h>
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>  #include <net/udp.h>
> @@ -223,6 +224,7 @@ static int inet6_create(struct net *net, struct socket *sock, int protocol,
>  	inet->mc_index	= 0;
>  	inet->mc_list	= NULL;
>  	inet->rcv_tos	= 0;
> +	inet->tos       = task_ip_tos(current);
>  
>  	if (net->ipv4.sysctl_ip_no_pmtu_disc)
>  		inet->pmtudisc = IP_PMTUDISC_DONT;
> 

^ permalink raw reply

* Re: [PATCH/RFC net-next 1/5] ravb: fix inconsistent lock state at enabling tx timestamp
From: David Miller @ 2018-04-17 14:13 UTC (permalink / raw)
  To: horms+renesas
  Cc: sergei.shtylyov, magnus.damm, netdev, linux-renesas-soc,
	wsa+renesas, masaru.nagai.vx, kazuya.mizuguchi.ks
In-Reply-To: <20180417085030.32650-2-horms+renesas@verge.net.au>

From: Simon Horman <horms+renesas@verge.net.au>
Date: Tue, 17 Apr 2018 10:50:26 +0200

> From: Masaru Nagai <masaru.nagai.vx@renesas.com>
> 
> [   58.490829] =================================
> [   58.495205] [ INFO: inconsistent lock state ]
> [   58.499583] 4.9.0-yocto-standard-00007-g2ef7caf #57 Not tainted
 ...
> Fixes: f51bdc236b6c ("ravb: Add dma queue interrupt support")
> Signed-off-by: Masaru Nagai <masaru.nagai.vx@renesas.com>
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>

This really needs more than the lockdep dump in the commit message, explaining
what the problem is and how it was corrected.

Are the wrong interrupt types enabled?  Are they handled improperly?
It definitely isn't clear from just reading the patch.

^ permalink raw reply

* Re: [PATCH bpf-next 01/10] [bpf]: adding bpf_xdp_adjust_tail helper
From: kbuild test robot @ 2018-04-17 14:14 UTC (permalink / raw)
  To: Nikita V. Shirokov
  Cc: kbuild-all, Alexei Starovoitov, Daniel Borkmann, netdev,
	Nikita V. Shirokov
In-Reply-To: <20180417065131.3632-2-tehnerd@tehnerd.com>

[-- Attachment #1: Type: text/plain, Size: 1403 bytes --]

Hi Nikita,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Nikita-V-Shirokov/introduction-of-bpf_xdp_adjust_tail/20180417-211905
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-s1-201815 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   net/core/filter.c: In function '____bpf_xdp_adjust_tail':
>> net/core/filter.c:2726:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
     void *data_end = xdp->data_end + offset;
     ^~~~

vim +2726 net/core/filter.c

  2719	
  2720	BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
  2721	{
  2722		/* only shrinking is allowed for now. */
  2723		if (unlikely(offset > 0))
  2724			return -EINVAL;
  2725	
> 2726		void *data_end = xdp->data_end + offset;
  2727	
  2728		if (unlikely(data_end < xdp->data + ETH_HLEN))
  2729			return -EINVAL;
  2730	
  2731		xdp->data_end = data_end;
  2732	
  2733		return 0;
  2734	}
  2735	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28627 bytes --]

^ permalink raw reply

* Re: [PATCH/RFC net-next 4/5] ravb: remove undocumented processing
From: David Miller @ 2018-04-17 14:14 UTC (permalink / raw)
  To: horms+renesas
  Cc: sergei.shtylyov, magnus.damm, netdev, linux-renesas-soc,
	wsa+renesas, kazuya.mizuguchi.ks
In-Reply-To: <20180417085030.32650-5-horms+renesas@verge.net.au>

From: Simon Horman <horms+renesas@verge.net.au>
Date: Tue, 17 Apr 2018 10:50:29 +0200

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> 
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>

Why?  What was wrong with it?

Need more text and explanations in these commit messages please.

^ permalink raw reply

* Re: [PATCH/RFC net-next 3/5] ravb: do not write 1 to reserved bits
From: David Miller @ 2018-04-17 14:15 UTC (permalink / raw)
  To: horms+renesas
  Cc: sergei.shtylyov, magnus.damm, netdev, linux-renesas-soc,
	wsa+renesas, kazuya.mizuguchi.ks
In-Reply-To: <20180417085030.32650-4-horms+renesas@verge.net.au>

From: Simon Horman <horms+renesas@verge.net.au>
Date: Tue, 17 Apr 2018 10:50:28 +0200

> From: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> 
> This patch corrects writing 1 to reserved bits.
> The write value should be 0.
> 
> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>

How are we ending up in situations where the driver is trying to write
non-zero values to those fields in the first place?

The places creating those values should be making sure that the
reserved bits are never set.

If you mask out the reserved bits in the register writing locations,
this just hides bugs.

^ permalink raw reply

* [PATCH bpf-next v3 0/8] bpf: document eBPF helpers and add a script to generate man page
From: Quentin Monnet @ 2018-04-17 14:34 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man

eBPF helper functions can be called from within eBPF programs to perform
a variety of tasks that would be otherwise hard or impossible to do with
eBPF itself. There is a growing number of such helper functions in the
kernel, but documentation is scarce. The main user space header file
does contain a short commented description of most helpers, but it is
somewhat outdated and not complete. It is more a "cheat sheet" than a
real documentation accessible to new eBPF developers.

This commit attempts to improve the situation by replacing the existing
overview for the helpers with a more developed description. Furthermore,
a Python script is added to generate a manual page for eBPF helpers. The
workflow is the following, and requires the rst2man utility:

    $ ./scripts/bpf_helpers_doc.py \
            --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
    $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
    $ man /tmp/bpf-helpers.7

The objective is to keep all documentation related to the helpers in a
single place, and to be able to generate from here a manual page that
could be packaged in the man-pages repository and shipped with most
distributions.

Additionally, parsing the prototypes of the helper functions could
hopefully be reused, with a different Printer object, to generate
header files needed in some eBPF-related projects.

Regarding the description of each helper, it comprises several items:

- The function prototype.
- A description of the function and of its arguments (except for a
  couple of cases, when there are no arguments and the return value
  makes the function usage really obvious).
- A description of return values (if not void).

Additional items such as the list of compatible eBPF program and map
types for each helper, Linux kernel version that introduced the helper,
GPL-only restriction, and commit hash could be added in the future, but
it was decided on the mailing list to leave them aside for now.

For several helpers, descriptions are inspired (at times, nearly copied)
from the commit logs introducing them in the kernel--Many thanks to
their respective authors! They were completed as much as possible, the
objective being to have something easily accessible even for people just
starting with eBPF. There is probably a bit more work to do in this
direction for some helpers.

Some RST formatting is used in the descriptions (not in function
prototypes, to keep them readable, but the Python script provided in
order to generate the RST for the manual page does add formatting to
prototypes, to produce something pretty) to get "bold" and "italics" in
manual pages. Hopefully, the descriptions in bpf.h file remains
perfectly readable. Note that the few trailing white spaces are
intentional, removing them would break paragraphs for rst2man.

The descriptions should ideally be updated each time someone adds a new
helper, or updates the behaviour (new socket option supported, ...) or
the interface (new flags available, ...) of existing ones.

To ease the review process, the documentation has been split into several
patches.

RFC v2 -> PATCH v3:
Several fixes on man page header and footer, and helpers documentation.
Please refer to individual patches for details.

RFC v1 -> RFC v2:
- Remove "For" (compatible program and map types), "Since" (minimal
  Linux kernel version required), "GPL only" sections and commit hashes
  for the helpers.
- Add comment on top of the description list to explain how this
  documentation is supposed to be processed.
- Update Python script accordingly (remove the same sections, and remove
  paragraphs on program types and GPL restrictions from man page
  header).
- Split series into several patches.

Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org

Quentin Monnet (8):
  bpf: add script and prepare bpf.h for new helpers documentation
  bpf: add documentation for eBPF helpers (01-11)
  bpf: add documentation for eBPF helpers (12-22)
  bpf: add documentation for eBPF helpers (23-32)
  bpf: add documentation for eBPF helpers (33-41)
  bpf: add documentation for eBPF helpers (42-50)
  bpf: add documentation for eBPF helpers (51-57)
  bpf: add documentation for eBPF helpers (58-64)

 include/uapi/linux/bpf.h   | 1604 +++++++++++++++++++++++++++++++++-----------
 scripts/bpf_helpers_doc.py |  434 ++++++++++++
 2 files changed, 1658 insertions(+), 380 deletions(-)
 create mode 100755 scripts/bpf_helpers_doc.py

-- 
2.14.1

^ permalink raw reply

* [PATCH bpf-next v3 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Quentin Monnet @ 2018-04-17 14:34 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-1-quentin.monnet@netronome.com>

Remove previous "overview" of eBPF helpers from user bpf.h header.
Replace it by a comment explaining how to process the new documentation
(to come in following patches) with a Python script to produce RST, then
man page documentation.

Also add the aforementioned Python script under scripts/. It is used to
process include/uapi/linux/bpf.h and to extract helper descriptions, to
turn it into a RST document that can further be processed with rst2man
to produce a man page. The script takes one "--filename <path/to/file>"
option. If the script is launched from scripts/ in the kernel root
directory, it should be able to find the location of the header to
parse, and "--filename <path/to/file>" is then optional. If it cannot
find the file, then the option becomes mandatory. RST-formatted
documentation is printed to standard output.

Typical workflow for producing the final man page would be:

    $ ./scripts/bpf_helpers_doc.py \
            --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
    $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
    $ man /tmp/bpf-helpers.7

Note that the tool kernel-doc cannot be used to document eBPF helpers,
whose signatures are not available directly in the header files
(pre-processor directives are used to produce them at the beginning of
the compilation process).

v3:
- Change license for man page.
- Remove "for safety reasons" from man page header text.
- Change "packets metadata" to "packets" in man page header text.
- Move and fix comment on helpers introducing no overhead.
- Remove "NOTES" section from man page footer.
- Add "LICENSE" section to man page footer.
- Edit description of file include/uapi/linux/bpf.h in man page footer.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 include/uapi/linux/bpf.h   | 406 ++----------------------------------------
 scripts/bpf_helpers_doc.py | 434 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 450 insertions(+), 390 deletions(-)
 create mode 100755 scripts/bpf_helpers_doc.py

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..45f77f01e672 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -365,396 +365,22 @@ union bpf_attr {
 	} raw_tracepoint;
 } __attribute__((aligned(8)));
 
-/* BPF helper function descriptions:
- *
- * void *bpf_map_lookup_elem(&map, &key)
- *     Return: Map value or NULL
- *
- * int bpf_map_update_elem(&map, &key, &value, flags)
- *     Return: 0 on success or negative error
- *
- * int bpf_map_delete_elem(&map, &key)
- *     Return: 0 on success or negative error
- *
- * int bpf_probe_read(void *dst, int size, void *src)
- *     Return: 0 on success or negative error
- *
- * u64 bpf_ktime_get_ns(void)
- *     Return: current ktime
- *
- * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
- *     Return: length of buffer written or negative error
- *
- * u32 bpf_prandom_u32(void)
- *     Return: random value
- *
- * u32 bpf_raw_smp_processor_id(void)
- *     Return: SMP processor ID
- *
- * int bpf_skb_store_bytes(skb, offset, from, len, flags)
- *     store bytes into packet
- *     @skb: pointer to skb
- *     @offset: offset within packet from skb->mac_header
- *     @from: pointer where to copy bytes from
- *     @len: number of bytes to store into packet
- *     @flags: bit 0 - if true, recompute skb->csum
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_l3_csum_replace(skb, offset, from, to, flags)
- *     recompute IP checksum
- *     @skb: pointer to skb
- *     @offset: offset within packet where IP checksum is located
- *     @from: old value of header field
- *     @to: new value of header field
- *     @flags: bits 0-3 - size of header field
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_l4_csum_replace(skb, offset, from, to, flags)
- *     recompute TCP/UDP checksum
- *     @skb: pointer to skb
- *     @offset: offset within packet where TCP/UDP checksum is located
- *     @from: old value of header field
- *     @to: new value of header field
- *     @flags: bits 0-3 - size of header field
- *             bit 4 - is pseudo header
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_tail_call(ctx, prog_array_map, index)
- *     jump into another BPF program
- *     @ctx: context pointer passed to next program
- *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
- *     @index: 32-bit index inside array that selects specific program to run
- *     Return: 0 on success or negative error
- *
- * int bpf_clone_redirect(skb, ifindex, flags)
- *     redirect to another netdev
- *     @skb: pointer to skb
- *     @ifindex: ifindex of the net device
- *     @flags: bit 0 - if set, redirect to ingress instead of egress
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * u64 bpf_get_current_pid_tgid(void)
- *     Return: current->tgid << 32 | current->pid
- *
- * u64 bpf_get_current_uid_gid(void)
- *     Return: current_gid << 32 | current_uid
- *
- * int bpf_get_current_comm(char *buf, int size_of_buf)
- *     stores current->comm into buf
- *     Return: 0 on success or negative error
- *
- * u32 bpf_get_cgroup_classid(skb)
- *     retrieve a proc's classid
- *     @skb: pointer to skb
- *     Return: classid if != 0
- *
- * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_vlan_pop(skb)
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_get_tunnel_key(skb, key, size, flags)
- * int bpf_skb_set_tunnel_key(skb, key, size, flags)
- *     retrieve or populate tunnel metadata
- *     @skb: pointer to skb
- *     @key: pointer to 'struct bpf_tunnel_key'
- *     @size: size of 'struct bpf_tunnel_key'
- *     @flags: room for future extensions
- *     Return: 0 on success or negative error
- *
- * u64 bpf_perf_event_read(map, flags)
- *     read perf event counter value
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     Return: value of perf event counter read or error code
- *
- * int bpf_redirect(ifindex, flags)
- *     redirect to another netdev
- *     @ifindex: ifindex of the net device
- *     @flags:
- *	  cls_bpf:
- *          bit 0 - if set, redirect to ingress instead of egress
- *          other bits - reserved
- *	  xdp_bpf:
- *	    all bits - reserved
- *     Return: cls_bpf: TC_ACT_REDIRECT on success or TC_ACT_SHOT on error
- *	       xdp_bfp: XDP_REDIRECT on success or XDP_ABORT on error
- * int bpf_redirect_map(map, key, flags)
- *     redirect to endpoint in map
- *     @map: pointer to dev map
- *     @key: index in map to lookup
- *     @flags: --
- *     Return: XDP_REDIRECT on success or XDP_ABORT on error
- *
- * u32 bpf_get_route_realm(skb)
- *     retrieve a dst's tclassid
- *     @skb: pointer to skb
- *     Return: realm if != 0
- *
- * int bpf_perf_event_output(ctx, map, flags, data, size)
- *     output perf raw sample
- *     @ctx: struct pt_regs*
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     @data: data on stack to be output as raw data
- *     @size: size of data
- *     Return: 0 on success or negative error
- *
- * int bpf_get_stackid(ctx, map, flags)
- *     walk user or kernel stack and return id
- *     @ctx: struct pt_regs*
- *     @map: pointer to stack_trace map
- *     @flags: bits 0-7 - numer of stack frames to skip
- *             bit 8 - collect user stack instead of kernel
- *             bit 9 - compare stacks by hash only
- *             bit 10 - if two different stacks hash into the same stackid
- *                      discard old
- *             other bits - reserved
- *     Return: >= 0 stackid on success or negative error
- *
- * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
- *     calculate csum diff
- *     @from: raw from buffer
- *     @from_size: length of from buffer
- *     @to: raw to buffer
- *     @to_size: length of to buffer
- *     @seed: optional seed
- *     Return: csum result or negative error code
- *
- * int bpf_skb_get_tunnel_opt(skb, opt, size)
- *     retrieve tunnel options metadata
- *     @skb: pointer to skb
- *     @opt: pointer to raw tunnel option data
- *     @size: size of @opt
- *     Return: option size
- *
- * int bpf_skb_set_tunnel_opt(skb, opt, size)
- *     populate tunnel options metadata
- *     @skb: pointer to skb
- *     @opt: pointer to raw tunnel option data
- *     @size: size of @opt
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_change_proto(skb, proto, flags)
- *     Change protocol of the skb. Currently supported is v4 -> v6,
- *     v6 -> v4 transitions. The helper will also resize the skb. eBPF
- *     program is expected to fill the new headers via skb_store_bytes
- *     and lX_csum_replace.
- *     @skb: pointer to skb
- *     @proto: new skb->protocol type
- *     @flags: reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_change_type(skb, type)
- *     Change packet type of skb.
- *     @skb: pointer to skb
- *     @type: new skb->pkt_type type
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_under_cgroup(skb, map, index)
- *     Check cgroup2 membership of skb
- *     @skb: pointer to skb
- *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
- *     @index: index of the cgroup in the bpf_map
- *     Return:
- *       == 0 skb failed the cgroup2 descendant test
- *       == 1 skb succeeded the cgroup2 descendant test
- *        < 0 error
- *
- * u32 bpf_get_hash_recalc(skb)
- *     Retrieve and possibly recalculate skb->hash.
- *     @skb: pointer to skb
- *     Return: hash
- *
- * u64 bpf_get_current_task(void)
- *     Returns current task_struct
- *     Return: current
- *
- * int bpf_probe_write_user(void *dst, void *src, int len)
- *     safely attempt to write to a location
- *     @dst: destination address in userspace
- *     @src: source address on stack
- *     @len: number of bytes to copy
- *     Return: 0 on success or negative error
- *
- * int bpf_current_task_under_cgroup(map, index)
- *     Check cgroup2 membership of current task
- *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
- *     @index: index of the cgroup in the bpf_map
- *     Return:
- *       == 0 current failed the cgroup2 descendant test
- *       == 1 current succeeded the cgroup2 descendant test
- *        < 0 error
- *
- * int bpf_skb_change_tail(skb, len, flags)
- *     The helper will resize the skb to the given new size, to be used f.e.
- *     with control messages.
- *     @skb: pointer to skb
- *     @len: new skb length
- *     @flags: reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_pull_data(skb, len)
- *     The helper will pull in non-linear data in case the skb is non-linear
- *     and not all of len are part of the linear section. Only needed for
- *     read/write with direct packet access.
- *     @skb: pointer to skb
- *     @len: len to make read/writeable
- *     Return: 0 on success or negative error
- *
- * s64 bpf_csum_update(skb, csum)
- *     Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
- *     @skb: pointer to skb
- *     @csum: csum to add
- *     Return: csum on success or negative error
- *
- * void bpf_set_hash_invalid(skb)
- *     Invalidate current skb->hash.
- *     @skb: pointer to skb
- *
- * int bpf_get_numa_node_id()
- *     Return: Id of current NUMA node.
- *
- * int bpf_skb_change_head()
- *     Grows headroom of skb and adjusts MAC header offset accordingly.
- *     Will extends/reallocae as required automatically.
- *     May change skb data pointer and will thus invalidate any check
- *     performed for direct packet access.
- *     @skb: pointer to skb
- *     @len: length of header to be pushed in front
- *     @flags: Flags (unused for now)
- *     Return: 0 on success or negative error
- *
- * int bpf_xdp_adjust_head(xdp_md, delta)
- *     Adjust the xdp_md.data by delta
- *     @xdp_md: pointer to xdp_md
- *     @delta: An positive/negative integer to be added to xdp_md.data
- *     Return: 0 on success or negative on error
- *
- * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
- *     Copy a NUL terminated string from unsafe address. In case the string
- *     length is smaller than size, the target is not padded with further NUL
- *     bytes. In case the string length is larger than size, just count-1
- *     bytes are copied and the last byte is set to NUL.
- *     @dst: destination address
- *     @size: maximum number of bytes to copy, including the trailing NUL
- *     @unsafe_ptr: unsafe address
- *     Return:
- *       > 0 length of the string including the trailing NUL on success
- *       < 0 error
- *
- * u64 bpf_get_socket_cookie(skb)
- *     Get the cookie for the socket stored inside sk_buff.
- *     @skb: pointer to skb
- *     Return: 8 Bytes non-decreasing number on success or 0 if the socket
- *     field is missing inside sk_buff
- *
- * u32 bpf_get_socket_uid(skb)
- *     Get the owner uid of the socket stored inside sk_buff.
- *     @skb: pointer to skb
- *     Return: uid of the socket owner on success or overflowuid if failed.
- *
- * u32 bpf_set_hash(skb, hash)
- *     Set full skb->hash.
- *     @skb: pointer to skb
- *     @hash: hash to set
- *
- * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
- *     Calls setsockopt. Not all opts are available, only those with
- *     integer optvals plus TCP_CONGESTION.
- *     Supported levels: SOL_SOCKET and IPPROTO_TCP
- *     @bpf_socket: pointer to bpf_socket
- *     @level: SOL_SOCKET or IPPROTO_TCP
- *     @optname: option name
- *     @optval: pointer to option value
- *     @optlen: length of optval in bytes
- *     Return: 0 or negative error
- *
- * int bpf_getsockopt(bpf_socket, level, optname, optval, optlen)
- *     Calls getsockopt. Not all opts are available.
- *     Supported levels: IPPROTO_TCP
- *     @bpf_socket: pointer to bpf_socket
- *     @level: IPPROTO_TCP
- *     @optname: option name
- *     @optval: pointer to option value
- *     @optlen: length of optval in bytes
- *     Return: 0 or negative error
- *
- * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
- *     Set callback flags for sock_ops
- *     @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
- *     @flags: flags value
- *     Return: 0 for no error
- *             -EINVAL if there is no full tcp socket
- *             bits in flags that are not supported by current kernel
- *
- * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
- *     Grow or shrink room in sk_buff.
- *     @skb: pointer to skb
- *     @len_diff: (signed) amount of room to grow/shrink
- *     @mode: operation mode (enum bpf_adj_room_mode)
- *     @flags: reserved for future use
- *     Return: 0 on success or negative error code
- *
- * int bpf_sk_redirect_map(map, key, flags)
- *     Redirect skb to a sock in map using key as a lookup key for the
- *     sock in map.
- *     @map: pointer to sockmap
- *     @key: key to lookup sock in map
- *     @flags: reserved for future use
- *     Return: SK_PASS
- *
- * int bpf_sock_map_update(skops, map, key, flags)
- *	@skops: pointer to bpf_sock_ops
- *	@map: pointer to sockmap to update
- *	@key: key to insert/update sock in map
- *	@flags: same flags as map update elem
- *
- * int bpf_xdp_adjust_meta(xdp_md, delta)
- *     Adjust the xdp_md.data_meta by delta
- *     @xdp_md: pointer to xdp_md
- *     @delta: An positive/negative integer to be added to xdp_md.data_meta
- *     Return: 0 on success or negative on error
- *
- * int bpf_perf_event_read_value(map, flags, buf, buf_size)
- *     read perf event counter value and perf event enabled/running time
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     @buf: buf to fill
- *     @buf_size: size of the buf
- *     Return: 0 on success or negative error code
- *
- * int bpf_perf_prog_read_value(ctx, buf, buf_size)
- *     read perf prog attached perf event counter and enabled/running time
- *     @ctx: pointer to ctx
- *     @buf: buf to fill
- *     @buf_size: size of the buf
- *     Return : 0 on success or negative error code
- *
- * int bpf_override_return(pt_regs, rc)
- *	@pt_regs: pointer to struct pt_regs
- *	@rc: the return value to set
- *
- * int bpf_msg_redirect_map(map, key, flags)
- *     Redirect msg to a sock in map using key as a lookup key for the
- *     sock in map.
- *     @map: pointer to sockmap
- *     @key: key to lookup sock in map
- *     @flags: reserved for future use
- *     Return: SK_PASS
- *
- * int bpf_bind(ctx, addr, addr_len)
- *     Bind socket to address. Only binding to IP is supported, no port can be
- *     set in addr.
- *     @ctx: pointer to context of type bpf_sock_addr
- *     @addr: pointer to struct sockaddr to bind socket to
- *     @addr_len: length of sockaddr structure
- *     Return: 0 on success or negative error code
+/* The description below is an attempt at providing documentation to eBPF
+ * developers about the multiple available eBPF helper functions. It can be
+ * parsed and used to produce a manual page. The workflow is the following,
+ * and requires the rst2man utility:
+ *
+ *     $ ./scripts/bpf_helpers_doc.py \
+ *             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
+ *     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
+ *     $ man /tmp/bpf-helpers.7
+ *
+ * Note that in order to produce this external documentation, some RST
+ * formatting is used in the descriptions to get "bold" and "italics" in
+ * manual pages. Also note that the few trailing white spaces are
+ * intentional, removing them would break paragraphs for rst2man.
+ *
+ * Start of BPF helper function descriptions:
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
new file mode 100755
index 000000000000..57857d597237
--- /dev/null
+++ b/scripts/bpf_helpers_doc.py
@@ -0,0 +1,434 @@
+#!/usr/bin/python3
+#
+# Copyright (C) 2018 Netronome Systems, Inc.
+#
+# This software is licensed under the GNU General License Version 2,
+# June 1991 as shown in the file COPYING in the top-level directory of this
+# source tree.
+#
+# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+# In case user attempts to run with Python 2.
+from __future__ import print_function
+
+import argparse
+import re
+import sys, os
+
+class NoHelperFound(BaseException):
+    pass
+
+class ParsingError(BaseException):
+    def __init__(self, line='<line not provided>', reader=None):
+        if reader:
+            BaseException.__init__(self,
+                                   'Error at file offset %d, parsing line: %s' %
+                                   (reader.tell(), line))
+        else:
+            BaseException.__init__(self, 'Error parsing line: %s' % line)
+
+class Helper(object):
+    """
+    An object representing the description of an eBPF helper function.
+    @proto: function prototype of the helper function
+    @desc: textual description of the helper function
+    @ret: description of the return value of the helper function
+    """
+    def __init__(self, proto='', desc='', ret=''):
+        self.proto = proto
+        self.desc = desc
+        self.ret = ret
+
+    def proto_break_down(self):
+        """
+        Break down helper function protocol into smaller chunks: return type,
+        name, distincts arguments.
+        """
+        arg_re = re.compile('^((const )?(struct )?(\w+|...))( (\**)(\w+))?$')
+        res = {}
+        proto_re = re.compile('^(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$')
+
+        capture = proto_re.match(self.proto)
+        res['ret_type'] = capture.group(1)
+        res['ret_star'] = capture.group(2)
+        res['name']     = capture.group(3)
+        res['args'] = []
+
+        args    = capture.group(4).split(', ')
+        for a in args:
+            capture = arg_re.match(a)
+            res['args'].append({
+                'type' : capture.group(1),
+                'star' : capture.group(6),
+                'name' : capture.group(7)
+            })
+
+        return res
+
+class HeaderParser(object):
+    """
+    An object used to parse a file in order to extract the documentation of a
+    list of eBPF helper functions. All the helpers that can be retrieved are
+    stored as Helper object, in the self.helpers() array.
+    @filename: name of file to parse, usually include/uapi/linux/bpf.h in the
+               kernel tree
+    """
+    def __init__(self, filename):
+        self.reader = open(filename, 'r')
+        self.line = ''
+        self.helpers = []
+
+    def parse_helper(self):
+        proto    = self.parse_proto()
+        desc     = self.parse_desc()
+        ret      = self.parse_ret()
+        return Helper(proto=proto, desc=desc, ret=ret)
+
+    def parse_proto(self):
+        # Argument can be of shape:
+        #   - "void"
+        #   - "type  name"
+        #   - "type *name"
+        #   - Same as above, with "const" and/or "struct" in front of type
+        #   - "..." (undefined number of arguments, for bpf_trace_printk())
+        # There is at least one term ("void"), and at most five arguments.
+        p = re.compile('^ \* ((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
+        capture = p.match(self.line)
+        if not capture:
+            raise NoHelperFound
+        self.line = self.reader.readline()
+        return capture.group(1)
+
+    def parse_desc(self):
+        p = re.compile('^ \* \tDescription$')
+        capture = p.match(self.line)
+        if not capture:
+            # Helper can have empty description and we might be parsing another
+            # attribute: return but do not consume.
+            return ''
+        # Description can be several lines, some of them possibly empty, and it
+        # stops when another subsection title is met.
+        desc = ''
+        while True:
+            self.line = self.reader.readline()
+            if self.line == ' *\n':
+                desc += '\n'
+            else:
+                p = re.compile('^ \* \t\t(.*)')
+                capture = p.match(self.line)
+                if capture:
+                    desc += capture.group(1) + '\n'
+                else:
+                    break
+        return desc
+
+    def parse_ret(self):
+        p = re.compile('^ \* \tReturn$')
+        capture = p.match(self.line)
+        if not capture:
+            # Helper can have empty retval and we might be parsing another
+            # attribute: return but do not consume.
+            return ''
+        # Return value description can be several lines, some of them possibly
+        # empty, and it stops when another subsection title is met.
+        ret = ''
+        while True:
+            self.line = self.reader.readline()
+            if self.line == ' *\n':
+                ret += '\n'
+            else:
+                p = re.compile('^ \* \t\t(.*)')
+                capture = p.match(self.line)
+                if capture:
+                    ret += capture.group(1) + '\n'
+                else:
+                    break
+        return ret
+
+    def run(self):
+        # Advance to start of helper function descriptions.
+        offset = self.reader.read().find('* Start of BPF helper function descriptions:')
+        if offset == -1:
+            raise Exception('Could not find start of eBPF helper descriptions list')
+        self.reader.seek(offset)
+        self.reader.readline()
+        self.reader.readline()
+        self.line = self.reader.readline()
+
+        while True:
+            try:
+                helper = self.parse_helper()
+                self.helpers.append(helper)
+            except NoHelperFound:
+                break
+
+        self.reader.close()
+        print('Parsed description of %d helper function(s)' % len(self.helpers),
+              file=sys.stderr)
+
+###############################################################################
+
+class Printer(object):
+    """
+    A generic class for printers. Printers should be created with an array of
+    Helper objects, and implement a way to print them in the desired fashion.
+    @helpers: array of Helper objects to print to standard output
+    """
+    def __init__(self, helpers):
+        self.helpers = helpers
+
+    def print_header(self):
+        pass
+
+    def print_footer(self):
+        pass
+
+    def print_one(self, helper):
+        pass
+
+    def print_all(self):
+        self.print_header()
+        for helper in self.helpers:
+            self.print_one(helper)
+        self.print_footer()
+
+class PrinterRST(Printer):
+    """
+    A printer for dumping collected information about helpers as a ReStructured
+    Text page compatible with the rst2man program, which can be used to
+    generate a manual page for the helpers.
+    @helpers: array of Helper objects to print to standard output
+    """
+    def print_header(self):
+        header = '''\
+.. Copyright (C) All BPF authors and contributors from 2014 to present.
+.. See git log include/uapi/linux/bpf.h in kernel tree for details.
+.. 
+.. %%%LICENSE_START(VERBATIM)
+.. Permission is granted to make and distribute verbatim copies of this
+.. manual provided the copyright notice and this permission notice are
+.. preserved on all copies.
+.. 
+.. Permission is granted to copy and distribute modified versions of this
+.. manual under the conditions for verbatim copying, provided that the
+.. entire resulting derived work is distributed under the terms of a
+.. permission notice identical to this one.
+.. 
+.. Since the Linux kernel and libraries are constantly changing, this
+.. manual page may be incorrect or out-of-date.  The author(s) assume no
+.. responsibility for errors or omissions, or for damages resulting from
+.. the use of the information contained herein.  The author(s) may not
+.. have taken the same level of care in the production of this manual,
+.. which is licensed free of charge, as they might when working
+.. professionally.
+.. 
+.. Formatted or processed versions of this manual, if unaccompanied by
+.. the source, must acknowledge the copyright and authors of this work.
+.. %%%LICENSE_END
+.. 
+.. Please do not edit this file. It was generated from the documentation
+.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
+.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
+.. repository (header and footer).
+
+===========
+BPF-HELPERS
+===========
+-------------------------------------------------------------------------------
+list of eBPF helper functions
+-------------------------------------------------------------------------------
+
+:Manual section: 7
+
+DESCRIPTION
+===========
+
+The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
+written in a pseudo-assembly language, then attached to one of the several
+kernel hooks and run in reaction of specific events. This framework differs
+from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
+the ability to call special functions (or "helpers") from within a program.
+These functions are restricted to a white-list of helpers defined in the
+kernel.
+
+These helpers are used by eBPF programs to interact with the system, or with
+the context in which they work. For instance, they can be used to print
+debugging messages, to get the time since the system was booted, to interact
+with eBPF maps, or to manipulate network packets. Since there are several eBPF
+program types, and that they do not run in the same context, each program type
+can only call a subset of those helpers.
+
+Due to eBPF conventions, a helper can not have more than five arguments.
+
+Internally, eBPF programs call directly into the compiled helper functions
+without requiring any foreign-function interface. As a result, calling helpers
+introduces no overhead, thus offering excellent performance.
+
+This document is an attempt to list and document the helpers available to eBPF
+developers. They are sorted by chronological order (the oldest helpers in the
+kernel at the top).
+
+HELPERS
+=======
+'''
+        print(header)
+
+    def print_footer(self):
+        footer = '''
+EXAMPLES
+========
+
+Example usage for most of the eBPF helpers listed in this manual page are
+available within the Linux kernel sources, at the following locations:
+
+* *samples/bpf/*
+* *tools/testing/selftests/bpf/*
+
+LICENSE
+=======
+
+Helpers related to tracing and monitoring functions are for "GPL-only" eBPF
+programs. These functions can be used to inspect the internals of the Linux
+kernel, which is under the GNU Privacy License (GPL): so eBPF programs must
+themselves be compatible with this license to be allowed to use such helpers
+(The licensing rules are the same as for kernel modules, so that also dual
+licenses, such as "Dual BSD/GPL", may be used).
+
+This requires the programs are loaded with the correct license string passed
+(via **attr**) to the **bpf**\ () system call, and generally translates into
+the C source code of such programs containing a line similar to the following:
+
+::
+
+	char ____license[] __attribute__((section("license"), used)) = "GPL";
+
+Helpers restricted to networking usage are free to use in non-GPL programs,
+because the packets on which they operate are just abstract bytes on the wire.
+
+IMPLEMENTATION
+==============
+
+This manual page is an effort to document the existing eBPF helper functions.
+But as of this writing, the BPF sub-system is under heavy development. New eBPF
+program or map types are added, along with new helper functions. Some helpers
+are occasionally made available for additional program types. So in spite of
+the efforts of the community, this page might not be up-to-date. If you want to
+check by yourself what helper functions exist in your kernel, or what types of
+programs they can support, here are some files among the kernel tree that you
+may be interested in:
+
+* *include/uapi/linux/bpf.h* is the main BPF header. It contains the full list
+  of all helper functions, as well as many other BPF definitions including most
+  of the flags, structs or constants used by the helpers.
+* *net/core/filter.c* contains the definition of most network-related helper
+  functions, and the list of program types from which they can be used.
+* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
+  helpers.
+* *kernel/bpf/verifier.c* contains the functions used to check that valid types
+  of eBPF maps are used with a given helper function.
+* *kernel/bpf/* directory contains other files in which additional helpers are
+  defined (for cgroups, sockmaps, etc.).
+
+Compatibility between helper functions and program types can generally be found
+in the files where helper functions are defined. Look for the **struct
+bpf_func_proto** objects and for functions returning them: these functions
+contain a list of helpers that a given program type can call. Note that the
+**default:** label of the **switch ... case** used to filter helpers can call
+other functions, themselves allowing access to additional helpers. The
+requirement for GPL license is also in those **struct bpf_func_proto**.
+
+Compatibility between helper functions and map types can be found in the
+**check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*.
+
+Helper functions that invalidate the checks on **data** and **data_end**
+pointers for network processing are listed in function
+**bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*.
+
+SEE ALSO
+========
+
+**bpf**\ (2),
+**cgroups**\ (7),
+**ip**\ (8),
+**perf_event_open**\ (2),
+**sendmsg**\ (2),
+**socket**\ (7),
+**tc-bpf**\ (8)'''
+        print(footer)
+
+    def print_proto(self, helper):
+        """
+        Format function protocol with bold and italics markers. This makes RST
+        file less readable, but gives nice results in the manual page.
+        """
+        proto = helper.proto_break_down()
+
+        print('**%s %s%s(' % (proto['ret_type'],
+                              proto['ret_star'].replace('*', '\\*'),
+                              proto['name']),
+              end='')
+
+        comma = ''
+        for a in proto['args']:
+            one_arg = '{}{}'.format(comma, a['type'])
+            if a['name']:
+                if a['star']:
+                    one_arg += ' {}**\ '.format(a['star'].replace('*', '\\*'))
+                else:
+                    one_arg += '** '
+                one_arg += '*{}*\\ **'.format(a['name'])
+            comma = ', '
+            print(one_arg, end='')
+
+        print(')**')
+
+    def print_one(self, helper):
+        self.print_proto(helper)
+
+        if (helper.desc):
+            print('\tDescription')
+            # Do not strip all newline characters: formatted code at the end of
+            # a section must be followed by a blank line.
+            for line in re.sub('\n$', '', helper.desc, count=1).split('\n'):
+                print('{}{}'.format('\t\t' if line else '', line))
+
+        if (helper.ret):
+            print('\tReturn')
+            for line in helper.ret.rstrip().split('\n'):
+                print('{}{}'.format('\t\t' if line else '', line))
+
+        print('')
+
+###############################################################################
+
+# If script is launched from scripts/ from kernel tree and can access
+# ../include/uapi/linux/bpf.h, use it as a default name for the file to parse,
+# otherwise the --filename argument will be required from the command line.
+script = os.path.abspath(sys.argv[0])
+linuxRoot = os.path.dirname(os.path.dirname(script))
+bpfh = os.path.join(linuxRoot, 'include/uapi/linux/bpf.h')
+
+argParser = argparse.ArgumentParser(description="""
+Parse eBPF header file and generate documentation for eBPF helper functions.
+The RST-formatted output produced can be turned into a manual page with the
+rst2man utility.
+""")
+if (os.path.isfile(bpfh)):
+    argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h',
+                           default=bpfh)
+else:
+    argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h')
+args = argParser.parse_args()
+
+# Parse file.
+headerParser = HeaderParser(args.filename)
+headerParser.run()
+
+# Print formatted output to standard output.
+printer = PrinterRST(headerParser.helpers)
+printer.print_all()
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox