Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Cornelia Huck @ 2019-08-14 13:09 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Alex Williamson, Kirti Wankhede, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, cjia@nvidia.com, Jiri Pirko,
	netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB4866ABFDDD9DDCBC01F6CA90D1AD0@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Wed, 14 Aug 2019 12:27:01 +0000
Parav Pandit <parav@mellanox.com> wrote:

> + Jiri, + netdev 
> To get perspective on the ndo->phys_port_name for the representor netdev of mdev.
> 
> Hi Cornelia,
> 
> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Wednesday, August 14, 2019 1:32 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; cjia@nvidia.com
> > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > 
> > On Wed, 14 Aug 2019 05:54:36 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > > I get that part. I prefer to remove the UUID itself from the
> > > > > structure and therefore removing this API makes lot more sense?  
> > > >
> > > > Mdev and support tools around mdev are based on UUIDs because it's  
> > defined  
> > > > in the documentation.  
> > > When we introduce newer device naming scheme, it will update the  
> > documentation also.  
> > > May be that is the time to move to .rst format too.  
> > 
> > You are aware that there are existing tools that expect a uuid naming scheme,
> > right?
> >   
> Yes, Alex mentioned too.
> The good tool that I am aware of is [1], which is 4 months old. Not sure if it is part of any distros yet.
> 
> README also says, that it is in 'early in development. So we have scope to improve it for non UUID names, but lets discuss that more below.

The up-to-date reference for mdevctl is
https://github.com/mdevctl/mdevctl. There is currently an effort to get
this packaged in Fedora.

> 
> > >  
> > > > I don't think it's as simple as saying "voila, UUID dependencies are
> > > > removed, users are free to use arbitrary strings".  We'd need to
> > > > create some kind of naming policy, what characters are allows so
> > > > that we can potentially expand the creation parameters as has been
> > > > proposed a couple times, how do we deal with collisions and races,
> > > > and why should we make such a change when a UUID is a perfectly
> > > > reasonable devices name.  Thanks,
> > > >  
> > > Sure, we should define a policy on device naming to be more relaxed.
> > > We have enough examples in-kernel.
> > > Few that I am aware of are netdev (vxlan, macvlan, ipvlan, lot more), rdma  
> > etc which has arbitrary device names and ID based device names.  
> > >
> > > Collisions and race is already taken care today in the mdev core. Same  
> > unique device names continue.
> > 
> > I'm still completely missing a rationale _why_ uuids are supposedly
> > bad/restricting/etc.  
> There is nothing bad about uuid based naming.
> Its just too long name to derive phys_port_name of a netdev.
> In details below.
> 
> For a given mdev of networking type, we would like to have 
> (a) representor netdevice [2] 
> (b) associated devlink port [3]
> 
> Currently these representor netdevice exist only for the PCIe SR-IOV VFs.
> It is further getting extended for mdev without SR-IOV.
> 
> Each of the devlink port is attached to representor netdevice [4].
> 
> This netdevice phys_port_name should be a unique derived from some property of mdev.
> Udev/systemd uses phys_port_name to derive unique representor netdev name.
> This netdev name is further use by orchestration and switching software in user space.
> One such distro supported switching software is ovs [4], which relies on the persistent device name of the representor netdevice.

Ok, let me rephrase this to check that I understand this correctly. I'm
not sure about some of the terms you use here (even after looking at
the linked doc/code), but that's probably still ok.

We want to derive an unique (and probably persistent?) netdev name so
that userspace can refer to a representor netdevice. Makes sense.
For generating that name, udev uses the phys_port_name (which
represents the devlink port, IIUC). Also makes sense.

> 
> phys_port_name has limitation to be only 15 characters long.
> UUID doesn't fit in phys_port_name.

Understood. But why do we need to derive the phys_port_name from the
mdev device name? This netdevice use case seems to be just one use case
for using mdev devices? If this is a specialized mdev type for this
setup, why not just expose a shorter identifier via an extra attribute?

> Longer UUID names are creating snow ball effect, not just in networking stack but many user space tools too.

This snowball effect mainly comes from the device name ->
phys_port_name setup, IIUC.

> (as opposed to recently introduced mdevctl, are they more mdev tools which has dependency on UUID name?)

I am aware that people have written scripts etc. to manage their mdevs.
Given that the mdev infrastructure has been around for quite some time,
I'd say the chance of some of those scripts relying on uuid names is
non-zero.

> 
> Instead of mdev subsystem creating such effect, one option we are considering is to have shorter mdev names.
> (Similar to netdev, rdma, nvme devices).
> Such as mdev1, mdev2000 etc.
> 
> Second option I was considering is to have an optional alias for UUID based mdev.
> This name alias is given at time of mdev creation.
> Devlink port's phys_port_name is derived out of this shorter mdev name alias.
> This way, mdev remains to be UUID based with optional extension.
> However, I prefer first option to relax mdev naming scheme.

Actually, I think that second option makes much more sense, as you
avoid potentially breaking existing tooling.

> 
> > We want to uniquely identify a device, across different
> > types of vendor drivers. An uuid is a unique identifier and even a well-defined
> > one. Tools (e.g. mdevctl) are relying on it for mdev devices today.
> > 
> > What is the problem you're trying to solve?  
> Unique device naming is still achieved without UUID scheme by various subsystems in kernel using alpha-numeric string.
> Having such string based continue to provide unique names.
> 
> I hope I described the problem and two solutions above.
> 
> [1] https://github.com/awilliam/mdevctl
> [2] https://elixir.bootlin.com/linux/v5.3-rc4/source/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
> [3] http://man7.org/linux/man-pages/man8/devlink-port.8.html
> [4] https://elixir.bootlin.com/linux/v5.3-rc4/source/net/core/devlink.c#L6921
> [5] https://www.openvswitch.org/
> 


^ permalink raw reply

* Re: [PATCH v4 9/9] Input: add IOC3 serio driver
From: Jonas Gorski @ 2019-08-14 13:20 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, Dmitry Torokhov,
	Lee Jones, David S. Miller, Srinivas Kandagatla, Alessandro Zummo,
	Alexandre Belloni, Greg Kroah-Hartman, Jiri Slaby,
	Evgeniy Polyakov, linux-mips, linux-kernel, linux-input,
	Network Development, linux-rtc, linux-serial
In-Reply-To: <20190809103235.16338-10-tbogendoerfer@suse.de>

Hi,

On Fri, 9 Aug 2019 at 12:33, Thomas Bogendoerfer <tbogendoerfer@suse.de> wrote:
>
> This patch adds a platform driver for supporting keyboard and mouse
> interface of SGI IOC3 chips.
>
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> ---
>  drivers/input/serio/Kconfig   |  10 +++
>  drivers/input/serio/Makefile  |   1 +
>  drivers/input/serio/ioc3kbd.c | 163 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 174 insertions(+)
>  create mode 100644 drivers/input/serio/ioc3kbd.c
>
> diff --git a/drivers/input/serio/Kconfig b/drivers/input/serio/Kconfig
> index f3e18f8ef9ca..373a1646019e 100644
> --- a/drivers/input/serio/Kconfig
> +++ b/drivers/input/serio/Kconfig
> @@ -165,6 +165,16 @@ config SERIO_MACEPS2
>           To compile this driver as a module, choose M here: the
>           module will be called maceps2.
>
> +config SERIO_SGI_IOC3
> +       tristate "SGI IOC3 PS/2 controller"
> +       depends on SGI_MFD_IOC3
> +       help
> +         Say Y here if you have an SGI Onyx2, SGI Octane or IOC3 PCI card
> +         and you want to attach and use a keyboard, mouse, or both.
> +
> +         To compile this driver as a module, choose M here: the
> +         module will be called ioc3kbd.
> +
>  config SERIO_LIBPS2
>         tristate "PS/2 driver library"
>         depends on SERIO_I8042 || SERIO_I8042=n
> diff --git a/drivers/input/serio/Makefile b/drivers/input/serio/Makefile
> index 67950a5ccb3f..6d97bad7b844 100644
> --- a/drivers/input/serio/Makefile
> +++ b/drivers/input/serio/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_HIL_MLC)         += hp_sdc_mlc.o hil_mlc.o
>  obj-$(CONFIG_SERIO_PCIPS2)     += pcips2.o
>  obj-$(CONFIG_SERIO_PS2MULT)    += ps2mult.o
>  obj-$(CONFIG_SERIO_MACEPS2)    += maceps2.o
> +obj-$(CONFIG_SERIO_SGI_IOC3)   += ioc3kbd.o
>  obj-$(CONFIG_SERIO_LIBPS2)     += libps2.o
>  obj-$(CONFIG_SERIO_RAW)                += serio_raw.o
>  obj-$(CONFIG_SERIO_AMS_DELTA)  += ams_delta_serio.o
> diff --git a/drivers/input/serio/ioc3kbd.c b/drivers/input/serio/ioc3kbd.c
> new file mode 100644
> index 000000000000..6840e3c23fed
> --- /dev/null
> +++ b/drivers/input/serio/ioc3kbd.c
> @@ -0,0 +1,163 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * SGI IOC3 PS/2 controller driver for linux
> + *
> + * Copyright (C) 2019 Thomas Bogendoerfer <tbogendoerfer@suse.de>
> + *
> + * Based on code Copyright (C) 2005 Stanislaw Skowronek <skylark@unaligned.org>
> + *               Copyright (C) 2009 Johannes Dickgreber <tanzy@gmx.de>
> + */
> +
> +#include <linux/delay.h>
> +#include <linux/init.h>
> +#include <linux/io.h>
> +#include <linux/serio.h>
> +#include <linux/module.h>
> +#include <linux/platform_device.h>
> +
> +#include <asm/sn/ioc3.h>
> +
> +struct ioc3kbd_data {
> +       struct ioc3_serioregs __iomem *regs;
> +       struct serio *kbd, *aux;
> +       int irq;
> +};
> +
> +static int ioc3kbd_write(struct serio *dev, u8 val)
> +{
> +       struct ioc3kbd_data *d = dev->port_data;
> +       unsigned long timeout = 0;
> +       u32 mask;
> +
> +       mask = (dev == d->aux) ? KM_CSR_M_WRT_PEND : KM_CSR_K_WRT_PEND;
> +       while ((readl(&d->regs->km_csr) & mask) && (timeout < 1000)) {
> +               udelay(100);
> +               timeout++;
> +       }
> +
> +       if (timeout >= 1000)
> +               return -ETIMEDOUT;
> +
> +       writel(val, dev == d->aux ? &d->regs->m_wd : &d->regs->k_wd);
> +
> +       return 0;
> +}
> +
> +static irqreturn_t ioc3kbd_intr(int itq, void *dev_id)
> +{
> +       struct ioc3kbd_data *d = dev_id;
> +       u32 data_k, data_m;
> +
> +       data_k = readl(&d->regs->k_rd);
> +       data_m = readl(&d->regs->m_rd);
> +
> +       if (data_k & KM_RD_VALID_0)
> +               serio_interrupt(d->kbd, (data_k >> KM_RD_DATA_0_SHIFT) & 0xff,
> +                               0);
> +       if (data_k & KM_RD_VALID_1)
> +               serio_interrupt(d->kbd, (data_k >> KM_RD_DATA_1_SHIFT) & 0xff,
> +                               0);
> +       if (data_k & KM_RD_VALID_2)
> +               serio_interrupt(d->kbd, (data_k >> KM_RD_DATA_2_SHIFT) & 0xff,
> +                               0);
> +       if (data_m & KM_RD_VALID_0)
> +               serio_interrupt(d->aux, (data_m >> KM_RD_DATA_0_SHIFT) & 0xff,
> +                               0);
> +       if (data_m & KM_RD_VALID_1)
> +               serio_interrupt(d->aux, (data_m >> KM_RD_DATA_1_SHIFT) & 0xff,
> +                               0);
> +       if (data_m & KM_RD_VALID_2)
> +               serio_interrupt(d->aux, (data_m >> KM_RD_DATA_2_SHIFT) & 0xff,
> +                               0);
> +
> +       return 0;
> +}
> +
> +static int ioc3kbd_probe(struct platform_device *pdev)
> +{
> +       struct ioc3_serioregs __iomem *regs;
> +       struct device *dev = &pdev->dev;
> +       struct ioc3kbd_data *d;
> +       struct serio *sk, *sa;
> +       int irq, ret;
> +
> +       regs = devm_platform_ioremap_resource(pdev, 0);
> +       if (IS_ERR(regs))
> +               return PTR_ERR(regs);
> +
> +       irq = platform_get_irq(pdev, 0);
> +       if (irq < 0)
> +               return -ENXIO;
> +
> +       d = devm_kzalloc(&pdev->dev, sizeof(*d), GFP_KERNEL);

&pdev->dev => dev

> +       if (!d)
> +               return -ENOMEM;
> +
> +       sk = kzalloc(sizeof(*sk), GFP_KERNEL);

any reason not to devm_kzalloc this as well? Then you won't need to
manually free it in the error cases.

> +       if (!sk)
> +               return -ENOMEM;
> +
> +       sa = kzalloc(sizeof(*sa), GFP_KERNEL);

same here.

> +       if (!sa) {
> +               kfree(sk);
> +               return -ENOMEM;
> +       }
> +
> +       sk->id.type = SERIO_8042;
> +       sk->write = ioc3kbd_write;
> +       snprintf(sk->name, sizeof(sk->name), "IOC3 keyboard %d", pdev->id);
> +       snprintf(sk->phys, sizeof(sk->phys), "ioc3/serio%dkbd", pdev->id);
> +       sk->port_data = d;
> +       sk->dev.parent = &pdev->dev;

&pdev->dev => dev

> +
> +       sa->id.type = SERIO_8042;
> +       sa->write = ioc3kbd_write;
> +       snprintf(sa->name, sizeof(sa->name), "IOC3 auxiliary %d", pdev->id);
> +       snprintf(sa->phys, sizeof(sa->phys), "ioc3/serio%daux", pdev->id);
> +       sa->port_data = d;
> +       sa->dev.parent = dev;
> +
> +       d->regs = regs;
> +       d->kbd = sk;
> +       d->aux = sa;
> +       d->irq = irq;
> +
> +       platform_set_drvdata(pdev, d);
> +       serio_register_port(d->kbd);
> +       serio_register_port(d->aux);
> +
> +       ret = devm_request_irq(&pdev->dev, irq, ioc3kbd_intr, IRQF_SHARED,
> +                              "ioc3-kbd", d);
> +       if (ret) {
> +               dev_err(&pdev->dev, "could not request IRQ %d\n", irq);
> +               serio_unregister_port(d->kbd);
> +               serio_unregister_port(d->aux);
> +               kfree(sk);
> +               kfree(sa);
> +               return ret;
> +       }
> +       return 0;
> +}
> +
> +static int ioc3kbd_remove(struct platform_device *pdev)
> +{
> +       struct ioc3kbd_data *d = platform_get_drvdata(pdev);
> +
> +       devm_free_irq(&pdev->dev, d->irq, d);
> +       serio_unregister_port(d->kbd);
> +       serio_unregister_port(d->aux);
> +       return 0;
> +}

and on that topic, won't you need to kfree d->kbd and d->aux here?
Unless you devm_kzalloc'd them.

Alternatively you could also just embed the two serio structs into
ioc3kbd_data, then you only need one allocation instead of three.


Regards

Jonas

^ permalink raw reply

* Re: [PATCH bpf-next 1/3] libbpf: add asm/unistd.h to xsk to get __NR_mmap2
From: Björn Töpel @ 2019-08-14 13:32 UTC (permalink / raw)
  To: Andrii Nakryiko, Magnus Karlsson, Björn Töpel,
	David S. Miller, Jesper Dangaard Brouer, john fastabend,
	Jakub Kicinski, Daniel Borkmann, Networking, bpf, Xdp, open list
In-Reply-To: <20190814115659.GC4142@khorivan>

On Wed, 14 Aug 2019 at 13:57, Ivan Khoronzhuk
<ivan.khoronzhuk@linaro.org> wrote:
>
> On Wed, Aug 14, 2019 at 12:24:05PM +0300, Ivan Khoronzhuk wrote:
> >On Tue, Aug 13, 2019 at 04:38:13PM -0700, Andrii Nakryiko wrote:
> >
> >Hi, Andrii
> >
> >>On Tue, Aug 13, 2019 at 3:24 AM Ivan Khoronzhuk
> >><ivan.khoronzhuk@linaro.org> wrote:
> >>>
> >>>That's needed to get __NR_mmap2 when mmap2 syscall is used.
> >>>
> >>>Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> >>>---
> >>> tools/lib/bpf/xsk.c | 1 +
> >>> 1 file changed, 1 insertion(+)
> >>>
> >>>diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> >>>index 5007b5d4fd2c..f2fc40f9804c 100644
> >>>--- a/tools/lib/bpf/xsk.c
> >>>+++ b/tools/lib/bpf/xsk.c
> >>>@@ -12,6 +12,7 @@
> >>> #include <stdlib.h>
> >>> #include <string.h>
> >>> #include <unistd.h>
> >>>+#include <asm/unistd.h>
> >>
> >>asm/unistd.h is not present in Github libbpf projection. Is there any
> >
> >Look on includes from
> >tools/lib/bpf/libpf.c
> >tools/lib/bpf/bpf.c
> >
> >That's how it's done... Copping headers to arch/arm will not
> >solve this, it includes both of them anyway, and anyway it needs
> >asm/unistd.h inclusion here, only because xsk.c needs __NR_*
> >
> >
>
> There is one more radical solution for this I can send, but I'm not sure how it
> can impact on other syscals/arches...
>
> Looks like:
>
>
> diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> index 9312066a1ae3..8b2f8ff7ce44 100644
> --- a/tools/lib/bpf/Makefile
> +++ b/tools/lib/bpf/Makefile
> @@ -113,6 +113,7 @@ override CFLAGS += -Werror -Wall
>  override CFLAGS += -fPIC
>  override CFLAGS += $(INCLUDES)
>  override CFLAGS += -fvisibility=hidden
> +override CFLAGS += -D_FILE_OFFSET_BITS=64
>

Hmm, isn't this glibc-ism? Does is it work for, say, musl or bionic?

If this is portable, and works on 32-, and 64-bit archs, I'm happy
with the patch. :-)


Björn

>  ifeq ($(VERBOSE),1)
>    Q =
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index f2fc40f9804c..ff2d03b8380d 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -75,23 +75,6 @@ struct xsk_nl_info {
>         int fd;
>  };
>
> -/* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
> - * Unfortunately, it is not part of glibc.
> - */
> -static inline void *xsk_mmap(void *addr, size_t length, int prot, int flags,
> -                            int fd, __u64 offset)
> -{
> -#ifdef __NR_mmap2
> -       unsigned int page_shift = __builtin_ffs(getpagesize()) - 1;
> -       long ret = syscall(__NR_mmap2, addr, length, prot, flags, fd,
> -                          (off_t)(offset >> page_shift));
> -
> -       return (void *)ret;
> -#else
> -       return mmap(addr, length, prot, flags, fd, offset);
> -#endif
> -}
> -
>  int xsk_umem__fd(const struct xsk_umem *umem)
>  {
>         return umem ? umem->fd : -EINVAL;
> @@ -211,10 +194,9 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
>                 goto out_socket;
>         }
>
> -       map = xsk_mmap(NULL, off.fr.desc +
> -                      umem->config.fill_size * sizeof(__u64),
> -                      PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> -                      umem->fd, XDP_UMEM_PGOFF_FILL_RING);
> +       map = mmap(NULL, off.fr.desc + umem->config.fill_size * sizeof(__u64),
> +                  PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, umem->fd,
> +                  XDP_UMEM_PGOFF_FILL_RING);
>         if (map == MAP_FAILED) {
>                 err = -errno;
>                 goto out_socket;
> @@ -228,10 +210,9 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
>         fill->ring = map + off.fr.desc;
>         fill->cached_cons = umem->config.fill_size;
>
> -       map = xsk_mmap(NULL,
> -                      off.cr.desc + umem->config.comp_size * sizeof(__u64),
> -                      PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> -                      umem->fd, XDP_UMEM_PGOFF_COMPLETION_RING);
> +       map = mmap(NULL, off.cr.desc + umem->config.comp_size * sizeof(__u64),
> +                  PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, umem->fd,
> +                  XDP_UMEM_PGOFF_COMPLETION_RING);
>         if (map == MAP_FAILED) {
>                 err = -errno;
>                 goto out_mmap;
> @@ -552,11 +533,10 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
>         }
>
>         if (rx) {
> -               rx_map = xsk_mmap(NULL, off.rx.desc +
> -                                 xsk->config.rx_size * sizeof(struct xdp_desc),
> -                                 PROT_READ | PROT_WRITE,
> -                                 MAP_SHARED | MAP_POPULATE,
> -                                 xsk->fd, XDP_PGOFF_RX_RING);
> +               rx_map = mmap(NULL, off.rx.desc +
> +                             xsk->config.rx_size * sizeof(struct xdp_desc),
> +                             PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> +                             xsk->fd, XDP_PGOFF_RX_RING);
>                 if (rx_map == MAP_FAILED) {
>                         err = -errno;
>                         goto out_socket;
> @@ -571,11 +551,10 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
>         xsk->rx = rx;
>
>         if (tx) {
> -               tx_map = xsk_mmap(NULL, off.tx.desc +
> -                                 xsk->config.tx_size * sizeof(struct xdp_desc),
> -                                 PROT_READ | PROT_WRITE,
> -                                 MAP_SHARED | MAP_POPULATE,
> -                                 xsk->fd, XDP_PGOFF_TX_RING);
> +               tx_map = mmap(NULL, off.tx.desc +
> +                             xsk->config.tx_size * sizeof(struct xdp_desc),
> +                             PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
> +                             xsk->fd, XDP_PGOFF_TX_RING);
>                 if (tx_map == MAP_FAILED) {
>                         err = -errno;
>                         goto out_mmap_rx;
>
>
> If maintainers are ready to accept this I can send.
> What do you say?
>
> --
> Regards,
> Ivan Khoronzhuk

^ permalink raw reply

* Re: [PATCH 2/3] hwmon: raspberrypi: update MODULE_AUTHOR() email address
From: Guenter Roeck @ 2019-08-14 13:39 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jean Delvare, David S. Miller, Srinivas Kandagatla, Shawn Guo,
	Sascha Hauer, Pengutronix Kernel Team, Fabio Estevam,
	NXP Linux Team, linux-hwmon, linux-arm-kernel, linux-kernel,
	netdev
In-Reply-To: <1565720249-6549-2-git-send-email-wahrenst@gmx.net>

On Tue, Aug 13, 2019 at 08:17:28PM +0200, Stefan Wahren wrote:
> The email address listed in MODULE_AUTHOR() will be disabled in the
> near future. Replace it with my private one.
> 
> Signed-off-by: Stefan Wahren <wahrenst@gmx.net>

Applied to hwmon-next.

Thanks,
Guenter

> ---
>  drivers/hwmon/raspberrypi-hwmon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --
> 2.7.4
> 
> diff --git a/drivers/hwmon/raspberrypi-hwmon.c b/drivers/hwmon/raspberrypi-hwmon.c
> index efe4bb1..d3a64a3 100644
> --- a/drivers/hwmon/raspberrypi-hwmon.c
> +++ b/drivers/hwmon/raspberrypi-hwmon.c
> @@ -146,7 +146,7 @@ static struct platform_driver rpi_hwmon_driver = {
>  };
>  module_platform_driver(rpi_hwmon_driver);
> 
> -MODULE_AUTHOR("Stefan Wahren <stefan.wahren@i2se.com>");
> +MODULE_AUTHOR("Stefan Wahren <wahrenst@gmx.net>");
>  MODULE_DESCRIPTION("Raspberry Pi voltage sensor driver");
>  MODULE_LICENSE("GPL v2");
>  MODULE_ALIAS("platform:raspberrypi-hwmon");

^ permalink raw reply

* Re: [PATCH 3/3] ocelot_ace: fix action of trap
From: Andrew Lunn @ 2019-08-14 13:42 UTC (permalink / raw)
  To: Y.b. Lu
  Cc: Allan W. Nielsen, netdev@vger.kernel.org, David S . Miller,
	Alexandre Belloni, Microchip Linux Driver Support
In-Reply-To: <VI1PR0401MB2237D9358AA17400E72A776EF8AD0@VI1PR0401MB2237.eurprd04.prod.outlook.com>

> [Y.b. Lu] PTP messages over Ethernet will use two multicast addresses.
> 01-80-C2-00-00-0E for peer delay messages.
> 01-1B-19-00-00-00 for other messages.
> 
> But only 01-80-C2-00-00-0E could be handled by hardware filter for BPDU frames (01-80-C2-00-00-0x).
> For PTP messages handling, trapping them to CPU through VCAP IS2 is the suggested way by Ocelot/Felix.
> 
> I have a question since you are experts.

There real expert is Richard Cochran <richardcochran@gmail.com>. He
implemented PTP support for the mv88e6xxx devices, maintains to core
code, and the linuxptp daemon. Any ptp support your post will be
reviewed by him.

> For other switches, whether they are always trapping PTP messages to CPU?

For the mv88e6xxx family, there is a per port bit which enabled
PTP. When enabled, PTP frames are recognised by the hardware and
forwarded to the CPU.

> Is there any common method in linux to configure switch to select
> trapping or just forwarding PTP messages?

The best answer to that is to look at other switch driver which
implement ptp. The ptp core expects a ptp_clock_info structure, and
one of its members is 'enable'.

    Andrew

^ permalink raw reply

* RE: [PATCH v2 0/2] Simplify mtty driver and mdev core
From: Parav Pandit @ 2019-08-14 13:45 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Kirti Wankhede, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, cjia@nvidia.com, Jiri Pirko,
	netdev@vger.kernel.org
In-Reply-To: <20190814150911.296da78c.cohuck@redhat.com>



> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Wednesday, August 14, 2019 6:39 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; cjia@nvidia.com; Jiri Pirko <jiri@mellanox.com>;
> netdev@vger.kernel.org
> Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> 
> On Wed, 14 Aug 2019 12:27:01 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > + Jiri, + netdev
> > To get perspective on the ndo->phys_port_name for the representor netdev
> of mdev.
> >
> > Hi Cornelia,
> >
> > > -----Original Message-----
> > > From: Cornelia Huck <cohuck@redhat.com>
> > > Sent: Wednesday, August 14, 2019 1:32 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; kvm@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; cjia@nvidia.com
> > > Subject: Re: [PATCH v2 0/2] Simplify mtty driver and mdev core
> > >
> > > On Wed, 14 Aug 2019 05:54:36 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > > I get that part. I prefer to remove the UUID itself from the
> > > > > > structure and therefore removing this API makes lot more sense?
> > > > >
> > > > > Mdev and support tools around mdev are based on UUIDs because
> > > > > it's
> > > defined
> > > > > in the documentation.
> > > > When we introduce newer device naming scheme, it will update the
> > > documentation also.
> > > > May be that is the time to move to .rst format too.
> > >
> > > You are aware that there are existing tools that expect a uuid
> > > naming scheme, right?
> > >
> > Yes, Alex mentioned too.
> > The good tool that I am aware of is [1], which is 4 months old. Not sure if it is
> part of any distros yet.
> >
> > README also says, that it is in 'early in development. So we have scope to
> improve it for non UUID names, but lets discuss that more below.
> 
> The up-to-date reference for mdevctl is
> https://github.com/mdevctl/mdevctl. There is currently an effort to get this
> packaged in Fedora.
> 
Awesome.

> >
> > > >
> > > > > I don't think it's as simple as saying "voila, UUID dependencies
> > > > > are removed, users are free to use arbitrary strings".  We'd
> > > > > need to create some kind of naming policy, what characters are
> > > > > allows so that we can potentially expand the creation parameters
> > > > > as has been proposed a couple times, how do we deal with
> > > > > collisions and races, and why should we make such a change when
> > > > > a UUID is a perfectly reasonable devices name.  Thanks,
> > > > >
> > > > Sure, we should define a policy on device naming to be more relaxed.
> > > > We have enough examples in-kernel.
> > > > Few that I am aware of are netdev (vxlan, macvlan, ipvlan, lot
> > > > more), rdma
> > > etc which has arbitrary device names and ID based device names.
> > > >
> > > > Collisions and race is already taken care today in the mdev core.
> > > > Same
> > > unique device names continue.
> > >
> > > I'm still completely missing a rationale _why_ uuids are supposedly
> > > bad/restricting/etc.
> > There is nothing bad about uuid based naming.
> > Its just too long name to derive phys_port_name of a netdev.
> > In details below.
> >
> > For a given mdev of networking type, we would like to have
> > (a) representor netdevice [2]
> > (b) associated devlink port [3]
> >
> > Currently these representor netdevice exist only for the PCIe SR-IOV VFs.
> > It is further getting extended for mdev without SR-IOV.
> >
> > Each of the devlink port is attached to representor netdevice [4].
> >
> > This netdevice phys_port_name should be a unique derived from some
> property of mdev.
> > Udev/systemd uses phys_port_name to derive unique representor netdev
> name.
> > This netdev name is further use by orchestration and switching software in
> user space.
> > One such distro supported switching software is ovs [4], which relies on the
> persistent device name of the representor netdevice.
> 
> Ok, let me rephrase this to check that I understand this correctly. I'm not sure
> about some of the terms you use here (even after looking at the linked
> doc/code), but that's probably still ok.
> 
> We want to derive an unique (and probably persistent?) netdev name so that
> userspace can refer to a representor netdevice. Makes sense.
> For generating that name, udev uses the phys_port_name (which represents
> the devlink port, IIUC). Also makes sense.
> 
You understood it correctly.

> >
> > phys_port_name has limitation to be only 15 characters long.
> > UUID doesn't fit in phys_port_name.
> 
> Understood. But why do we need to derive the phys_port_name from the mdev
> device name? This netdevice use case seems to be just one use case for using
> mdev devices? If this is a specialized mdev type for this setup, why not just
> expose a shorter identifier via an extra attribute?
> 
Representor netdev, represents mdev's switch port (like PCI SRIOV VF's switch port).
So user must be able to relate this two objects in similar manner as SRIOV VFs.
Phys_port_name is derived from the PCI PF and VF numbering scheme.
Similarly mdev's such port should be derived from mdev's id/name/attribute.

> > Longer UUID names are creating snow ball effect, not just in networking stack
> but many user space tools too.
> 
> This snowball effect mainly comes from the device name -> phys_port_name
> setup, IIUC.
> 
Right.

> > (as opposed to recently introduced mdevctl, are they more mdev tools
> > which has dependency on UUID name?)
> 
> I am aware that people have written scripts etc. to manage their mdevs.
> Given that the mdev infrastructure has been around for quite some time, I'd
> say the chance of some of those scripts relying on uuid names is non-zero.
> 
Ok. but those scripts have never managed networking devices.
So those scripts won't break because they will always create mdev devices using UUID.
When they use these new networking devices, they need more things than their scripts.
So user space upgrade for such mixed mode case is reasonable.

> >
> > Instead of mdev subsystem creating such effect, one option we are
> considering is to have shorter mdev names.
> > (Similar to netdev, rdma, nvme devices).
> > Such as mdev1, mdev2000 etc.
> >
> > Second option I was considering is to have an optional alias for UUID based
> mdev.
> > This name alias is given at time of mdev creation.
> > Devlink port's phys_port_name is derived out of this shorter mdev name
> alias.
> > This way, mdev remains to be UUID based with optional extension.
> > However, I prefer first option to relax mdev naming scheme.
> 
> Actually, I think that second option makes much more sense, as you avoid
> potentially breaking existing tooling.
Let's first understand of what exactly will break with existing tool if they see non_uuid based device.
> 
Existing tooling continue to work with UUID devices.
Do you have example of what can break if they see non_uuid based device name?
I think you are clear, but to be sure, UUID based creation will continue to be there. Optionally mdev will be created with alpha-numeric string, if we don't it as additional attribute.

> >
> > > We want to uniquely identify a device, across different types of
> > > vendor drivers. An uuid is a unique identifier and even a
> > > well-defined one. Tools (e.g. mdevctl) are relying on it for mdev devices
> today.
> > >
> > > What is the problem you're trying to solve?
> > Unique device naming is still achieved without UUID scheme by various
> subsystems in kernel using alpha-numeric string.
> > Having such string based continue to provide unique names.
> >
> > I hope I described the problem and two solutions above.
> >
> > [1] https://github.com/awilliam/mdevctl
> > [2]
> > https://elixir.bootlin.com/linux/v5.3-rc4/source/drivers/net/ethernet/
> > mellanox/mlx5/core/en_rep.c [3]
> > http://man7.org/linux/man-pages/man8/devlink-port.8.html
> > [4]
> > https://elixir.bootlin.com/linux/v5.3-rc4/source/net/core/devlink.c#L6
> > 921
> > [5] https://www.openvswitch.org/
> >


^ permalink raw reply

* [PATCH bpf-next v4] libbpf: add xsk_ring_prod__nb_free() function
From: Eelco Chaudron @ 2019-08-14 13:51 UTC (permalink / raw)
  To: netdev
  Cc: ast, daniel, kafai, songliubraving, yhs, andrii.nakryiko,
	magnus.karlsson

When an AF_XDP application received X packets, it does not mean X
frames can be stuffed into the producer ring. To make it easier for
AF_XDP applications this API allows them to check how many frames can
be added into the ring.

The patch below looks like a name change only, but the xsk_prod__
prefix denotes that this API is exposed to be used by applications.

Besides, if you set the nb value to the size of the ring, you will
get the exact amount of slots available, at the cost of performance
(you touch shared state for sure). nb is there to limit the
touching of the shared state.

Also the example xdpsock application has been modified to use this
new API, so it's also able to process flows at a 1pps rate on veth
interfaces.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
---

v3 -> v4
  - Cleanedup commit message
  - Updated AF_XDP sample application to use this new API

v2 -> v3
  - Removed cache by pass option

v1 -> v2
  - Renamed xsk_ring_prod__free() to xsk_ring_prod__nb_free()
  - Add caching so it will only touch global state when needed

 samples/bpf/xdpsock_user.c | 109 ++++++++++++++++++++++++++++---------
 tools/lib/bpf/xsk.h        |   4 +-
 2 files changed, 86 insertions(+), 27 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 93eaaf7239b2..87115e233b54 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -461,9 +461,13 @@ static void kick_tx(struct xsk_socket_info *xsk)
 
 static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 {
-	u32 idx_cq = 0, idx_fq = 0;
-	unsigned int rcvd;
+	static u64 free_frames[NUM_FRAMES];
+	static size_t nr_free_frames;
+
+	u32 idx_cq = 0, idx_fq = 0, free_slots;
+	unsigned int rcvd, i;
 	size_t ndescs;
+	int ret;
 
 	if (!xsk->outstanding_tx)
 		return;
@@ -474,27 +478,52 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 
 	/* re-add completed Tx buffers */
 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
-	if (rcvd > 0) {
-		unsigned int i;
-		int ret;
+	if (!rcvd)
+		return;
 
-		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
-		while (ret != rcvd) {
-			if (ret < 0)
-				exit_with_error(-ret);
-			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
-						     &idx_fq);
-		}
-		for (i = 0; i < rcvd; i++)
+	/* When xsk_ring_cons__peek() for example returns that 5 packets
+	 * have been received, it does not automatically mean that
+	 * xsk_ring_prod__reserve() will have 5 slots available. You will
+	 * see this, for example, when using a veth interface due to the
+	 * RX_BATCH_SIZE used by the generic driver.
+	 *
+	 * In this example we store unused buffers and try to re-stock
+	 * them the next iteration.
+	 */
+
+	free_slots = xsk_prod__nb_free(&xsk->umem->fq, rcvd + nr_free_frames);
+	if (free_slots > rcvd + nr_free_frames)
+		free_slots = rcvd + nr_free_frames;
+
+	ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots, &idx_fq);
+	while (ret != free_slots) {
+		if (ret < 0)
+			exit_with_error(-ret);
+		ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots,
+					     &idx_fq);
+	}
+	for (i = 0; i < rcvd; i++) {
+		u64 addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
+
+		if (i < free_slots)
 			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
-				*xsk_ring_cons__comp_addr(&xsk->umem->cq,
-							  idx_cq++);
+				addr;
+		else
+			free_frames[nr_free_frames++] = addr;
+	}
 
-		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
-		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
-		xsk->outstanding_tx -= rcvd;
-		xsk->tx_npkts += rcvd;
+	if (free_slots > rcvd) {
+		for (i = 0; i < (free_slots - rcvd); i++) {
+			u64 addr = free_frames[--nr_free_frames];
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		}
 	}
+
+	xsk_ring_prod__submit(&xsk->umem->fq, free_slots);
+	xsk_ring_cons__release(&xsk->umem->cq, rcvd);
+	xsk->outstanding_tx -= rcvd;
+	xsk->tx_npkts += rcvd;
 }
 
 static inline void complete_tx_only(struct xsk_socket_info *xsk)
@@ -517,19 +546,37 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
 
 static void rx_drop(struct xsk_socket_info *xsk)
 {
+	static u64 free_frames[NUM_FRAMES];
+	static size_t nr_free_frames;
+
 	unsigned int rcvd, i;
-	u32 idx_rx = 0, idx_fq = 0;
+	u32 idx_rx = 0, idx_fq = 0, free_slots;
 	int ret;
 
 	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
 	if (!rcvd)
 		return;
 
-	ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
-	while (ret != rcvd) {
+	/* When xsk_ring_cons__peek() for example returns that 5 packets
+	 * have been received, it does not automatically mean that
+	 * xsk_ring_prod__reserve() will have 5 slots available. You will
+	 * see this, for example, when using a veth interface due to the
+	 * RX_BATCH_SIZE used by the generic driver.
+	 *
+	 * In this example we store unused buffers and try to re-stock
+	 * them the next iteration.
+	 */
+
+	free_slots = xsk_prod__nb_free(&xsk->umem->fq, rcvd + nr_free_frames);
+	if (free_slots > rcvd + nr_free_frames)
+		free_slots = rcvd + nr_free_frames;
+
+	ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots, &idx_fq);
+	while (ret != free_slots) {
 		if (ret < 0)
 			exit_with_error(-ret);
-		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+		ret = xsk_ring_prod__reserve(&xsk->umem->fq, free_slots,
+					     &idx_fq);
 	}
 
 	for (i = 0; i < rcvd; i++) {
@@ -538,10 +585,22 @@ static void rx_drop(struct xsk_socket_info *xsk)
 		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
 
 		hex_dump(pkt, len, addr);
-		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
+		if (i < free_slots)
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		else
+			free_frames[nr_free_frames++] = addr;
+	}
+
+	if (free_slots > rcvd) {
+		for (i = 0; i < (free_slots - rcvd); i++) {
+			u64 addr = free_frames[--nr_free_frames];
+			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
+				addr;
+		}
 	}
 
-	xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+	xsk_ring_prod__submit(&xsk->umem->fq, free_slots);
 	xsk_ring_cons__release(&xsk->rx, rcvd);
 	xsk->rx_npkts += rcvd;
 }
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 833a6e60d065..cae506ab3f3c 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -76,7 +76,7 @@ xsk_ring_cons__rx_desc(const struct xsk_ring_cons *rx, __u32 idx)
 	return &descs[idx & rx->mask];
 }
 
-static inline __u32 xsk_prod_nb_free(struct xsk_ring_prod *r, __u32 nb)
+static inline __u32 xsk_prod__nb_free(struct xsk_ring_prod *r, __u32 nb)
 {
 	__u32 free_entries = r->cached_cons - r->cached_prod;
 
@@ -110,7 +110,7 @@ static inline __u32 xsk_cons_nb_avail(struct xsk_ring_cons *r, __u32 nb)
 static inline size_t xsk_ring_prod__reserve(struct xsk_ring_prod *prod,
 					    size_t nb, __u32 *idx)
 {
-	if (xsk_prod_nb_free(prod, nb) < nb)
+	if (xsk_prod__nb_free(prod, nb) < nb)
 		return 0;
 
 	*idx = prod->cached_prod;
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH 3/3] ocelot_ace: fix action of trap
From: Andrew Lunn @ 2019-08-14 13:52 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Y.b. Lu, netdev@vger.kernel.org, David S . Miller,
	Alexandre Belloni, Microchip Linux Driver Support
In-Reply-To: <20190814085711.7654bff2u66o4yjj@lx-anielsen.microsemi.net>

On Wed, Aug 14, 2019 at 10:57:12AM +0200, Allan W. Nielsen wrote:
> Hi Y.b. and Andrew,
> 
> The 08/14/2019 04:28, Y.b. Lu wrote:
> > > > I'd like to trap all IEEE 1588 PTP Ethernet frames to CPU through etype
> > > 0x88f7.
> > > 
> > > Is this the correct way to handle PTP for this switch? For other switches we
> > > don't need such traps. The switch itself identifies PTP frames and forwards
> > > them to the CPU so it can process them.
> > > 
> > > I'm just wondering if your general approach is wrong?
> > 
> > [Y.b. Lu] PTP messages over Ethernet will use two multicast addresses.
> > 01-80-C2-00-00-0E for peer delay messages.
> Yes, and as you write, this is a BPDU which must not be forwarded (and they are
> not).
> 
> > 01-1B-19-00-00-00 for other messages.
> Yes, this is a normal L2 multicast address, which by default are broadcastet.
> 
> > But only 01-80-C2-00-00-0E could be handled by hardware filter for BPDU frames
> > (01-80-C2-00-00-0x).  For PTP messages handling, trapping them to CPU through
> > VCAP IS2 is the suggested way by Ocelot/Felix.

Hi Allan

The typical userspace for this is linuxptp. It implements Boundary
Clock (BC), Ordinary Clock (OC) and Transparent Clock (TC). On
switches, it works great for L2 PTP. But it has architectural issues
for L3 PTP when used with a bridge. I've no idea if Richard is fixing
this.

> 3) It can be done via 'tc' using the trap action, but I do not know if this is
>    the desired way of doing it.

No, it is not. It could be the way you the implement
ptp_clock_info.enable() does the same as what TC could do, but TC
itself is not used, it should all be internal to the driver. And you
might also want to consider hiding such rules from TC, otherwise the
user might remove them and things break.

     Andrew

^ permalink raw reply

* Re: [PATCH v4 13/14] net: phy: adin: add ethtool get_stats support
From: Andrew Lunn @ 2019-08-14 14:04 UTC (permalink / raw)
  To: Ardelean, Alexandru
  Cc: davem@davemloft.net, hkallweit1@gmail.com,
	devicetree@vger.kernel.org, mark.rutland@arm.com,
	linux-kernel@vger.kernel.org, f.fainelli@gmail.com,
	netdev@vger.kernel.org, robh+dt@kernel.org
In-Reply-To: <2175a95d818172153e839f6bcf6d3d61a3e23dd8.camel@analog.com>

> So, I have to apologize again here.
> I guess I was an idiot/n00b about this.

Not a problem. If it is not something you have come across before, you
can easily miss the significance.

So you just need to modify the ordering and you are good to go.
Please add a comment in the code about this latching. We don't want
somebody changing the ordering and breaking it.

	 Thanks
		Andrew

^ permalink raw reply

* RE: [EXT] INFO: trying to register non-static key in del_timer_sync (2)
From: Ganapathi Bhat @ 2019-08-14 14:08 UTC (permalink / raw)
  To: Kalle Valo, Andrey Konovalov
  Cc: Dmitry Vyukov, syzbot, amitkarwar@gmail.com, davem@davemloft.net,
	huxinming820@gmail.com, linux-kernel@vger.kernel.org,
	linux-usb@vger.kernel.org, linux-wireless@vger.kernel.org,
	netdev@vger.kernel.org, nishants@marvell.com,
	syzkaller-bugs@googlegroups.com
In-Reply-To: <87k1bhb20j.fsf@kamboji.qca.qualcomm.com>

Hi Dmitry/Kalle,

> >>
> >> Hi Dmitry,
> >>
> >> We have a patch to fix this:
> >> https://patchwork.kernel.org/patch/10990275/
> >
> > Hi Ganapathi,
> >
> > Has this patch been accepted anywhere? This bug is still open on syzbot.
> 
> The patch is in "Changes Requested" state which means that the author is
> supposed to send a new version based on the review comments.
We will address the review comments and try to push the updated version soon;

Regards,
Ganapathi

^ permalink raw reply

* [PATCH bpf-next v2] net: Don't call XDP_SETUP_PROG when nothing is changed
From: Maxim Mikityanskiy @ 2019-08-14 14:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski
  Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, David S. Miller,
	Björn Töpel, Saeed Mahameed, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song,
	Maxim Mikityanskiy
In-Reply-To: <5b123e9a-095f-1db4-da6e-5af6552430e1@iogearbox.net>

Don't uninstall an XDP program when none is installed, and don't install
an XDP program that has the same ID as the one already installed.

dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
XDP program. It means that the driver's ndo_bpf can be called with
XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
case happens if the user runs `ip link set eth0 xdp off` when there is
no XDP program attached.

The symmetrical case is possible when the user tries to set the program
that is already set.

The drivers typically perform some heavy operations on XDP_SETUP_PROG,
so they all have to handle these cases internally to return early if
they happen. This patch puts this check into the kernel code, so that
all drivers will benefit from it.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
---
v2 changes: Cover the case when the program is set, but isn't changed.

 net/core/dev.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 49589ed2018d..b1afafee3e2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8126,12 +8126,15 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		bpf_chk = generic_xdp_install;
 
 	if (fd >= 0) {
+		u32 prog_id;
+
 		if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) {
 			NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
-		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
-		    __dev_xdp_query(dev, bpf_op, query)) {
+
+		prog_id = __dev_xdp_query(dev, bpf_op, query);
+		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && prog_id) {
 			NL_SET_ERR_MSG(extack, "XDP program already attached");
 			return -EBUSY;
 		}
@@ -8146,6 +8149,14 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			bpf_prog_put(prog);
 			return -EINVAL;
 		}
+
+		if (prog->aux->id == prog_id) {
+			bpf_prog_put(prog);
+			return 0;
+		}
+	} else {
+		if (!__dev_xdp_query(dev, bpf_op, query))
+			return 0;
 	}
 
 	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v4 9/9] Input: add IOC3 serio driver
From: Thomas Bogendoerfer @ 2019-08-14 14:37 UTC (permalink / raw)
  To: Jonas Gorski
  Cc: Ralf Baechle, Paul Burton, James Hogan, Dmitry Torokhov,
	Lee Jones, David S. Miller, Srinivas Kandagatla, Alessandro Zummo,
	Alexandre Belloni, Greg Kroah-Hartman, Jiri Slaby,
	Evgeniy Polyakov, linux-mips, linux-kernel, linux-input,
	Network Development, linux-rtc, linux-serial
In-Reply-To: <CAOiHx=kuQtOuNfsJ+fDrps+hbrbp5cPujmQpi8Vfy+0qeP8dtA@mail.gmail.com>

On Wed, 14 Aug 2019 15:20:14 +0200
Jonas Gorski <jonas.gorski@gmail.com> wrote:

> > +       d = devm_kzalloc(&pdev->dev, sizeof(*d), GFP_KERNEL);
> 
> &pdev->dev => dev

will change.

> 
> > +       if (!d)
> > +               return -ENOMEM;
> > +
> > +       sk = kzalloc(sizeof(*sk), GFP_KERNEL);
> 
> any reason not to devm_kzalloc this as well? Then you won't need to
> manually free it in the error cases.

it has different life time than the device, so it may not allocated
via devm_kzalloc

> > +static int ioc3kbd_remove(struct platform_device *pdev)
> > +{
> > +       struct ioc3kbd_data *d = platform_get_drvdata(pdev);
> > +
> > +       devm_free_irq(&pdev->dev, d->irq, d);
> > +       serio_unregister_port(d->kbd);
> > +       serio_unregister_port(d->aux);
> > +       return 0;
> > +}
> 
> and on that topic, won't you need to kfree d->kbd and d->aux here?

that's done in serio_release_port() by the serio core.

Thomas.

-- 
SUSE Linux GmbH
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* [PATCH net-next v2] net: dsa: mv88e6xxx: check for mode change in port_setup_mac
From: Marek Behún @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: Andrew Lunn, Vivien Didelot, Heiner Kallweit, Marek Behún

The mv88e6xxx_port_setup_mac checks if the requested MAC settings are
different from the current ones, and if not, does nothing (since chaning
them requires putting the link down).

In this check it only looks if the triplet [link, speed, duplex] is
being changed.

This patch adds support to also check if the mode parameter (of type
phy_interface_t) is requested to be changed. The current mode is
computed by the ->port_link_state() method, and if it is different from
PHY_INTERFACE_MODE_NA, we check for equality with the requested mode.

In the implementations of the mv88e6250_port_link_state() method we set
the current mode to PHY_INTERFACE_MODE_NA - so the code does not check
for mode change on 6250.

In the mv88e6352_port_link_state() method, we use the cached cmode of
the port to determine the mode as phy_interface_t (and if it is not
enough, eg. for RGMII, we also look at the port control register for
RX/TX timings).

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 drivers/net/dsa/mv88e6xxx/chip.c |  4 +++-
 drivers/net/dsa/mv88e6xxx/port.c | 38 ++++++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/port.h |  1 +
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 818a83eb2dcb..9b3ad22a5b98 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -417,7 +417,9 @@ int mv88e6xxx_port_setup_mac(struct mv88e6xxx_chip *chip, int port, int link,
 	 */
 	if (state.link == link &&
 	    state.speed == speed &&
-	    state.duplex == duplex)
+	    state.duplex == duplex &&
+	    (state.interface == mode ||
+	     state.interface == PHY_INTERFACE_MODE_NA))
 		return 0;
 
 	/* Port's MAC control must not be changed unless the link is down */
diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 04309ef0a1cc..c95cdb73e5a2 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -590,6 +590,7 @@ int mv88e6250_port_link_state(struct mv88e6xxx_chip *chip, int port,
 	state->link = !!(reg & MV88E6250_PORT_STS_LINK);
 	state->an_enabled = 1;
 	state->an_complete = state->link;
+	state->interface = PHY_INTERFACE_MODE_NA;
 
 	return 0;
 }
@@ -600,6 +601,43 @@ int mv88e6352_port_link_state(struct mv88e6xxx_chip *chip, int port,
 	int err;
 	u16 reg;
 
+	switch (chip->ports[port].cmode) {
+	case MV88E6XXX_PORT_STS_CMODE_RGMII:
+		err = mv88e6xxx_port_read(chip, port, MV88E6XXX_PORT_MAC_CTL,
+					  &reg);
+		if (err)
+			return err;
+
+		if ((reg & MV88E6XXX_PORT_MAC_CTL_RGMII_DELAY_RXCLK) &&
+		    (reg & MV88E6XXX_PORT_MAC_CTL_RGMII_DELAY_TXCLK))
+			state->interface = PHY_INTERFACE_MODE_RGMII_ID;
+		else if (reg & MV88E6XXX_PORT_MAC_CTL_RGMII_DELAY_RXCLK)
+			state->interface = PHY_INTERFACE_MODE_RGMII_RXID;
+		else if (reg & MV88E6XXX_PORT_MAC_CTL_RGMII_DELAY_TXCLK)
+			state->interface = PHY_INTERFACE_MODE_RGMII_TXID;
+		else
+			state->interface = PHY_INTERFACE_MODE_RGMII;
+		break;
+	case MV88E6XXX_PORT_STS_CMODE_1000BASE_X:
+		state->interface = PHY_INTERFACE_MODE_1000BASEX;
+		break;
+	case MV88E6XXX_PORT_STS_CMODE_SGMII:
+		state->interface = PHY_INTERFACE_MODE_SGMII;
+		break;
+	case MV88E6XXX_PORT_STS_CMODE_2500BASEX:
+		state->interface = PHY_INTERFACE_MODE_2500BASEX;
+		break;
+	case MV88E6XXX_PORT_STS_CMODE_XAUI:
+		state->interface = PHY_INTERFACE_MODE_XAUI;
+		break;
+	case MV88E6XXX_PORT_STS_CMODE_RXAUI:
+		state->interface = PHY_INTERFACE_MODE_RXAUI;
+		break;
+	default:
+		/* we do not support other cmode values here */
+		state->interface = PHY_INTERFACE_MODE_NA;
+	}
+
 	err = mv88e6xxx_port_read(chip, port, MV88E6XXX_PORT_STS, &reg);
 	if (err)
 		return err;
diff --git a/drivers/net/dsa/mv88e6xxx/port.h b/drivers/net/dsa/mv88e6xxx/port.h
index ceec771f8bfc..1abf5ea033e2 100644
--- a/drivers/net/dsa/mv88e6xxx/port.h
+++ b/drivers/net/dsa/mv88e6xxx/port.h
@@ -42,6 +42,7 @@
 #define MV88E6XXX_PORT_STS_TX_PAUSED		0x0020
 #define MV88E6XXX_PORT_STS_FLOW_CTL		0x0010
 #define MV88E6XXX_PORT_STS_CMODE_MASK		0x000f
+#define MV88E6XXX_PORT_STS_CMODE_RGMII		0x0007
 #define MV88E6XXX_PORT_STS_CMODE_100BASE_X	0x0008
 #define MV88E6XXX_PORT_STS_CMODE_1000BASE_X	0x0009
 #define MV88E6XXX_PORT_STS_CMODE_SGMII		0x000a
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 0/4] net: bridge: mdb: allow dump/add/del of host-joined entries
From: Nikolay Aleksandrov @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov

Hi,
This set makes the bridge dump host-joined mdb entries, they should be
treated as normal entries since they take a slot and are aging out.
We already have notifications for them but we couldn't dump them until
now so they remained hidden. We dump them similar to how they're
notified, in order to keep user-space compatibility with the dumped
objects (e.g. iproute2 dumps mdbs in a format which can be fed into
add/del commands) we allow host-joined groups also to be added/deleted via
mdb commands. That can later be used for L2 mcast MAC manipulation as
was recently discussed. Note that iproute2 changes are not necessary,
this set will work with the current user-space mdb code.

Patch 01 - a trivial comment move
Patch 02 - factors out the mdb filling code so it can be
           re-used for the host-joined entries
Patch 03 - dumps host-joined entries
Patch 04 - allows manipulation of host-joined entries via standard mdb
           calls

Thanks,
 Nik

Nikolay Aleksandrov (4):
  net: bridge: mdb: move vlan comments
  net: bridge: mdb: factor out mdb filling
  net: bridge: mdb: dump host-joined entries as well
  net: bridge: mdb: allow add/delete for host-joined groups

 net/bridge/br_mdb.c       | 171 +++++++++++++++++++++++++-------------
 net/bridge/br_multicast.c |  24 ++++--
 net/bridge/br_private.h   |   2 +
 3 files changed, 133 insertions(+), 64 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH net-next 1/4] net: bridge: mdb: move vlan comments
From: Nikolay Aleksandrov @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov
In-Reply-To: <20190814144024.9710-1-nikolay@cumulusnetworks.com>

Trivial patch to move the vlan comments in their proper places above the
vid 0 checks.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_mdb.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 428af1abf8cc..ee6208c6d946 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -653,9 +653,6 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	br = netdev_priv(dev);
 
-	/* If vlan filtering is enabled and VLAN is not specified
-	 * install mdb entry on all vlans configured on the port.
-	 */
 	pdev = __dev_get_by_index(net, entry->ifindex);
 	if (!pdev)
 		return -ENODEV;
@@ -665,6 +662,9 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 
 	vg = nbp_vlan_group(p);
+	/* If vlan filtering is enabled and VLAN is not specified
+	 * install mdb entry on all vlans configured on the port.
+	 */
 	if (br_vlan_enabled(br->dev) && vg && entry->vid == 0) {
 		list_for_each_entry(v, &vg->vlan_list, vlist) {
 			entry->vid = v->vid;
@@ -745,9 +745,6 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	br = netdev_priv(dev);
 
-	/* If vlan filtering is enabled and VLAN is not specified
-	 * delete mdb entry on all vlans configured on the port.
-	 */
 	pdev = __dev_get_by_index(net, entry->ifindex);
 	if (!pdev)
 		return -ENODEV;
@@ -757,6 +754,9 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 		return -EINVAL;
 
 	vg = nbp_vlan_group(p);
+	/* If vlan filtering is enabled and VLAN is not specified
+	 * delete mdb entry on all vlans configured on the port.
+	 */
 	if (br_vlan_enabled(br->dev) && vg && entry->vid == 0) {
 		list_for_each_entry(v, &vg->vlan_list, vlist) {
 			entry->vid = v->vid;
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 2/4] net: bridge: mdb: factor out mdb filling
From: Nikolay Aleksandrov @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov
In-Reply-To: <20190814144024.9710-1-nikolay@cumulusnetworks.com>

We have to factor out the mdb fill portion in order to re-use it later for
the bridge mdb entries. No functional changes intended.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_mdb.c | 68 ++++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 31 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index ee6208c6d946..77730983097e 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -77,6 +77,40 @@ static void __mdb_entry_to_br_ip(struct br_mdb_entry *entry, struct br_ip *ip)
 #endif
 }
 
+static int __mdb_fill_info(struct sk_buff *skb,
+			   struct net_bridge_port_group *p)
+{
+	struct nlattr *nest_ent;
+	struct br_mdb_entry e;
+
+	memset(&e, 0, sizeof(e));
+	__mdb_entry_fill_flags(&e, p->flags);
+	e.ifindex = p->port->dev->ifindex;
+	e.vid = p->addr.vid;
+	if (p->addr.proto == htons(ETH_P_IP))
+		e.addr.u.ip4 = p->addr.u.ip4;
+#if IS_ENABLED(CONFIG_IPV6)
+	if (p->addr.proto == htons(ETH_P_IPV6))
+		e.addr.u.ip6 = p->addr.u.ip6;
+#endif
+	e.addr.proto = p->addr.proto;
+	nest_ent = nla_nest_start_noflag(skb,
+					 MDBA_MDB_ENTRY_INFO);
+	if (!nest_ent)
+		return -EMSGSIZE;
+
+	if (nla_put_nohdr(skb, sizeof(e), &e) ||
+	    nla_put_u32(skb,
+			MDBA_MDB_EATTR_TIMER,
+			br_timer_value(&p->timer))) {
+		nla_nest_cancel(skb, nest_ent);
+		return -EMSGSIZE;
+	}
+	nla_nest_end(skb, nest_ent);
+
+	return 0;
+}
+
 static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
 			    struct net_device *dev)
 {
@@ -95,7 +129,6 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
 	hlist_for_each_entry_rcu(mp, &br->mdb_list, mdb_node) {
 		struct net_bridge_port_group *p;
 		struct net_bridge_port_group __rcu **pp;
-		struct net_bridge_port *port;
 
 		if (idx < s_idx)
 			goto skip;
@@ -108,41 +141,14 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
 
 		for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
 		      pp = &p->next) {
-			struct nlattr *nest_ent;
-			struct br_mdb_entry e;
-
-			port = p->port;
-			if (!port)
+			if (!p->port)
 				continue;
 
-			memset(&e, 0, sizeof(e));
-			e.ifindex = port->dev->ifindex;
-			e.vid = p->addr.vid;
-			__mdb_entry_fill_flags(&e, p->flags);
-			if (p->addr.proto == htons(ETH_P_IP))
-				e.addr.u.ip4 = p->addr.u.ip4;
-#if IS_ENABLED(CONFIG_IPV6)
-			if (p->addr.proto == htons(ETH_P_IPV6))
-				e.addr.u.ip6 = p->addr.u.ip6;
-#endif
-			e.addr.proto = p->addr.proto;
-			nest_ent = nla_nest_start_noflag(skb,
-							 MDBA_MDB_ENTRY_INFO);
-			if (!nest_ent) {
-				nla_nest_cancel(skb, nest2);
-				err = -EMSGSIZE;
-				goto out;
-			}
-			if (nla_put_nohdr(skb, sizeof(e), &e) ||
-			    nla_put_u32(skb,
-					MDBA_MDB_EATTR_TIMER,
-					br_timer_value(&p->timer))) {
-				nla_nest_cancel(skb, nest_ent);
+			err = __mdb_fill_info(skb, p);
+			if (err) {
 				nla_nest_cancel(skb, nest2);
-				err = -EMSGSIZE;
 				goto out;
 			}
-			nla_nest_end(skb, nest_ent);
 		}
 		nla_nest_end(skb, nest2);
 skip:
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 3/4] net: bridge: mdb: dump host-joined entries as well
From: Nikolay Aleksandrov @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov
In-Reply-To: <20190814144024.9710-1-nikolay@cumulusnetworks.com>

Currently we dump only the port mdb entries but we can have host-joined
entries on the bridge itself and they should be treated as normal temp
mdbs, they're already notified:
$ bridge monitor all
[MDB]dev br0 port br0 grp ff02::8 temp

The group will not be shown in the bridge mdb output, but it takes 1 slot
and it's timing out. If it's only host-joined then the mdb show output
can even be empty.

After this patch we show the host-joined groups:
$ bridge mdb show
dev br0 port br0 grp ff02::8 temp

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_mdb.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 77730983097e..985273425117 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -78,22 +78,35 @@ static void __mdb_entry_to_br_ip(struct br_mdb_entry *entry, struct br_ip *ip)
 }
 
 static int __mdb_fill_info(struct sk_buff *skb,
+			   struct net_bridge_mdb_entry *mp,
 			   struct net_bridge_port_group *p)
 {
+	struct timer_list *mtimer;
 	struct nlattr *nest_ent;
 	struct br_mdb_entry e;
+	u8 flags = 0;
+	int ifindex;
 
 	memset(&e, 0, sizeof(e));
-	__mdb_entry_fill_flags(&e, p->flags);
-	e.ifindex = p->port->dev->ifindex;
-	e.vid = p->addr.vid;
-	if (p->addr.proto == htons(ETH_P_IP))
-		e.addr.u.ip4 = p->addr.u.ip4;
+	if (p) {
+		ifindex = p->port->dev->ifindex;
+		mtimer = &p->timer;
+		flags = p->flags;
+	} else {
+		ifindex = mp->br->dev->ifindex;
+		mtimer = &mp->timer;
+	}
+
+	__mdb_entry_fill_flags(&e, flags);
+	e.ifindex = ifindex;
+	e.vid = mp->addr.vid;
+	if (mp->addr.proto == htons(ETH_P_IP))
+		e.addr.u.ip4 = mp->addr.u.ip4;
 #if IS_ENABLED(CONFIG_IPV6)
-	if (p->addr.proto == htons(ETH_P_IPV6))
-		e.addr.u.ip6 = p->addr.u.ip6;
+	if (mp->addr.proto == htons(ETH_P_IPV6))
+		e.addr.u.ip6 = mp->addr.u.ip6;
 #endif
-	e.addr.proto = p->addr.proto;
+	e.addr.proto = mp->addr.proto;
 	nest_ent = nla_nest_start_noflag(skb,
 					 MDBA_MDB_ENTRY_INFO);
 	if (!nest_ent)
@@ -102,7 +115,7 @@ static int __mdb_fill_info(struct sk_buff *skb,
 	if (nla_put_nohdr(skb, sizeof(e), &e) ||
 	    nla_put_u32(skb,
 			MDBA_MDB_EATTR_TIMER,
-			br_timer_value(&p->timer))) {
+			br_timer_value(mtimer))) {
 		nla_nest_cancel(skb, nest_ent);
 		return -EMSGSIZE;
 	}
@@ -139,12 +152,20 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
 			break;
 		}
 
+		if (mp->host_joined) {
+			err = __mdb_fill_info(skb, mp, NULL);
+			if (err) {
+				nla_nest_cancel(skb, nest2);
+				break;
+			}
+		}
+
 		for (pp = &mp->ports; (p = rcu_dereference(*pp)) != NULL;
 		      pp = &p->next) {
 			if (!p->port)
 				continue;
 
-			err = __mdb_fill_info(skb, p);
+			err = __mdb_fill_info(skb, mp, p);
 			if (err) {
 				nla_nest_cancel(skb, nest2);
 				goto out;
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next 4/4] net: bridge: mdb: allow add/delete for host-joined groups
From: Nikolay Aleksandrov @ 2019-08-14 14:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov
In-Reply-To: <20190814144024.9710-1-nikolay@cumulusnetworks.com>

Currently this is needed only for user-space compatibility, so similar
object adds/deletes as the dumped ones would succeed. Later it can be
used for L2 mcast MAC add/delete.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 net/bridge/br_mdb.c       | 74 +++++++++++++++++++++++++++------------
 net/bridge/br_multicast.c | 24 +++++++++----
 net/bridge/br_private.h   |  2 ++
 3 files changed, 71 insertions(+), 29 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 985273425117..331a130b83b1 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -616,6 +616,19 @@ static int br_mdb_add_group(struct net_bridge *br, struct net_bridge_port *port,
 			return err;
 	}
 
+	/* host join */
+	if (!port) {
+		/* don't allow any flags for host-joined groups */
+		if (state)
+			return -EINVAL;
+		if (mp->host_joined)
+			return -EEXIST;
+
+		br_multicast_host_join(mp);
+
+		return 0;
+	}
+
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
@@ -640,19 +653,21 @@ static int __br_mdb_add(struct net *net, struct net_bridge *br,
 {
 	struct br_ip ip;
 	struct net_device *dev;
-	struct net_bridge_port *p;
+	struct net_bridge_port *p = NULL;
 	int ret;
 
 	if (!netif_running(br->dev) || !br_opt_get(br, BROPT_MULTICAST_ENABLED))
 		return -EINVAL;
 
-	dev = __dev_get_by_index(net, entry->ifindex);
-	if (!dev)
-		return -ENODEV;
+	if (entry->ifindex != br->dev->ifindex) {
+		dev = __dev_get_by_index(net, entry->ifindex);
+		if (!dev)
+			return -ENODEV;
 
-	p = br_port_get_rtnl(dev);
-	if (!p || p->br != br || p->state == BR_STATE_DISABLED)
-		return -EINVAL;
+		p = br_port_get_rtnl(dev);
+		if (!p || p->br != br || p->state == BR_STATE_DISABLED)
+			return -EINVAL;
+	}
 
 	__mdb_entry_to_br_ip(entry, &ip);
 
@@ -680,15 +695,19 @@ static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	br = netdev_priv(dev);
 
-	pdev = __dev_get_by_index(net, entry->ifindex);
-	if (!pdev)
-		return -ENODEV;
+	if (entry->ifindex != br->dev->ifindex) {
+		pdev = __dev_get_by_index(net, entry->ifindex);
+		if (!pdev)
+			return -ENODEV;
 
-	p = br_port_get_rtnl(pdev);
-	if (!p || p->br != br || p->state == BR_STATE_DISABLED)
-		return -EINVAL;
+		p = br_port_get_rtnl(pdev);
+		if (!p || p->br != br || p->state == BR_STATE_DISABLED)
+			return -EINVAL;
+		vg = nbp_vlan_group(p);
+	} else {
+		vg = br_vlan_group(br);
+	}
 
-	vg = nbp_vlan_group(p);
 	/* If vlan filtering is enabled and VLAN is not specified
 	 * install mdb entry on all vlans configured on the port.
 	 */
@@ -727,6 +746,13 @@ static int __br_mdb_del(struct net_bridge *br, struct br_mdb_entry *entry)
 	if (!mp)
 		goto unlock;
 
+	/* host leave */
+	if (entry->ifindex == mp->br->dev->ifindex && mp->host_joined) {
+		br_multicast_host_leave(mp);
+		err = 0;
+		goto unlock;
+	}
+
 	for (pp = &mp->ports;
 	     (p = mlock_dereference(*pp, br)) != NULL;
 	     pp = &p->next) {
@@ -759,9 +785,9 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 {
 	struct net *net = sock_net(skb->sk);
 	struct net_bridge_vlan_group *vg;
+	struct net_bridge_port *p = NULL;
 	struct net_device *dev, *pdev;
 	struct br_mdb_entry *entry;
-	struct net_bridge_port *p;
 	struct net_bridge_vlan *v;
 	struct net_bridge *br;
 	int err;
@@ -772,15 +798,19 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 
 	br = netdev_priv(dev);
 
-	pdev = __dev_get_by_index(net, entry->ifindex);
-	if (!pdev)
-		return -ENODEV;
+	if (entry->ifindex != br->dev->ifindex) {
+		pdev = __dev_get_by_index(net, entry->ifindex);
+		if (!pdev)
+			return -ENODEV;
 
-	p = br_port_get_rtnl(pdev);
-	if (!p || p->br != br || p->state == BR_STATE_DISABLED)
-		return -EINVAL;
+		p = br_port_get_rtnl(pdev);
+		if (!p || p->br != br || p->state == BR_STATE_DISABLED)
+			return -EINVAL;
+		vg = nbp_vlan_group(p);
+	} else {
+		vg = br_vlan_group(br);
+	}
 
-	vg = nbp_vlan_group(p);
 	/* If vlan filtering is enabled and VLAN is not specified
 	 * delete mdb entry on all vlans configured on the port.
 	 */
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 9b379e110129..f92cb6751898 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -148,8 +148,7 @@ static void br_multicast_group_expired(struct timer_list *t)
 	if (!netif_running(br->dev) || timer_pending(&mp->timer))
 		goto out;
 
-	mp->host_joined = false;
-	br_mdb_notify(br->dev, NULL, &mp->addr, RTM_DELMDB, 0);
+	br_multicast_host_leave(mp);
 
 	if (mp->ports)
 		goto out;
@@ -512,6 +511,21 @@ static bool br_port_group_equal(struct net_bridge_port_group *p,
 	return ether_addr_equal(src, p->eth_addr);
 }
 
+void br_multicast_host_join(struct net_bridge_mdb_entry *mp)
+{
+	if (!mp->host_joined) {
+		mp->host_joined = true;
+		br_mdb_notify(mp->br->dev, NULL, &mp->addr, RTM_NEWMDB, 0);
+	}
+	mod_timer(&mp->timer, jiffies + mp->br->multicast_membership_interval);
+}
+
+void br_multicast_host_leave(struct net_bridge_mdb_entry *mp)
+{
+	mp->host_joined = false;
+	br_mdb_notify(mp->br->dev, NULL, &mp->addr, RTM_DELMDB, 0);
+}
+
 static int br_multicast_add_group(struct net_bridge *br,
 				  struct net_bridge_port *port,
 				  struct br_ip *group,
@@ -534,11 +548,7 @@ static int br_multicast_add_group(struct net_bridge *br,
 		goto err;
 
 	if (!port) {
-		if (!mp->host_joined) {
-			mp->host_joined = true;
-			br_mdb_notify(br->dev, NULL, &mp->addr, RTM_NEWMDB, 0);
-		}
-		mod_timer(&mp->timer, now + br->multicast_membership_interval);
+		br_multicast_host_join(mp);
 		goto out;
 	}
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index b7a4942ff1b3..a99dcbb9825c 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -702,6 +702,8 @@ void br_multicast_get_stats(const struct net_bridge *br,
 			    struct br_mcast_stats *dest);
 void br_mdb_init(void);
 void br_mdb_uninit(void);
+void br_multicast_host_join(struct net_bridge_mdb_entry *mp);
+void br_multicast_host_leave(struct net_bridge_mdb_entry *mp);
 
 #define mlock_dereference(X, br) \
 	rcu_dereference_protected(X, lockdep_is_held(&br->multicast_lock))
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Marcelo Ricardo Leitner @ 2019-08-14 14:41 UTC (permalink / raw)
  To: Bernd; +Cc: Neal Cardwell, netdev, edumazet
In-Reply-To: <CABOR3+zQ0yfbcon6bv5TXrrAomoWLxy101iEXqBycDTrhytDiA@mail.gmail.com>

On Fri, Aug 02, 2019 at 09:58:12PM +0200, Bernd wrote:
> 2019-08-02 21:14 GMT+02:00, Neal Cardwell <ncardwell@google.com>:
> > What's the exact kernel version you are using?
> 
> It is the RHEL errata kernel 3.10.0-957.21.3.el7 (rhsa-2019:1481), i
> need to check if there is a newer one.

FWIW, this one doesn't have the patch below.

  Marcelo

> 
> > Eric submitted a patch recently that may address your issue:
> >    tcp: be more careful in tcp_fragment()
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab
> >
> > Would you be able to test your workload with that commit
> > cherry-picked, and see if the issue still occurs?
> 
> It only happens on a customer system in production up to now, so most
> likely not.
> 
> > That commit was targeted to many stable releases, so you may be able
> > to pick up that fix from a stable branch.
> 
> The only thing which is a bit strange, this is a Java client, and I am
> pretty sure we don’t set a small SO_SNDBUF, if anything it is
> increased (I need to verify that).
> 
> Not to worry, I guess I can now with your helpful pointer sort that
> out with Redhat.
> 
> gruss
> Bernd
> 

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/8] xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeup
From: Jonathan Lemon @ 2019-08-14 14:46 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, maximmi, bpf,
	bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, sridhar.samudrala, kevin.laatz, ilias.apalodimas,
	kiran.patil, axboe, maciej.fijalkowski, maciejromanfijalkowski,
	intel-wired-lan
In-Reply-To: <1565767643-4908-2-git-send-email-magnus.karlsson@intel.com>



On 14 Aug 2019, at 0:27, Magnus Karlsson wrote:

> This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
> ndo provides the same functionality as before but with the addition of
> a new flags field that is used to specifiy if Rx, Tx or both should be
> woken up. The previous ndo only woke up Tx, as implied by the
> name. The i40e and ixgbe drivers (which are all the supported ones)
> are updated with this new interface.
>
> This new ndo will be used by the new need_wakeup functionality of XDP
> sockets that need to be able to wake up both Rx and Tx driver
> processing.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v4 2/8] xsk: add support for need_wakeup flag in AF_XDP rings
From: Jonathan Lemon @ 2019-08-14 14:47 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, maximmi, bpf,
	bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, sridhar.samudrala, kevin.laatz, ilias.apalodimas,
	kiran.patil, axboe, maciej.fijalkowski, maciejromanfijalkowski,
	intel-wired-lan
In-Reply-To: <1565767643-4908-3-git-send-email-magnus.karlsson@intel.com>

On 14 Aug 2019, at 0:27, Magnus Karlsson wrote:

> This commit adds support for a new flag called need_wakeup in the
> AF_XDP Tx and fill rings. When this flag is set, it means that the
> application has to explicitly wake up the kernel Rx (for the bit in
> the fill ring) or kernel Tx (for bit in the Tx ring) processing by
> issuing a syscall. Poll() can wake up both depending on the flags
> submitted and sendto() will wake up tx processing only.
>
> The main reason for introducing this new flag is to be able to
> efficiently support the case when application and driver is executing
> on the same core. Previously, the driver was just busy-spinning on the
> fill ring if it ran out of buffers in the HW and there were none on
> the fill ring. This approach works when the application is running on
> another core as it can replenish the fill ring while the driver is
> busy-spinning. Though, this is a lousy approach if both of them are
> running on the same core as the probability of the fill ring getting
> more entries when the driver is busy-spinning is zero. With this new
> feature the driver now sets the need_wakeup flag and returns to the
> application. The application can then replenish the fill queue and
> then explicitly wake up the Rx processing in the kernel using the
> syscall poll(). For Tx, the flag is only set to one if the driver has
> no outstanding Tx completion interrupts. If it has some, the flag is
> zero as it will be woken up by a completion interrupt anyway.
>
> As a nice side effect, this new flag also improves the performance of
> the case where application and driver are running on two different
> cores as it reduces the number of syscalls to the kernel. The kernel
> tells user space if it needs to be woken up by a syscall, and this
> eliminates many of the syscalls.
>
> This flag needs some simple driver support. If the driver does not
> support this, the Rx flag is always zero and the Tx flag is always
> one. This makes any application relying on this feature default to the
> old behaviour of not requiring any syscalls in the Rx path and always
> having to call sendto() in the Tx path.
>
> For backwards compatibility reasons, this feature has to be explicitly
> turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
> that you always turn it on as it so far always have had a positive
> performance impact.
>
> The name and inspiration of the flag has been taken from io_uring by
> Jens Axboe. Details about this feature in io_uring can be found in
> http://kernel.dk/io_uring.pdf, section 8.3.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next v2] net: dsa: mv88e6xxx: check for mode change in port_setup_mac
From: Andrew Lunn @ 2019-08-14 14:48 UTC (permalink / raw)
  To: Marek Behún; +Cc: netdev, Vivien Didelot, Heiner Kallweit
In-Reply-To: <20190814144024.27975-1-marek.behun@nic.cz>

On Wed, Aug 14, 2019 at 04:40:24PM +0200, Marek Behún wrote:
> The mv88e6xxx_port_setup_mac checks if the requested MAC settings are
> different from the current ones, and if not, does nothing (since chaning
> them requires putting the link down).
> 
> In this check it only looks if the triplet [link, speed, duplex] is
> being changed.
> 
> This patch adds support to also check if the mode parameter (of type
> phy_interface_t) is requested to be changed. The current mode is
> computed by the ->port_link_state() method, and if it is different from
> PHY_INTERFACE_MODE_NA, we check for equality with the requested mode.
> 
> In the implementations of the mv88e6250_port_link_state() method we set
> the current mode to PHY_INTERFACE_MODE_NA - so the code does not check
> for mode change on 6250.
> 
> In the mv88e6352_port_link_state() method, we use the cached cmode of
> the port to determine the mode as phy_interface_t (and if it is not
> enough, eg. for RGMII, we also look at the port control register for
> RX/TX timings).
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v4 3/8] i40e: add support for AF_XDP need_wakeup feature
From: Jonathan Lemon @ 2019-08-14 14:48 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, maximmi, bpf,
	bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, sridhar.samudrala, kevin.laatz, ilias.apalodimas,
	kiran.patil, axboe, maciej.fijalkowski, maciejromanfijalkowski,
	intel-wired-lan
In-Reply-To: <1565767643-4908-4-git-send-email-magnus.karlsson@intel.com>



On 14 Aug 2019, at 0:27, Magnus Karlsson wrote:

> This patch adds support for the need_wakeup feature of AF_XDP. If the
> application has told the kernel that it might sleep using the new bind
> flag XDP_USE_NEED_WAKEUP, the driver will then set this flag if it has
> no more buffers on the NIC Rx ring and yield to the application. For
> Tx, it will set the flag if it has no outstanding Tx completion
> interrupts and return to the application.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index d0ff5d8..42c9012 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -626,6 +626,15 @@ int i40e_clean_rx_irq_zc(struct i40e_ring 
> *rx_ring, int budget)
>
>  	i40e_finalize_xdp_rx(rx_ring, xdp_xmit);
>  	i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets);
> +
> +	if (xsk_umem_uses_need_wakeup(rx_ring->xsk_umem)) {
> +		if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
> +			xsk_set_rx_need_wakeup(rx_ring->xsk_umem);
> +		else
> +			xsk_clear_rx_need_wakeup(rx_ring->xsk_umem);
> +
> +		return (int)total_rx_packets;
> +	}
>  	return failure ? budget : (int)total_rx_packets;

Can you elaborate why we're not returning the total budget on failure
for the wakeup case?


^ permalink raw reply

* Re: [PATCH bpf-next v4 5/8] libbpf: add support for need_wakeup flag in AF_XDP part
From: Jonathan Lemon @ 2019-08-14 14:49 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, maximmi, bpf,
	bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, sridhar.samudrala, kevin.laatz, ilias.apalodimas,
	kiran.patil, axboe, maciej.fijalkowski, maciejromanfijalkowski,
	intel-wired-lan
In-Reply-To: <1565767643-4908-6-git-send-email-magnus.karlsson@intel.com>



On 14 Aug 2019, at 0:27, Magnus Karlsson wrote:

> This commit adds support for the new need_wakeup flag in AF_XDP. The
> xsk_socket__create function is updated to handle this and a new
> function is introduced called xsk_ring_prod__needs_wakeup(). This
> function can be used by the application to check if Rx and/or Tx
> processing needs to be explicitly woken up.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v4 6/8] samples/bpf: add use of need_wakeup flag in xdpsock
From: Jonathan Lemon @ 2019-08-14 14:53 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, maximmi, bpf,
	bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, sridhar.samudrala, kevin.laatz, ilias.apalodimas,
	kiran.patil, axboe, maciej.fijalkowski, maciejromanfijalkowski,
	intel-wired-lan
In-Reply-To: <1565767643-4908-7-git-send-email-magnus.karlsson@intel.com>



On 14 Aug 2019, at 0:27, Magnus Karlsson wrote:

> This commit adds using the need_wakeup flag to the xdpsock sample
> application. It is turned on by default as we think it is a feature
> that seems to always produce a performance benefit, if the application
> has been written taking advantage of it. It can be turned off in the
> sample app by using the '-m' command line option.
>
> The txpush and l2fwd sub applications have also been updated to
> support poll() with multiple sockets.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>


> ---
>  samples/bpf/xdpsock_user.c | 192 
> ++++++++++++++++++++++++++++-----------------
>  1 file changed, 120 insertions(+), 72 deletions(-)
>
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index 93eaaf7..da84c76 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -67,8 +67,10 @@ static int opt_ifindex;
>  static int opt_queue;
>  static int opt_poll;
>  static int opt_interval = 1;
> -static u32 opt_xdp_bind_flags;
> +static u32 opt_xdp_bind_flags = XDP_USE_NEED_WAKEUP;
>  static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
> +static int opt_timeout = 1000;
> +static bool opt_need_wakeup = true;
>  static __u32 prog_id;
>
>  struct xsk_umem_info {
> @@ -352,6 +354,7 @@ static struct option long_options[] = {
>  	{"zero-copy", no_argument, 0, 'z'},
>  	{"copy", no_argument, 0, 'c'},
>  	{"frame-size", required_argument, 0, 'f'},
> +	{"no-need-wakeup", no_argument, 0, 'm'},
>  	{0, 0, 0, 0}
>  };
>
> @@ -372,6 +375,7 @@ static void usage(const char *prog)
>  		"  -z, --zero-copy      Force zero-copy mode.\n"
>  		"  -c, --copy           Force copy mode.\n"
>  		"  -f, --frame-size=n   Set the frame size (must be a power of two, 
> default is %d).\n"
> +		"  -m, --no-need-wakeup Turn off use of driver need wakeup flag.\n"
>  		"\n";
>  	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
>  	exit(EXIT_FAILURE);
> @@ -384,8 +388,9 @@ static void parse_command_line(int argc, char 
> **argv)
>  	opterr = 0;
>
>  	for (;;) {
> -		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options,
> -				&option_index);
> +
> +		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:m",
> +				long_options, &option_index);
>  		if (c == -1)
>  			break;
>
> @@ -429,6 +434,9 @@ static void parse_command_line(int argc, char 
> **argv)
>  			break;
>  		case 'f':
>  			opt_xsk_frame_size = atoi(optarg);
> +		case 'm':
> +			opt_need_wakeup = false;
> +			opt_xdp_bind_flags &= ~XDP_USE_NEED_WAKEUP;
>  			break;
>  		default:
>  			usage(basename(argv[0]));
> @@ -459,7 +467,8 @@ static void kick_tx(struct xsk_socket_info *xsk)
>  	exit_with_error(errno);
>  }
>
> -static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> +static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
> +				     struct pollfd *fds)
>  {
>  	u32 idx_cq = 0, idx_fq = 0;
>  	unsigned int rcvd;
> @@ -468,7 +477,9 @@ static inline void complete_tx_l2fwd(struct 
> xsk_socket_info *xsk)
>  	if (!xsk->outstanding_tx)
>  		return;
>
> -	kick_tx(xsk);
> +	if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx))
> +		kick_tx(xsk);
> +
>  	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
>  		xsk->outstanding_tx;
>
> @@ -482,6 +493,8 @@ static inline void complete_tx_l2fwd(struct 
> xsk_socket_info *xsk)
>  		while (ret != rcvd) {
>  			if (ret < 0)
>  				exit_with_error(-ret);
> +			if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
> +				ret = poll(fds, num_socks, opt_timeout);
>  			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
>  						     &idx_fq);
>  		}
> @@ -505,7 +518,8 @@ static inline void complete_tx_only(struct 
> xsk_socket_info *xsk)
>  	if (!xsk->outstanding_tx)
>  		return;
>
> -	kick_tx(xsk);
> +	if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx))
> +		kick_tx(xsk);
>
>  	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx);
>  	if (rcvd > 0) {
> @@ -515,20 +529,25 @@ static inline void complete_tx_only(struct 
> xsk_socket_info *xsk)
>  	}
>  }
>
> -static void rx_drop(struct xsk_socket_info *xsk)
> +static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
>  {
>  	unsigned int rcvd, i;
>  	u32 idx_rx = 0, idx_fq = 0;
>  	int ret;
>
>  	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> -	if (!rcvd)
> +	if (!rcvd) {
> +		if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
> +			ret = poll(fds, num_socks, opt_timeout);
>  		return;
> +	}
>
>  	ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
>  	while (ret != rcvd) {
>  		if (ret < 0)
>  			exit_with_error(-ret);
> +		if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
> +			ret = poll(fds, num_socks, opt_timeout);
>  		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
>  	}

I'll just point out that it seems a little odd that if one ring
needs a wakeup, all the rings get polled again.  However, I suppose
that it does amortize costs of a kernel call.
-- 
Jonathan


> @@ -549,42 +568,65 @@ static void rx_drop(struct xsk_socket_info *xsk)
>  static void rx_drop_all(void)
>  {
>  	struct pollfd fds[MAX_SOCKS + 1];
> -	int i, ret, timeout, nfds = 1;
> +	int i, ret;
>
>  	memset(fds, 0, sizeof(fds));
>
>  	for (i = 0; i < num_socks; i++) {
>  		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
>  		fds[i].events = POLLIN;
> -		timeout = 1000; /* 1sn */
>  	}
>
>  	for (;;) {
>  		if (opt_poll) {
> -			ret = poll(fds, nfds, timeout);
> +			ret = poll(fds, num_socks, opt_timeout);
>  			if (ret <= 0)
>  				continue;
>  		}
>
>  		for (i = 0; i < num_socks; i++)
> -			rx_drop(xsks[i]);
> +			rx_drop(xsks[i], fds);
> +	}
> +}
> +
> +static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb)
> +{
> +	u32 idx;
> +
> +	if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) == 
> BATCH_SIZE) {
> +		unsigned int i;
> +
> +		for (i = 0; i < BATCH_SIZE; i++) {
> +			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr	=
> +				(frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT;
> +			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
> +				sizeof(pkt_data) - 1;
> +		}
> +
> +		xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
> +		xsk->outstanding_tx += BATCH_SIZE;
> +		frame_nb += BATCH_SIZE;
> +		frame_nb %= NUM_FRAMES;
>  	}
> +
> +	complete_tx_only(xsk);
>  }
>
> -static void tx_only(struct xsk_socket_info *xsk)
> +static void tx_only_all(void)
>  {
> -	int timeout, ret, nfds = 1;
> -	struct pollfd fds[nfds + 1];
> -	u32 idx, frame_nb = 0;
> +	struct pollfd fds[MAX_SOCKS];
> +	u32 frame_nb[MAX_SOCKS] = {};
> +	int i, ret;
>
>  	memset(fds, 0, sizeof(fds));
> -	fds[0].fd = xsk_socket__fd(xsk->xsk);
> -	fds[0].events = POLLOUT;
> -	timeout = 1000; /* 1sn */
> +	for (i = 0; i < num_socks; i++) {
> +		fds[0].fd = xsk_socket__fd(xsks[i]->xsk);
> +		fds[0].events = POLLOUT;
> +	}
>
>  	for (;;) {
>  		if (opt_poll) {
> -			ret = poll(fds, nfds, timeout);
> +			ret = poll(fds, num_socks, opt_timeout);
>  			if (ret <= 0)
>  				continue;
>
> @@ -592,69 +634,75 @@ static void tx_only(struct xsk_socket_info *xsk)
>  				continue;
>  		}
>
> -		if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
> -		    BATCH_SIZE) {
> -			unsigned int i;
> -
> -			for (i = 0; i < BATCH_SIZE; i++) {
> -				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
> -					= (frame_nb + i) * opt_xsk_frame_size;
> -				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
> -					sizeof(pkt_data) - 1;
> -			}
> -
> -			xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
> -			xsk->outstanding_tx += BATCH_SIZE;
> -			frame_nb += BATCH_SIZE;
> -			frame_nb %= NUM_FRAMES;
> -		}
> -
> -		complete_tx_only(xsk);
> +		for (i = 0; i < num_socks; i++)
> +			tx_only(xsks[i], frame_nb[i]);
>  	}
>  }
>
> -static void l2fwd(struct xsk_socket_info *xsk)
> +static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
>  {
> -	for (;;) {
> -		unsigned int rcvd, i;
> -		u32 idx_rx = 0, idx_tx = 0;
> -		int ret;
> +	unsigned int rcvd, i;
> +	u32 idx_rx = 0, idx_tx = 0;
> +	int ret;
>
> -		for (;;) {
> -			complete_tx_l2fwd(xsk);
> +	complete_tx_l2fwd(xsk, fds);
>
> -			rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
> -						   &idx_rx);
> -			if (rcvd > 0)
> -				break;
> -		}
> +	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> +	if (!rcvd) {
> +		if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
> +			ret = poll(fds, num_socks, opt_timeout);
> +		return;
> +	}
>
> +	ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> +	while (ret != rcvd) {
> +		if (ret < 0)
> +			exit_with_error(-ret);
> +		if (xsk_ring_prod__needs_wakeup(&xsk->tx))
> +			kick_tx(xsk);
>  		ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> -		while (ret != rcvd) {
> -			if (ret < 0)
> -				exit_with_error(-ret);
> -			ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> -		}
> +	}
> +
> +	for (i = 0; i < rcvd; i++) {
> +		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
> +		u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
> +		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +
> +		swap_mac_addresses(pkt);
>
> -		for (i = 0; i < rcvd; i++) {
> -			u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
> -							  idx_rx)->addr;
> -			u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
> -							 idx_rx++)->len;
> -			char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +		hex_dump(pkt, len, addr);
> +		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
> +		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
> +	}
>
> -			swap_mac_addresses(pkt);
> +	xsk_ring_prod__submit(&xsk->tx, rcvd);
> +	xsk_ring_cons__release(&xsk->rx, rcvd);
>
> -			hex_dump(pkt, len, addr);
> -			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
> -			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
> -		}
> +	xsk->rx_npkts += rcvd;
> +	xsk->outstanding_tx += rcvd;
> +}
> +
> +static void l2fwd_all(void)
> +{
> +	struct pollfd fds[MAX_SOCKS];
> +	int i, ret;
> +
> +	memset(fds, 0, sizeof(fds));
> +
> +	for (i = 0; i < num_socks; i++) {
> +		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> +		fds[i].events = POLLOUT | POLLIN;
> +	}
>
> -		xsk_ring_prod__submit(&xsk->tx, rcvd);
> -		xsk_ring_cons__release(&xsk->rx, rcvd);
> +	for (;;) {
> +		if (opt_poll) {
> +			ret = poll(fds, num_socks, opt_timeout);
> +			if (ret <= 0)
> +				continue;
> +		}
>
> -		xsk->rx_npkts += rcvd;
> -		xsk->outstanding_tx += rcvd;
> +		for (i = 0; i < num_socks; i++)
> +			l2fwd(xsks[i], fds);
>  	}
>  }
>
> @@ -705,9 +753,9 @@ int main(int argc, char **argv)
>  	if (opt_bench == BENCH_RXDROP)
>  		rx_drop_all();
>  	else if (opt_bench == BENCH_TXONLY)
> -		tx_only(xsks[0]);
> +		tx_only_all();
>  	else
> -		l2fwd(xsks[0]);
> +		l2fwd_all();
>
>  	return 0;
>  }
> -- 
> 2.7.4

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox