Netdev List
 help / color / mirror / Atom feed
* RE: linux-next: Tree for Aug 29 (mlx5)
From: Haiyang Zhang @ 2019-08-29 21:48 UTC (permalink / raw)
  To: Saeed Mahameed, sfr@canb.auug.org.au, Eran Ben Elisha,
	linux-next@vger.kernel.org, rdunlap@infradead.org
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Leon Romanovsky
In-Reply-To: <c92d20e27268f515e0d4c8a28f92c0da041c2acc.camel@mellanox.com>



> -----Original Message-----
> From: Saeed Mahameed <saeedm@mellanox.com>
> Sent: Thursday, August 29, 2019 2:32 PM
> To: sfr@canb.auug.org.au; Eran Ben Elisha <eranbe@mellanox.com>; linux-
> next@vger.kernel.org; rdunlap@infradead.org; Haiyang Zhang
> <haiyangz@microsoft.com>
> Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Leon
> Romanovsky <leonro@mellanox.com>
> Subject: Re: linux-next: Tree for Aug 29 (mlx5)
> 
> On Thu, 2019-08-29 at 12:55 -0700, Randy Dunlap wrote:
> > On 8/29/19 12:54 PM, Randy Dunlap wrote:
> > > On 8/29/19 4:08 AM, Stephen Rothwell wrote:
> > > > Hi all,
> > > >
> > > > Changes since 20190828:
> > > >
> > >
> > > on x86_64:
> > > when CONFIG_PCI_HYPERV=m
> >
> > and CONFIG_PCI_HYPERV_INTERFACE=m
> >
> 
> Haiyang and Eran, I think CONFIG_PCI_HYPERV_INTERFACE was never
> supposed to be a module ? it supposed to provide an always available
> interface to drivers ..
> 
> Anyway, maybe we need to imply CONFIG_PCI_HYPERV_INTERFACE in mlx5.

The symbolic dependency by driver mlx5e,  automatically triggers loading of
pci_hyperv_interface module. And this module can be loaded in any platforms.

Currently, mlx5e driver has #if IS_ENABLED(CONFIG_PCI_HYPERV_INTERFACE)
around the code using the interface.

I agree --
Adding "select PCI_HYPERV_INTERFACE" for mlx5e will clean up these #if's.

Thanks,
- Haiyang

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] nfp: bpf: add simple map op cache
From: Jakub Kicinski @ 2019-08-29 21:36 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, oss-drivers,
	jaco.gericke, Quentin Monnet
In-Reply-To: <CAPhsuW5ExXPXYi5D2MND5JREh8EKNHUvSNoBEJ7L3-XK3GD9mA@mail.gmail.com>

On Thu, 29 Aug 2019 14:29:44 -0700, Song Liu wrote:
> On Tue, Aug 27, 2019 at 10:40 PM Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
> >
> > Each get_next and lookup call requires a round trip to the device.
> > However, the device is capable of giving us a few entries back,
> > instead of just one.
> >
> > In this patch we ask for a small yet reasonable number of entries
> > (4) on every get_next call, and on subsequent get_next/lookup calls
> > check this little cache for a hit. The cache is only kept for 250us,
> > and is invalidated on every operation which may modify the map
> > (e.g. delete or update call). Note that operations may be performed
> > simultaneously, so we have to keep track of operations in flight.
> >
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> > ---
> >  drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 179 +++++++++++++++++-
> >  drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
> >  drivers/net/ethernet/netronome/nfp/bpf/main.c |  18 ++
> >  drivers/net/ethernet/netronome/nfp/bpf/main.h |  23 +++
> >  .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
> >  5 files changed, 215 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> > index fcf880c82f3f..0e2db6ea79e9 100644
> > --- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> > +++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/bug.h>
> >  #include <linux/jiffies.h>
> >  #include <linux/skbuff.h>
> > +#include <linux/timekeeping.h>
> >
> >  #include "../ccm.h"
> >  #include "../nfp_app.h"
> > @@ -175,29 +176,151 @@ nfp_bpf_ctrl_reply_val(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op *reply,
> >         return &reply->data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
> >  }
> >
> > +static bool nfp_bpf_ctrl_op_cache_invalidate(enum nfp_ccm_type op)
> > +{
> > +       return op == NFP_CCM_TYPE_BPF_MAP_UPDATE ||
> > +              op == NFP_CCM_TYPE_BPF_MAP_DELETE;
> > +}
> > +
> > +static bool nfp_bpf_ctrl_op_cache_capable(enum nfp_ccm_type op)
> > +{
> > +       return op == NFP_CCM_TYPE_BPF_MAP_LOOKUP ||
> > +              op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
> > +}
> > +
> > +static bool nfp_bpf_ctrl_op_cache_fill(enum nfp_ccm_type op)
> > +{
> > +       return op == NFP_CCM_TYPE_BPF_MAP_GETFIRST ||
> > +              op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
> > +}
> > +
> > +static unsigned int
> > +nfp_bpf_ctrl_op_cache_get(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
> > +                         const u8 *key, u8 *out_key, u8 *out_value,
> > +                         u32 *cache_gen)
> > +{
> > +       struct bpf_map *map = &nfp_map->offmap->map;
> > +       struct nfp_app_bpf *bpf = nfp_map->bpf;
> > +       unsigned int i, count, n_entries;
> > +       struct cmsg_reply_map_op *reply;
> > +
> > +       n_entries = nfp_bpf_ctrl_op_cache_fill(op) ? bpf->cmsg_cache_cnt : 1;
> > +
> > +       spin_lock(&nfp_map->cache_lock);
> > +       *cache_gen = nfp_map->cache_gen;
> > +       if (nfp_map->cache_blockers)
> > +               n_entries = 1;
> > +
> > +       if (nfp_bpf_ctrl_op_cache_invalidate(op))
> > +               goto exit_block;
> > +       if (!nfp_bpf_ctrl_op_cache_capable(op))
> > +               goto exit_unlock;
> > +
> > +       if (!nfp_map->cache)
> > +               goto exit_unlock;
> > +       if (nfp_map->cache_to < ktime_get_ns())
> > +               goto exit_invalidate;
> > +
> > +       reply = (void *)nfp_map->cache->data;
> > +       count = be32_to_cpu(reply->count);  
> 
> Do we need to check whether count is too big (from firmware bug)?

It's validated below, when the skb is received (see my "here" below)

> > +
> > +       for (i = 0; i < count; i++) {
> > +               void *cached_key;
> > +
> > +               cached_key = nfp_bpf_ctrl_reply_key(bpf, reply, i);
> > +               if (memcmp(cached_key, key, map->key_size))
> > +                       continue;
> > +
> > +               if (op == NFP_CCM_TYPE_BPF_MAP_LOOKUP)
> > +                       memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, i),
> > +                              map->value_size);
> > +               if (op == NFP_CCM_TYPE_BPF_MAP_GETNEXT) {
> > +                       if (i + 1 == count)
> > +                               break;
> > +
> > +                       memcpy(out_key,
> > +                              nfp_bpf_ctrl_reply_key(bpf, reply, i + 1),
> > +                              map->key_size);
> > +               }
> > +
> > +               n_entries = 0;
> > +               goto exit_unlock;
> > +       }
> > +       goto exit_unlock;
> > +
> > +exit_block:
> > +       nfp_map->cache_blockers++;
> > +exit_invalidate:
> > +       dev_consume_skb_any(nfp_map->cache);
> > +       nfp_map->cache = NULL;
> > +exit_unlock:
> > +       spin_unlock(&nfp_map->cache_lock);
> > +       return n_entries;
> > +}

> >  static int
> >  nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
> >                       u8 *key, u8 *value, u64 flags, u8 *out_key, u8 *out_value)
> >  {
> >         struct nfp_bpf_map *nfp_map = offmap->dev_priv;
> > +       unsigned int n_entries, reply_entries, count;
> >         struct nfp_app_bpf *bpf = nfp_map->bpf;
> >         struct bpf_map *map = &offmap->map;
> >         struct cmsg_reply_map_op *reply;
> >         struct cmsg_req_map_op *req;
> >         struct sk_buff *skb;
> > +       u32 cache_gen;
> >         int err;
> >
> >         /* FW messages have no space for more than 32 bits of flags */
> >         if (flags >> 32)
> >                 return -EOPNOTSUPP;
> >
> > +       /* Handle op cache */
> > +       n_entries = nfp_bpf_ctrl_op_cache_get(nfp_map, op, key, out_key,
> > +                                             out_value, &cache_gen);
> > +       if (!n_entries)
> > +               return 0;
> > +
> >         skb = nfp_bpf_cmsg_map_req_alloc(bpf, 1);
> > -       if (!skb)
> > -               return -ENOMEM;
> > +       if (!skb) {
> > +               err = -ENOMEM;
> > +               goto err_cache_put;
> > +       }
> >
> >         req = (void *)skb->data;
> >         req->tid = cpu_to_be32(nfp_map->tid);
> > -       req->count = cpu_to_be32(1);
> > +       req->count = cpu_to_be32(n_entries);
> >         req->flags = cpu_to_be32(flags);
> >
> >         /* Copy inputs */
> > @@ -207,16 +330,38 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
> >                 memcpy(nfp_bpf_ctrl_req_val(bpf, req, 0), value,
> >                        map->value_size);
> >
> > -       skb = nfp_ccm_communicate(&bpf->ccm, skb, op,
> > -                                 nfp_bpf_cmsg_map_reply_size(bpf, 1));
> > -       if (IS_ERR(skb))
> > -               return PTR_ERR(skb);
> > +       skb = nfp_ccm_communicate(&bpf->ccm, skb, op, 0);
> > +       if (IS_ERR(skb)) {
> > +               err = PTR_ERR(skb);
> > +               goto err_cache_put;
> > +       }
> > +
> > +       if (skb->len < sizeof(*reply)) {
> > +               cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d!\n",
> > +                         op, skb->len);
> > +               err = -EIO;
> > +               goto err_free;
> > +       }
> >
> >         reply = (void *)skb->data;
> > +       count = be32_to_cpu(reply->count);
> >         err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
> > +       /* FW responds with message sized to hold the good entries,
> > +        * plus one extra entry if there was an error.
> > +        */
> > +       reply_entries = count + !!err;
> > +       if (n_entries > 1 && count)
> > +               err = 0;
> >         if (err)
> >                 goto err_free;
> >
> > +       if (skb->len != nfp_bpf_cmsg_map_reply_size(bpf, reply_entries)) {

here, reply_entries is derived directly from reply->count

> > +               cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d for %d entries!\n",
> > +                         op, skb->len, reply_entries);
> > +               err = -EIO;
> > +               goto err_free;
> > +       }
> > +
> >         /* Copy outputs */
> >         if (out_key)
> >                 memcpy(out_key, nfp_bpf_ctrl_reply_key(bpf, reply, 0),
> > @@ -225,11 +370,13 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
> >                 memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, 0),
> >                        map->value_size);
> >
> > -       dev_consume_skb_any(skb);
> > +       nfp_bpf_ctrl_op_cache_put(nfp_map, op, skb, cache_gen);
> >
> >         return 0;
> >  err_free:
> >         dev_kfree_skb_any(skb);
> > +err_cache_put:
> > +       nfp_bpf_ctrl_op_cache_put(nfp_map, op, NULL, cache_gen);
> >         return err;
> >  }
> >
> > @@ -275,7 +422,21 @@ unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
> >
> >  unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
> >  {
> > -       return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
> > +       return max3(NFP_NET_DEFAULT_MTU,
> > +                   nfp_bpf_cmsg_map_req_size(bpf, NFP_BPF_MAP_CACHE_CNT),
> > +                   nfp_bpf_cmsg_map_reply_size(bpf, NFP_BPF_MAP_CACHE_CNT));
> > +}
> > +
> > +unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf)
> > +{
> > +       unsigned int mtu, req_max, reply_max, entry_sz;
> > +
> > +       mtu = bpf->app->ctrl->dp.mtu;
> > +       entry_sz = bpf->cmsg_key_sz + bpf->cmsg_val_sz;
> > +       req_max = (mtu - sizeof(struct cmsg_req_map_op)) / entry_sz;
> > +       reply_max = (mtu - sizeof(struct cmsg_reply_map_op)) / entry_sz;
> > +
> > +       return min3(req_max, reply_max, NFP_BPF_MAP_CACHE_CNT);
> >  }
> >
> >  void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
> > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> > index 06c4286bd79e..a83a0ad5e27d 100644
> > --- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> > +++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> > @@ -24,6 +24,7 @@ enum bpf_cap_tlv_type {
> >         NFP_BPF_CAP_TYPE_QUEUE_SELECT   = 5,
> >         NFP_BPF_CAP_TYPE_ADJUST_TAIL    = 6,
> >         NFP_BPF_CAP_TYPE_ABI_VERSION    = 7,
> > +       NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT = 8,
> >  };
> >
> >  struct nfp_bpf_cap_tlv_func {
> > diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
> > index 2b1773ed3de9..8f732771d3fa 100644
> > --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
> > +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
> > @@ -299,6 +299,14 @@ nfp_bpf_parse_cap_adjust_tail(struct nfp_app_bpf *bpf, void __iomem *value,
> >         return 0;
> >  }
> >
> > +static int
> > +nfp_bpf_parse_cap_cmsg_multi_ent(struct nfp_app_bpf *bpf, void __iomem *value,
> > +                                u32 length)
> > +{
> > +       bpf->cmsg_multi_ent = true;
> > +       return 0;
> > +}
> > +
> >  static int
> >  nfp_bpf_parse_cap_abi_version(struct nfp_app_bpf *bpf, void __iomem *value,
> >                               u32 length)
> > @@ -375,6 +383,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
> >                                                           length))
> >                                 goto err_release_free;
> >                         break;
> > +               case NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT:
> > +                       if (nfp_bpf_parse_cap_cmsg_multi_ent(app->priv, value,
> > +                                                            length))  
> 
> Do we plan to extend nfp_bpf_parse_cap_cmsg_multi_ent() to return
> non-zero in the
> future?

Yes, the TLV format allows for the entry to be extended and then
parsing may fail. It's mostly a pattern the BPF TLV parsing follows,
though.

^ permalink raw reply

* Re: linux-next: Tree for Aug 29 (mlx5)
From: Saeed Mahameed @ 2019-08-29 21:31 UTC (permalink / raw)
  To: sfr@canb.auug.org.au, Eran Ben Elisha, linux-next@vger.kernel.org,
	rdunlap@infradead.org, haiyangz
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Leon Romanovsky
In-Reply-To: <52bcddef-fcf2-8de5-d15a-9e7ee2d5b14d@infradead.org>

On Thu, 2019-08-29 at 12:55 -0700, Randy Dunlap wrote:
> On 8/29/19 12:54 PM, Randy Dunlap wrote:
> > On 8/29/19 4:08 AM, Stephen Rothwell wrote:
> > > Hi all,
> > > 
> > > Changes since 20190828:
> > > 
> > 
> > on x86_64:
> > when CONFIG_PCI_HYPERV=m
> 
> and CONFIG_PCI_HYPERV_INTERFACE=m
> 

Haiyang and Eran, I think CONFIG_PCI_HYPERV_INTERFACE was never
supposed to be a module ? it supposed to provide an always available 
interface to drivers .. 

Anyway, maybe we need to imply CONFIG_PCI_HYPERV_INTERFACE in mlx5.

Thanks,
Saeed.
 
> > and mxlx5 is builtin (=y).
> > 
> > ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function
> > `mlx5_unload':
> > main.c:(.text+0x5d): undefined reference to `mlx5_hv_vhca_cleanup'
> > ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function
> > `mlx5_cleanup_once':
> > main.c:(.text+0x158): undefined reference to `mlx5_hv_vhca_destroy'
> > ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function
> > `mlx5_load_one':
> > main.c:(.text+0x4191): undefined reference to `mlx5_hv_vhca_create'
> > ld: main.c:(.text+0x4772): undefined reference to
> > `mlx5_hv_vhca_init'
> > ld: main.c:(.text+0x4b07): undefined reference to
> > `mlx5_hv_vhca_cleanup'
> > 
> > 
> 
> 

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] nfp: bpf: rework MTU checking
From: Song Liu @ 2019-08-29 21:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, oss-drivers,
	jaco.gericke, Quentin Monnet
In-Reply-To: <20190828053629.28658-2-jakub.kicinski@netronome.com>

On Tue, Aug 27, 2019 at 10:38 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> If control channel MTU is too low to support map operations a warning
> will be printed. This is not enough, we want to make sure probe fails
> in such scenario, as this would clearly be a faulty configuration.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] nfp: bpf: add simple map op cache
From: Song Liu @ 2019-08-29 21:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, oss-drivers,
	jaco.gericke, Quentin Monnet
In-Reply-To: <20190828053629.28658-3-jakub.kicinski@netronome.com>

On Tue, Aug 27, 2019 at 10:40 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> Each get_next and lookup call requires a round trip to the device.
> However, the device is capable of giving us a few entries back,
> instead of just one.
>
> In this patch we ask for a small yet reasonable number of entries
> (4) on every get_next call, and on subsequent get_next/lookup calls
> check this little cache for a hit. The cache is only kept for 250us,
> and is invalidated on every operation which may modify the map
> (e.g. delete or update call). Note that operations may be performed
> simultaneously, so we have to keep track of operations in flight.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 179 +++++++++++++++++-
>  drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
>  drivers/net/ethernet/netronome/nfp/bpf/main.c |  18 ++
>  drivers/net/ethernet/netronome/nfp/bpf/main.h |  23 +++
>  .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
>  5 files changed, 215 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> index fcf880c82f3f..0e2db6ea79e9 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
> @@ -6,6 +6,7 @@
>  #include <linux/bug.h>
>  #include <linux/jiffies.h>
>  #include <linux/skbuff.h>
> +#include <linux/timekeeping.h>
>
>  #include "../ccm.h"
>  #include "../nfp_app.h"
> @@ -175,29 +176,151 @@ nfp_bpf_ctrl_reply_val(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op *reply,
>         return &reply->data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
>  }
>
> +static bool nfp_bpf_ctrl_op_cache_invalidate(enum nfp_ccm_type op)
> +{
> +       return op == NFP_CCM_TYPE_BPF_MAP_UPDATE ||
> +              op == NFP_CCM_TYPE_BPF_MAP_DELETE;
> +}
> +
> +static bool nfp_bpf_ctrl_op_cache_capable(enum nfp_ccm_type op)
> +{
> +       return op == NFP_CCM_TYPE_BPF_MAP_LOOKUP ||
> +              op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
> +}
> +
> +static bool nfp_bpf_ctrl_op_cache_fill(enum nfp_ccm_type op)
> +{
> +       return op == NFP_CCM_TYPE_BPF_MAP_GETFIRST ||
> +              op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
> +}
> +
> +static unsigned int
> +nfp_bpf_ctrl_op_cache_get(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
> +                         const u8 *key, u8 *out_key, u8 *out_value,
> +                         u32 *cache_gen)
> +{
> +       struct bpf_map *map = &nfp_map->offmap->map;
> +       struct nfp_app_bpf *bpf = nfp_map->bpf;
> +       unsigned int i, count, n_entries;
> +       struct cmsg_reply_map_op *reply;
> +
> +       n_entries = nfp_bpf_ctrl_op_cache_fill(op) ? bpf->cmsg_cache_cnt : 1;
> +
> +       spin_lock(&nfp_map->cache_lock);
> +       *cache_gen = nfp_map->cache_gen;
> +       if (nfp_map->cache_blockers)
> +               n_entries = 1;
> +
> +       if (nfp_bpf_ctrl_op_cache_invalidate(op))
> +               goto exit_block;
> +       if (!nfp_bpf_ctrl_op_cache_capable(op))
> +               goto exit_unlock;
> +
> +       if (!nfp_map->cache)
> +               goto exit_unlock;
> +       if (nfp_map->cache_to < ktime_get_ns())
> +               goto exit_invalidate;
> +
> +       reply = (void *)nfp_map->cache->data;
> +       count = be32_to_cpu(reply->count);

Do we need to check whether count is too big (from firmware bug)?

> +
> +       for (i = 0; i < count; i++) {
> +               void *cached_key;
> +
> +               cached_key = nfp_bpf_ctrl_reply_key(bpf, reply, i);
> +               if (memcmp(cached_key, key, map->key_size))
> +                       continue;
> +
> +               if (op == NFP_CCM_TYPE_BPF_MAP_LOOKUP)
> +                       memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, i),
> +                              map->value_size);
> +               if (op == NFP_CCM_TYPE_BPF_MAP_GETNEXT) {
> +                       if (i + 1 == count)
> +                               break;
> +
> +                       memcpy(out_key,
> +                              nfp_bpf_ctrl_reply_key(bpf, reply, i + 1),
> +                              map->key_size);
> +               }
> +
> +               n_entries = 0;
> +               goto exit_unlock;
> +       }
> +       goto exit_unlock;
> +
> +exit_block:
> +       nfp_map->cache_blockers++;
> +exit_invalidate:
> +       dev_consume_skb_any(nfp_map->cache);
> +       nfp_map->cache = NULL;
> +exit_unlock:
> +       spin_unlock(&nfp_map->cache_lock);
> +       return n_entries;
> +}
> +
> +static void
> +nfp_bpf_ctrl_op_cache_put(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
> +                         struct sk_buff *skb, u32 cache_gen)
> +{
> +       bool blocker, filler;
> +
> +       blocker = nfp_bpf_ctrl_op_cache_invalidate(op);
> +       filler = nfp_bpf_ctrl_op_cache_fill(op);
> +       if (blocker || filler) {
> +               u64 to = 0;
> +
> +               if (filler)
> +                       to = ktime_get_ns() + NFP_BPF_MAP_CACHE_TIME_NS;
> +
> +               spin_lock(&nfp_map->cache_lock);
> +               if (blocker) {
> +                       nfp_map->cache_blockers--;
> +                       nfp_map->cache_gen++;
> +               }
> +               if (filler && !nfp_map->cache_blockers &&
> +                   nfp_map->cache_gen == cache_gen) {
> +                       nfp_map->cache_to = to;
> +                       swap(nfp_map->cache, skb);
> +               }
> +               spin_unlock(&nfp_map->cache_lock);
> +       }
> +
> +       dev_consume_skb_any(skb);
> +}
> +
>  static int
>  nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
>                       u8 *key, u8 *value, u64 flags, u8 *out_key, u8 *out_value)
>  {
>         struct nfp_bpf_map *nfp_map = offmap->dev_priv;
> +       unsigned int n_entries, reply_entries, count;
>         struct nfp_app_bpf *bpf = nfp_map->bpf;
>         struct bpf_map *map = &offmap->map;
>         struct cmsg_reply_map_op *reply;
>         struct cmsg_req_map_op *req;
>         struct sk_buff *skb;
> +       u32 cache_gen;
>         int err;
>
>         /* FW messages have no space for more than 32 bits of flags */
>         if (flags >> 32)
>                 return -EOPNOTSUPP;
>
> +       /* Handle op cache */
> +       n_entries = nfp_bpf_ctrl_op_cache_get(nfp_map, op, key, out_key,
> +                                             out_value, &cache_gen);
> +       if (!n_entries)
> +               return 0;
> +
>         skb = nfp_bpf_cmsg_map_req_alloc(bpf, 1);
> -       if (!skb)
> -               return -ENOMEM;
> +       if (!skb) {
> +               err = -ENOMEM;
> +               goto err_cache_put;
> +       }
>
>         req = (void *)skb->data;
>         req->tid = cpu_to_be32(nfp_map->tid);
> -       req->count = cpu_to_be32(1);
> +       req->count = cpu_to_be32(n_entries);
>         req->flags = cpu_to_be32(flags);
>
>         /* Copy inputs */
> @@ -207,16 +330,38 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
>                 memcpy(nfp_bpf_ctrl_req_val(bpf, req, 0), value,
>                        map->value_size);
>
> -       skb = nfp_ccm_communicate(&bpf->ccm, skb, op,
> -                                 nfp_bpf_cmsg_map_reply_size(bpf, 1));
> -       if (IS_ERR(skb))
> -               return PTR_ERR(skb);
> +       skb = nfp_ccm_communicate(&bpf->ccm, skb, op, 0);
> +       if (IS_ERR(skb)) {
> +               err = PTR_ERR(skb);
> +               goto err_cache_put;
> +       }
> +
> +       if (skb->len < sizeof(*reply)) {
> +               cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d!\n",
> +                         op, skb->len);
> +               err = -EIO;
> +               goto err_free;
> +       }
>
>         reply = (void *)skb->data;
> +       count = be32_to_cpu(reply->count);
>         err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
> +       /* FW responds with message sized to hold the good entries,
> +        * plus one extra entry if there was an error.
> +        */
> +       reply_entries = count + !!err;
> +       if (n_entries > 1 && count)
> +               err = 0;
>         if (err)
>                 goto err_free;
>
> +       if (skb->len != nfp_bpf_cmsg_map_reply_size(bpf, reply_entries)) {
> +               cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d for %d entries!\n",
> +                         op, skb->len, reply_entries);
> +               err = -EIO;
> +               goto err_free;
> +       }
> +
>         /* Copy outputs */
>         if (out_key)
>                 memcpy(out_key, nfp_bpf_ctrl_reply_key(bpf, reply, 0),
> @@ -225,11 +370,13 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
>                 memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, 0),
>                        map->value_size);
>
> -       dev_consume_skb_any(skb);
> +       nfp_bpf_ctrl_op_cache_put(nfp_map, op, skb, cache_gen);
>
>         return 0;
>  err_free:
>         dev_kfree_skb_any(skb);
> +err_cache_put:
> +       nfp_bpf_ctrl_op_cache_put(nfp_map, op, NULL, cache_gen);
>         return err;
>  }
>
> @@ -275,7 +422,21 @@ unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
>
>  unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
>  {
> -       return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
> +       return max3(NFP_NET_DEFAULT_MTU,
> +                   nfp_bpf_cmsg_map_req_size(bpf, NFP_BPF_MAP_CACHE_CNT),
> +                   nfp_bpf_cmsg_map_reply_size(bpf, NFP_BPF_MAP_CACHE_CNT));
> +}
> +
> +unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf)
> +{
> +       unsigned int mtu, req_max, reply_max, entry_sz;
> +
> +       mtu = bpf->app->ctrl->dp.mtu;
> +       entry_sz = bpf->cmsg_key_sz + bpf->cmsg_val_sz;
> +       req_max = (mtu - sizeof(struct cmsg_req_map_op)) / entry_sz;
> +       reply_max = (mtu - sizeof(struct cmsg_reply_map_op)) / entry_sz;
> +
> +       return min3(req_max, reply_max, NFP_BPF_MAP_CACHE_CNT);
>  }
>
>  void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> index 06c4286bd79e..a83a0ad5e27d 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
> @@ -24,6 +24,7 @@ enum bpf_cap_tlv_type {
>         NFP_BPF_CAP_TYPE_QUEUE_SELECT   = 5,
>         NFP_BPF_CAP_TYPE_ADJUST_TAIL    = 6,
>         NFP_BPF_CAP_TYPE_ABI_VERSION    = 7,
> +       NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT = 8,
>  };
>
>  struct nfp_bpf_cap_tlv_func {
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
> index 2b1773ed3de9..8f732771d3fa 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
> @@ -299,6 +299,14 @@ nfp_bpf_parse_cap_adjust_tail(struct nfp_app_bpf *bpf, void __iomem *value,
>         return 0;
>  }
>
> +static int
> +nfp_bpf_parse_cap_cmsg_multi_ent(struct nfp_app_bpf *bpf, void __iomem *value,
> +                                u32 length)
> +{
> +       bpf->cmsg_multi_ent = true;
> +       return 0;
> +}
> +
>  static int
>  nfp_bpf_parse_cap_abi_version(struct nfp_app_bpf *bpf, void __iomem *value,
>                               u32 length)
> @@ -375,6 +383,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
>                                                           length))
>                                 goto err_release_free;
>                         break;
> +               case NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT:
> +                       if (nfp_bpf_parse_cap_cmsg_multi_ent(app->priv, value,
> +                                                            length))

Do we plan to extend nfp_bpf_parse_cap_cmsg_multi_ent() to return
non-zero in the
future?

> +                               goto err_release_free;
> +                       break;
>                 default:
>                         nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
>                         break;
> @@ -426,6 +439,11 @@ static int nfp_bpf_start(struct nfp_app *app)
>                 return -EINVAL;
>         }
>
> +       if (bpf->cmsg_multi_ent)
> +               bpf->cmsg_cache_cnt = nfp_bpf_ctrl_cmsg_cache_cnt(bpf);
> +       else
> +               bpf->cmsg_cache_cnt = 1;
> +
>         return 0;
>  }
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> index f4802036eb42..fac9c6f9e197 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
> @@ -99,6 +99,7 @@ enum pkt_vec {
>   * @maps_neutral:      hash table of offload-neutral maps (on pointer)
>   *
>   * @abi_version:       global BPF ABI version
> + * @cmsg_cache_cnt:    number of entries to read for caching
>   *
>   * @adjust_head:       adjust head capability
>   * @adjust_head.flags:         extra flags for adjust head
> @@ -124,6 +125,7 @@ enum pkt_vec {
>   * @pseudo_random:     FW initialized the pseudo-random machinery (CSRs)
>   * @queue_select:      BPF can set the RX queue ID in packet vector
>   * @adjust_tail:       BPF can simply trunc packet size for adjust tail
> + * @cmsg_multi_ent:    FW can pack multiple map entries in a single cmsg
>   */
>  struct nfp_app_bpf {
>         struct nfp_app *app;
> @@ -134,6 +136,8 @@ struct nfp_app_bpf {
>         unsigned int cmsg_key_sz;
>         unsigned int cmsg_val_sz;
>
> +       unsigned int cmsg_cache_cnt;
> +
>         struct list_head map_list;
>         unsigned int maps_in_use;
>         unsigned int map_elems_in_use;
> @@ -169,6 +173,7 @@ struct nfp_app_bpf {
>         bool pseudo_random;
>         bool queue_select;
>         bool adjust_tail;
> +       bool cmsg_multi_ent;
>  };
>
>  enum nfp_bpf_map_use {
> @@ -183,11 +188,21 @@ struct nfp_bpf_map_word {
>         unsigned char non_zero_update   :1;
>  };
>
> +#define NFP_BPF_MAP_CACHE_CNT          4U
> +#define NFP_BPF_MAP_CACHE_TIME_NS      (250 * 1000)
> +
>  /**
>   * struct nfp_bpf_map - private per-map data attached to BPF maps for offload
>   * @offmap:    pointer to the offloaded BPF map
>   * @bpf:       back pointer to bpf app private structure
>   * @tid:       table id identifying map on datapath
> + *
> + * @cache_lock:        protects @cache_blockers, @cache_to, @cache
> + * @cache_blockers:    number of ops in flight which block caching
> + * @cache_gen: counter incremented by every blocker on exit
> + * @cache_to:  time when cache will no longer be valid (ns)
> + * @cache:     skb with cached response
> + *
>   * @l:         link on the nfp_app_bpf->map_list list
>   * @use_map:   map of how the value is used (in 4B chunks)
>   */
> @@ -195,6 +210,13 @@ struct nfp_bpf_map {
>         struct bpf_offloaded_map *offmap;
>         struct nfp_app_bpf *bpf;
>         u32 tid;
> +
> +       spinlock_t cache_lock;
> +       u32 cache_blockers;
> +       u32 cache_gen;
> +       u64 cache_to;
> +       struct sk_buff *cache;
> +
>         struct list_head l;
>         struct nfp_bpf_map_word use_map[];
>  };
> @@ -566,6 +588,7 @@ void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, struct nfp_bpf_vnic *bv);
>
>  unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf);
>  unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf);
> +unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf);
>  long long int
>  nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map);
>  void
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> index 39c9fec222b4..88fab6a82acf 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
> @@ -385,6 +385,7 @@ nfp_bpf_map_alloc(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
>         offmap->dev_priv = nfp_map;
>         nfp_map->offmap = offmap;
>         nfp_map->bpf = bpf;
> +       spin_lock_init(&nfp_map->cache_lock);
>
>         res = nfp_bpf_ctrl_alloc_map(bpf, &offmap->map);
>         if (res < 0) {
> @@ -407,6 +408,8 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
>         struct nfp_bpf_map *nfp_map = offmap->dev_priv;
>
>         nfp_bpf_ctrl_free_map(bpf, nfp_map);
> +       dev_consume_skb_any(nfp_map->cache);
> +       WARN_ON_ONCE(nfp_map->cache_blockers);
>         list_del_init(&nfp_map->l);
>         bpf->map_elems_in_use -= offmap->map.max_entries;
>         bpf->maps_in_use--;
> --
> 2.21.0
>

^ permalink raw reply

* Re: [PATCH v2 net-next 02/15] MIPS: SGI-IP27: restructure ioc3 register access
From: Shannon Nelson @ 2019-08-29 21:26 UTC (permalink / raw)
  To: Thomas Bogendoerfer, Ralf Baechle, Paul Burton, James Hogan,
	David S. Miller, linux-mips, linux-kernel, netdev
In-Reply-To: <20190829155014.9229-3-tbogendoerfer@suse.de>

On 8/29/19 8:50 AM, Thomas Bogendoerfer wrote:
> Break up the big ioc3 register struct into functional pieces to
> make use in sub-function drivers more straightforward. And while
> doing that get rid of all volatile access by using readX/writeX.
>
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> ---

> diff --git a/arch/mips/sgi-ip27/ip27-console.c b/arch/mips/sgi-ip27/ip27-console.c
> index 6bdb48d41276..5886bee89d06 100644
> --- a/arch/mips/sgi-ip27/ip27-console.c
> +++ b/arch/mips/sgi-ip27/ip27-console.c
> @@ -35,6 +35,7 @@ void prom_putchar(char c)
>   {
>   	struct ioc3_uartregs *uart = console_uart();
>   
> -	while ((uart->iu_lsr & 0x20) == 0);
> -	uart->iu_thr = c;
> +	while ((readb(&uart->iu_lsr) & 0x20) == 0)
> +		;
> +	writeb(c, &uart->iu_thr);
>   }

Is it ever possible to never see your bit get set?
Instead of a tight forever spin, you might add a short delay and a retry 
limit.

I see this in several other times in the following code as well.  It 
might be interesting to see how many times through and perhaps how many 
usecs are normally spent in these loops.

Not a binding request, just a thought...

sln



^ permalink raw reply

* Re: [PATCH v3 03/11] net/mlx5e: Remove unlikely() from WARN*() condition
From: Saeed Mahameed @ 2019-08-29 21:23 UTC (permalink / raw)
  To: efremov@linux.com, linux-kernel@vger.kernel.org,
	davem@davemloft.net
  Cc: joe@perches.com, Boris Pismenny, netdev@vger.kernel.org,
	leon@kernel.org, akpm@linux-foundation.org
In-Reply-To: <20190829165025.15750-3-efremov@linux.com>

On Thu, 2019-08-29 at 19:50 +0300, Denis Efremov wrote:
> "unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses
> unlikely() internally.
> 
> Signed-off-by: Denis Efremov <efremov@linux.com>
> Cc: Boris Pismenny <borisp@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Leon Romanovsky <leon@kernel.org>
> Cc: Joe Perches <joe@perches.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: netdev@vger.kernel.org
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git
> a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> index 7833ddef0427..e5222d17df35 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
> @@ -408,7 +408,7 @@ struct sk_buff *mlx5e_ktls_handle_tx_skb(struct
> net_device *netdev,
>  		goto out;
>  
>  	tls_ctx = tls_get_ctx(skb->sk);
> -	if (unlikely(WARN_ON_ONCE(tls_ctx->netdev != netdev)))
> +	if (WARN_ON_ONCE(tls_ctx->netdev != netdev))
>  		goto err_out;
>  
>  	priv_tx = mlx5e_get_ktls_tx_priv_ctx(tls_ctx);

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

Dave, you can take this one.

^ permalink raw reply

* Re: [PATCH v2 net-next 00/15] ioc3-eth improvements
From: Jakub Kicinski @ 2019-08-29 21:15 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, David S. Miller,
	linux-mips, linux-kernel, netdev
In-Reply-To: <20190829155014.9229-1-tbogendoerfer@suse.de>

On Thu, 29 Aug 2019 17:49:58 +0200, Thomas Bogendoerfer wrote:
> In my patch series for splitting out the serial code from ioc3-eth
> by using a MFD device there was one big patch for ioc3-eth.c,
> which wasn't really usefull for reviews. This series contains the
> ioc3-eth changes splitted in smaller steps and few more cleanups.
> Only the conversion to MFD will be done later in a different series.
> 
> Changes in v2:
> - use net_err_ratelimited for printing various ioc3 errors
> - added missing clearing of rx buf valid flags into ioc3_alloc_rings
> - use __func__ for printing out of memory messages

Only a few more comments on patch 5, otherwise looks good!

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Alexei Starovoitov @ 2019-08-29 21:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov, luto, davem,
	peterz, rostedt, netdev, bpf, kernel-team, linux-api
In-Reply-To: <20190829222530.3c6163ac@carbon>

On Thu, Aug 29, 2019 at 10:25:30PM +0200, Jesper Dangaard Brouer wrote:
> On Thu, 29 Aug 2019 20:05:49 +0200
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> 
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> > 
> > > On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:  
> > >> Alexei Starovoitov <ast@kernel.org> writes:
> > >>   
> > >> > CAP_BPF allows the following BPF operations:
> > >> > - Loading all types of BPF programs
> > >> > - Creating all types of BPF maps except:
> > >> >    - stackmap that needs CAP_TRACING
> > >> >    - devmap that needs CAP_NET_ADMIN
> > >> >    - cpumap that needs CAP_SYS_ADMIN  
> > >> 
> > >> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?  
> > >
> > > Currently it's cap_sys_admin and I think it should stay this way
> > > because it creates kthreads.  
> > 
> > Ah, right. I can sorta see that makes sense because of the kthreads, but
> > it also means that you can use all of XDP *except* cpumap with
> > CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?
>  
> Hmm... I see 'cpumap' primarily as a network stack feature.  It is about
> starting the network stack on a specific CPU, allocating and building
> SKBs on that remote CPU.  It can only be used together with XDP_REDIRECT.
> I would prefer CAP_NET_ADMIN like the devmap, to keep the XDP
> capabilities consistent.

I don't mind relaxing cpumap to cap_net_admin.
Looking at the reaction to the rest of the set. I'd rather discuss it
and do it later after basic cap_bpf is in.


^ permalink raw reply

* Re: [PATCH v2 net-next 05/15] net: sgi: ioc3-eth: allocate space for desc rings only once
From: Jakub Kicinski @ 2019-08-29 21:05 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, David S. Miller,
	linux-mips, linux-kernel, netdev
In-Reply-To: <20190829155014.9229-6-tbogendoerfer@suse.de>

On Thu, 29 Aug 2019 17:50:03 +0200, Thomas Bogendoerfer wrote:
> Memory for descriptor rings are allocated/freed, when interface is
> brought up/down. Since the size of the rings is not changeable by
> hardware, we now allocate rings now during probe and free it, when
> device is removed.
> 
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> ---
>  drivers/net/ethernet/sgi/ioc3-eth.c | 103 ++++++++++++++++++------------------
>  1 file changed, 51 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c
> index ba18a53fbbe6..d9d94a55ac34 100644
> --- a/drivers/net/ethernet/sgi/ioc3-eth.c
> +++ b/drivers/net/ethernet/sgi/ioc3-eth.c
> @@ -803,25 +803,17 @@ static void ioc3_free_rings(struct ioc3_private *ip)
>  	struct sk_buff *skb;
>  	int rx_entry, n_entry;
>  
> -	if (ip->txr) {
> -		ioc3_clean_tx_ring(ip);
> -		free_pages((unsigned long)ip->txr, 2);
> -		ip->txr = NULL;
> -	}
> +	ioc3_clean_tx_ring(ip);
>  
> -	if (ip->rxr) {
> -		n_entry = ip->rx_ci;
> -		rx_entry = ip->rx_pi;
> +	n_entry = ip->rx_ci;
> +	rx_entry = ip->rx_pi;
>  
> -		while (n_entry != rx_entry) {
> -			skb = ip->rx_skbs[n_entry];
> -			if (skb)
> -				dev_kfree_skb_any(skb);
> +	while (n_entry != rx_entry) {
> +		skb = ip->rx_skbs[n_entry];
> +		if (skb)
> +			dev_kfree_skb_any(skb);

I think dev_kfree_skb_any() accepts NULL

>  
> -			n_entry = (n_entry + 1) & RX_RING_MASK;
> -		}
> -		free_page((unsigned long)ip->rxr);
> -		ip->rxr = NULL;
> +		n_entry = (n_entry + 1) & RX_RING_MASK;
>  	}
>  }
>  
> @@ -829,49 +821,34 @@ static void ioc3_alloc_rings(struct net_device *dev)
>  {
>  	struct ioc3_private *ip = netdev_priv(dev);
>  	struct ioc3_erxbuf *rxb;
> -	unsigned long *rxr;
>  	int i;
>  
> -	if (!ip->rxr) {
> -		/* Allocate and initialize rx ring.  4kb = 512 entries  */
> -		ip->rxr = (unsigned long *)get_zeroed_page(GFP_ATOMIC);
> -		rxr = ip->rxr;
> -		if (!rxr)
> -			pr_err("%s: get_zeroed_page() failed!\n", __func__);
> -
> -		/* Now the rx buffers.  The RX ring may be larger but
> -		 * we only allocate 16 buffers for now.  Need to tune
> -		 * this for performance and memory later.
> -		 */
> -		for (i = 0; i < RX_BUFFS; i++) {
> -			struct sk_buff *skb;
> +	/* Now the rx buffers.  The RX ring may be larger but
> +	 * we only allocate 16 buffers for now.  Need to tune
> +	 * this for performance and memory later.
> +	 */
> +	for (i = 0; i < RX_BUFFS; i++) {
> +		struct sk_buff *skb;
>  
> -			skb = ioc3_alloc_skb(RX_BUF_ALLOC_SIZE, GFP_ATOMIC);
> -			if (!skb) {
> -				show_free_areas(0, NULL);
> -				continue;
> -			}
> +		skb = ioc3_alloc_skb(RX_BUF_ALLOC_SIZE, GFP_ATOMIC);
> +		if (!skb) {
> +			show_free_areas(0, NULL);
> +			continue;
> +		}
>  
> -			ip->rx_skbs[i] = skb;
> +		ip->rx_skbs[i] = skb;
>  
> -			/* Because we reserve afterwards. */
> -			skb_put(skb, (1664 + RX_OFFSET));
> -			rxb = (struct ioc3_erxbuf *)skb->data;
> -			rxr[i] = cpu_to_be64(ioc3_map(rxb, 1));
> -			skb_reserve(skb, RX_OFFSET);
> -		}
> -		ip->rx_ci = 0;
> -		ip->rx_pi = RX_BUFFS;
> +		/* Because we reserve afterwards. */
> +		skb_put(skb, (1664 + RX_OFFSET));
> +		rxb = (struct ioc3_erxbuf *)skb->data;
> +		ip->rxr[i] = cpu_to_be64(ioc3_map(rxb, 1));
> +		skb_reserve(skb, RX_OFFSET);
>  	}
> +	ip->rx_ci = 0;
> +	ip->rx_pi = RX_BUFFS;
>  
> -	if (!ip->txr) {
> -		/* Allocate and initialize tx rings.  16kb = 128 bufs.  */
> -		ip->txr = (struct ioc3_etxd *)__get_free_pages(GFP_KERNEL, 2);
> -		if (!ip->txr)
> -			pr_err("%s: __get_free_pages() failed!\n", __func__);
> -		ip->tx_pi = 0;
> -		ip->tx_ci = 0;
> -	}
> +	ip->tx_pi = 0;
> +	ip->tx_ci = 0;
>  }
>  
>  static void ioc3_init_rings(struct net_device *dev)
> @@ -1239,6 +1216,23 @@ static int ioc3_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	timer_setup(&ip->ioc3_timer, ioc3_timer, 0);
>  
>  	ioc3_stop(ip);
> +
> +	/* Allocate and rx ring.  4kb = 512 entries  */
> +	ip->rxr = (unsigned long *)get_zeroed_page(GFP_ATOMIC);
> +	if (!ip->rxr) {
> +		pr_err("ioc3-eth: rx ring allocation failed\n");
> +		err = -ENOMEM;
> +		goto out_stop;
> +	}
> +
> +	/* Allocate tx rings.  16kb = 128 bufs.  */
> +	ip->txr = (struct ioc3_etxd *)__get_free_pages(GFP_KERNEL, 2);
> +	if (!ip->txr) {
> +		pr_err("ioc3-eth: tx ring allocation failed\n");
> +		err = -ENOMEM;
> +		goto out_stop;
> +	}

Please just use kcalloc()/kmalloc_array() here, and make sure the flags
are set to GFP_KERNEL whenever possible. Here and in ioc3_alloc_rings()
it looks like GFP_ATOMIC is unnecessary.

^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Florian Westphal @ 2019-08-29 20:58 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Pablo Neira Ayuso, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	Alexey Kuznetsov, Hideaki YOSHIFUJI
In-Reply-To: <b6585989069fd832a65b73d1c4f4319a10714165.camel@linux.ibm.com>

Leonardo Bras <leonardo@linux.ibm.com> wrote:
> On Thu, 2019-08-29 at 17:04 -0300, Leonardo Bras wrote:
> > > Thats a good point -- Leonardo, is the
> > > "net.bridge.bridge-nf-call-ip6tables" sysctl on?
> > 
> > Running
> > # sudo sysctl -a
> > I can see:
> > net.bridge.bridge-nf-call-ip6tables = 1
> 
> Also, doing
> # echo 0 >  /proc/sys/net/bridge/bridge-nf-call-ip6tables 
> And then trying to boot the guest will not crash the host.
> 
> Which would make sense, since host iptables is not dealing with guest
> IPv6 packets.

Yes.

> So, the real cause of this bug is the bridge making host ip6tables deal
> with guest IPv6 packets ? 
> If so, would it be ok if write a patch testing ipv6_mod_enabled()
> before passing guest ipv6 packets to host ip6tables? 

I'm not sure.  This switch is very old, it was added 10 years ago
in v2.6.31-rc1.

Even if we disable call-ip6tables in br_netfilter we will at least
in addition need a patch for nft_fib_netdev.c.

From a "avoid calls to ipv6 stack when its disabled" standpoint,
the safest fix is to disable call-ip6tables functionality if ipv6
module is off *and* fix nft_fib_netdev.c to BREAK in ipv6 is off case.

I started to place a list of suspicous modules here, but that got out
of hand quickly.

So, given I don't want to plaster ipv6_mod_enabled() everywhere, I
would suggest this course of action:

1. add a patch to BREAK in nft_fib_netdev.c for !ipv6_mod_enabled()
2. change net/bridge/br_netfilter_hooks.c, br_nf_pre_routing() to
   make sure ipv6_mod_enabled() is true before doing the ipv6 stack
   "emulation".

Makes sense?

Thanks,
Florian

^ permalink raw reply

* Re: [bpf-next, v2] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Song Liu @ 2019-08-29 20:41 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, Networking
In-Reply-To: <20190826162517.8082-1-danieltimlee@gmail.com>

On Mon, Aug 26, 2019 at 9:52 AM Daniel T. Lee <danieltimlee@gmail.com> wrote:
>
> Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> to 600. To make this size flexible, a new map 'pcktsz' is added.
>
> By updating new packet size to this map from the userland,
> xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
>
> If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> will be 600 as a default.

Please also cc bpf@vger.kernel.org for bpf patches.

>
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

With a nit below.

[...]

> diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> index a3596b617c4c..29ade7caf841 100644
> --- a/samples/bpf/xdp_adjust_tail_user.c
> +++ b/samples/bpf/xdp_adjust_tail_user.c
> @@ -72,6 +72,7 @@ static void usage(const char *cmd)
>         printf("Usage: %s [...]\n", cmd);
>         printf("    -i <ifname|ifindex> Interface\n");
>         printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> +       printf("    -P <MAX_PCKT_SIZE> Default: 600\n");

nit: printf("    -P <MAX_PCKT_SIZE> Default: %u\n", MAX_PCKT_SIZE);

^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Leonardo Bras @ 2019-08-29 20:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, David S. Miller
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	Alexey Kuznetsov, Hideaki YOSHIFUJI
In-Reply-To: <db0f02c5b1a995fde174f036540a3d11008cf116.camel@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

On Thu, 2019-08-29 at 17:04 -0300, Leonardo Bras wrote:
> > Thats a good point -- Leonardo, is the
> > "net.bridge.bridge-nf-call-ip6tables" sysctl on?
> 
> Running
> # sudo sysctl -a
> I can see:
> net.bridge.bridge-nf-call-ip6tables = 1

Also, doing
# echo 0 >  /proc/sys/net/bridge/bridge-nf-call-ip6tables 
And then trying to boot the guest will not crash the host.

Which would make sense, since host iptables is not dealing with guest
IPv6 packets.

So, the real cause of this bug is the bridge making host ip6tables deal
with guest IPv6 packets ? 
If so, would it be ok if write a patch testing ipv6_mod_enabled()
before passing guest ipv6 packets to host ip6tables? 


Best regards,

>  
> So this packets are sent to host iptables for processing?
> 
> 
> (Sorry for the delay, I did not received the previous e-mails.
> Please include me in to/cc.)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Florian Westphal @ 2019-08-29 20:29 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Pablo Neira Ayuso, Florian Westphal, David S. Miller,
	netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	Alexey Kuznetsov, Hideaki YOSHIFUJI
In-Reply-To: <db0f02c5b1a995fde174f036540a3d11008cf116.camel@linux.ibm.com>

Leonardo Bras <leonardo@linux.ibm.com> wrote:
> > Thats a good point -- Leonardo, is the
> > "net.bridge.bridge-nf-call-ip6tables" sysctl on?
> 
> Running
> # sudo sysctl -a
> I can see:
> net.bridge.bridge-nf-call-ip6tables = 1
>  
> So this packets are sent to host iptables for processing?

Yes, this is an hold hack that was made because ebtables is
very feature-limited.

However, as I mentioned before I don't think there is anything
we can do here except audit all affected nft expressions and ip6tables
matches and add this check where needed.  ip6t_rpfilter.c comes to mind.

In any case your patch looks ok to me.

> (Sorry for the delay, I did not received the previous e-mails.
> Please include me in to/cc.)

Sorry about that.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Jesper Dangaard Brouer @ 2019-08-29 20:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Alexei Starovoitov, luto, davem, peterz,
	rostedt, netdev, bpf, kernel-team, linux-api, brouer
In-Reply-To: <87imqfhmo2.fsf@toke.dk>

On Thu, 29 Aug 2019 20:05:49 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:  
> >> Alexei Starovoitov <ast@kernel.org> writes:
> >>   
> >> > CAP_BPF allows the following BPF operations:
> >> > - Loading all types of BPF programs
> >> > - Creating all types of BPF maps except:
> >> >    - stackmap that needs CAP_TRACING
> >> >    - devmap that needs CAP_NET_ADMIN
> >> >    - cpumap that needs CAP_SYS_ADMIN  
> >> 
> >> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?  
> >
> > Currently it's cap_sys_admin and I think it should stay this way
> > because it creates kthreads.  
> 
> Ah, right. I can sorta see that makes sense because of the kthreads, but
> it also means that you can use all of XDP *except* cpumap with
> CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?
 
Hmm... I see 'cpumap' primarily as a network stack feature.  It is about
starting the network stack on a specific CPU, allocating and building
SKBs on that remote CPU.  It can only be used together with XDP_REDIRECT.
I would prefer CAP_NET_ADMIN like the devmap, to keep the XDP
capabilities consistent.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Leonardo Bras @ 2019-08-29 20:04 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, David S. Miller
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	Alexey Kuznetsov, Hideaki YOSHIFUJI
In-Reply-To: <20190821141505.2394-1-leonardo@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 336 bytes --]

> Thats a good point -- Leonardo, is the
> "net.bridge.bridge-nf-call-ip6tables" sysctl on?

Running
# sudo sysctl -a
I can see:
net.bridge.bridge-nf-call-ip6tables = 1
 
So this packets are sent to host iptables for processing?


(Sorry for the delay, I did not received the previous e-mails.
Please include me in to/cc.)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next] tools: libbpf: update extended attributes version of bpf_object__open()
From: Song Liu @ 2019-08-29 20:02 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann, Martin Lau,
	Yonghong Song, Networking, bpf, linux-kernel@vger.kernel.org
In-Reply-To: <20190815000330.12044-1-a.s.protopopov@gmail.com>



> On Aug 14, 2019, at 5:03 PM, Anton Protopopov <a.s.protopopov@gmail.com> wrote:
> 

[...]

> 
> 
> int bpf_object__unload(struct bpf_object *obj)
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index e8f70977d137..634f278578dd 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -63,8 +63,13 @@ LIBBPF_API libbpf_print_fn_t libbpf_set_print(libbpf_print_fn_t fn);
> struct bpf_object;
> 
> struct bpf_object_open_attr {
> -	const char *file;
> +	union {
> +		const char *file;
> +		const char *obj_name;
> +	};
> 	enum bpf_prog_type prog_type;
> +	void *obj_buf;
> +	size_t obj_buf_sz;
> };

I think this would break dynamically linked libbpf. No?

Thanks,
Song


^ permalink raw reply

* Re: linux-next: Tree for Aug 29 (mlx5)
From: Randy Dunlap @ 2019-08-29 19:55 UTC (permalink / raw)
  To: Stephen Rothwell, Linux Next Mailing List
  Cc: Linux Kernel Mailing List, netdev@vger.kernel.org, Saeed Mahameed,
	Leon Romanovsky
In-Reply-To: <3cbf3e88-53b5-0eb3-9863-c4031b9aed9f@infradead.org>

On 8/29/19 12:54 PM, Randy Dunlap wrote:
> On 8/29/19 4:08 AM, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20190828:
>>
> 
> 
> on x86_64:
> when CONFIG_PCI_HYPERV=m

and CONFIG_PCI_HYPERV_INTERFACE=m

> and mxlx5 is builtin (=y).
> 
> ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_unload':
> main.c:(.text+0x5d): undefined reference to `mlx5_hv_vhca_cleanup'
> ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_cleanup_once':
> main.c:(.text+0x158): undefined reference to `mlx5_hv_vhca_destroy'
> ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_load_one':
> main.c:(.text+0x4191): undefined reference to `mlx5_hv_vhca_create'
> ld: main.c:(.text+0x4772): undefined reference to `mlx5_hv_vhca_init'
> ld: main.c:(.text+0x4b07): undefined reference to `mlx5_hv_vhca_cleanup'
> 
> 


-- 
~Randy

^ permalink raw reply

* Re: linux-next: Tree for Aug 29 (mlx5)
From: Randy Dunlap @ 2019-08-29 19:54 UTC (permalink / raw)
  To: Stephen Rothwell, Linux Next Mailing List
  Cc: Linux Kernel Mailing List, netdev@vger.kernel.org, Saeed Mahameed,
	Leon Romanovsky
In-Reply-To: <20190829210845.41a9e193@canb.auug.org.au>

On 8/29/19 4:08 AM, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20190828:
> 


on x86_64:
when CONFIG_PCI_HYPERV=m
and mxlx5 is builtin (=y).

ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_unload':
main.c:(.text+0x5d): undefined reference to `mlx5_hv_vhca_cleanup'
ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_cleanup_once':
main.c:(.text+0x158): undefined reference to `mlx5_hv_vhca_destroy'
ld: drivers/net/ethernet/mellanox/mlx5/core/main.o: in function `mlx5_load_one':
main.c:(.text+0x4191): undefined reference to `mlx5_hv_vhca_create'
ld: main.c:(.text+0x4772): undefined reference to `mlx5_hv_vhca_init'
ld: main.c:(.text+0x4b07): undefined reference to `mlx5_hv_vhca_cleanup'


-- 
~Randy

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Ido Schimmel @ 2019-08-29 19:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jiri Pirko, Horatiu Vultur, alexandre.belloni, UNGLinuxDriver,
	davem, allan.nielsen, ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829182957.GA17530@lunn.ch>

On Thu, Aug 29, 2019 at 08:29:57PM +0200, Andrew Lunn wrote:
> > Hi Andrew,
> > 
> > What happens when you run tcpdump on a routed interface without putting
> > it in promiscuous mode ('-p')? If it is a pure software switch, then you
> > see all unicast packets addressed to your interface's MAC address. What
> > happens when the same is done on a hardware switch? With the proposed
> > solution you will not get the same result.
> > 
> > On a software switch, when you run tcpdump without '-p', do you incur
> > major packet loss? No. Will this happen when you punt several Tbps to
> > your CPU on the hardware switch? Yes.
> 
> Hi Ido
> 
> Please think about the general case, not your hardware. A DSA switch
> generally has 1G ports. And the connection to the host is generally
> 1G, maybe 2.5G. So if i put one interface into promisc mode, i will
> probably receive the majority of the traffic on that port, so long as
> there is not too much traffic from other ports towards the CPU.
> 
> I also don't expect any major packet loss in the switch. It is still
> hardware switching, but also sending a copy to the CPU. That copy will
> have the offload_fwd_mark bit set, so the bridge will discard the
> frame. The switch egress queue towards the CPU might overflow, but
> that means tcpdump does not get to see all the frames, and some
> traffic which is actually heading to the CPU is lost. But that can
> happen anyway.

The potential packet loss was only one example why using promiscuous
mode as an indication to punt all traffic to the CPU is wrong. I also
mentioned that you will not capture any traffic (besides
control/exception) when '-p' is specified.

> We should also think about the different classes of users. Somebody
> using a TOR switch with a NOS is very different to a user of a SOHO
> switch in their WiFi access point. The first probably knows tc very
> well, the second has probably never heard of it, and just wants
> tcpdump to work like on their desktop.

I fully agree that we should make it easy for users to capture offloaded
traffic, which is why I suggested patching libpcap. Add a flag to
capable netdevs that tells libpcap that in order to capture all the
traffic from this interface it needs to add a tc filter with a trap
action. That way zero familiarity with tc is required from users.

I really believe that instead of interpreting IFF_PROMISC in exotic ways
and pushing all this logic into the kernel, we should instead teach user
space utilities to capture offloaded traffic.

^ permalink raw reply

* Re: Is bug 200755 in anyone's queue??
From: Willem de Bruijn @ 2019-08-29 19:26 UTC (permalink / raw)
  To: Steve Zabele
  Cc: Network Development, shum, vladimir116, saifi.khan, saifi.khan,
	Daniel Borkmann, on2k16nm, Stephen Hemminger
In-Reply-To: <01db01d559e5$64d71de0$2e8559a0$@net>

On Fri, Aug 23, 2019 at 3:11 PM Steve Zabele <zabele@comcast.net> wrote:
>
> Hi folks,
>
> Is there a way to find out where the SO_REUSEPORT bug reported a year ago in
> August (and apparently has been a bug with kernels later than 4.4) is being
> addressed?
>
> The bug characteristics, simple standalone test code demonstrating the bug,
> and an assessment of the likely location/cause of the bug within the kernel
> are all described here
>
> https://bugzilla.kernel.org/show_bug.cgi?id=200755
>
> I'm really hoping this gets fixed so we can move forward on updating our
> kernels/Ubuntu release from our aging 4.4/16.04 release
>
> Thanks!
>
> Steve
>
>
>
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, July 16, 2019 10:03 AM
> To: Steve Zabele
> Cc: shum@canndrew.org; vladimir116@gmail.com; saifi.khan@DataSynergy.org;
> saifi.khan@strikr.in; daniel@iogearbox.net; on2k16nm@gmail.com
> Subject: Re: Is bug 200755 in anyone's queue??
>
> On Tue, 16 Jul 2019 09:43:24 -0400
> "Steve Zabele" <zabele@comcast.net> wrote:
>
>
> > I came across bug report 200755 trying to figure out why some code I had
> > provided to customers a while ago no longer works with the current Linux
> > kernel. See
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=200755
> >
> > I've verified that, as reported, 'connect' no longer works for UDP.
> > Moreover, it appears it has been broken since the 4.5 kernel has been
> > released.
> >
> >
> >
> > It does also appear that the intended new feature of doing round robin
> > assignments to different UDP sockets opened with SO_REUSEPORT also does
> not
> > work as described.
> >
> >
> >
> > Since the original bug report was made nearly a year ago for the 4.14
> kernel
> > (and the bug is also still present in the 4.15 kernel) I'm curious if
> anyone
> > is on the hook to get this fixed any time soon.
> >
> >
> >
> > I'd rather not have to do my own demultiplexing using a single socket in
> > user space to work around what is clearly a (maybe not so recently
> > introduced) kernel bug if at all possible. My code had worked just fine on
> > 3.X kernels, and appears to work okay up through 4.4.
> >
>
> Kernel developers do not use bugzilla, I forward bug reports
> to netdev@vger.kernel.org (after filtering).

SO_REUSEPORT was not intended to be used in this way. Opening
multiple connected sockets with the same local port.

But since the interface allowed connect after joining a group, and
that is being used, I guess that point is moot. Still, I'm a bit
surprised that it ever worked as described.

Also note that the default distribution algorithm is not round robin
assignment, but hash based. So multiple consecutive datagrams arriving
at the same socket is not unexpected.

I suspect that this quick hack might "work". It seemed to on the
supplied .c file:

                  score = compute_score(sk, net, saddr, sport,
                                        daddr, hnum, dif, sdif);
                  if (score > badness) {
  -                       if (sk->sk_reuseport) {
  +                       if (sk->sk_reuseport && !sk->sk_state !=
TCP_ESTABLISHED) {

But a more robust approach, that also works on existing kernels, is to
swap the default distribution algorithm with a custom BPF based one (
SO_ATTACH_REUSEPORT_EBPF).

^ permalink raw reply

* Re: general protection fault in tls_sk_proto_close (2)
From: Jakub Kicinski @ 2019-08-29 18:53 UTC (permalink / raw)
  To: John Fastabend
  Cc: Hillf Danton, syzbot, aviadye, borisp, daniel, davejwatson, davem,
	linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <5d681e0011c7b_6b462ad11252c5c084@john-XPS-13-9370.notmuch>

On Thu, 29 Aug 2019 11:48:32 -0700, John Fastabend wrote:
> Jakub Kicinski wrote:
> > On Thu, 29 Aug 2019 11:52:00 +0800, Hillf Danton wrote:  
> > > Alternatively work is done if sock is closed again. Anyway ctx is reset
> > > under sock's callback lock in write mode.
> > > 
> > > --- a/net/tls/tls_main.c
> > > +++ b/net/tls/tls_main.c
> > > @@ -295,6 +295,8 @@ static void tls_sk_proto_close(struct so
> > >  	long timeo = sock_sndtimeo(sk, 0);
> > >  	bool free_ctx;
> > >  
> > > +	if (!ctx)
> > > +		return;
> > >  	if (ctx->tx_conf == TLS_SW)
> > >  		tls_sw_cancel_work_tx(ctx);  
> > 
> > That's no bueno, the real socket's close will never get called.  
> 
> Seems when we refactored BPF side we dropped the check for ULP on one
> path so I'll add that back now. It would be nice and seems we are
> getting closer now that tls side is a bit more dynamic if the ordering
> didn't matter.

We'd probably need some more generic way of communicating the changes
in sk_proto stack, e.g. by moving the update into one of sk_proto
callbacks? but yes.

> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 1330a7442e5b..30d11558740e 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -666,6 +666,8 @@ static int sock_hash_update_common(struct bpf_map *map, void *key,
>         WARN_ON_ONCE(!rcu_read_lock_held());
>         if (unlikely(flags > BPF_EXIST))
>                 return -EINVAL;
> +       if (unlikely(icsk->icsk_ulp_data))
> +               return -EINVAL;
> 
>         link = sk_psock_init_link();
>         if (!link)

Thanks! That looks good, if you feel like submitting officially feel
free to add my Reviewed-by!

^ permalink raw reply

* Re: auto-split of commit. Was: [PATCH bpf-next 04/10] tools/bpf: add libbpf_prog_type_(from|to)_str helpers
From: Jakub Kicinski @ 2019-08-29 18:50 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Alexei Starovoitov, Greg Kroah-Hartman, Julia Kartseva, ast,
	Thomas Gleixner, rdna, bpf, daniel, netdev, kernel-team
In-Reply-To: <20190829181039.GD28011@kernel.org>

On Thu, 29 Aug 2019 15:10:39 -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Aug 29, 2019 at 10:16:56AM -0700, Alexei Starovoitov escreveu:
> > On Thu, Aug 29, 2019 at 08:51:51AM +0200, Greg Kroah-Hartman wrote:  
> > > That being said, from a "are you keeping the correct authorship info",
> > > yes, it sounds like you are doing the correct thing here.  
> 
> > > Look at what I do for stable kernels, I take the original commit and add
> > > it to "another tree" keeping the original author and s-o-b chain intact,
> > > and adding a "this is the original git commit id" type message to the
> > > changelog text so that people can link it back to the original.  
>  
> > I think you're describing 'git cherry-pick -x'.
> > The question was about taking pieces of the original commit. Not the whole commit.
> > Author field obviously stays, but SOB is questionable.
> > If author meant to change X and Y and Z. Silently taking only Z chunk of the diff
> > doesn't quite seem right.
> > If we document that such commit split happens in Documentation/bpf/bpf_devel_QA.rst
> > do you think it will be enough to properly inform developers?  
> 
> Can't we instead establish the rule that for something to be added to
> tools/include/ it should first land in a separate commit in include/,
> ditto for the other things tools/ copies from the kernel sources.

In practice in for BPF work the tools/include/ patch is always part of
the same patch set, since the patch sets usually include libbpf support,
tests that need libbpf etc.
 
> That was the initial intention of tools/include/ and also that is how
> tools/perf/check-headers.h works, warning when something ot out of sync,
> etc.
> 
> I.e. the tools/ maintainers should refuse patches that touch both
> tools/include and tools/.
> 
> wdyt?

It's not only about include/. The series that sparked this query is
moving code from tools/bpf/ to tools/lib/bpf/. And each move is split
into two commits add and delete. That's utterly pointless and a waste
of reviewers' time.

But the question is larger still. As I said vendors maintain
out-of-tree version of their drivers, by necessity, e.g.:

https://github.com/Netronome/nfp-drv-kmods is a #ifdef'd version of 
driver/net/ethernet/netronome/nfp.

If there is a problem of loosing SOB when we only apply a part of a
commit, e.g.

https://github.com/Netronome/nfp-drv-kmods/commit/79941cccea4a7720539e35a72c3ba789e4d4bf8c

which is part of:

ef01adae0e43 ("net: sched: use major priority number as hardware priority")

upstream - then we really need a clear ruling here.

^ permalink raw reply

* Re: general protection fault in tls_sk_proto_close (2)
From: John Fastabend @ 2019-08-29 18:48 UTC (permalink / raw)
  To: Jakub Kicinski, Hillf Danton
  Cc: john.fastabend, syzbot, aviadye, borisp, daniel, davejwatson,
	davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <20190829094343.0248c61c@cakuba.netronome.com>

Jakub Kicinski wrote:
> On Thu, 29 Aug 2019 11:52:00 +0800, Hillf Danton wrote:
> > Alternatively work is done if sock is closed again. Anyway ctx is reset
> > under sock's callback lock in write mode.
> > 
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -295,6 +295,8 @@ static void tls_sk_proto_close(struct so
> >  	long timeo = sock_sndtimeo(sk, 0);
> >  	bool free_ctx;
> >  
> > +	if (!ctx)
> > +		return;
> >  	if (ctx->tx_conf == TLS_SW)
> >  		tls_sw_cancel_work_tx(ctx);
> 
> That's no bueno, the real socket's close will never get called.

Seems when we refactored BPF side we dropped the check for ULP on one
path so I'll add that back now. It would be nice and seems we are
getting closer now that tls side is a bit more dynamic if the ordering
didn't matter.

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 1330a7442e5b..30d11558740e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -666,6 +666,8 @@ static int sock_hash_update_common(struct bpf_map *map, void *key,
        WARN_ON_ONCE(!rcu_read_lock_held());
        if (unlikely(flags > BPF_EXIST))
                return -EINVAL;
+       if (unlikely(icsk->icsk_ulp_data))
+               return -EINVAL;

        link = sk_psock_init_link();
        if (!link)

^ permalink raw reply related

* Proposal: r8152 firmware patching framework
From: Prashant Malani @ 2019-08-29 18:40 UTC (permalink / raw)
  To: Hayes Wang, David Miller, netdev; +Cc: nic_swsd, Grant Grundler

Hi,

The r8152 driver source code distributed by Realtek (on
www.realtek.com) contains firmware patches. This involves binary
byte-arrays being written byte/word-wise to the hardware memory
Example: grundler@chromium.org (cc-ed) has an experimental patch which
includes the firmware patching code which was distributed with the
Realtek source :
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1417953

It would be nice to have a way to incorporate these firmware fixes
into the upstream code. Since having indecipherable byte-arrays is not
possible upstream, I propose the following:
- We use the assistance of Realtek to come up with a format which the
firmware patch files can follow (this can be documented in the
comments).
       - A real simple format could look like this:
               +
<section1><size_in_bytes><address1><data1><address2><data2>...<addressN><dataN><section2>...
                + The driver would be able to understand how to parse
each section (e.g is each data entry a byte or a word?)

- We use request_firmware() to load the firmware, parse it and write
the data to the relevant registers.

I'm unfamiliar with what the preferred method of firmware patching is,
so I hope the maintainers can help suggest the best path forward.

As an aside: It would be great if Realtek could publish a list of
fixes that the firmware patches implement (I think a list on the
driver download page on the Realtek website would be an excellent
starting point).

Thanks and Best regards,

-Prashant

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox