Netdev List
 help / color / mirror / Atom feed
* Re: [Patch net] netrom: hold sock when setting skb->destructor
From: David Miller @ 2019-07-24 22:49 UTC (permalink / raw)
  To: xiyou.wangcong
  Cc: netdev, syzbot+622bdabb128acc33427d, syzbot+6eaef7158b19e3fec3a0,
	syzbot+9399c158fcc09b21d0d2, syzbot+a34e5f3d0300163f0c87, ralf
In-Reply-To: <20190723034122.23166-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon, 22 Jul 2019 20:41:22 -0700

> sock_efree() releases the sock refcnt, if we don't hold this refcnt
> when setting skb->destructor to it, the refcnt would not be balanced.
> This leads to several bug reports from syzbot.
> 
> I have checked other users of sock_efree(), all of them hold the
> sock refcnt.
> 
> Fixes: c8c8218ec5af ("netrom: fix a memory leak in nr_rx_frame()")
> Reported-and-tested-by: <syzbot+622bdabb128acc33427d@syzkaller.appspotmail.com>
> Reported-and-tested-by: <syzbot+6eaef7158b19e3fec3a0@syzkaller.appspotmail.com>
> Reported-and-tested-by: <syzbot+9399c158fcc09b21d0d2@syzkaller.appspotmail.com>
> Reported-and-tested-by: <syzbot+a34e5f3d0300163f0c87@syzkaller.appspotmail.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [RFC PATCH net-next 00/12] drop_monitor: Capture dropped packets and metadata
From: David Miller @ 2019-07-24 22:48 UTC (permalink / raw)
  To: toke
  Cc: idosch, netdev, nhorman, dsahern, roopa, nikolay, jakub.kicinski,
	andy, f.fainelli, andrew, vivien.didelot, mlxsw, idosch
In-Reply-To: <87imrt4zzg.fsf@toke.dk>

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Mon, 22 Jul 2019 21:43:15 +0200

> I like this!

I definitely think this is going in the right direction as well.

^ permalink raw reply

* Re: [PATCH] net: sfc: falcon: convert to i2c_new_dummy_device
From: David Miller @ 2019-07-24 22:47 UTC (permalink / raw)
  To: wsa+renesas
  Cc: linux-i2c, linux-net-drivers, ecree, mhabets, netdev,
	linux-kernel
In-Reply-To: <20190722172635.4535-1-wsa+renesas@sang-engineering.com>

From: Wolfram Sang <wsa+renesas@sang-engineering.com>
Date: Mon, 22 Jul 2019 19:26:35 +0200

> Move from i2c_new_dummy() to i2c_new_dummy_device(). So, we now get an
> ERRPTR which we use in error handling.
> 
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>

Solarflare folks, please review/test.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] [net-next] mlx4: avoid large stack usage in mlx4_init_hca()
From: David Miller @ 2019-07-24 22:47 UTC (permalink / raw)
  To: arnd
  Cc: tariqt, ereza, jackm, eli, moshe, jiri, netdev, linux-rdma,
	linux-kernel, clang-built-linux
In-Reply-To: <20190722150204.1157315-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Mon, 22 Jul 2019 17:01:55 +0200

> The mlx4_dev_cap and mlx4_init_hca_param are really too large
> to be put on the kernel stack, as shown by this clang warning:
> 
> drivers/net/ethernet/mellanox/mlx4/main.c:3304:12: error: stack frame size of 1088 bytes in function 'mlx4_load_one' [-Werror,-Wframe-larger-than=]
> 
> With gcc, the problem is the same, but it does not warn because
> it does not inline this function, and therefore stays just below
> the warning limit, while clang is just above it.
> 
> Use kzalloc for dynamic allocation instead of putting them
> on stack. This gets the combined stack frame down to 424 bytes.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] qed: reduce maximum stack frame size
From: David Miller @ 2019-07-24 22:46 UTC (permalink / raw)
  To: arnd
  Cc: aelior, GR-everest-linux-l2, Yuval.Mintz, manishc,
	michal.kalderon, skalluru, denis.bolotin, rverma, netdev,
	linux-kernel, clang-built-linux
In-Reply-To: <20190722150133.1157096-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Mon, 22 Jul 2019 17:01:23 +0200

> clang warns about an overly large stack frame in one function
> when it decides to inline all __qed_get_vport_*() functions into
> __qed_get_vport_stats():
> 
> drivers/net/ethernet/qlogic/qed/qed_l2.c:1889:13: error: stack frame size of 1128 bytes in function '_qed_get_vport_stats' [-Werror,-Wframe-larger-than=]
> 
> Use a noinline_for_stack annotation to prevent clang from inlining
> these, which keeps the maximum stack usage at around half of that
> in the worst case, similar to what we get with gcc.
> 
> Fixes: 86622ee75312 ("qed: Move statistics to L2 code")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] ovs: datapath: hide clang frame-overflow warnings
From: David Miller @ 2019-07-24 22:45 UTC (permalink / raw)
  To: arnd
  Cc: pshelar, xiangxia.m.yue, johannes.berg, kjlu, netdev, dev,
	linux-kernel, clang-built-linux
In-Reply-To: <20190722150018.1156794-1-arnd@arndb.de>

From: Arnd Bergmann <arnd@arndb.de>
Date: Mon, 22 Jul 2019 17:00:01 +0200

> Some functions in the datapath code are factored out so that each
> one has a stack frame smaller than 1024 bytes with gcc. However,
> when compiling with clang, the functions are inlined more aggressively
> and combined again so we get
> 
> net/openvswitch/datapath.c:1124:12: error: stack frame size of 1528 bytes in function 'ovs_flow_cmd_set' [-Werror,-Wframe-larger-than=]
> 
> Marking both get_flow_actions() and ovs_nla_init_match_and_action()
> as 'noinline_for_stack' gives us the same behavior that we see with
> gcc, and no warning. Note that this does not mean we actually use
> less stack, as the functions call each other, and we still get
> three copies of the large 'struct sw_flow_key' type on the stack.
> 
> The comment tells us that this was previously considered safe,
> presumably since the netlink parsing functions are called with
> a known backchain that does not also use a lot of stack space.
> 
> Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied, thanks Arnd.

^ permalink raw reply

* Re: [PATCH bpf-next 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Song Liu @ 2019-07-24 22:45 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, Networking, bpf, David S . Miller,
	Alexei Starovoitov, Daniel Borkmann, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <20190724223210.GA3500@mini-arch>

On Wed, Jul 24, 2019 at 3:32 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 07/24, Song Liu wrote:
> > On Wed, Jul 24, 2019 at 10:11 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > C flow dissector supports input flags that tell it to customize parsing
> > > by either stopping early or trying to parse as deep as possible. Pass
> > > those flags to the BPF flow dissector so it can make the same
> > > decisions. In the next commits I'll add support for those flags to
> > > our reference bpf_flow.c
> > >
> > > Cc: Willem de Bruijn <willemb@google.com>
> > > Cc: Petar Penkov <ppenkov@google.com>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/linux/skbuff.h       | 2 +-
> > >  include/net/flow_dissector.h | 4 ----
> > >  include/uapi/linux/bpf.h     | 5 +++++
> > >  net/bpf/test_run.c           | 2 +-
> > >  net/core/flow_dissector.c    | 5 +++--
> > >  5 files changed, 10 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > index 718742b1c505..9b7a8038beec 100644
> > > --- a/include/linux/skbuff.h
> > > +++ b/include/linux/skbuff.h
> > > @@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > >
> > >  struct bpf_flow_dissector;
> > >  bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
> > > -                     __be16 proto, int nhoff, int hlen);
> > > +                     __be16 proto, int nhoff, int hlen, unsigned int flags);
> > >
> > >  bool __skb_flow_dissect(const struct net *net,
> > >                         const struct sk_buff *skb,
> > > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > > index 90bd210be060..3e2642587b76 100644
> > > --- a/include/net/flow_dissector.h
> > > +++ b/include/net/flow_dissector.h
> > > @@ -253,10 +253,6 @@ enum flow_dissector_key_id {
> > >         FLOW_DISSECTOR_KEY_MAX,
> > >  };
> > >
> > > -#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                BIT(0)
> > > -#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    BIT(1)
> > > -#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         BIT(2)
> > > -
> > >  struct flow_dissector_key {
> > >         enum flow_dissector_key_id key_id;
> > >         size_t offset; /* offset of struct flow_dissector_key_*
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index fa1c753dcdbc..b4ad19bd6aa8 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
> > >         BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> > >  };
> > >
> > > +#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                (1U << 0)
> > > +#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    (1U << 1)
> > > +#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         (1U << 2)
> >
> > Do we have to move these?
> These have to be a part of UAPI one way or another. The easiest thing
> we can do is to call the existing flags a UAPI and move them into
> exported headers. Alternatively, we can introduce new set of UAPI flags
> and do some kind of conversion between exported and internal ones.
>
> Since it's pretty easy to add/deprecate them (see cover letter), I've
> decided to just move them instead of adding another set and doing
> conversion. I'm open to suggestions if you think otherwise.

I see. This looks good to me. Thanks for the explanation.

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-24 22:44 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf
In-Reply-To: <CAPhsuW4HPjXE+zZGmPM9GVPgnVieRr0WOuXfM0W6ec3SB4imDw@mail.gmail.com>

On Wed, Jul 24, 2019 at 2:40 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
> >
> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
> >
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
> >
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
> >
> > Suggested-by: Stanislav Fomichev <sdf@google.com>
> > Signed-off-by: Brian Vazquez <brianvv@google.com>
> > ---
> >  include/uapi/linux/bpf.h |   9 +++
> >  kernel/bpf/syscall.c     | 117 +++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 126 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fa1c753dcdbc7..66dab5385170d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -106,6 +106,7 @@ enum bpf_cmd {
> >         BPF_TASK_FD_QUERY,
> >         BPF_MAP_LOOKUP_AND_DELETE_ELEM,
> >         BPF_MAP_FREEZE,
> > +       BPF_MAP_DUMP,
> >  };
> >
> >  enum bpf_map_type {
> > @@ -388,6 +389,14 @@ union bpf_attr {
> >                 __u64           flags;
> >         };
> >
> > +       struct { /* struct used by BPF_MAP_DUMP command */
> > +               __aligned_u64   prev_key;
> > +               __aligned_u64   buf;
> > +               __aligned_u64   buf_len; /* input/output: len of buf */
> > +               __u64           flags;
>
> Please add explanation of flags here.

got it!

> Also, we need to update the
> comments of BPF_F_LOCK for BPF_MAP_DUMP.

What do you mean? I didn't get this part.

>
> > +               __u32           map_fd;
> > +       } dump;
> > +
> >         struct { /* anonymous struct used by BPF_PROG_LOAD command */
> >                 __u32           prog_type;      /* one of enum bpf_prog_type */
> >                 __u32           insn_cnt;
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 86cdc2f7bb56e..0c35505aa219f 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr)
> >         return err;
> >  }
> >
> > +/* last field in 'union bpf_attr' used by this command */
> > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd
> > +
> > +static int map_dump(union bpf_attr *attr)
> > +{
> > +       void __user *ukey = u64_to_user_ptr(attr->dump.prev_key);
> > +       void __user *ubuf = u64_to_user_ptr(attr->dump.buf);
> > +       u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len);
> > +       int ufd = attr->dump.map_fd;
> > +       struct bpf_map *map;
> > +       void *buf, *prev_key, *key, *value;
> > +       u32 value_size, elem_size, buf_len, cp_len;
> > +       struct fd f;
> > +       int err;
> > +       bool first_key = false;
> > +
> > +       if (CHECK_ATTR(BPF_MAP_DUMP))
> > +               return -EINVAL;
> > +
> > +       if (attr->dump.flags & ~BPF_F_LOCK)
> > +               return -EINVAL;
> > +
> > +       f = fdget(ufd);
> > +       map = __bpf_map_get(f);
> > +       if (IS_ERR(map))
> > +               return PTR_ERR(map);
> > +       if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) {
> > +               err = -EPERM;
> > +               goto err_put;
> > +       }
> > +
> > +       if ((attr->dump.flags & BPF_F_LOCK) &&
> > +           !map_value_has_spin_lock(map)) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
>
> We can share these lines with map_lookup_elem(). Maybe
> add another helper function?

Which are the lines you are referring to? the dump.flags? It makes
sense so that way when a new flag is added you only need to modify
them in one spot.

> > +
> > +       if (map->map_type == BPF_MAP_TYPE_QUEUE ||
> > +           map->map_type == BPF_MAP_TYPE_STACK) {
> > +               err = -ENOTSUPP;
> > +               goto err_put;
> > +       }
> > +
> > +       value_size = bpf_map_value_size(map);
> > +
> > +       err = get_user(buf_len, ubuf_len);
> > +       if (err)
> > +               goto err_put;
> > +
> > +       elem_size = map->key_size + value_size;
> > +       if (buf_len < elem_size) {
> > +               err = -EINVAL;
> > +               goto err_put;
> > +       }
> > +
> > +       if (ukey) {
> > +               prev_key = __bpf_copy_key(ukey, map->key_size);
> > +               if (IS_ERR(prev_key)) {
> > +                       err = PTR_ERR(prev_key);
> > +                       goto err_put;
> > +               }
> > +       } else {
> > +               prev_key = NULL;
> > +               first_key = true;
> > +       }
> > +
> > +       err = -ENOMEM;
> > +       buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN);
> > +       if (!buf)
> > +               goto err_put;
> > +
> > +       key = buf;
> > +       value = key + map->key_size;
> > +       for (cp_len = 0; cp_len + elem_size <= buf_len;) {
> > +               if (signal_pending(current)) {
> > +                       err = -EINTR;
> > +                       break;
> > +               }
> > +
> > +               rcu_read_lock();
> > +               err = map->ops->map_get_next_key(map, prev_key, key);
>
> If prev_key is deleted before map_get_next_key(), we get the first key
> again. This is pretty weird.

Yes, I know. But note that the current scenario happens even for the
old interface (imagine you are walking a map from userspace and you
tried get_next_key the prev_key was removed, you will start again from
the beginning without noticing it).
I tried to sent a patch in the past but I was missing some context:
before NULL was used to get the very first_key the interface relied in
a random (non existent) key to retrieve the first_key in the map, and
I was told what we still have to support that scenario.

>
> > +               rcu_read_unlock();
> > +
> > +               if (err)
> > +                       break;
> > +
> > +               err = bpf_map_copy_value(map, key, value, attr->dump.flags);
> > +
> > +               if (err == -ENOENT)
> > +                       continue;
> > +               if (err)
> > +                       goto free_buf;
> > +
> > +               if (copy_to_user(ubuf + cp_len, buf, elem_size)) {
> > +                       err = -EFAULT;
> > +                       goto free_buf;
> > +               }
> > +
> > +               prev_key = key;
> > +               cp_len += elem_size;
> > +       }
> > +
> > +       if (err == -ENOENT && cp_len)
> > +               err = 0;
> > +       if (!err && (copy_to_user(ubuf_len, &cp_len, sizeof(cp_len)) ||
> > +                   (!first_key && copy_to_user(ukey, key, map->key_size))))
> > +               err = -EFAULT;
> > +free_buf:
> > +       kfree(buf);
> > +err_put:
> > +       fdput(f);
> > +       return err;
> > +}
> > +
> >  #define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> >
> >  static int map_lookup_and_delete_elem(union bpf_attr *attr)
> > @@ -2910,6 +3024,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
> >         case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> >                 err = map_lookup_and_delete_elem(&attr);
> >                 break;
> > +       case BPF_MAP_DUMP:
> > +               err = map_dump(&attr);
> > +               break;
> >         default:
> >                 err = -EINVAL;
> >                 break;
> > --
> > 2.22.0.657.g960e92d24f-goog
> >

Thanks for reviewing it!!

^ permalink raw reply

* Re: [PATCH RFC net-next] net: use helper skb_ensure_writable in skb_checksum_help and skb_crc32c_csum_help
From: David Miller @ 2019-07-24 22:43 UTC (permalink / raw)
  To: hkallweit1; +Cc: netdev
In-Reply-To: <d5791755-5b1e-6dc6-fa9d-da3bb0353847@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Wed, 24 Jul 2019 23:47:49 +0200

> Instead of open-coding making the checksum writable we can use an
> appropriate helper. skb_ensure_writable is a candidate, however we
> might also use skb_header_unclone. Hints welcome.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

FWIW these conversions look accurate to me.

The only real difference is the perhaps unnecessary pskb_may_pull()
check.

^ permalink raw reply

* Re: [PATCH v2 net-next] r8169: improve rtl_set_rx_mode
From: David Miller @ 2019-07-24 22:41 UTC (permalink / raw)
  To: hkallweit1; +Cc: nic_swsd, netdev
In-Reply-To: <2f0e6d3c-aa44-3334-aab0-f158f46e4aa9@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com>
Date: Wed, 24 Jul 2019 23:34:45 +0200

> This patch improves and simplifies rtl_set_rx_mode a little.
> No functional change intended.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
> v2:
> - change variable declarations to reverse xmas tree

Applied.

^ permalink raw reply

* Re: [net-next v2 0/5][pull request] 1GbE Intel Wired LAN Driver Updates 2019-07-24
From: David Miller @ 2019-07-24 22:35 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190724212613.1580-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 24 Jul 2019 14:26:08 -0700

> This series contains updates to igc and e1000e client drivers only.
> 
> Sasha provides a couple of cleanups to remove code that is not needed
> and reduce structure sizes.  Updated the MAC reset flow to use the
> device reset flow instead of a port reset flow.  Added addition device
> id's that will be supported.
> 
> Kai-Heng Feng provides a workaround for a possible stalled packet issue
> in our ICH devices due to a clock recovery from the PCH being too slow.
> 
> v2: removed the last patch in the series that supposedly fixed a MAC/PHY
>     de-sync potential issue while waiting for additional information from
>     hardware engineers.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Willem de Bruijn @ 2019-07-24 22:33 UTC (permalink / raw)
  To: Brian Vazquez
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf
In-Reply-To: <CABCgpaWCLJtDx8kHNiQZneqYZkZ3fzRGnipT5__kmwMhu01g=w@mail.gmail.com>

> > > Because maps can be called from userspace and kernel code, this function
> > > can have a scenario where the next_key was found but by the time we
> > > try to retrieve the value the element is not there, in this case the
> > > function continues and tries to get a new next_key value, skipping the
> > > deleted key. If at some point the function find itself trap in a loop,
> > > it will return -EINTR.
> >
> > Good to point this out! I don't think that unbounded continue;
> > statements until an interrupt happens is sufficient. Please bound the
> > number of retries to a low number.
>
> And what would it be a good number? Maybe 3 attempts?

3 sounds good to me.

> And in that case what error should be reported?

One that's unambiguous and somewhat intuitive for the given issue.
Perhaps EBUSY?

> > > The function will try to fit as much as possible in the buf provided and
> > > will return -EINVAL if buf_len is smaller than elem_size.
> > >
> > > QUEUE and STACK maps are not supported.
> > >
> > > Note that map_dump doesn't guarantee that reading the entire table is
> > > consistent since this function is always racing with kernel and user code
> > > but the same behaviour is found when the entire table is walked using
> > > the current interfaces: map_get_next_key + map_lookup_elem.
> >
> > > It is also important to note that with  a locked map, the lock is grabbed
> > > for 1 entry at the time, meaning that the returned buf might or might not
> > > be consistent.
> >
> > Would it be informative to signal to the caller if the read was
> > complete and consistent (because the entire table was read while the
> > lock was held)?
>
> Mmm.. not sure how we could signal that to the caller.  But I don't
> think there's a way to know it was consistent

Okay, that makes for a simple answer :) No need to try to add a signal, then.

^ permalink raw reply

* Re: [PATCH bpf-next 3/7] bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
From: Song Liu @ 2019-07-24 22:33 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, Willem de Bruijn, Petar Penkov
In-Reply-To: <20190724170018.96659-4-sdf@google.com>

On Wed, Jul 24, 2019 at 10:11 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> This will allow us to write tests for those flags.
>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  net/bpf/test_run.c | 39 +++++++++++++++++++++++++++++++++++----
>  1 file changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
> index 4e41d15a1098..444a7baed791 100644
> --- a/net/bpf/test_run.c
> +++ b/net/bpf/test_run.c
> @@ -377,6 +377,22 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
>         return ret;
>  }
>
> +static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
> +{
> +       /* make sure the fields we don't use are zeroed */
> +       if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags)))
> +               return -EINVAL;
> +
> +       /* flags is allowed */
> +
> +       if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) +
> +                          FIELD_SIZEOF(struct bpf_flow_keys, flags),
> +                          sizeof(struct bpf_flow_keys)))
> +               return -EINVAL;
> +
> +       return 0;
> +}
> +
>  int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>                                      const union bpf_attr *kattr,
>                                      union bpf_attr __user *uattr)
> @@ -384,9 +400,11 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>         u32 size = kattr->test.data_size_in;
>         struct bpf_flow_dissector ctx = {};
>         u32 repeat = kattr->test.repeat;
> +       struct bpf_flow_keys *user_ctx;
>         struct bpf_flow_keys flow_keys;
>         u64 time_start, time_spent = 0;
>         const struct ethhdr *eth;
> +       unsigned int flags = 0;
>         u32 retval, duration;
>         void *data;
>         int ret;
> @@ -395,9 +413,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>         if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
>                 return -EINVAL;
>
> -       if (kattr->test.ctx_in || kattr->test.ctx_out)
> -               return -EINVAL;
> -
>         if (size < ETH_HLEN)
>                 return -EINVAL;
>
> @@ -410,6 +425,18 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>         if (!repeat)
>                 repeat = 1;
>
> +       user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys));
> +       if (IS_ERR(user_ctx)) {
> +               kfree(data);
> +               return PTR_ERR(user_ctx);
> +       }
> +       if (user_ctx) {
> +               ret = verify_user_bpf_flow_keys(user_ctx);
> +               if (ret)
> +                       goto out;
> +               flags = user_ctx->flags;
> +       }
> +
>         ctx.flow_keys = &flow_keys;
>         ctx.data = data;
>         ctx.data_end = (__u8 *)data + size;
> @@ -419,7 +446,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>         time_start = ktime_get_ns();
>         for (i = 0; i < repeat; i++) {
>                 retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
> -                                         size, 0);
> +                                         size, flags);
>
>                 if (signal_pending(current)) {
>                         preempt_enable();
> @@ -450,8 +477,12 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
>
>         ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
>                               retval, duration);
> +       if (!ret)
> +               ret = bpf_ctx_finish(kattr, uattr, user_ctx,
> +                                    sizeof(struct bpf_flow_keys));
>
>  out:
>         kfree(data);
> +       kfree(user_ctx);

nit: it is not really necessary now, but maybe put kfree(user_ctx)
before kfree(data).

>         return ret;
>  }
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply

* Re: [PATCH bpf-next 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Stanislav Fomichev @ 2019-07-24 22:32 UTC (permalink / raw)
  To: Song Liu
  Cc: Stanislav Fomichev, Networking, bpf, David S . Miller,
	Alexei Starovoitov, Daniel Borkmann, Willem de Bruijn,
	Petar Penkov
In-Reply-To: <CAPhsuW6wq_6Pf80yV7oEb0uW7Xv9=UKAbTm4XJLyKAtSmDzCBQ@mail.gmail.com>

On 07/24, Song Liu wrote:
> On Wed, Jul 24, 2019 at 10:11 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > C flow dissector supports input flags that tell it to customize parsing
> > by either stopping early or trying to parse as deep as possible. Pass
> > those flags to the BPF flow dissector so it can make the same
> > decisions. In the next commits I'll add support for those flags to
> > our reference bpf_flow.c
> >
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/skbuff.h       | 2 +-
> >  include/net/flow_dissector.h | 4 ----
> >  include/uapi/linux/bpf.h     | 5 +++++
> >  net/bpf/test_run.c           | 2 +-
> >  net/core/flow_dissector.c    | 5 +++--
> >  5 files changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 718742b1c505..9b7a8038beec 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> >
> >  struct bpf_flow_dissector;
> >  bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
> > -                     __be16 proto, int nhoff, int hlen);
> > +                     __be16 proto, int nhoff, int hlen, unsigned int flags);
> >
> >  bool __skb_flow_dissect(const struct net *net,
> >                         const struct sk_buff *skb,
> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > index 90bd210be060..3e2642587b76 100644
> > --- a/include/net/flow_dissector.h
> > +++ b/include/net/flow_dissector.h
> > @@ -253,10 +253,6 @@ enum flow_dissector_key_id {
> >         FLOW_DISSECTOR_KEY_MAX,
> >  };
> >
> > -#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                BIT(0)
> > -#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    BIT(1)
> > -#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         BIT(2)
> > -
> >  struct flow_dissector_key {
> >         enum flow_dissector_key_id key_id;
> >         size_t offset; /* offset of struct flow_dissector_key_*
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index fa1c753dcdbc..b4ad19bd6aa8 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
> >         BPF_FD_TYPE_URETPROBE,          /* filename + offset */
> >  };
> >
> > +#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                (1U << 0)
> > +#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    (1U << 1)
> > +#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         (1U << 2)
> 
> Do we have to move these?
These have to be a part of UAPI one way or another. The easiest thing
we can do is to call the existing flags a UAPI and move them into
exported headers. Alternatively, we can introduce new set of UAPI flags
and do some kind of conversion between exported and internal ones.

Since it's pretty easy to add/deprecate them (see cover letter), I've
decided to just move them instead of adding another set and doing
conversion. I'm open to suggestions if you think otherwise.

^ permalink raw reply

* Re: [PATCH net-next] net/tls: add myself as a co-maintainer
From: David Miller @ 2019-07-24 22:30 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: alexei.starovoitov, daniel, john.fastabend, davejwatson, borisp,
	aviadye, netdev, oss-drivers, simon.horman, ast
In-Reply-To: <20190724180248.6480-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Wed, 24 Jul 2019 11:02:48 -0700

> I've been spending quite a bit of time fixing and
> preventing bit rot in the core TLS code. TLS seems
> to only be growing in importance, I'd like to help
> ensuring the quality of our implementation.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: Simon Horman <simon.horman@netronome.com>

I'll aplly this to 'net', thanks.

^ permalink raw reply

* Re: [PATCH -next v2] net/ixgbevf: fix a compilation error of skb_frag_t
From: David Miller @ 2019-07-24 22:29 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: cai, netdev, linux-kernel
In-Reply-To: <8749f22284c5a557fe50a5dcc956c5d2c80037e2.camel@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 24 Jul 2019 15:02:15 -0700

> On Wed, 2019-07-24 at 14:41 -0700, David Miller wrote:
>> From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>> Date: Wed, 24 Jul 2019 09:39:26 -0700
>> 
>> > Dave I will pick this up and add it to my queue.
>> 
>> How soon will you get that to me?  The sooner this build fix is cured
>> the
>> better.
> 
> Go ahead and pick it up, I was able to get it through an initial round
> of testing with no issues.  No need to wait for me to re-send it, I
> will go ahead and ACK Qian's patch.

Ok, done.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next] selftests: mlxsw: Fix typo in qos_mc_aware.sh
From: David Miller @ 2019-07-24 22:27 UTC (permalink / raw)
  To: shuah
  Cc: standby24x7, linux-kernel, jiri, idosch, linux-kselftest, rdunlap,
	netdev
In-Reply-To: <7e69dda0-bca2-4b78-19cb-b66d097503c0@kernel.org>

From: shuah <shuah@kernel.org>
Date: Wed, 24 Jul 2019 16:00:56 -0600

> On 7/24/19 3:51 PM, David Miller wrote:
>> From: Masanari Iida <standby24x7@gmail.com>
>> Date: Thu, 25 Jul 2019 00:29:51 +0900
>> 
>>> This patch fix some spelling typo in qos_mc_aware.sh
>>>
>>> Signed-off-by: Masanari Iida <standby24x7@gmail.com>
>>> Acked-by: Randy Dunlap <rdunlap@infradead.org>
>> Applied.
>> 
> 
> I applied to this my fixes branch this morning on auto-pilot
> without realizing that it is in your domain :)
> 
> Would you like like me to drop it from mine?

It probably doesn't matter that it exists in two trees :)

^ permalink raw reply

* Re: [PATCH] net: phy: mscc: initialize stats array
From: David Miller @ 2019-07-24 22:26 UTC (permalink / raw)
  To: schwab; +Cc: netdev, andrew, f.fainelli, hkallweit1, linux-kernel
In-Reply-To: <mvmh87bih1y.fsf@suse.de>

From: Andreas Schwab <schwab@suse.de>
Date: Wed, 24 Jul 2019 17:32:57 +0200

> The memory allocated for the stats array may contain arbitrary data.
> 
> Signed-off-by: Andreas Schwab <schwab@suse.de>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-24 22:26 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, LKML, Network Development, bpf
In-Reply-To: <CAF=yD-+a=t_YizdJpb_Q+zxR7iP-V-EarNsp9tjnFTRBjOtFvA@mail.gmail.com>

On Wed, Jul 24, 2019 at 12:55 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 1:10 PM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map, wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > To start dumping the map from the beginning you must specify NULL as
> > the prev_key.
> >
> > The new API returns 0 when it successfully copied all the elements
> > requested or it copied less because there weren't more elements to
> > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0.
>
> I think I understand this, but perhaps it can be explained a bit more
> concisely without reference to ENOENT and error masking. The function
> returns the min of the number of requested elements and the number of
> remaining elements in the map from the given starting point
> (prev_key).

That sounds better to me. Thanks!

> > On a successful call buf and buf_len will contain correct data and in
> > case prev_key was provided (not for the first walk, since prev_key is
> > NULL) it will contain the last_key copied into the prev_key which will
> > simplify next call.
> >
> > Only when it can't find a single element it will return -ENOENT meaning
> > that the map has been entirely walked. When an error is return buf,
> > buf_len and prev_key shouldn't be read nor used.
>
> That's common for error handling. No need to state explicitly.

 Ack

>
> > Because maps can be called from userspace and kernel code, this function
> > can have a scenario where the next_key was found but by the time we
> > try to retrieve the value the element is not there, in this case the
> > function continues and tries to get a new next_key value, skipping the
> > deleted key. If at some point the function find itself trap in a loop,
> > it will return -EINTR.
>
> Good to point this out! I don't think that unbounded continue;
> statements until an interrupt happens is sufficient. Please bound the
> number of retries to a low number.

And what would it be a good number? Maybe 3 attempts? And in that case
what error should be reported?
>
> > The function will try to fit as much as possible in the buf provided and
> > will return -EINVAL if buf_len is smaller than elem_size.
> >
> > QUEUE and STACK maps are not supported.
> >
> > Note that map_dump doesn't guarantee that reading the entire table is
> > consistent since this function is always racing with kernel and user code
> > but the same behaviour is found when the entire table is walked using
> > the current interfaces: map_get_next_key + map_lookup_elem.
>
> > It is also important to note that with  a locked map, the lock is grabbed
> > for 1 entry at the time, meaning that the returned buf might or might not
> > be consistent.
>
> Would it be informative to signal to the caller if the read was
> complete and consistent (because the entire table was read while the
> lock was held)?

Mmm.. not sure how we could signal that to the caller.  But I don't
think there's a way to know it was consistent (i.e. one element was
removed in bucket 20 and you are copying the keys in bucket 15, when
you get to bucket 20 there's no way to know that some entries were
removed when you traversed them). The lock is held for just 1 single
entry not the entire table.
Maybe clarify more that in the commit message?

Thanks for reviewing!

^ permalink raw reply

* Re: [PATCH bpf-next 2/7] bpf/flow_dissector: document flags
From: Song Liu @ 2019-07-24 22:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, Willem de Bruijn, Petar Penkov
In-Reply-To: <20190724170018.96659-3-sdf@google.com>

On Wed, Jul 24, 2019 at 10:11 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> Describe what each input flag does and who uses it.
>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  Documentation/bpf/prog_flow_dissector.rst | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst
> index ed343abe541e..0f3f380b2ce4 100644
> --- a/Documentation/bpf/prog_flow_dissector.rst
> +++ b/Documentation/bpf/prog_flow_dissector.rst
> @@ -26,6 +26,7 @@ and output arguments.
>    * ``nhoff`` - initial offset of the networking header
>    * ``thoff`` - initial offset of the transport header, initialized to nhoff
>    * ``n_proto`` - L3 protocol type, parsed out of L2 header
> +  * ``flags`` - optional flags
>
>  Flow dissector BPF program should fill out the rest of the ``struct
>  bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
> @@ -101,6 +102,23 @@ can be called for both cases and would have to be written carefully to
>  handle both cases.
>
>
> +Flags
> +=====
> +
> +``flow_keys->flags`` might contain optional input flags that work as follows:
> +
> +* ``FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to continue
> +  parsing first fragment; the default expected behavior is that flow dissector
> +  returns as soon as it finds out that the packet is fragmented;
> +  used by ``eth_get_headlen`` to estimate length of all headers for GRO.
> +* ``FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to stop
> +  parsing as soon as it reaches IPv6 flow label; used by ``___skb_get_hash``
> +  and ``__skb_get_hash_symmetric`` to get flow hash.
> +* ``FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
> +  parsing as soon as it reaches encapsulated headers; used by routing
> +  infrastructure.
> +
> +
>  Reference Implementation
>  ========================
>
> --
> 2.22.0.657.g960e92d24f-goog
>

^ permalink raw reply

* [PATCH iproute2] iplink: document the 'link change' subcommand
From: Matteo Croce @ 2019-07-24 19:12 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, David Ahern

ip link can set parameters both via the 'set' and 'change' keyword.
In fact, 'change' is an alias for 'set'.
Document this in the help and manpage.

Fixes: 1d93483985f0 ("iplink: use netlink for link configuration")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 ip/iplink.c           | 4 ++--
 man/man8/ip-link.8.in | 7 ++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 212a0885..be237c93 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -64,12 +64,12 @@ void iplink_usage(void)
 			"\n"
 			"	ip link delete { DEVICE | dev DEVICE | group DEVGROUP } type TYPE [ ARGS ]\n"
 			"\n"
-			"	ip link set { DEVICE | dev DEVICE | group DEVGROUP }\n"
+			"	ip link { set | change } { DEVICE | dev DEVICE | group DEVGROUP }\n"
 			"			[ { up | down } ]\n"
 			"			[ type TYPE ARGS ]\n");
 	} else
 		fprintf(stderr,
-			"Usage: ip link set DEVICE [ { up | down } ]\n");
+			"Usage: ip link { set | change } DEVICE [ { up | down } ]\n");
 
 	fprintf(stderr,
 		"		[ arp { on | off } ]\n"
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 883d8807..d4be1a0e 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1802,7 +1802,8 @@ deleted since it is the default group.
 .BI type " TYPE "
 specifies the type of the device.
 
-.SS ip link set - change device attributes
+.SS ip link set
+.SS ip link change - change device attributes
 
 .PP
 .B Warning:
@@ -1815,6 +1816,10 @@ can move the system to an unpredictable state. The solution
 is to avoid changing several parameters with one
 .B ip link set
 call.
+.B set
+and
+.B change
+are synonyms and have identical syntax and behavior.
 
 .TP
 .BI dev " DEVICE "
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH bpf-next 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Song Liu @ 2019-07-24 22:21 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, Willem de Bruijn, Petar Penkov
In-Reply-To: <20190724170018.96659-2-sdf@google.com>

On Wed, Jul 24, 2019 at 10:11 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> C flow dissector supports input flags that tell it to customize parsing
> by either stopping early or trying to parse as deep as possible. Pass
> those flags to the BPF flow dissector so it can make the same
> decisions. In the next commits I'll add support for those flags to
> our reference bpf_flow.c
>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h       | 2 +-
>  include/net/flow_dissector.h | 4 ----
>  include/uapi/linux/bpf.h     | 5 +++++
>  net/bpf/test_run.c           | 2 +-
>  net/core/flow_dissector.c    | 5 +++--
>  5 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 718742b1c505..9b7a8038beec 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
>
>  struct bpf_flow_dissector;
>  bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
> -                     __be16 proto, int nhoff, int hlen);
> +                     __be16 proto, int nhoff, int hlen, unsigned int flags);
>
>  bool __skb_flow_dissect(const struct net *net,
>                         const struct sk_buff *skb,
> diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> index 90bd210be060..3e2642587b76 100644
> --- a/include/net/flow_dissector.h
> +++ b/include/net/flow_dissector.h
> @@ -253,10 +253,6 @@ enum flow_dissector_key_id {
>         FLOW_DISSECTOR_KEY_MAX,
>  };
>
> -#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                BIT(0)
> -#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    BIT(1)
> -#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         BIT(2)
> -
>  struct flow_dissector_key {
>         enum flow_dissector_key_id key_id;
>         size_t offset; /* offset of struct flow_dissector_key_*
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fa1c753dcdbc..b4ad19bd6aa8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
>         BPF_FD_TYPE_URETPROBE,          /* filename + offset */
>  };
>
> +#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG                (1U << 0)
> +#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL    (1U << 1)
> +#define FLOW_DISSECTOR_F_STOP_AT_ENCAP         (1U << 2)

Do we have to move these?

Thanks,
Song

^ permalink raw reply

* Re: [PATCH bpf-next 0/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call
From: Brian Vazquez @ 2019-07-24 22:15 UTC (permalink / raw)
  To: Song Liu
  Cc: Brian Vazquez, Alexei Starovoitov, Daniel Borkmann,
	David S . Miller, Stanislav Fomichev, Willem de Bruijn,
	Petar Penkov, open list, Networking, bpf
In-Reply-To: <CAPhsuW7PU1PP91e8vD2diwhBAwGJHWu6wAKOoBThT86f4r5OJw@mail.gmail.com>

On Wed, Jul 24, 2019 at 12:20 PM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Wed, Jul 24, 2019 at 10:09 AM Brian Vazquez <brianvv@google.com> wrote:
> >
> > This introduces a new command to retrieve multiple number of entries
> > from a bpf map.
> >
> > This new command can be executed from the existing BPF syscall as
> > follows:
> >
> > err =  bpf(BPF_MAP_DUMP, union bpf_attr *attr, u32 size)
> > using attr->dump.map_fd, attr->dump.prev_key, attr->dump.buf,
> > attr->dump.buf_len
> > returns zero or negative error, and populates buf and buf_len on
> > succees
> >
> > This implementation is wrapping the existing bpf methods:
> > map_get_next_key and map_lookup_elem
> >
> > Note that this implementation can be extended later to do dump and
> > delete by extending map_lookup_and_delete_elem (currently it only works
> > for bpf queue/stack maps) and either use a new flag in map_dump or a new
> > command map_dump_and_delete.
> >
> > Results show that even with a 1-elem_size buffer, it runs ~40 faster
>
> Why is the new command 40% faster with 1-elem_size buffer?

The test is using a really simple map structure: uint64_t key,val.
Which makes the lookup_elem logic faster since it doesn't spend too
much time copying the value. My conclusion with the 40% was that this
new implementation only needs 1 syscall per elem compare to the 2
syscalls that we needed with previous implementation so in this
particular case the number of ops that we are doing is almost halved.
I did one experiment increasing the value_size (448*64B) and it was
only 14% faster with 1-elem_size buffer.

> > than the current implementation, improvements of ~85% are reported when
> > the buffer size is increased, although, after the buffer size is around
> > 5% of the total number of entries there's no huge difference in
> > increasing it.
> >
> > Tested:
> > Tried different size buffers to handle case where the bulk is bigger, or
> > the elements to retrieve are less than the existing ones, all runs read
> > a map of 100K entries. Below are the results(in ns) from the different
> > runs:
> >
> > buf_len_1:       69038725 entry-by-entry: 112384424 improvement
> > 38.569134
> > buf_len_2:       40897447 entry-by-entry: 111030546 improvement
> > 63.165590
> > buf_len_230:     13652714 entry-by-entry: 111694058 improvement
> > 87.776687
> > buf_len_5000:    13576271 entry-by-entry: 111101169 improvement
> > 87.780263
> > buf_len_73000:   14694343 entry-by-entry: 111740162 improvement
> > 86.849542
> > buf_len_100000:  13745969 entry-by-entry: 114151991 improvement
> > 87.958187
> > buf_len_234567:  14329834 entry-by-entry: 114427589 improvement
> > 87.476941
>
> It took me a while to figure out the meaning of 87.476941. It is probably
> a good idea to say 87.5% instead.

right, will change it in next version.

^ permalink raw reply

* Re: [PATCH bpf-next 0/7] bpf/flow_dissector: support input flags
From: Willem de Bruijn @ 2019-07-24 22:10 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, bpf, David Miller, Alexei Starovoitov,
	Daniel Borkmann, Willem de Bruijn, Petar Penkov
In-Reply-To: <20190724170018.96659-1-sdf@google.com>

On Wed, Jul 24, 2019 at 1:11 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> C flow dissector supports input flags that tell it to customize parsing
> by either stopping early or trying to parse as deep as possible.
> BPF flow dissector always parses as deep as possible which is sub-optimal.
> Pass input flags to the BPF flow dissector as well so it can make the same
> decisions.
>
> Series outline:
> * remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
> * export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
>   flow dissector
> * add documentation for the export flags
> * support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
> * sync uapi to tools
> * support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
> * support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
> * support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest
>
> Pros:
> * makes BPF flow dissector faster by avoiding burning extra cycles
> * existing BPF progs continue to work by ignoring the flags and always
>   parsing as deep as possible
>
> Cons:
> * new UAPI which we need to support (OTOH, if we need to deprecate some
>   flags, we can just stop setting them upon calling BPF programs)
>
> Some numbers (with .repeat = 4000000 in test_flow_dissector):
>         test_flow_dissector:PASS:ipv4-frag 35 nsec
>         test_flow_dissector:PASS:ipv4-frag 35 nsec
>         test_flow_dissector:PASS:ipv4-no-frag 32 nsec
>         test_flow_dissector:PASS:ipv4-no-frag 32 nsec
>
>         test_flow_dissector:PASS:ipv6-frag 39 nsec
>         test_flow_dissector:PASS:ipv6-frag 39 nsec
>         test_flow_dissector:PASS:ipv6-no-frag 36 nsec
>         test_flow_dissector:PASS:ipv6-no-frag 36 nsec
>
>         test_flow_dissector:PASS:ipv6-flow-label 36 nsec
>         test_flow_dissector:PASS:ipv6-flow-label 36 nsec
>         test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
>         test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
>
>         test_flow_dissector:PASS:ipip-encap 38 nsec
>         test_flow_dissector:PASS:ipip-encap 38 nsec
>         test_flow_dissector:PASS:ipip-no-encap 32 nsec
>         test_flow_dissector:PASS:ipip-no-encap 32 nsec
>
> The improvement is around 10%, but it's in a tight cache-hot
> BPF_PROG_TEST_RUN loop.
>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Petar Penkov <ppenkov@google.com>

This looks great to me. Thanks, Stan!

Acked-by: Willem de Bruijn <willemb@google.com>

^ permalink raw reply

* Re: [PATCH -next] net/ixgbevf: fix a compilation error of skb_frag_t
From: Jeff Kirsher @ 2019-07-24 22:02 UTC (permalink / raw)
  To: Qian Cai, willy; +Cc: davem, netdev, linux-kernel
In-Reply-To: <1563975157-30691-1-git-send-email-cai@lca.pw>

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

On Wed, 2019-07-24 at 09:32 -0400, Qian Cai wrote:
> The linux-next commit "net: Rename skb_frag_t size to bv_len" [1]
> introduced a compilation error on powerpc as it forgot to rename
> "size"
> to "bv_len" for ixgbevf.
> 
> [1] 
> https://lore.kernel.org/netdev/20190723030831.11879-1-willy@infradead.org/T/#md052f1c7de965ccd1bdcb6f92e1990a52298eac5
> 
> In file included from ./include/linux/cache.h:5,
>                  from ./include/linux/printk.h:9,
>                  from ./include/linux/kernel.h:15,
>                  from ./include/linux/list.h:9,
>                  from ./include/linux/module.h:9,
>                  from
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:12:
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c: In function
> 'ixgbevf_xmit_frame_ring':
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:51: error:
> 'skb_frag_t' {aka 'struct bio_vec'} has no member named 'size'
>    count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
>                                                    ^
> ./include/uapi/linux/kernel.h:13:40: note: in definition of macro
> '__KERNEL_DIV_ROUND_UP'
>  #define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
>                                         ^
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:4138:12: note: in
> expansion of macro 'TXD_USE_COUNT'
>    count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
> 
> Signed-off-by: Qian Cai <cai@lca.pw>

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox