Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri
In-Reply-To: <1523386790-12396-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP	  62	/* Act as backup for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device
From: Sridhar Samudrala @ 2018-04-10 18:59 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 introduces a bypass module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 

The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  net: Introduce generic bypass module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the bypass
    framework

 drivers/net/Kconfig             |   1 +
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 ++++----------
 drivers/net/virtio_net.c        | 614 +++++++++++++++++++++++++++++++++++++++-
 include/net/bypass.h            |  80 ++++++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig                     |  18 ++
 net/core/Makefile               |   1 +
 net/core/bypass.c               | 406 ++++++++++++++++++++++++++
 9 files changed, 1184 insertions(+), 159 deletions(-)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

-- 
2.14.3

^ permalink raw reply

* Re: [RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Alexei Starovoitov @ 2018-04-10 18:16 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-2-quentin.monnet@netronome.com>

On Tue, Apr 10, 2018 at 03:41:50PM +0100, Quentin Monnet wrote:
> Remove previous "overview" of eBPF helpers from user bpf.h header.
> Replace it by a comment explaining how to process the new documentation
> (to come in following patches) with a Python script to produce RST, then
> man page documentation.
> 
> Also add the aforementioned Python script under scripts/. It is used to
> process include/uapi/linux/bpf.h and to extract helper descriptions, to
> turn it into a RST document that can further be processed with rst2man
> to produce a man page. The script takes one "--filename <path/to/file>"
> option. If the script is launched from scripts/ in the kernel root
> directory, it should be able to find the location of the header to
> parse, and "--filename <path/to/file>" is then optional. If it cannot
> find the file, then the option becomes mandatory. RST-formatted
> documentation is printed to standard output.
> 
> Typical workflow for producing the final man page would be:
> 
>     $ ./scripts/bpf_helpers_doc.py \
>             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>     $ man /tmp/bpf-helpers.7
> 
> Note that the tool kernel-doc cannot be used to document eBPF helpers,
> whose signatures are not available directly in the header files
> (pre-processor directives are used to produce them at the beginning of
> the compilation process).
> 
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h   | 406 ++------------------------------------------
>  scripts/bpf_helpers_doc.py | 414 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 430 insertions(+), 390 deletions(-)
>  create mode 100755 scripts/bpf_helpers_doc.py
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec89732a8d..45f77f01e672 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -365,396 +365,22 @@ union bpf_attr {
>  	} raw_tracepoint;
>  } __attribute__((aligned(8)));
>  
> -/* BPF helper function descriptions:
> - *
> - * void *bpf_map_lookup_elem(&map, &key)
> - *     Return: Map value or NULL
> - *
> - * int bpf_map_update_elem(&map, &key, &value, flags)
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_map_delete_elem(&map, &key)
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_probe_read(void *dst, int size, void *src)
> - *     Return: 0 on success or negative error
> - *
> - * u64 bpf_ktime_get_ns(void)
> - *     Return: current ktime
> - *
> - * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
> - *     Return: length of buffer written or negative error
> - *
> - * u32 bpf_prandom_u32(void)
> - *     Return: random value
> - *
> - * u32 bpf_raw_smp_processor_id(void)
> - *     Return: SMP processor ID
> - *
> - * int bpf_skb_store_bytes(skb, offset, from, len, flags)
> - *     store bytes into packet
> - *     @skb: pointer to skb
> - *     @offset: offset within packet from skb->mac_header
> - *     @from: pointer where to copy bytes from
> - *     @len: number of bytes to store into packet
> - *     @flags: bit 0 - if true, recompute skb->csum
> - *             other bits - reserved
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_l3_csum_replace(skb, offset, from, to, flags)
> - *     recompute IP checksum
> - *     @skb: pointer to skb
> - *     @offset: offset within packet where IP checksum is located
> - *     @from: old value of header field
> - *     @to: new value of header field
> - *     @flags: bits 0-3 - size of header field
> - *             other bits - reserved
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_l4_csum_replace(skb, offset, from, to, flags)
> - *     recompute TCP/UDP checksum
> - *     @skb: pointer to skb
> - *     @offset: offset within packet where TCP/UDP checksum is located
> - *     @from: old value of header field
> - *     @to: new value of header field
> - *     @flags: bits 0-3 - size of header field
> - *             bit 4 - is pseudo header
> - *             other bits - reserved
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_tail_call(ctx, prog_array_map, index)
> - *     jump into another BPF program
> - *     @ctx: context pointer passed to next program
> - *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
> - *     @index: 32-bit index inside array that selects specific program to run
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_clone_redirect(skb, ifindex, flags)
> - *     redirect to another netdev
> - *     @skb: pointer to skb
> - *     @ifindex: ifindex of the net device
> - *     @flags: bit 0 - if set, redirect to ingress instead of egress
> - *             other bits - reserved
> - *     Return: 0 on success or negative error
> - *
> - * u64 bpf_get_current_pid_tgid(void)
> - *     Return: current->tgid << 32 | current->pid
> - *
> - * u64 bpf_get_current_uid_gid(void)
> - *     Return: current_gid << 32 | current_uid
> - *
> - * int bpf_get_current_comm(char *buf, int size_of_buf)
> - *     stores current->comm into buf
> - *     Return: 0 on success or negative error
> - *
> - * u32 bpf_get_cgroup_classid(skb)
> - *     retrieve a proc's classid
> - *     @skb: pointer to skb
> - *     Return: classid if != 0
> - *
> - * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_vlan_pop(skb)
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_get_tunnel_key(skb, key, size, flags)
> - * int bpf_skb_set_tunnel_key(skb, key, size, flags)
> - *     retrieve or populate tunnel metadata
> - *     @skb: pointer to skb
> - *     @key: pointer to 'struct bpf_tunnel_key'
> - *     @size: size of 'struct bpf_tunnel_key'
> - *     @flags: room for future extensions
> - *     Return: 0 on success or negative error
> - *
> - * u64 bpf_perf_event_read(map, flags)
> - *     read perf event counter value
> - *     @map: pointer to perf_event_array map
> - *     @flags: index of event in the map or bitmask flags
> - *     Return: value of perf event counter read or error code
> - *
> - * int bpf_redirect(ifindex, flags)
> - *     redirect to another netdev
> - *     @ifindex: ifindex of the net device
> - *     @flags:
> - *	  cls_bpf:
> - *          bit 0 - if set, redirect to ingress instead of egress
> - *          other bits - reserved
> - *	  xdp_bpf:
> - *	    all bits - reserved
> - *     Return: cls_bpf: TC_ACT_REDIRECT on success or TC_ACT_SHOT on error
> - *	       xdp_bfp: XDP_REDIRECT on success or XDP_ABORT on error
> - * int bpf_redirect_map(map, key, flags)
> - *     redirect to endpoint in map
> - *     @map: pointer to dev map
> - *     @key: index in map to lookup
> - *     @flags: --
> - *     Return: XDP_REDIRECT on success or XDP_ABORT on error
> - *
> - * u32 bpf_get_route_realm(skb)
> - *     retrieve a dst's tclassid
> - *     @skb: pointer to skb
> - *     Return: realm if != 0
> - *
> - * int bpf_perf_event_output(ctx, map, flags, data, size)
> - *     output perf raw sample
> - *     @ctx: struct pt_regs*
> - *     @map: pointer to perf_event_array map
> - *     @flags: index of event in the map or bitmask flags
> - *     @data: data on stack to be output as raw data
> - *     @size: size of data
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_get_stackid(ctx, map, flags)
> - *     walk user or kernel stack and return id
> - *     @ctx: struct pt_regs*
> - *     @map: pointer to stack_trace map
> - *     @flags: bits 0-7 - numer of stack frames to skip
> - *             bit 8 - collect user stack instead of kernel
> - *             bit 9 - compare stacks by hash only
> - *             bit 10 - if two different stacks hash into the same stackid
> - *                      discard old
> - *             other bits - reserved
> - *     Return: >= 0 stackid on success or negative error
> - *
> - * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
> - *     calculate csum diff
> - *     @from: raw from buffer
> - *     @from_size: length of from buffer
> - *     @to: raw to buffer
> - *     @to_size: length of to buffer
> - *     @seed: optional seed
> - *     Return: csum result or negative error code
> - *
> - * int bpf_skb_get_tunnel_opt(skb, opt, size)
> - *     retrieve tunnel options metadata
> - *     @skb: pointer to skb
> - *     @opt: pointer to raw tunnel option data
> - *     @size: size of @opt
> - *     Return: option size
> - *
> - * int bpf_skb_set_tunnel_opt(skb, opt, size)
> - *     populate tunnel options metadata
> - *     @skb: pointer to skb
> - *     @opt: pointer to raw tunnel option data
> - *     @size: size of @opt
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_change_proto(skb, proto, flags)
> - *     Change protocol of the skb. Currently supported is v4 -> v6,
> - *     v6 -> v4 transitions. The helper will also resize the skb. eBPF
> - *     program is expected to fill the new headers via skb_store_bytes
> - *     and lX_csum_replace.
> - *     @skb: pointer to skb
> - *     @proto: new skb->protocol type
> - *     @flags: reserved
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_change_type(skb, type)
> - *     Change packet type of skb.
> - *     @skb: pointer to skb
> - *     @type: new skb->pkt_type type
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_under_cgroup(skb, map, index)
> - *     Check cgroup2 membership of skb
> - *     @skb: pointer to skb
> - *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
> - *     @index: index of the cgroup in the bpf_map
> - *     Return:
> - *       == 0 skb failed the cgroup2 descendant test
> - *       == 1 skb succeeded the cgroup2 descendant test
> - *        < 0 error
> - *
> - * u32 bpf_get_hash_recalc(skb)
> - *     Retrieve and possibly recalculate skb->hash.
> - *     @skb: pointer to skb
> - *     Return: hash
> - *
> - * u64 bpf_get_current_task(void)
> - *     Returns current task_struct
> - *     Return: current
> - *
> - * int bpf_probe_write_user(void *dst, void *src, int len)
> - *     safely attempt to write to a location
> - *     @dst: destination address in userspace
> - *     @src: source address on stack
> - *     @len: number of bytes to copy
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_current_task_under_cgroup(map, index)
> - *     Check cgroup2 membership of current task
> - *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
> - *     @index: index of the cgroup in the bpf_map
> - *     Return:
> - *       == 0 current failed the cgroup2 descendant test
> - *       == 1 current succeeded the cgroup2 descendant test
> - *        < 0 error
> - *
> - * int bpf_skb_change_tail(skb, len, flags)
> - *     The helper will resize the skb to the given new size, to be used f.e.
> - *     with control messages.
> - *     @skb: pointer to skb
> - *     @len: new skb length
> - *     @flags: reserved
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_skb_pull_data(skb, len)
> - *     The helper will pull in non-linear data in case the skb is non-linear
> - *     and not all of len are part of the linear section. Only needed for
> - *     read/write with direct packet access.
> - *     @skb: pointer to skb
> - *     @len: len to make read/writeable
> - *     Return: 0 on success or negative error
> - *
> - * s64 bpf_csum_update(skb, csum)
> - *     Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
> - *     @skb: pointer to skb
> - *     @csum: csum to add
> - *     Return: csum on success or negative error
> - *
> - * void bpf_set_hash_invalid(skb)
> - *     Invalidate current skb->hash.
> - *     @skb: pointer to skb
> - *
> - * int bpf_get_numa_node_id()
> - *     Return: Id of current NUMA node.
> - *
> - * int bpf_skb_change_head()
> - *     Grows headroom of skb and adjusts MAC header offset accordingly.
> - *     Will extends/reallocae as required automatically.
> - *     May change skb data pointer and will thus invalidate any check
> - *     performed for direct packet access.
> - *     @skb: pointer to skb
> - *     @len: length of header to be pushed in front
> - *     @flags: Flags (unused for now)
> - *     Return: 0 on success or negative error
> - *
> - * int bpf_xdp_adjust_head(xdp_md, delta)
> - *     Adjust the xdp_md.data by delta
> - *     @xdp_md: pointer to xdp_md
> - *     @delta: An positive/negative integer to be added to xdp_md.data
> - *     Return: 0 on success or negative on error
> - *
> - * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> - *     Copy a NUL terminated string from unsafe address. In case the string
> - *     length is smaller than size, the target is not padded with further NUL
> - *     bytes. In case the string length is larger than size, just count-1
> - *     bytes are copied and the last byte is set to NUL.
> - *     @dst: destination address
> - *     @size: maximum number of bytes to copy, including the trailing NUL
> - *     @unsafe_ptr: unsafe address
> - *     Return:
> - *       > 0 length of the string including the trailing NUL on success
> - *       < 0 error
> - *
> - * u64 bpf_get_socket_cookie(skb)
> - *     Get the cookie for the socket stored inside sk_buff.
> - *     @skb: pointer to skb
> - *     Return: 8 Bytes non-decreasing number on success or 0 if the socket
> - *     field is missing inside sk_buff
> - *
> - * u32 bpf_get_socket_uid(skb)
> - *     Get the owner uid of the socket stored inside sk_buff.
> - *     @skb: pointer to skb
> - *     Return: uid of the socket owner on success or overflowuid if failed.
> - *
> - * u32 bpf_set_hash(skb, hash)
> - *     Set full skb->hash.
> - *     @skb: pointer to skb
> - *     @hash: hash to set
> - *
> - * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
> - *     Calls setsockopt. Not all opts are available, only those with
> - *     integer optvals plus TCP_CONGESTION.
> - *     Supported levels: SOL_SOCKET and IPPROTO_TCP
> - *     @bpf_socket: pointer to bpf_socket
> - *     @level: SOL_SOCKET or IPPROTO_TCP
> - *     @optname: option name
> - *     @optval: pointer to option value
> - *     @optlen: length of optval in bytes
> - *     Return: 0 or negative error
> - *
> - * int bpf_getsockopt(bpf_socket, level, optname, optval, optlen)
> - *     Calls getsockopt. Not all opts are available.
> - *     Supported levels: IPPROTO_TCP
> - *     @bpf_socket: pointer to bpf_socket
> - *     @level: IPPROTO_TCP
> - *     @optname: option name
> - *     @optval: pointer to option value
> - *     @optlen: length of optval in bytes
> - *     Return: 0 or negative error
> - *
> - * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
> - *     Set callback flags for sock_ops
> - *     @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
> - *     @flags: flags value
> - *     Return: 0 for no error
> - *             -EINVAL if there is no full tcp socket
> - *             bits in flags that are not supported by current kernel
> - *
> - * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
> - *     Grow or shrink room in sk_buff.
> - *     @skb: pointer to skb
> - *     @len_diff: (signed) amount of room to grow/shrink
> - *     @mode: operation mode (enum bpf_adj_room_mode)
> - *     @flags: reserved for future use
> - *     Return: 0 on success or negative error code
> - *
> - * int bpf_sk_redirect_map(map, key, flags)
> - *     Redirect skb to a sock in map using key as a lookup key for the
> - *     sock in map.
> - *     @map: pointer to sockmap
> - *     @key: key to lookup sock in map
> - *     @flags: reserved for future use
> - *     Return: SK_PASS
> - *
> - * int bpf_sock_map_update(skops, map, key, flags)
> - *	@skops: pointer to bpf_sock_ops
> - *	@map: pointer to sockmap to update
> - *	@key: key to insert/update sock in map
> - *	@flags: same flags as map update elem
> - *
> - * int bpf_xdp_adjust_meta(xdp_md, delta)
> - *     Adjust the xdp_md.data_meta by delta
> - *     @xdp_md: pointer to xdp_md
> - *     @delta: An positive/negative integer to be added to xdp_md.data_meta
> - *     Return: 0 on success or negative on error
> - *
> - * int bpf_perf_event_read_value(map, flags, buf, buf_size)
> - *     read perf event counter value and perf event enabled/running time
> - *     @map: pointer to perf_event_array map
> - *     @flags: index of event in the map or bitmask flags
> - *     @buf: buf to fill
> - *     @buf_size: size of the buf
> - *     Return: 0 on success or negative error code
> - *
> - * int bpf_perf_prog_read_value(ctx, buf, buf_size)
> - *     read perf prog attached perf event counter and enabled/running time
> - *     @ctx: pointer to ctx
> - *     @buf: buf to fill
> - *     @buf_size: size of the buf
> - *     Return : 0 on success or negative error code
> - *
> - * int bpf_override_return(pt_regs, rc)
> - *	@pt_regs: pointer to struct pt_regs
> - *	@rc: the return value to set
> - *
> - * int bpf_msg_redirect_map(map, key, flags)
> - *     Redirect msg to a sock in map using key as a lookup key for the
> - *     sock in map.
> - *     @map: pointer to sockmap
> - *     @key: key to lookup sock in map
> - *     @flags: reserved for future use
> - *     Return: SK_PASS
> - *
> - * int bpf_bind(ctx, addr, addr_len)
> - *     Bind socket to address. Only binding to IP is supported, no port can be
> - *     set in addr.
> - *     @ctx: pointer to context of type bpf_sock_addr
> - *     @addr: pointer to struct sockaddr to bind socket to
> - *     @addr_len: length of sockaddr structure
> - *     Return: 0 on success or negative error code
> +/* The description below is an attempt at providing documentation to eBPF
> + * developers about the multiple available eBPF helper functions. It can be
> + * parsed and used to produce a manual page. The workflow is the following,
> + * and requires the rst2man utility:
> + *
> + *     $ ./scripts/bpf_helpers_doc.py \
> + *             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
> + *     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
> + *     $ man /tmp/bpf-helpers.7
> + *
> + * Note that in order to produce this external documentation, some RST
> + * formatting is used in the descriptions to get "bold" and "italics" in
> + * manual pages. Also note that the few trailing white spaces are
> + * intentional, removing them would break paragraphs for rst2man.
> + *
> + * Start of BPF helper function descriptions:
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
> new file mode 100755
> index 000000000000..3a15ba3f0a83
> --- /dev/null
> +++ b/scripts/bpf_helpers_doc.py
> @@ -0,0 +1,414 @@
> +#!/usr/bin/python3
> +#
> +# Copyright (C) 2018 Netronome Systems, Inc.
> +#
> +# This software is licensed under the GNU General License Version 2,
> +# June 1991 as shown in the file COPYING in the top-level directory of this
> +# source tree.

please use SPDX instead.

> +#
> +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
> +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
> +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
> +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
> +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
> +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
> +
> +# In case user attempts to run with Python 2.
> +from __future__ import print_function
> +
> +import argparse
> +import re
> +import sys, os
> +
> +class NoHelperFound(BaseException):
> +    pass
> +
> +class ParsingError(BaseException):
> +    def __init__(self, line='<line not provided>', reader=None):
> +        if reader:
> +            BaseException.__init__(self,
> +                                   'Error at file offset %d, parsing line: %s' %
> +                                   (reader.tell(), line))
> +        else:
> +            BaseException.__init__(self, 'Error parsing line: %s' % line)
> +
> +class Helper(object):
> +    """
> +    An object representing the description of an eBPF helper function.
> +    @proto: function prototype of the helper function
> +    @desc: textual description of the helper function
> +    @ret: description of the return value of the helper function
> +    """
> +    def __init__(self, proto='', desc='', ret=''):
> +        self.proto = proto
> +        self.desc = desc
> +        self.ret = ret
> +
> +    def proto_break_down(self):
> +        """
> +        Break down helper function protocol into smaller chunks: return type,
> +        name, distincts arguments.
> +        """
> +        arg_re = re.compile('^((const )?(struct )?(\w+|...))( (\**)(\w+))?$')
> +        res = {}
> +        proto_re = re.compile('^(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$')
> +
> +        capture = proto_re.match(self.proto)
> +        res['ret_type'] = capture.group(1)
> +        res['ret_star'] = capture.group(2)
> +        res['name']     = capture.group(3)
> +        res['args'] = []
> +
> +        args    = capture.group(4).split(', ')
> +        for a in args:
> +            capture = arg_re.match(a)
> +            res['args'].append({
> +                'type' : capture.group(1),
> +                'star' : capture.group(6),
> +                'name' : capture.group(7)
> +            })
> +
> +        return res
> +
> +class HeaderParser(object):
> +    """
> +    An object used to parse a file in order to extract the documentation of a
> +    list of eBPF helper functions. All the helpers that can be retrieved are
> +    stored as Helper object, in the self.helpers() array.
> +    @filename: name of file to parse, usually include/uapi/linux/bpf.h in the
> +               kernel tree
> +    """
> +    def __init__(self, filename):
> +        self.reader = open(filename, 'r')
> +        self.line = ''
> +        self.helpers = []
> +
> +    def parse_helper(self):
> +        proto    = self.parse_proto()
> +        desc     = self.parse_desc()
> +        ret      = self.parse_ret()
> +        return Helper(proto=proto, desc=desc, ret=ret)
> +
> +    def parse_proto(self):
> +        # Argument can be of shape:
> +        #   - "void"
> +        #   - "type  name"
> +        #   - "type *name"
> +        #   - Same as above, with "const" and/or "struct" in front of type
> +        #   - "..." (undefined number of arguments, for bpf_trace_printk())
> +        # There is at least one term ("void"), and at most five arguments.
> +        p = re.compile('^ \* ((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
> +        capture = p.match(self.line)
> +        if not capture:
> +            raise NoHelperFound
> +        self.line = self.reader.readline()
> +        return capture.group(1)
> +
> +    def parse_desc(self):
> +        p = re.compile('^ \* \tDescription$')
> +        capture = p.match(self.line)
> +        if not capture:
> +            # Helper can have empty description and we might be parsing another
> +            # attribute: return but do not consume.
> +            return ''
> +        # Description can be several lines, some of them possibly empty, and it
> +        # stops when another subsection title is met.
> +        desc = ''
> +        while True:
> +            self.line = self.reader.readline()
> +            if self.line == ' *\n':
> +                desc += '\n'
> +            else:
> +                p = re.compile('^ \* \t\t(.*)')
> +                capture = p.match(self.line)
> +                if capture:
> +                    desc += capture.group(1) + '\n'
> +                else:
> +                    break
> +        return desc
> +
> +    def parse_ret(self):
> +        p = re.compile('^ \* \tReturn$')
> +        capture = p.match(self.line)
> +        if not capture:
> +            # Helper can have empty retval and we might be parsing another
> +            # attribute: return but do not consume.
> +            return ''
> +        # Return value description can be several lines, some of them possibly
> +        # empty, and it stops when another subsection title is met.
> +        ret = ''
> +        while True:
> +            self.line = self.reader.readline()
> +            if self.line == ' *\n':
> +                ret += '\n'
> +            else:
> +                p = re.compile('^ \* \t\t(.*)')
> +                capture = p.match(self.line)
> +                if capture:
> +                    ret += capture.group(1) + '\n'
> +                else:
> +                    break
> +        return ret
> +
> +    def run(self):
> +        # Advance to start of helper function descriptions.
> +        offset = self.reader.read().find('* Start of BPF helper function descriptions:')
> +        if offset == -1:
> +            raise Exception('Could not find start of eBPF helper descriptions list')
> +        self.reader.seek(offset)
> +        self.reader.readline()
> +        self.reader.readline()
> +        self.line = self.reader.readline()
> +
> +        while True:
> +            try:
> +                helper = self.parse_helper()
> +                self.helpers.append(helper)
> +            except NoHelperFound:
> +                break
> +
> +        self.reader.close()
> +        print('Parsed description of %d helper function(s)' % len(self.helpers),
> +              file=sys.stderr)
> +
> +###############################################################################
> +
> +class Printer(object):
> +    """
> +    A generic class for printers. Printers should be created with an array of
> +    Helper objects, and implement a way to print them in the desired fashion.
> +    @helpers: array of Helper objects to print to standard output
> +    """
> +    def __init__(self, helpers):
> +        self.helpers = helpers
> +
> +    def print_header(self):
> +        pass
> +
> +    def print_footer(self):
> +        pass
> +
> +    def print_one(self, helper):
> +        pass
> +
> +    def print_all(self):
> +        self.print_header()
> +        for helper in self.helpers:
> +            self.print_one(helper)
> +        self.print_footer()
> +
> +class PrinterRST(Printer):
> +    """
> +    A printer for dumping collected information about helpers as a ReStructured
> +    Text page compatible with the rst2man program, which can be used to
> +    generate a manual page for the helpers.
> +    @helpers: array of Helper objects to print to standard output
> +    """
> +    def print_header(self):
> +        header = '''\
> +.. Copyright (C) 2018 Netronome Systems, Inc.

I think would be good to capture copyrights of all authors that added
the helpers being documented. Since a lot of text was copied from commit
logs it's only fair to preserve the copyrights.
Such man page file is automatically generated by the python script
and script itself is copyrighted by Netronome. That's fine, but the text
of man page is not netronome only.
I'm not sure what would be the solution. May be something like:
"
Copyright (C) All BPF authors and contributors from 2011 to present
See git log include/uapi/linux/bpf.h for details
"
?

> +.. 
> +.. %%%LICENSE_START(VERBATIM)
> +.. Permission is granted to make and distribute verbatim copies of this
> +.. manual provided the copyright notice and this permission notice are
> +.. preserved on all copies.
> +.. 
> +.. Permission is granted to copy and distribute modified versions of this
> +.. manual under the conditions for verbatim copying, provided that the
> +.. entire resulting derived work is distributed under the terms of a
> +.. permission notice identical to this one.
> +.. 
> +.. Since the Linux kernel and libraries are constantly changing, this
> +.. manual page may be incorrect or out-of-date.  The author(s) assume no
> +.. responsibility for errors or omissions, or for damages resulting from
> +.. the use of the information contained herein.  The author(s) may not
> +.. have taken the same level of care in the production of this manual,
> +.. which is licensed free of charge, as they might when working
> +.. professionally.
> +.. 
> +.. Formatted or processed versions of this manual, if unaccompanied by
> +.. the source, must acknowledge the copyright and authors of this work.
> +.. %%%LICENSE_END
> +.. 
> +.. Please do not edit this file. It was generated from the documentation
> +.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
> +.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
> +.. repository (header and footer).
> +
> +===========
> +BPF-HELPERS
> +===========
> +-------------------------------------------------------------------------------
> +list of eBPF helper functions
> +-------------------------------------------------------------------------------
> +
> +:Manual section: 7
> +
> +DESCRIPTION
> +===========
> +
> +The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
> +written in a pseudo-assembly language, then attached to one of the several
> +kernel hooks and run in reaction of specific events. This framework differs
> +from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
> +the ability to call special functions (or "helpers") from within a program. For
> +security reasons, these functions are restricted to a white-list of helpers
> +defined in the kernel.

'for security reasons' sounds a bit odd. May be 'for safety reasons' ?
Or drop that part.

> +
> +These helpers are used by eBPF programs to interact with the system, or with
> +the context in which they work. For instance, they can be used to print
> +debugging messages, to get the time since the system was booted, to interact
> +with eBPF maps, or to manipulate network packets metadata. Since there are

s/packets metadata/packets/

> +several eBPF program types, and that they do not run in the same context, each
> +program type can only call a subset of those helpers.
> +
> +Due to eBPF conventions, a helper can not have more than five arguments.
> +
> +This document is an attempt to list and document the helpers available to eBPF
> +developers. They are sorted by chronological order (the oldest helpers in the
> +kernel at the top).
> +
> +HELPERS
> +=======
> +'''
> +        print(header)
> +
> +    def print_footer(self):
> +        footer = '''
> +NOTES
> +=====
> +
> +On the performance side, eBPF programs move to the stack all arguments to pass
> +to the helpers, and call directly into the compiled helper functions without

"move to the stack all arguments" ?! I'm not sure what you're trying to say.
The arguments stay in registers for the call.

> +requiring any foreign-function interface. As a result, calling helpers
> +introduce very little overhead.

not true. it's zero overhead. Literally. Very little is not the same as zero.

> +
> +EXAMPLES
> +========
> +
> +Example usage for most of the eBPF helpers listed in this manual page are
> +available within the Linux kernel sources, at the following locations:
> +
> +* *samples/bpf/*
> +* *tools/testing/selftests/bpf/*
> +
> +IMPLEMENTATION
> +==============
> +
> +This manual page is an effort to document the existing eBPF helper functions.
> +But as of this writing, the BPF sub-system is under heavy development. New eBPF
> +program or map types are added, along with new helper functions. Some helpers
> +are occasionally made available for additional program types. So in spite of
> +the efforts of the community, this page might not be up-to-date. If you want to
> +check by yourself what helper functions exist in your kernel, or what types of
> +programs they can support, here are some files among the kernel tree that you
> +may be interested in:
> +
> +* *include/uapi/linux/bpf.h* contains the full list of all helper functions.
> +* *net/core/filter.c* contains the definition of most network-related helper
> +  functions, and the list of program types from which they can be used.
> +* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
> +  helpers.
> +* *kernel/bpf/verifier.c* contains the functions used to check that valid types
> +  of eBPF maps are used with a given helper function.
> +* *kernel/bpf/* directory contains other files in which additional helpers are
> +  defined (for cgroups, sockmaps, etc.).
> +
> +Compatibility between helper functions and program types can generally be found
> +in the files where helper functions are defined. Look for the **struct
> +bpf_func_proto** objects and for functions returning them: these functions
> +contain a list of helpers that a given program type can call. Note that the
> +**default:** label of the **switch ... case** used to filter helpers can call
> +other functions, themselves allowing access to additional helpers. The
> +requirement for GPL license is also in those **struct bpf_func_proto**.

I think here would be good to add that most networking helpers are non-GPL
because they operate on packets which are abstract bytes on the wire,
whereas most tracing helpers are GPL, since they inspect the guts of
the linux kernel which is GPL itself.
That's the main reason why adding extra 'gpl=yes/no' for each helper
description is redundant.

^ permalink raw reply

* [PATCH v2 iproute2-next 1/1] tc: jsonify skbedit action
From: Roman Mashak @ 2018-04-10 18:04 UTC (permalink / raw)
  To: dsahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

v2:
   FIxed strings format in print_string()

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 tc/m_skbedit.c | 53 +++++++++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/tc/m_skbedit.c b/tc/m_skbedit.c
index db5c64caf2ba..7391fc7f158c 100644
--- a/tc/m_skbedit.c
+++ b/tc/m_skbedit.c
@@ -168,9 +168,8 @@ static int print_skbedit(struct action_util *au, FILE *f, struct rtattr *arg)
 	struct rtattr *tb[TCA_SKBEDIT_MAX + 1];
 
 	SPRINT_BUF(b1);
-	__u32 *priority;
-	__u32 *mark;
-	__u16 *queue_mapping, *ptype;
+	__u32 priority;
+	__u16 ptype;
 	struct tc_skbedit *p = NULL;
 
 	if (arg == NULL)
@@ -179,43 +178,49 @@ static int print_skbedit(struct action_util *au, FILE *f, struct rtattr *arg)
 	parse_rtattr_nested(tb, TCA_SKBEDIT_MAX, arg);
 
 	if (tb[TCA_SKBEDIT_PARMS] == NULL) {
-		fprintf(f, "[NULL skbedit parameters]");
+		print_string(PRINT_FP, NULL, "%s", "[NULL skbedit parameters]");
 		return -1;
 	}
 	p = RTA_DATA(tb[TCA_SKBEDIT_PARMS]);
 
-	fprintf(f, " skbedit");
+	print_string(PRINT_ANY, "kind", "%s ", "skbedit");
 
 	if (tb[TCA_SKBEDIT_QUEUE_MAPPING] != NULL) {
-		queue_mapping = RTA_DATA(tb[TCA_SKBEDIT_QUEUE_MAPPING]);
-		fprintf(f, " queue_mapping %u", *queue_mapping);
+		print_uint(PRINT_ANY, "queue_mapping", "queue_mapping %u",
+			   rta_getattr_u16(tb[TCA_SKBEDIT_QUEUE_MAPPING]));
 	}
 	if (tb[TCA_SKBEDIT_PRIORITY] != NULL) {
-		priority = RTA_DATA(tb[TCA_SKBEDIT_PRIORITY]);
-		fprintf(f, " priority %s", sprint_tc_classid(*priority, b1));
+		priority = rta_getattr_u32(tb[TCA_SKBEDIT_PRIORITY]);
+		print_string(PRINT_ANY, "priority", " priority %s",
+			     sprint_tc_classid(priority, b1));
 	}
 	if (tb[TCA_SKBEDIT_MARK] != NULL) {
-		mark = RTA_DATA(tb[TCA_SKBEDIT_MARK]);
-		fprintf(f, " mark %d", *mark);
+		print_uint(PRINT_ANY, "mark", " mark %u",
+			   rta_getattr_u32(tb[TCA_SKBEDIT_MARK]));
 	}
 	if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
-		ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]);
-		if (*ptype == PACKET_HOST)
-			fprintf(f, " ptype host");
-		else if (*ptype == PACKET_BROADCAST)
-			fprintf(f, " ptype broadcast");
-		else if (*ptype == PACKET_MULTICAST)
-			fprintf(f, " ptype multicast");
-		else if (*ptype == PACKET_OTHERHOST)
-			fprintf(f, " ptype otherhost");
+		ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]);
+		if (ptype == PACKET_HOST)
+			print_string(PRINT_ANY, "ptype", " ptype %s", "host");
+		else if (ptype == PACKET_BROADCAST)
+			print_string(PRINT_ANY, "ptype", " ptype %s",
+				     "broadcast");
+		else if (ptype == PACKET_MULTICAST)
+			print_string(PRINT_ANY, "ptype", " ptype %s",
+				     "multicast");
+		else if (ptype == PACKET_OTHERHOST)
+			print_string(PRINT_ANY, "ptype", " ptype %s",
+				     "otherhost");
 		else
-			fprintf(f, " ptype %d", *ptype);
+			print_uint(PRINT_ANY, "ptype", " ptype %u", ptype);
 	}
 
 	print_action_control(f, " ", p->action, "");
 
-	fprintf(f, "\n\t index %u ref %d bind %d",
-		p->index, p->refcnt, p->bindcnt);
+	print_string(PRINT_FP, NULL, "%s", _SL_);
+	print_uint(PRINT_ANY, "index", "\t index %u", p->index);
+	print_int(PRINT_ANY, "ref", " ref %d", p->refcnt);
+	print_int(PRINT_ANY, "bind", " bind %d", p->bindcnt);
 
 	if (show_stats) {
 		if (tb[TCA_SKBEDIT_TM]) {
@@ -225,7 +230,7 @@ static int print_skbedit(struct action_util *au, FILE *f, struct rtattr *arg)
 		}
 	}
 
-	fprintf(f, "\n ");
+	print_string(PRINT_FP, NULL, "%s", _SL_);
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related

* Re: [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)
From: Alexei Starovoitov @ 2018-04-10 17:56 UTC (permalink / raw)
  To: Quentin Monnet; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-3-quentin.monnet@netronome.com>

On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
> 
> - bpf_map_lookup_elem()
> - bpf_map_update_elem()
> - bpf_map_delete_elem()
> - bpf_probe_read()
> - bpf_ktime_get_ns()
> - bpf_trace_printk()
> - bpf_skb_store_bytes()
> - bpf_l3_csum_replace()
> - bpf_l4_csum_replace()
> - bpf_tail_call()
> - bpf_clone_redirect()
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 199 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 45f77f01e672..2bc653a3a20f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -381,6 +381,205 @@ union bpf_attr {
>   * intentional, removing them would break paragraphs for rst2man.
>   *
>   * Start of BPF helper function descriptions:
> + *
> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
> + * 	Description
> + * 		Perform a lookup in *map* for an entry associated to *key*.
> + * 	Return
> + * 		Map value associated to *key*, or **NULL** if no entry was
> + * 		found.
> + *
> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
> + * 	Description
> + * 		Add or update the value of the entry associated to *key* in
> + * 		*map* with *value*. *flags* is one of:
> + *
> + * 		**BPF_NOEXIST**
> + * 			The entry for *key* must not exist in the map.
> + * 		**BPF_EXIST**
> + * 			The entry for *key* must already exist in the map.
> + * 		**BPF_ANY**
> + * 			No condition on the existence of the entry for *key*.
> + *
> + * 		These flags are only useful for maps of type
> + * 		**BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
> + * 		should be used.

I think that's not entirely accurate.
The flags work as expected for all other map types as well
and for lru map, sockmap, map in map the flags have practical use cases.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_delete_elem(struct bpf_map *map, void *key)
> + * 	Description
> + * 		Delete entry with *key* from *map*.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_probe_read(void *dst, u32 size, const void *src)
> + * 	Description
> + * 		For tracing programs, safely attempt to read *size* bytes from
> + * 		address *src* and store the data in *dst*.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * u64 bpf_ktime_get_ns(void)
> + * 	Description
> + * 		Return the time elapsed since system boot, in nanoseconds.
> + * 	Return
> + * 		Current *ktime*.
> + *
> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
> + * 	Description
> + * 		This helper is a "printk()-like" facility for debugging. It
> + * 		prints a message defined by format *fmt* (of size *fmt_size*)
> + * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
> + * 		available. It can take up to three additional **u64**
> + * 		arguments (as an eBPF helpers, the total number of arguments is
> + * 		limited to five). Each time the helper is called, it appends a
> + * 		line that looks like the following:
> + *
> + * 		::
> + *
> + * 			telnet-470   [001] .N.. 419421.045894: 0x00000001: BPF command: 2
> + *
> + * 		In the above:
> + *
> + * 			* ``telnet`` is the name of the current task.
> + * 			* ``470`` is the PID of the current task.
> + * 			* ``001`` is the CPU number on which the task is
> + * 			  running.
> + * 			* In ``.N..``, each character refers to a set of
> + * 			  options (whether irqs are enabled, scheduling
> + * 			  options, whether hard/softirqs are running, level of
> + * 			  preempt_disabled respectively). **N** means that
> + * 			  **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
> + * 			  are set.
> + * 			* ``419421.045894`` is a timestamp.
> + * 			* ``0x00000001`` is a fake value used by BPF for the
> + * 			  instruction pointer register.
> + * 			* ``BPF command: 2`` is the message formatted with
> + * 			  *fmt*.

the above depends on how trace_pipe was configured. It's a default
configuration for many, but would be good to explain this a bit better.

> + *
> + * 		The conversion specifiers supported by *fmt* are similar, but
> + * 		more limited than for printk(). They are **%d**, **%i**,
> + * 		**%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
> + * 		**%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
> + * 		of field, padding with zeroes, etc.) is available, and the
> + * 		helper will silently fail if it encounters an unknown
> + * 		specifier.

This is not true. bpf_trace_printk will return -EINVAL for unknown specifier.

> + *
> + * 		Also, note that **bpf_trace_printk**\ () is slow, and should
> + * 		only be used for debugging purposes. For passing values to user
> + * 		space, perf events should be preferred.

please mention the giant dmesg warning that people will definitely
notice when they try to use this helper.

> + * 	Return
> + * 		The number of bytes written to the buffer, or a negative error
> + * 		in case of failure.
> + *
> + * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
> + * 	Description
> + * 		Store *len* bytes from address *from* into the packet
> + * 		associated to *skb*, at *offset*. *flags* are a combination of
> + * 		**BPF_F_RECOMPUTE_CSUM** (automatically recompute the
> + * 		checksum for the packet after storing the bytes) and
> + * 		**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + * 		**->swhash** and *skb*\ **->l4hash** to 0).
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
> + * 	Description
> + * 		Recompute the IP checksum for the packet associated to *skb*.
> + * 		Computation is incremental, so the helper must know the former
> + * 		value of the header field that was modified (*from*), the new
> + * 		value of this field (*to*), and the number of bytes (2 or 4)
> + * 		for this field, stored in *size*. Alternatively, it is possible
> + * 		to store the difference between the previous and the new values
> + * 		of the header field in *to*, by setting *from* and *size* to 0.
> + * 		For both methods, *offset* indicates the location of the IP
> + * 		checksum within the packet.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
> + * 	Description
> + * 		Recompute the TCP or UDP checksum for the packet associated to
> + * 		*skb*. Computation is incremental, so the helper must know the
> + * 		former value of the header field that was modified (*from*),
> + * 		the new value of this field (*to*), and the number of bytes (2
> + * 		or 4) for this field, stored on the lowest four bits of
> + * 		*flags*. Alternatively, it is possible to store the difference
> + * 		between the previous and the new values of the header field in
> + * 		*to*, by setting *from* and the four lowest bits of *flags* to
> + * 		0. For both methods, *offset* indicates the location of the IP
> + * 		checksum within the packet. In addition to the size of the
> + * 		field, *flags* can be added (bitwise OR) actual flags. With
> + * 		**BPF_F_MARK_MANGLED_0**, a null checksum is left untouched
> + * 		(unless **BPF_F_MARK_ENFORCE** is added as well), and for
> + * 		updates resulting in a null checksum the value is set to
> + * 		**CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR**
> + * 		indicates the checksum is to be computed against a
> + * 		pseudo-header.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
> + * 	Description
> + * 		This special helper is used to trigger a "tail call", or in
> + * 		other words, to jump into another eBPF program. The contents of
> + * 		eBPF registers and stack are not modified, the new program
> + * 		"inherits" them from the caller. This mechanism allows for

"inherits" is a technically correct, but misleading statement,
since callee program cannot access caller's registers and stack.

> + * 		program chaining, either for raising the maximum number of
> + * 		available eBPF instructions, or to execute given programs in
> + * 		conditional blocks. For security reasons, there is an upper
> + * 		limit to the number of successive tail calls that can be
> + * 		performed.
> + *
> + * 		Upon call of this helper, the program attempts to jump into a
> + * 		program referenced at index *index* in *prog_array_map*, a
> + * 		special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
> + * 		*ctx*, a pointer to the context.
> + *
> + * 		If the call succeeds, the kernel immediately runs the first
> + * 		instruction of the new program. This is not a function call,
> + * 		and it never goes back to the previous program. If the call
> + * 		fails, then the helper has no effect, and the caller continues
> + * 		to run its own instructions. A call can fail if the destination
> + * 		program for the jump does not exist (i.e. *index* is superior
> + * 		to the number of entries in *prog_array_map*), or if the
> + * 		maximum number of tail calls has been reached for this chain of
> + * 		programs. This limit is defined in the kernel by the macro
> + * 		**MAX_TAIL_CALL_CNT** (not accessible to user space), which
> + * 		is currently set to 32.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
> + * 	Description
> + * 		Clone and redirect the packet associated to *skb* to another
> + * 		net device of index *ifindex*. The only flag supported for now
> + * 		is **BPF_F_INGRESS**, which indicates the packet is to be
> + * 		redirected to the ingress interface instead of (by default)
> + * 		egress.

imo the above sentence is prone to misinterpretation.
Can you rephrase it to say that both redirect to ingress and redirect to egress
are supported and flag is used to indicate which path to take ?

> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> -- 
> 2.14.1
> 

^ permalink raw reply

* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Andrey Ignatov @ 2018-04-10 17:50 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	Lawrence Brakmo, Yonghong Song, Josef Bacik
In-Reply-To: <20180410144157.4831-8-quentin.monnet@netronome.com>

Quentin Monnet <quentin.monnet@netronome.com> [Tue, 2018-04-10 07:43 -0700]:
> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
> + * 	Description
> + * 		Bind the socket associated to *ctx* to the address pointed by
> + * 		*addr*, of length *addr_len*. This allows for making outgoing
> + * 		connection from the desired IP address, which can be useful for
> + * 		example when all processes inside a cgroup should use one
> + * 		single IP address on a host that has multiple IP configured.
> + *
> + * 		This helper works for IPv4 and IPv6, TCP and UDP sockets. The
> + * 		domain (*addr*\ **->sa_family**) must be **AF_INET** (or
> + * 		**AF_INET6**). Looking for a free port to bind to can be
> + * 		expensive, therefore binding to port is not permitted by the
> + * 		helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
> + * 		must be set to zero.
> + *
> + * 		As for the remote end, both parts of it can be overridden,
> + * 		remote IP and remote port. This can be useful if an application
> + * 		inside a cgroup wants to connect to another application inside
> + * 		the same cgroup or to itself, but knows nothing about the IP
> + * 		address assigned to the cgroup.

The last paragraph ("As for the remote end ...") is not relevant to
bpf_bind() and should be removed. It's about sys_connect hook itself
that can call to bpf_bind() but also has other functionality (and that
other functionality is described by this paragraph).


-- 
Andrey Ignatov

^ permalink raw reply

* [next-queue PATCH v7 10/10] igb: Add support for adding offloaded clsflower filters
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

This allows filters added by tc-flower and specifying MAC addresses,
Ethernet types, and the VLAN priority field, to be offloaded to the
controller.

This reuses most of the infrastructure used by ethtool, but clsflower
filters are kept in a separated list, so they are invisible to
ethtool.

To setup clsflower offloading:

$ tc qdisc replace dev eth0 handle 100: parent root mqprio \
     	   	   num_tc 3 map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
		   queues 1@0 1@1 2@2 hw 0
(clsflower offloading depends on the netword driver to be configured
with multiple traffic classes, we use mqprio's 'num_tc' parameter to
set it to 3)

$ tc qdisc add dev eth0 ingress

Examples of filters:

$ tc filter add dev eth0 parent ffff: flower \
     	    dst_mac aa:aa:aa:aa:aa:aa \
	    hw_tc 2 skip_sw
(just a simple filter filtering for the destination MAC address and
steering that traffic to queue 2)

$ tc filter add dev enp2s0 parent ffff: proto 0x22f0 flower \
     	    src_mac cc:cc:cc:cc:cc:cc \
	    hw_tc 1 skip_sw
(as the i210 doesn't support steering traffic based on the source
address alone, we need to use another steering traffic, in this case
we are using the ethernet type (0x22f0) to steer traffic to queue 1)

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb.h      |   2 +
 drivers/net/ethernet/intel/igb/igb_main.c | 188 +++++++++++++++++++++-
 2 files changed, 188 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index b9b965921e9f..a413284fada6 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -465,6 +465,7 @@ struct igb_nfc_input {
 struct igb_nfc_filter {
 	struct hlist_node nfc_node;
 	struct igb_nfc_input filter;
+	unsigned long cookie;
 	u16 etype_reg_index;
 	u16 sw_idx;
 	u16 action;
@@ -604,6 +605,7 @@ struct igb_adapter {
 
 	/* RX network flow classification support */
 	struct hlist_head nfc_filter_list;
+	struct hlist_head cls_flower_list;
 	unsigned int nfc_filter_count;
 	/* lock for RX network flow classification filter */
 	spinlock_t nfc_lock;
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index e3f33fb8064e..3c2e68dd0902 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2499,16 +2499,197 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	return 0;
 }
 
+#define ETHER_TYPE_FULL_MASK ((__force __be16)~0)
+#define VLAN_PRIO_FULL_MASK (0x07)
+
+static int igb_parse_cls_flower(struct igb_adapter *adapter,
+				struct tc_cls_flower_offload *f,
+				int traffic_class,
+				struct igb_nfc_filter *input)
+{
+	struct netlink_ext_ack *extack = f->common.extack;
+
+	if (f->dissector->used_keys &
+	    ~(BIT(FLOW_DISSECTOR_KEY_BASIC) |
+	      BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Unsupported key used, only BASIC, CONTROL, ETH_ADDRS and VLAN are supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_dissector_key_eth_addrs *key, *mask;
+
+		key = skb_flow_dissector_target(f->dissector,
+						FLOW_DISSECTOR_KEY_ETH_ADDRS,
+						f->key);
+		mask = skb_flow_dissector_target(f->dissector,
+						 FLOW_DISSECTOR_KEY_ETH_ADDRS,
+						 f->mask);
+
+		if (!is_zero_ether_addr(mask->dst)) {
+			if (!is_broadcast_ether_addr(mask->dst)) {
+				NL_SET_ERR_MSG_MOD(extack, "Only full masks are supported for destination MAC address");
+				return -EINVAL;
+			}
+
+			input->filter.match_flags |=
+				IGB_FILTER_FLAG_DST_MAC_ADDR;
+			ether_addr_copy(input->filter.dst_addr, key->dst);
+		}
+
+		if (!is_zero_ether_addr(mask->src)) {
+			if (!is_broadcast_ether_addr(mask->src)) {
+				NL_SET_ERR_MSG_MOD(extack, "Only full masks are supported for source MAC address");
+				return -EINVAL;
+			}
+
+			input->filter.match_flags |=
+				IGB_FILTER_FLAG_SRC_MAC_ADDR;
+			ether_addr_copy(input->filter.src_addr, key->src);
+		}
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_dissector_key_basic *key, *mask;
+
+		key = skb_flow_dissector_target(f->dissector,
+						FLOW_DISSECTOR_KEY_BASIC,
+						f->key);
+		mask = skb_flow_dissector_target(f->dissector,
+						 FLOW_DISSECTOR_KEY_BASIC,
+						 f->mask);
+
+		if (mask->n_proto) {
+			if (mask->n_proto != ETHER_TYPE_FULL_MASK) {
+				NL_SET_ERR_MSG_MOD(extack, "Only full mask is supported for EtherType filter");
+				return -EINVAL;
+			}
+
+			input->filter.match_flags |= IGB_FILTER_FLAG_ETHER_TYPE;
+			input->filter.etype = key->n_proto;
+		}
+	}
+
+	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLAN)) {
+		struct flow_dissector_key_vlan *key, *mask;
+
+		key = skb_flow_dissector_target(f->dissector,
+						FLOW_DISSECTOR_KEY_VLAN,
+						f->key);
+		mask = skb_flow_dissector_target(f->dissector,
+						 FLOW_DISSECTOR_KEY_VLAN,
+						 f->mask);
+
+		if (mask->vlan_priority) {
+			if (mask->vlan_priority != VLAN_PRIO_FULL_MASK) {
+				NL_SET_ERR_MSG_MOD(extack, "Only full mask is supported for VLAN priority");
+				return -EINVAL;
+			}
+
+			input->filter.match_flags |= IGB_FILTER_FLAG_VLAN_TCI;
+			input->filter.vlan_tci = key->vlan_priority;
+		}
+	}
+
+	input->action = traffic_class;
+	input->cookie = f->cookie;
+
+	return 0;
+}
+
 static int igb_configure_clsflower(struct igb_adapter *adapter,
 				   struct tc_cls_flower_offload *cls_flower)
 {
-	return -EOPNOTSUPP;
+	struct netlink_ext_ack *extack = cls_flower->common.extack;
+	struct igb_nfc_filter *filter, *f;
+	int err, tc;
+
+	tc = tc_classid_to_hwtc(adapter->netdev, cls_flower->classid);
+	if (tc < 0) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid traffic class");
+		return -EINVAL;
+	}
+
+	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
+	if (!filter)
+		return -ENOMEM;
+
+	err = igb_parse_cls_flower(adapter, cls_flower, tc, filter);
+	if (err < 0)
+		goto err_parse;
+
+	spin_lock(&adapter->nfc_lock);
+
+	hlist_for_each_entry(f, &adapter->nfc_filter_list, nfc_node) {
+		if (!memcmp(&f->filter, &filter->filter, sizeof(f->filter))) {
+			err = -EEXIST;
+			NL_SET_ERR_MSG_MOD(extack,
+					   "This filter is already set in ethtool");
+			goto err_locked;
+		}
+	}
+
+	hlist_for_each_entry(f, &adapter->cls_flower_list, nfc_node) {
+		if (!memcmp(&f->filter, &filter->filter, sizeof(f->filter))) {
+			err = -EEXIST;
+			NL_SET_ERR_MSG_MOD(extack,
+					   "This filter is already set in cls_flower");
+			goto err_locked;
+		}
+	}
+
+	err = igb_add_filter(adapter, filter);
+	if (err < 0) {
+		NL_SET_ERR_MSG_MOD(extack, "Could not add filter to the adapter");
+		goto err_locked;
+	}
+
+	hlist_add_head(&filter->nfc_node, &adapter->cls_flower_list);
+
+	spin_unlock(&adapter->nfc_lock);
+
+	return 0;
+
+err_locked:
+	spin_unlock(&adapter->nfc_lock);
+
+err_parse:
+	kfree(filter);
+
+	return err;
 }
 
 static int igb_delete_clsflower(struct igb_adapter *adapter,
 				struct tc_cls_flower_offload *cls_flower)
 {
-	return -EOPNOTSUPP;
+	struct igb_nfc_filter *filter;
+	int err;
+
+	spin_lock(&adapter->nfc_lock);
+
+	hlist_for_each_entry(filter, &adapter->cls_flower_list, nfc_node)
+		if (filter->cookie == cls_flower->cookie)
+			break;
+
+	if (!filter) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	err = igb_erase_filter(adapter, filter);
+	if (err < 0)
+		goto out;
+
+	hlist_del(&filter->nfc_node);
+	kfree(filter);
+
+out:
+	spin_unlock(&adapter->nfc_lock);
+
+	return err;
 }
 
 static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
@@ -9353,6 +9534,9 @@ static void igb_nfc_filter_exit(struct igb_adapter *adapter)
 	hlist_for_each_entry(rule, &adapter->nfc_filter_list, nfc_node)
 		igb_erase_filter(adapter, rule);
 
+	hlist_for_each_entry(rule, &adapter->cls_flower_list, nfc_node)
+		igb_erase_filter(adapter, rule);
+
 	spin_unlock(&adapter->nfc_lock);
 }
 
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 09/10] igb: Add the skeletons for tc-flower offloading
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

This adds basic functions needed to implement offloading for filters
created by tc-flower.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 66 +++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 1b6fad88107a..e3f33fb8064e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -36,6 +36,7 @@
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
 #include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
 #include <linux/net_tstamp.h>
 #include <linux/mii.h>
 #include <linux/ethtool.h>
@@ -2498,6 +2499,69 @@ static int igb_offload_cbs(struct igb_adapter *adapter,
 	return 0;
 }
 
+static int igb_configure_clsflower(struct igb_adapter *adapter,
+				   struct tc_cls_flower_offload *cls_flower)
+{
+	return -EOPNOTSUPP;
+}
+
+static int igb_delete_clsflower(struct igb_adapter *adapter,
+				struct tc_cls_flower_offload *cls_flower)
+{
+	return -EOPNOTSUPP;
+}
+
+static int igb_setup_tc_cls_flower(struct igb_adapter *adapter,
+				   struct tc_cls_flower_offload *cls_flower)
+{
+	switch (cls_flower->command) {
+	case TC_CLSFLOWER_REPLACE:
+		return igb_configure_clsflower(adapter, cls_flower);
+	case TC_CLSFLOWER_DESTROY:
+		return igb_delete_clsflower(adapter, cls_flower);
+	case TC_CLSFLOWER_STATS:
+		return -EOPNOTSUPP;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int igb_setup_tc_block_cb(enum tc_setup_type type, void *type_data,
+				 void *cb_priv)
+{
+	struct igb_adapter *adapter = cb_priv;
+
+	if (!tc_cls_can_offload_and_chain0(adapter->netdev, type_data))
+		return -EOPNOTSUPP;
+
+	switch (type) {
+	case TC_SETUP_CLSFLOWER:
+		return igb_setup_tc_cls_flower(adapter, type_data);
+
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int igb_setup_tc_block(struct igb_adapter *adapter,
+			      struct tc_block_offload *f)
+{
+	if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+		return -EOPNOTSUPP;
+
+	switch (f->command) {
+	case TC_BLOCK_BIND:
+		return tcf_block_cb_register(f->block, igb_setup_tc_block_cb,
+					     adapter, adapter);
+	case TC_BLOCK_UNBIND:
+		tcf_block_cb_unregister(f->block, igb_setup_tc_block_cb,
+					adapter);
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			void *type_data)
 {
@@ -2506,6 +2570,8 @@ static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
 	switch (type) {
 	case TC_SETUP_QDISC_CBS:
 		return igb_offload_cbs(adapter, type_data);
+	case TC_SETUP_BLOCK:
+		return igb_setup_tc_block(adapter, type_data);
 
 	default:
 		return -EOPNOTSUPP;
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 08/10] igb: Add MAC address support for ethtool nftuple filters
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

This adds the capability of configuring the queue steering of arriving
packets based on their source and destination MAC addresses.

Source address steering (i.e. driving traffic to a specific queue),
for the i210, does not work, but filtering does (i.e. accepting
traffic based on the source address). So, trying to add a filter
specifying only a source address will be an error.

In practical terms this adds support for the following use cases,
characterized by these examples:

$ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
(this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
to the RX queue 0)

$ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 \
  	     	  	    	  proto 0x22f0 action 3
(this will direct packets with source address "44:44:44:44:44:44" and
ethertype 0x22f0 to the RX queue 3)

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 43 ++++++++++++++++++--
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 31b2960a7869..6697c273ab59 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2495,6 +2495,23 @@ static int igb_get_ethtool_nfc_entry(struct igb_adapter *adapter,
 			fsp->h_ext.vlan_tci = rule->filter.vlan_tci;
 			fsp->m_ext.vlan_tci = htons(VLAN_PRIO_MASK);
 		}
+		if (rule->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+			ether_addr_copy(fsp->h_u.ether_spec.h_dest,
+					rule->filter.dst_addr);
+			/* As we only support matching by the full
+			 * mask, return the mask to userspace
+			 */
+			eth_broadcast_addr(fsp->m_u.ether_spec.h_dest);
+		}
+		if (rule->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+			ether_addr_copy(fsp->h_u.ether_spec.h_source,
+					rule->filter.src_addr);
+			/* As we only support matching by the full
+			 * mask, return the mask to userspace
+			 */
+			eth_broadcast_addr(fsp->m_u.ether_spec.h_source);
+		}
+
 		return 0;
 	}
 	return -EINVAL;
@@ -2768,8 +2785,16 @@ static int igb_rxnfc_write_vlan_prio_filter(struct igb_adapter *adapter,
 
 int igb_add_filter(struct igb_adapter *adapter, struct igb_nfc_filter *input)
 {
+	struct e1000_hw *hw = &adapter->hw;
 	int err = -EINVAL;
 
+	if (hw->mac.type == e1000_i210 &&
+	    !(input->filter.match_flags & ~IGB_FILTER_FLAG_SRC_MAC_ADDR)) {
+		dev_err(&adapter->pdev->dev,
+			"i210 doesn't support flow classification rules specifying only source addresses.\n");
+		return -EOPNOTSUPP;
+	}
+
 	if (input->filter.match_flags & IGB_FILTER_FLAG_ETHER_TYPE) {
 		err = igb_rxnfc_write_etype_filter(adapter, input);
 		if (err)
@@ -2933,10 +2958,6 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter *adapter,
 	if ((fsp->flow_type & ~FLOW_EXT) != ETHER_FLOW)
 		return -EINVAL;
 
-	if (fsp->m_u.ether_spec.h_proto != ETHER_TYPE_FULL_MASK &&
-	    fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK))
-		return -EINVAL;
-
 	input = kzalloc(sizeof(*input), GFP_KERNEL);
 	if (!input)
 		return -ENOMEM;
@@ -2946,6 +2967,20 @@ static int igb_add_ethtool_nfc_entry(struct igb_adapter *adapter,
 		input->filter.match_flags = IGB_FILTER_FLAG_ETHER_TYPE;
 	}
 
+	/* Only support matching addresses by the full mask */
+	if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_source)) {
+		input->filter.match_flags |= IGB_FILTER_FLAG_SRC_MAC_ADDR;
+		ether_addr_copy(input->filter.src_addr,
+				fsp->h_u.ether_spec.h_source);
+	}
+
+	/* Only support matching addresses by the full mask */
+	if (is_broadcast_ether_addr(fsp->m_u.ether_spec.h_dest)) {
+		input->filter.match_flags |= IGB_FILTER_FLAG_DST_MAC_ADDR;
+		ether_addr_copy(input->filter.dst_addr,
+				fsp->h_u.ether_spec.h_dest);
+	}
+
 	if ((fsp->flow_type & FLOW_EXT) && fsp->m_ext.vlan_tci) {
 		if (fsp->m_ext.vlan_tci != htons(VLAN_PRIO_MASK)) {
 			err = -EINVAL;
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 07/10] igb: Enable nfc filters to specify MAC addresses
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

This allows igb_add_filter()/igb_erase_filter() to work on filters
that include MAC addresses (both source and destination).

For now, this only exposes the functionality, the next commit glues
ethtool into this. Later in this series, these APIs are used to allow
offloading of cls_flower filters.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb.h         |  4 +++
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 28 ++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index f48ba090fd6a..b9b965921e9f 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -442,6 +442,8 @@ struct hwmon_buff {
 enum igb_filter_match_flags {
 	IGB_FILTER_FLAG_ETHER_TYPE = 0x1,
 	IGB_FILTER_FLAG_VLAN_TCI   = 0x2,
+	IGB_FILTER_FLAG_SRC_MAC_ADDR   = 0x4,
+	IGB_FILTER_FLAG_DST_MAC_ADDR   = 0x8,
 };
 
 #define IGB_MAX_RXNFC_FILTERS 16
@@ -456,6 +458,8 @@ struct igb_nfc_input {
 	u8 match_flags;
 	__be16 etype;
 	__be16 vlan_tci;
+	u8 src_addr[ETH_ALEN];
+	u8 dst_addr[ETH_ALEN];
 };
 
 struct igb_nfc_filter {
diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index 5975d432836f..31b2960a7869 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2776,6 +2776,25 @@ int igb_add_filter(struct igb_adapter *adapter, struct igb_nfc_filter *input)
 			return err;
 	}
 
+	if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR) {
+		err = igb_add_mac_steering_filter(adapter,
+						  input->filter.dst_addr,
+						  input->action, 0);
+		err = min_t(int, err, 0);
+		if (err)
+			return err;
+	}
+
+	if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR) {
+		err = igb_add_mac_steering_filter(adapter,
+						  input->filter.src_addr,
+						  input->action,
+						  IGB_MAC_STATE_SRC_ADDR);
+		err = min_t(int, err, 0);
+		if (err)
+			return err;
+	}
+
 	if (input->filter.match_flags & IGB_FILTER_FLAG_VLAN_TCI)
 		err = igb_rxnfc_write_vlan_prio_filter(adapter, input);
 
@@ -2824,6 +2843,15 @@ int igb_erase_filter(struct igb_adapter *adapter, struct igb_nfc_filter *input)
 		igb_clear_vlan_prio_filter(adapter,
 					   ntohs(input->filter.vlan_tci));
 
+	if (input->filter.match_flags & IGB_FILTER_FLAG_SRC_MAC_ADDR)
+		igb_del_mac_steering_filter(adapter, input->filter.src_addr,
+					    input->action,
+					    IGB_MAC_STATE_SRC_ADDR);
+
+	if (input->filter.match_flags & IGB_FILTER_FLAG_DST_MAC_ADDR)
+		igb_del_mac_steering_filter(adapter, input->filter.dst_addr,
+					    input->action, 0);
+
 	return 0;
 }
 
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 05/10] igb: Add support for enabling queue steering in filters
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

On some igb models (82575 and i210) the MAC address filters can
control to which queue the packet will be assigned.

This extends the 'state' with one more state to signify that queue
selection should be enabled for that filter.

As 82575 parts are no longer easily obtained (and this was developed
against i210), only support for the i210 model is enabled.

These functions are exported and will be used in the next patch.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 .../net/ethernet/intel/igb/e1000_defines.h    |  1 +
 drivers/net/ethernet/intel/igb/igb.h          |  6 +++++
 drivers/net/ethernet/intel/igb/igb_main.c     | 26 +++++++++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 5417edbe3125..d3d1d868e7ba 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -492,6 +492,7 @@
  */
 #define E1000_RAH_AV  0x80000000        /* Receive descriptor valid */
 #define E1000_RAH_ASEL_SRC_ADDR 0x00010000
+#define E1000_RAH_QSEL_ENABLE 0x10000000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC0000
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index f3ecda46f287..f48ba090fd6a 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -475,6 +475,7 @@ struct igb_mac_addr {
 #define IGB_MAC_STATE_DEFAULT	0x1
 #define IGB_MAC_STATE_IN_USE	0x2
 #define IGB_MAC_STATE_SRC_ADDR  0x4
+#define IGB_MAC_STATE_QUEUE_STEERING 0x8
 
 /* board specific private data structure */
 struct igb_adapter {
@@ -740,4 +741,9 @@ int igb_add_filter(struct igb_adapter *adapter,
 int igb_erase_filter(struct igb_adapter *adapter,
 		     struct igb_nfc_filter *input);
 
+int igb_add_mac_steering_filter(struct igb_adapter *adapter,
+				const u8 *addr, u8 queue, u8 flags);
+int igb_del_mac_steering_filter(struct igb_adapter *adapter,
+				const u8 *addr, u8 queue, u8 flags);
+
 #endif /* _IGB_H_ */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 2033ec3c9b27..e3da35cab786 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6935,6 +6935,28 @@ static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 	return igb_del_mac_filter_flags(adapter, addr, queue, 0);
 }
 
+int igb_add_mac_steering_filter(struct igb_adapter *adapter,
+				const u8 *addr, u8 queue, u8 flags)
+{
+	struct e1000_hw *hw = &adapter->hw;
+
+	/* In theory, this should be supported on 82575 as well, but
+	 * that part wasn't easily accessible during development.
+	 */
+	if (hw->mac.type != e1000_i210)
+		return -EOPNOTSUPP;
+
+	return igb_add_mac_filter_flags(adapter, addr, queue,
+					IGB_MAC_STATE_QUEUE_STEERING | flags);
+}
+
+int igb_del_mac_steering_filter(struct igb_adapter *adapter,
+				const u8 *addr, u8 queue, u8 flags)
+{
+	return igb_del_mac_filter_flags(adapter, addr, queue,
+					IGB_MAC_STATE_QUEUE_STEERING | flags);
+}
+
 static int igb_uc_sync(struct net_device *netdev, const unsigned char *addr)
 {
 	struct igb_adapter *adapter = netdev_priv(netdev);
@@ -8784,6 +8806,10 @@ static void igb_rar_set_index(struct igb_adapter *adapter, u32 index)
 		switch (hw->mac.type) {
 		case e1000_82575:
 		case e1000_i210:
+			if (adapter->mac_table[index].state &
+			    IGB_MAC_STATE_QUEUE_STEERING)
+				rar_high |= E1000_RAH_QSEL_ENABLE;
+
 			rar_high |= E1000_RAH_POOL_1 *
 				      adapter->mac_table[index].queue;
 			break;
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 06/10] igb: Allow filters to be added for the local MAC address
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

Users expect that when adding a steering filter for the local MAC
address, that all the traffic directed to that address will go to some
queue.

Currently, it's not possible to configure entries in the "in use"
state, which is the normal state of the local MAC address entry (it is
the default), this patch allows to override the steering configuration
of "in use" entries, if the filter to be added match the address and
address type (source or destination) of an existing entry.

There is a bit of a special handling for entries referring to the
local MAC address, when they are removed, only the steering
configuration is reset.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 40 ++++++++++++++++++++---
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index e3da35cab786..1b6fad88107a 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6844,6 +6844,27 @@ static void igb_set_default_mac_filter(struct igb_adapter *adapter)
 	igb_rar_set_index(adapter, 0);
 }
 
+/* If the filter to be added and an already existing filter express
+ * the same address and address type, it should be possible to only
+ * override the other configurations, for example the queue to steer
+ * traffic.
+ */
+static bool igb_mac_entry_can_be_used(const struct igb_mac_addr *entry,
+				      const u8 *addr, const u8 flags)
+{
+	if (!(entry->state & IGB_MAC_STATE_IN_USE))
+		return true;
+
+	if ((entry->state & IGB_MAC_STATE_SRC_ADDR) !=
+	    (flags & IGB_MAC_STATE_SRC_ADDR))
+		return false;
+
+	if (!ether_addr_equal(addr, entry->addr))
+		return false;
+
+	return true;
+}
+
 /* Add a MAC filter for 'addr' directing matching traffic to 'queue',
  * 'flags' is used to indicate what kind of match is made, match is by
  * default for the destination address, if matching by source address
@@ -6866,7 +6887,8 @@ static int igb_add_mac_filter_flags(struct igb_adapter *adapter,
 	 * addresses.
 	 */
 	for (i = 0; i < rar_entries; i++) {
-		if (adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE)
+		if (!igb_mac_entry_can_be_used(&adapter->mac_table[i],
+					       addr, flags))
 			continue;
 
 		ether_addr_copy(adapter->mac_table[i].addr, addr);
@@ -6918,9 +6940,19 @@ static int igb_del_mac_filter_flags(struct igb_adapter *adapter,
 		if (!ether_addr_equal(adapter->mac_table[i].addr, addr))
 			continue;
 
-		adapter->mac_table[i].state = 0;
-		memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
-		adapter->mac_table[i].queue = 0;
+		/* When a filter for the default address is "deleted",
+		 * we return it to its initial configuration
+		 */
+		if (adapter->mac_table[i].state & IGB_MAC_STATE_DEFAULT) {
+			adapter->mac_table[i].state =
+				IGB_MAC_STATE_DEFAULT | IGB_MAC_STATE_IN_USE;
+			adapter->mac_table[i].queue =
+				adapter->vfs_allocated_count;
+		} else {
+			adapter->mac_table[i].state = 0;
+			adapter->mac_table[i].queue = 0;
+			memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
+		}
 
 		igb_rar_set_index(adapter, i);
 		return 0;
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 04/10] igb: Add support for MAC address filters specifying source addresses
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

Makes it possible to direct packets to queues based on their source
address. Documents the expected usage of the 'flags' parameter.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 .../net/ethernet/intel/igb/e1000_defines.h    |  1 +
 drivers/net/ethernet/intel/igb/igb.h          |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c     | 40 ++++++++++++++++---
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 98534f765e0e..5417edbe3125 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -491,6 +491,7 @@
  * manageability enabled, allowing us room for 15 multicast addresses.
  */
 #define E1000_RAH_AV  0x80000000        /* Receive descriptor valid */
+#define E1000_RAH_ASEL_SRC_ADDR 0x00010000
 #define E1000_RAL_MAC_ADDR_LEN 4
 #define E1000_RAH_MAC_ADDR_LEN 2
 #define E1000_RAH_POOL_MASK 0x03FC0000
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 8dbc399b345e..f3ecda46f287 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -474,6 +474,7 @@ struct igb_mac_addr {
 
 #define IGB_MAC_STATE_DEFAULT	0x1
 #define IGB_MAC_STATE_IN_USE	0x2
+#define IGB_MAC_STATE_SRC_ADDR  0x4
 
 /* board specific private data structure */
 struct igb_adapter {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 976898d39d6e..2033ec3c9b27 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6844,8 +6844,14 @@ static void igb_set_default_mac_filter(struct igb_adapter *adapter)
 	igb_rar_set_index(adapter, 0);
 }
 
-static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
-			      const u8 queue)
+/* Add a MAC filter for 'addr' directing matching traffic to 'queue',
+ * 'flags' is used to indicate what kind of match is made, match is by
+ * default for the destination address, if matching by source address
+ * is desired the flag IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_add_mac_filter_flags(struct igb_adapter *adapter,
+				    const u8 *addr, const u8 queue,
+				    const u8 flags)
 {
 	struct e1000_hw *hw = &adapter->hw;
 	int rar_entries = hw->mac.rar_entry_count -
@@ -6865,7 +6871,7 @@ static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 
 		ether_addr_copy(adapter->mac_table[i].addr, addr);
 		adapter->mac_table[i].queue = queue;
-		adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE;
+		adapter->mac_table[i].state |= IGB_MAC_STATE_IN_USE | flags;
 
 		igb_rar_set_index(adapter, i);
 		return i;
@@ -6874,8 +6880,21 @@ static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 	return -ENOSPC;
 }
 
-static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+static int igb_add_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 			      const u8 queue)
+{
+	return igb_add_mac_filter_flags(adapter, addr, queue, 0);
+}
+
+/* Remove a MAC filter for 'addr' directing matching traffic to
+ * 'queue', 'flags' is used to indicate what kind of match need to be
+ * removed, match is by default for the destination address, if
+ * matching by source address is to be removed the flag
+ * IGB_MAC_STATE_SRC_ADDR can be used.
+ */
+static int igb_del_mac_filter_flags(struct igb_adapter *adapter,
+				    const u8 *addr, const u8 queue,
+				    const u8 flags)
 {
 	struct e1000_hw *hw = &adapter->hw;
 	int rar_entries = hw->mac.rar_entry_count -
@@ -6892,12 +6911,14 @@ static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 	for (i = 0; i < rar_entries; i++) {
 		if (!(adapter->mac_table[i].state & IGB_MAC_STATE_IN_USE))
 			continue;
+		if ((adapter->mac_table[i].state & flags) != flags)
+			continue;
 		if (adapter->mac_table[i].queue != queue)
 			continue;
 		if (!ether_addr_equal(adapter->mac_table[i].addr, addr))
 			continue;
 
-		adapter->mac_table[i].state &= ~IGB_MAC_STATE_IN_USE;
+		adapter->mac_table[i].state = 0;
 		memset(adapter->mac_table[i].addr, 0, ETH_ALEN);
 		adapter->mac_table[i].queue = 0;
 
@@ -6908,6 +6929,12 @@ static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
 	return -ENOENT;
 }
 
+static int igb_del_mac_filter(struct igb_adapter *adapter, const u8 *addr,
+			      const u8 queue)
+{
+	return igb_del_mac_filter_flags(adapter, addr, queue, 0);
+}
+
 static int igb_uc_sync(struct net_device *netdev, const unsigned char *addr)
 {
 	struct igb_adapter *adapter = netdev_priv(netdev);
@@ -8751,6 +8778,9 @@ static void igb_rar_set_index(struct igb_adapter *adapter, u32 index)
 		if (is_valid_ether_addr(addr))
 			rar_high |= E1000_RAH_AV;
 
+		if (adapter->mac_table[index].state & IGB_MAC_STATE_SRC_ADDR)
+			rar_high |= E1000_RAH_ASEL_SRC_ADDR;
+
 		switch (hw->mac.type) {
 		case e1000_82575:
 		case e1000_i210:
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 03/10] igb: Enable the hardware traffic class feature bit for igb models
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

This will allow functionality depending on the hardware being traffic
class aware to work. In particular the tc-flower offloading checks
verifies that this bit is set.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index 0a79fef3c4fb..976898d39d6e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2807,6 +2807,9 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (hw->mac.type >= e1000_82576)
 		netdev->features |= NETIF_F_SCTP_CRC;
 
+	if (hw->mac.type >= e1000_i350)
+		netdev->features |= NETIF_F_HW_TC;
+
 #define IGB_GSO_PARTIAL_FEATURES (NETIF_F_GSO_GRE | \
 				  NETIF_F_GSO_GRE_CSUM | \
 				  NETIF_F_GSO_IPXIP4 | \
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 01/10] igb: Fix not adding filter elements to the list
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

Because the order of the parameters passes to 'hlist_add_behind()' was
inverted, the 'parent' node was added "behind" the 'input', as input
is not in the list, this causes the 'input' node to be lost.

Fixes: 0e71def25281 ("igb: add support of RX network flow classification")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ethtool.c b/drivers/net/ethernet/intel/igb/igb_ethtool.c
index e77ba0d5866d..5975d432836f 100644
--- a/drivers/net/ethernet/intel/igb/igb_ethtool.c
+++ b/drivers/net/ethernet/intel/igb/igb_ethtool.c
@@ -2865,7 +2865,7 @@ static int igb_update_ethtool_nfc_entry(struct igb_adapter *adapter,
 
 	/* add filter to the list */
 	if (parent)
-		hlist_add_behind(&parent->nfc_node, &input->nfc_node);
+		hlist_add_behind(&input->nfc_node, &parent->nfc_node);
 	else
 		hlist_add_head(&input->nfc_node, &adapter->nfc_filter_list);
 
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 02/10] igb: Fix queue selection on MAC filters on i210
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia
In-Reply-To: <20180410174959.18757-1-vinicius.gomes@intel.com>

On the RAH registers there are semantic differences on the meaning of
the "queue" parameter for traffic steering depending on the controller
model: there is the 82575 meaning, which "queue" means a RX Hardware
Queue, and the i350 meaning, where it is a reception pool.

The previous behaviour was having no effect for i210 based controllers
because the QSEL bit of the RAH register wasn't being set.

This patch separates the condition in discrete cases, so the different
handling is clearer.

Fixes: 83c21335c876 ("igb: improve MAC filter handling")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc30a16d..0a79fef3c4fb 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8748,12 +8748,17 @@ static void igb_rar_set_index(struct igb_adapter *adapter, u32 index)
 		if (is_valid_ether_addr(addr))
 			rar_high |= E1000_RAH_AV;
 
-		if (hw->mac.type == e1000_82575)
+		switch (hw->mac.type) {
+		case e1000_82575:
+		case e1000_i210:
 			rar_high |= E1000_RAH_POOL_1 *
-				    adapter->mac_table[index].queue;
-		else
+				      adapter->mac_table[index].queue;
+			break;
+		default:
 			rar_high |= E1000_RAH_POOL_1 <<
-				    adapter->mac_table[index].queue;
+				adapter->mac_table[index].queue;
+			break;
+		}
 	}
 
 	wr32(E1000_RAL(index), rar_low);
-- 
2.17.0

^ permalink raw reply related

* [next-queue PATCH v7 00/10] igb: offloading of receive filters
From: Vinicius Costa Gomes @ 2018-04-10 17:49 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: Vinicius Costa Gomes, jeffrey.t.kirsher, netdev,
	jesus.sanchez-palencia

Hi,

Known issue:
 - It seems that the the QSEL bits in the RAH registers do not have
 any effect for source address (i.e. steering doesn't work for source
 address filters), everything is pointing to a hardware (or
 documentation) issue;

Changes from v6:
 - Because the i210 doesn't support steering traffic per source
   address (only filtering works), an error is emitted when only an
   source address is specified in a classification rule;

Changes from v5:
 - Filters can now be added for local MAC addresses, when removed,
   they return to their initial configuration (thanks for the testing
   Aaron);

Changes from v4:
 - Added a new bit to the MAC address filters internal
 representation to mean that some filters are steering filters (i.e.
 they direct traffic to queues);
 - And, this is only supported in i210;

Changes from v3:
 - Addressed review comments from Aaron F. Brown and
   Jakub Kicinski;

Changes from v2:
 - Addressed review comments from Jakub Kicinski, mostly about coding
   style adjustments and more consistent error reporting;

Changes from v1:
 - Addressed review comments from Alexander Duyck and Florian
   Fainelli;
 - Adding and removing cls_flower filters are now proposed in the same
   patch;
 - cls_flower filters are kept in a separated list from "ethtool"
   filters (so that section of the original cover letter is no longer
   valid);
 - The patch adding support for ethtool filters is now independent from
   the rest of the series;

Original cover letter:

This series enables some ethtool and tc-flower filters to be offloaded
to igb-based network controllers. This is useful when the system
configurator want to steer kinds of traffic to a specific hardware
queue.

The first two commits are bug fixes.

The basis of this series is to export the internal API used to
configure address filters, so they can be used by ethtool, and
extending the functionality so an source address can be handled.

Then, we enable the tc-flower offloading implementation to re-use the
same infrastructure as ethtool, and storing them in the per-adapter
"nfc" (Network Filter Config?) list. But for consistency, for
destructive access they are separated, i.e. an filter added by
tc-flower can only be removed by tc-flower, but ethtool can read them
all.

Only support for VLAN Prio, Source and Destination MAC Address, and
Ethertype is enabled for now.

Open question:
  - igb is initialized with the number of traffic classes as 1, if we
  want to use multiple traffic classes we need to increase this value,
  the only way I could find is to use mqprio (for example). Should igb
  be initialized with, say, the number of queues as its "num_tc"?

--

Vinicius Costa Gomes (10):
  igb: Fix not adding filter elements to the list
  igb: Fix queue selection on MAC filters on i210
  igb: Enable the hardware traffic class feature bit for igb models
  igb: Add support for MAC address filters specifying source addresses
  igb: Add support for enabling queue steering in filters
  igb: Allow filters to be added for the local MAC address
  igb: Enable nfc filters to specify MAC addresses
  igb: Add MAC address support for ethtool nftuple filters
  igb: Add the skeletons for tc-flower offloading
  igb: Add support for adding offloaded clsflower filters

 .../net/ethernet/intel/igb/e1000_defines.h    |   2 +
 drivers/net/ethernet/intel/igb/igb.h          |  13 +
 drivers/net/ethernet/intel/igb/igb_ethtool.c  |  73 +++-
 drivers/net/ethernet/intel/igb/igb_main.c     | 370 +++++++++++++++++-
 4 files changed, 441 insertions(+), 17 deletions(-)

--
2.17.0

^ permalink raw reply

* Re: [PATCH] ibmvnic: Disable irqs before exiting reset from closed state
From: David Miller @ 2018-04-10 17:39 UTC (permalink / raw)
  To: jallen; +Cc: netdev

John, if you want this to be submitted to stable you'll need to provide
backports for the various -stable trees and this patch doesn't even
come close to applying cleanly to v4.16, for example.

Thank you.

^ permalink raw reply

* Re: [PATCH] net: wireless: b43legacy: Replace GFP_ATOMIC with GFP_KERNEL in dma_tx_fragment
From: Michael Büsch @ 2018-04-10 16:57 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: Larry.Finger, kvalo, netdev, linux-wireless, b43-dev,
	linux-kernel
In-Reply-To: <1523368459-32128-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]

On Tue, 10 Apr 2018 21:54:19 +0800
Jia-Ju Bai <baijiaju1990@gmail.com> wrote:

> dma_tx_fragment() is never called in atomic context.
> 
> dma_tx_fragment() is only called by b43legacy_dma_tx(), which is 
> only called by b43legacy_tx_work().
> b43legacy_tx_work() is only set a parameter of INIT_WORK() in 
> b43legacy_wireless_init().
> 
> Despite never getting called from atomic context,
> dma_tx_fragment() calls alloc_skb() with GFP_ATOMIC,
> which does not sleep for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> which can sleep and improve the possibility of sucessful allocation.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
> ---
>  drivers/net/wireless/broadcom/b43legacy/dma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wireless/broadcom/b43legacy/dma.c b/drivers/net/wireless/broadcom/b43legacy/dma.c
> index cfa617d..2f0c64c 100644
> --- a/drivers/net/wireless/broadcom/b43legacy/dma.c
> +++ b/drivers/net/wireless/broadcom/b43legacy/dma.c
> @@ -1064,7 +1064,7 @@ static int dma_tx_fragment(struct b43legacy_dmaring *ring,
>  	meta->dmaaddr = map_descbuffer(ring, skb->data, skb->len, 1);
>  	/* create a bounce buffer in zone_dma on mapping failure. */
>  	if (b43legacy_dma_mapping_error(ring, meta->dmaaddr, skb->len, 1)) {
> -		bounce_skb = alloc_skb(skb->len, GFP_ATOMIC | GFP_DMA);
> +		bounce_skb = alloc_skb(skb->len, GFP_KERNEL | GFP_DMA);
>  		if (!bounce_skb) {
>  			ring->current_slot = old_top_slot;
>  			ring->used_slots = old_used_slots;


Ack.
I think the GFP_ATOMIC came from the days where we did DMA operations
under spinlock instead of mutex.

The same thing can be done in b43.

Also 
setup_rx_descbuffer(ring, desc, meta, GFP_ATOMIC)
could be GFP_KERNEL in dma_rx().
This function is called from IRQ thread context.

-- 
Michael

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] net: bridge: add missing NULL checks
From: Laszlo Toth @ 2018-04-10 17:22 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Laszlo Toth, Stephen Hemminger, David S. Miller, bridge, netdev
In-Reply-To: <3db78abd-bc13-528f-9da3-cb9a0ce0f617@cumulusnetworks.com>

On Mon, Apr 09, 2018 at 01:25:41AM +0300, Nikolay Aleksandrov wrote:
> On 08/04/18 20:49, Laszlo Toth wrote:
> >br_port_get_rtnl() can return NULL
> >
> >Signed-off-by: Laszlo Toth <laszlth@gmail.com>
> >---
> >  net/bridge/br_netlink.c | 12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> >
> 
> Nacked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> More below.
> 
> >diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> >index 015f465c..cbec11f 100644
> >--- a/net/bridge/br_netlink.c
> >+++ b/net/bridge/br_netlink.c
> >@@ -939,14 +939,17 @@ static int br_port_slave_changelink(struct net_device *brdev,
> >  				    struct nlattr *data[],
> >  				    struct netlink_ext_ack *extack)
> >  {
> >+	struct net_bridge_port *port = br_port_get_rtnl(dev);
> >  	struct net_bridge *br = netdev_priv(brdev);
> >  	int ret;
> >  	if (!data)
> >  		return 0;
> >+	if (!port)
> >+		return -EINVAL;
> 
> If we're here, it means the master device of dev is a bridge => dev is a bridge port,
> since we're running with RTNL that cannot change, so this check is unnecessary.
> 
> Have you actually hit a bug with this code ?
> 
> >  	spin_lock_bh(&br->lock);
> >-	ret = br_setport(br_port_get_rtnl(dev), data);
> >+	ret = br_setport(port, data);
> >  	spin_unlock_bh(&br->lock);
> >  	return ret;
> >@@ -956,7 +959,12 @@ static int br_port_fill_slave_info(struct sk_buff *skb,
> >  				   const struct net_device *brdev,
> >  				   const struct net_device *dev)
> >  {
> >-	return br_port_fill_attrs(skb, br_port_get_rtnl(dev));
> >+	struct net_bridge_port *port = br_port_get_rtnl(dev);
> >+
> >+	if (!port)
> >+		return -EINVAL;
> >+
> >+	return br_port_fill_attrs(skb, port);
> 
> Same rationale here, fill_slave_info is called via a master device's ops
> under RTNL, which means dev is a bridge port and that also cannot change.
> 
> If you have hit a bug with this code, can we see the trace ?
> The problem might be elsewhere.

There was a NULL dereference in br_port_fill_attrs(), but on a much
older release w/ a probably buggy and custom driver,
so there is no real problem to trace.
Anyway I thought I'd make a quick patch from it, but you're right,
it's pointless to validate twice. 

Please just ignore the patch.

Laszlo

> 
> Thanks,
>  Nik
> 
> >  }
> >  static size_t br_port_get_slave_size(const struct net_device *brdev,
> >
> 

^ permalink raw reply

* Re: [RFC net-next 2/2] net: net-sysfs: Reduce netstat_show read_lock critical section
From: David Miller @ 2018-04-10 17:17 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20180410170812.18905-2-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 10 Apr 2018 10:08:12 -0700

> Instead of holding the device chain read_lock also while calling
> dev_get_stats just hold it only to check dev_isalive, if the dev is alive,
> hold that dev via dev_hold then release the read_lock.
> 
> When done handling the device, dev_put it.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

My feedback here is that same as for patch #1.

Two atomics for a shorter RCU lock hold time is not that great of
a tradeoff.

^ permalink raw reply

* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: David Miller @ 2018-04-10 17:16 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20180410170812.18905-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Tue, 10 Apr 2018 10:08:11 -0700

> The current net proc fs sequence file implementation to show current
> namespace netdevs list statistics and mc lists holds the rcu lock
> throughout the whole process, from dev seq start up to dev seq stop.
> 
> This is really greedy and demanding from device drivers since
> ndo_get_stats64 called from dev_seq_show while the rcu lock is held.
> 
> The rcu lock is needed to guarantee that device chain is not modified
> while the dev sequence file is walking through it and handling the
> netdev in the same time.
> 
> To minimize this critical section and drastically reduce the time rcu lock
> is being held, all we need is to grab the rcu lock only for the brief
> moment where we are looking for the next netdev to handle, if found,
> dev_hold it to guarantee it kept alive while accessed later in seq show
> callback and release the rcu lock immediately.
> 
> The current netdev being handled will be released "dev_put" when the seq next
> callback is called or dev seq stop is called.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

The tradeoff here is that now you are doing two unnecessary atomic
operations per stats dump.

That is what the RCU lock allows us to avoid.

^ permalink raw reply

* [RFC net-next 2/2] net: net-sysfs: Reduce netstat_show read_lock critical section
From: Saeed Mahameed @ 2018-04-10 17:08 UTC (permalink / raw)
  To: netdev; +Cc: Saeed Mahameed
In-Reply-To: <20180410170812.18905-1-saeedm@mellanox.com>

Instead of holding the device chain read_lock also while calling
dev_get_stats just hold it only to check dev_isalive, if the dev is alive,
hold that dev via dev_hold then release the read_lock.

When done handling the device, dev_put it.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/core/net-sysfs.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c476f0794132..ee6f9fed43df 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -563,13 +563,20 @@ static ssize_t netstat_show(const struct device *d,
 		offset % sizeof(u64) != 0);
 
 	read_lock(&dev_base_lock);
-	if (dev_isalive(dev)) {
+	if (dev_isalive(dev))
+		dev_hold(dev);
+	else
+		dev = NULL;
+	read_unlock(&dev_base_lock);
+
+	if (dev) {
 		struct rtnl_link_stats64 temp;
 		const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
 
+		dev_put(dev);
 		ret = sprintf(buf, fmt_u64, *(u64 *)(((u8 *)stats) + offset));
 	}
-	read_unlock(&dev_base_lock);
+
 	return ret;
 }
 
-- 
2.14.3

^ permalink raw reply related

* [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Saeed Mahameed @ 2018-04-10 17:08 UTC (permalink / raw)
  To: netdev; +Cc: Saeed Mahameed

The current net proc fs sequence file implementation to show current
namespace netdevs list statistics and mc lists holds the rcu lock
throughout the whole process, from dev seq start up to dev seq stop.

This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held.

The rcu lock is needed to guarantee that device chain is not modified
while the dev sequence file is walking through it and handling the
netdev in the same time.

To minimize this critical section and drastically reduce the time rcu lock
is being held, all we need is to grab the rcu lock only for the brief
moment where we are looking for the next netdev to handle, if found,
dev_hold it to guarantee it kept alive while accessed later in seq show
callback and release the rcu lock immediately.

The current netdev being handled will be released "dev_put" when the seq next
callback is called or dev seq stop is called.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 net/core/net-procfs.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index 9737302907b1..9d5ce6a203d2 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -31,19 +31,24 @@ static inline struct net_device *dev_from_same_bucket(struct seq_file *seq, loff

 static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *pos)
 {
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	unsigned int bucket;

+	rcu_read_lock();
 	do {
 		dev = dev_from_same_bucket(seq, pos);
-		if (dev)
-			return dev;
+		if (dev) {
+			dev_hold(dev);
+			goto unlock;
+		}

 		bucket = get_bucket(*pos) + 1;
 		*pos = set_bucket_offset(bucket, 1);
 	} while (bucket < NETDEV_HASHENTRIES);

-	return NULL;
+unlock:
+	rcu_read_unlock();
+	return dev;
 }

 /*
@@ -51,9 +56,7 @@ static inline struct net_device *dev_from_bucket(struct seq_file *seq, loff_t *p
  *	in detail.
  */
 static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
 {
-	rcu_read_lock();
 	if (!*pos)
 		return SEQ_START_TOKEN;

@@ -66,13 +69,15 @@ static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
 static void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	++*pos;
+	if (v && v != SEQ_START_TOKEN)
+		dev_put(v);
 	return dev_from_bucket(seq, pos);
 }

 static void dev_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
 {
-	rcu_read_unlock();
+	if (v && v != SEQ_START_TOKEN)
+		dev_put(v);
 }

 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
-- 
2.14.3

^ permalink raw reply related

* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Yonghong Song @ 2018-04-10 16:58 UTC (permalink / raw)
  To: Quentin Monnet, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <20180410144157.4831-8-quentin.monnet@netronome.com>



On 4/10/18 7:41 AM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions:
> 
> Helpers from Lawrence:
> - bpf_setsockopt()
> - bpf_getsockopt()
> - bpf_sock_ops_cb_flags_set()
> 
> Helpers from Yonghong:
> - bpf_perf_event_read_value()
> - bpf_perf_prog_read_value()
> 
> Helper from Josef:
> - bpf_override_return()
> 
> Helper from Andrey:
> - bpf_bind()
> 
> Cc: Lawrence Brakmo <brakmo@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Andrey Ignatov <rdna@fb.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>   include/uapi/linux/bpf.h | 184 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 184 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 15d9ccafebbe..7343af4196c8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1208,6 +1208,28 @@ union bpf_attr {
>    * 	Return
>    * 		0
>    *
> + * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * 	Description
> + * 		Emulate a call to **setsockopt()** on the socket associated to
> + * 		*bpf_socket*, which must be a full socket. The *level* at
> + * 		which the option resides and the name *optname* of the option
> + * 		must be specified, see **setsockopt(2)** for more information.
> + * 		The option value of length *optlen* is pointed by *optval*.
> + *
> + * 		This helper actually implements a subset of **setsockopt()**.
> + * 		It supports the following *level*\ s:
> + *
> + * 		* **SOL_SOCKET**, which supports the following *optname*\ s:
> + * 		  **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
> + * 		  **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**.
> + * 		* **IPPROTO_TCP**, which supports the following *optname*\ s:
> + * 		  **TCP_CONGESTION**, **TCP_BPF_IW**,
> + * 		  **TCP_BPF_SNDCWND_CLAMP**.
> + * 		* **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * 		* **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>    * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
>    * 	Description
>    * 		Grow or shrink the room for data in the packet associated to
> @@ -1255,6 +1277,168 @@ union bpf_attr {
>    * 		performed again.
>    * 	Return
>    * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
> + * 	Description
> + * 		Read the value of a perf event counter, and store it into *buf*
> + * 		of size *buf_size*. This helper relies on a *map* of type
> + * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
> + * 		event counter is selected at the creation of the *map*. The

The nature of the perf event counter is selected when *map* is updated 
with perf_event fd's.

> + * 		*map* is an array whose size is the number of available CPU
> + * 		cores, and each cell contains a value relative to one core. The

It is confusing to mix core/cpu here. Maybe just use perf_event 
convention, always using cpu?

> + * 		value to retrieve is indicated by *flags*, that contains the
> + * 		index of the core to look up, masked with
> + * 		**BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
> + * 		**BPF_F_CURRENT_CPU** to indicate that the value for the
> + * 		current CPU core should be retrieved.
> + *
> + * 		This helper behaves in a way close to
> + * 		**bpf_perf_event_read**\ () helper, save that instead of
> + * 		just returning the value observed, it fills the *buf*
> + * 		structure. This allows for additional data to be retrieved: in
> + * 		particular, the enabled and running times (in *buf*\
> + * 		**->enabled** and *buf*\ **->running**, respectively) are
> + * 		copied.
> + *
> + * 		These values are interesting, because hardware PMU (Performance
> + * 		Monitoring Unit) counters are limited resources. When there are
> + * 		more PMU based perf events opened than available counters,
> + * 		kernel will multiplex these events so each event gets certain
> + * 		percentage (but not all) of the PMU time. In case that
> + * 		multiplexing happens, the number of samples or counter value
> + * 		will not reflect the case compared to when no multiplexing
> + * 		occurs. This makes comparison between different runs difficult.
> + * 		Typically, the counter value should be normalized before
> + * 		comparing to other experiments. The usual normalization is done
> + * 		as follows.
> + *
> + * 		::
> + *
> + * 			normalized_counter = counter * t_enabled / t_running
> + *
> + * 		Where t_enabled is the time enabled for event and t_running is
> + * 		the time running for event since last normalization. The
> + * 		enabled and running times are accumulated since the perf event
> + * 		open. To achieve scaling factor between two invocations of an
> + * 		eBPF program, users can can use CPU id as the key (which is
> + * 		typical for perf array usage model) to remember the previous
> + * 		value and do the calculation inside the eBPF program.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
> + * 	Description
> + * 		For en eBPF program attached to a perf event, retrieve the
> + * 		value of the event counter associated to *ctx* and store it in
> + * 		the structure pointed by *buf* and of size *buf_size*. Enabled
> + * 		and running times are also stored in the structure (see
> + * 		description of helper **bpf_perf_event_read_value**\ () for
> + * 		more details).
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * 	Description
> + * 		Emulate a call to **getsockopt()** on the socket associated to
> + * 		*bpf_socket*, which must be a full socket. The *level* at
> + * 		which the option resides and the name *optname* of the option
> + * 		must be specified, see **getsockopt(2)** for more information.
> + * 		The retrieved value is stored in the structure pointed by
> + * 		*opval* and of length *optlen*.
> + *
> + * 		This helper actually implements a subset of **getsockopt()**.
> + * 		It supports the following *level*\ s:
> + *
> + * 		* **IPPROTO_TCP**, which supports *optname*
> + * 		  **TCP_CONGESTION**.
> + * 		* **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * 		* **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_override_return(struct pt_reg *regs, u64 rc)
> + * 	Description
> + * 		Used for error injection, this helper uses kprobes to override
> + * 		the return value of the probed function, and to set it to *rc*.
> + * 		The first argument is the context *regs* on which the kprobe
> + * 		works.
> + *
> + * 		This helper works by setting setting the PC (program counter)
> + * 		to an override function which is run in place of the original
> + * 		probed function. This means the probed function is not run at
> + * 		all. The replacement function just returns with the required
> + * 		value.
> + *
> + * 		This helper has security implications, and thus is subject to
> + * 		restrictions. It is only available if the kernel was compiled
> + * 		with the **CONFIG_BPF_KPROBE_OVERRIDE** configuration
> + * 		option, and in this case it only works on functions tagged with
> + * 		**ALLOW_ERROR_INJECTION** in the kernel code.
> + *
> + * 		Also, the helper is only available for the architectures having
> + * 		the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing,
> + * 		x86 architecture is the only one to support this feature.
> + * 	Return
> + * 		0
> + *
> + * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops_kern *bpf_sock, int argval)
> + * 	Description
> + * 		Attempt to set the value of the **bpf_sock_ops_cb_flags** field
> + * 		for the full TCP socket associated to *bpf_sock_ops* to
> + * 		*argval*.
> + *
> + * 		The primary use of this field is to determine if there should
> + * 		be calls to eBPF programs of type
> + * 		**BPF_PROG_TYPE_SOCK_OPS** at various points in the TCP
> + * 		code. A program of the same type can change its value, per
> + * 		connection and as necessary, when the connection is
> + * 		established. This field is directly accessible for reading, but
> + * 		this helper must be used for updates in order to return an
> + * 		error if an eBPF program tries to set a callback that is not
> + * 		supported in the current kernel.
> + *
> + * 		The supported callback values that *argval* can combine are:
> + *
> + * 		* **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out)
> + * 		* **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission)
> + * 		* **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change)
> + *
> + * 		Here are some examples of where one could call such eBPF
> + * 		program:
> + *
> + * 		* When RTO fires.
> + * 		* When a packet is retransmitted.
> + * 		* When the connection terminates.
> + * 		* When a packet is sent.
> + * 		* When a packet is received.
> + * 	Return
> + * 		Code **-EINVAL** if the socket is not a full TCP socket;
> + * 		otherwise, a positive number containing the bits that could not
> + * 		be set is returned (which comes down to 0 if all bits were set
> + * 		as required).
> + *
> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
> + * 	Description
> + * 		Bind the socket associated to *ctx* to the address pointed by
> + * 		*addr*, of length *addr_len*. This allows for making outgoing
> + * 		connection from the desired IP address, which can be useful for
> + * 		example when all processes inside a cgroup should use one
> + * 		single IP address on a host that has multiple IP configured.
> + *
> + * 		This helper works for IPv4 and IPv6, TCP and UDP sockets. The
> + * 		domain (*addr*\ **->sa_family**) must be **AF_INET** (or
> + * 		**AF_INET6**). Looking for a free port to bind to can be
> + * 		expensive, therefore binding to port is not permitted by the
> + * 		helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
> + * 		must be set to zero.
> + *
> + * 		As for the remote end, both parts of it can be overridden,
> + * 		remote IP and remote port. This can be useful if an application
> + * 		inside a cgroup wants to connect to another application inside
> + * 		the same cgroup or to itself, but knows nothing about the IP
> + * 		address assigned to the cgroup.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox