Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>

Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.

The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.

This patch contains descriptions for the following helper functions, all
writter by Alexei:

- bpf_get_current_pid_tgid()
- bpf_get_current_uid_gid()
- bpf_get_current_comm()
- bpf_skb_vlan_push()
- bpf_skb_vlan_pop()
- bpf_skb_get_tunnel_key()
- bpf_skb_set_tunnel_key()
- bpf_redirect()
- bpf_perf_event_output()
- bpf_get_stackid()
- bpf_get_current_task()

Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 237 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2bc653a3a20f..f3ea8824efbc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -580,6 +580,243 @@ union bpf_attr {
  * 		performed again.
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_get_current_pid_tgid(void)
+ * 	Return
+ * 		A 64-bit integer containing the current tgid and pid, and
+ * 		created as such:
+ * 		*current_task*\ **->tgid << 32 \|**
+ * 		*current_task*\ **->pid**.
+ *
+ * u64 bpf_get_current_uid_gid(void)
+ * 	Return
+ * 		A 64-bit integer containing the current GID and UID, and
+ * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
+ *
+ * int bpf_get_current_comm(char *buf, u32 size_of_buf)
+ * 	Description
+ * 		Copy the **comm** attribute of the current task into *buf* of
+ * 		*size_of_buf*. The **comm** attribute contains the name of
+ * 		the executable (excluding the path) for the current task. The
+ * 		*size_of_buf* must be strictly positive. On success, the
+ * 		helper makes sure that the *buf* is NUL-terminated. On failure,
+ * 		it is filled with zeroes.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
+ * 	Description
+ * 		Push a *vlan_tci* (VLAN tag control information) of protocol
+ * 		*vlan_proto* to the packet associated to *skb*, then update
+ * 		the checksum. Note that if *vlan_proto* is different from
+ * 		**ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
+ * 		be **ETH_P_8021Q**.
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_vlan_pop(struct sk_buff *skb)
+ * 	Description
+ * 		Pop a VLAN header from the packet associated to *skb*.
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * 	Description
+ * 		Get tunnel metadata. This helper takes a pointer *key* to an
+ * 		empty **struct bpf_tunnel_key** of **size**, that will be
+ * 		filled with tunnel metadata for the packet associated to *skb*.
+ * 		The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
+ * 		indicates that the tunnel is based on IPv6 protocol instead of
+ * 		IPv4.
+ *
+ * 		This is typically used on the receive path to perform a lookup
+ * 		or a packet redirection based on the value of *key*:
+ *
+ * 		::
+ *
+ * 			struct bpf_tunnel_key key = {};
+ * 			bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
+ * 			     lookup or redirect based on key ...
+ *
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
+ * 	Description
+ * 		Populate tunnel metadata for packet associated to *skb.* The
+ * 		tunnel metadata is set to the contents of *key*, of *size*. The
+ * 		*flags* can be set to a combination of the following values:
+ *
+ * 		**BPF_F_TUNINFO_IPV6**
+ * 			Indicate that the tunnel is based on IPv6 protocol
+ * 			instead of IPv4.
+ * 		**BPF_F_ZERO_CSUM_TX**
+ * 			For IPv4 packets, add a flag to tunnel metadata
+ * 			indicating that checksum computation should be skipped
+ * 			and checksum set to zeroes.
+ * 		**BPF_F_DONT_FRAGMENT**
+ * 			Add a flag to tunnel metadata indicating that the
+ * 			packet should not be fragmented.
+ * 		**BPF_F_SEQ_NUMBER**
+ * 			Add a flag to tunnel metadata indicating that a
+ * 			sequence number should be added to tunnel header before
+ * 			sending the packet. This flag was added for GRE
+ * 			encapsulation, but might be used with other protocols
+ * 			as well in the future.
+ *
+ * 		Here is a typical usage on the transmit path:
+ *
+ * 		::
+ *
+ * 			struct bpf_tunnel_key key;
+ * 			     populate key ...
+ * 			bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
+ * 			bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
+ *
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_redirect(u32 ifindex, u64 flags)
+ * 	Description
+ * 		Redirect the packet to another net device of index *ifindex*.
+ * 		This helper is somewhat similar to **bpf_clone_redirect**\
+ * 		(), except that the packet is not cloned, which provides
+ * 		increased performance.
+ *
+ * 		For hooks other than XDP, *flags* can be set to
+ * 		**BPF_F_INGRESS**, which indicates the packet is to be
+ * 		redirected to the ingress interface instead of (by default)
+ * 		egress. Currently, XDP does not support any flag.
+ * 	Return
+ * 		For XDP, the helper returns **XDP_REDIRECT** on success or
+ * 		**XDP_ABORT** on error. For other program types, the values
+ * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
+ * 		error.
+ *
+ * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
+ * 	Description
+ * 		Write perf raw sample into a perf event held by *map* of type
+ * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
+ * 		have the following attributes: **PERF_SAMPLE_RAW** as
+ * 		**sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
+ * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
+ *
+ * 		The *flags* are used to indicate the index in *map* for which
+ * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
+ * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
+ * 		to indicate that the index of the current CPU core should be
+ * 		used.
+ *
+ * 		The value to write, of *size*, is passed through eBPF stack and
+ * 		pointed by *data*.
+ *
+ * 		The context of the program *ctx* needs also be passed to the
+ * 		helper, and will get interpreted as a pointer to a **struct
+ * 		pt_reg**.
+ *
+ * 		On user space, a program willing to read the values needs to
+ * 		call **perf_event_open**\ () on the perf event (either for
+ * 		one or for all CPUs) and to store the file descriptor into the
+ * 		*map*. This must be done before the eBPF program can send data
+ * 		into it. An example is available in file
+ * 		*samples/bpf/trace_output_user.c* in the Linux kernel source
+ * 		tree (the eBPF program counterpart is in
+ * 		*samples/bpf/trace_output_kern.c*). It looks like the
+ * 		following snippet:
+ *
+ * 		::
+ *
+ * 			volatile struct perf_event_mmap_page *header;
+ * 			struct perf_event_attr attr = {
+ * 			        .sample_type = PERF_SAMPLE_RAW,
+ * 			        .type = PERF_TYPE_SOFTWARE,
+ * 			        .config = PERF_COUNT_SW_BPF_OUTPUT,
+ * 			};
+ * 			int page_size;
+ * 			int mmap_size;
+ * 			int key = 0;
+ * 			int pmu_fd;
+ * 			void *base;
+ * 			
+ * 			if (load_bpf_file(filename))
+ * 			        return -1;
+ * 			
+ * 			pmu_fd = sys_perf_event_open(&attr,
+ * 			                             -1, // pid
+ * 			                              0, // cpu
+ * 			                             -1, // group_fd
+ * 			                              0);
+ * 			
+ * 			assert(pmu_fd >= 0);
+ * 			assert(bpf_map_update_elem(map_fd[0], &key,
+ * 			                           &pmu_fd, BPF_ANY) == 0);
+ * 			assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
+ * 			
+ * 			page_size = getpagesize();
+ * 			mmap_size = page_size * (page_cnt + 1);
+ * 			
+ * 			base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+ * 			            MAP_SHARED, fd, 0);
+ * 			if (base == MAP_FAILED)
+ * 			        return -1;
+ * 			
+ * 			header = base;
+ *
+ * 		**bpf_perf_event_output**\ () achieves better performance
+ * 		than **bpf_trace_printk**\ () for sharing data with user
+ * 		space, and is much better suitable for streaming data from eBPF
+ * 		programs.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
+ * 	Description
+ * 		Walk a user or a kernel stack and return its id. To achieve
+ * 		this, the helper needs *ctx*, which is a pointer to the context
+ * 		on which the tracing program is executed, and a pointer to a
+ * 		*map* of type **BPF_MAP_TYPE_STACK_TRACE**.
+ *
+ * 		The last argument, *flags*, holds the number of stack frames to
+ * 		skip (from 0 to 255), masked with
+ * 		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ * 		a combination of the following flags:
+ *
+ * 		**BPF_F_USER_STACK**
+ * 			Collect a user space stack instead of a kernel stack.
+ * 		**BPF_F_FAST_STACK_CMP**
+ * 			Compare stacks by hash only.
+ * 		**BPF_F_REUSE_STACKID**
+ * 			If two different stacks hash into the same *stackid*,
+ * 			discard the old one.
+ *
+ * 		The stack id retrieved is a 32 bit long integer handle which
+ * 		can be further combined with other data (including other stack
+ * 		ids) and used as a key into maps. This can be useful for
+ * 		generating a variety of graphs (such as flame graphs or off-cpu
+ * 		graphs).
+ *
+ * 		For walking a stack, this helper is an improvement over
+ * 		**bpf_probe_read**\ (), which can be used with unrolled loops
+ * 		but is not efficient and consumes a lot of eBPF instructions.
+ * 		Instead, **bpf_get_stackid**\ () can collect up to
+ * 		**PERF_MAX_STACK_DEPTH** both kernel and user frames.
+ * 	Return
+ * 		The positive or null stack id on success, or a negative error
+ * 		in case of failure.
+ *
+ * u64 bpf_get_current_task(void)
+ * 	Return
+ * 		A pointer to the current task struct.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
-- 
2.14.1

^ permalink raw reply related

* [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>

Add documentation for eBPF helper functions to bpf.h user header file.
This documentation can be parsed with the Python script provided in
another commit of the patch series, in order to provide a RST document
that can later be converted into a man page.

The objective is to make the documentation easily understandable and
accessible to all eBPF developers, including beginners.

This patch contains descriptions for the following helper functions, all
written by Alexei:

- bpf_map_lookup_elem()
- bpf_map_update_elem()
- bpf_map_delete_elem()
- bpf_probe_read()
- bpf_ktime_get_ns()
- bpf_trace_printk()
- bpf_skb_store_bytes()
- bpf_l3_csum_replace()
- bpf_l4_csum_replace()
- bpf_tail_call()
- bpf_clone_redirect()

Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 45f77f01e672..2bc653a3a20f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -381,6 +381,205 @@ union bpf_attr {
  * intentional, removing them would break paragraphs for rst2man.
  *
  * Start of BPF helper function descriptions:
+ *
+ * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
+ * 	Description
+ * 		Perform a lookup in *map* for an entry associated to *key*.
+ * 	Return
+ * 		Map value associated to *key*, or **NULL** if no entry was
+ * 		found.
+ *
+ * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+ * 	Description
+ * 		Add or update the value of the entry associated to *key* in
+ * 		*map* with *value*. *flags* is one of:
+ *
+ * 		**BPF_NOEXIST**
+ * 			The entry for *key* must not exist in the map.
+ * 		**BPF_EXIST**
+ * 			The entry for *key* must already exist in the map.
+ * 		**BPF_ANY**
+ * 			No condition on the existence of the entry for *key*.
+ *
+ * 		These flags are only useful for maps of type
+ * 		**BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
+ * 		should be used.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_delete_elem(struct bpf_map *map, void *key)
+ * 	Description
+ * 		Delete entry with *key* from *map*.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_probe_read(void *dst, u32 size, const void *src)
+ * 	Description
+ * 		For tracing programs, safely attempt to read *size* bytes from
+ * 		address *src* and store the data in *dst*.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_ktime_get_ns(void)
+ * 	Description
+ * 		Return the time elapsed since system boot, in nanoseconds.
+ * 	Return
+ * 		Current *ktime*.
+ *
+ * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
+ * 	Description
+ * 		This helper is a "printk()-like" facility for debugging. It
+ * 		prints a message defined by format *fmt* (of size *fmt_size*)
+ * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
+ * 		available. It can take up to three additional **u64**
+ * 		arguments (as an eBPF helpers, the total number of arguments is
+ * 		limited to five). Each time the helper is called, it appends a
+ * 		line that looks like the following:
+ *
+ * 		::
+ *
+ * 			telnet-470   [001] .N.. 419421.045894: 0x00000001: BPF command: 2
+ *
+ * 		In the above:
+ *
+ * 			* ``telnet`` is the name of the current task.
+ * 			* ``470`` is the PID of the current task.
+ * 			* ``001`` is the CPU number on which the task is
+ * 			  running.
+ * 			* In ``.N..``, each character refers to a set of
+ * 			  options (whether irqs are enabled, scheduling
+ * 			  options, whether hard/softirqs are running, level of
+ * 			  preempt_disabled respectively). **N** means that
+ * 			  **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
+ * 			  are set.
+ * 			* ``419421.045894`` is a timestamp.
+ * 			* ``0x00000001`` is a fake value used by BPF for the
+ * 			  instruction pointer register.
+ * 			* ``BPF command: 2`` is the message formatted with
+ * 			  *fmt*.
+ *
+ * 		The conversion specifiers supported by *fmt* are similar, but
+ * 		more limited than for printk(). They are **%d**, **%i**,
+ * 		**%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
+ * 		**%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
+ * 		of field, padding with zeroes, etc.) is available, and the
+ * 		helper will silently fail if it encounters an unknown
+ * 		specifier.
+ *
+ * 		Also, note that **bpf_trace_printk**\ () is slow, and should
+ * 		only be used for debugging purposes. For passing values to user
+ * 		space, perf events should be preferred.
+ * 	Return
+ * 		The number of bytes written to the buffer, or a negative error
+ * 		in case of failure.
+ *
+ * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
+ * 	Description
+ * 		Store *len* bytes from address *from* into the packet
+ * 		associated to *skb*, at *offset*. *flags* are a combination of
+ * 		**BPF_F_RECOMPUTE_CSUM** (automatically recompute the
+ * 		checksum for the packet after storing the bytes) and
+ * 		**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
+ * 		**->swhash** and *skb*\ **->l4hash** to 0).
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
+ * 	Description
+ * 		Recompute the IP checksum for the packet associated to *skb*.
+ * 		Computation is incremental, so the helper must know the former
+ * 		value of the header field that was modified (*from*), the new
+ * 		value of this field (*to*), and the number of bytes (2 or 4)
+ * 		for this field, stored in *size*. Alternatively, it is possible
+ * 		to store the difference between the previous and the new values
+ * 		of the header field in *to*, by setting *from* and *size* to 0.
+ * 		For both methods, *offset* indicates the location of the IP
+ * 		checksum within the packet.
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
+ * 	Description
+ * 		Recompute the TCP or UDP checksum for the packet associated to
+ * 		*skb*. Computation is incremental, so the helper must know the
+ * 		former value of the header field that was modified (*from*),
+ * 		the new value of this field (*to*), and the number of bytes (2
+ * 		or 4) for this field, stored on the lowest four bits of
+ * 		*flags*. Alternatively, it is possible to store the difference
+ * 		between the previous and the new values of the header field in
+ * 		*to*, by setting *from* and the four lowest bits of *flags* to
+ * 		0. For both methods, *offset* indicates the location of the IP
+ * 		checksum within the packet. In addition to the size of the
+ * 		field, *flags* can be added (bitwise OR) actual flags. With
+ * 		**BPF_F_MARK_MANGLED_0**, a null checksum is left untouched
+ * 		(unless **BPF_F_MARK_ENFORCE** is added as well), and for
+ * 		updates resulting in a null checksum the value is set to
+ * 		**CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR**
+ * 		indicates the checksum is to be computed against a
+ * 		pseudo-header.
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
+ * 	Description
+ * 		This special helper is used to trigger a "tail call", or in
+ * 		other words, to jump into another eBPF program. The contents of
+ * 		eBPF registers and stack are not modified, the new program
+ * 		"inherits" them from the caller. This mechanism allows for
+ * 		program chaining, either for raising the maximum number of
+ * 		available eBPF instructions, or to execute given programs in
+ * 		conditional blocks. For security reasons, there is an upper
+ * 		limit to the number of successive tail calls that can be
+ * 		performed.
+ *
+ * 		Upon call of this helper, the program attempts to jump into a
+ * 		program referenced at index *index* in *prog_array_map*, a
+ * 		special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
+ * 		*ctx*, a pointer to the context.
+ *
+ * 		If the call succeeds, the kernel immediately runs the first
+ * 		instruction of the new program. This is not a function call,
+ * 		and it never goes back to the previous program. If the call
+ * 		fails, then the helper has no effect, and the caller continues
+ * 		to run its own instructions. A call can fail if the destination
+ * 		program for the jump does not exist (i.e. *index* is superior
+ * 		to the number of entries in *prog_array_map*), or if the
+ * 		maximum number of tail calls has been reached for this chain of
+ * 		programs. This limit is defined in the kernel by the macro
+ * 		**MAX_TAIL_CALL_CNT** (not accessible to user space), which
+ * 		is currently set to 32.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
+ * 	Description
+ * 		Clone and redirect the packet associated to *skb* to another
+ * 		net device of index *ifindex*. The only flag supported for now
+ * 		is **BPF_F_INGRESS**, which indicates the packet is to be
+ * 		redirected to the ingress interface instead of (by default)
+ * 		egress.
+ *
+ * 		A call to this helper is susceptible to change data from the
+ * 		packet. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again.
+ * 	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
-- 
2.14.1

^ permalink raw reply related

* [RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man
In-Reply-To: <20180410144157.4831-1-quentin.monnet@netronome.com>

Remove previous "overview" of eBPF helpers from user bpf.h header.
Replace it by a comment explaining how to process the new documentation
(to come in following patches) with a Python script to produce RST, then
man page documentation.

Also add the aforementioned Python script under scripts/. It is used to
process include/uapi/linux/bpf.h and to extract helper descriptions, to
turn it into a RST document that can further be processed with rst2man
to produce a man page. The script takes one "--filename <path/to/file>"
option. If the script is launched from scripts/ in the kernel root
directory, it should be able to find the location of the header to
parse, and "--filename <path/to/file>" is then optional. If it cannot
find the file, then the option becomes mandatory. RST-formatted
documentation is printed to standard output.

Typical workflow for producing the final man page would be:

    $ ./scripts/bpf_helpers_doc.py \
            --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
    $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
    $ man /tmp/bpf-helpers.7

Note that the tool kernel-doc cannot be used to document eBPF helpers,
whose signatures are not available directly in the header files
(pre-processor directives are used to produce them at the beginning of
the compilation process).

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 include/uapi/linux/bpf.h   | 406 ++------------------------------------------
 scripts/bpf_helpers_doc.py | 414 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 430 insertions(+), 390 deletions(-)
 create mode 100755 scripts/bpf_helpers_doc.py

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..45f77f01e672 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -365,396 +365,22 @@ union bpf_attr {
 	} raw_tracepoint;
 } __attribute__((aligned(8)));
 
-/* BPF helper function descriptions:
- *
- * void *bpf_map_lookup_elem(&map, &key)
- *     Return: Map value or NULL
- *
- * int bpf_map_update_elem(&map, &key, &value, flags)
- *     Return: 0 on success or negative error
- *
- * int bpf_map_delete_elem(&map, &key)
- *     Return: 0 on success or negative error
- *
- * int bpf_probe_read(void *dst, int size, void *src)
- *     Return: 0 on success or negative error
- *
- * u64 bpf_ktime_get_ns(void)
- *     Return: current ktime
- *
- * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
- *     Return: length of buffer written or negative error
- *
- * u32 bpf_prandom_u32(void)
- *     Return: random value
- *
- * u32 bpf_raw_smp_processor_id(void)
- *     Return: SMP processor ID
- *
- * int bpf_skb_store_bytes(skb, offset, from, len, flags)
- *     store bytes into packet
- *     @skb: pointer to skb
- *     @offset: offset within packet from skb->mac_header
- *     @from: pointer where to copy bytes from
- *     @len: number of bytes to store into packet
- *     @flags: bit 0 - if true, recompute skb->csum
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_l3_csum_replace(skb, offset, from, to, flags)
- *     recompute IP checksum
- *     @skb: pointer to skb
- *     @offset: offset within packet where IP checksum is located
- *     @from: old value of header field
- *     @to: new value of header field
- *     @flags: bits 0-3 - size of header field
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_l4_csum_replace(skb, offset, from, to, flags)
- *     recompute TCP/UDP checksum
- *     @skb: pointer to skb
- *     @offset: offset within packet where TCP/UDP checksum is located
- *     @from: old value of header field
- *     @to: new value of header field
- *     @flags: bits 0-3 - size of header field
- *             bit 4 - is pseudo header
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_tail_call(ctx, prog_array_map, index)
- *     jump into another BPF program
- *     @ctx: context pointer passed to next program
- *     @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY
- *     @index: 32-bit index inside array that selects specific program to run
- *     Return: 0 on success or negative error
- *
- * int bpf_clone_redirect(skb, ifindex, flags)
- *     redirect to another netdev
- *     @skb: pointer to skb
- *     @ifindex: ifindex of the net device
- *     @flags: bit 0 - if set, redirect to ingress instead of egress
- *             other bits - reserved
- *     Return: 0 on success or negative error
- *
- * u64 bpf_get_current_pid_tgid(void)
- *     Return: current->tgid << 32 | current->pid
- *
- * u64 bpf_get_current_uid_gid(void)
- *     Return: current_gid << 32 | current_uid
- *
- * int bpf_get_current_comm(char *buf, int size_of_buf)
- *     stores current->comm into buf
- *     Return: 0 on success or negative error
- *
- * u32 bpf_get_cgroup_classid(skb)
- *     retrieve a proc's classid
- *     @skb: pointer to skb
- *     Return: classid if != 0
- *
- * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci)
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_vlan_pop(skb)
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_get_tunnel_key(skb, key, size, flags)
- * int bpf_skb_set_tunnel_key(skb, key, size, flags)
- *     retrieve or populate tunnel metadata
- *     @skb: pointer to skb
- *     @key: pointer to 'struct bpf_tunnel_key'
- *     @size: size of 'struct bpf_tunnel_key'
- *     @flags: room for future extensions
- *     Return: 0 on success or negative error
- *
- * u64 bpf_perf_event_read(map, flags)
- *     read perf event counter value
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     Return: value of perf event counter read or error code
- *
- * int bpf_redirect(ifindex, flags)
- *     redirect to another netdev
- *     @ifindex: ifindex of the net device
- *     @flags:
- *	  cls_bpf:
- *          bit 0 - if set, redirect to ingress instead of egress
- *          other bits - reserved
- *	  xdp_bpf:
- *	    all bits - reserved
- *     Return: cls_bpf: TC_ACT_REDIRECT on success or TC_ACT_SHOT on error
- *	       xdp_bfp: XDP_REDIRECT on success or XDP_ABORT on error
- * int bpf_redirect_map(map, key, flags)
- *     redirect to endpoint in map
- *     @map: pointer to dev map
- *     @key: index in map to lookup
- *     @flags: --
- *     Return: XDP_REDIRECT on success or XDP_ABORT on error
- *
- * u32 bpf_get_route_realm(skb)
- *     retrieve a dst's tclassid
- *     @skb: pointer to skb
- *     Return: realm if != 0
- *
- * int bpf_perf_event_output(ctx, map, flags, data, size)
- *     output perf raw sample
- *     @ctx: struct pt_regs*
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     @data: data on stack to be output as raw data
- *     @size: size of data
- *     Return: 0 on success or negative error
- *
- * int bpf_get_stackid(ctx, map, flags)
- *     walk user or kernel stack and return id
- *     @ctx: struct pt_regs*
- *     @map: pointer to stack_trace map
- *     @flags: bits 0-7 - numer of stack frames to skip
- *             bit 8 - collect user stack instead of kernel
- *             bit 9 - compare stacks by hash only
- *             bit 10 - if two different stacks hash into the same stackid
- *                      discard old
- *             other bits - reserved
- *     Return: >= 0 stackid on success or negative error
- *
- * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
- *     calculate csum diff
- *     @from: raw from buffer
- *     @from_size: length of from buffer
- *     @to: raw to buffer
- *     @to_size: length of to buffer
- *     @seed: optional seed
- *     Return: csum result or negative error code
- *
- * int bpf_skb_get_tunnel_opt(skb, opt, size)
- *     retrieve tunnel options metadata
- *     @skb: pointer to skb
- *     @opt: pointer to raw tunnel option data
- *     @size: size of @opt
- *     Return: option size
- *
- * int bpf_skb_set_tunnel_opt(skb, opt, size)
- *     populate tunnel options metadata
- *     @skb: pointer to skb
- *     @opt: pointer to raw tunnel option data
- *     @size: size of @opt
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_change_proto(skb, proto, flags)
- *     Change protocol of the skb. Currently supported is v4 -> v6,
- *     v6 -> v4 transitions. The helper will also resize the skb. eBPF
- *     program is expected to fill the new headers via skb_store_bytes
- *     and lX_csum_replace.
- *     @skb: pointer to skb
- *     @proto: new skb->protocol type
- *     @flags: reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_change_type(skb, type)
- *     Change packet type of skb.
- *     @skb: pointer to skb
- *     @type: new skb->pkt_type type
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_under_cgroup(skb, map, index)
- *     Check cgroup2 membership of skb
- *     @skb: pointer to skb
- *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
- *     @index: index of the cgroup in the bpf_map
- *     Return:
- *       == 0 skb failed the cgroup2 descendant test
- *       == 1 skb succeeded the cgroup2 descendant test
- *        < 0 error
- *
- * u32 bpf_get_hash_recalc(skb)
- *     Retrieve and possibly recalculate skb->hash.
- *     @skb: pointer to skb
- *     Return: hash
- *
- * u64 bpf_get_current_task(void)
- *     Returns current task_struct
- *     Return: current
- *
- * int bpf_probe_write_user(void *dst, void *src, int len)
- *     safely attempt to write to a location
- *     @dst: destination address in userspace
- *     @src: source address on stack
- *     @len: number of bytes to copy
- *     Return: 0 on success or negative error
- *
- * int bpf_current_task_under_cgroup(map, index)
- *     Check cgroup2 membership of current task
- *     @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
- *     @index: index of the cgroup in the bpf_map
- *     Return:
- *       == 0 current failed the cgroup2 descendant test
- *       == 1 current succeeded the cgroup2 descendant test
- *        < 0 error
- *
- * int bpf_skb_change_tail(skb, len, flags)
- *     The helper will resize the skb to the given new size, to be used f.e.
- *     with control messages.
- *     @skb: pointer to skb
- *     @len: new skb length
- *     @flags: reserved
- *     Return: 0 on success or negative error
- *
- * int bpf_skb_pull_data(skb, len)
- *     The helper will pull in non-linear data in case the skb is non-linear
- *     and not all of len are part of the linear section. Only needed for
- *     read/write with direct packet access.
- *     @skb: pointer to skb
- *     @len: len to make read/writeable
- *     Return: 0 on success or negative error
- *
- * s64 bpf_csum_update(skb, csum)
- *     Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
- *     @skb: pointer to skb
- *     @csum: csum to add
- *     Return: csum on success or negative error
- *
- * void bpf_set_hash_invalid(skb)
- *     Invalidate current skb->hash.
- *     @skb: pointer to skb
- *
- * int bpf_get_numa_node_id()
- *     Return: Id of current NUMA node.
- *
- * int bpf_skb_change_head()
- *     Grows headroom of skb and adjusts MAC header offset accordingly.
- *     Will extends/reallocae as required automatically.
- *     May change skb data pointer and will thus invalidate any check
- *     performed for direct packet access.
- *     @skb: pointer to skb
- *     @len: length of header to be pushed in front
- *     @flags: Flags (unused for now)
- *     Return: 0 on success or negative error
- *
- * int bpf_xdp_adjust_head(xdp_md, delta)
- *     Adjust the xdp_md.data by delta
- *     @xdp_md: pointer to xdp_md
- *     @delta: An positive/negative integer to be added to xdp_md.data
- *     Return: 0 on success or negative on error
- *
- * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
- *     Copy a NUL terminated string from unsafe address. In case the string
- *     length is smaller than size, the target is not padded with further NUL
- *     bytes. In case the string length is larger than size, just count-1
- *     bytes are copied and the last byte is set to NUL.
- *     @dst: destination address
- *     @size: maximum number of bytes to copy, including the trailing NUL
- *     @unsafe_ptr: unsafe address
- *     Return:
- *       > 0 length of the string including the trailing NUL on success
- *       < 0 error
- *
- * u64 bpf_get_socket_cookie(skb)
- *     Get the cookie for the socket stored inside sk_buff.
- *     @skb: pointer to skb
- *     Return: 8 Bytes non-decreasing number on success or 0 if the socket
- *     field is missing inside sk_buff
- *
- * u32 bpf_get_socket_uid(skb)
- *     Get the owner uid of the socket stored inside sk_buff.
- *     @skb: pointer to skb
- *     Return: uid of the socket owner on success or overflowuid if failed.
- *
- * u32 bpf_set_hash(skb, hash)
- *     Set full skb->hash.
- *     @skb: pointer to skb
- *     @hash: hash to set
- *
- * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
- *     Calls setsockopt. Not all opts are available, only those with
- *     integer optvals plus TCP_CONGESTION.
- *     Supported levels: SOL_SOCKET and IPPROTO_TCP
- *     @bpf_socket: pointer to bpf_socket
- *     @level: SOL_SOCKET or IPPROTO_TCP
- *     @optname: option name
- *     @optval: pointer to option value
- *     @optlen: length of optval in bytes
- *     Return: 0 or negative error
- *
- * int bpf_getsockopt(bpf_socket, level, optname, optval, optlen)
- *     Calls getsockopt. Not all opts are available.
- *     Supported levels: IPPROTO_TCP
- *     @bpf_socket: pointer to bpf_socket
- *     @level: IPPROTO_TCP
- *     @optname: option name
- *     @optval: pointer to option value
- *     @optlen: length of optval in bytes
- *     Return: 0 or negative error
- *
- * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
- *     Set callback flags for sock_ops
- *     @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
- *     @flags: flags value
- *     Return: 0 for no error
- *             -EINVAL if there is no full tcp socket
- *             bits in flags that are not supported by current kernel
- *
- * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
- *     Grow or shrink room in sk_buff.
- *     @skb: pointer to skb
- *     @len_diff: (signed) amount of room to grow/shrink
- *     @mode: operation mode (enum bpf_adj_room_mode)
- *     @flags: reserved for future use
- *     Return: 0 on success or negative error code
- *
- * int bpf_sk_redirect_map(map, key, flags)
- *     Redirect skb to a sock in map using key as a lookup key for the
- *     sock in map.
- *     @map: pointer to sockmap
- *     @key: key to lookup sock in map
- *     @flags: reserved for future use
- *     Return: SK_PASS
- *
- * int bpf_sock_map_update(skops, map, key, flags)
- *	@skops: pointer to bpf_sock_ops
- *	@map: pointer to sockmap to update
- *	@key: key to insert/update sock in map
- *	@flags: same flags as map update elem
- *
- * int bpf_xdp_adjust_meta(xdp_md, delta)
- *     Adjust the xdp_md.data_meta by delta
- *     @xdp_md: pointer to xdp_md
- *     @delta: An positive/negative integer to be added to xdp_md.data_meta
- *     Return: 0 on success or negative on error
- *
- * int bpf_perf_event_read_value(map, flags, buf, buf_size)
- *     read perf event counter value and perf event enabled/running time
- *     @map: pointer to perf_event_array map
- *     @flags: index of event in the map or bitmask flags
- *     @buf: buf to fill
- *     @buf_size: size of the buf
- *     Return: 0 on success or negative error code
- *
- * int bpf_perf_prog_read_value(ctx, buf, buf_size)
- *     read perf prog attached perf event counter and enabled/running time
- *     @ctx: pointer to ctx
- *     @buf: buf to fill
- *     @buf_size: size of the buf
- *     Return : 0 on success or negative error code
- *
- * int bpf_override_return(pt_regs, rc)
- *	@pt_regs: pointer to struct pt_regs
- *	@rc: the return value to set
- *
- * int bpf_msg_redirect_map(map, key, flags)
- *     Redirect msg to a sock in map using key as a lookup key for the
- *     sock in map.
- *     @map: pointer to sockmap
- *     @key: key to lookup sock in map
- *     @flags: reserved for future use
- *     Return: SK_PASS
- *
- * int bpf_bind(ctx, addr, addr_len)
- *     Bind socket to address. Only binding to IP is supported, no port can be
- *     set in addr.
- *     @ctx: pointer to context of type bpf_sock_addr
- *     @addr: pointer to struct sockaddr to bind socket to
- *     @addr_len: length of sockaddr structure
- *     Return: 0 on success or negative error code
+/* The description below is an attempt at providing documentation to eBPF
+ * developers about the multiple available eBPF helper functions. It can be
+ * parsed and used to produce a manual page. The workflow is the following,
+ * and requires the rst2man utility:
+ *
+ *     $ ./scripts/bpf_helpers_doc.py \
+ *             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
+ *     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
+ *     $ man /tmp/bpf-helpers.7
+ *
+ * Note that in order to produce this external documentation, some RST
+ * formatting is used in the descriptions to get "bold" and "italics" in
+ * manual pages. Also note that the few trailing white spaces are
+ * intentional, removing them would break paragraphs for rst2man.
+ *
+ * Start of BPF helper function descriptions:
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
new file mode 100755
index 000000000000..3a15ba3f0a83
--- /dev/null
+++ b/scripts/bpf_helpers_doc.py
@@ -0,0 +1,414 @@
+#!/usr/bin/python3
+#
+# Copyright (C) 2018 Netronome Systems, Inc.
+#
+# This software is licensed under the GNU General License Version 2,
+# June 1991 as shown in the file COPYING in the top-level directory of this
+# source tree.
+#
+# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
+# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
+# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
+# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
+# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+# In case user attempts to run with Python 2.
+from __future__ import print_function
+
+import argparse
+import re
+import sys, os
+
+class NoHelperFound(BaseException):
+    pass
+
+class ParsingError(BaseException):
+    def __init__(self, line='<line not provided>', reader=None):
+        if reader:
+            BaseException.__init__(self,
+                                   'Error at file offset %d, parsing line: %s' %
+                                   (reader.tell(), line))
+        else:
+            BaseException.__init__(self, 'Error parsing line: %s' % line)
+
+class Helper(object):
+    """
+    An object representing the description of an eBPF helper function.
+    @proto: function prototype of the helper function
+    @desc: textual description of the helper function
+    @ret: description of the return value of the helper function
+    """
+    def __init__(self, proto='', desc='', ret=''):
+        self.proto = proto
+        self.desc = desc
+        self.ret = ret
+
+    def proto_break_down(self):
+        """
+        Break down helper function protocol into smaller chunks: return type,
+        name, distincts arguments.
+        """
+        arg_re = re.compile('^((const )?(struct )?(\w+|...))( (\**)(\w+))?$')
+        res = {}
+        proto_re = re.compile('^(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$')
+
+        capture = proto_re.match(self.proto)
+        res['ret_type'] = capture.group(1)
+        res['ret_star'] = capture.group(2)
+        res['name']     = capture.group(3)
+        res['args'] = []
+
+        args    = capture.group(4).split(', ')
+        for a in args:
+            capture = arg_re.match(a)
+            res['args'].append({
+                'type' : capture.group(1),
+                'star' : capture.group(6),
+                'name' : capture.group(7)
+            })
+
+        return res
+
+class HeaderParser(object):
+    """
+    An object used to parse a file in order to extract the documentation of a
+    list of eBPF helper functions. All the helpers that can be retrieved are
+    stored as Helper object, in the self.helpers() array.
+    @filename: name of file to parse, usually include/uapi/linux/bpf.h in the
+               kernel tree
+    """
+    def __init__(self, filename):
+        self.reader = open(filename, 'r')
+        self.line = ''
+        self.helpers = []
+
+    def parse_helper(self):
+        proto    = self.parse_proto()
+        desc     = self.parse_desc()
+        ret      = self.parse_ret()
+        return Helper(proto=proto, desc=desc, ret=ret)
+
+    def parse_proto(self):
+        # Argument can be of shape:
+        #   - "void"
+        #   - "type  name"
+        #   - "type *name"
+        #   - Same as above, with "const" and/or "struct" in front of type
+        #   - "..." (undefined number of arguments, for bpf_trace_printk())
+        # There is at least one term ("void"), and at most five arguments.
+        p = re.compile('^ \* ((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$')
+        capture = p.match(self.line)
+        if not capture:
+            raise NoHelperFound
+        self.line = self.reader.readline()
+        return capture.group(1)
+
+    def parse_desc(self):
+        p = re.compile('^ \* \tDescription$')
+        capture = p.match(self.line)
+        if not capture:
+            # Helper can have empty description and we might be parsing another
+            # attribute: return but do not consume.
+            return ''
+        # Description can be several lines, some of them possibly empty, and it
+        # stops when another subsection title is met.
+        desc = ''
+        while True:
+            self.line = self.reader.readline()
+            if self.line == ' *\n':
+                desc += '\n'
+            else:
+                p = re.compile('^ \* \t\t(.*)')
+                capture = p.match(self.line)
+                if capture:
+                    desc += capture.group(1) + '\n'
+                else:
+                    break
+        return desc
+
+    def parse_ret(self):
+        p = re.compile('^ \* \tReturn$')
+        capture = p.match(self.line)
+        if not capture:
+            # Helper can have empty retval and we might be parsing another
+            # attribute: return but do not consume.
+            return ''
+        # Return value description can be several lines, some of them possibly
+        # empty, and it stops when another subsection title is met.
+        ret = ''
+        while True:
+            self.line = self.reader.readline()
+            if self.line == ' *\n':
+                ret += '\n'
+            else:
+                p = re.compile('^ \* \t\t(.*)')
+                capture = p.match(self.line)
+                if capture:
+                    ret += capture.group(1) + '\n'
+                else:
+                    break
+        return ret
+
+    def run(self):
+        # Advance to start of helper function descriptions.
+        offset = self.reader.read().find('* Start of BPF helper function descriptions:')
+        if offset == -1:
+            raise Exception('Could not find start of eBPF helper descriptions list')
+        self.reader.seek(offset)
+        self.reader.readline()
+        self.reader.readline()
+        self.line = self.reader.readline()
+
+        while True:
+            try:
+                helper = self.parse_helper()
+                self.helpers.append(helper)
+            except NoHelperFound:
+                break
+
+        self.reader.close()
+        print('Parsed description of %d helper function(s)' % len(self.helpers),
+              file=sys.stderr)
+
+###############################################################################
+
+class Printer(object):
+    """
+    A generic class for printers. Printers should be created with an array of
+    Helper objects, and implement a way to print them in the desired fashion.
+    @helpers: array of Helper objects to print to standard output
+    """
+    def __init__(self, helpers):
+        self.helpers = helpers
+
+    def print_header(self):
+        pass
+
+    def print_footer(self):
+        pass
+
+    def print_one(self, helper):
+        pass
+
+    def print_all(self):
+        self.print_header()
+        for helper in self.helpers:
+            self.print_one(helper)
+        self.print_footer()
+
+class PrinterRST(Printer):
+    """
+    A printer for dumping collected information about helpers as a ReStructured
+    Text page compatible with the rst2man program, which can be used to
+    generate a manual page for the helpers.
+    @helpers: array of Helper objects to print to standard output
+    """
+    def print_header(self):
+        header = '''\
+.. Copyright (C) 2018 Netronome Systems, Inc.
+.. 
+.. %%%LICENSE_START(VERBATIM)
+.. Permission is granted to make and distribute verbatim copies of this
+.. manual provided the copyright notice and this permission notice are
+.. preserved on all copies.
+.. 
+.. Permission is granted to copy and distribute modified versions of this
+.. manual under the conditions for verbatim copying, provided that the
+.. entire resulting derived work is distributed under the terms of a
+.. permission notice identical to this one.
+.. 
+.. Since the Linux kernel and libraries are constantly changing, this
+.. manual page may be incorrect or out-of-date.  The author(s) assume no
+.. responsibility for errors or omissions, or for damages resulting from
+.. the use of the information contained herein.  The author(s) may not
+.. have taken the same level of care in the production of this manual,
+.. which is licensed free of charge, as they might when working
+.. professionally.
+.. 
+.. Formatted or processed versions of this manual, if unaccompanied by
+.. the source, must acknowledge the copyright and authors of this work.
+.. %%%LICENSE_END
+.. 
+.. Please do not edit this file. It was generated from the documentation
+.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
+.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
+.. repository (header and footer).
+
+===========
+BPF-HELPERS
+===========
+-------------------------------------------------------------------------------
+list of eBPF helper functions
+-------------------------------------------------------------------------------
+
+:Manual section: 7
+
+DESCRIPTION
+===========
+
+The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
+written in a pseudo-assembly language, then attached to one of the several
+kernel hooks and run in reaction of specific events. This framework differs
+from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
+the ability to call special functions (or "helpers") from within a program. For
+security reasons, these functions are restricted to a white-list of helpers
+defined in the kernel.
+
+These helpers are used by eBPF programs to interact with the system, or with
+the context in which they work. For instance, they can be used to print
+debugging messages, to get the time since the system was booted, to interact
+with eBPF maps, or to manipulate network packets metadata. Since there are
+several eBPF program types, and that they do not run in the same context, each
+program type can only call a subset of those helpers.
+
+Due to eBPF conventions, a helper can not have more than five arguments.
+
+This document is an attempt to list and document the helpers available to eBPF
+developers. They are sorted by chronological order (the oldest helpers in the
+kernel at the top).
+
+HELPERS
+=======
+'''
+        print(header)
+
+    def print_footer(self):
+        footer = '''
+NOTES
+=====
+
+On the performance side, eBPF programs move to the stack all arguments to pass
+to the helpers, and call directly into the compiled helper functions without
+requiring any foreign-function interface. As a result, calling helpers
+introduce very little overhead.
+
+EXAMPLES
+========
+
+Example usage for most of the eBPF helpers listed in this manual page are
+available within the Linux kernel sources, at the following locations:
+
+* *samples/bpf/*
+* *tools/testing/selftests/bpf/*
+
+IMPLEMENTATION
+==============
+
+This manual page is an effort to document the existing eBPF helper functions.
+But as of this writing, the BPF sub-system is under heavy development. New eBPF
+program or map types are added, along with new helper functions. Some helpers
+are occasionally made available for additional program types. So in spite of
+the efforts of the community, this page might not be up-to-date. If you want to
+check by yourself what helper functions exist in your kernel, or what types of
+programs they can support, here are some files among the kernel tree that you
+may be interested in:
+
+* *include/uapi/linux/bpf.h* contains the full list of all helper functions.
+* *net/core/filter.c* contains the definition of most network-related helper
+  functions, and the list of program types from which they can be used.
+* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
+  helpers.
+* *kernel/bpf/verifier.c* contains the functions used to check that valid types
+  of eBPF maps are used with a given helper function.
+* *kernel/bpf/* directory contains other files in which additional helpers are
+  defined (for cgroups, sockmaps, etc.).
+
+Compatibility between helper functions and program types can generally be found
+in the files where helper functions are defined. Look for the **struct
+bpf_func_proto** objects and for functions returning them: these functions
+contain a list of helpers that a given program type can call. Note that the
+**default:** label of the **switch ... case** used to filter helpers can call
+other functions, themselves allowing access to additional helpers. The
+requirement for GPL license is also in those **struct bpf_func_proto**.
+
+Compatibility between helper functions and map types can be found in the
+**check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*.
+
+Helper functions that invalidate the checks on **data** and **data_end**
+pointers for network processing are listed in function
+**bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*.
+
+SEE ALSO
+========
+
+**bpf**\ (2),
+**cgroups**\ (7),
+**ip**\ (8),
+**perf_event_open**\ (2),
+**sendmsg**\ (2),
+**socket**\ (7),
+**tc-bpf**\ (8)'''
+        print(footer)
+
+    def print_proto(self, helper):
+        """
+        Format function protocol with bold and italics markers. This makes RST
+        file less readable, but gives nice results in the manual page.
+        """
+        proto = helper.proto_break_down()
+
+        print('**%s %s%s(' % (proto['ret_type'],
+                              proto['ret_star'].replace('*', '\\*'),
+                              proto['name']),
+              end='')
+
+        comma = ''
+        for a in proto['args']:
+            one_arg = '{}{}'.format(comma, a['type'])
+            if a['name']:
+                if a['star']:
+                    one_arg += ' {}**\ '.format(a['star'].replace('*', '\\*'))
+                else:
+                    one_arg += '** '
+                one_arg += '*{}*\\ **'.format(a['name'])
+            comma = ', '
+            print(one_arg, end='')
+
+        print(')**')
+
+    def print_one(self, helper):
+        self.print_proto(helper)
+
+        if (helper.desc):
+            print('\tDescription')
+            # Do not strip all newline characters: formatted code at the end of
+            # a section must be followed by a blank line.
+            for line in re.sub('\n$', '', helper.desc, count=1).split('\n'):
+                print('{}{}'.format('\t\t' if line else '', line))
+
+        if (helper.ret):
+            print('\tReturn')
+            for line in helper.ret.rstrip().split('\n'):
+                print('{}{}'.format('\t\t' if line else '', line))
+
+        print('')
+
+###############################################################################
+
+# If script is launched from scripts/ from kernel tree and can access
+# ../include/uapi/linux/bpf.h, use it as a default name for the file to parse,
+# otherwise the --filename argument will be required from the command line.
+script = os.path.abspath(sys.argv[0])
+linuxRoot = os.path.dirname(os.path.dirname(script))
+bpfh = os.path.join(linuxRoot, 'include/uapi/linux/bpf.h')
+
+argParser = argparse.ArgumentParser(description="""
+Parse eBPF header file and generate documentation for eBPF helper functions.
+The RST-formatted output produced can be turned into a manual page with the
+rst2man utility.
+""")
+if (os.path.isfile(bpfh)):
+    argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h',
+                           default=bpfh)
+else:
+    argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h')
+args = argParser.parse_args()
+
+# Parse file.
+headerParser = HeaderParser(args.filename)
+headerParser.run()
+
+# Print formatted output to standard output.
+printer = PrinterRST(headerParser.helpers)
+printer.print_all()
-- 
2.14.1

^ permalink raw reply related

* [RFC bpf-next v2 0/8] bpf: document eBPF helpers and add a script to generate man page
From: Quentin Monnet @ 2018-04-10 14:41 UTC (permalink / raw)
  To: daniel, ast; +Cc: netdev, oss-drivers, quentin.monnet, linux-doc, linux-man

eBPF helper functions can be called from within eBPF programs to perform
a variety of tasks that would be otherwise hard or impossible to do with
eBPF itself. There is a growing number of such helper functions in the
kernel, but documentation is scarce. The main user space header file
does contain a short commented description of most helpers, but it is
somewhat outdated and not complete. It is more a "cheat sheet" than a
real documentation accessible to new eBPF developers.

This commit attempts to improve the situation by replacing the existing
overview for the helpers with a more developed description. Furthermore,
a Python script is added to generate a manual page for eBPF helpers. The
workflow is the following, and requires the rst2man utility:

    $ ./scripts/bpf_helpers_doc.py \
            --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
    $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
    $ man /tmp/bpf-helpers.7

The objective is to keep all documentation related to the helpers in a
single place, and to be able to generate from here a manual page that
could be packaged in the man-pages repository and shipped with most
distributions.

Additionally, parsing the prototypes of the helper functions could
hopefully be reused, with a different Printer object, to generate
header files needed in some eBPF-related projects.

Regarding the description of each helper, it comprises several items:

- The function prototype.
- A description of the function and of its arguments (except for a
  couple of cases, when there are no arguments and the return value
  makes the function usage really obvious).
- A description of return values (if not void).

Additional items such as the list of compatible eBPF program and map
types for each helper, Linux kernel version that introduced the helper,
GPL-only restriction, and commit hash could be added in the future, but
it was decided on the mailing list to leave them aside for now.

For several helpers, descriptions are inspired (at times, nearly copied)
from the commit logs introducing them in the kernel--Many thanks to
their respective authors! They were completed as much as possible, the
objective being to have something easily accessible even for people just
starting with eBPF. There is probably a bit more work to do in this
direction for some helpers.

Some RST formatting is used in the descriptions (not in function
prototypes, to keep them readable, but the Python script provided in
order to generate the RST for the manual page does add formatting to
prototypes, to produce something pretty) to get "bold" and "italics" in
manual pages. Hopefully, the descriptions in bpf.h file remains
perfectly readable. Note that the few trailing white spaces are
intentional, removing them would break paragraphs for rst2man.

The descriptions should ideally be updated each time someone adds a new
helper, or updates the behaviour (new socket option supported, ...) or
the interface (new flags available, ...) of existing ones.

The second RFC for this set splits the documentation into several patches.
Ideally all helper descriptions should be reviewed by the respective
authors of the functions they describe. Please do not hesitate to suggest
improvements to make descriptions more complete or accessible.

v2:
- Remove "For" (compatible program and map types), "Since" (minimal
  Linux kernel version required), "GPL only" sections and commit hashes
  for the helpers.
- Add comment on top of the description list to explain how this
  documentation is supposed to be processed.
- Update Python script accordingly (remove the same sections, and remove
  paragraphs on program types and GPL restrictions from man page
  header).
- Split series into several patches.

Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>

Quentin Monnet (8):
  bpf: add script and prepare bpf.h for new helpers documentation
  bpf: add documentation for eBPF helpers (01-11)
  bpf: add documentation for eBPF helpers (12-22)
  bpf: add documentation for eBPF helpers (23-32)
  bpf: add documentation for eBPF helpers (33-41)
  bpf: add documentation for eBPF helpers (42-50)
  bpf: add documentation for eBPF helpers (51-57)
  bpf: add documentation for eBPF helpers (58-64)

 include/uapi/linux/bpf.h   | 1580 +++++++++++++++++++++++++++++++++-----------
 scripts/bpf_helpers_doc.py |  414 ++++++++++++
 2 files changed, 1616 insertions(+), 378 deletions(-)
 create mode 100755 scripts/bpf_helpers_doc.py

-- 
2.14.1

^ permalink raw reply

* Re: [patch net] devlink: convert occ_get op to separate registration
From: David Miller @ 2018-04-10 14:36 UTC (permalink / raw)
  To: jiri; +Cc: Alexander.Levin, jiri, netdev, idosch, stable
In-Reply-To: <20180410141309.GD2028@nanopsycho>

From: Jiri Pirko <jiri@resnulli.us>
Date: Tue, 10 Apr 2018 16:13:09 +0200

> Tue, Apr 10, 2018 at 03:49:40PM CEST, Alexander.Levin@microsoft.com wrote:
>>Hi,
>>
>>[This is an automated email]
>>
>>This commit has been processed because it contains a "Fixes:" tag,
>>fixing commit: d9f9b9a4d05f devlink: Add support for resource abstraction.
> 
> I don't think we need this fix in stable. Thanks.

And as I stated in another email, I don't think we need these automated
things either.

^ permalink raw reply

* [PATCH] net: usb: hso: Replace GFP_ATOMIC with GFP_KERNEL in hso_create_device
From: Jia-Ju Bai @ 2018-04-10 14:35 UTC (permalink / raw)
  To: davem, andreas, johan, johannes.berg, stephen, Linyu.Yuan
  Cc: linux-usb, netdev, linux-kernel, Jia-Ju Bai

hso_create_device() is never called in atomic context.

The call chains ending up at hso_create_device() are:
[1] hso_create_device() <- hso_create_bulk_serial_device() <- hso_probe()
[2] hso_create_device() <- hso_create_mux_serial_device() <- hso_probe()
[3] hso_create_device() <- hso_create_net_device() <- hso_probe()
hso_probe() is set as ".probe" in struct usb_driver, 
so it is not called in atomic context.

Despite never getting called from atomic context,
hso_create_device() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 drivers/net/usb/hso.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/usb/hso.c b/drivers/net/usb/hso.c
index d7a3379..3d7a33f 100644
--- a/drivers/net/usb/hso.c
+++ b/drivers/net/usb/hso.c
@@ -2332,7 +2332,7 @@ static struct hso_device *hso_create_device(struct usb_interface *intf,
 {
 	struct hso_device *hso_dev;

-	hso_dev = kzalloc(sizeof(*hso_dev), GFP_ATOMIC);
+	hso_dev = kzalloc(sizeof(*hso_dev), GFP_KERNEL);
 	if (!hso_dev)
 		return NULL;

-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-10 14:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <878t9wx8xw.fsf@xmission.com>

On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >>>
> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >>>
> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >>>
> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >>> into all network namespaces.
> >> >> >>>
> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >>>
> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >>>
> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >>>                 return;
> >> >> >>>
> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >>>         j       return;
> >> >> >>> }
> >> >> >>>
> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >>>
> >> >> >>>  sudo unshare -U --map-root
> >> >> >>>  udevadm monitor -k
> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >>>  modprobe kvm
> >> >> >>>  # or
> >> >> >>>  rmmod kvm
> >> >> >>>
> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >>> with them.
> >> >> >>>
> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >>> make a decision what uevents should be sent.
> >> >> >>>
> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >>>   event will always be filtered.
> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >>>   the event will be filtered as well.
> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >>> anything with interesting devices.
> >> >> >>>
> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >>> ---
> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >>>
> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >>>  		return sock_ns != ns;
> >> >> >>>  	}
> >> >> >>>  
> >> >> >>> -	return 0;
> >> >> >>> +	/*
> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >>> +	 * namespace below.
> >> >> >>> +	 */
> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >>> +		return 1;
> >> >> >>> +
> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >>>  }
> >> >> >>>  #endif
> >> >> >>
> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> > 
> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> > how this comes about.
> >> >> > There is only one case where this currently breaks and this is as I
> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> 
> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> Now it will return 1 sometimes.
> >> >
> >> > Oh sure, it's in the commit message though. The callchain is
> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >
> >> > This codepiece will check whether the openened socket holds
> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> > which it won't because we don't have device namespaces and all devices
> >> > belong to the initial set of namespaces.
> >> >
> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >                              CAP_NET_BROADCAST))
> >> >         j       return;
> >> >
> >> 
> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> on their socket and has had someone with the appropriate privileges
> >> assign a peerrnetid.
> >> 
> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> as it might be, and if you can pass the other two checks I think it is
> >> pointless (because the peernet attributes are not generated for
> >> kobj_uevents) but valid to receive events from outside your network
> >> namespace.
> >> 
> >> 
> >> I might be missing something but I don't see anything excluding network
> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> logic.
> >> 
> >> The uevent_sock_list has one entry per network namespace. And
> >> kobject_uevent_net_broacast appears to walk each one.
> >> 
> >> I had a memory of filtering uevent messages and I had a memory
> >> that the netlink_has_listeners had a per network namespace effect.
> >> Neither seems true from my inspection of the code tonight.
> >> 
> >> If we are not filtering ordinary uevents at least at the user namespace
> >> level that does seem to be at least a nuisance.
> >> 
> >> 
> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> there are some real efficiency improvements that we could make to
> >> kobject_uevent_net_broadcast if nothing else.
> >> 
> >> Perhaps you could see where uevents are broadcast by poking
> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >
> > @Eric, so I did some intensive testing over the weekend and forget everything I
> > said before. Uevents are not filtered by the kernel at the moment. This is
> > currently - apart from network devices - a pure userspace thing. Specifically,
> > anyone  on the host can listen to all uevents from everywhere. It's neither
> > filtered by user nor by network namespace. The fact that I used
> >
> > udevadm --debug monitor
> >
> > to test my prior hypothesis was what led me to believe that uevents are already
> > correctly filtered.
> > Instead, what is actually happening is that every udev implementation out there
> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> > Specifically,
> 
> Yes.  I remember something of the sort.  I believe udev also filters to
> ensure that the netlink port id == 0 to ensure the message came from
> the kernel and was not spoofed in any way.
> 
> > - legacy standalone udevd:
> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> > - eudevd
> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> > - systemd-udevd
> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> > - ueventd (Android)
> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >
> > For all of those I could trace this behavior back to the first released
> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> > I could trace it back to the first version that is still available on
> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> > trivially true that it has the same behavior from the beginning.)
> > Android filters uevents in the same way but removed that check on January 8
> > 2018 for what I think is invalid reasoning. The good news is that this is only
> > in their master branch. It hasn't even made it into an rc release for Android 8
> > yet. I filed a bug against Android and offered them a fix if they agree.
> >
> > In any case, userspace udev is not making use of uevents at all right now since
> > any uid != 0 events are **explicitly** discarded.
> > The fact that you receive uevents for
> >
> > sudo unshare -U --map-root -n
> > udevadm --debug monitor
> >
> > is simply explained by the fact that container(0) <=> host(0) at which point
> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> > discard it.
> > The use case for receiving uevents in containers/user namespaces is definitely
> > there but that's what the uevent injection patch series was for that we merged.
> > This is a much safer and saner solution.
> > The fact that all udev implementations filter uevents by uid != 0 very much
> > seems like a security mechanism in userspace that we probably should provide by
> > isolating uevents based on user and/or network namespaces.
> 
> So in summary.  Uevents are filtered in a user namespace (by userspace)
> because the received uid != 0.  It instead == 65534 == "nobody" because
> the global root uid is not mapped.

Exactly.

> 
> Which means that we can modify the kernel to not send to all network
> namespaces whose user_ns != &init_user_ns because we know that userspace
> will ignore the message because of the uid anyway.  Which means when

Yes.

> net-next reopens you can send that patch.  But please base it on just
> not including network namespaces in the list, as that is much more
> efficient than adding more conditions to the filter.

I'll send a patch out once net-next reopens. I'll also make sure to
inform all udev userspace implementations of the change. It won't affect
them but it is nice for them to know that they're safer now.

Something like this (Proper commit message and so on will be added once
I sent this out.):

>From 68d3c27435520cd600874999b6d9d17572854a7a Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Tue, 10 Apr 2018 11:56:49 +0200
Subject: [PATCH] netns: restrict uevents to initial user namespace

/* Here'll be a useful commit message describing in detail what's
 * happening once I sent it to net-next. */

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 lib/kobject_uevent.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..22a2c1a98b8f 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,16 @@ static int uevent_net_init(struct net *net)
 
 	net->uevent_sock = ue_sk;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_add_tail(&ue_sk->list, &uevent_sock_list);
-	mutex_unlock(&uevent_sock_mutex);
+	/*
+	 * Only sent uevents to uevent sockets that are located in network
+	 * namespaces owned by the initial user namespace.
+	 */
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_add_tail(&ue_sk->list, &uevent_sock_list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
+
 	return 0;
 }
 
@@ -713,9 +720,11 @@ static void uevent_net_exit(struct net *net)
 {
 	struct uevent_sock *ue_sk = net->uevent_sock;
 
-	mutex_lock(&uevent_sock_mutex);
-	list_del(&ue_sk->list);
-	mutex_unlock(&uevent_sock_mutex);
+	if (sock_net(ue_sk->sk)->user_ns == &init_user_ns) {
+		mutex_lock(&uevent_sock_mutex);
+		list_del(&ue_sk->list);
+		mutex_unlock(&uevent_sock_mutex);
+	}
 
 	netlink_kernel_release(ue_sk->sk);
 	kfree(ue_sk);
-- 
2.15.1

> 
> Thank you for tracking down what is going on.

Sure! Thanks for properly poking at this.

Christian

^ permalink raw reply related

* Re: [patch net] devlink: convert occ_get op to separate registration
From: David Miller @ 2018-04-10 14:35 UTC (permalink / raw)
  To: Alexander.Levin; +Cc: jiri, jiri, netdev, idosch, stable
In-Reply-To: <DM5PR2101MB103228CD6B0A1EB82361E796FBBE0@DM5PR2101MB1032.namprd21.prod.outlook.com>

From: Sasha Levin <Alexander.Levin@microsoft.com>
Date: Tue, 10 Apr 2018 13:49:40 +0000

> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: d9f9b9a4d05f devlink: Add support for resource abstraction.
> 
> The bot has tested the following trees: v4.16.1.

This is nice, but it's becomming noise.

I pick and choose specific networking changes to queue up for -stable
and whether it has a Fixes tag or applies cleanly are not the primary
considerations I take into account when deciding what goes to stable
or not.

What matters primarily are two attributes:

1) What is the impact of the bug?

2) How risky is the change?

3) Did someone explicitly ask for the backport?  Why?

These automated emails tell me nothing about either attribute.

I figure out what Fixes: tags are involved and whether the patch
applies cleanly when I process my stable queue of networking patches.

For various reasons I have to do this all manually anyways.  I always
do a full analysis of where the bug came from, and also do manual
code inspection to find the cases where a Fixes: tag indicates a commit
that was also put into stable branches and what impact that has on
the backport of the stable fix for a particular stable tree.

So I would like to kindly ask that you stop sending these automated
emails, they really are not helping my networking stable backport
process at all, and instead are just making more emails I have to
delete from my inbox every day.

Thank you very much for your kind consideration in this matter.

^ permalink raw reply

* Re: [PATCH] lan78xx: Don't reset the interface on open
From: Phil Elwell @ 2018-04-10 14:33 UTC (permalink / raw)
  To: Nisar.Sayed, Woojung.Huh, UNGLinuxDriver, agraf, tbogendoerfer,
	netdev, linux-usb, linux-kernel
In-Reply-To: <CE371C1263339941885964188A0225FA3A3F82@CHN-SV-EXMX03.mchp-main.com>

Hi Nisar,

On 10/04/2018 15:16, Nisar.Sayed@microchip.com wrote:
> Thanks Phil, for identifying the issues.
> 
>> -	ret = lan78xx_reset(dev);
>> -	if (ret < 0)
>> -		goto done;
>> -
>>  	phy_start(net->phydev);
>>
>>  	netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
>> --
> 
> You may need to start the interrupts before "phy_start" instead of suppressing call to "lan78xx_reset".
> 
> +             if (dev->domain_data.phyirq > 0)
> +                             phy_start_interrupts(dev->net->phydev);

Shouldn't phy_connect_direct, called from lan78xx_phy_init, already have enabled interrupts for us?

This patch addresses two problems - time wasted by renegotiating the link after the reset and the
missed interrupt - and I'd like both to be fixed. Unless you can come up with a good reason for
performing the reset from the open handler I think it should be removed.

Phil

^ permalink raw reply

* Re: ath9k: dfs: remove accidental use of stack VLA
From: Kalle Valo @ 2018-04-10 14:33 UTC (permalink / raw)
  To: Gustavo A. R. Silva
  Cc: QCA ath9k Development, linux-wireless, netdev, linux-kernel,
	Gustavo A. R. Silva
In-Reply-To: <20180330183401.GA16758@embeddedgus>

"Gustavo A. R. Silva" <gustavo@embeddedor.com> wrote:

> In preparation to enabling -Wvla, remove accidental use of stack VLA.
> 
> This avoids an accidental stack VLA (since the compiler thinks
> the value of FFT_NUM_SAMPLES can change, even when marked
> "const"). This just replaces it with a #define.
> 
> Also, fixed as part of the directive to remove all VLAs from
> the kernel: https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

9c27489a3454 ath9k: dfs: remove accidental use of stack VLA

-- 
https://patchwork.kernel.org/patch/10318271/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: ath10k: fix spelling mistake: "tiggers" -> "triggers"
From: Kalle Valo @ 2018-04-10 14:32 UTC (permalink / raw)
  To: Colin Ian King
  Cc: linux-wireless, netdev, kernel-janitors, linux-kernel, ath10k,
	Kalle Valo
In-Reply-To: <20180329165949.29155-1-colin.king@canonical.com>

Colin Ian King <colin.king@canonical.com> wrote:

> Trivial fix to spelling mistake in message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

6ce36faeac35 ath10k: fix spelling mistake: "tiggers" -> "triggers"

-- 
https://patchwork.kernel.org/patch/10315725/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: ath10k: fix spelling mistake: "tiggers" -> "triggers"
From: Kalle Valo @ 2018-04-10 14:32 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Kalle Valo, ath10k, linux-wireless, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180329165949.29155-1-colin.king@canonical.com>

Colin Ian King <colin.king@canonical.com> wrote:

> Trivial fix to spelling mistake in message text
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

6ce36faeac35 ath10k: fix spelling mistake: "tiggers" -> "triggers"

-- 
https://patchwork.kernel.org/patch/10315725/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: [next] wil6210: fix potential null dereference of ndev before null check
From: Kalle Valo @ 2018-04-10 14:30 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Maya Erez, linux-wireless, wil6210, netdev, kernel-janitors,
	linux-kernel
In-Reply-To: <20180328174027.31551-1-colin.king@canonical.com>

Colin Ian King <colin.king@canonical.com> wrote:

> The pointer ndev is being dereferenced before it is being null checked,
> hence there is a potential null pointer deference. Fix this by only
> dereferencing ndev after it has been null checked
> 
> Detected by CoverityScan, CID#1467010 ("Dereference before null check")
> 
> Fixes: e00243fab84b ("wil6210: infrastructure for multiple virtual interfaces")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> Reviewed-by: Maya Erez <merez@codeaurora.org>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

db5a4d5e1073 wil6210: fix potential null dereference of ndev before null check

-- 
https://patchwork.kernel.org/patch/10313705/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* [PATCH net 2/2] tun: send netlink notification when the device is modified
From: Sabrina Dubroca @ 2018-04-10 14:28 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca, Thomas Haller, Stefano Brivio
In-Reply-To: <0b61a3298ffa3a3fbc4ae0cae1573b368301a211.1523370272.git.sd@queasysnail.net>

I added dumping of link information about tun devices over netlink in
commit 1ec010e70593 ("tun: export flags, uid, gid, queue information
over netlink"), but didn't add the missing netlink notifications when
the device's exported properties change.

This patch adds notifications when owner/group or flags are modified,
when queues are attached/detached, and when a tun fd is closed.

Reported-by: Thomas Haller <thaller@redhat.com>
Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 drivers/net/tun.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c9e68fd76a37..28583aa0c17d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -743,8 +743,15 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 
 static void tun_detach(struct tun_file *tfile, bool clean)
 {
+	struct tun_struct *tun;
+	struct net_device *dev;
+
 	rtnl_lock();
+	tun = rtnl_dereference(tfile->tun);
+	dev = tun ? tun->dev : NULL;
 	__tun_detach(tfile, clean);
+	if (dev)
+		netdev_state_change(dev);
 	rtnl_unlock();
 }
 
@@ -2562,13 +2569,15 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 			/* One or more queue has already been attached, no need
 			 * to initialize the device again.
 			 */
+			netdev_state_change(dev);
 			return 0;
 		}
 
 		tun->flags = (tun->flags & ~TUN_FEATURES) |
 			      (ifr->ifr_flags & TUN_FEATURES);
-	}
-	else {
+
+		netdev_state_change(dev);
+	} else {
 		char *name;
 		unsigned long flags = 0;
 		int queues = ifr->ifr_flags & IFF_MULTI_QUEUE ?
@@ -2808,6 +2817,9 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
 	} else
 		ret = -EINVAL;
 
+	if (ret >= 0)
+		netdev_state_change(tun->dev);
+
 unlock:
 	rtnl_unlock();
 	return ret;
@@ -2848,6 +2860,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	unsigned int ifindex;
 	int le;
 	int ret;
+	bool do_notify = false;
 
 	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
 	    (_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
@@ -2944,10 +2957,12 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		if (arg && !(tun->flags & IFF_PERSIST)) {
 			tun->flags |= IFF_PERSIST;
 			__module_get(THIS_MODULE);
+			do_notify = true;
 		}
 		if (!arg && (tun->flags & IFF_PERSIST)) {
 			tun->flags &= ~IFF_PERSIST;
 			module_put(THIS_MODULE);
+			do_notify = true;
 		}
 
 		tun_debug(KERN_INFO, tun, "persist %s\n",
@@ -2962,6 +2977,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 		tun->owner = owner;
+		do_notify = true;
 		tun_debug(KERN_INFO, tun, "owner set to %u\n",
 			  from_kuid(&init_user_ns, tun->owner));
 		break;
@@ -2974,6 +2990,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 			break;
 		}
 		tun->group = group;
+		do_notify = true;
 		tun_debug(KERN_INFO, tun, "group set to %u\n",
 			  from_kgid(&init_user_ns, tun->group));
 		break;
@@ -3133,6 +3150,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		break;
 	}
 
+	if (do_notify)
+		netdev_state_change(tun->dev);
+
 unlock:
 	rtnl_unlock();
 	if (tun)
-- 
2.16.2

^ permalink raw reply related

* [PATCH net 1/2] tun: set the flags before registering the netdevice
From: Sabrina Dubroca @ 2018-04-10 14:28 UTC (permalink / raw)
  To: netdev; +Cc: Sabrina Dubroca, Thomas Haller, Stefano Brivio

Otherwise, register_netdevice advertises the creation of the device with
the default flags, instead of what the user requested.

Reported-by: Thomas Haller <thaller@redhat.com>
Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
---
 drivers/net/tun.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ba262f40ad..c9e68fd76a37 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2564,6 +2564,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 			 */
 			return 0;
 		}
+
+		tun->flags = (tun->flags & ~TUN_FEATURES) |
+			      (ifr->ifr_flags & TUN_FEATURES);
 	}
 	else {
 		char *name;
@@ -2642,6 +2645,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 				     ~(NETIF_F_HW_VLAN_CTAG_TX |
 				       NETIF_F_HW_VLAN_STAG_TX);
 
+		tun->flags = (tun->flags & ~TUN_FEATURES) |
+			      (ifr->ifr_flags & TUN_FEATURES);
+
 		INIT_LIST_HEAD(&tun->disabled);
 		err = tun_attach(tun, file, false, ifr->ifr_flags & IFF_NAPI);
 		if (err < 0)
@@ -2656,9 +2662,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
 
-	tun->flags = (tun->flags & ~TUN_FEATURES) |
-		(ifr->ifr_flags & TUN_FEATURES);
-
 	/* Make sure persistent devices do not get stuck in
 	 * xoff state.
 	 */
-- 
2.16.2

^ permalink raw reply related

* Re: ath10k: avoid possible string overflow
From: Kalle Valo @ 2018-04-10 14:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Maharaja Kennadyrajan, Manikanta Pubbisetty, Johannes Berg,
	Arnd Bergmann, netdev, Anilkumar Kolli, Carl Huang,
	linux-wireless, linux-kernel, ath10k, Kalle Valo,
	Gustavo A. R. Silva
In-Reply-To: <20180328220635.3704458-1-arnd@arndb.de>

Arnd Bergmann <arnd@arndb.de> wrote:

> The way that 'strncat' is used here raised a warning in gcc-8:
> 
> drivers/net/wireless/ath/ath10k/wmi.c: In function 'ath10k_wmi_tpc_stats_final_disp_tables':
> drivers/net/wireless/ath/ath10k/wmi.c:4649:4: error: 'strncat' output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
> 
> Effectively, this is simply a strcat() but the use of strncat() suggests
> some form of overflow check. Regardless of whether this might actually
> overflow, using strlcat() instead of strncat() avoids the warning and
> makes the code more robust.
> 
> Fixes: bc64d05220f3 ("ath10k: debugfs support to get final TPC stats for 10.4 variants")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

6707ba0105a2 ath10k: avoid possible string overflow

-- 
https://patchwork.kernel.org/patch/10314201/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* Re: ath10k: avoid possible string overflow
From: Kalle Valo @ 2018-04-10 14:28 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Kalle Valo, Arnd Bergmann, Manikanta Pubbisetty, Anilkumar Kolli,
	Carl Huang, Gustavo A. R. Silva, Johannes Berg,
	Maharaja Kennadyrajan, ath10k, linux-wireless, netdev,
	linux-kernel
In-Reply-To: <20180328220635.3704458-1-arnd@arndb.de>

Arnd Bergmann <arnd@arndb.de> wrote:

> The way that 'strncat' is used here raised a warning in gcc-8:
> 
> drivers/net/wireless/ath/ath10k/wmi.c: In function 'ath10k_wmi_tpc_stats_final_disp_tables':
> drivers/net/wireless/ath/ath10k/wmi.c:4649:4: error: 'strncat' output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
> 
> Effectively, this is simply a strcat() but the use of strncat() suggests
> some form of overflow check. Regardless of whether this might actually
> overflow, using strlcat() instead of strncat() avoids the warning and
> makes the code more robust.
> 
> Fixes: bc64d05220f3 ("ath10k: debugfs support to get final TPC stats for 10.4 variants")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>

Patch applied to ath-next branch of ath.git, thanks.

6707ba0105a2 ath10k: avoid possible string overflow

-- 
https://patchwork.kernel.org/patch/10314201/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

^ permalink raw reply

* RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend
From: Liang, Cunming @ 2018-04-10 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Bie, Tiwei, Jason Wang, alex.williamson@redhat.com,
	ddutile@redhat.com, Duyck, Alexander H,
	virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org,
	kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, Daly, Dan, Wang, Zhihong, Tan, Jianfeng,
	Wang, Xiao W
In-Reply-To: <20180410163105-mutt-send-email-mst@kernel.org>



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Tuesday, April 10, 2018 9:36 PM
> To: Liang, Cunming <cunming.liang@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Bie, Tiwei <tiwei.bie@intel.com>;
> Jason Wang <jasowang@redhat.com>; alex.williamson@redhat.com;
> ddutile@redhat.com; Duyck, Alexander H <alexander.h.duyck@intel.com>;
> virtio-dev@lists.oasis-open.org; linux-kernel@vger.kernel.org;
> kvm@vger.kernel.org; virtualization@lists.linux-foundation.org;
> netdev@vger.kernel.org; Daly, Dan <dan.daly@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>; Wang, Xiao
> W <xiao.w.wang@intel.com>
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost
> backend
> 
> On Tue, Apr 10, 2018 at 09:23:53AM +0000, Liang, Cunming wrote:
> >
> >
> > > -----Original Message-----
> > > From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> > > Sent: Tuesday, April 10, 2018 3:52 PM
> > > To: Bie, Tiwei <tiwei.bie@intel.com>; Jason Wang
> > > <jasowang@redhat.com>
> > > Cc: mst@redhat.com; alex.williamson@redhat.com; ddutile@redhat.com;
> > > Duyck, Alexander H <alexander.h.duyck@intel.com>;
> > > virtio-dev@lists.oasis- open.org; linux-kernel@vger.kernel.org;
> > > kvm@vger.kernel.org; virtualization@lists.linux-foundation.org;
> > > netdev@vger.kernel.org; Daly, Dan <dan.daly@intel.com>; Liang,
> > > Cunming <cunming.liang@intel.com>; Wang, Zhihong
> > > <zhihong.wang@intel.com>; Tan, Jianfeng <jianfeng.tan@intel.com>;
> > > Wang, Xiao W <xiao.w.wang@intel.com>
> > > Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based
> > > hardware vhost backend
> > >
> > > On 10/04/2018 06:57, Tiwei Bie wrote:
> > > >> So you just move the abstraction layer from qemu to kernel, and
> > > >> you still need different drivers in kernel for different device
> > > >> interfaces of accelerators. This looks even more complex than
> > > >> leaving it in qemu. As you said, another idea is to implement
> > > >> userspace vhost backend for accelerators which seems easier and
> > > >> could co-work with other parts of qemu without inventing new type of
> messages.
> > > >
> > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > vendor specific hardware drivers in QEMU?
> > >
> > > I think so.  We have vendor-specific quirks, and at some point there
> > > was an idea of using quirks to implement (vendor-specific) live
> > > migration support for assigned devices.
> >
> > Vendor-specific quirks of accessing VGA is a small portion. Other major portions
> are still handled by guest driver.
> >
> > While in this case, when saying various vendor specific drivers in QEMU, it says
> QEMU takes over and provides the entire user space device drivers. Some parts
> are even not relevant to vhost, they're basic device function enabling. Moreover,
> it could be different kinds of devices(network/block/...) under vhost. No matter
> # of vendors or # of types, total LOC is not small.
> >
> > The idea is to avoid introducing these extra complexity out of QEMU, keeping
> vhost adapter simple. As vhost protocol is de factor standard, it leverages kernel
> device driver to provide the diversity. Changing once in QEMU, then it supports
> multi-vendor devices whose drivers naturally providing kernel driver there.
> >
> > If QEMU is going to build a user space driver framework there, we're open mind
> on that, even leveraging DPDK as the underlay library. Looking forward to more
> others' comments from community.
> >
> > Steve
> 
> Dependency on a kernel driver is fine IMHO. It's the dependency on a DPDK
> backend that makes people unhappy, since the functionality in question is setup-
> time only.

Agreed, we don't see dependency on kernel driver is a problem.

mdev based vhost backend (this patch set) is independent with vhost-user extension patch set. In fact, there're a few vhost-user providers, DPDK librte_vhost is one of them. FD.IO/VPP and snabbswitch have their own vhost-user providers. So I can't agree on vhost-user extension patch depends on DPDK backend. But anyway, that's the topic of another mail thread.

> 
> As others said, we do not need to go overeboard. A couple of small vendor-
> specific quirks in qemu isn't a big deal.

It's quite challenge to identify it's small or not, there's no uniform metric.

It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is to build the entire device driver. We don't object to do that in QEMU, but wanna make sure to understand the boundary size clearly.

> 
> > >
> > > Paolo

^ permalink raw reply

* [PATCH] net: sbni: Replace mdelay with msleep in sbni_probe1
From: Jia-Ju Bai @ 2018-04-10 14:22 UTC (permalink / raw)
  To: dhowells; +Cc: netdev, linux-kernel, Jia-Ju Bai

sbni_probe1() is never called in atomic context.

The call chains ending up at sbni_probe1() are:
[1] sbni_probe1() <- sbni_init() <- sbni_probe() <- net_olddevs_init()
[2] sbni_probe1() <- sbni_isa_probe() <- sbni_init() <- 
	sbni_probe() <- net_olddevs_init()
[3] sbni_probe1() <- sbni_pci_probe() <- sbni_init() <- 
	sbni_probe() <- net_olddevs_init()

net_olddevs_init() is set as a parameter of device_initcall().
This function is not called in atomic context.

Despite never getting called from atomic context, sbni_probe1()
calls mdelay() to busily wait.
This is not necessary and can be replaced with msleep() to 
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 drivers/net/wan/sbni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wan/sbni.c b/drivers/net/wan/sbni.c
index bde8c03..b7dd665 100644
--- a/drivers/net/wan/sbni.c
+++ b/drivers/net/wan/sbni.c
@@ -361,7 +361,7 @@ static int __init sbni_init(struct net_device *dev)
 		irq_mask = probe_irq_on();
 		outb( EN_INT | TR_REQ, ioaddr + CSR0 );
 		outb( PR_RES, ioaddr + CSR1 );
-		mdelay(50);
+		msleep(50);
 		irq = probe_irq_off(irq_mask);
 		outb( 0, ioaddr + CSR0 );
 
-- 
1.9.1

^ permalink raw reply related

* Re: marvell switch
From: Ran Shalit @ 2018-04-10 14:21 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev
In-Reply-To: <20180405204622.GE17495@lunn.ch>

On Thu, Apr 5, 2018 at 11:46 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> > Hi Ran
>> >
>> > The Marvell driver makes each port act like a normal Linux network
>> > interface. So if you want to enable a port, do
>> >
>> > ip link set lan0 up
>> >
>> > Want to add an ip address to a port
>> >
>> > ip addr add 10.42.42.42/24 dev lan0
>> >

I would please like to ask one more about it, if I may.
I thought I understood it all, but seems not.
The switch ports are connected to other PCs (except for one port which
is connected to cpu).
What is therefore the purpose of adding ip address to such ports (the
ip is configured externally in PC).

Thank you for the time,
ranran

>> > Want to bridge two ports
>> >
>> > ip link add name br0 type bridge
>> > ip link set dev br0 up
>> > ip link set dev lan0 master br0
>> > ip link set dev lan1 master br0
>> >
>> > Just treat them as normal interfaces.
>> >
>>
>> If I may please ask,
>> What is the purpose of using bridge for configuring switch interfaces.
>> Is it in order to isolate some ports from others?
>> I ask because according to my understanding the default configuration of
>> the driver is to enable switch in "flat" configuration, i.e. as if all
>> ports are connected to each other.
>
> Please think about what i said. They are standard Linux network
> interfaces. Do standard Linux network interfaces bridge themselves
> together by default? No, you need to configure a bridge.
>
>          Andrew

^ permalink raw reply

* RE: [PATCH] lan78xx: Don't reset the interface on open
From: Nisar.Sayed @ 2018-04-10 14:16 UTC (permalink / raw)
  To: phil, Woojung.Huh, UNGLinuxDriver, agraf, tbogendoerfer, netdev,
	linux-usb, linux-kernel
In-Reply-To: <1523362705-30032-1-git-send-email-phil@raspberrypi.org>

Thanks Phil, for identifying the issues.

> -	ret = lan78xx_reset(dev);
> -	if (ret < 0)
> -		goto done;
> -
>  	phy_start(net->phydev);
> 
>  	netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
> --

You may need to start the interrupts before "phy_start" instead of suppressing call to "lan78xx_reset".

+             if (dev->domain_data.phyirq > 0)
+                             phy_start_interrupts(dev->net->phydev);

- Nisar

^ permalink raw reply

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
From: William Tu @ 2018-04-10 14:14 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	Zhang, Qi Z, ravineet.singh
In-Reply-To: <CAJ+HfNgu+RE6G3=4dnyfaEZGxcw-NB1dAZKNgrZm0TFKooVdKQ@mail.gmail.com>

On Mon, Apr 9, 2018 at 11:47 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> 2018-04-09 23:51 GMT+02:00 William Tu <u9012063@gmail.com>:
>> On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> This RFC introduces a new address family called AF_XDP that is
>>> optimized for high performance packet processing and, in upcoming
>>> patch sets, zero-copy semantics. In this v2 version, we have removed
>>> all zero-copy related code in order to make it smaller, simpler and
>>> hopefully more review friendly. This RFC only supports copy-mode for
>>> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
>>> using the XDP_DRV path. Zero-copy support requires XDP and driver
>>> changes that Jesper Dangaard Brouer is working on. Some of his work is
>>> already on the mailing list for review. We will publish our zero-copy
>>> support for RX and TX on top of his patch sets at a later point in
>>> time.
>>>
>>> An AF_XDP socket (XSK) is created with the normal socket()
>>> syscall. Associated with each XSK are two queues: the RX queue and the
>>> TX queue. A socket can receive packets on the RX queue and it can send
>>> packets on the TX queue. These queues are registered and sized with
>>> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
>>> mandatory to have at least one of these queues for each socket. In
>>> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
>>> packet buffers. An RX or TX descriptor points to a data buffer in a
>>> memory area called a UMEM. RX and TX can share the same UMEM so that a
>>> packet does not have to be copied between RX and TX. Moreover, if a
>>> packet needs to be kept for a while due to a possible retransmit, the
>>> descriptor that points to that packet can be changed to point to
>>> another and reused right away. This again avoids copying data.
>>>
>>> This new dedicated packet buffer area is called a UMEM. It consists of
>>> a number of equally size frames and each frame has a unique frame
>>> id. A descriptor in one of the queues references a frame by
>>> referencing its frame id. The user space allocates memory for this
>>> UMEM using whatever means it feels is most appropriate (malloc, mmap,
>>> huge pages, etc). This memory area is then registered with the kernel
>>> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
>>> the FILL queue and the COMPLETION queue. The fill queue is used by the
>>> application to send down frame ids for the kernel to fill in with RX
>>> packet data. References to these frames will then appear in the RX
>>> queue of the XSK once they have been received. The completion queue,
>>> on the other hand, contains frame ids that the kernel has transmitted
>>> completely and can now be used again by user space, for either TX or
>>> RX. Thus, the frame ids appearing in the completion queue are ids that
>>> were previously transmitted using the TX queue. In summary, the RX and
>>> FILL queues are used for the RX path and the TX and COMPLETION queues
>>> are used for the TX path.
>>>
>> Can we register a UMEM to multiple device's queue?
>>
>
> No, one UMEM, one netdev queue in this RFC. That being said, there's
> nothing stopping a user from creating an additional UMEM, say UMEM',
> pointing to the same memory as UMEM, but bound to another
> netdev/queue. Note that the user space application has to make sure
> that the buffer handling is sane (user/kernel frame ownership).
>
> We used to allow to share UMEM between unrelated sockets, but after
> the introduction of the UMEM queues (fill/completion) that's no the
> case any more. For the zero-copy scenario, having to manage multiple
> DMA mappings per UMEM was a bit of a mess, so we went for the simpler
> (current) solution with one UMEM per netdev/queue.
>
>> So far the l2fwd sample code is sending/receiving from the same
>> queue. I'm thinking about forwarding packets from one device to another.
>> Now I'm copying packets from one device's RX desc to another device's TX
>> completion queue. But this introduces one extra copy.
>>
>
> So you've setup two identical UMEMs? Then you can just forward the
> incoming Rx descriptor to the other netdev's Tx queue. Note, that you
> only need to copy the descriptor, not the actual frame data.
>

Thanks!
I will give it a try, I guess you're saying I can do below:

int sfd1; // for device1
int sfd2; // for device2
...
// create 2 umem
umem1 = calloc(1, sizeof(*umem));
umem2 = calloc(1, sizeof(*umem));

// allocate 1 shared buffer, 1 xdp_umem_reg
posix_memalign(&bufs, ...)
mr.addr = (__u64)bufs; // shared for umem1,2
...

// umem reg the same mr
setsockopt(sfd1, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))
setsockopt(sfd2, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr))

// setup fill, completion, mmap for sfd1 and sfd2
...

Since both device can put frame data in 'bufs', I only need to copy
the descs between 2 umem1 and umem2. Am I understand correct?

Regards,
William

^ permalink raw reply

* [PATCH] net: wimax: i2400m: Replace GFP_ATOMIC with GFP_KERNEL in i2400m_tx_setup
From: Jia-Ju Bai @ 2018-04-10 14:14 UTC (permalink / raw)
  To: inaky.perez-gonzalez; +Cc: linux-wimax, netdev, linux-kernel, Jia-Ju Bai

i2400m_tx_setup() is never called in atomic context.

The call chains ending up at i2400m_tx_setup() are:
[1] i2400m_tx_setup() <- __i2400m_dev_start() <- 
	__i2400m_dev_reset_handle()
[2] i2400m_tx_setup() <- __i2400m_dev_start() <- i2400m_dev_start() <- 
	i2400m_setup() <- i2400mu_probe()
[3] i2400m_tx_setup() <- __i2400m_dev_start() <- i2400m_post_reset() <- 
	i2400mu_post_reset()

__i2400m_dev_reset_handle() is only set as a parameter of INIT_WORK() in 
i2400m_init().
i2400mu_probe() is only set as ".probe" in struct usb_driver.
i2400mu_post_reset() is only set as ".post_reset" in struct usb_driver.
These functions are not called in atomic context.

Despite never getting called from atomic context,
i2400m_tx_setup() calls kmalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
 drivers/net/wimax/i2400m/tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wimax/i2400m/tx.c b/drivers/net/wimax/i2400m/tx.c
index f20886a..cfbc187 100644
--- a/drivers/net/wimax/i2400m/tx.c
+++ b/drivers/net/wimax/i2400m/tx.c
@@ -973,7 +973,7 @@ int i2400m_tx_setup(struct i2400m *i2400m)
 	 * the WS was scheduled on another CPU */
 	INIT_WORK(&i2400m->wake_tx_ws, i2400m_wake_tx_work);

-	tx_buf = kmalloc(I2400M_TX_BUF_SIZE, GFP_ATOMIC);
+	tx_buf = kmalloc(I2400M_TX_BUF_SIZE, GFP_KERNEL);
 	if (tx_buf == NULL) {
 		result = -ENOMEM;
 		goto error_kmalloc;
-- 
1.9.1

^ permalink raw reply related

* Re: [patch net] devlink: convert occ_get op to separate registration
From: Jiri Pirko @ 2018-04-10 14:13 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Jiri Pirko, netdev@vger.kernel.org, davem@davemloft.net,
	idosch@mellanox.com, stable@vger.kernel.org
In-Reply-To: <DM5PR2101MB103228CD6B0A1EB82361E796FBBE0@DM5PR2101MB1032.namprd21.prod.outlook.com>

Tue, Apr 10, 2018 at 03:49:40PM CEST, Alexander.Levin@microsoft.com wrote:
>Hi,
>
>[This is an automated email]
>
>This commit has been processed because it contains a "Fixes:" tag,
>fixing commit: d9f9b9a4d05f devlink: Add support for resource abstraction.

I don't think we need this fix in stable. Thanks.


>
>The bot has tested the following trees: v4.16.1.
>
>v4.16.1: Failed to apply! Possible dependencies:
>    37923ed6b8ce ("netdevsim: Add simple FIB resource controller via devlink")
>    3ed898e8cd05 ("mlxsw: spectrum_kvdl: Make some functions static")
>    4f4bbf7c4e3d ("devlink: Perform cleanup of resource_set cb")
>    51d3c08e3371 ("mlxsw: spectrum_kvdl: Add support for linear division resources")
>    7f47b19bd744 ("mlxsw: spectrum_kvdl: Add support for per part occupancy")
>    88d2fbcda145 ("mlxsw: spectrum: Pass mlxsw_core as arg of mlxsw_sp_kvdl_resources_register()")
>    c8276dd250e9 ("mlxsw: spectrum_kvdl: Fix handling of resource_size_param")
>    f9b9120119ca ("mlxsw: Constify devlink_resource_ops")
>
>
>--
>Thanks,
>Sasha

^ permalink raw reply

* Re: aquantia - BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 on reboot
From: Yanko Kaneti @ 2018-04-10 14:06 UTC (permalink / raw)
  To: Igor Russkikh; +Cc: netdev
In-Reply-To: <6df8b628-cec5-9ce5-8430-e9a2cd27346d@aquantia.com>

On Tue, 2018-04-10 at 15:53 +0300, Igor Russkikh wrote:
> On 10.04.2018 15:42, Yanko Kaneti wrote:
> > Hello, 
> > 
> > Since 90869ddfefeb net: aquantia: Implement pci shutdown callback
> > I get the below oops on reboot.  Without the callback everything works
> > as expected.
> > 
> 
> Thanks, we also recently found out that.
> 
> Could you please try the below patch?

Works for me, as in, no crashes on reboot.

Thanks
-Yanko


> 
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> index c96a921..32f6d2e 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> @@ -951,9 +951,11 @@ void aq_nic_shutdown(struct aq_nic_s *self)
>  
>  	netif_device_detach(self->ndev);
>  
> -	err = aq_nic_stop(self);
> -	if (err < 0)
> -		goto err_exit;
> +	if (netif_running(self->ndev)) {
> +		err = aq_nic_stop(self);
> +		if (err < 0)
> +			goto err_exit;
> +	}
>  	aq_nic_deinit(self);
>  
>  err_exit:

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox