Linux Documentation
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v3 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Daniel Borkmann @ 2018-04-19 10:29 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-4-quentin.monnet@netronome.com>

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
> 
> - bpf_get_current_pid_tgid()
> - bpf_get_current_uid_gid()
> - bpf_get_current_comm()
> - bpf_skb_vlan_push()
> - bpf_skb_vlan_pop()
> - bpf_skb_get_tunnel_key()
> - bpf_skb_set_tunnel_key()
> - bpf_redirect()
> - bpf_perf_event_output()
> - bpf_get_stackid()
> - bpf_get_current_task()
> 
> v3:
> - bpf_skb_get_tunnel_key(): Change and improve description and example.
> - bpf_redirect(): Improve description of BPF_F_INGRESS flag.
> - bpf_perf_event_output(): Fix first sentence of description. Delete
>   wrong statement on context being evaluated as a struct pt_reg. Remove
>   the long yet incomplete example.
> - bpf_get_stackid(): Add a note about PERF_MAX_STACK_DEPTH being
>   configurable.
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 225 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 225 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 02b7d522b3c0..c59bf5b28164 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -591,6 +591,231 @@ union bpf_attr {
>   * 		performed again.
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
> + *
> + * u64 bpf_get_current_pid_tgid(void)
> + * 	Return
> + * 		A 64-bit integer containing the current tgid and pid, and
> + * 		created as such:
> + * 		*current_task*\ **->tgid << 32 \|**
> + * 		*current_task*\ **->pid**.
> + *
> + * u64 bpf_get_current_uid_gid(void)
> + * 	Return
> + * 		A 64-bit integer containing the current GID and UID, and
> + * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
> + *
> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
> + * 	Description
> + * 		Copy the **comm** attribute of the current task into *buf* of
> + * 		*size_of_buf*. The **comm** attribute contains the name of
> + * 		the executable (excluding the path) for the current task. The
> + * 		*size_of_buf* must be strictly positive. On success, the
> + * 		helper makes sure that the *buf* is NUL-terminated. On failure,
> + * 		it is filled with zeroes.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
> + * 	Description
> + * 		Push a *vlan_tci* (VLAN tag control information) of protocol
> + * 		*vlan_proto* to the packet associated to *skb*, then update
> + * 		the checksum. Note that if *vlan_proto* is different from
> + * 		**ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
> + * 		be **ETH_P_8021Q**.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
> + * 	Description
> + * 		Pop a VLAN header from the packet associated to *skb*.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
> + * 	Description
> + * 		Get tunnel metadata. This helper takes a pointer *key* to an
> + * 		empty **struct bpf_tunnel_key** of **size**, that will be
> + * 		filled with tunnel metadata for the packet associated to *skb*.
> + * 		The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
> + * 		indicates that the tunnel is based on IPv6 protocol instead of
> + * 		IPv4.
> + *
> + * 		The **struct bpf_tunnel_key** is an object that generalizes the
> + * 		principal parameters used by various tunneling protocols into a
> + * 		single struct. This way, it can be used to easily make a
> + * 		decision based on the contents of the encapsulation header,
> + * 		"summarized" in this struct. In particular, it holds the IP
> + * 		address of the remote end (IPv4 or IPv6, depending on the case)
> + * 		in *key*\ **->remote_ipv4** or *key*\ **->remote_ipv6**.

I would also mention the tunnel_id which is typically mapped to a vni, allowing
to make this id programmable together with bpf_skb_set_tunnel_key() helper.

> + * 		Let's imagine that the following code is part of a program
> + * 		attached to the TC ingress interface, on one end of a GRE
> + * 		tunnel, and is supposed to filter out all messages coming from
> + * 		remote ends with IPv4 address other than 10.0.0.1:
> + *
> + * 		::
> + *
> + * 			int ret;
> + * 			struct bpf_tunnel_key key = {};
> + * 			
> + * 			ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
> + * 			if (ret < 0)
> + * 				return TC_ACT_SHOT;	// drop packet
> + * 			
> + * 			if (key.remote_ipv4 != 0x0a000001)
> + * 				return TC_ACT_SHOT;	// drop packet
> + * 			
> + * 			return TC_ACT_OK;		// accept packet

Lets also add a small sentence that this interface can be used with all
encap devs that can operate in 'collect metadata' mode, where instead of
having one netdevice per specific configuration, the 'collect metadata'
mode only requires a single device where the configuration can be extracted
from those BPF helpers. Could also mentioned this can be used together with
vxlan, geneve, gre and ipip tunnels.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
> + * 	Description
> + * 		Populate tunnel metadata for packet associated to *skb.* The
> + * 		tunnel metadata is set to the contents of *key*, of *size*. The
> + * 		*flags* can be set to a combination of the following values:
> + *
> + * 		**BPF_F_TUNINFO_IPV6**
> + * 			Indicate that the tunnel is based on IPv6 protocol
> + * 			instead of IPv4.
> + * 		**BPF_F_ZERO_CSUM_TX**
> + * 			For IPv4 packets, add a flag to tunnel metadata
> + * 			indicating that checksum computation should be skipped
> + * 			and checksum set to zeroes.
> + * 		**BPF_F_DONT_FRAGMENT**
> + * 			Add a flag to tunnel metadata indicating that the
> + * 			packet should not be fragmented.
> + * 		**BPF_F_SEQ_NUMBER**
> + * 			Add a flag to tunnel metadata indicating that a
> + * 			sequence number should be added to tunnel header before
> + * 			sending the packet. This flag was added for GRE
> + * 			encapsulation, but might be used with other protocols
> + * 			as well in the future.
> + *
> + * 		Here is a typical usage on the transmit path:
> + *
> + * 		::
> + *
> + * 			struct bpf_tunnel_key key;
> + * 			     populate key ...
> + * 			bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
> + * 			bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);

See above, maybe this can just reference bpf_skb_get_tunnel_key() from here.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_redirect(u32 ifindex, u64 flags)
> + * 	Description
> + * 		Redirect the packet to another net device of index *ifindex*.
> + * 		This helper is somewhat similar to **bpf_clone_redirect**\
> + * 		(), except that the packet is not cloned, which provides
> + * 		increased performance.
> + *
> + * 		Save for XDP, both ingress and egress interfaces can be used

s/Save/Same/ ?

> + * 		for redirection. The **BPF_F_INGRESS** value in *flags* is used

(In XDP case, BPF_F_INGRESS cannot be used.)

> + * 		to make the distinction (ingress path is selected if the flag
> + * 		is present, egress path otherwise). Currently, XDP only
> + * 		supports redirection to the egress interface, and accepts no
> + * 		flag at all.
> + * 	Return
> + * 		For XDP, the helper returns **XDP_REDIRECT** on success or
> + * 		**XDP_ABORT** on error. For other program types, the values
> + * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
> + * 		error.
> + *
> + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
> + * 	Description
> + * 		Write raw *data* blob into a special BPF perf event held by
> + * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf
> + * 		event must have the following attributes: **PERF_SAMPLE_RAW**
> + * 		as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
> + * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
> + *
> + * 		The *flags* are used to indicate the index in *map* for which
> + * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
> + * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
> + * 		to indicate that the index of the current CPU core should be
> + * 		used.
> + *
> + * 		The value to write, of *size*, is passed through eBPF stack and
> + * 		pointed by *data*.
> + *
> + * 		The context of the program *ctx* needs also be passed to the
> + * 		helper.
> + *
> + * 		On user space, a program willing to read the values needs to
> + * 		call **perf_event_open**\ () on the perf event (either for
> + * 		one or for all CPUs) and to store the file descriptor into the
> + * 		*map*. This must be done before the eBPF program can send data
> + * 		into it. An example is available in file
> + * 		*samples/bpf/trace_output_user.c* in the Linux kernel source
> + * 		tree (the eBPF program counterpart is in
> + * 		*samples/bpf/trace_output_kern.c*).
> + *
> + * 		**bpf_perf_event_output**\ () achieves better performance
> + * 		than **bpf_trace_printk**\ () for sharing data with user
> + * 		space, and is much better suitable for streaming data from eBPF
> + * 		programs.

Would also mentioned that this helper can be used out of tc and XDP BPF
programs as well and allows for passing i) only custom structs, ii) only
packet payload, or iii) a combination of both to user space listeners.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
> + * 	Description
> + * 		Walk a user or a kernel stack and return its id. To achieve
> + * 		this, the helper needs *ctx*, which is a pointer to the context
> + * 		on which the tracing program is executed, and a pointer to a
> + * 		*map* of type **BPF_MAP_TYPE_STACK_TRACE**.
> + *
> + * 		The last argument, *flags*, holds the number of stack frames to
> + * 		skip (from 0 to 255), masked with
> + * 		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
> + * 		a combination of the following flags:
> + *
> + * 		**BPF_F_USER_STACK**
> + * 			Collect a user space stack instead of a kernel stack.
> + * 		**BPF_F_FAST_STACK_CMP**
> + * 			Compare stacks by hash only.
> + * 		**BPF_F_REUSE_STACKID**
> + * 			If two different stacks hash into the same *stackid*,
> + * 			discard the old one.
> + *
> + * 		The stack id retrieved is a 32 bit long integer handle which
> + * 		can be further combined with other data (including other stack
> + * 		ids) and used as a key into maps. This can be useful for
> + * 		generating a variety of graphs (such as flame graphs or off-cpu
> + * 		graphs).
> + *
> + * 		For walking a stack, this helper is an improvement over
> + * 		**bpf_probe_read**\ (), which can be used with unrolled loops
> + * 		but is not efficient and consumes a lot of eBPF instructions.
> + * 		Instead, **bpf_get_stackid**\ () can collect up to
> + * 		**PERF_MAX_STACK_DEPTH** both kernel and user frames. Note that
> + * 		this limit can be controlled with the **sysctl** program, and
> + * 		that it should be manually increased in order to profile long
> + * 		user stacks (such as stacks for Java programs). To do so, use:
> + *
> + * 		::
> + *
> + * 			# sysctl kernel.perf_event_max_stack=<new value>
> + *
> + * 	Return
> + * 		The positive or null stack id on success, or a negative error
> + * 		in case of failure.
> + *
> + * u64 bpf_get_current_task(void)
> + * 	Return
> + * 		A pointer to the current task struct.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 4/8] bpf: add documentation for eBPF helpers (23-32)
From: Daniel Borkmann @ 2018-04-19 11:16 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-5-quentin.monnet@netronome.com>

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Daniel:
> 
> - bpf_get_prandom_u32()
> - bpf_get_smp_processor_id()
> - bpf_get_cgroup_classid()
> - bpf_get_route_realm()
> - bpf_skb_load_bytes()
> - bpf_csum_diff()
> - bpf_skb_get_tunnel_opt()
> - bpf_skb_set_tunnel_opt()
> - bpf_skb_change_proto()
> - bpf_skb_change_type()
> 
> v3:
> - bpf_get_prandom_u32(): Fix helper name :(. Add description, including
>   a note on the internal random state.
> - bpf_get_smp_processor_id(): Add description, including a note on the
>   processor id remaining stable during program run.
> - bpf_get_cgroup_classid(): State that CONFIG_CGROUP_NET_CLASSID is
>   required to use the helper. Add a reference to related documentation.
>   State that placing a task in net_cls controller disables cgroup-bpf.
> - bpf_get_route_realm(): State that CONFIG_CGROUP_NET_CLASSID is
>   required to use this helper.
> - bpf_skb_load_bytes(): Fix comment on current use cases for the helper.
> 
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 152 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 152 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c59bf5b28164..d748f65a8f58 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -483,6 +483,23 @@ union bpf_attr {
>   * 		The number of bytes written to the buffer, or a negative error
>   * 		in case of failure.
>   *
> + * u32 bpf_get_prandom_u32(void)
> + * 	Description
> + * 		Get a pseudo-random number. Note that this helper uses its own
> + * 		pseudo-random internal state, and cannot be used to infer the
> + * 		seed of other random functions in the kernel.

We should still add that this prng is not cryptographically secure.

> + * 	Return
> + * 		A random 32-bit unsigned value.
> + *
> + * u32 bpf_get_smp_processor_id(void)
> + * 	Description
> + * 		Get the SMP (Symmetric multiprocessing) processor id. Note that

Nit: s/Symmetric/symmetric/ ?

> + * 		all programs run with preemption disabled, which means that the
> + * 		SMP processor id is stable during all the execution of the
> + * 		program.
> + * 	Return
> + * 		The SMP id of the processor running the program.
> + *
>   * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
>   * 	Description
>   * 		Store *len* bytes from address *from* into the packet
> @@ -615,6 +632,27 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
> + * 	Description
> + * 		Retrieve the classid for the current task, i.e. for the
> + * 		net_cls (network classifier) cgroup to which *skb* belongs.
> + *
> + * 		This helper is only available is the kernel was compiled with
> + * 		the **CONFIG_CGROUP_NET_CLASSID** configuration option set to
> + * 		"**y**" or to "**m**".
> + *
> + * 		Note that placing a task into the net_cls controller completely
> + * 		disables the execution of eBPF programs with the cgroup.

I'm not sure I follow the above sentence, what do you mean by that?

I would definitely also add here that this helper is limited to cgroups v1
only, and that it works on clsact TC egress hook but not the ingress one.

> + * 		Also note that, in the above description, the "network
> + * 		classifier" cgroup does not designate a generic classifier, but
> + * 		a particular mechanism that provides an interface to tag
> + * 		network packets with a specific class identifier. See also the

The "generic classifier" part is a bit strange to parse. I would probably
leave the first part out and explain that this provides a means to tag
packets based on a user-provided ID for all traffic coming from the tasks
belonging to the related cgroup.

> + * 		related kernel documentation, available from the Linux sources
> + * 		in file *Documentation/cgroup-v1/net_cls.txt*.
> + * 	Return
> + * 		The classid, or 0 for the default unconfigured classid.
> + *
>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>   * 	Description
>   * 		Push a *vlan_tci* (VLAN tag control information) of protocol
> @@ -734,6 +772,16 @@ union bpf_attr {
>   * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>   * 		error.
>   *
> + * u32 bpf_get_route_realm(struct sk_buff *skb)
> + * 	Description
> + * 		Retrieve the realm or the route, that is to say the
> + * 		**tclassid** field of the destination for the *skb*. This
> + * 		helper is available only if the kernel was compiled with
> + * 		**CONFIG_IP_ROUTE_CLASSID** configuration option.

Could mention that this is a similar user provided tag like in the net_cls
case with cgroups only that this applies to routes here (dst entries) which
hold this tag.

Also, should say that this works with clsact TC egress hook or alternatively
conventional classful egress qdiscs, but not on TC ingress. In case of clsact
TC egress hook this has the advantage that the dst entry has not been dropped
yet in the xmit path. Therefore, the dst entry does not need to be artificially
held via netif_keep_dst() for a classful qdisc until the skb is freed.

> + * 	Return
> + * 		The realm of the route for the packet associated to *sdb*, or 0

Typo: sdb

> + * 		if none was found.
> + *
>   * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
>   * 	Description
>   * 		Write raw *data* blob into a special BPF perf event held by
> @@ -770,6 +818,23 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * int bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
> + * 	Description
> + * 		This helper was provided as an easy way to load data from a
> + * 		packet. It can be used to load *len* bytes from *offset* from
> + * 		the packet associated to *skb*, into the buffer pointed by
> + * 		*to*.
> + *
> + * 		Since Linux 4.7, usage of this helper has mostly been replaced
> + * 		by "direct packet access", enabling packet data to be
> + * 		manipulated with *skb*\ **->data** and *skb*\ **->data_end**
> + * 		pointing respectively to the first byte of packet data and to
> + * 		the byte after the last byte of packet data. However, it
> + * 		remains useful if one wishes to read large quantities of data
> + * 		at once from a packet.

I would add: s/at once from a packet/at once from a packet into the BPF stack/

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>   * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
>   * 	Description
>   * 		Walk a user or a kernel stack and return its id. To achieve
> @@ -813,6 +878,93 @@ union bpf_attr {
>   * 		The positive or null stack id on success, or a negative error
>   * 		in case of failure.
>   *
> + * s64 bpf_csum_diff(__be32 *from, u32 from_size, __be32 *to, u32 to_size, __wsum seed)
> + * 	Description
> + * 		Compute a checksum difference, from the raw buffer pointed by
> + * 		*from*, of length *from_size* (that must be a multiple of 4),
> + * 		towards the raw buffer pointed by *to*, of size *to_size*
> + * 		(same remark). An optional *seed* can be added to the value.

Wrt seed, we should explicitly mention that this can be cascaded but also that
this helper works in combination with the l3/l4 csum ones where you feed in this
diff coming from bpf_csum_diff().

> + * 		This is flexible enough to be used in several ways:
> + *
> + * 		* With *from_size* == 0, *to_size* > 0 and *seed* set to
> + * 		  checksum, it can be used when pushing new data.
> + * 		* With *from_size* > 0, *to_size* == 0 and *seed* set to
> + * 		  checksum, it can be used when removing data from a packet.
> + * 		* With *from_size* > 0, *to_size* > 0 and *seed* set to 0, it
> + * 		  can be used to compute a diff. Note that *from_size* and
> + * 		  *to_size* do not need to be equal.
> + * 	Return
> + * 		The checksum result, or a negative error code in case of
> + * 		failure.
> + *
> + * int bpf_skb_get_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size)
> + * 	Description
> + * 		Retrieve tunnel options metadata for the packet associated to
> + * 		*skb*, and store the raw tunnel option data to the buffer *opt*
> + * 		of *size*.
> + * 	Return
> + * 		The size of the option data retrieved.
> + *
> + * int bpf_skb_set_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size)
> + * 	Description
> + * 		Set tunnel options metadata for the packet associated to *skb*
> + * 		to the option data contained in the raw buffer *opt* of *size*.

Also here the same remark with collect meta data I made earlier, and as a
particular example where this can be used in combination with geneve where
this allows for pushing and retrieving (bpf_skb_get_tunnel_opt() case)
arbitrary TLVs from the BPF program that allows for full customization.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags)
> + * 	Description
> + * 		Change the protocol of the *skb* to *proto*. Currently
> + * 		supported are transition from IPv4 to IPv6, and from IPv6 to
> + * 		IPv4. The helper takes care of the groundwork for the
> + * 		transition, including resizing the socket buffer. The eBPF
> + * 		program is expected to fill the new headers, if any, via
> + * 		**skb_store_bytes**\ () and to recompute the checksums with
> + * 		**bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\
> + * 		().

Could mention the main use case for NAT64 out of an BPF program.

> + *
> + * 		Internally, the GSO type is marked as dodgy so that headers are
> + * 		checked and segments are recalculated by the GSO/GRO engine.
> + * 		The size for GSO target is adapted as well.
> + *
> + * 		All values for *flags* are reserved for future usage, and must
> + * 		be left at zero.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_change_type(struct sk_buff *skb, u32 type)
> + * 	Description
> + * 		Change the packet type for the packet associated to *skb*. This
> + * 		comes down to setting *skb*\ **->pkt_type** to *type*, except
> + * 		the eBPF program does not have a write access to *skb*\
> + * 		**->pkt_type** beside this helper. Using a helper here allows
> + * 		for graceful handling of errors.
> + *
> + * 		The major use case is to change incoming *skb*s to
> + * 		**PACKET_HOST** in a programmatic way instead of having to
> + * 		recirculate via **redirect**\ (..., **BPF_F_INGRESS**), for
> + * 		example.
> + *
> + * 		Note that *type* only allows certain values. At this time, they
> + * 		are:
> + *
> + * 		**PACKET_HOST**
> + * 			Packet is for us.
> + * 		**PACKET_BROADCAST**
> + * 			Send packet to all.
> + * 		**PACKET_MULTICAST**
> + * 			Send packet to group.
> + * 		**PACKET_OTHERHOST**
> + * 			Send packet to someone else.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>   * u64 bpf_get_current_task(void)
>   * 	Return
>   * 		A pointer to the current task struct.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 5/8] bpf: add documentation for eBPF helpers (33-41)
From: Daniel Borkmann @ 2018-04-19 12:27 UTC (permalink / raw)
  To: Quentin Monnet, ast; +Cc: netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180417143438.7018-6-quentin.monnet@netronome.com>

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Daniel:
> 
> - bpf_get_hash_recalc()
> - bpf_skb_change_tail()
> - bpf_skb_pull_data()
> - bpf_csum_update()
> - bpf_set_hash_invalid()
> - bpf_get_numa_node_id()
> - bpf_set_hash()
> - bpf_skb_adjust_room()
> - bpf_xdp_adjust_meta()
> 
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 155 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 155 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d748f65a8f58..3a40f5debac2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -965,9 +965,164 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * u32 bpf_get_hash_recalc(struct sk_buff *skb)
> + * 	Description
> + * 		Retrieve the hash of the packet, *skb*\ **->hash**. If it is
> + * 		not set, in particular if the hash was cleared due to mangling,
> + * 		recompute this hash. Later accesses to the hash can be done
> + * 		directly with *skb*\ **->hash**.
> + *
> + * 		Calling **bpf_set_hash_invalid**\ (), changing a packet
> + * 		prototype with **bpf_skb_change_proto**\ (), or calling
> + * 		**bpf_skb_store_bytes**\ () with the
> + * 		**BPF_F_INVALIDATE_HASH** are actions susceptible to clear
> + * 		the hash and to trigger a new computation for the next call to
> + * 		**bpf_get_hash_recalc**\ ().
> + * 	Return
> + * 		The 32-bit hash.
> + *
>   * u64 bpf_get_current_task(void)
>   * 	Return
>   * 		A pointer to the current task struct.
> + *
> + * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
> + * 	Description
> + * 		Resize (trim or grow) the packet associated to *skb* to the
> + * 		new *len*. The *flags* are reserved for future usage, and must
> + * 		be left at zero.
> + *
> + * 		The basic idea is that the helper performs the needed work to
> + * 		change the size of the packet, then the eBPF program rewrites
> + * 		the rest via helpers like **bpf_skb_store_bytes**\ (),
> + * 		**bpf_l3_csum_replace**\ (), **bpf_l3_csum_replace**\ ()
> + * 		and others. This helper is a slow path utility intended for
> + * 		replies with control messages. And because it is targeted for
> + * 		slow path, the helper itself can afford to be slow: it
> + * 		implicitly linearizes, unclones and drops offloads from the
> + * 		*skb*.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_pull_data(struct sk_buff *skb, u32 len)
> + * 	Description
> + * 		Pull in non-linear data in case the *skb* is non-linear and not
> + * 		all of *len* are part of the linear section. Make *len* bytes
> + * 		from *skb* readable and writable. If a zero value is passed for
> + * 		*len*, then the whole length of the *skb* is pulled.
> + *
> + * 		This helper is only needed for reading and writing with direct
> + * 		packet access.
> + *
> + * 		For direct packet access, when testing that offsets to access
> + * 		are within packet boundaries (test on *skb*\ **->data_end**)
> + * 		fails, programs just bail out, or, in the direct read case, use

I would add here to why it can fail, meaning either due to invalid offsets
or due to the requested data being in non-linear parts of the skb where then
either the bpf_skb_load_bytes() can be used as you mentioned or the data
pulled in via bpf_skb_pull_data().

> + * 		**bpf_skb_load_bytes()** as an alternative to overcome this
> + * 		limitation. If such data sits in non-linear parts, it is
> + * 		possible to pull them in once with the new helper, retest and
> + * 		eventually access them.

You do this here, but maybe slightly rearranging this one paragraph a bit as
to why one would use either of the helpers would help reading flow a bit.

> + * 		At the same time, this also makes sure the skb is uncloned,
> + * 		which is a necessary condition for direct write. As this needs
> + * 		to be an invariant for the write part only, the verifier
> + * 		detects writes and adds a prologue that is calling
> + * 		**bpf_skb_pull_data()** to effectively unclone the skb from the
> + * 		very beginning in case it is indeed cloned.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * s64 bpf_csum_update(struct sk_buff *skb, __wsum csum)
> + * 	Description
> + * 		Add the checksum *csum* into *skb*\ **->csum** in case the
> + * 		driver fed us an IP checksum. Return an error otherwise. This

It's not IP checksum specifically (if that is what you meant), it's when the
driver propagates CHECKSUM_COMPLETE to the skb, where the device has supplied
the checksum of the whole packet into skb->csum. At TC ingress time, this
covers everything starting from net header offset to the end of the skb since
mac hdr skb->csum has been pulled already. Main use case indeed direct packet
access.

> + * 		header is intended to be used in combination with
> + * 		**bpf_csum_diff()** helper, in particular when the checksum
> + * 		needs to be updated after data has been written into the packet
> + * 		through direct packet access.
> + * 	Return
> + * 		The checksum on success, or a negative error code in case of
> + * 		failure.
> + *
> + * void bpf_set_hash_invalid(struct sk_buff *skb)
> + * 	Description
> + * 		Invalidate the current *skb*\ **->hash**. It can be used after
> + * 		mangling on headers through direct packet access, in order to
> + * 		indicate that the hash is outdated and to trigger a
> + * 		recalculation the next time the kernel tries to access this
> + * 		hash.

[...] hash or through the helper bpf_get_hash_recalc().

> + *
> + * int bpf_get_numa_node_id(void)
> + * 	Description
> + * 		Return the id of the current NUMA node. The primary use case
> + * 		for this helper is the selection of sockets for the local NUMA
> + * 		node, when the program is attached to sockets using the
> + * 		**SO_ATTACH_REUSEPORT_EBPF** option (see also **socket(7)**).

I would mention that this also available for other types similarly to
bpf_get_smp_processor_id() helper though. (Otherwise one might read that
this could not be the case.)

> + * 	Return
> + * 		The id of current NUMA node.
> + *
> + * u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
> + * 	Description
> + * 		Set the full hash for *skb* (set the field *skb*\ **->hash**)
> + * 		to value *hash*.
> + * 	Return
> + * 		0
> + *
> + * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
> + * 	Description
> + * 		Grow or shrink the room for data in the packet associated to
> + * 		*skb* by *len_diff*, and according to the selected *mode*.
> + *
> + * 		There is a single supported mode at this time:
> + *
> + * 		* **BPF_ADJ_ROOM_NET**: Adjust room at the network layer
> + * 		  (room space is added or removed below the layer 3 header).
> + *
> + * 		All values for *flags* are reserved for future usage, and must
> + * 		be left at zero.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta)
> + * 	Description
> + * 		Adjust the address pointed by *xdp_md*\ **->data_meta** by
> + * 		*delta* (which can be positive or negative). Note that this
> + * 		operation modifies the address stored in *xdp_md*\ **->data**,
> + * 		so the latter must be loaded only after the helper has been
> + * 		called.
> + *
> + * 		The use of *xdp_md*\ **->data_meta** is optional and programs
> + * 		are not required to use it. The rationale is that when the
> + * 		packet is processed with XDP (e.g. as DoS filter), it is
> + * 		possible to push further meta data along with it before passing
> + * 		to the stack, and to give the guarantee that an ingress eBPF
> + * 		program attached as a TC classifier on the same device can pick
> + * 		this up for further post-processing. Since TC works with socket
> + * 		buffers, it remains possible to set from XDP the **mark** or
> + * 		**priority** pointers, or other pointers for the socket buffer.
> + * 		Having this scratch space generic and programmable allows for
> + * 		more flexibility as the user is free to store whatever meta
> + * 		data they need.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 6/8] bpf: add documentation for eBPF helpers (42-50)
From: Daniel Borkmann @ 2018-04-19 12:40 UTC (permalink / raw)
  To: Quentin Monnet, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Kaixu Xia,
	Martin KaFai Lau, Sargun Dhillon, Thomas Graf, Gianluca Borello,
	Chenbo Feng
In-Reply-To: <20180417143438.7018-7-quentin.monnet@netronome.com>

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions:
> 
> Helper from Kaixu:
> - bpf_perf_event_read()
> 
> Helpers from Martin:
> - bpf_skb_under_cgroup()
> - bpf_xdp_adjust_head()
> 
> Helpers from Sargun:
> - bpf_probe_write_user()
> - bpf_current_task_under_cgroup()
> 
> Helper from Thomas:
> - bpf_skb_change_head()
> 
> Helper from Gianluca:
> - bpf_probe_read_str()
> 
> Helpers from Chenbo:
> - bpf_get_socket_cookie()
> - bpf_get_socket_uid()
> 
> v3:
> - bpf_perf_event_read(): Fix time of selection for perf event type in
>   description. Remove occurences of "cores" to avoid confusion with
>   "CPU".
> 
> Cc: Kaixu Xia <xiakaixu@huawei.com>
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: Gianluca Borello <g.borello@gmail.com>
> Cc: Chenbo Feng <fengc@google.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  include/uapi/linux/bpf.h | 158 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 158 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3a40f5debac2..dd79a1c82adf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -753,6 +753,25 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * u64 bpf_perf_event_read(struct bpf_map *map, u64 flags)
> + * 	Description
> + * 		Read the value of a perf event counter. This helper relies on a
> + * 		*map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of
> + * 		the perf event counter is selected when *map* is updated with
> + * 		perf event file descriptors. The *map* is an array whose size
> + * 		is the number of available CPUs, and each cell contains a value
> + * 		relative to one CPU. The value to retrieve is indicated by
> + * 		*flags*, that contains the index of the CPU to look up, masked
> + * 		with **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
> + * 		**BPF_F_CURRENT_CPU** to indicate that the value for the
> + * 		current CPU should be retrieved.
> + *
> + * 		Note that before Linux 4.13, only hardware perf event can be
> + * 		retrieved.
> + * 	Return
> + * 		The value of the perf event counter read from the map, or a
> + * 		negative error code in case of failure.
> + *
>   * int bpf_redirect(u32 ifindex, u64 flags)
>   * 	Description
>   * 		Redirect the packet to another net device of index *ifindex*.
> @@ -965,6 +984,17 @@ union bpf_attr {
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
>   *
> + * int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index)
> + * 	Description
> + * 		Check whether *skb* is a descendant of the cgroup2 held by
> + * 		*map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
> + * 	Return
> + * 		The return value depends on the result of the test, and can be:
> + *
> + * 		* 0, if the *skb* failed the cgroup2 descendant test.
> + * 		* 1, if the *skb* succeeded the cgroup2 descendant test.
> + * 		* A negative error code, if an error occurred.
> + *
>   * u32 bpf_get_hash_recalc(struct sk_buff *skb)
>   * 	Description
>   * 		Retrieve the hash of the packet, *skb*\ **->hash**. If it is
> @@ -985,6 +1015,37 @@ union bpf_attr {
>   * 	Return
>   * 		A pointer to the current task struct.
>   *
> + * int bpf_probe_write_user(void *dst, const void *src, u32 len)
> + * 	Description
> + * 		Attempt in a safe way to write *len* bytes from the buffer
> + * 		*src* to *dst* in memory. It only works for threads that are in
> + * 		user context.

Plus the dst address must be a valid user space address.

> + * 		This helper should not be used to implement any kind of
> + * 		security mechanism because of TOC-TOU attacks, but rather to
> + * 		debug, divert, and manipulate execution of semi-cooperative
> + * 		processes.
> + *
> + * 		Keep in mind that this feature is meant for experiments, and it
> + * 		has a risk of crashing the system and running programs.

Ditto, crashing user space applications.

> + * 		Therefore, when an eBPF program using this helper is attached,
> + * 		a warning including PID and process name is printed to kernel
> + * 		logs.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index)
> + * 	Description
> + * 		Check whether the probe is being run is the context of a given
> + * 		subset of the cgroup2 hierarchy. The cgroup2 to test is held by
> + * 		*map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*.
> + * 	Return
> + * 		The return value depends on the result of the test, and can be:
> + *
> + * 		* 0, if the *skb* task belongs to the cgroup2.
> + * 		* 1, if the *skb* task does not belong to the cgroup2.
> + * 		* A negative error code, if an error occurred.
> + *
>   * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
>   * 	Description
>   * 		Resize (trim or grow) the packet associated to *skb* to the
> @@ -1069,6 +1130,103 @@ union bpf_attr {
>   * 	Return
>   * 		The id of current NUMA node.
>   *
> + * int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags)
> + * 	Description
> + * 		Grows headroom of packet associated to *skb* and adjusts the
> + * 		offset of the MAC header accordingly, adding *len* bytes of
> + * 		space. It automatically extends and reallocates memory as
> + * 		required.
> + *
> + * 		This helper can be used on a layer 3 *skb* to push a MAC header
> + * 		for redirection into a layer 2 device.
> + *
> + * 		All values for *flags* are reserved for future usage, and must
> + * 		be left at zero.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta)
> + * 	Description
> + * 		Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that
> + * 		it is possible to use a negative value for *delta*. This helper
> + * 		can be used to prepare the packet for pushing or popping
> + * 		headers.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr)
> + * 	Description
> + * 		Copy a NUL terminated string from an unsafe address
> + * 		*unsafe_ptr* to *dst*. The *size* should include the
> + * 		terminating NUL byte. In case the string length is smaller than
> + * 		*size*, the target is not padded with further NUL bytes. If the
> + * 		string length is larger than *size*, just *size*-1 bytes are
> + * 		copied and the last byte is set to NUL.
> + *
> + * 		On success, the length of the copied string is returned. This
> + * 		makes this helper useful in tracing programs for reading
> + * 		strings, and more importantly to get its length at runtime. See
> + * 		the following snippet:
> + *
> + * 		::
> + *
> + * 			SEC("kprobe/sys_open")
> + * 			void bpf_sys_open(struct pt_regs *ctx)
> + * 			{
> + * 			        char buf[PATHLEN]; // PATHLEN is defined to 256
> + * 			        int res = bpf_probe_read_str(buf, sizeof(buf),
> + * 				                             ctx->di);
> + *
> + * 				// Consume buf, for example push it to
> + * 				// userspace via bpf_perf_event_output(); we
> + * 				// can use res (the string length) as event
> + * 				// size, after checking its boundaries.
> + * 			}
> + *
> + * 		In comparison, using **bpf_probe_read()** helper here instead
> + * 		to read the string would require to estimate the length at
> + * 		compile time, and would often result in copying more memory
> + * 		than necessary.
> + *
> + * 		Another useful use case is when parsing individual process
> + * 		arguments or individual environment variables navigating
> + * 		*current*\ **->mm->arg_start** and *current*\
> + * 		**->mm->env_start**: using this helper and the return value,
> + * 		one can quickly iterate at the right offset of the memory area.
> + * 	Return
> + * 		On success, the strictly positive length of the string,
> + * 		including the trailing NUL character. On error, a negative
> + * 		value.
> + *
> + * u64 bpf_get_socket_cookie(struct sk_buff *skb)
> + * 	Description
> + * 		Retrieve the socket cookie generated by the kernel from a
> + * 		**struct sk_buff** with a known socket. If none has been set
> + * 		yet, generate a new cookie. This helper can be useful for
> + * 		monitoring per socket networking traffic statistics as it
> + * 		provides a unique socket identifier per namespace.
> + * 	Return
> + * 		A 8-byte long non-decreasing number on success, or 0 if the
> + * 		socket field is missing inside *skb*.
> + *
> + * u32 bpf_get_socket_uid(struct sk_buff *skb)
> + * 	Return
> + * 		The owner UID of the socket associated to *skb*. If the socket
> + * 		is **NULL**, or if it is not a full socket (i.e. if it is a
> + * 		time-wait or a request socket instead), **overflowuid** value
> + * 		is returned (note that **overflowuid** might also be the actual
> + * 		UID value for the socket).
> + *
>   * u32 bpf_set_hash(struct sk_buff *skb, u32 hash)
>   * 	Description
>   * 		Set the full hash for *skb* (set the field *skb*\ **->hash**)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Quentin Monnet @ 2018-04-19 12:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend
In-Reply-To: <20180418174319.37f6c793@redhat.com>

2018-04-18 17:43 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> On Wed, 18 Apr 2018 15:09:41 +0100
> Quentin Monnet <quentin.monnet@netronome.com> wrote:
> 
>> 2018-04-18 15:34 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
>>> On Tue, 17 Apr 2018 15:34:38 +0100
>>> Quentin Monnet <quentin.monnet@netronome.com> wrote:
>>>   
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 350459c583de..3d329538498f 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -1276,6 +1276,50 @@ union bpf_attr {
>>>>   * 	Return
>>>>   * 		0 on success, or a negative error in case of failure.
>>>>   *
>>>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>>>> + * 	Description
>>>> + * 		Redirect the packet to the endpoint referenced by *map* at
>>>> + * 		index *key*. Depending on its type, his *map* can contain  
>>>                                                     ^^^
>>>
>>> "his" -> "this"  
>>
>> Thanks!
>>
>>>> + * 		references to net devices (for forwarding packets through other
>>>> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
>>>> + * 		but this is only implemented for native XDP (with driver
>>>> + * 		support) as of this writing).
>>>> + *
>>>> + * 		All values for *flags* are reserved for future usage, and must
>>>> + * 		be left at zero.
>>>> + * 	Return
>>>> + * 		**XDP_REDIRECT** on success, or **XDP_ABORT** on error.
>>>> + *  
>>>
>>> "XDP_ABORT" -> "XDP_ABORTED"  
>>
>> Ouch. And I did the same for bpf_redirect(). Thanks for the catch.
>>
>>>
>>> I don't know if it's worth mentioning in the doc/man-page; that for XDP
>>> using bpf_redirect_map() is a HUGE performance advantage, compared to
>>> the bpf_redirect() call ?  
>>
>> It seems worth to me. How would you simply explain the reason for this
>> difference?
> 
> The basic reason is "bulking effect", as devmap avoids the NIC
> tailptr/doorbell update on every packet... how to write that in a doc
> format?
> 
> I wrote about why XDP_REDIRECT with maps are smart here:
>  http://vger.kernel.org/netconf2017_files/XDP_devel_update_NetConf2017_Seoul.pdf
> 
> Using maps for redirect, hopefully makes XDP_REDIRECT the last driver
> XDP action code we need.  As new types of redirect can be introduced
> without driver changes. See that AF_XDP also uses a map.
> 
> It is more subtle, but maps also function as a sorting step. Imagine
> your XDP program need to redirect out different interfaces (or CPUs in
> cpumap case), and packets arrive intermixed.  Packets get sorted into
> the different map indexes, and the xdp_do_flush_map() will trigger the
> flush operation.
> 
> 
> Happened to have an i40e NIC benchmark setup, and ran a single flow pktgen test.
> 
> Results with 'xdp_redirect_map'
>  13589297 pps (13,589,297) 
> 
> Results with 'xdp_redirect' NOT using devmap:
>   7567575 pps (7,567,575)
> 
> Just to point out the performance benefit of devmap...


Thanks for those details! This is an impressive change in performance
indeed.

I think I will just keep it simple for the documentation. I will add the
following for bpf_redirect_map():

    When used to redirect packets to net devices, this helper
    provides a high performance increase over **bpf_redirect**\ ().
    This is due to various implementation details of the underlying
    mechanisms, one of which is the fact that **bpf_redirect_map**\ ()
    tries to send packet as a "bulk" to the device.

And also append the following to bpf_redirect():

    The same effect can be attained with the more generic
    **bpf_redirect_map**\ (), which requires specific maps
    to be used but offers better performance.

Best,
Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Daniel Borkmann @ 2018-04-19 12:47 UTC (permalink / raw)
  To: Quentin Monnet, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Yonghong Song, Josef Bacik, Andrey Ignatov
In-Reply-To: <20180417143438.7018-8-quentin.monnet@netronome.com>

On 04/17/2018 04:34 PM, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions:
> 
> Helpers from Lawrence:
> - bpf_setsockopt()
> - bpf_getsockopt()
> - bpf_sock_ops_cb_flags_set()
> 
> Helpers from Yonghong:
> - bpf_perf_event_read_value()
> - bpf_perf_prog_read_value()
> 
> Helper from Josef:
> - bpf_override_return()
> 
> Helper from Andrey:
> - bpf_bind()
> 
> v3:
> - bpf_perf_event_read_value(): Fix time of selection for perf event type
>   in description. Remove occurences of "cores" to avoid confusion with
>   "CPU".
> - bpf_bind(): Remove last paragraph of description, which was off topic.
> 
> Cc: Lawrence Brakmo <brakmo@fb.com>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Andrey Ignatov <rdna@fb.com>
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> 
> fix patch 7: Yonghong and Andrey
> ---
>  include/uapi/linux/bpf.h | 178 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 178 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index dd79a1c82adf..350459c583de 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1234,6 +1234,28 @@ union bpf_attr {
>   * 	Return
>   * 		0
>   *
> + * int bpf_setsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * 	Description
> + * 		Emulate a call to **setsockopt()** on the socket associated to
> + * 		*bpf_socket*, which must be a full socket. The *level* at
> + * 		which the option resides and the name *optname* of the option
> + * 		must be specified, see **setsockopt(2)** for more information.
> + * 		The option value of length *optlen* is pointed by *optval*.
> + *
> + * 		This helper actually implements a subset of **setsockopt()**.
> + * 		It supports the following *level*\ s:
> + *
> + * 		* **SOL_SOCKET**, which supports the following *optname*\ s:
> + * 		  **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**,
> + * 		  **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**.
> + * 		* **IPPROTO_TCP**, which supports the following *optname*\ s:
> + * 		  **TCP_CONGESTION**, **TCP_BPF_IW**,
> + * 		  **TCP_BPF_SNDCWND_CLAMP**.
> + * 		* **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * 		* **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
>   * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
>   * 	Description
>   * 		Grow or shrink the room for data in the packet associated to
> @@ -1281,6 +1303,162 @@ union bpf_attr {
>   * 		performed again.
>   * 	Return
>   * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size)
> + * 	Description
> + * 		Read the value of a perf event counter, and store it into *buf*
> + * 		of size *buf_size*. This helper relies on a *map* of type
> + * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf event
> + * 		counter is selected when *map* is updated with perf event file
> + * 		descriptors. The *map* is an array whose size is the number of
> + * 		available CPUs, and each cell contains a value relative to one
> + * 		CPU. The value to retrieve is indicated by *flags*, that
> + * 		contains the index of the CPU to look up, masked with
> + * 		**BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
> + * 		**BPF_F_CURRENT_CPU** to indicate that the value for the
> + * 		current CPU should be retrieved.
> + *
> + * 		This helper behaves in a way close to
> + * 		**bpf_perf_event_read**\ () helper, save that instead of
> + * 		just returning the value observed, it fills the *buf*
> + * 		structure. This allows for additional data to be retrieved: in
> + * 		particular, the enabled and running times (in *buf*\
> + * 		**->enabled** and *buf*\ **->running**, respectively) are
> + * 		copied.

Since you mention bpf_perf_event_read() here, we should mention that
bpf_perf_event_read_value() is recommended over bpf_perf_event_read()
in general. The latter bpf_perf_event_read() has some ABI quirks where
error and counter value are used as a return code (which is obviously
wrong to do since ranges may overlap). bpf_perf_event_read_value()
fixed this but also provides more features at the same time over the
old interface.

> + * 		These values are interesting, because hardware PMU (Performance
> + * 		Monitoring Unit) counters are limited resources. When there are
> + * 		more PMU based perf events opened than available counters,
> + * 		kernel will multiplex these events so each event gets certain
> + * 		percentage (but not all) of the PMU time. In case that
> + * 		multiplexing happens, the number of samples or counter value
> + * 		will not reflect the case compared to when no multiplexing
> + * 		occurs. This makes comparison between different runs difficult.
> + * 		Typically, the counter value should be normalized before
> + * 		comparing to other experiments. The usual normalization is done
> + * 		as follows.
> + *
> + * 		::
> + *
> + * 			normalized_counter = counter * t_enabled / t_running
> + *
> + * 		Where t_enabled is the time enabled for event and t_running is
> + * 		the time running for event since last normalization. The
> + * 		enabled and running times are accumulated since the perf event
> + * 		open. To achieve scaling factor between two invocations of an
> + * 		eBPF program, users can can use CPU id as the key (which is
> + * 		typical for perf array usage model) to remember the previous
> + * 		value and do the calculation inside the eBPF program.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_perf_prog_read_value(struct bpf_perf_event_data_kern *ctx, struct bpf_perf_event_value *buf, u32 buf_size)
> + * 	Description
> + * 		For en eBPF program attached to a perf event, retrieve the
> + * 		value of the event counter associated to *ctx* and store it in
> + * 		the structure pointed by *buf* and of size *buf_size*. Enabled
> + * 		and running times are also stored in the structure (see
> + * 		description of helper **bpf_perf_event_read_value**\ () for
> + * 		more details).

Ditto, mentioning here that bpf_perf_event_read_value() should be used instead.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_getsockopt(struct bpf_sock_ops_kern *bpf_socket, int level, int optname, char *optval, int optlen)
> + * 	Description
> + * 		Emulate a call to **getsockopt()** on the socket associated to
> + * 		*bpf_socket*, which must be a full socket. The *level* at
> + * 		which the option resides and the name *optname* of the option
> + * 		must be specified, see **getsockopt(2)** for more information.
> + * 		The retrieved value is stored in the structure pointed by
> + * 		*opval* and of length *optlen*.
> + *
> + * 		This helper actually implements a subset of **getsockopt()**.
> + * 		It supports the following *level*\ s:
> + *
> + * 		* **IPPROTO_TCP**, which supports *optname*
> + * 		  **TCP_CONGESTION**.
> + * 		* **IPPROTO_IP**, which supports *optname* **IP_TOS**.
> + * 		* **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
[...]
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH bpf-next v3 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Jesper Dangaard Brouer @ 2018-04-19 12:58 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend, brouer
In-Reply-To: <62931080-6f69-3877-7a84-0df5be5754e6@netronome.com>

On Thu, 19 Apr 2018 13:44:41 +0100
Quentin Monnet <quentin.monnet@netronome.com> wrote:

> 2018-04-18 17:43 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> > On Wed, 18 Apr 2018 15:09:41 +0100
> > Quentin Monnet <quentin.monnet@netronome.com> wrote:
> >   
> >> 2018-04-18 15:34 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>  
> >>> On Tue, 17 Apr 2018 15:34:38 +0100
> >>> Quentin Monnet <quentin.monnet@netronome.com> wrote:
> >>>     
> >>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>>> index 350459c583de..3d329538498f 100644
> >>>> --- a/include/uapi/linux/bpf.h
> >>>> +++ b/include/uapi/linux/bpf.h
> >>>> @@ -1276,6 +1276,50 @@ union bpf_attr {
> >>>>   * 	Return
> >>>>   * 		0 on success, or a negative error in case of failure.
> >>>>   *
> >>>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
> >>>> + * 	Description
> >>>> + * 		Redirect the packet to the endpoint referenced by *map* at
> >>>> + * 		index *key*. Depending on its type, his *map* can contain    
> >>>                                                     ^^^
> >>>
> >>> "his" -> "this"    
> >>
> >> Thanks!
> >>  
> >>>> + * 		references to net devices (for forwarding packets through other
> >>>> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
> >>>> + * 		but this is only implemented for native XDP (with driver
> >>>> + * 		support) as of this writing).
> >>>> + *
> >>>> + * 		All values for *flags* are reserved for future usage, and must
> >>>> + * 		be left at zero.
> >>>> + * 	Return
> >>>> + * 		**XDP_REDIRECT** on success, or **XDP_ABORT** on error.
> >>>> + *    
> >>>
> >>> "XDP_ABORT" -> "XDP_ABORTED"    
> >>
> >> Ouch. And I did the same for bpf_redirect(). Thanks for the catch.
> >>  
> >>>
> >>> I don't know if it's worth mentioning in the doc/man-page; that for XDP
> >>> using bpf_redirect_map() is a HUGE performance advantage, compared to
> >>> the bpf_redirect() call ?    
> >>
> >> It seems worth to me. How would you simply explain the reason for this
> >> difference?  
> > 
> > The basic reason is "bulking effect", as devmap avoids the NIC
> > tailptr/doorbell update on every packet... how to write that in a doc
> > format?
> > 
> > I wrote about why XDP_REDIRECT with maps are smart here:
> >  http://vger.kernel.org/netconf2017_files/XDP_devel_update_NetConf2017_Seoul.pdf
> > 
> > Using maps for redirect, hopefully makes XDP_REDIRECT the last driver
> > XDP action code we need.  As new types of redirect can be introduced
> > without driver changes. See that AF_XDP also uses a map.
> > 
> > It is more subtle, but maps also function as a sorting step. Imagine
> > your XDP program need to redirect out different interfaces (or CPUs in
> > cpumap case), and packets arrive intermixed.  Packets get sorted into
> > the different map indexes, and the xdp_do_flush_map() will trigger the
> > flush operation.
> > 
> > 
> > Happened to have an i40e NIC benchmark setup, and ran a single flow pktgen test.
> > 
> > Results with 'xdp_redirect_map'
> >  13589297 pps (13,589,297) 
> > 
> > Results with 'xdp_redirect' NOT using devmap:
> >   7567575 pps (7,567,575)
> > 
> > Just to point out the performance benefit of devmap...  
> 
> 
> Thanks for those details! This is an impressive change in performance
> indeed.
> 
> I think I will just keep it simple for the documentation. I will add the
> following for bpf_redirect_map():
> 
>     When used to redirect packets to net devices, this helper
>     provides a high performance increase over **bpf_redirect**\ ().
>     This is due to various implementation details of the underlying
>     mechanisms, one of which is the fact that **bpf_redirect_map**\ ()
>     tries to send packet as a "bulk" to the device.
> 
> And also append the following to bpf_redirect():
> 
>     The same effect can be attained with the more generic
>     **bpf_redirect_map**\ (), which requires specific maps
>     to be used but offers better performance.

This sounds good to me! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v7 4/5] cpuset: Restrict load balancing off cpus to subset of cpus.isolated
From: Waiman Long @ 2018-04-19 13:47 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

With the addition of "cpuset.cpus.isolated", it makes sense to add the
restriction that load balancing can only be turned off if the CPUs in
the isolated cpuset are subset of "cpuset.cpus.isolated".

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt |  7 ++++---
 kernel/cgroup/cpuset.c      | 29 ++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 8d89dc2..c4227ee 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1554,9 +1554,10 @@ Cpuset Interface Files
 	and will not be moved to other CPUs.
 
 	This flag is hierarchical and is inherited by child cpusets. It
-	can be turned off only when the CPUs in this cpuset aren't
-	listed in the cpuset.cpus of other sibling cgroups, and all
-	the child cpusets, if present, have this flag turned off.
+	can be explicitly turned off only when it is a direct child of
+	the root cgroup and the CPUs in this cpuset are subset of the
+	root's "cpuset.cpus.isolated".	Moreover, the CPUs cannot be
+	listed in the "cpuset.cpus" of other sibling cgroups.
 
 	Once it is off, it cannot be turned back on as long as the
 	parent cgroup still has this flag in the off state.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c746b18..d05c4c8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -511,6 +511,16 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 
 	par = parent_cs(cur);
 
+	/*
+	 * On default hierarchy with sched_load_balance flag off, the cpu
+	 * list must be a subset of the parent's isolated CPU list, if
+	 * defined (root).
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+	    !is_sched_load_balance(trial) && par->isolation_count &&
+	    !cpumask_subset(trial->cpus_allowed, par->isolated_cpus))
+		goto out;
+
 	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
 	ret = -EACCES;
 	if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
@@ -1431,10 +1441,16 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	balance_flag_changed = (is_sched_load_balance(cs) !=
+				is_sched_load_balance(trialcs));
+
 	/*
 	 * On default hierarchy, turning off sched_load_balance flag implies
 	 * an implicit cpu_exclusive. Turning on sched_load_balance will
 	 * clear the cpu_exclusive flag.
+	 *
+	 * sched_load_balance can only be turned off if all the CPUs are
+	 * in the parent's isolated CPU list.
 	 */
 	if ((bit == CS_SCHED_LOAD_BALANCE) &&
 	    cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
@@ -1442,15 +1458,22 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
 		else
 			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+
+		if (balance_flag_changed && !turning_on) {
+			struct cpuset *parent = parent_cs(cs);
+
+			err = -EBUSY;
+			if (!parent->isolation_count ||
+			    !cpumask_subset(trialcs->cpus_allowed,
+					    parent->cpus_allowed))
+				goto out;
+		}
 	}
 
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
 
-	balance_flag_changed = (is_sched_load_balance(cs) !=
-				is_sched_load_balance(trialcs));
-
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v7 3/5] cpuset: Add a root-only cpus.isolated v2 control file
From: Waiman Long @ 2018-04-19 13:47 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

In order to better support CPU isolation as well as multiple root
domains for deadline scheduling, the ability to carve out a set of CPUs
specifically for isolation or for another root domain will be useful.

A new root-only "cpuset.cpus.isolated" control file is now added for
holding the list of CPUs that will not be participating in load balancing
within the root cpuset. The root's effective cpu list will not contain
any CPUs that are in "cpuset.cpus.isolated" file.  These isolated CPUs,
however, can still be put into child cpusets and load balanced within
them if necessary.

For CPU isolation, putting the CPUs into this new control file and not
having them in any of the child cpusets should be enough. Those isolated
CPUs can also be put into a child cpuset with load balancing disabled
for finer-grained control.

For creating additional root domains for scheduling, a child cpuset
should only select an exclusive set of CPUs within the isolated set.

The "cpuset.cpus.isolated" control file should be set up before
any child cpusets are created. If child cpusets are present, changes
to this control file will not be allowed if any CPUs that will change
state are in any of the child cpusets.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt |  25 ++++++++++
 kernel/cgroup/cpuset.c      | 119 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 143 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c970bd7..8d89dc2 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1484,6 +1484,31 @@ Cpuset Interface Files
 	a subset of "cpuset.cpus".  Its value will be affected by CPU
 	hotplug events.
 
+  cpuset.cpus.isolated
+	A read-write multiple values file which exists on root cgroup
+	only.
+
+	It lists the CPUs that have been withdrawn from the root cgroup
+	for load balancing.  These CPUs can still be allocated to child
+	cpusets with load balancing enabled, if necessary.
+
+	If a child cpuset contains only an exclusive set of CPUs that are
+	a subset of the isolated CPUs and with load balancing enabled,
+	these CPUs will be load balanced on a separate root domain from
+	the one in the root cgroup.
+
+	Just putting the CPUs into "cpuset.cpus.isolated" will be
+	enough to disable load balancing on those CPUs as long as they
+	do not appear in a child cpuset with load balancing enabled.
+	Fine-grained control of cpu isolation can also be done by
+	putting these isolated CPUs into child cpusets with load
+	balancing disabled.
+
+	The "cpuset.cpus.isolated" should be set up before child
+	cpusets are created.  Once child cpusets are present, changes
+	to "cpuset.cpus.isolated" will not be allowed if the CPUs that
+	change their states are in any of the child cpusets.
+
   cpuset.mems
 	A read-write multiple values file which exists on non-root
 	cgroups.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 50c9254..c746b18 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -109,6 +109,9 @@ struct cpuset {
 	cpumask_var_t effective_cpus;
 	nodemask_t effective_mems;
 
+	/* Isolated CPUs - root cpuset only */
+	cpumask_var_t isolated_cpus;
+
 	/*
 	 * This is old Memory Nodes tasks took on.
 	 *
@@ -134,6 +137,9 @@ struct cpuset {
 
 	/* for custom sched domain */
 	int relax_domain_level;
+
+	/* for isolated_cpus */
+	int isolation_count;
 };
 
 static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
@@ -909,7 +915,19 @@ static void update_cpumasks_hier(struct cpuset *cs, struct cpumask *new_cpus)
 	cpuset_for_each_descendant_pre(cp, pos_css, cs) {
 		struct cpuset *parent = parent_cs(cp);
 
-		cpumask_and(new_cpus, cp->cpus_allowed, parent->effective_cpus);
+		/*
+		 * If parent has isolated CPUs, include them in the list
+		 * of allowable CPUs.
+		 */
+		if (parent->isolation_count) {
+			cpumask_or(new_cpus, parent->effective_cpus,
+				   parent->isolated_cpus);
+			cpumask_and(new_cpus, new_cpus, cpu_online_mask);
+			cpumask_and(new_cpus, new_cpus, cp->cpus_allowed);
+		} else {
+			cpumask_and(new_cpus, cp->cpus_allowed,
+				    parent->effective_cpus);
+		}
 
 		/*
 		 * If it becomes empty, inherit the effective mask of the
@@ -1004,6 +1022,85 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	return 0;
 }
 
+/**
+ * update_isolated_cpumask - update the isolated_cpus mask of the top cpuset
+ * @buf: buffer of cpu numbers written to this cpuset
+ *
+ * Changes to the isolated CPUs are not allowed if any of CPUs changing
+ * state are in any of the child cpusets. Called with cpuset_mutex held.
+ */
+static int update_isolated_cpumask(const char *buf)
+{
+	int retval;
+	int adding, deleting;
+	cpumask_var_t addmask, delmask;
+	struct cpuset *child;
+	struct cgroup_subsys_state *pos_css;
+
+	if (!alloc_cpumask_var(&addmask, GFP_KERNEL))
+		return -ENOMEM;
+	if (!alloc_cpumask_var(&delmask, GFP_KERNEL)) {
+		free_cpumask_var(addmask);
+		return -ENOMEM;
+	}
+	retval = cpulist_parse(buf, addmask);
+	if (retval)
+		goto out;
+
+	retval = -EINVAL;
+	if (!cpumask_subset(addmask, top_cpuset.cpus_allowed))
+		goto out;
+
+	retval = -EBUSY;
+	deleting = cpumask_andnot(delmask, top_cpuset.isolated_cpus, addmask);
+	adding   = cpumask_andnot(addmask, addmask, top_cpuset.isolated_cpus);
+
+	if (!adding && !deleting)
+		goto out_ok;
+
+	/*
+	 * Check if any CPUs in addmask or delmask are in a child cpuset.
+	 * An empty child cpus_allowed means it is the same as parent's
+	 * effective_cpus.
+	 */
+	cpuset_for_each_child(child, pos_css, &top_cpuset) {
+		if (cpumask_empty(child->cpus_allowed))
+			goto out;
+		if (adding && cpumask_intersects(child->cpus_allowed, addmask))
+			goto out;
+		if (deleting &&
+		    cpumask_intersects(child->cpus_allowed, delmask))
+			goto out;
+	}
+
+	/*
+	 * Change the isolated CPU list.
+	 * Newly added isolated CPUs will be removed from effective_cpus
+	 * and newly deleted ones will be added back if they are online.
+	 */
+	spin_lock_irq(&callback_lock);
+	if (adding)
+		cpumask_or(top_cpuset.isolated_cpus,
+			   top_cpuset.isolated_cpus, addmask);
+
+	if (deleting)
+		cpumask_andnot(top_cpuset.isolated_cpus,
+			       top_cpuset.isolated_cpus, delmask);
+
+	cpumask_andnot(top_cpuset.effective_cpus, cpu_online_mask,
+		       top_cpuset.isolated_cpus);
+
+	top_cpuset.isolation_count = cpumask_weight(top_cpuset.isolated_cpus);
+	spin_unlock_irq(&callback_lock);
+
+out_ok:
+	retval = 0;
+out:
+	free_cpumask_var(addmask);
+	free_cpumask_var(delmask);
+	return retval;
+}
+
 /*
  * Migrate memory region from one set of nodes to another.  This is
  * performed asynchronously as it can be called from process migration path
@@ -1612,6 +1709,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 	FILE_MEMLIST,
 	FILE_EFFECTIVE_CPULIST,
 	FILE_EFFECTIVE_MEMLIST,
+	FILE_ISOLATED_CPULIST,
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_MEM_HARDWALL,
@@ -1733,6 +1831,12 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
 	if (!is_cpuset_online(cs))
 		goto out_unlock;
 
+	if (of_cft(of)->private == FILE_ISOLATED_CPULIST) {
+		WARN_ON_ONCE(cs != &top_cpuset);
+		retval = update_isolated_cpumask(buf);
+		goto out_unlock;
+	}
+
 	trialcs = alloc_trial_cpuset(cs);
 	if (!trialcs) {
 		retval = -ENOMEM;
@@ -1789,6 +1893,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, void *v)
 	case FILE_EFFECTIVE_MEMLIST:
 		seq_printf(sf, "%*pbl\n", nodemask_pr_args(&cs->effective_mems));
 		break;
+	case FILE_ISOLATED_CPULIST:	/* Root only */
+		seq_printf(sf, "%*pbl\n", cpumask_pr_args(cs->isolated_cpus));
+		break;
 	default:
 		ret = -EINVAL;
 	}
@@ -1994,6 +2101,15 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "cpus.isolated",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_ISOLATED_CPULIST,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -2204,6 +2320,7 @@ int __init cpuset_init(void)
 
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
 	BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
+	BUG_ON(!zalloc_cpumask_var(&top_cpuset.isolated_cpus, GFP_KERNEL));
 
 	cpumask_setall(top_cpuset.cpus_allowed);
 	nodes_setall(top_cpuset.mems_allowed);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v7 1/5] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-04-19 13:47 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

Given the fact that thread mode had been merged into 4.14, it is now
time to enable cpuset to be used in the default hierarchy (cgroup v2)
as it is clearly threaded.

The cpuset controller had experienced feature creep since its
introduction more than a decade ago. Besides the core cpus and mems
control files to limit cpus and memory nodes, there are a bunch of
additional features that can be controlled from the userspace. Some of
the features are of doubtful usefulness and may not be actively used.

This patch enables cpuset controller in the default hierarchy with
a minimal set of features, namely just the cpus and mems and their
effective_* counterparts.  We can certainly add more features to the
default hierarchy in the future if there is a real user need for them
later on.

Alternatively, with the unified hiearachy, it may make more sense
to move some of those additional cpuset features, if desired, to
memory controller or may be to the cpu controller instead of staying
with cpuset.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
 kernel/cgroup/cpuset.c      | 48 ++++++++++++++++++++++--
 2 files changed, 130 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 74cdeae..ed8ec66 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
        5-3-2. Writeback
      5-4. PID
        5-4-1. PID Interface Files
-     5-5. Device
-     5-6. RDMA
-       5-6-1. RDMA Interface Files
-     5-7. Misc
-       5-7-1. perf_event
+     5-5. Cpuset
+       5.5-1. Cpuset Interface Files
+     5-6. Device
+     5-7. RDMA
+       5-7-1. RDMA Interface Files
+     5-8. Misc
+       5-8-1. perf_event
      5-N. Non-normative information
        5-N-1. CPU controller root cgroup process behaviour
        5-N-2. IO controller root cgroup process behaviour
@@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
 of a new process would cause a cgroup policy to be violated.
 
 
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical.  That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+  cpuset.cpus
+	A read-write multiple values file which exists on non-root
+	cgroups.
+
+	It lists the CPUs allowed to be used by tasks within this
+	cgroup.  The CPU numbers are comma-separated numbers or
+	ranges.  For example:
+
+	  # cat cpuset.cpus
+	  0-4,6,8-10
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.cpus" or all the available CPUs if none is found.
+
+	The value of "cpuset.cpus" stays constant until the next update
+	and won't be affected by any CPU hotplug events.
+
+  cpuset.cpus.effective
+	A read-only multiple values file which exists on non-root
+	cgroups.
+
+	It lists the onlined CPUs that are actually allowed to be
+	used by tasks within the current cgroup.  If "cpuset.cpus"
+	is empty, it shows all the CPUs from the parent cgroup that
+	will be available to be used by this cgroup.  Otherwise, it is
+	a subset of "cpuset.cpus".  Its value will be affected by CPU
+	hotplug events.
+
+  cpuset.mems
+	A read-write multiple values file which exists on non-root
+	cgroups.
+
+	It lists the memory nodes allowed to be used by tasks within
+	this cgroup.  The memory node numbers are comma-separated
+	numbers or ranges.  For example:
+
+	  # cat cpuset.mems
+	  0-1,3
+
+	An empty value indicates that the cgroup is using the same
+	setting as the nearest cgroup ancestor with a non-empty
+	"cpuset.mems" or all the available memory nodes if none
+	is found.
+
+	The value of "cpuset.mems" stays constant until the next update
+	and won't be affected by any memory nodes hotplug events.
+
+  cpuset.mems.effective
+	A read-only multiple values file which exists on non-root
+	cgroups.
+
+	It lists the onlined memory nodes that are actually allowed to
+	be used by tasks within the current cgroup.  If "cpuset.mems"
+	is empty, it shows all the memory nodes from the parent cgroup
+	that will be available to be used by this cgroup.  Otherwise,
+	it is a subset of "cpuset.mems".  Its value will be affected
+	by memory nodes hotplug events.
+
+
 Device controller
 -----------------
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b42037e..419b758 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	return 0;
 }
 
-
 /*
  * for the common functions, 'private' gives the type of file
  */
 
-static struct cftype files[] = {
+static struct cftype legacy_files[] = {
 	{
 		.name = "cpus",
 		.seq_show = cpuset_common_seq_show,
@@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 };
 
 /*
+ * This is currently a minimal set for the default hierarchy. It can be
+ * expanded later on by migrating more features and control files from v1.
+ */
+static struct cftype dfl_files[] = {
+	{
+		.name = "cpus",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * NR_CPUS),
+		.private = FILE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems",
+		.seq_show = cpuset_common_seq_show,
+		.write = cpuset_write_resmask,
+		.max_write_len = (100U + 6 * MAX_NUMNODES),
+		.private = FILE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "cpus.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_CPULIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{
+		.name = "mems.effective",
+		.seq_show = cpuset_common_seq_show,
+		.private = FILE_EFFECTIVE_MEMLIST,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
+	{ }	/* terminate */
+};
+
+
+/*
  *	cpuset_css_alloc - allocate a cpuset css
  *	cgrp:	control group that the new cpuset will be part of
  */
@@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
 	.post_attach	= cpuset_post_attach,
 	.bind		= cpuset_bind,
 	.fork		= cpuset_fork,
-	.legacy_cftypes	= files,
+	.legacy_cftypes	= legacy_files,
+	.dfl_cftypes	= dfl_files,
 	.early_init	= true,
+	.threaded	= true,
 };
 
 /**
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v7 2/5] cpuset: Add cpuset.sched_load_balance to v2
From: Waiman Long @ 2018-04-19 13:47 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

The sched_load_balance flag is needed to enable CPU isolation similar
to what can be done with the "isolcpus" kernel boot parameter.

The sched_load_balance flag implies an implicit !cpu_exclusive as
it doesn't make sense to have an isolated CPU being load-balanced in
another cpuset.

For v2, this flag is hierarchical and is inherited by child cpusets. It
is not allowed to have this flag turn off in a parent cpuset, but on
in a child cpuset.

This flag is set by the parent and is not delegatable.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/cgroup-v2.txt | 22 ++++++++++++++++++
 kernel/cgroup/cpuset.c      | 56 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ed8ec66..c970bd7 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1514,6 +1514,28 @@ Cpuset Interface Files
 	it is a subset of "cpuset.mems".  Its value will be affected
 	by memory nodes hotplug events.
 
+  cpuset.sched_load_balance
+	A read-write single value file which exists on non-root cgroups.
+	The default is "1" (on), and the other possible value is "0"
+	(off).
+
+	When it is on, tasks within this cpuset will be load-balanced
+	by the kernel scheduler.  Tasks will be moved from CPUs with
+	high load to other CPUs within the same cpuset with less load
+	periodically.
+
+	When it is off, there will be no load balancing among CPUs on
+	this cgroup.  Tasks will stay in the CPUs they are running on
+	and will not be moved to other CPUs.
+
+	This flag is hierarchical and is inherited by child cpusets. It
+	can be turned off only when the CPUs in this cpuset aren't
+	listed in the cpuset.cpus of other sibling cgroups, and all
+	the child cpusets, if present, have this flag turned off.
+
+	Once it is off, it cannot be turned back on as long as the
+	parent cgroup still has this flag in the off state.
+
 
 Device controller
 -----------------
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 419b758..50c9254 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -407,15 +407,22 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
  *
  * One cpuset is a subset of another if all its allowed CPUs and
  * Memory Nodes are a subset of the other, and its exclusive flags
- * are only set if the other's are set.  Call holding cpuset_mutex.
+ * are only set if the other's are set (on legacy hierarchy) or
+ * its sched_load_balance flag is only set if the other is set
+ * (on default hierarchy).  Caller holding cpuset_mutex.
  */
 
 static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
 {
-	return	cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
-		nodes_subset(p->mems_allowed, q->mems_allowed) &&
-		is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
-		is_mem_exclusive(p) <= is_mem_exclusive(q);
+	if (!cpumask_subset(p->cpus_allowed, q->cpus_allowed) ||
+	    !nodes_subset(p->mems_allowed, q->mems_allowed))
+		return false;
+
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys))
+		return is_sched_load_balance(p) <= is_sched_load_balance(q);
+	else
+		return is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
+		       is_mem_exclusive(p) <= is_mem_exclusive(q);
 }
 
 /**
@@ -498,7 +505,7 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
 
 	par = parent_cs(cur);
 
-	/* On legacy hiearchy, we must be a subset of our parent cpuset. */
+	/* On legacy hierarchy, we must be a subset of our parent cpuset. */
 	ret = -EACCES;
 	if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
 		goto out;
@@ -1327,6 +1334,19 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	else
 		clear_bit(bit, &trialcs->flags);
 
+	/*
+	 * On default hierarchy, turning off sched_load_balance flag implies
+	 * an implicit cpu_exclusive. Turning on sched_load_balance will
+	 * clear the cpu_exclusive flag.
+	 */
+	if ((bit == CS_SCHED_LOAD_BALANCE) &&
+	    cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		if (turning_on)
+			clear_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+		else
+			set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
+	}
+
 	err = validate_change(cs, trialcs);
 	if (err < 0)
 		goto out;
@@ -1966,6 +1986,14 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
 
+	{
+		.name = "sched_load_balance",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SCHED_LOAD_BALANCE,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
+
 	{ }	/* terminate */
 };
 
@@ -1991,7 +2019,21 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
 	if (!alloc_cpumask_var(&cs->effective_cpus, GFP_KERNEL))
 		goto free_cpus;
 
-	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	/*
+	 * On default hierarchy, inherit parent's CS_SCHED_LOAD_BALANCE and
+	 * CS_CPU_EXCLUSIVE flag.
+	 */
+	if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+		struct cpuset *parent = css_cs(parent_css);
+
+		if (test_bit(CS_SCHED_LOAD_BALANCE, &parent->flags))
+			set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+		else
+			set_bit(CS_CPU_EXCLUSIVE, &cs->flags);
+	} else {
+		set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	}
+
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	cpumask_clear(cs->effective_cpus);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v7 5/5] cpuset: Make generate_sched_domains() recognize isolated_cpus
From: Waiman Long @ 2018-04-19 13:47 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long
In-Reply-To: <1524145624-23655-1-git-send-email-longman@redhat.com>

The generate_sched_domains() function and the hotplug code are modified
to make them use the newly introduced isolated_cpus mask for schedule
domains generation.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/cgroup/cpuset.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d05c4c8..a67c77a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -683,13 +683,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	int ndoms = 0;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] struct cpumask slot */
 	struct cgroup_subsys_state *pos_css;
+	bool root_load_balance = is_sched_load_balance(&top_cpuset);
 
 	doms = NULL;
 	dattr = NULL;
 	csa = NULL;
 
 	/* Special case for the 99% of systems with one, full, sched domain */
-	if (is_sched_load_balance(&top_cpuset)) {
+	if (root_load_balance && !top_cpuset.isolation_count) {
 		ndoms = 1;
 		doms = alloc_sched_domains(ndoms);
 		if (!doms)
@@ -712,6 +713,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	csn = 0;
 
 	rcu_read_lock();
+	if (root_load_balance)
+		csa[csn++] = &top_cpuset;
 	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
 		if (cp == &top_cpuset)
 			continue;
@@ -722,6 +725,9 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		 * parent's cpus, so just skip them, and then we call
 		 * update_domain_attr_tree() to calc relax_domain_level of
 		 * the corresponding sched domain.
+		 *
+		 * If root is load-balancing, we can skip @cp if it
+		 * is a subset of the root's effective_cpus.
 		 */
 		if (!cpumask_empty(cp->cpus_allowed) &&
 		    !(is_sched_load_balance(cp) &&
@@ -729,6 +735,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
 					 housekeeping_cpumask(HK_FLAG_DOMAIN))))
 			continue;
 
+		if (root_load_balance &&
+		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
+			continue;
+
 		if (is_sched_load_balance(cp))
 			csa[csn++] = cp;
 
@@ -820,6 +830,12 @@ static int generate_sched_domains(cpumask_var_t **domains,
 	}
 	BUG_ON(nslot != ndoms);
 
+#ifdef CONFIG_DEBUG_KERNEL
+	for (i = 0; i < ndoms; i++)
+		pr_info("rebuild_sched_domains dom %d: %*pbl\n", i,
+			cpumask_pr_args(doms[i]));
+#endif
+
 done:
 	kfree(csa);
 
@@ -860,7 +876,12 @@ static void rebuild_sched_domains_locked(void)
 	 * passing doms with offlined cpu to partition_sched_domains().
 	 * Anyways, hotplug work item will rebuild sched domains.
 	 */
-	if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+	if (!top_cpuset.isolation_count &&
+	    !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
+		goto out;
+
+	if (top_cpuset.isolation_count &&
+	   !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
 		goto out;
 
 	/* Generate domain masks and attrs */
@@ -1102,6 +1123,7 @@ static int update_isolated_cpumask(const char *buf)
 
 	top_cpuset.isolation_count = cpumask_weight(top_cpuset.isolated_cpus);
 	spin_unlock_irq(&callback_lock);
+	rebuild_sched_domains_locked();
 
 out_ok:
 	retval = 0;
@@ -2530,6 +2552,11 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 	cpumask_copy(&new_cpus, cpu_active_mask);
 	new_mems = node_states[N_MEMORY];
 
+	/*
+	 * If isolated_cpus is populated, it is likely that the check below
+	 * will produce a false positive on cpus_updated when the cpu list
+	 * isn't changed. It is extra work, but it is better to be safe.
+	 */
 	cpus_updated = !cpumask_equal(top_cpuset.effective_cpus, &new_cpus);
 	mems_updated = !nodes_equal(top_cpuset.effective_mems, new_mems);
 
@@ -2538,6 +2565,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
 		spin_lock_irq(&callback_lock);
 		if (!on_dfl)
 			cpumask_copy(top_cpuset.cpus_allowed, &new_cpus);
+
+		if (top_cpuset.isolation_count)
+			cpumask_andnot(&new_cpus, &new_cpus,
+					top_cpuset.isolated_cpus);
 		cpumask_copy(top_cpuset.effective_cpus, &new_cpus);
 		spin_unlock_irq(&callback_lock);
 		/* we don't mess with cpumasks of tasks in top_cpuset */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v7 0/5] cpuset: Enable cpuset controller in default hierarchy
From: Waiman Long @ 2018-04-19 13:46 UTC (permalink / raw)
  To: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar
  Cc: cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
	Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli, Waiman Long

v7:
 - Add a root-only cpuset.cpus.isolated control file for CPU isolation.
 - Enforce that load_balancing can only be turned off on cpusets with
   CPUs from the isolated list.
 - Update sched domain generation to allow cpusets with CPUs only
   from the isolated CPU list to be in separate root domains.

v6:
 - Hide cpuset control knobs in root cgroup.
 - Rename effective_cpus and effective_mems to cpus.effective and
   mems.effective respectively.
 - Remove cpuset.flags and add cpuset.sched_load_balance instead
   as the behavior of sched_load_balance has changed and so is
   not a simple flag.
 - Update cgroup-v2.txt accordingly.

v5:
 - Add patch 2 to provide the cpuset.flags control knob for the
   sched_load_balance flag which should be the only feature that is
   essential as a replacement of the "isolcpus" kernel boot parameter.

v4:
 - Further minimize the feature set by removing the flags control knob.

v3:
 - Further trim the additional features down to just memory_migrate.
 - Update Documentation/cgroup-v2.txt.

v6 patch: https://lkml.org/lkml/2018/3/21/530

The purpose of this patchset is to provide a basic set of cpuset
features for cgroup v2. This basic set includes the non-root "cpus",
"mems", "cpus.effective" and "mems.effective", "sched_load_balance"
control file as well as a root-only "cpus.isolated".

The root-only "cpus.isolated" file is added to support use cases similar
to the "isolcpus" kernel parameter. CPUs from the isolated list can be
put into child cpusets where "sched_load_balance" can be disabled to
allow finer control of task-cpu mappings of those isolated CPUs.

On the other hand, enabling the "sched_load_balance" on a cpuset with
only CPUs from the isolated list will allow those CPUs to use a separate
root domain from that of the root cpuset.

This patchset does not exclude the possibility of adding more features
in the future after careful consideration.

Patch 1 enables cpuset in cgroup v2 with cpus, mems and their
effective counterparts.

Patch 2 adds sched_load_balance whose behavior changes in v2 to become
hierarchical and includes an implicit !cpu_exclusive.

Patch 3 adds a new root-only "cpuset.cpus.isolated" control file for
CPU isolation purpose.

Patch 4 adds the limitation that "sched_load_balance" can only be turned
off in a cpuset if all the CPUs in the cpuset are already in the root's
"cpuset.cpus.isolated".

Patch 5 modifies the sched domain generation code to generate separate root
sched domains if all the CPUs in a cpuset comes from "cpuset.cpus.isolated".

In other words, all the CPUs that need to be isolated or in separate
root domains have to be put into the "cpuset.cpus.isolated" first. Then
child cpusets can be created to partition those isolated CPUs into
either separate root domains with "sched_load_balance" on or really
isolated CPUs with "sched_load_balance" off. Load balancing cannot
be turned off at root.

Waiman Long (5):
  cpuset: Enable cpuset controller in default hierarchy
  cpuset: Add cpuset.sched_load_balance to v2
  cpuset: Add a root-only cpus.isolated v2 control file
  cpuset: Restrict load balancing off cpus to subset of cpus.isolated
  cpuset: Make generate_sched_domains() recognize isolated_cpus

 Documentation/cgroup-v2.txt | 138 ++++++++++++++++++++-
 kernel/cgroup/cpuset.c      | 287 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 404 insertions(+), 21 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] atomic_ops.rst: Fix wrong example code
From: Paul E. McKenney @ 2018-04-19 13:55 UTC (permalink / raw)
  To: SeongJae Park; +Cc: corbet, linux-kernel, linux-doc
In-Reply-To: <20180419084245.17096-1-sj38.park@gmail.com>

On Thu, Apr 19, 2018 at 05:42:44PM +0900, SeongJae Park wrote:
> Example code snippets for necessary of READ_ONCE() and WRITE_ONCE() has
> an unnecessary line of code and wrong condition.  This commit fixes
> them.
> 
> Signed-off-by: SeongJae Park <sj38.park@gmail.com>

Good catch!!!  I queued and pushed both patches for further review,
thank you!

						Thanx, Paul

> ---
>  Documentation/core-api/atomic_ops.rst | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst
> index fce929144ccd..4ea4af71e68a 100644
> --- a/Documentation/core-api/atomic_ops.rst
> +++ b/Documentation/core-api/atomic_ops.rst
> @@ -111,7 +111,6 @@ If the compiler can prove that do_something() does not store to the
>  variable a, then the compiler is within its rights transforming this to
>  the following::
> 
> -	tmp = a;
>  	if (a > 0)
>  		for (;;)
>  			do_something();
> @@ -119,7 +118,7 @@ the following::
>  If you don't want the compiler to do this (and you probably don't), then
>  you should use something like the following::
> 
> -	while (READ_ONCE(a) < 0)
> +	while (READ_ONCE(a) > 0)
>  		do_something();
> 
>  Alternatively, you could place a barrier() call in the loop.
> -- 
> 2.13.0
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 03/10] drivers/peci: Add support for PECI bus driver core
From: kbuild test robot @ 2018-04-19 18:59 UTC (permalink / raw)
  To: Jae Hyun Yoo
  Cc: kbuild-all, Alan Cox, Andrew Jeffery, Andrew Lunn,
	Andy Shevchenko, Arnd Bergmann, Benjamin Herrenschmidt,
	Fengguang Wu, Greg KH, Guenter Roeck, Haiyue Wang, James Feist,
	Jason M Biils, Jean Delvare, Joel Stanley, Julia Cartwright,
	Miguel Ojeda, Milton Miller II, Pavel Machek, Randy Dunlap,
	Stef van Os, Sumeet R Pawnikar, Vernon Mauery, linux-kernel,
	linux-doc, devicetree, linux-hwmon, linux-arm-kernel, openbmc,
	Jae Hyun Yoo
In-Reply-To: <20180410183212.16787-4-jae.hyun.yoo@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]

Hi Jae,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc1 next-20180419]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jae-Hyun-Yoo/PECI-device-driver-introduction/20180411-180018
config: x86_64-randconfig-s0-04192349 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers//peci/peci-core.c: In function 'peci_device_match':
>> drivers//peci/peci-core.c:739:6: error: implicit declaration of function 'peci_of_match_device' [-Werror=implicit-function-declaration]
     if (peci_of_match_device(drv->of_match_table, client))
         ^~~~~~~~~~~~~~~~~~~~
   At top level:
   drivers//peci/peci-core.c:840:28: warning: 'peci_new_device' defined but not used [-Wunused-function]
    static struct peci_client *peci_new_device(struct peci_adapter *adapter,
                               ^~~~~~~~~~~~~~~
   drivers//peci/peci-core.c:135:29: warning: 'peci_verify_adapter' defined but not used [-Wunused-function]
    static struct peci_adapter *peci_verify_adapter(struct device *dev)
                                ^~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/peci_of_match_device +739 drivers//peci/peci-core.c

   732	
   733	static int peci_device_match(struct device *dev, struct device_driver *drv)
   734	{
   735		struct peci_client *client = peci_verify_client(dev);
   736		struct peci_driver *driver;
   737	
   738		/* Attempt an OF style match */
 > 739		if (peci_of_match_device(drv->of_match_table, client))
   740			return 1;
   741	
   742		driver = to_peci_driver(drv);
   743	
   744		if (peci_match_id(driver->id_table, client))
   745			return 1;
   746	
   747		return 0;
   748	}
   749	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26598 bytes --]

^ permalink raw reply

* [PATCH v2] proc/stat: Separate out individual irq counts into /proc/stat_irqs
From: Waiman Long @ 2018-04-19 19:36 UTC (permalink / raw)
  To: Andrew Morton, Jonathan Corbet, Kate Stewart, Thomas Gleixner,
	Greg Kroah-Hartman
  Cc: linux-kernel, linux-doc, Randy Dunlap, Alexey Dobriyan,
	Waiman Long

It was found that reading /proc/stat could be time consuming on
systems with a lot of irqs. For example, reading /proc/stat in a
certain 2-socket Skylake server took about 4.6ms because it had over
5k irqs. In that particular case, the majority of the CPU cycles for
reading /proc/stat was spent in the kstat_irqs() function.  Therefore,
application performance can be impacted if the application reads
/proc/stat rather frequently.

The "intr" line within /proc/stat contains a sum total of all the
irqs that have been serviced followed by a list of irq counts for
each individual irq number. In many cases, the first number is good
enough. The individual irq counts may not provide that much more
information.

In order to avoid this kind of performance issue, all these individual
irq counts are now separated into a new /proc/stat_irqs file. The
sum total irq count will stay in /proc/stat and be duplicated in
/proc/stat_irqs. Applications that need to look up individual irq counts
will now have to look into /proc/stat_irqs instead of /proc/stat.

v2: Update Documentation/filesystems/proc.txt accordingly.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/filesystems/proc.txt | 22 ++++++++++++-----
 fs/proc/stat.c                     | 48 ++++++++++++++++++++++++++++++++------
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 2a84bb3..15558ff 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1300,7 +1300,7 @@ since the system first booted.  For a quick look, simply cat the file:
   cpu  2255 34 2290 22625563 6290 127 456 0 0 0
   cpu0 1132 34 1441 11311718 3675 127 438 0 0 0
   cpu1 1123 0 849 11313845 2614 0 18 0 0 0
-  intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+  intr 114930548
   ctxt 1990473
   btime 1062191376
   processes 2915
@@ -1333,11 +1333,10 @@ second).  The meanings of the columns are as follows, from left to right:
 - guest: running a normal guest
 - guest_nice: running a niced guest
 
-The "intr" line gives counts of interrupts  serviced since boot time, for each
-of the  possible system interrupts.   The first  column  is the  total of  all
-interrupts serviced  including  unnumbered  architecture specific  interrupts;
-each  subsequent column is the  total for that particular numbered interrupt.
-Unnumbered interrupts are not shown, only summed into the total.
+The "intr" line gives the total of all interrupts including unnumbered
+architecture specific interrupts serviced since boot time.  To see the
+number of interrupts serviced for a particular numbered interrupt,
+the /proc/stat_irqs file should be used instead.
 
 The "ctxt" line gives the total number of context switches across all CPUs.
 
@@ -1359,6 +1358,17 @@ of the possible system softirqs. The first column is the total of all
 softirqs serviced; each subsequent column is the total for that particular
 softirq.
 
+To see the number of interrupts serviced for each of the numbered
+interrupts, the /proc/stat_irqs file can be viewed.
+
+  > cat /proc/stat_irqs
+  intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
+
+The "intr" line gives counts of interrupts  serviced since boot time, for each
+of the  possible system interrupts.   The first  column  is the  total of  all
+interrupts serviced  including  unnumbered  architecture specific  interrupts;
+each  subsequent column is the  total for that particular numbered interrupt.
+Unnumbered interrupts are not shown, only summed into the total.
 
 1.9 Ext4 file system parameters
 -------------------------------
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 59749df..79e3c03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -155,11 +155,6 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_putc(p, '\n');
 	}
 	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
-
 	seq_printf(p,
 		"\nctxt %llu\n"
 		"btime %llu\n"
@@ -181,15 +176,46 @@ static int show_stat(struct seq_file *p, void *v)
 	return 0;
 }
 
+/*
+ * Showing individual irq counts can be expensive if there are a lot of
+ * irqs. So it is done in a separate procfs file to reduce performance
+ * overhead of reading other statistical counts.
+ */
+static int show_stat_irqs(struct seq_file *p, void *v)
+{
+	int i, j;
+	u64 sum = 0;
+
+	for_each_possible_cpu(i) {
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+	}
+	sum += arch_irq_stat();
+
+	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
+
+	for_each_irq_nr(j)
+		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+
+	seq_putc(p, '\n');
+
+	return 0;
+}
+
 static int stat_open(struct inode *inode, struct file *file)
 {
 	size_t size = 1024 + 128 * num_online_cpus();
 
-	/* minimum size to display an interrupt count : 2 bytes */
-	size += 2 * nr_irqs;
 	return single_open_size(file, show_stat, NULL, size);
 }
 
+static int stat_irqs_open(struct inode *inode, struct file *file)
+{
+	size_t size = 1024 + 16 * nr_irqs;
+
+	return single_open_size(file, show_stat_irqs, NULL, size);
+}
+
 static const struct file_operations proc_stat_operations = {
 	.open		= stat_open,
 	.read		= seq_read,
@@ -197,9 +223,17 @@ static int stat_open(struct inode *inode, struct file *file)
 	.release	= single_release,
 };
 
+static const struct file_operations proc_stat_irqs_operations = {
+	.open		= stat_irqs_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static int __init proc_stat_init(void)
 {
 	proc_create("stat", 0, NULL, &proc_stat_operations);
+	proc_create("stat_irqs", 0, NULL, &proc_stat_irqs_operations);
 	return 0;
 }
 fs_initcall(proc_stat_init);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v4 1/2] perf: riscv: preliminary RISC-V support
From: Atish Patra @ 2018-04-19 19:46 UTC (permalink / raw)
  To: Alan Kao, Palmer Dabbelt, Albert Ou, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alex Solomatnikov, Jonathan Corbet,
	linux-riscv@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: Nick Hu, Greentime Hu
In-Reply-To: <1524017523-25076-2-git-send-email-alankao@andestech.com>

On 4/17/18 7:13 PM, Alan Kao wrote:
> This patch provide a basic PMU, riscv_base_pmu, which supports two
> general hardware event, instructions and cycles.  Furthermore, this
> PMU serves as a reference implementation to ease the portings in
> the future.
> 
> riscv_base_pmu should be able to run on any RISC-V machine that
> conforms to the Priv-Spec.  Note that the latest qemu model hasn't
> fully support a proper behavior of Priv-Spec 1.10 yet, but work
> around should be easy with very small fixes.  Please check
> https://github.com/riscv/riscv-qemu/pull/115 for future updates.
> 
> Cc: Nick Hu <nickhu@andestech.com>
> Cc: Greentime Hu <greentime@andestech.com>
> Signed-off-by: Alan Kao <alankao@andestech.com>
> ---
>   arch/riscv/Kconfig                  |  13 +
>   arch/riscv/include/asm/perf_event.h |  79 ++++-
>   arch/riscv/kernel/Makefile          |   1 +
>   arch/riscv/kernel/perf_event.c      | 482 ++++++++++++++++++++++++++++
>   4 files changed, 571 insertions(+), 4 deletions(-)
>   create mode 100644 arch/riscv/kernel/perf_event.c
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index c22ebe08e902..90d9c8e50377 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -203,6 +203,19 @@ config RISCV_ISA_C
>   config RISCV_ISA_A
>   	def_bool y
>   
> +menu "supported PMU type"
> +	depends on PERF_EVENTS
> +
> +config RISCV_BASE_PMU
> +	bool "Base Performance Monitoring Unit"
> +	def_bool y
> +	help
> +	  A base PMU that serves as a reference implementation and has limited
> +	  feature of perf.  It can run on any RISC-V machines so serves as the
> +	  fallback, but this option can also be disable to reduce kernel size.
> +
> +endmenu
> +
>   endmenu
>   
>   menu "Kernel type"
> diff --git a/arch/riscv/include/asm/perf_event.h b/arch/riscv/include/asm/perf_event.h
> index e13d2ff29e83..0e638a0c3feb 100644
> --- a/arch/riscv/include/asm/perf_event.h
> +++ b/arch/riscv/include/asm/perf_event.h
> @@ -1,13 +1,84 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
>   /*
>    * Copyright (C) 2018 SiFive
> + * Copyright (C) 2018 Andes Technology Corporation
>    *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public Licence
> - * as published by the Free Software Foundation; either version
> - * 2 of the Licence, or (at your option) any later version.
>    */
>   
>   #ifndef _ASM_RISCV_PERF_EVENT_H
>   #define _ASM_RISCV_PERF_EVENT_H
>   
> +#include <linux/perf_event.h>
> +#include <linux/ptrace.h>
> +
> +#define RISCV_BASE_COUNTERS	2
> +
> +/*
> + * The RISCV_MAX_COUNTERS parameter should be specified.
> + */
> +
> +#ifdef CONFIG_RISCV_BASE_PMU
> +#define RISCV_MAX_COUNTERS	2
> +#endif
> +
> +#ifndef RISCV_MAX_COUNTERS
> +#error "Please provide a valid RISCV_MAX_COUNTERS for the PMU."
> +#endif
> +
> +/*
> + * These are the indexes of bits in counteren register *minus* 1,
> + * except for cycle.  It would be coherent if it can directly mapped
> + * to counteren bit definition, but there is a *time* register at
> + * counteren[1].  Per-cpu structure is scarce resource here.
> + *
> + * According to the spec, an implementation can support counter up to
> + * mhpmcounter31, but many high-end processors has at most 6 general
> + * PMCs, we give the definition to MHPMCOUNTER8 here.
> + */
> +#define RISCV_PMU_CYCLE		0
> +#define RISCV_PMU_INSTRET	1
> +#define RISCV_PMU_MHPMCOUNTER3	2
> +#define RISCV_PMU_MHPMCOUNTER4	3
> +#define RISCV_PMU_MHPMCOUNTER5	4
> +#define RISCV_PMU_MHPMCOUNTER6	5
> +#define RISCV_PMU_MHPMCOUNTER7	6
> +#define RISCV_PMU_MHPMCOUNTER8	7
> +
> +#define RISCV_OP_UNSUPP		(-EOPNOTSUPP)
> +
> +struct cpu_hw_events {
> +	/* # currently enabled events*/
> +	int			n_events;
> +	/* currently enabled events */
> +	struct perf_event	*events[RISCV_MAX_COUNTERS];
> +	/* vendor-defined PMU data */
> +	void			*platform;
> +};
> +
> +struct riscv_pmu {
> +	struct pmu	*pmu;
> +
> +	/* generic hw/cache events table */
> +	const int	*hw_events;
> +	const int	(*cache_events)[PERF_COUNT_HW_CACHE_MAX]
> +				       [PERF_COUNT_HW_CACHE_OP_MAX]
> +				       [PERF_COUNT_HW_CACHE_RESULT_MAX];
> +	/* method used to map hw/cache events */
> +	int		(*map_hw_event)(u64 config);
> +	int		(*map_cache_event)(u64 config);
> +
> +	/* max generic hw events in map */
> +	int		max_events;
> +	/* number total counters, 2(base) + x(general) */
> +	int		num_counters;
> +	/* the width of the counter */
> +	int		counter_width;
> +
> +	/* vendor-defined PMU features */
> +	void		*platform;
> +
> +	irqreturn_t	(*handle_irq)(int irq_num, void *dev);
> +	int		irq;
> +};
> +
>   #endif /* _ASM_RISCV_PERF_EVENT_H */
> diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> index ffa439d4a364..f50d19816757 100644
> --- a/arch/riscv/kernel/Makefile
> +++ b/arch/riscv/kernel/Makefile
> @@ -39,5 +39,6 @@ obj-$(CONFIG_MODULE_SECTIONS)	+= module-sections.o
>   obj-$(CONFIG_FUNCTION_TRACER)	+= mcount.o
>   obj-$(CONFIG_DYNAMIC_FTRACE)	+= mcount-dyn.o
>   obj-$(CONFIG_FUNCTION_GRAPH_TRACER)	+= ftrace.o
> +obj-$(CONFIG_PERF_EVENTS)      += perf_event.o
>   
>   clean:
> diff --git a/arch/riscv/kernel/perf_event.c b/arch/riscv/kernel/perf_event.c
> new file mode 100644
> index 000000000000..ba3192afc470
> --- /dev/null
> +++ b/arch/riscv/kernel/perf_event.c
> @@ -0,0 +1,482 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2008 Thomas Gleixner <tglx@linutronix.de>
> + * Copyright (C) 2008-2009 Red Hat, Inc., Ingo Molnar
> + * Copyright (C) 2009 Jaswinder Singh Rajput
> + * Copyright (C) 2009 Advanced Micro Devices, Inc., Robert Richter
> + * Copyright (C) 2008-2009 Red Hat, Inc., Peter Zijlstra
> + * Copyright (C) 2009 Intel Corporation, <markus.t.metzger@intel.com>
> + * Copyright (C) 2009 Google, Inc., Stephane Eranian
> + * Copyright 2014 Tilera Corporation. All Rights Reserved.
> + * Copyright (C) 2018 Andes Technology Corporation
> + *
> + * Perf_events support for RISC-V platforms.
> + *
> + * Since the spec. (as of now, Priv-Spec 1.10) does not provide enough
> + * functionality for perf event to fully work, this file provides
> + * the very basic framework only.
> + *
> + * For platform portings, please check Documentations/riscv/pmu.txt.
> + *
> + * The Copyright line includes x86 and tile ones.
> + */
> +
> +#include <linux/kprobes.h>
> +#include <linux/kernel.h>
> +#include <linux/kdebug.h>
> +#include <linux/mutex.h>
> +#include <linux/bitmap.h>
> +#include <linux/irq.h>
> +#include <linux/interrupt.h>
> +#include <linux/perf_event.h>
> +#include <linux/atomic.h>
> +#include <linux/of.h>
> +#include <asm/perf_event.h>
> +
> +static const struct riscv_pmu *riscv_pmu __read_mostly;
> +static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
> +
> +/*
> + * Hardware & cache maps and their methods
> + */
> +
> +static const int riscv_hw_event_map[] = {
> +	[PERF_COUNT_HW_CPU_CYCLES]		= RISCV_PMU_CYCLE,
> +	[PERF_COUNT_HW_INSTRUCTIONS]		= RISCV_PMU_INSTRET,
> +	[PERF_COUNT_HW_CACHE_REFERENCES]	= RISCV_OP_UNSUPP,
> +	[PERF_COUNT_HW_CACHE_MISSES]		= RISCV_OP_UNSUPP,
> +	[PERF_COUNT_HW_BRANCH_INSTRUCTIONS]	= RISCV_OP_UNSUPP,
> +	[PERF_COUNT_HW_BRANCH_MISSES]		= RISCV_OP_UNSUPP,
> +	[PERF_COUNT_HW_BUS_CYCLES]		= RISCV_OP_UNSUPP,
> +};
> +
> +#define C(x) PERF_COUNT_HW_CACHE_##x
> +static const int riscv_cache_event_map[PERF_COUNT_HW_CACHE_MAX]
> +[PERF_COUNT_HW_CACHE_OP_MAX]
> +[PERF_COUNT_HW_CACHE_RESULT_MAX] = {
> +	[C(L1D)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +	[C(L1I)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +	[C(LL)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +	[C(DTLB)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] =  RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] =  RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +	[C(ITLB)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +	[C(BPU)] = {
> +		[C(OP_READ)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_WRITE)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +		[C(OP_PREFETCH)] = {
> +			[C(RESULT_ACCESS)] = RISCV_OP_UNSUPP,
> +			[C(RESULT_MISS)] = RISCV_OP_UNSUPP,
> +		},
> +	},
> +};
> +
> +static int riscv_map_hw_event(u64 config)
> +{
> +	if (config >= riscv_pmu->max_events)
> +		return -EINVAL;
> +
> +	return riscv_pmu->hw_events[config];
> +}
> +
> +int riscv_map_cache_decode(u64 config, unsigned int *type,
> +			   unsigned int *op, unsigned int *result)
> +{
> +	return -ENOENT;
> +}
> +
> +static int riscv_map_cache_event(u64 config)
> +{
> +	unsigned int type, op, result;
> +	int err = -ENOENT;
> +		int code;
> +
> +	err = riscv_map_cache_decode(config, &type, &op, &result);
> +	if (!riscv_pmu->cache_events || err)
> +		return err;
> +
> +	if (type >= PERF_COUNT_HW_CACHE_MAX ||
> +	    op >= PERF_COUNT_HW_CACHE_OP_MAX ||
> +	    result >= PERF_COUNT_HW_CACHE_RESULT_MAX)
> +		return -EINVAL;
> +
> +	code = (*riscv_pmu->cache_events)[type][op][result];
> +	if (code == RISCV_OP_UNSUPP)
> +		return -EINVAL;
> +
> +	return code;
> +}
> +
> +/*
> + * Low-level functions: reading/writing counters
> + */
> +
> +static inline u64 read_counter(int idx)
> +{
> +	u64 val = 0;
> +
> +	switch (idx) {
> +	case RISCV_PMU_CYCLE:
> +		val = csr_read(cycle);
> +		break;
> +	case RISCV_PMU_INSTRET:
> +		val = csr_read(instret);
> +		break;
> +	default:
> +		WARN_ON_ONCE(idx < 0 ||	idx > RISCV_MAX_COUNTERS);
> +		return -EINVAL;
> +	}
> +
> +	return val;
> +}
> +
> +static inline void write_counter(int idx, u64 value)
> +{
> +	/* currently not supported */
> +	WARN_ON_ONCE(1);
> +}
> +
> +/*
> + * pmu->read: read and update the counter
> + *
> + * Other architectures' implementation often have a xxx_perf_event_update
> + * routine, which can return counter values when called in the IRQ, but
> + * return void when being called by the pmu->read method.
> + */
> +static void riscv_pmu_read(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +	u64 prev_raw_count, new_raw_count;
> +	u64 oldval;
> +	int idx = hwc->idx;
> +	u64 delta;
> +
> +	do {
> +		prev_raw_count = local64_read(&hwc->prev_count);
> +		new_raw_count = read_counter(idx);
> +
> +		oldval = local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> +					 new_raw_count);
> +	} while (oldval != prev_raw_count);
> +
> +	/*
> +	 * delta is the value to update the counter we maintain in the kernel.
> +	 */
> +	delta = (new_raw_count - prev_raw_count) &
> +		((1ULL << riscv_pmu->counter_width) - 1);
> +	local64_add(delta, &event->count);
> +	/*
> +	 * Something like local64_sub(delta, &hwc->period_left) here is
> +	 * needed if there is an interrupt for perf.
> +	 */
> +}
> +
> +/*
> + * State transition functions:
> + *
> + * stop()/start() & add()/del()
> + */
> +
> +/*
> + * pmu->stop: stop the counter
> + */
> +static void riscv_pmu_stop(struct perf_event *event, int flags)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
> +	hwc->state |= PERF_HES_STOPPED;
> +
> +	if ((flags & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
> +		riscv_pmu->pmu->read(event);
> +		hwc->state |= PERF_HES_UPTODATE;
> +	}
> +}
> +
> +/*
> + * pmu->start: start the event.
> + */
> +static void riscv_pmu_start(struct perf_event *event, int flags)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
> +		return;
> +
> +	if (flags & PERF_EF_RELOAD) {
> +		WARN_ON_ONCE(!(event->hw.state & PERF_HES_UPTODATE));
> +
> +		/*
> +		 * Set the counter to the period to the next interrupt here,
> +		 * if you have any.
> +		 */
> +	}
> +
> +	hwc->state = 0;
> +	perf_event_update_userpage(event);
> +
> +	/*
> +	 * Since we cannot write to counters, this serves as an initialization
> +	 * to the delta-mechanism in pmu->read(); otherwise, the delta would be
> +	 * wrong when pmu->read is called for the first time.
> +	 */
> +	local64_set(&hwc->prev_count, read_counter(hwc->idx));
> +}
> +
> +/*
> + * pmu->add: add the event to PMU.
> + */
> +static int riscv_pmu_add(struct perf_event *event, int flags)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (cpuc->n_events == riscv_pmu->num_counters)
> +		return -ENOSPC;
> +
> +	/*
> +	 * We don't have general conunters, so no binding-event-to-counter
> +	 * process here.
> +	 *
> +	 * Indexing using hwc->config generally not works, since config may
> +	 * contain extra information, but here the only info we have in
> +	 * hwc->config is the event index.
> +	 */
> +	hwc->idx = hwc->config;
> +	cpuc->events[hwc->idx] = event;
> +	cpuc->n_events++;
> +
> +	hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
> +
> +	if (flags & PERF_EF_START)
> +		riscv_pmu->pmu->start(event, PERF_EF_RELOAD);
> +
> +	return 0;
> +}
> +
> +/*
> + * pmu->del: delete the event from PMU.
> + */
> +static void riscv_pmu_del(struct perf_event *event, int flags)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	cpuc->events[hwc->idx] = NULL;
> +	cpuc->n_events--;
> +	riscv_pmu->pmu->stop(event, PERF_EF_UPDATE);
> +	perf_event_update_userpage(event);
> +}
> +
> +/*
> + * Interrupt: a skeletion for reference.
> + */
> +
> +static DEFINE_MUTEX(pmc_reserve_mutex);
> +
> +irqreturn_t riscv_base_pmu_handle_irq(int irq_num, void *dev)
> +{
> +	return IRQ_NONE;
> +}
> +
> +static int reserve_pmc_hardware(void)
> +{
> +	int err = 0;
> +
> +	mutex_lock(&pmc_reserve_mutex);
> +	if (riscv_pmu->irq >=0 && riscv_pmu->handle_irq) {
> +		err = request_irq(riscv_pmu->irq, riscv_pmu->handle_irq,
> +				  IRQF_PERCPU, "riscv-base-perf", NULL);
> +	}
> +	mutex_unlock(&pmc_reserve_mutex);
> +
> +	return err;
> +}
> +
> +void release_pmc_hardware(void)
> +{
> +	mutex_lock(&pmc_reserve_mutex);
> +	if (riscv_pmu->irq >=0) {
> +		free_irq(riscv_pmu->irq, NULL);
> +	}
> +	mutex_unlock(&pmc_reserve_mutex);
> +}
> +
> +/*
> + * Event Initialization/Finalization
> + */
> +
> +static atomic_t riscv_active_events = ATOMIC_INIT(0);
> +
> +static void riscv_event_destroy(struct perf_event *event)
> +{
> +	if (atomic_dec_return(&riscv_active_events) == 0)
> +		release_pmc_hardware();
> +}
> +
> +static int riscv_event_init(struct perf_event *event)
> +{
> +	struct perf_event_attr *attr = &event->attr;
> +	struct hw_perf_event *hwc = &event->hw;
> +	int err;
> +	int code;
> +
> +	if (atomic_inc_return(&riscv_active_events) == 1) {
> +		err = reserve_pmc_hardware();
> +
> +		if (err) {
> +			pr_warn("PMC hardware not available\n");
> +			atomic_dec(&riscv_active_events);
> +			return -EBUSY;
> +		}
> +	}
> +
> +	switch (event->attr.type) {
> +	case PERF_TYPE_HARDWARE:
> +		code = riscv_pmu->map_hw_event(attr->config);
> +		break;
> +	case PERF_TYPE_HW_CACHE:
> +		code = riscv_pmu->map_cache_event(attr->config);
> +		break;
> +	case PERF_TYPE_RAW:
> +		return -EOPNOTSUPP;
> +	default:
> +		return -ENOENT;
> +	}
> +
> +	event->destroy = riscv_event_destroy;
> +	if (code < 0) {
> +		event->destroy(event);
> +		return code;
> +	}
> +
> +	/*
> +	 * idx is set to -1 because the index of a general event should not be
> +	 * decided until binding to some counter in pmu->add().
> +	 *
> +	 * But since we don't have such support, later in pmu->add(), we just
> +	 * use hwc->config as the index instead.
> +	 */
> +	hwc->config = code;
> +	hwc->idx = -1;
> +
> +	return 0;
> +}
> +
> +/*
> + * Initialization
> + */
> +
> +static struct pmu min_pmu = {
> +	.name		= "riscv-base",
> +	.event_init	= riscv_event_init,
> +	.add		= riscv_pmu_add,
> +	.del		= riscv_pmu_del,
> +	.start		= riscv_pmu_start,
> +	.stop		= riscv_pmu_stop,
> +	.read		= riscv_pmu_read,
> +};
> +
> +static const struct riscv_pmu riscv_base_pmu = {
> +	.pmu = &min_pmu,
> +	.max_events = ARRAY_SIZE(riscv_hw_event_map),
> +	.map_hw_event = riscv_map_hw_event,
> +	.hw_events = riscv_hw_event_map,
> +	.map_cache_event = riscv_map_cache_event,
> +	.cache_events = &riscv_cache_event_map,
> +	.counter_width = 63,
> +	.num_counters = RISCV_BASE_COUNTERS + 0,
> +	.handle_irq = &riscv_base_pmu_handle_irq,
> +
> +	/* This means this PMU has no IRQ. */
> +	.irq = -1,
> +};
> +
> +static const struct of_device_id riscv_pmu_of_ids[] = {
> +	{.compatible = "riscv,base-pmu",	.data = &riscv_base_pmu},
> +	{ /* sentinel value */ }
> +};
> +
> +int __init init_hw_perf_events(void)
> +{
> +	struct device_node *node = of_find_node_by_type(NULL, "pmu");
> +	const struct of_device_id *of_id;
> +	
> +	if (node && (of_id = of_match_node(riscv_pmu_of_ids, node)))
> +		riscv_pmu = of_id->data;
> +	else
> +		riscv_pmu = &riscv_base_pmu;
> +
> +	perf_pmu_register(riscv_pmu->pmu, "cpu", PERF_TYPE_RAW);
> +	return 0;
> +}
> +arch_initcall(init_hw_perf_events);
> 
Some check patch errors.

ERROR: spaces required around that '>=' (ctx:WxV)
#517: FILE: arch/riscv/kernel/perf_event.c:356:
+	if (riscv_pmu->irq >=0 && riscv_pmu->handle_irq) {
  	                   ^

ERROR: spaces required around that '>=' (ctx:WxV)
#529: FILE: arch/riscv/kernel/perf_event.c:368:
+	if (riscv_pmu->irq >=0) {
  	                   ^

WARNING: braces {} are not necessary for single statement blocks
#529: FILE: arch/riscv/kernel/perf_event.c:368:
+	if (riscv_pmu->irq >=0) {
+		free_irq(riscv_pmu->irq, NULL);
+	}

WARNING: DT compatible string "riscv,base-pmu" appears un-documented -- 
check ./Documentation/devicetree/bindings/
#626: FILE: arch/riscv/kernel/perf_event.c:465:
+	{.compatible = "riscv,base-pmu",	.data = &riscv_base_pmu},

ERROR: trailing whitespace
#634: FILE: arch/riscv/kernel/perf_event.c:473:
+^I$

ERROR: do not use assignment in if condition
#635: FILE: arch/riscv/kernel/perf_event.c:474:
+	if (node && (of_id = of_match_node(riscv_pmu_of_ids, node)))

total: 4 errors, 3 warnings, 595 lines checked


Regards,
Atish
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v3 07/10] Documentation: dt-bindings: Add documents for PECI hwmon client drivers
From: Jae Hyun Yoo @ 2018-04-19 19:48 UTC (permalink / raw)
  To: Rob Herring
  Cc: Alan Cox, Andrew Jeffery, Andrew Lunn, Andy Shevchenko,
	Arnd Bergmann, Benjamin Herrenschmidt, Fengguang Wu, Greg KH,
	Guenter Roeck, Haiyue Wang, James Feist, Jason M Biils,
	Jean Delvare, Joel Stanley, Julia Cartwright, Miguel Ojeda,
	Milton Miller II, Pavel Machek, Randy Dunlap, Stef van Os,
	Sumeet R Pawnikar, Vernon Mauery, linux-kernel@vger.kernel.org,
	linux-doc, devicetree, Linux HWMON List,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	OpenBMC Maillist
In-Reply-To: <884d3c16-84e8-e38a-731a-313ba7105176@linux.intel.com>

On 4/18/2018 2:57 PM, Jae Hyun Yoo wrote:
> On 4/18/2018 2:28 PM, Rob Herring wrote:
>> On Wed, Apr 18, 2018 at 3:28 PM, Jae Hyun Yoo
>> <jae.hyun.yoo@linux.intel.com> wrote:
>>> On 4/18/2018 7:32 AM, Rob Herring wrote:
>>>>
>>>> On Tue, Apr 17, 2018 at 3:40 PM, Jae Hyun Yoo
>>>> <jae.hyun.yoo@linux.intel.com> wrote:
>>>>>
>>>>> On 4/16/2018 4:51 PM, Jae Hyun Yoo wrote:
>>>>>>
>>>>>>
>>>>>> On 4/16/2018 4:22 PM, Jae Hyun Yoo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 4/16/2018 11:14 AM, Rob Herring wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 10, 2018 at 11:32:09AM -0700, Jae Hyun Yoo wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This commit adds dt-bindings documents for PECI cputemp and 
>>>>>>>>> dimmtemp
>>>>>>>>> client
>>>>>>>>> drivers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>>>> +Example:
>>>>>>>>> +    peci-bus@0 {
>>>>>>>>> +        #address-cells = <1>;
>>>>>>>>> +        #size-cells = <0>;
>>>>>>>>> +        < more properties >
>>>>>>>>> +
>>>>>>>>> +        peci-dimmtemp@cpu0 {
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> unit-address is wrong.
>>>>>>>>
>>>>>>>
>>>>>>> Will fix it using the reg value.
>>>>>>>
>>>>>>>> It is a different bus from cputemp? Otherwise, you have conflicting
>>>>>>>> addresses. If that's the case, probably should make it clear by
>>>>>>>> showing
>>>>>>>> different host adapters for each example.
>>>>>>>>
>>>>>>>
>>>>>>> It could be the same bus with cputemp. Also, client address 
>>>>>>> sharing is
>>>>>>> possible by PECI core if the functionality is different. I mean,
>>>>>>> cputemp and
>>>>>>> dimmtemp targeting the same client is possible case like this.
>>>>>>> peci-cputemp@30
>>>>>>> peci-dimmtemp@30
>>>>>>>
>>>>>>
>>>>>> Oh, I got your point. Probably, I should change these separate 
>>>>>> settings
>>>>>> into one like
>>>>>>
>>>>>> peci-client@30 {
>>>>>>        compatible = "intel,peci-client";
>>>>>>        reg = <0x30>;
>>>>>> };
>>>>>>
>>>>>> Then cputemp and dimmtemp drivers could refer the same compatible
>>>>>> string.
>>>>>> Will rewrite it.
>>>>>>
>>>>>
>>>>> I've checked it again and realized that it should use function 
>>>>> based node
>>>>> name like:
>>>>>
>>>>> peci-cputemp@30
>>>>> peci-dimmtemp@30
>>>>>
>>>>> If it use the same string like 'peci-client@30', the drivers cannot be
>>>>> selectively enabled. The client address sharing way is well handled in
>>>>> PECI
>>>>> core and this way would be better for the future implementations of 
>>>>> other
>>>>> PECI functional drivers such as crash dump driver and so on. So I'm 
>>>>> going
>>>>> change the unit-address only.
>>>>
>>>>
>>>> 2 nodes at the same address is wrong (and soon dtc will warn you on
>>>> this). You have 2 potential options. The first is you need additional
>>>> address information in the DT if these are in fact 2 independent
>>>> devices. This could be something like a function number to use
>>>> something from PCI addressing. From what I found on PECI, it doesn't
>>>> seem to have anything like that. The 2nd option is you have a single
>>>> DT node which registers multiple hwmon devices. DT nodes and drivers
>>>> don't have to be 1-1. Don't design your DT nodes from how you want to
>>>> partition drivers in some OS.
>>>>
>>>> Rob
>>>>
>>>
>>> Please correct me if I'm wrong but I'm still thinking that it is
>>> possible. Also, I did compile it but dtc doesn't make a warning. Let me
>>> show an another use case which is similar to this case:
>>
>> I did say *soon*. It's in dtc repo, but not the kernel copy yet.
>>
>>> In arch/arm/boot/dts/aspeed-g5.dtsi
>>> [...]
>>> lpc_host: lpc-host@80 {
>>>          compatible = "aspeed,ast2500-lpc-host", "simple-mfd", "syscon";
>>>          reg = <0x80 0x1e0>;
>>>          reg-io-width = <4>;
>>>
>>>          #address-cells = <1>;
>>>          #size-cells = <1>;
>>>          ranges = <0x0 0x80 0x1e0>;
>>>
>>>          lpc_ctrl: lpc-ctrl@0 {
>>>                  compatible = "aspeed,ast2500-lpc-ctrl";
>>>                  reg = <0x0 0x80>;
>>>                  clocks = <&syscon ASPEED_CLK_GATE_LCLK>;
>>>                  status = "disabled";
>>>          };
>>>
>>>          lpc_snoop: lpc-snoop@0 {
>>>                  compatible = "aspeed,ast2500-lpc-snoop";
>>>                  reg = <0x0 0x80>;
>>>                  interrupts = <8>;
>>>                  status = "disabled";
>>>          };
>>> }
>>> [...]
>>>
>>> This is device tree setting for LPC interface and its child nodes.
>>> LPC interface can be used as a multi-functional interface such as
>>> snoop 80, KCS, SIO and so on. In this use case, lpc-ctrl@0 and
>>> lpc-snoop@0 are sharing their address range from their individual
>>> driver modules and they can be registered quite well through both
>>> static dt or dynamic dtoverlay. PECI is also a multi-functional
>>> interface which is similar to the above case, I think.
>>
>> This case too is poor design and should be fixed as well. Simply put,
>> you can have 2 devices on a bus at the same address without some sort
>> of mux or arbitration device in the middle. If you have a device/block
>> with multiple functions provided to the OS, then it is the OS's
>> problem to arbitrate access. It is not a DT problem because OS's can
>> vary in how they handle that both from OS to OS and over time.
>>
>> Rob
>>
> 
> If I change it to a single DT node which registers 2 hwmon devices using
> the 2nd option above, then I still have 2 devices on a bus at the same
> address. Does it also make a problem to the OS then?
> 
> Jae

Additionally, I need to explain that there is one and only bus host
(adapter) and multiple clients on a PECI bus, and PECI spec doesn't
allow multiple originators so only the host device can originate
message. In this implementation, all message transactions on a bus from
client driver modules and user space will be serialized well in the PECI
core bus driver so bus occupation and traffic arbitration will be
managed well in the PECI core bus driver even in case of a bus has 2 
client drivers at the same address. I'm sure that this implementation 
doesn't make that kind of problem to OS.

Jae

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 0/7] i2c: clean up include/linux/i2c-*
From: Wolfram Sang @ 2018-04-19 20:00 UTC (permalink / raw)
  To: linux-i2c
  Cc: Wolfram Sang, linux-arm-kernel, linux-doc, linux-kernel,
	linux-media, linux-mips, linux-omap, linux-sh

Move all plain platform_data includes to the platform_data-dir
(except for i2c-pnx which can be moved into the driver itself).

My preference is to take these patches via the i2c tree. I can provide an
immutable branch if needed. But we can also discuss those going in via
arch-trees if dependencies are against us.

The current branch can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux.git i2c/platform_data

and buildbot had no complaints.

Looking forward to comments or Acks, Revs...

Kind regards,

   Wolfram


Wolfram Sang (7):
  i2c: i2c-gpio: move header to platform_data
  i2c: i2c-mux-gpio: move header to platform_data
  i2c: i2c-ocores: move header to platform_data
  i2c: i2c-omap: move header to platform_data
  i2c: i2c-pca-platform: move header to platform_data
  i2c: i2c-xiic: move header to platform_data
  i2c: pnx: move header into the driver

 Documentation/i2c/busses/i2c-ocores                |  2 +-
 Documentation/i2c/muxes/i2c-mux-gpio               |  4 +--
 MAINTAINERS                                        |  8 ++---
 arch/arm/mach-ks8695/board-acs5k.c                 |  2 +-
 arch/arm/mach-omap1/board-htcherald.c              |  2 +-
 arch/arm/mach-omap1/common.h                       |  2 +-
 arch/arm/mach-omap1/i2c.c                          |  2 +-
 arch/arm/mach-omap2/common.h                       |  2 +-
 arch/arm/mach-omap2/omap_hwmod_2420_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_2430_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_33xx_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_3xxx_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_44xx_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_54xx_data.c         |  2 +-
 arch/arm/mach-omap2/omap_hwmod_7xx_data.c          |  2 +-
 arch/arm/mach-pxa/palmz72.c                        |  2 +-
 arch/arm/mach-pxa/viper.c                          |  2 +-
 arch/arm/mach-sa1100/simpad.c                      |  2 +-
 arch/mips/alchemy/board-gpr.c                      |  2 +-
 arch/sh/boards/board-sh7785lcr.c                   |  2 +-
 drivers/i2c/busses/i2c-gpio.c                      |  2 +-
 drivers/i2c/busses/i2c-i801.c                      |  2 +-
 drivers/i2c/busses/i2c-ocores.c                    |  2 +-
 drivers/i2c/busses/i2c-omap.c                      |  2 +-
 drivers/i2c/busses/i2c-pca-platform.c              |  2 +-
 drivers/i2c/busses/i2c-pnx.c                       | 21 +++++++++++-
 drivers/i2c/busses/i2c-xiic.c                      |  2 +-
 drivers/i2c/muxes/i2c-mux-gpio.c                   |  2 +-
 drivers/media/platform/marvell-ccic/mmp-driver.c   |  2 +-
 drivers/mfd/sm501.c                                |  2 +-
 drivers/mfd/timberdale.c                           |  4 +--
 include/linux/i2c-pnx.h                            | 38 ----------------------
 include/linux/{ => platform_data}/i2c-gpio.h       |  0
 include/linux/{ => platform_data}/i2c-mux-gpio.h   |  0
 include/linux/{ => platform_data}/i2c-ocores.h     |  0
 include/linux/{ => platform_data}/i2c-omap.h       |  0
 .../linux/{ => platform_data}/i2c-pca-platform.h   |  0
 include/linux/{ => platform_data}/i2c-xiic.h       |  0
 38 files changed, 55 insertions(+), 74 deletions(-)
 delete mode 100644 include/linux/i2c-pnx.h
 rename include/linux/{ => platform_data}/i2c-gpio.h (100%)
 rename include/linux/{ => platform_data}/i2c-mux-gpio.h (100%)
 rename include/linux/{ => platform_data}/i2c-ocores.h (100%)
 rename include/linux/{ => platform_data}/i2c-omap.h (100%)
 rename include/linux/{ => platform_data}/i2c-pca-platform.h (100%)
 rename include/linux/{ => platform_data}/i2c-xiic.h (100%)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 3/7] i2c: i2c-ocores: move header to platform_data
From: Wolfram Sang @ 2018-04-19 20:00 UTC (permalink / raw)
  To: linux-i2c
  Cc: Wolfram Sang, Peter Korsgaard, Jonathan Corbet, Lee Jones,
	linux-doc, linux-kernel
In-Reply-To: <20180419200015.15095-1-wsa@the-dreams.de>

This header only contains platform_data. Move it to the proper directory.

Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
---
 Documentation/i2c/busses/i2c-ocores            | 2 +-
 drivers/i2c/busses/i2c-ocores.c                | 2 +-
 drivers/mfd/timberdale.c                       | 2 +-
 include/linux/{ => platform_data}/i2c-ocores.h | 0
 4 files changed, 3 insertions(+), 3 deletions(-)
 rename include/linux/{ => platform_data}/i2c-ocores.h (100%)

diff --git a/Documentation/i2c/busses/i2c-ocores b/Documentation/i2c/busses/i2c-ocores
index c269aaa2f26a..c12fa9d3b050 100644
--- a/Documentation/i2c/busses/i2c-ocores
+++ b/Documentation/i2c/busses/i2c-ocores
@@ -18,7 +18,7 @@ Usage
 i2c-ocores uses the platform bus, so you need to provide a struct
 platform_device with the base address and interrupt number. The
 dev.platform_data of the device should also point to a struct
-ocores_i2c_platform_data (see linux/i2c-ocores.h) describing the
+ocores_i2c_platform_data (see linux/platform_data/i2c-ocores.h) describing the
 distance between registers and the input clock speed.
 There is also a possibility to attach a list of i2c_board_info which
 the i2c-ocores driver will add to the bus upon creation.
diff --git a/drivers/i2c/busses/i2c-ocores.c b/drivers/i2c/busses/i2c-ocores.c
index 8c42ca7107b2..d7da9adf7ee1 100644
--- a/drivers/i2c/busses/i2c-ocores.c
+++ b/drivers/i2c/busses/i2c-ocores.c
@@ -21,7 +21,7 @@
 #include <linux/i2c.h>
 #include <linux/interrupt.h>
 #include <linux/wait.h>
-#include <linux/i2c-ocores.h>
+#include <linux/platform_data/i2c-ocores.h>
 #include <linux/slab.h>
 #include <linux/io.h>
 #include <linux/log2.h>
diff --git a/drivers/mfd/timberdale.c b/drivers/mfd/timberdale.c
index cd4a6d7d6750..118d7ef727e6 100644
--- a/drivers/mfd/timberdale.c
+++ b/drivers/mfd/timberdale.c
@@ -30,7 +30,7 @@
 #include <linux/timb_gpio.h>
 
 #include <linux/i2c.h>
-#include <linux/i2c-ocores.h>
+#include <linux/platform_data/i2c-ocores.h>
 #include <linux/i2c-xiic.h>
 
 #include <linux/spi/spi.h>
diff --git a/include/linux/i2c-ocores.h b/include/linux/platform_data/i2c-ocores.h
similarity index 100%
rename from include/linux/i2c-ocores.h
rename to include/linux/platform_data/i2c-ocores.h
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 2/7] i2c: i2c-mux-gpio: move header to platform_data
From: Wolfram Sang @ 2018-04-19 20:00 UTC (permalink / raw)
  To: linux-i2c
  Cc: Wolfram Sang, Peter Korsgaard, Peter Rosin, Jonathan Corbet,
	Jean Delvare, linux-doc, linux-kernel
In-Reply-To: <20180419200015.15095-1-wsa@the-dreams.de>

This header only contains platform_data. Move it to the proper directory.

Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
---
 Documentation/i2c/muxes/i2c-mux-gpio             | 4 ++--
 MAINTAINERS                                      | 2 +-
 drivers/i2c/busses/i2c-i801.c                    | 2 +-
 drivers/i2c/muxes/i2c-mux-gpio.c                 | 2 +-
 include/linux/{ => platform_data}/i2c-mux-gpio.h | 0
 5 files changed, 5 insertions(+), 5 deletions(-)
 rename include/linux/{ => platform_data}/i2c-mux-gpio.h (100%)

diff --git a/Documentation/i2c/muxes/i2c-mux-gpio b/Documentation/i2c/muxes/i2c-mux-gpio
index 7a8d7d261632..893ecdfe6e43 100644
--- a/Documentation/i2c/muxes/i2c-mux-gpio
+++ b/Documentation/i2c/muxes/i2c-mux-gpio
@@ -30,12 +30,12 @@ i2c-mux-gpio uses the platform bus, so you need to provide a struct
 platform_device with the platform_data pointing to a struct
 i2c_mux_gpio_platform_data with the I2C adapter number of the master
 bus, the number of bus segments to create and the GPIO pins used
-to control it. See include/linux/i2c-mux-gpio.h for details.
+to control it. See include/linux/platform_data/i2c-mux-gpio.h for details.
 
 E.G. something like this for a MUX providing 4 bus segments
 controlled through 3 GPIO pins:
 
-#include <linux/i2c-mux-gpio.h>
+#include <linux/platform_data/i2c-mux-gpio.h>
 #include <linux/platform_device.h>
 
 static const unsigned myboard_gpiomux_gpios[] = {
diff --git a/MAINTAINERS b/MAINTAINERS
index 7aad64b62102..44b4bb5cf94e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5879,7 +5879,7 @@ M:	Peter Korsgaard <peter.korsgaard@barco.com>
 L:	linux-i2c@vger.kernel.org
 S:	Supported
 F:	drivers/i2c/muxes/i2c-mux-gpio.c
-F:	include/linux/i2c-mux-gpio.h
+F:	include/linux/platform_data/i2c-mux-gpio.h
 F:	Documentation/i2c/muxes/i2c-mux-gpio
 
 GENERIC HDLC (WAN) DRIVERS
diff --git a/drivers/i2c/busses/i2c-i801.c b/drivers/i2c/busses/i2c-i801.c
index e0d59e9ff3c6..bff160d1ce3f 100644
--- a/drivers/i2c/busses/i2c-i801.c
+++ b/drivers/i2c/busses/i2c-i801.c
@@ -106,7 +106,7 @@
 
 #if IS_ENABLED(CONFIG_I2C_MUX_GPIO) && defined CONFIG_DMI
 #include <linux/gpio.h>
-#include <linux/i2c-mux-gpio.h>
+#include <linux/platform_data/i2c-mux-gpio.h>
 #endif
 
 /* I801 SMBus address offsets */
diff --git a/drivers/i2c/muxes/i2c-mux-gpio.c b/drivers/i2c/muxes/i2c-mux-gpio.c
index 1a9973ede443..15a7cc0459fb 100644
--- a/drivers/i2c/muxes/i2c-mux-gpio.c
+++ b/drivers/i2c/muxes/i2c-mux-gpio.c
@@ -10,7 +10,7 @@
 
 #include <linux/i2c.h>
 #include <linux/i2c-mux.h>
-#include <linux/i2c-mux-gpio.h>
+#include <linux/platform_data/i2c-mux-gpio.h>
 #include <linux/platform_device.h>
 #include <linux/module.h>
 #include <linux/slab.h>
diff --git a/include/linux/i2c-mux-gpio.h b/include/linux/platform_data/i2c-mux-gpio.h
similarity index 100%
rename from include/linux/i2c-mux-gpio.h
rename to include/linux/platform_data/i2c-mux-gpio.h
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] Documentation/i2c: direct users to i2c-tools regarding i2c/smbus.h
From: Sam Hansen @ 2018-04-19 22:23 UTC (permalink / raw)
  To: linux-i2c, jdelvare, hansens, wsa, corbet, linux-doc,
	linux-kernel
  Cc: jdelvare, Sam Hansen, wsa, corbet, linux-doc, linux-kernel

The current examples reference i2c/smbus.h, which is the first reference
in Documentation/i2c/dev-interface to anything related to the i2c-tools
project.  This moves the existing reference to i2c-tools up into the
C-example, directing the user to the project's git repository.

Signed-off-by: Sam Hansen <hansens@google.com>
---
 Documentation/i2c/dev-interface | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/i2c/dev-interface b/Documentation/i2c/dev-interface
index fbed645ccd75..ed20f7b7851a 100644
--- a/Documentation/i2c/dev-interface
+++ b/Documentation/i2c/dev-interface
@@ -23,6 +23,10 @@ First, you need to include these two headers:
   #include <linux/i2c-dev.h>
   #include <i2c/smbus.h>
 
+The i2c/smbus header file is provided by the i2c-tools project.  For more info
+about i2c-tools, see:
+https://git.kernel.org/pub/scm/utils/i2c-tools/i2c-tools.git/
+
 Now, you have to decide which adapter you want to access. You should
 inspect /sys/class/i2c-dev/ or run "i2cdetect -l" to decide this.
 Adapter numbers are assigned somewhat dynamically, so you can not
@@ -163,10 +167,6 @@ what happened. The 'write' transactions return 0 on success; the
 returns the number of values read. The block buffers need not be longer
 than 32 bytes.
 
-The above functions are made available by linking against the libi2c library,
-which is provided by the i2c-tools project.  See:
-https://git.kernel.org/pub/scm/utils/i2c-tools/i2c-tools.git/.
-
 
 Implementation details
 ======================
-- 
2.17.0.484.g0c8726318c-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v4 1/2] perf: riscv: preliminary RISC-V support
From: Alan Kao @ 2018-04-19 23:24 UTC (permalink / raw)
  To: Atish Patra
  Cc: Palmer Dabbelt, Albert Ou, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alex Solomatnikov, Jonathan Corbet,
	linux-riscv@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Nick Hu, Greentime Hu
In-Reply-To: <c057c7d1-421f-f4eb-3f03-2abda16d876a@wdc.com>

On Thu, Apr 19, 2018 at 12:46:24PM -0700, Atish Patra wrote:
> On 4/17/18 7:13 PM, Alan Kao wrote:
> >This patch provide a basic PMU, riscv_base_pmu, which supports two
> >general hardware event, instructions and cycles.  Furthermore, this
> >PMU serves as a reference implementation to ease the portings in
> >the future.
> >
> >riscv_base_pmu should be able to run on any RISC-V machine that
> >conforms to the Priv-Spec.  Note that the latest qemu model hasn't
> >fully support a proper behavior of Priv-Spec 1.10 yet, but work
> >around should be easy with very small fixes.  Please check
> >https://github.com/riscv/riscv-qemu/pull/115 for future updates.
> >
> >Cc: Nick Hu <nickhu@andestech.com>
> >Cc: Greentime Hu <greentime@andestech.com>
> >Signed-off-by: Alan Kao <alankao@andestech.com>
> >---
> >  arch/riscv/Kconfig                  |  13 +
> >  arch/riscv/include/asm/perf_event.h |  79 ++++-
> >  arch/riscv/kernel/Makefile          |   1 +
> >  arch/riscv/kernel/perf_event.c      | 482 ++++++++++++++++++++++++++++
> >  4 files changed, 571 insertions(+), 4 deletions(-)
> >  create mode 100644 arch/riscv/kernel/perf_event.c
> >
> >diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >index c22ebe08e902..90d9c8e50377 100644
> >--- a/arch/riscv/Kconfig
> >+++ b/arch/riscv/Kconfig
> Some check patch errors.
> 
> ERROR: spaces required around that '>=' (ctx:WxV)
> #517: FILE: arch/riscv/kernel/perf_event.c:356:
> +	if (riscv_pmu->irq >=0 && riscv_pmu->handle_irq) {
>  	                   ^
> 
> ERROR: spaces required around that '>=' (ctx:WxV)
> #529: FILE: arch/riscv/kernel/perf_event.c:368:
> +	if (riscv_pmu->irq >=0) {
>  	                   ^
> 
> WARNING: braces {} are not necessary for single statement blocks
> #529: FILE: arch/riscv/kernel/perf_event.c:368:
> +	if (riscv_pmu->irq >=0) {
> +		free_irq(riscv_pmu->irq, NULL);
> +	}
> 
> WARNING: DT compatible string "riscv,base-pmu" appears un-documented --
> check ./Documentation/devicetree/bindings/
> #626: FILE: arch/riscv/kernel/perf_event.c:465:
> +	{.compatible = "riscv,base-pmu",	.data = &riscv_base_pmu},
> 
> ERROR: trailing whitespace
> #634: FILE: arch/riscv/kernel/perf_event.c:473:
> +^I$
> 
> ERROR: do not use assignment in if condition
> #635: FILE: arch/riscv/kernel/perf_event.c:474:
> +	if (node && (of_id = of_match_node(riscv_pmu_of_ids, node)))
> 
> total: 4 errors, 3 warnings, 595 lines checked
> 
> 
> Regards,
> Atish

Thanks for pointing this out.  I happened to develop this patchset on a machine
without the post-commit settings.  A new version is ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v5 0/2] perf: riscv: Preliminary Perf Event Support on RISC-V
From: Alan Kao @ 2018-04-19 23:27 UTC (permalink / raw)
  To: Palmer Dabbelt, Albert Ou, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alex Solomatnikov, Atish Patra, Jonathan Corbet,
	linux-riscv, linux-doc, linux-kernel
  Cc: Alan Kao, Greentime Hu, Nick Hu

This implements the baseline PMU for RISC-V platforms.

To ease future PMU portings, a guide is also written, containing
perf concepts, arch porting practices and some hints.

Changes in v5:
 - Fix patch errors from checkpatch.pl.

Changes in v4:
 - Fix several compilation errors.  Sorry for that.
 - Raise a warning in the write_counter body.

Changes in v3:
 - Fix typos in the document.
 - Change the initialization routine from statically assigning PMU to
   device-tree-based methods, and set default to the PMU proposed in
   this patch.

Changes in v2:
 - Fix the bug reported by Alex, which was caused by not sufficient
   initialization.  Check https://lkml.org/lkml/2018/3/31/251 for the
   discussion.

Alan Kao (2):
  perf: riscv: preliminary RISC-V support
  perf: riscv: Add Document for Future Porting Guide

 Documentation/riscv/pmu.txt         | 249 ++++++++++++++
 arch/riscv/Kconfig                  |  13 +
 arch/riscv/include/asm/perf_event.h |  79 ++++-
 arch/riscv/kernel/Makefile          |   1 +
 arch/riscv/kernel/perf_event.c      | 485 ++++++++++++++++++++++++++++
 5 files changed, 823 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/riscv/pmu.txt
 create mode 100644 arch/riscv/kernel/perf_event.c

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v5 2/2] perf: riscv: Add Document for Future Porting Guide
From: Alan Kao @ 2018-04-19 23:27 UTC (permalink / raw)
  To: Palmer Dabbelt, Albert Ou, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alex Solomatnikov, Atish Patra, Jonathan Corbet,
	linux-riscv, linux-doc, linux-kernel
  Cc: Alan Kao, Nick Hu, Greentime Hu
In-Reply-To: <1524180470-8622-1-git-send-email-alankao@andestech.com>

Reviewed-by: Alex Solomatnikov <sols@sifive.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <greentime@andestech.com>
Signed-off-by: Alan Kao <alankao@andestech.com>
---
 Documentation/riscv/pmu.txt | 249 ++++++++++++++++++++++++++++++++++++
 1 file changed, 249 insertions(+)
 create mode 100644 Documentation/riscv/pmu.txt

diff --git a/Documentation/riscv/pmu.txt b/Documentation/riscv/pmu.txt
new file mode 100644
index 000000000000..b29f03a6d82f
--- /dev/null
+++ b/Documentation/riscv/pmu.txt
@@ -0,0 +1,249 @@
+Supporting PMUs on RISC-V platforms
+==========================================
+Alan Kao <alankao@andestech.com>, Mar 2018
+
+Introduction
+------------
+
+As of this writing, perf_event-related features mentioned in The RISC-V ISA
+Privileged Version 1.10 are as follows:
+(please check the manual for more details)
+
+* [m|s]counteren
+* mcycle[h], cycle[h]
+* minstret[h], instret[h]
+* mhpeventx, mhpcounterx[h]
+
+With such function set only, porting perf would require a lot of work, due to
+the lack of the following general architectural performance monitoring features:
+
+* Enabling/Disabling counters
+  Counters are just free-running all the time in our case.
+* Interrupt caused by counter overflow
+  No such feature in the spec.
+* Interrupt indicator
+  It is not possible to have many interrupt ports for all counters, so an
+  interrupt indicator is required for software to tell which counter has
+  just overflowed.
+* Writing to counters
+  There will be an SBI to support this since the kernel cannot modify the
+  counters [1].  Alternatively, some vendor considers to implement
+  hardware-extension for M-S-U model machines to write counters directly.
+
+This document aims to provide developers a quick guide on supporting their
+PMUs in the kernel.  The following sections briefly explain perf' mechanism
+and todos.
+
+You may check previous discussions here [1][2].  Also, it might be helpful
+to check the appendix for related kernel structures.
+
+
+1. Initialization
+-----------------
+
+*riscv_pmu* is a global pointer of type *struct riscv_pmu*, which contains
+various methods according to perf's internal convention and PMU-specific
+parameters.  One should declare such instance to represent the PMU.  By default,
+*riscv_pmu* points to a constant structure *riscv_base_pmu*, which has very
+basic support to a baseline QEMU model.
+
+Then he/she can either assign the instance's pointer to *riscv_pmu* so that
+the minimal and already-implemented logic can be leveraged, or invent his/her
+own *riscv_init_platform_pmu* implementation.
+
+In other words, existing sources of *riscv_base_pmu* merely provide a
+reference implementation.  Developers can flexibly decide how many parts they
+can leverage, and in the most extreme case, they can customize every function
+according to their needs.
+
+
+2. Event Initialization
+-----------------------
+
+When a user launches a perf command to monitor some events, it is first
+interpreted by the userspace perf tool into multiple *perf_event_open*
+system calls, and then each of them calls to the body of *event_init*
+member function that was assigned in the previous step.  In *riscv_base_pmu*'s
+case, it is *riscv_event_init*.
+
+The main purpose of this function is to translate the event provided by user
+into bitmap, so that HW-related control registers or counters can directly be
+manipulated.  The translation is based on the mappings and methods provided in
+*riscv_pmu*.
+
+Note that some features can be done in this stage as well:
+
+(1) interrupt setting, which is stated in the next section;
+(2) privilege level setting (user space only, kernel space only, both);
+(3) destructor setting.  Normally it is sufficient to apply *riscv_destroy_event*;
+(4) tweaks for non-sampling events, which will be utilized by functions such as
+*perf_adjust_period*, usually something like the follows:
+
+if (!is_sampling_event(event)) {
+        hwc->sample_period = x86_pmu.max_period;
+        hwc->last_period = hwc->sample_period;
+        local64_set(&hwc->period_left, hwc->sample_period);
+}
+
+In the case of *riscv_base_pmu*, only (3) is provided for now.
+
+
+3. Interrupt
+------------
+
+3.1. Interrupt Initialization
+
+This often occurs at the beginning of the *event_init* method. In common
+practice, this should be a code segment like
+
+int x86_reserve_hardware(void)
+{
+        int err = 0;
+
+        if (!atomic_inc_not_zero(&pmc_refcount)) {
+                mutex_lock(&pmc_reserve_mutex);
+                if (atomic_read(&pmc_refcount) == 0) {
+                        if (!reserve_pmc_hardware())
+                                err = -EBUSY;
+                        else
+                                reserve_ds_buffers();
+                }
+                if (!err)
+                        atomic_inc(&pmc_refcount);
+                mutex_unlock(&pmc_reserve_mutex);
+        }
+
+        return err;
+}
+
+And the magic is in *reserve_pmc_hardware*, which usually does atomic
+operations to make implemented IRQ accessible from some global function pointer.
+*release_pmc_hardware* serves the opposite purpose, and it is used in event
+destructors mentioned in previous section.
+
+(Note: From the implementations in all the architectures, the *reserve/release*
+pair are always IRQ settings, so the *pmc_hardware* seems somehow misleading.
+It does NOT deal with the binding between an event and a physical counter,
+which will be introduced in the next section.)
+
+3.2. IRQ Structure
+
+Basically, a IRQ runs the following pseudo code:
+
+for each hardware counter that triggered this overflow
+
+    get the event of this counter
+
+    // following two steps are defined as *read()*,
+    // check the section Reading/Writing Counters for details.
+    count the delta value since previous interrupt
+    update the event->count (# event occurs) by adding delta, and
+               event->hw.period_left by subtracting delta
+
+    if the event overflows
+        sample data
+        set the counter appropriately for the next overflow
+
+        if the event overflows again
+            too frequently, throttle this event
+        fi
+    fi
+
+end for
+
+However as of this writing, none of the RISC-V implementations have designed an
+interrupt for perf, so the details are to be completed in the future.
+
+4. Reading/Writing Counters
+---------------------------
+
+They seem symmetric but perf treats them quite differently.  For reading, there
+is a *read* interface in *struct pmu*, but it serves more than just reading.
+According to the context, the *read* function not only reads the content of the
+counter (event->count), but also updates the left period to the next interrupt
+(event->hw.period_left).
+
+But the core of perf does not need direct write to counters.  Writing counters
+is hidden behind the abstraction of 1) *pmu->start*, literally start counting so one
+has to set the counter to a good value for the next interrupt; 2) inside the IRQ
+it should set the counter to the same resonable value.
+
+Reading is not a problem in RISC-V but writing would need some effort, since
+counters are not allowed to be written by S-mode.
+
+
+5. add()/del()/start()/stop()
+-----------------------------
+
+Basic idea: add()/del() adds/deletes events to/from a PMU, and start()/stop()
+starts/stop the counter of some event in the PMU.  All of them take the same
+arguments: *struct perf_event *event* and *int flag*.
+
+Consider perf as a state machine, then you will find that these functions serve
+as the state transition process between those states.
+Three states (event->hw.state) are defined:
+
+* PERF_HES_STOPPED:	the counter is stopped
+* PERF_HES_UPTODATE:	the event->count is up-to-date
+* PERF_HES_ARCH:	arch-dependent usage ... we don't need this for now
+
+A normal flow of these state transitions are as follows:
+
+* A user launches a perf event, resulting in calling to *event_init*.
+* When being context-switched in, *add* is called by the perf core, with a flag
+  PERF_EF_START, which means that the event should be started after it is added.
+  At this stage, a general event is bound to a physical counter, if any.
+  The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, because it is now
+  stopped, and the (software) event count does not need updating.
+** *start* is then called, and the counter is enabled.
+   With flag PERF_EF_RELOAD, it writes an appropriate value to the counter (check
+   previous section for detail).
+   Nothing is written if the flag does not contain PERF_EF_RELOAD.
+   The state now is reset to none, because it is neither stopped nor updated
+   (the counting already started)
+* When being context-switched out, *del* is called.  It then checks out all the
+  events in the PMU and calls *stop* to update their counts.
+** *stop* is called by *del*
+   and the perf core with flag PERF_EF_UPDATE, and it often shares the same
+   subroutine as *read* with the same logic.
+   The state changes to PERF_HES_STOPPED and PERF_HES_UPTODATE, again.
+
+** Life cycle of these two pairs: *add* and *del* are called repeatedly as
+  tasks switch in-and-out; *start* and *stop* is also called when the perf core
+  needs a quick stop-and-start, for instance, when the interrupt period is being
+  adjusted.
+
+Current implementation is sufficient for now and can be easily extended to
+features in the future.
+
+A. Related Structures
+---------------------
+
+* struct pmu: include/linux/perf_event.h
+* struct riscv_pmu: arch/riscv/include/asm/perf_event.h
+
+  Both structures are designed to be read-only.
+
+  *struct pmu* defines some function pointer interfaces, and most of them take
+*struct perf_event* as a main argument, dealing with perf events according to
+perf's internal state machine (check kernel/events/core.c for details).
+
+  *struct riscv_pmu* defines PMU-specific parameters.  The naming follows the
+convention of all other architectures.
+
+* struct perf_event: include/linux/perf_event.h
+* struct hw_perf_event
+
+  The generic structure that represents perf events, and the hardware-related
+details.
+
+* struct riscv_hw_events: arch/riscv/include/asm/perf_event.h
+
+  The structure that holds the status of events, has two fixed members:
+the number of events and the array of the events.
+
+References
+----------
+
+[1] https://github.com/riscv/riscv-linux/pull/124
+[2] https://groups.google.com/a/groups.riscv.org/forum/#!topic/sw-dev/f19TmCNP6yA
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox