Netdev List
 help / color / mirror / Atom feed
* [PATCH net] tg3: prevent scheduling while atomic splat
From: Jonathan Toppins @ 2018-03-14 16:36 UTC (permalink / raw)
  To: netdev
  Cc: Andy Gospodarek, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, linux-kernel

The problem was introduced in commit
506b0a395f26 ("[netdrv] tg3: APE heartbeat changes"). The bug occurs
because tp->lock spinlock is held which is obtained in tg3_start
by way of tg3_full_lock(), line 11571. The documentation for usleep_range()
specifically states it cannot be used inside a spinlock.

Fixes: 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes")
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
---

Notes:
    The thing I need reviewed from Broadcom is if the udelay should be 20
    instead of 10, due to any timing changes introduced by the offending
    patch.

 drivers/net/ethernet/broadcom/tg3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index c1841db1b500..f2593978ae75 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -820,7 +820,7 @@ static int tg3_ape_event_lock(struct tg3 *tp, u32 timeout_us)
 
 		tg3_ape_unlock(tp, TG3_APE_LOCK_MEM);
 
-		usleep_range(10, 20);
+		udelay(10);
 		timeout_us -= (timeout_us > 10) ? 10 : timeout_us;
 	}
 
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH bpf-next v5 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Song Liu @ 2018-03-14 16:36 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, ast@kernel.org, Peter Zijlstra,
	Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <3728d93e-67e3-8018-1025-7bab444c1f9e@iogearbox.net>



> On Mar 14, 2018, at 9:07 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
> Just a minor question below, the rest seems fine to me as far as I
> can tell.
> 
> On 03/13/2018 10:47 PM, Song Liu wrote:
> [...]
>> +enum bpf_stack_build_id_status {
>> +	/* user space need an empty entry to identify end of a trace */
>> +	BPF_STACK_BUILD_ID_EMPTY = 0,
>> +	/* with valid build_id and offset */
>> +	BPF_STACK_BUILD_ID_VALID = 1,
>> +	/* couldn't get build_id, fallback to ip */
>> +	BPF_STACK_BUILD_ID_IP = 2,
>> +};
>> +
>> +#define BPF_BUILD_ID_SIZE 20
>> +struct bpf_stack_build_id {
>> +	__s32		status;
>> +	unsigned char	build_id[BPF_BUILD_ID_SIZE];
>> +	union {
>> +		__u64	offset;
>> +		__u64	ip;
>> +	};
>> +};
> [...]>  BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>> 	   u64, flags)
>> {
>> 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
>> 	struct perf_callchain_entry *trace;
>> 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
>> -	u32 max_depth = map->value_size / 8;
>> +	u32 max_depth = map->value_size / stack_map_data_size(map);
>> 	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
>> 	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
>> 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> @@ -128,11 +318,16 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>> 	bool user = flags & BPF_F_USER_STACK;
>> 	bool kernel = !user;
>> 	u64 *ips;
>> +	bool hash_matches;
>> 
>> 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
>> 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
>> 		return -EINVAL;
>> 
>> +	/* build_id+offset stack map only supports user stack */
>> +	if (stack_map_use_build_id(map) && !user)
>> +		return -EINVAL;
> 
> Instead of bailing out here, wouldn't it make sense to just reuse the
> BPF_STACK_BUILD_ID_IP status and use this 'fallback' for kernel similar
> to what we do anyway in stack_map_get_build_id_offset() when we cannot
> get the build id so that map can be used for both cases?

This a great idea! Let me implement it. 

Thanks,
Song

^ permalink raw reply

* Re: [PATCH net-next 2/2] skbuff: Fix not waking applications when errors are enqueued
From: Eric Dumazet @ 2018-03-14 16:40 UTC (permalink / raw)
  To: Vinicius Costa Gomes, netdev; +Cc: randy.e.witt, davem
In-Reply-To: <20180313203519.8638-3-vinicius.gomes@intel.com>



On 03/13/2018 01:35 PM, Vinicius Costa Gomes wrote:
> When errors are enqueued to the error queue via sock_queue_err_skb()
> function, it is possible that the waiting application is not notified.
> 
> Calling 'sk->sk_data_ready()' would not notify applications that
> selected only POLLERR events in poll() (for example).
> 
> Reported-by: Randy E. Witt <randy.e.witt@intel.com>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
>   net/core/skbuff.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 715c13495ba6..6def3534f509 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4181,7 +4181,7 @@ int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
>   
>   	skb_queue_tail(&sk->sk_error_queue, skb);
>   	if (!sock_flag(sk, SOCK_DEAD))
> -		sk->sk_data_ready(sk);
> +		sk->sk_error_report(sk);
>   	return 0;
>   }
>   EXPORT_SYMBOL(sock_queue_err_skb);
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>

Given this bug has been there forever, there is probably no rush to get 
the fix in 4.16, but this looks safe enough IMO.

^ permalink raw reply

* [PATCH net-next] sock: remove zerocopy sockopt restriction on closed tcp state
From: Willem de Bruijn @ 2018-03-14 16:49 UTC (permalink / raw)
  To: netdev; +Cc: soheil, kou, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Socket option SO_ZEROCOPY determines whether the kernel ignores or
processes flag MSG_ZEROCOPY on subsequent send calls. This to avoid
changing behavior for legacy processes.

Limiting the state change to closed sockets is annoying with passive
sockets and not necessary for correctness. Once created, zerocopy skbs
are processed based on their private state, not this socket flag.

Remove the constraint.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 Documentation/networking/msg_zerocopy.rst | 5 -----
 net/core/sock.c                           | 2 --
 2 files changed, 7 deletions(-)

diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index 291a01264967..fe46d4867e2d 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -72,11 +72,6 @@ this flag, a process must first signal intent by setting a socket option:
 	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
 		error(1, errno, "setsockopt zerocopy");
 
-Setting the socket option only works when the socket is in its initial
-(TCP_CLOSED) state.  Trying to set the option for a socket returned by accept(),
-for example, will lead to an EBUSY error. In this case, the option should be set
-to the listening socket and it will be inherited by the accepted sockets.
-
 Transmission
 ------------
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 27f218bba43f..a8962d912895 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1052,8 +1052,6 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
 			if (sk->sk_protocol != IPPROTO_TCP)
 				ret = -ENOTSUPP;
-			else if (sk->sk_state != TCP_CLOSE)
-				ret = -EBUSY;
 		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
 		}
-- 
2.16.2.804.g6dcf76e118-goog

^ permalink raw reply related

* Re: [PATCH] can: enable multi-queue for SocketCAN devices
From: Jonas Mark (BT-FIR/ENG1) @ 2018-03-14 16:51 UTC (permalink / raw)
  To: Marc Kleine-Budde, Wolfgang Grandegger
  Cc: linux-can@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, hs@denx.de,
	ZHU Yi (BT-FIR/ENG1-Zhu), Jonas Mark (BT-FIR/ENG1)

Hello Marc,

> >> Do you have a driver or a patch to make a driver mq aware?
> >
> > Yes, we have CAN hardware with multiple queues and we also have a
> > SocketCAN driver for it.
> >
> > IMHO the driver will be of very little use for the Linux community
> > because the HW is proprietary.
> 
> That doesn't matter. It would be the first driver that makes use of the
> feature, so we can learn from it. And you might get a free review of
> your driver.

Of course we would be happy if somebody was volunteering to review
our driver for free.

The driver consists of two layers because the HW is accessed via SPI.
So there is a SPI driver of 1000 lines. On top of that is the CAN
driver and it has 600 lines. Because the HW does more than just
CAN there is another driver which implements a char device with
additional 300 lines.

Are there still volunteers?

> > Our motivation to separate this patch from the proprietary SocketCAN
> > driver
> 
> Looking at your email address I assume your employer sells devices with
> this hardware and the driver. So someone has to provide the sources for
> it anyways to fulfil the GPL license requirements. :)

I will have to check but I am pretty confident that we would be willing
to publish the driver on these mailing lists as well. We will anyhow
ship the source code (or a written offer) with every distribution.

If you are really serious about the review I will get the process
rolling for an early publication of our driver.

Greetings,
Mark

Bosch Building Technologies, Panel Software Fire (ST-FIR/ENG1) 
Bosch Sicherheitssysteme GmbH | Postfach 11 11 | 85626 Grasbrunn | GERMANY | www.boschsecurity.com 

Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart HRB 23118
Aufsichtsratsvorsitzender: Stefan Hartung; Geschäftsführung: Gert van Iperen, Andreas Bartz, Thomas Quante, Bernhard Schuster

^ permalink raw reply

* Re: [PATCH net-next] sock: remove zerocopy sockopt restriction on closed tcp state
From: David Miller @ 2018-03-14 16:52 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, soheil, kou, willemb
In-Reply-To: <20180314164919.179364-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 14 Mar 2018 12:49:19 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> Socket option SO_ZEROCOPY determines whether the kernel ignores or
> processes flag MSG_ZEROCOPY on subsequent send calls. This to avoid
> changing behavior for legacy processes.
> 
> Limiting the state change to closed sockets is annoying with passive
> sockets and not necessary for correctness. Once created, zerocopy skbs
> are processed based on their private state, not this socket flag.
> 
> Remove the constraint.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied, thanks Willem.

^ permalink raw reply

* Re: [PATCH net] net: use skb_to_full_sk() in skb_update_prio()
From: David Miller @ 2018-03-14 16:55 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, avagin, eric.dumazet
In-Reply-To: <20180314160416.232585-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 14 Mar 2018 09:04:16 -0700

> Andrei Vagin reported a KASAN: slab-out-of-bounds error in
> skb_update_prio()
> 
> Since SYNACK might be attached to a request socket, we need to
> get back to the listener socket.
> Since this listener is manipulated without locks, add const
> qualifiers to sock_cgroup_prioidx() so that the const can also
> be used in skb_update_prio()
> 
> Also add the const qualifier to sock_cgroup_classid() for consistency.
> 
> Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Andrei Vagin <avagin@virtuozzo.com>

Applied and queued up for -stable, thanks!

^ permalink raw reply

* [PATCH] vhost: fix vhost ioctl signature to build with clang
From: Sonny Rao @ 2018-03-14 17:05 UTC (permalink / raw)
  To: kvm; +Cc: Michael S . Tsirkin, Jason Wang, virtualization, netdev,
	linux-kernel

Clang is particularly anal about signed vs unsigned comparisons and
doesn't like the fact that some ioctl numbers set the MSB, so we get
this error when trying to build vhost on aarch64:

drivers/vhost/vhost.c:1400:7: error: overflow converting case value to
 switch condition type (3221794578 to 18446744072636378898)
 [-Werror, -Wswitch]
        case VHOST_GET_VRING_BASE:

3221794578 is 0xC008AF12 in hex
18446744072636378898 is 0xFFFFFFFFC008AF12 in hex

Fix this by using unsigned ints in the function signature for
vhost_vring_ioctl().

Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
---
 drivers/vhost/vhost.c | 2 +-
 drivers/vhost/vhost.h | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1b3e8d2d5c8b4..5316319d84081 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1337,7 +1337,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 	return -EFAULT;
 }
 
-long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp)
+long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp)
 {
 	struct file *eventfp, *filep = NULL;
 	bool pollstart = false, pollstop = false;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ac4b6056f19ae..d8ee85ae8fdcc 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -45,7 +45,7 @@ void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
 void vhost_work_flush(struct vhost_dev *dev, struct vhost_work *work);
-long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
+long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
 
 struct vhost_log {
 	u64 addr;
@@ -177,7 +177,7 @@ void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
 void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
-long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
+long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *argp);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
 
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net-next 0/2] sfc: support FEC configuration
From: David Miller @ 2018-03-14 17:12 UTC (permalink / raw)
  To: ecree; +Cc: linux-net-drivers, netdev
In-Reply-To: <41f508db-b805-12d0-f50a-e45440470402@solarflare.com>

From: Edward Cree <ecree@solarflare.com>
Date: Wed, 14 Mar 2018 14:19:11 +0000

> Implements the ethtool get & set fecparam operations.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
From: David Ahern @ 2018-03-14 17:13 UTC (permalink / raw)
  To: Alexei Starovoitov, davem; +Cc: daniel, netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

On 3/13/18 8:39 PM, Alexei Starovoitov wrote:
> For our container management we've been using complicated and fragile setup
> consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> all containerized applications.
> The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable.

Why does VRF and the cgroup option to bind sockets to the VRF not solve
this problem for you? The VRF limits the source address choices.

^ permalink raw reply

* Re: [PATCH 0/5] DPAA Ethernet fixes
From: David Miller @ 2018-03-14 17:17 UTC (permalink / raw)
  To: madalin.bucur
  Cc: netdev, linux-kernel, camelia.groza, leoyang.li, linuxppc-dev,
	linux-arm-kernel
In-Reply-To: <20180314133732.24068-1-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Wed, 14 Mar 2018 08:37:27 -0500

> This patch set is addressing several issues in the DPAA Ethernet
> driver suite:
> 
>  - module unload crash caused by wrong reference to device being left
>    in the cleanup code after the DSA related changes
>  - scheduling wile atomic bug in QMan code revealed during dpaa_eth
>    module unload
>  - a couple of error counter fixes, a duplicated init in dpaa_eth.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] tuntap: XDP_TX can use native XDP
From: David Miller @ 2018-03-14 17:21 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, linux-kernel, mst
In-Reply-To: <1520997820-8289-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Wed, 14 Mar 2018 11:23:40 +0800

> Now we have ndo_xdp_xmit, switch to use it instead of the slow generic
> XDP TX routine. XDP_TX on TAP gets ~20% improvements from ~1.5Mpps to
> ~1.8Mpps on 2.60GHz Core(TM) i7-5600U.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Applied, thanks Jason.

^ permalink raw reply

* Re: [PATCH RFC bpf-next 0/6] bpf: introduce cgroup-bpf bind, connect, post-bind hooks
From: Mahesh Bandewar (महेश बंडेवार) @ 2018-03-14 17:22 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Miller, daniel, linux-netdev, kernel-team
In-Reply-To: <20180314033934.3502167-1-ast@kernel.org>

On Tue, Mar 13, 2018 at 8:39 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> For our container management we've been using complicated and fragile setup
> consisting of LD_PRELOAD wrapper intercepting bind and connect calls from
> all containerized applications.
> The setup involves per-container IPs, policy, etc, so traditional
> network-only solutions that involve VRFs, netns, acls are not applicable.
You can keep the policies per cgroup but move the ip from cgroup to
net-ns and then none of these ebpf hacks are required since cgroup and
namespaces are orthogonal you can use cgroups in conjunction with
namespaces.

> Changing apps is not possible and LD_PRELOAD doesn't work
> for apps that don't use glibc like java and golang.
> BPF+cgroup looks to be the best solution for this problem.
> Hence we introduce 3 hooks:
> - at entry into sys_bind and sys_connect
>   to let bpf prog look and modify 'struct sockaddr' provided
>   by user space and fail bind/connect when appropriate
> - post sys_bind after port is allocated
>
> The approach works great and has zero overhead for anyone who doesn't
> use it and very low overhead when deployed.
>
> The main question for Daniel and Dave is what approach to take
> with prog types...
>
> In this patch set we introduce 6 new program types to make user
> experience easier:
>   BPF_PROG_TYPE_CGROUP_INET4_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_BIND,
>   BPF_PROG_TYPE_CGROUP_INET4_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET6_CONNECT,
>   BPF_PROG_TYPE_CGROUP_INET4_POST_BIND,
>   BPF_PROG_TYPE_CGROUP_INET6_POST_BIND,
>
> since v4 programs should not be using 'struct bpf_sock_addr'->user_ip6 fields
> and different prog type for v4 and v6 helps verifier reject such access
> at load time.
> Similarly bind vs connect are two different prog types too,
> since only sys_connect programs can call new bpf_bind() helper.
>
> This approach is very different from tcp-bpf where single
> 'struct bpf_sock_ops' and single prog type is used for different hooks.
> The field checks are done at run-time instead of load time.
>
> I think the approach taken by this patch set is justified,
> but we may do better if we extend BPF_PROG_ATTACH cmd
> with log_buf + log_size, then we should be able to combine
> bind+connect+v4+v6 into single program type.
> The idea that at load time the verifier will remember a bitmask
> of fields in bpf_sock_addr used by the program and helpers
> that program used, then at attach time we can check that
> hook is compatible with features used by the program and
> report human readable error message back via log_buf.
> We cannot do this right now with just EINVAL, since combinations
> of errors like 'using user_ip6 field but attaching to v4 hook'
> are too high to express as errno.
> This would be bigger change. If you folks think it's worth it
> we can go with this approach or if you think 6 new prog types
> is not too bad, we can leave the patch as-is.
> Comments?
> Other comments on patches are welcome.
>
> Andrey Ignatov (6):
>   bpf: Hooks for sys_bind
>   selftests/bpf: Selftest for sys_bind hooks
>   net: Introduce __inet_bind() and __inet6_bind
>   bpf: Hooks for sys_connect
>   selftests/bpf: Selftest for sys_connect hooks
>   bpf: Post-hooks for sys_bind
>
>  include/linux/bpf-cgroup.h                    |  68 +++-
>  include/linux/bpf_types.h                     |   6 +
>  include/linux/filter.h                        |  10 +
>  include/net/inet_common.h                     |   2 +
>  include/net/ipv6.h                            |   2 +
>  include/net/sock.h                            |   3 +
>  include/net/udp.h                             |   1 +
>  include/uapi/linux/bpf.h                      |  52 ++-
>  kernel/bpf/cgroup.c                           |  36 ++
>  kernel/bpf/syscall.c                          |  42 ++
>  kernel/bpf/verifier.c                         |   6 +
>  net/core/filter.c                             | 479 ++++++++++++++++++++++-
>  net/ipv4/af_inet.c                            |  60 ++-
>  net/ipv4/tcp_ipv4.c                           |  16 +
>  net/ipv4/udp.c                                |  14 +
>  net/ipv6/af_inet6.c                           |  47 ++-
>  net/ipv6/tcp_ipv6.c                           |  16 +
>  net/ipv6/udp.c                                |  20 +
>  tools/include/uapi/linux/bpf.h                |  39 +-
>  tools/testing/selftests/bpf/Makefile          |   8 +-
>  tools/testing/selftests/bpf/bpf_helpers.h     |   2 +
>  tools/testing/selftests/bpf/connect4_prog.c   |  45 +++
>  tools/testing/selftests/bpf/connect6_prog.c   |  61 +++
>  tools/testing/selftests/bpf/test_sock_addr.c  | 541 ++++++++++++++++++++++++++
>  tools/testing/selftests/bpf/test_sock_addr.sh |  57 +++
>  25 files changed, 1580 insertions(+), 53 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/connect4_prog.c
>  create mode 100644 tools/testing/selftests/bpf/connect6_prog.c
>  create mode 100644 tools/testing/selftests/bpf/test_sock_addr.c
>  create mode 100755 tools/testing/selftests/bpf/test_sock_addr.sh
>
> --
> 2.9.5
>

^ permalink raw reply

* Re: [PATCH net] tg3: prevent scheduling while atomic splat
From: Michael Chan @ 2018-03-14 17:22 UTC (permalink / raw)
  To: Jonathan Toppins
  Cc: Netdev, Andy Gospodarek, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, open list
In-Reply-To: <f439051c3964c6bb5342e25fb5327e0cdeba001e.1521045132.git.jtoppins@redhat.com>

On Wed, Mar 14, 2018 at 9:36 AM, Jonathan Toppins <jtoppins@redhat.com> wrote:
> The problem was introduced in commit
> 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes"). The bug occurs
> because tp->lock spinlock is held which is obtained in tg3_start
> by way of tg3_full_lock(), line 11571. The documentation for usleep_range()
> specifically states it cannot be used inside a spinlock.
>
> Fixes: 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes")
> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
> ---
>
> Notes:
>     The thing I need reviewed from Broadcom is if the udelay should be 20
>     instead of 10, due to any timing changes introduced by the offending
>     patch.

Thanks.  10 us is correct.

As a future improvement, we might want to see if we can release the
spinlock and go back to usleep_range().  The wait time is potentially
up to 20 msec which is quite long.

Acked-by: Michael Chan <michael.chan@broadcom.com>

^ permalink raw reply

* [PATCH bpf-next v6 2/2] bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID
From: Song Liu @ 2018-03-14 17:23 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu
In-Reply-To: <20180314172323.3957483-1-songliubraving@fb.com>

test_stacktrace_build_id() is added. It accesses tracepoint urandom_read
with "dd" and "urandom_read" and gathers stack traces. Then it reads the
stack traces from the stackmap.

urandom_read is a statically link binary that reads from /dev/urandom.
test_stacktrace_build_id() calls readelf to read build ID of urandom_read
and compares it with build ID from the stackmap.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 tools/include/uapi/linux/bpf.h                     |  22 +++
 tools/testing/selftests/bpf/Makefile               |  12 +-
 tools/testing/selftests/bpf/test_progs.c           | 164 ++++++++++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  60 ++++++++
 tools/testing/selftests/bpf/urandom_read.c         |  22 +++
 5 files changed, 278 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_stacktrace_build_id.c
 create mode 100644 tools/testing/selftests/bpf/urandom_read.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db6bdc3..1944d0a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -231,6 +231,28 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
+/* Flag for stack_map, store build_id+offset instead of pointer */
+#define BPF_F_STACK_BUILD_ID	(1U << 5)
+
+enum bpf_stack_build_id_status {
+	/* user space need an empty entry to identify end of a trace */
+	BPF_STACK_BUILD_ID_EMPTY = 0,
+	/* with valid build_id and offset */
+	BPF_STACK_BUILD_ID_VALID = 1,
+	/* couldn't get build_id, fallback to ip */
+	BPF_STACK_BUILD_ID_IP = 2,
+};
+
+#define BPF_BUILD_ID_SIZE 20
+struct bpf_stack_build_id {
+	__s32		status;
+	unsigned char	build_id[BPF_BUILD_ID_SIZE];
+	union {
+		__u64	offset;
+		__u64	ip;
+	};
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8567a858..b0d29fd 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -13,6 +13,14 @@ endif
 CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include
 LDLIBS += -lcap -lelf -lrt -lpthread
 
+TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
+all: $(TEST_CUSTOM_PROGS)
+
+$(TEST_CUSTOM_PROGS): urandom_read
+
+urandom_read: urandom_read.c
+	$(CC) -o $(TEST_CUSTOM_PROGS) -static $<
+
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
@@ -21,7 +29,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
-	sample_map_ret0.o test_tcpbpf_kern.o
+	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -74,3 +82,5 @@ $(OUTPUT)/%.o: %.c
 	$(CLANG) $(CLANG_FLAGS) \
 		 -O2 -target bpf -emit-llvm -c $< -o - |      \
 	$(LLC) -march=bpf -mcpu=$(CPU) -filetype=obj -o $@
+
+EXTRA_CLEAN := $(TEST_CUSTOM_PROGS)
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 27ad540..e9df48b 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -841,7 +841,8 @@ static void test_tp_attach_query(void)
 static int compare_map_keys(int map1_fd, int map2_fd)
 {
 	__u32 key, next_key;
-	char val_buf[PERF_MAX_STACK_DEPTH * sizeof(__u64)];
+	char val_buf[PERF_MAX_STACK_DEPTH *
+		     sizeof(struct bpf_stack_build_id)];
 	int err;
 
 	err = bpf_map_get_next_key(map1_fd, NULL, &key);
@@ -964,6 +965,166 @@ static void test_stacktrace_map()
 	return;
 }
 
+static int extract_build_id(char *build_id, size_t size)
+{
+	FILE *fp;
+	char *line = NULL;
+	size_t len = 0;
+
+	fp = popen("readelf -n ./urandom_read | grep 'Build ID'", "r");
+	if (fp == NULL)
+		return -1;
+
+	if (getline(&line, &len, fp) == -1)
+		goto err;
+	fclose(fp);
+
+	if (len > size)
+		len = size;
+	memcpy(build_id, line, len);
+	build_id[len] = '\0';
+	return 0;
+err:
+	fclose(fp);
+	return -1;
+}
+
+static void test_stacktrace_build_id(void)
+{
+	int control_map_fd, stackid_hmap_fd, stackmap_fd;
+	const char *file = "./test_stacktrace_build_id.o";
+	int bytes, efd, err, pmu_fd, prog_fd;
+	struct perf_event_attr attr = {};
+	__u32 key, previous_key, val, duration = 0;
+	struct bpf_object *obj;
+	char buf[256];
+	int i, j;
+	struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
+	int build_id_matches = 0;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
+	if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+		goto out;
+
+	/* Get the ID for the sched/sched_switch tracepoint */
+	snprintf(buf, sizeof(buf),
+		 "/sys/kernel/debug/tracing/events/random/urandom_read/id");
+	efd = open(buf, O_RDONLY, 0);
+	if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+		goto close_prog;
+
+	bytes = read(efd, buf, sizeof(buf));
+	close(efd);
+	if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
+		  "read", "bytes %d errno %d\n", bytes, errno))
+		goto close_prog;
+
+	/* Open the perf event and attach bpf progrram */
+	attr.config = strtol(buf, NULL, 0);
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1 /* pid */,
+			 0 /* cpu 0 */, -1 /* group id */,
+			 0 /* flags */);
+	if (CHECK(pmu_fd < 0, "perf_event_open", "err %d errno %d\n",
+		  pmu_fd, errno))
+		goto close_prog;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	if (CHECK(err, "perf_event_ioc_enable", "err %d errno %d\n",
+		  err, errno))
+		goto close_pmu;
+
+	err = ioctl(pmu_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+	if (CHECK(err, "perf_event_ioc_set_bpf", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	/* find map fds */
+	control_map_fd = bpf_find_map(__func__, obj, "control_map");
+	if (CHECK(control_map_fd < 0, "bpf_find_map control_map",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackid_hmap_fd = bpf_find_map(__func__, obj, "stackid_hmap");
+	if (CHECK(stackid_hmap_fd < 0, "bpf_find_map stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	stackmap_fd = bpf_find_map(__func__, obj, "stackmap");
+	if (CHECK(stackmap_fd < 0, "bpf_find_map stackmap", "err %d errno %d\n",
+		  err, errno))
+		goto disable_pmu;
+
+	assert(system("dd if=/dev/urandom of=/dev/zero count=4 2> /dev/null")
+	       == 0);
+	assert(system("./urandom_read if=/dev/urandom of=/dev/zero count=4 2> /dev/null") == 0);
+	/* disable stack trace collection */
+	key = 0;
+	val = 1;
+	bpf_map_update_elem(control_map_fd, &key, &val, 0);
+
+	/* for every element in stackid_hmap, we can find a corresponding one
+	 * in stackmap, and vise versa.
+	 */
+	err = compare_map_keys(stackid_hmap_fd, stackmap_fd);
+	if (CHECK(err, "compare_map_keys stackid_hmap vs. stackmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = compare_map_keys(stackmap_fd, stackid_hmap_fd);
+	if (CHECK(err, "compare_map_keys stackmap vs. stackid_hmap",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = extract_build_id(buf, 256);
+
+	if (CHECK(err, "get build_id with readelf",
+		  "err %d errno %d\n", err, errno))
+		goto disable_pmu;
+
+	err = bpf_map_get_next_key(stackmap_fd, NULL, &key);
+	if (CHECK(err, "get_next_key from stackmap",
+		  "err %d, errno %d\n", err, errno))
+		goto disable_pmu;
+
+	do {
+		char build_id[64];
+
+		err = bpf_map_lookup_elem(stackmap_fd, &key, id_offs);
+		if (CHECK(err, "lookup_elem from stackmap",
+			  "err %d, errno %d\n", err, errno))
+			goto disable_pmu;
+		for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i)
+			if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID &&
+			    id_offs[i].offset != 0) {
+				for (j = 0; j < 20; ++j)
+					sprintf(build_id + 2 * j, "%02x",
+						id_offs[i].build_id[j] & 0xff);
+				if (strstr(buf, build_id) != NULL)
+					build_id_matches = 1;
+			}
+		previous_key = key;
+	} while (bpf_map_get_next_key(stackmap_fd, &previous_key, &key) == 0);
+
+	CHECK(build_id_matches < 1, "build id match",
+	      "Didn't find expected build ID from the map");
+
+disable_pmu:
+	ioctl(pmu_fd, PERF_EVENT_IOC_DISABLE);
+
+close_pmu:
+	close(pmu_fd);
+
+close_prog:
+	bpf_object__close(obj);
+
+out:
+	return;
+}
+
 int main(void)
 {
 	test_pkt_access();
@@ -976,6 +1137,7 @@ int main(void)
 	test_obj_name();
 	test_tp_attach_query();
 	test_stacktrace_map();
+	test_stacktrace_build_id();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_build_id.c b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
new file mode 100644
index 0000000..b755bd7
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_stacktrace_build_id.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2018 Facebook
+
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+#ifndef PERF_MAX_STACK_DEPTH
+#define PERF_MAX_STACK_DEPTH         127
+#endif
+
+struct bpf_map_def SEC("maps") control_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") stackid_hmap = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 10000,
+};
+
+struct bpf_map_def SEC("maps") stackmap = {
+	.type = BPF_MAP_TYPE_STACK_TRACE,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_stack_build_id)
+		* PERF_MAX_STACK_DEPTH,
+	.max_entries = 128,
+	.map_flags = BPF_F_STACK_BUILD_ID,
+};
+
+/* taken from /sys/kernel/debug/tracing/events/random/urandom_read/format */
+struct random_urandom_args {
+	unsigned long long pad;
+	int got_bits;
+	int pool_left;
+	int input_left;
+};
+
+SEC("tracepoint/random/urandom_read")
+int oncpu(struct random_urandom_args *args)
+{
+	__u32 key = 0, val = 0, *value_p;
+
+	value_p = bpf_map_lookup_elem(&control_map, &key);
+	if (value_p && *value_p)
+		return 0; /* skip if non-zero *value_p */
+
+	/* The size of stackmap and stackid_hmap should be the same */
+	key = bpf_get_stackid(args, &stackmap, BPF_F_USER_STACK);
+	if ((int)key >= 0)
+		bpf_map_update_elem(&stackid_hmap, &key, &val, 0);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1; /* ignored by tracepoints, required by libbpf.a */
diff --git a/tools/testing/selftests/bpf/urandom_read.c b/tools/testing/selftests/bpf/urandom_read.c
new file mode 100644
index 0000000..4acfdeb
--- /dev/null
+++ b/tools/testing/selftests/bpf/urandom_read.c
@@ -0,0 +1,22 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdlib.h>
+
+#define BUF_SIZE 256
+int main(void)
+{
+	int fd = open("/dev/urandom", O_RDONLY);
+	int i;
+	char buf[BUF_SIZE];
+
+	if (fd < 0)
+		return 1;
+	for (i = 0; i < 4; ++i)
+		read(fd, buf, BUF_SIZE);
+
+	close(fd);
+	return 0;
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next v6 0/2] bpf stackmap with build_id+offset
From: Song Liu @ 2018-03-14 17:23 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu

Changes v4 -> v6:

1. When kernel stack is added to stackmap with build_id, use fallback
   mechanism to store ip (status == BPF_STACK_BUILD_ID_IP).

Changes v4 -> v5:

1. Only allow build_id lookup in non-nmi context. Added comment and
   commit message to highlight this limitation.
2. Minor fix reported by kbuild test robot.

Changes v3 -> v4:

1. Add fallback when build_id lookup failed. In this case, status is set
   to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved.
2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0).
   Thanks to Teng for helping me identify this issue!
3. Address feedbacks for previous versions.

Song Liu (2):
  bpf: extend stackmap to save binary_build_id+offset instead of address
  bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID

 include/uapi/linux/bpf.h                           |  22 ++
 kernel/bpf/stackmap.c                              | 257 +++++++++++++++++++--
 tools/include/uapi/linux/bpf.h                     |  22 ++
 tools/testing/selftests/bpf/Makefile               |  12 +-
 tools/testing/selftests/bpf/test_progs.c           | 164 ++++++++++++-
 .../selftests/bpf/test_stacktrace_build_id.c       |  60 +++++
 tools/testing/selftests/bpf/urandom_read.c         |  22 ++
 7 files changed, 535 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_stacktrace_build_id.c
 create mode 100644 tools/testing/selftests/bpf/urandom_read.c

--
2.9.5

^ permalink raw reply

* [PATCH bpf-next v6 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Song Liu @ 2018-03-14 17:23 UTC (permalink / raw)
  To: netdev, ast, peterz, daniel; +Cc: kernel-team, hannes, qinteng, Song Liu
In-Reply-To: <20180314172323.3957483-1-songliubraving@fb.com>

Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.

This patch tries to solve this problem with a variation of stackmap. This
variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
addresses, the variation stores ELF file build_id + offset.

Build_id is
only meaningful for user stack. If a kernel stack is added to a stackmap
with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store
ip (status == BPF_STACK_BUILD_ID_IP).

Build ID is a 20-byte unique identifier for ELF files. The following
command shows the Build ID of /bin/bash:

  [user@]$ readelf -n /bin/bash
  ...
    Build ID: XXXXXXXXXX
  ...

With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
for each entry in the call trace, and translate it into the following
struct:

  struct bpf_stack_build_id_offset {
          __s32           status;
          unsigned char   build_id[BPF_BUILD_ID_SIZE];
          union {
                  __u64   offset;
                  __u64   ip;
          };
  };

The search of build_id is limited to the first page of the file, and this
page should be in page cache. Otherwise, we fallback to store ip for this
entry (ip field in struct bpf_stack_build_id_offset). This requires the
build_id to be stored in the first page. A quick survey of binary and
dynamic library files in a few different systems shows that almost all
binary and dynamic library files have build_id in the first page.

Build_id is only meaningful for user stack. If a kernel stack is added to
a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
lookup failed for some reason, it will also fallback to store ip.

User space can access struct bpf_stack_build_id_offset with bpf
syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
maintain mapping from build id to binary files. This mostly static
mapping is much easier to maintain than per process address maps.

Note: Stackmap with build_id only works in non-nmi context at this time.
This is because we need to take mm->mmap_sem for find_vma(). If this
changes, we would like to allow build_id lookup in nmi context.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/uapi/linux/bpf.h |  22 ++++
 kernel/bpf/stackmap.c    | 257 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 257 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a66769..1e15d17 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -231,6 +231,28 @@ enum bpf_attach_type {
 #define BPF_F_RDONLY		(1U << 3)
 #define BPF_F_WRONLY		(1U << 4)
 
+/* Flag for stack_map, store build_id+offset instead of pointer */
+#define BPF_F_STACK_BUILD_ID	(1U << 5)
+
+enum bpf_stack_build_id_status {
+	/* user space need an empty entry to identify end of a trace */
+	BPF_STACK_BUILD_ID_EMPTY = 0,
+	/* with valid build_id and offset */
+	BPF_STACK_BUILD_ID_VALID = 1,
+	/* couldn't get build_id, fallback to ip */
+	BPF_STACK_BUILD_ID_IP = 2,
+};
+
+#define BPF_BUILD_ID_SIZE 20
+struct bpf_stack_build_id {
+	__s32		status;
+	unsigned char	build_id[BPF_BUILD_ID_SIZE];
+	union {
+		__u64	offset;
+		__u64	ip;
+	};
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b0ecf43..57eeb12 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -9,16 +9,19 @@
 #include <linux/filter.h>
 #include <linux/stacktrace.h>
 #include <linux/perf_event.h>
+#include <linux/elf.h>
+#include <linux/pagemap.h>
 #include "percpu_freelist.h"
 
-#define STACK_CREATE_FLAG_MASK \
-	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+#define STACK_CREATE_FLAG_MASK					\
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
+	 BPF_F_STACK_BUILD_ID)
 
 struct stack_map_bucket {
 	struct pcpu_freelist_node fnode;
 	u32 hash;
 	u32 nr;
-	u64 ip[];
+	u64 data[];
 };
 
 struct bpf_stack_map {
@@ -29,6 +32,17 @@ struct bpf_stack_map {
 	struct stack_map_bucket *buckets[];
 };
 
+static inline bool stack_map_use_build_id(struct bpf_map *map)
+{
+	return (map->map_flags & BPF_F_STACK_BUILD_ID);
+}
+
+static inline int stack_map_data_size(struct bpf_map *map)
+{
+	return stack_map_use_build_id(map) ?
+		sizeof(struct bpf_stack_build_id) : sizeof(u64);
+}
+
 static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
 {
 	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
@@ -68,8 +82,16 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 
 	/* check sanity of attributes */
 	if (attr->max_entries == 0 || attr->key_size != 4 ||
-	    value_size < 8 || value_size % 8 ||
-	    value_size / 8 > sysctl_perf_event_max_stack)
+	    value_size < 8 || value_size % 8)
+		return ERR_PTR(-EINVAL);
+
+	BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
+	if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
+		if (value_size % sizeof(struct bpf_stack_build_id) ||
+		    value_size / sizeof(struct bpf_stack_build_id)
+		    > sysctl_perf_event_max_stack)
+			return ERR_PTR(-EINVAL);
+	} else if (value_size / 8 > sysctl_perf_event_max_stack)
 		return ERR_PTR(-EINVAL);
 
 	/* hash table size must be power of 2 */
@@ -114,13 +136,184 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
+#define BPF_BUILD_ID 3
+/*
+ * Parse build id from the note segment. This logic can be shared between
+ * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
+ * identical.
+ */
+static inline int stack_map_parse_build_id(void *page_addr,
+					   unsigned char *build_id,
+					   void *note_start,
+					   Elf32_Word note_size)
+{
+	Elf32_Word note_offs = 0, new_offs;
+
+	/* check for overflow */
+	if (note_start < page_addr || note_start + note_size < note_start)
+		return -EINVAL;
+
+	/* only supports note that fits in the first page */
+	if (note_start + note_size > page_addr + PAGE_SIZE)
+		return -EINVAL;
+
+	while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
+		Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
+
+		if (nhdr->n_type == BPF_BUILD_ID &&
+		    nhdr->n_namesz == sizeof("GNU") &&
+		    nhdr->n_descsz == BPF_BUILD_ID_SIZE) {
+			memcpy(build_id,
+			       note_start + note_offs +
+			       ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr),
+			       BPF_BUILD_ID_SIZE);
+			return 0;
+		}
+		new_offs = note_offs + sizeof(Elf32_Nhdr) +
+			ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
+		if (new_offs <= note_offs)  /* overflow */
+			break;
+		note_offs = new_offs;
+	}
+	return -EINVAL;
+}
+
+/* Parse build ID from 32-bit ELF */
+static int stack_map_get_build_id_32(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
+	Elf32_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID from 64-bit ELF */
+static int stack_map_get_build_id_64(void *page_addr,
+				     unsigned char *build_id)
+{
+	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
+	Elf64_Phdr *phdr;
+	int i;
+
+	/* only supports phdr that fits in one page */
+	if (ehdr->e_phnum >
+	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
+		return -EINVAL;
+
+	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
+
+	for (i = 0; i < ehdr->e_phnum; ++i)
+		if (phdr[i].p_type == PT_NOTE)
+			return stack_map_parse_build_id(page_addr, build_id,
+					page_addr + phdr[i].p_offset,
+					phdr[i].p_filesz);
+	return -EINVAL;
+}
+
+/* Parse build ID of ELF file mapped to vma */
+static int stack_map_get_build_id(struct vm_area_struct *vma,
+				  unsigned char *build_id)
+{
+	Elf32_Ehdr *ehdr;
+	struct page *page;
+	void *page_addr;
+	int ret;
+
+	/* only works for page backed storage  */
+	if (!vma->vm_file)
+		return -EINVAL;
+
+	page = find_get_page(vma->vm_file->f_mapping, 0);
+	if (!page)
+		return -EFAULT;	/* page not mapped */
+
+	ret = -EINVAL;
+	page_addr = page_address(page);
+	ehdr = (Elf32_Ehdr *)page_addr;
+
+	/* compare magic x7f "ELF" */
+	if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
+		goto out;
+
+	/* only support executable file and shared object file */
+	if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
+		goto out;
+
+	if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
+		ret = stack_map_get_build_id_32(page_addr, build_id);
+	else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
+		ret = stack_map_get_build_id_64(page_addr, build_id);
+out:
+	put_page(page);
+	return ret;
+}
+
+static void stack_map_get_build_id_offset(struct bpf_map *map,
+					  struct stack_map_bucket *bucket,
+					  u64 *ips, u32 trace_nr, bool user)
+{
+	int i;
+	struct vm_area_struct *vma;
+	struct bpf_stack_build_id *id_offs;
+
+	bucket->nr = trace_nr;
+	id_offs = (struct bpf_stack_build_id *)bucket->data;
+
+	/*
+	 * We cannot do up_read() in nmi context, so build_id lookup is
+	 * only supported for non-nmi events. If at some point, it is
+	 * possible to run find_vma() without taking the semaphore, we
+	 * would like to allow build_id lookup in nmi context.
+	 *
+	 * Same fallback is used for kernel stack (!user) on a stackmap
+	 * with build_id.
+	 */
+	if (!user || !current || !current->mm || in_nmi() ||
+	    down_read_trylock(&current->mm->mmap_sem) == 0) {
+		/* cannot access current->mm, fall back to ips */
+		for (i = 0; i < trace_nr; i++) {
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+		}
+		return;
+	}
+
+	for (i = 0; i < trace_nr; i++) {
+		vma = find_vma(current->mm, ips[i]);
+		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
+			/* per entry fall back to ips */
+			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+			id_offs[i].ip = ips[i];
+			continue;
+		}
+		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
+			- vma->vm_start;
+		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
+	}
+	up_read(&current->mm->mmap_sem);
+}
+
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 	struct perf_callchain_entry *trace;
 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
-	u32 max_depth = map->value_size / 8;
+	u32 max_depth = map->value_size / stack_map_data_size(map);
 	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
 	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
@@ -128,6 +321,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	bool user = flags & BPF_F_USER_STACK;
 	bool kernel = !user;
 	u64 *ips;
+	bool hash_matches;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
@@ -156,24 +350,43 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	id = hash & (smap->n_buckets - 1);
 	bucket = READ_ONCE(smap->buckets[id]);
 
-	if (bucket && bucket->hash == hash) {
-		if (flags & BPF_F_FAST_STACK_CMP)
+	hash_matches = bucket && bucket->hash == hash;
+	/* fast cmp */
+	if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
+		return id;
+
+	if (stack_map_use_build_id(map)) {
+		/* for build_id+offset, pop a bucket before slow cmp */
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		stack_map_get_build_id_offset(map, new_bucket, ips,
+					      trace_nr, user);
+		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
 			return id;
-		if (bucket->nr == trace_nr &&
-		    memcmp(bucket->ip, ips, trace_len) == 0)
+		}
+		if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
+			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
+			return -EEXIST;
+		}
+	} else {
+		if (hash_matches && bucket->nr == trace_nr &&
+		    memcmp(bucket->data, ips, trace_len) == 0)
 			return id;
+		if (bucket && !(flags & BPF_F_REUSE_STACKID))
+			return -EEXIST;
+
+		new_bucket = (struct stack_map_bucket *)
+			pcpu_freelist_pop(&smap->freelist);
+		if (unlikely(!new_bucket))
+			return -ENOMEM;
+		memcpy(new_bucket->data, ips, trace_len);
 	}
 
-	/* this call stack is not in the map, try to add it */
-	if (bucket && !(flags & BPF_F_REUSE_STACKID))
-		return -EEXIST;
-
-	new_bucket = (struct stack_map_bucket *)
-		pcpu_freelist_pop(&smap->freelist);
-	if (unlikely(!new_bucket))
-		return -ENOMEM;
-
-	memcpy(new_bucket->ip, ips, trace_len);
 	new_bucket->hash = hash;
 	new_bucket->nr = trace_nr;
 
@@ -212,8 +425,8 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	if (!bucket)
 		return -ENOENT;
 
-	trace_len = bucket->nr * sizeof(u64);
-	memcpy(value, bucket->ip, trace_len);
+	trace_len = bucket->nr * stack_map_data_size(map);
+	memcpy(value, bucket->data, trace_len);
 	memset(value + trace_len, 0, map->value_size - trace_len);
 
 	old_bucket = xchg(&smap->buckets[id], bucket);
-- 
2.9.5

^ permalink raw reply related

* RE: [Intel-wired-lan] [next-queue PATCH v4 2/8] igb: Fix queue selection on MAC filters on i210 and i211
From: Vinicius Costa Gomes @ 2018-03-14 17:25 UTC (permalink / raw)
  To: Brown, Aaron F, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Sanchez-Palencia, Jesus
In-Reply-To: <309B89C4C689E141A5FF6A0C5FB2118B8C8032BA@ORSMSX101.amr.corp.intel.com>

Hi,

"Brown, Aaron F" <aaron.f.brown@intel.com> writes:

>> --- a/drivers/net/ethernet/intel/igb/igb_main.c
>> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
>> @@ -8747,12 +8747,19 @@ static void igb_rar_set_index(struct igb_adapter
>> *adapter, u32 index)
>>  		if (is_valid_ether_addr(addr))
>>  			rar_high |= E1000_RAH_AV;
>> 
>> -		if (hw->mac.type == e1000_82575)
>> +		switch (hw->mac.type) {
>> +		case e1000_82575:
>> +		case e1000_i210:
>> +		case e1000_i211:
>> +			rar_high |= E1000_RAH_QSEL_ENABLE;
>>  			rar_high |= E1000_RAH_POOL_1 *
>> -				    adapter->mac_table[index].queue;
>> -		else
>> +				      adapter->mac_table[index].queue;
>> +			break;
>> +		default:
>>  			rar_high |= E1000_RAH_POOL_1 <<
>> -				    adapter->mac_table[index].queue;
>> +				adapter->mac_table[index].queue;
>
> Small nit.  Shouldn't this line be indented more to be a few spaces
> past the "|=" operator as above?

I don't know why my editor seemed to disagree, I will fix.


Thank you,
--
Vinicius

^ permalink raw reply

* [PATCH] vhost: add vsock compat ioctl
From: Sonny Rao @ 2018-03-14 17:26 UTC (permalink / raw)
  To: kvm
  Cc: Stefan Hajnoczi, Michael S . Tsirkin, Jason Wang, virtualization,
	netdev, linux-kernel

This will allow usage of vsock from 32-bit binaries on a 64-bit
kernel.

Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
---
 drivers/vhost/vsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 0d14e2ff19f16..d0e65e92110e5 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -705,6 +705,7 @@ static const struct file_operations vhost_vsock_fops = {
 	.release        = vhost_vsock_dev_release,
 	.llseek		= noop_llseek,
 	.unlocked_ioctl = vhost_vsock_dev_ioctl,
+	.compat_ioctl   = vhost_vsock_dev_ioctl,
 };
 
 static struct miscdevice vhost_vsock_misc = {
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH net] tg3: prevent scheduling while atomic splat
From: Jonathan Toppins @ 2018-03-14 17:27 UTC (permalink / raw)
  To: Michael Chan
  Cc: Netdev, Andy Gospodarek, Siva Reddy Kallam, Prashant Sreedharan,
	Michael Chan, open list
In-Reply-To: <CACKFLi=R2q71JH2Jbqbpy8HrzNpg1UmnoNWKhEs7ooWLWoQOqQ@mail.gmail.com>

On 03/14/2018 01:22 PM, Michael Chan wrote:
> On Wed, Mar 14, 2018 at 9:36 AM, Jonathan Toppins <jtoppins@redhat.com> wrote:
>> The problem was introduced in commit
>> 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes"). The bug occurs
>> because tp->lock spinlock is held which is obtained in tg3_start
>> by way of tg3_full_lock(), line 11571. The documentation for usleep_range()
>> specifically states it cannot be used inside a spinlock.
>>
>> Fixes: 506b0a395f26 ("[netdrv] tg3: APE heartbeat changes")
>> Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
>> ---
>>
>> Notes:
>>     The thing I need reviewed from Broadcom is if the udelay should be 20
>>     instead of 10, due to any timing changes introduced by the offending
>>     patch.
> 
> Thanks.  10 us is correct.
> 
> As a future improvement, we might want to see if we can release the
> spinlock and go back to usleep_range().  The wait time is potentially
> up to 20 msec which is quite long.

Agreed, glad it is not just me wondering why a lock needs to be held for
reads. :-)

^ permalink raw reply

* Re: [PATCH net-next] liquidio: Add support for liquidio 10GBase-T NIC
From: David Miller @ 2018-03-14 17:35 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	veerasenareddy.burru
In-Reply-To: <20180314050445.GA8221@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Tue, 13 Mar 2018 22:04:45 -0700

> From: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>
> 
> Added ethtool changes to show port type as TP (Twisted Pair) for
> 10GBASE-T ports. Same driver and firmware works for liquidio NIC with
> SFP+ ports or TP ports.
> 
> Signed-off-by: Veerasenareddy Burru <veerasenareddy.burru@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH bpf-next v6 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Song Liu @ 2018-03-14 17:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org, ast@kernel.org, peterz@infradead.org,
	daniel@iogearbox.net
  Cc: Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <20180314172323.3957483-2-songliubraving@fb.com>



> On Mar 14, 2018, at 10:23 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> Currently, bpf stackmap store address for each entry in the call trace.
> To map these addresses to user space files, it is necessary to maintain
> the mapping from these virtual address to symbols in the binary. Usually,
> the user space profiler (such as perf) has to scan /proc/pid/maps at the
> beginning of profiling, and monitor mmap2() calls afterwards. Given the
> cost of maintaining the address map, this solution is not practical for
> system wide profiling that is always on.
> 
> This patch tries to solve this problem with a variation of stackmap. This
> variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
> addresses, the variation stores ELF file build_id + offset.
> 
> Build_id is
> only meaningful for user stack. If a kernel stack is added to a stackmap
> with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store
> ip (status == BPF_STACK_BUILD_ID_IP).

I forgot to delete the paragraph above. Please let me know if I should 
resend (or it can be removed when you apply it).

Thanks,
Song 


> Build ID is a 20-byte unique identifier for ELF files. The following
> command shows the Build ID of /bin/bash:
> 
>  [user@]$ readelf -n /bin/bash
>  ...
>    Build ID: XXXXXXXXXX
>  ...
> 
> With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
> for each entry in the call trace, and translate it into the following
> struct:
> 
>  struct bpf_stack_build_id_offset {
>          __s32           status;
>          unsigned char   build_id[BPF_BUILD_ID_SIZE];
>          union {
>                  __u64   offset;
>                  __u64   ip;
>          };
>  };
> 
> The search of build_id is limited to the first page of the file, and this
> page should be in page cache. Otherwise, we fallback to store ip for this
> entry (ip field in struct bpf_stack_build_id_offset). This requires the
> build_id to be stored in the first page. A quick survey of binary and
> dynamic library files in a few different systems shows that almost all
> binary and dynamic library files have build_id in the first page.
> 
> Build_id is only meaningful for user stack. If a kernel stack is added to
> a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
> only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
> lookup failed for some reason, it will also fallback to store ip.
> 
> User space can access struct bpf_stack_build_id_offset with bpf
> syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
> maintain mapping from build id to binary files. This mostly static
> mapping is much easier to maintain than per process address maps.
> 
> Note: Stackmap with build_id only works in non-nmi context at this time.
> This is because we need to take mm->mmap_sem for find_vma(). If this
> changes, we would like to allow build_id lookup in nmi context.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
> include/uapi/linux/bpf.h |  22 ++++
> kernel/bpf/stackmap.c    | 257 +++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 257 insertions(+), 22 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a66769..1e15d17 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -231,6 +231,28 @@ enum bpf_attach_type {
> #define BPF_F_RDONLY		(1U << 3)
> #define BPF_F_WRONLY		(1U << 4)
> 
> +/* Flag for stack_map, store build_id+offset instead of pointer */
> +#define BPF_F_STACK_BUILD_ID	(1U << 5)
> +
> +enum bpf_stack_build_id_status {
> +	/* user space need an empty entry to identify end of a trace */
> +	BPF_STACK_BUILD_ID_EMPTY = 0,
> +	/* with valid build_id and offset */
> +	BPF_STACK_BUILD_ID_VALID = 1,
> +	/* couldn't get build_id, fallback to ip */
> +	BPF_STACK_BUILD_ID_IP = 2,
> +};
> +
> +#define BPF_BUILD_ID_SIZE 20
> +struct bpf_stack_build_id {
> +	__s32		status;
> +	unsigned char	build_id[BPF_BUILD_ID_SIZE];
> +	union {
> +		__u64	offset;
> +		__u64	ip;
> +	};
> +};
> +
> union bpf_attr {
> 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
> 		__u32	map_type;	/* one of enum bpf_map_type */
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index b0ecf43..57eeb12 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -9,16 +9,19 @@
> #include <linux/filter.h>
> #include <linux/stacktrace.h>
> #include <linux/perf_event.h>
> +#include <linux/elf.h>
> +#include <linux/pagemap.h>
> #include "percpu_freelist.h"
> 
> -#define STACK_CREATE_FLAG_MASK \
> -	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
> +#define STACK_CREATE_FLAG_MASK					\
> +	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY |	\
> +	 BPF_F_STACK_BUILD_ID)
> 
> struct stack_map_bucket {
> 	struct pcpu_freelist_node fnode;
> 	u32 hash;
> 	u32 nr;
> -	u64 ip[];
> +	u64 data[];
> };
> 
> struct bpf_stack_map {
> @@ -29,6 +32,17 @@ struct bpf_stack_map {
> 	struct stack_map_bucket *buckets[];
> };
> 
> +static inline bool stack_map_use_build_id(struct bpf_map *map)
> +{
> +	return (map->map_flags & BPF_F_STACK_BUILD_ID);
> +}
> +
> +static inline int stack_map_data_size(struct bpf_map *map)
> +{
> +	return stack_map_use_build_id(map) ?
> +		sizeof(struct bpf_stack_build_id) : sizeof(u64);
> +}
> +
> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
> {
> 	u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
> @@ -68,8 +82,16 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
> 
> 	/* check sanity of attributes */
> 	if (attr->max_entries == 0 || attr->key_size != 4 ||
> -	    value_size < 8 || value_size % 8 ||
> -	    value_size / 8 > sysctl_perf_event_max_stack)
> +	    value_size < 8 || value_size % 8)
> +		return ERR_PTR(-EINVAL);
> +
> +	BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64));
> +	if (attr->map_flags & BPF_F_STACK_BUILD_ID) {
> +		if (value_size % sizeof(struct bpf_stack_build_id) ||
> +		    value_size / sizeof(struct bpf_stack_build_id)
> +		    > sysctl_perf_event_max_stack)
> +			return ERR_PTR(-EINVAL);
> +	} else if (value_size / 8 > sysctl_perf_event_max_stack)
> 		return ERR_PTR(-EINVAL);
> 
> 	/* hash table size must be power of 2 */
> @@ -114,13 +136,184 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
> 	return ERR_PTR(err);
> }
> 
> +#define BPF_BUILD_ID 3
> +/*
> + * Parse build id from the note segment. This logic can be shared between
> + * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are
> + * identical.
> + */
> +static inline int stack_map_parse_build_id(void *page_addr,
> +					   unsigned char *build_id,
> +					   void *note_start,
> +					   Elf32_Word note_size)
> +{
> +	Elf32_Word note_offs = 0, new_offs;
> +
> +	/* check for overflow */
> +	if (note_start < page_addr || note_start + note_size < note_start)
> +		return -EINVAL;
> +
> +	/* only supports note that fits in the first page */
> +	if (note_start + note_size > page_addr + PAGE_SIZE)
> +		return -EINVAL;
> +
> +	while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
> +		Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
> +
> +		if (nhdr->n_type == BPF_BUILD_ID &&
> +		    nhdr->n_namesz == sizeof("GNU") &&
> +		    nhdr->n_descsz == BPF_BUILD_ID_SIZE) {
> +			memcpy(build_id,
> +			       note_start + note_offs +
> +			       ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr),
> +			       BPF_BUILD_ID_SIZE);
> +			return 0;
> +		}
> +		new_offs = note_offs + sizeof(Elf32_Nhdr) +
> +			ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
> +		if (new_offs <= note_offs)  /* overflow */
> +			break;
> +		note_offs = new_offs;
> +	}
> +	return -EINVAL;
> +}
> +
> +/* Parse build ID from 32-bit ELF */
> +static int stack_map_get_build_id_32(void *page_addr,
> +				     unsigned char *build_id)
> +{
> +	Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr;
> +	Elf32_Phdr *phdr;
> +	int i;
> +
> +	/* only supports phdr that fits in one page */
> +	if (ehdr->e_phnum >
> +	    (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr))
> +		return -EINVAL;
> +
> +	phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr));
> +
> +	for (i = 0; i < ehdr->e_phnum; ++i)
> +		if (phdr[i].p_type == PT_NOTE)
> +			return stack_map_parse_build_id(page_addr, build_id,
> +					page_addr + phdr[i].p_offset,
> +					phdr[i].p_filesz);
> +	return -EINVAL;
> +}
> +
> +/* Parse build ID from 64-bit ELF */
> +static int stack_map_get_build_id_64(void *page_addr,
> +				     unsigned char *build_id)
> +{
> +	Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr;
> +	Elf64_Phdr *phdr;
> +	int i;
> +
> +	/* only supports phdr that fits in one page */
> +	if (ehdr->e_phnum >
> +	    (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr))
> +		return -EINVAL;
> +
> +	phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr));
> +
> +	for (i = 0; i < ehdr->e_phnum; ++i)
> +		if (phdr[i].p_type == PT_NOTE)
> +			return stack_map_parse_build_id(page_addr, build_id,
> +					page_addr + phdr[i].p_offset,
> +					phdr[i].p_filesz);
> +	return -EINVAL;
> +}
> +
> +/* Parse build ID of ELF file mapped to vma */
> +static int stack_map_get_build_id(struct vm_area_struct *vma,
> +				  unsigned char *build_id)
> +{
> +	Elf32_Ehdr *ehdr;
> +	struct page *page;
> +	void *page_addr;
> +	int ret;
> +
> +	/* only works for page backed storage  */
> +	if (!vma->vm_file)
> +		return -EINVAL;
> +
> +	page = find_get_page(vma->vm_file->f_mapping, 0);
> +	if (!page)
> +		return -EFAULT;	/* page not mapped */
> +
> +	ret = -EINVAL;
> +	page_addr = page_address(page);
> +	ehdr = (Elf32_Ehdr *)page_addr;
> +
> +	/* compare magic x7f "ELF" */
> +	if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0)
> +		goto out;
> +
> +	/* only support executable file and shared object file */
> +	if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN)
> +		goto out;
> +
> +	if (ehdr->e_ident[EI_CLASS] == ELFCLASS32)
> +		ret = stack_map_get_build_id_32(page_addr, build_id);
> +	else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64)
> +		ret = stack_map_get_build_id_64(page_addr, build_id);
> +out:
> +	put_page(page);
> +	return ret;
> +}
> +
> +static void stack_map_get_build_id_offset(struct bpf_map *map,
> +					  struct stack_map_bucket *bucket,
> +					  u64 *ips, u32 trace_nr, bool user)
> +{
> +	int i;
> +	struct vm_area_struct *vma;
> +	struct bpf_stack_build_id *id_offs;
> +
> +	bucket->nr = trace_nr;
> +	id_offs = (struct bpf_stack_build_id *)bucket->data;
> +
> +	/*
> +	 * We cannot do up_read() in nmi context, so build_id lookup is
> +	 * only supported for non-nmi events. If at some point, it is
> +	 * possible to run find_vma() without taking the semaphore, we
> +	 * would like to allow build_id lookup in nmi context.
> +	 *
> +	 * Same fallback is used for kernel stack (!user) on a stackmap
> +	 * with build_id.
> +	 */
> +	if (!user || !current || !current->mm || in_nmi() ||
> +	    down_read_trylock(&current->mm->mmap_sem) == 0) {
> +		/* cannot access current->mm, fall back to ips */
> +		for (i = 0; i < trace_nr; i++) {
> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> +			id_offs[i].ip = ips[i];
> +		}
> +		return;
> +	}
> +
> +	for (i = 0; i < trace_nr; i++) {
> +		vma = find_vma(current->mm, ips[i]);
> +		if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) {
> +			/* per entry fall back to ips */
> +			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> +			id_offs[i].ip = ips[i];
> +			continue;
> +		}
> +		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
> +			- vma->vm_start;
> +		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
> +	}
> +	up_read(&current->mm->mmap_sem);
> +}
> +
> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> 	   u64, flags)
> {
> 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
> 	struct perf_callchain_entry *trace;
> 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
> -	u32 max_depth = map->value_size / 8;
> +	u32 max_depth = map->value_size / stack_map_data_size(map);
> 	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
> 	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
> 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
> @@ -128,6 +321,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> 	bool user = flags & BPF_F_USER_STACK;
> 	bool kernel = !user;
> 	u64 *ips;
> +	bool hash_matches;
> 
> 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
> 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
> @@ -156,24 +350,43 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> 	id = hash & (smap->n_buckets - 1);
> 	bucket = READ_ONCE(smap->buckets[id]);
> 
> -	if (bucket && bucket->hash == hash) {
> -		if (flags & BPF_F_FAST_STACK_CMP)
> +	hash_matches = bucket && bucket->hash == hash;
> +	/* fast cmp */
> +	if (hash_matches && flags & BPF_F_FAST_STACK_CMP)
> +		return id;
> +
> +	if (stack_map_use_build_id(map)) {
> +		/* for build_id+offset, pop a bucket before slow cmp */
> +		new_bucket = (struct stack_map_bucket *)
> +			pcpu_freelist_pop(&smap->freelist);
> +		if (unlikely(!new_bucket))
> +			return -ENOMEM;
> +		stack_map_get_build_id_offset(map, new_bucket, ips,
> +					      trace_nr, user);
> +		trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
> +		if (hash_matches && bucket->nr == trace_nr &&
> +		    memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
> +			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
> 			return id;
> -		if (bucket->nr == trace_nr &&
> -		    memcmp(bucket->ip, ips, trace_len) == 0)
> +		}
> +		if (bucket && !(flags & BPF_F_REUSE_STACKID)) {
> +			pcpu_freelist_push(&smap->freelist, &new_bucket->fnode);
> +			return -EEXIST;
> +		}
> +	} else {
> +		if (hash_matches && bucket->nr == trace_nr &&
> +		    memcmp(bucket->data, ips, trace_len) == 0)
> 			return id;
> +		if (bucket && !(flags & BPF_F_REUSE_STACKID))
> +			return -EEXIST;
> +
> +		new_bucket = (struct stack_map_bucket *)
> +			pcpu_freelist_pop(&smap->freelist);
> +		if (unlikely(!new_bucket))
> +			return -ENOMEM;
> +		memcpy(new_bucket->data, ips, trace_len);
> 	}
> 
> -	/* this call stack is not in the map, try to add it */
> -	if (bucket && !(flags & BPF_F_REUSE_STACKID))
> -		return -EEXIST;
> -
> -	new_bucket = (struct stack_map_bucket *)
> -		pcpu_freelist_pop(&smap->freelist);
> -	if (unlikely(!new_bucket))
> -		return -ENOMEM;
> -
> -	memcpy(new_bucket->ip, ips, trace_len);
> 	new_bucket->hash = hash;
> 	new_bucket->nr = trace_nr;
> 
> @@ -212,8 +425,8 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
> 	if (!bucket)
> 		return -ENOENT;
> 
> -	trace_len = bucket->nr * sizeof(u64);
> -	memcpy(value, bucket->ip, trace_len);
> +	trace_len = bucket->nr * stack_map_data_size(map);
> +	memcpy(value, bucket->data, trace_len);
> 	memset(value + trace_len, 0, map->value_size - trace_len);
> 
> 	old_bucket = xchg(&smap->buckets[id], bucket);
> -- 
> 2.9.5
> 

^ permalink raw reply

* Re: [PATCH net v2] ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu
From: David Miller @ 2018-03-14 17:38 UTC (permalink / raw)
  To: sd; +Cc: netdev, sbrivio
In-Reply-To: <ab5d270fe823e0d4f7dc00471db68a66ebe68f54.1521019160.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Wed, 14 Mar 2018 10:21:14 +0100

> Prior to the rework of PMTU information storage in commit
> 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer."),
> when a PMTU event advertising a PMTU smaller than
> net.ipv4.route.min_pmtu was received, we would disable setting the DF
> flag on packets by locking the MTU metric, and set the PMTU to
> net.ipv4.route.min_pmtu.
> 
> Since then, we don't disable DF, and set PMTU to
> net.ipv4.route.min_pmtu, so the intermediate router that has this link
> with a small MTU will have to drop the packets.
> 
> This patch reestablishes pre-2.6.39 behavior by splitting
> rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
> rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
> and is checked in ip_dont_fragment().
> 
> One possible workaround is to set net.ipv4.route.min_pmtu to a value low
> enough to accommodate the lowest MTU encountered.
> 
> Fixes: 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer.")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v2: make rt_pmtu a bitfield
>     fix missing initializations of rt_mtu_locked

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next v6 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: Daniel Borkmann @ 2018-03-14 17:39 UTC (permalink / raw)
  To: Song Liu, netdev@vger.kernel.org, ast@kernel.org,
	peterz@infradead.org
  Cc: Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <15D381AB-87EE-4056-A5E1-9BF588710801@fb.com>

On 03/14/2018 06:36 PM, Song Liu wrote:
>> On Mar 14, 2018, at 10:23 AM, Song Liu <songliubraving@fb.com> wrote:
>>
>> Currently, bpf stackmap store address for each entry in the call trace.
>> To map these addresses to user space files, it is necessary to maintain
>> the mapping from these virtual address to symbols in the binary. Usually,
>> the user space profiler (such as perf) has to scan /proc/pid/maps at the
>> beginning of profiling, and monitor mmap2() calls afterwards. Given the
>> cost of maintaining the address map, this solution is not practical for
>> system wide profiling that is always on.
>>
>> This patch tries to solve this problem with a variation of stackmap. This
>> variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
>> addresses, the variation stores ELF file build_id + offset.
>>
>> Build_id is
>> only meaningful for user stack. If a kernel stack is added to a stackmap
>> with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store
>> ip (status == BPF_STACK_BUILD_ID_IP).
> 
> I forgot to delete the paragraph above. Please let me know if I should 
> resend (or it can be removed when you apply it).

We can fix it up when applying, no problem.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next 0/2] skbuff: Fix applications not being woken for errors
From: Soheil Hassas Yeganeh @ 2018-03-14 17:39 UTC (permalink / raw)
  To: vinicius.gomes
  Cc: netdev, randy.e.witt, David Miller, Eric Dumazet,
	Willem de Bruijn
In-Reply-To: <CAF=yD-K2xH6uGxC3P7UQab5v26kP85z3axvY8L9NfigCLrP6Hw@mail.gmail.com>

On Wed, Mar 14, 2018 at 12:32 PM Willem de Bruijn <
willemdebruijn.kernel@gmail.com> wrote:

> On Tue, Mar 13, 2018 at 4:35 PM, Vinicius Costa Gomes
> <vinicius.gomes@intel.com> wrote:
> > Hi,
> >
> > Changes from the RFC:
> >  - tweaked commit messages;
> >
> > Original cover letter:
> >
> > This is actually a "bug report"-RFC instead of the more usual "new
> > feature"-RFC.
> >
> > We are developing an application that uses TX hardware timestamping to
> > make some measurements, and during development Randy Witt initially
> > reported that the application poll() never unblocked when TX hardware
> > timestamping was enabled.
> >
> > After some investigation, it turned out the problem wasn't only
> > exclusive to hardware timestamping, and could be reproduced with
> > software timestamping.
> >
> > Applying patch (1), and running txtimestamp like this, for example:
> >
> > $ ./txtimestamp -u -4 192.168.1.71 -c 1000 -D -l 1000 -F
> >
> > ('-u' to use UDP only, '-4' for ipv4 only, '-c 1000' to send 1000
> > packets for each test, '-D' to remove the delay between packets, '-l
> > 1000' to set the payload to 1000 bytes, '-F' for configuring poll() to
> > wait forever)
> >
> > will cause the application to become stuck in the poll() call in most
> > of the times. (Note: I couldn't reproduce the issue running against an
> > address that is routed through loopback.)
> >
> > Another interesting fact is that if the POLLIN event is added to the
> > poll() .events, poll() no longer becomes stuck,

> The process has registered interest only in POLLIN, which the call to
> sk_data_read (sock_def_readable) will trigger.

> > and more interestingly
> > the returned event in .revents is only POLLERR.

> datagram_poll will set (E)POLLERR based on non-empty sk_error_queue.

> > After a few debugging sessions, we got to 'sock_queue_err_skb()' and
> > how it notifies applications of the error just enqueued. Changing it
> > to use 'sk->sk_error_report()', fixes the issue for hardware and
> > software timestamping. That is patch (2).
> >
> > The "solution" proposed in patch (2) looks like too big a hammer,

> It looks fine to me. POLLERR is returned regardless of the mask a
> process sets up in pollfd.events. So waking with sk_error_report
> will fix this while still waking callers waiting on POLLIN.

> Note that on sock_dequeue_err_skb, if another notification (of the
> right kind) is waiting, sk_error_report is already called instead of
> sk_data_ready.

Thank you for the fix. It looks fine to me too, because we already only arm
POLLERR for sock_dequeue_err_skb, and POLLERR is always returned when error
queue is not empty:
   if (...skb_queue_empty(&sk->sk_error_queue))
      mask |= POLLERR ...

> This should perhaps go to net, instead of net-next (though not the test).

+1 I think the fix should go to net.

> If resending, a small nit in the test: please keep the alphabetical
> order in getopt. The filepath also looks a bit fishy, but git am applied
> the mbox from patchwork without issue.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox