Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] liquidio: Add spoof checking on a VF MAC address
From: David Miller @ 2018-09-06 22:52 UTC (permalink / raw)
  To: felix.manlunas
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	weilin.chang
In-Reply-To: <20180906014056.GA2159@felix-thinkpad.cavium.com>

From: Felix Manlunas <felix.manlunas@cavium.com>
Date: Wed, 5 Sep 2018 18:40:56 -0700

> From: Weilin Chang <weilin.chang@cavium.com>
> 
> 1. Provide the API to set/unset the spoof checking feature.
> 2. Add a function to periodically provide the count of found
>    packets with spoof VF MAC address.
> 3. Prevent VF MAC address changing while the spoofchk of the VF is
>    on unless the changing MAC address is issued from PF.
> 
> Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] liquidio CN23XX: Remove set but not used variable 'ring_flag'
From: David Miller @ 2018-09-06 22:54 UTC (permalink / raw)
  To: yuehaibing
  Cc: derek.chickles, satananda.burla, felix.manlunas, raghu.vatsavayi,
	netdev, kernel-janitors
In-Reply-To: <1536232929-117243-1-git-send-email-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Thu, 6 Sep 2018 11:22:09 +0000

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c: In function 'cn23xx_setup_octeon_vf_device':
> drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c:619:20: warning:
>  variable 'ring_flag' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Herbert Xu @ 2018-09-07  3:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Olof Johansson, David Miller, Neil Horman,
	Marcelo Ricardo Leitner, Vladislav Yasevich, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, linux-crypto, LKML, linux-sctp, netdev,
	linux-decnet-user, kernel-team
In-Reply-To: <CANn89i+akEWrHELBkZJQOxok-ZfYy+FNPUWdPEfB6c4YyWLqJA@mail.gmail.com>

On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote:
> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@lixom.net> wrote:
> >
> > Today these are all global shared variables per protocol, and in
> > particular tcp_memory_allocated can get hot on a system with
> > large number of CPUs and a substantial number of connections.
> >
> > Moving it over to a per-cpu variable makes it significantly cheaper,
> > and the added overhead when summing up the percpu copies is still smaller
> > than the cost of having a hot cacheline bouncing around.
> 
> I am curious. We never noticed contention on this variable, at least for TCP.

Yes these variables are heavily amortised so I'm surprised that
they would cause much contention.

> Please share some numbers with us.

Indeed.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH net-next 08/11] tun: switch to new type of msg_control
From: Jason Wang @ 2018-09-07  3:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906125051-mutt-send-email-mst@kernel.org>



On 2018年09月07日 00:54, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:23PM +0800, Jason Wang wrote:
>> This patch introduces to a new tun/tap specific msg_control:
>>
>> #define TUN_MSG_UBUF 1
>> #define TUN_MSG_PTR  2
>> struct tun_msg_ctl {
>>         int type;
>>         void *ptr;
>> };
>>
>> This allows us to pass different kinds of msg_control through
>> sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
>> be used by the existed vhost_net zerocopy code. The second is XDP
>> buff, which allows vhost_net to pass XDP buff to TUN. This could be
>> used to implement accepting an array of XDP buffs from vhost_net in
>> the following patches.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> At this point, do we want to just add a new sock opt for tap's
> benefit? Seems cleaner than (ab)using sendmsg.

I think it won't be much difference, we still need a void pointer.

>> ---
>>   drivers/net/tap.c      | 18 ++++++++++++------
>>   drivers/net/tun.c      |  6 +++++-
>>   drivers/vhost/net.c    |  7 +++++--
>>   include/linux/if_tun.h |  7 +++++++
>>   4 files changed, 29 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index f0f7cd977667..7996ed7cbf18 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -619,7 +619,7 @@ static inline struct sk_buff *tap_alloc_skb(struct sock *sk, size_t prepad,
>>   #define TAP_RESERVE HH_DATA_OFF(ETH_HLEN)
>>   
>>   /* Get packet from user space buffer */
>> -static ssize_t tap_get_user(struct tap_queue *q, struct msghdr *m,
>> +static ssize_t tap_get_user(struct tap_queue *q, void *msg_control,
>>   			    struct iov_iter *from, int noblock)
>>   {
>>   	int good_linear = SKB_MAX_HEAD(TAP_RESERVE);
>> @@ -663,7 +663,7 @@ static ssize_t tap_get_user(struct tap_queue *q, struct msghdr *m,
>>   	if (unlikely(len < ETH_HLEN))
>>   		goto err;
>>   
>> -	if (m && m->msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
>> +	if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
>>   		struct iov_iter i;
>>   
>>   		copylen = vnet_hdr.hdr_len ?
>> @@ -724,11 +724,11 @@ static ssize_t tap_get_user(struct tap_queue *q, struct msghdr *m,
>>   	tap = rcu_dereference(q->tap);
>>   	/* copy skb_ubuf_info for callback when skb has no error */
>>   	if (zerocopy) {
>> -		skb_shinfo(skb)->destructor_arg = m->msg_control;
>> +		skb_shinfo(skb)->destructor_arg = msg_control;
>>   		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>>   		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>> -	} else if (m && m->msg_control) {
>> -		struct ubuf_info *uarg = m->msg_control;
>> +	} else if (msg_control) {
>> +		struct ubuf_info *uarg = msg_control;
>>   		uarg->callback(uarg, false);
>>   	}
>>   
>> @@ -1150,7 +1150,13 @@ static int tap_sendmsg(struct socket *sock, struct msghdr *m,
>>   		       size_t total_len)
>>   {
>>   	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
>> -	return tap_get_user(q, m, &m->msg_iter, m->msg_flags & MSG_DONTWAIT);
>> +	struct tun_msg_ctl *ctl = m->msg_control;
>> +
>> +	if (ctl && ctl->type != TUN_MSG_UBUF)
>> +		return -EINVAL;
>> +
>> +	return tap_get_user(q, ctl ? ctl->ptr : NULL, &m->msg_iter,
>> +			    m->msg_flags & MSG_DONTWAIT);
>>   }
>>   
>>   static int tap_recvmsg(struct socket *sock, struct msghdr *m,
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index ff1cbf3ebd50..c839a4bdcbd9 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2429,11 +2429,15 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>   	int ret;
>>   	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>>   	struct tun_struct *tun = tun_get(tfile);
>> +	struct tun_msg_ctl *ctl = m->msg_control;
>>   
>>   	if (!tun)
>>   		return -EBADFD;
>>   
>> -	ret = tun_get_user(tun, tfile, m->msg_control, &m->msg_iter,
>> +	if (ctl && ctl->type != TUN_MSG_UBUF)
>> +		return -EINVAL;
>> +
>> +	ret = tun_get_user(tun, tfile, ctl ? ctl->ptr : NULL, &m->msg_iter,
>>   			   m->msg_flags & MSG_DONTWAIT,
>>   			   m->msg_flags & MSG_MORE);
>>   	tun_put(tun);
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 4e656f89cb22..fb01ce6d981c 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -620,6 +620,7 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>>   		.msg_controllen = 0,
>>   		.msg_flags = MSG_DONTWAIT,
>>   	};
>> +	struct tun_msg_ctl ctl;
>>   	size_t len, total_len = 0;
>>   	int err;
>>   	struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
>> @@ -664,8 +665,10 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>>   			ubuf->ctx = nvq->ubufs;
>>   			ubuf->desc = nvq->upend_idx;
>>   			refcount_set(&ubuf->refcnt, 1);
>> -			msg.msg_control = ubuf;
>> -			msg.msg_controllen = sizeof(ubuf);
>> +			msg.msg_control = &ctl;
>> +			ctl.type = TUN_MSG_UBUF;
>> +			ctl.ptr = ubuf;
>> +			msg.msg_controllen = sizeof(ctl);
>>   			ubufs = nvq->ubufs;
>>   			atomic_inc(&ubufs->refcount);
>>   			nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
>> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
>> index 3d2996dc7d85..ba46dced1f38 100644
>> --- a/include/linux/if_tun.h
>> +++ b/include/linux/if_tun.h
>> @@ -19,6 +19,13 @@
>>   
>>   #define TUN_XDP_FLAG 0x1UL
>>   
>> +#define TUN_MSG_UBUF 1
>> +#define TUN_MSG_PTR  2
> Looks like TUN_MSG_PTR should be pushed out to a follow-up patch?

Ok.

>
>> +struct tun_msg_ctl {
>> +	int type;
>> +	void *ptr;
>> +};
>> +
> type actually includes a size. Why not two short fields then?

Yes, this sounds better.

Thanks

>
>>   #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
>>   struct socket *tun_get_socket(struct file *);
>>   struct ptr_ring *tun_get_tx_ring(struct file *file);
>> -- 
>> 2.17.1

^ permalink raw reply

* [PATCH net-next] ixgbe: remove redundant function ixgbe_fw_recovery_mode()
From: YueHaibing @ 2018-09-07  3:38 UTC (permalink / raw)
  To: davem, jeffrey.t.kirsher
  Cc: linux-kernel, netdev, intel-wired-lan, YueHaibing

There are no in-tree callers.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index 970f71d..0bd1294 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -3485,17 +3485,6 @@ void ixgbe_set_vlan_anti_spoofing(struct ixgbe_hw *hw, bool enable, int vf)
 }
 
 /**
- * ixgbe_fw_recovery_mode - Check if in FW NVM recovery mode
- * @hw: pointer to hardware structure
- */
-bool ixgbe_fw_recovery_mode(struct ixgbe_hw *hw)
-{
-	if (hw->mac.ops.fw_recovery_mode)
-		return hw->mac.ops.fw_recovery_mode(hw);
-	return false;
-}
-
-/**
  *  ixgbe_get_device_caps_generic - Get additional device capabilities
  *  @hw: pointer to hardware structure
  *  @device_caps: the EEPROM word with the extra device capabilities
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH net-next 10/11] tap: accept an array of XDP buffs through sendmsg()
From: Jason Wang @ 2018-09-07  3:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906135450-mutt-send-email-mst@kernel.org>



On 2018年09月07日 02:00, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:25PM +0800, Jason Wang wrote:
>> This patch implement TUN_MSG_PTR msg_control type. This type allows
>> the caller to pass an array of XDP buffs to tuntap through ptr field
>> of the tun_msg_control. Tap will build skb through those XDP buffers.
>>
>> This will avoid lots of indirect calls thus improves the icache
>> utilization and allows to do XDP batched flushing when doing XDP
>> redirection.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/net/tap.c | 73 +++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
>> index 7996ed7cbf18..50eb7bf22225 100644
>> --- a/drivers/net/tap.c
>> +++ b/drivers/net/tap.c
>> @@ -1146,14 +1146,83 @@ static const struct file_operations tap_fops = {
>>   #endif
>>   };
>>   
>> +static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp)
>> +{
>> +	struct virtio_net_hdr *gso = xdp->data_hard_start + sizeof(int);
>> +	int buflen = *(int *)xdp->data_hard_start;
>> +	int vnet_hdr_len = 0;
>> +	struct tap_dev *tap;
>> +	struct sk_buff *skb;
>> +	int err, depth;
>> +
>> +	if (q->flags & IFF_VNET_HDR)
>> +		vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>> +
>> +	skb = build_skb(xdp->data_hard_start, buflen);
>> +	if (!skb) {
>> +		err = -ENOMEM;
>> +		goto err;
>> +	}
> So fundamentally why is it called XDP?
> We just build and skb, don't we?

The reason is that the function accepts a pointer to XDP. And also the 
for the future development, I think the name is ok:

- we will probably do XDP offloading in this function.
- we may have a chance to call lower device's ndo_xdp_xmit() in the future.

Thanks

>
>> +
>> +	skb_reserve(skb, xdp->data - xdp->data_hard_start);
>> +	skb_put(skb, xdp->data_end - xdp->data);
>> +
>> +	skb_set_network_header(skb, ETH_HLEN);
>> +	skb_reset_mac_header(skb);
>> +	skb->protocol = eth_hdr(skb)->h_proto;
>> +
>> +	if (vnet_hdr_len) {
>> +		err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q));
>> +		if (err)
>> +			goto err_kfree;
>> +	}
>> +
>> +	skb_probe_transport_header(skb, ETH_HLEN);
>> +
>> +	/* Move network header to the right position for VLAN tagged packets */
>> +	if ((skb->protocol == htons(ETH_P_8021Q) ||
>> +	     skb->protocol == htons(ETH_P_8021AD)) &&
>> +	    __vlan_get_protocol(skb, skb->protocol, &depth) != 0)
>> +		skb_set_network_header(skb, depth);
>> +
>> +	rcu_read_lock();
>> +	tap = rcu_dereference(q->tap);
>> +	if (tap) {
>> +		skb->dev = tap->dev;
>> +		dev_queue_xmit(skb);
>> +	} else {
>> +		kfree_skb(skb);
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	return 0;
>> +
>> +err_kfree:
>> +	kfree_skb(skb);
>> +err:
>> +	rcu_read_lock();
>> +		tap = rcu_dereference(q->tap);
>> +	if (tap && tap->count_tx_dropped)
>> +		tap->count_tx_dropped(tap);
>> +	rcu_read_unlock();
>> +	return err;
>> +}
>> +
>>   static int tap_sendmsg(struct socket *sock, struct msghdr *m,
>>   		       size_t total_len)
>>   {
>>   	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
>>   	struct tun_msg_ctl *ctl = m->msg_control;
>> +	struct xdp_buff *xdp;
>> +	int i;
>>   
>> -	if (ctl && ctl->type != TUN_MSG_UBUF)
>> -		return -EINVAL;
>> +	if (ctl && ((ctl->type & 0xF) == TUN_MSG_PTR)) {
>> +		for (i = 0; i < ctl->type >> 16; i++) {
>> +			xdp = &((struct xdp_buff *)ctl->ptr)[i];
>> +			tap_get_user_xdp(q, xdp);
>> +		}
>> +		return 0;
>> +	}
>>   
>>   	return tap_get_user(q, ctl ? ctl->ptr : NULL, &m->msg_iter,
>>   			    m->msg_flags & MSG_DONTWAIT);
>> -- 
>> 2.17.1

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps
From: Alexei Starovoitov @ 2018-09-07  0:13 UTC (permalink / raw)
  To: Mauricio Vasquez B; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, joe
In-Reply-To: <153575074884.30050.17670029209466860207.stgit@kernel>

On Fri, Aug 31, 2018 at 11:25:48PM +0200, Mauricio Vasquez B wrote:
> In some applications this is needed have a pool of free elements, like for
> example the list of free L4 ports in a SNAT.  None of the current maps allow
> to do it as it is not possibleto get an any element without having they key
> it is associated to.
> 
> This patchset implements two new kind of eBPF maps: queue and stack.
> Those maps provide to eBPF programs the peek, push and pop operations, and for
> userspace applications a new bpf_map_lookup_and_delete_elem() is added.
> 
> Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
> 
> ---
> 
> I am sending this as an RFC because there is still an issue I am not sure how
> to solve.
> 
> The queue/stack maps have a linked list for saving the nodes, and a
> preallocation schema based on the pcpu_freelist already implemented and used
> in the htabmap.  Each time an element is pushed into the map, a node from the
> pcpu_freelist is taken and then added to the linked list.
> 
> The pop operation takes and *removes* the first node from the linked list, then
> it uses *call_rcu* to postpose freeing the node, i.e, the node is only returned
> to the pcpu_freelist when the rcu callback is executed.  This is needed because
> an element returned by the pop() operation should remain valid for the whole
> duration of the eBPF program.
> 
> The problem is that elements are not immediately returned to the free list, so
> in some cases the push operation could fail because there are not free nodes
> in the pcpu_freelist.
> 
> The following code snippet exposes that problem.
> 
> ...
> 	/* Push MAP_SIZE elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
> 
> 	/* Pop all elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_lookup_and_delete_elem(fd, NULL, &val) == 0 &&
> 		       val == vals[i]);
> 
>   // sleep(1) <-- If I put this sleep, everything works.
> 	/* Push MAP_SIZE elements */
> 	for (i = 0; i < MAP_SIZE; i++)
> 		assert(bpf_map_update_elem(fd, NULL, &vals[i], 0) == 0);
>            ^^^
>            This fails because there are not available elements in pcpu_freelist
> ...
> 
> I think a possible solution is to oversize the pcpu_freelist (no idea by how
> much, maybe double or, or make it 1.5 time the max elements in the map?)
> I also have concerns about it, it would waste that memory in many cases and
> this is also probably that it doesn't solve the issue because that code snippet
> is puhsing and popping elements too fast, so even if the pcpu_freelist is much
> large a certain time instant all the elements could be used.
> 
> Is this really an important issue?
> Any idea of how to solve it?

It is important issue indeed and a difficult one to solve.
We have the same issue with hash map.
If the prog is doing:
value = lookup(key);
delete(key);
// here the prog shouldn't be accessing the value anymore, since the memory
// could have been reused, but value pointer is still valid and points to
// allocated memory

bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
validity of value without races.
With pcpu_freelist I don't think there is a solution.
We can have this queue/stack map without prealloc and use kmalloc/kfree
back and forth. Performance will not be as great, but for your use case,
I suspect, it will be good enough.
The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
If somebody starts doing push/pop for every packet, the rcu subsystem
will struggle and nothing we can do about it.

The only way I could think of to resolve this problem is to reuse
the logic that Joe is working on for socket lookups inside the program.
Joe,
how is that going? Could you repost the latest patches?

In such case the api for stack map will look like:

elem = bpf_map_pop_elem(stack);
// access elem
bpf_map_free_elem(elem);
// here prog is not allowed to access elem and verifier will catch that

elem = bpf_map_alloc_elem(stack);
// populate elem
bpf_map_push_elem(elem);
// here prog is not allowed to access elem and verifier will catch that

Then both pre-allocated elems and kmalloc/kfree will work fine
and no unbounded rcu issues in both cases.

^ permalink raw reply

* Re: [PATCH iproute2] bridge/mdb: fix missing new line when show bridge mdb
From: Hangbin Liu @ 2018-09-07  0:14 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, David Ahern
In-Reply-To: <20180906140053.0778cee3@shemminger-XPS-13-9360>

Hi Stephen,
On Thu, Sep 06, 2018 at 02:00:53PM +0100, Stephen Hemminger wrote:
> > @@ -164,6 +168,10 @@ static void print_mdb_entry(FILE *f, int ifindex, const struct br_mdb_entry *e,
> >  		print_string(PRINT_ANY, "timer", " %s",
> >  			     format_timer(timer));
> >  	}
> > +
> > +	if (!is_json_context())
> > +		fprintf(f, "\n");
> > +
> >  	close_json_object();
> >  }
> >  
> 
> Thanks for catching this.
> 
> Now that there is a json print library, the preferred pattern for
> this is:
> 	print_string(PRINT_FP, NULL, "\n", NULL);

Are we going to replace all printf() by json print library, even not in
json context? If yes, I can post a v2 patch. Becuase there are still a
lot fprintf() in mdb.c.

> 
> I plan to introduce a helper
> 	print_fp(...)
> 
> and it would be easier if all places were consistent.

cool, that would be more clear.

Thanks
Hangbin

^ permalink raw reply

* Re: [bpf-next 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-09-07  0:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	simon.horman, ecree, songliubraving, Tom Herbert,
	Willem de Bruijn
In-Reply-To: <CAG4SDVVJZwsqzkc3SCmtsSX41L6yv5nncVWwQSw0Qo7EnFV7Tg@mail.gmail.com>

On Mon, Sep 3, 2018 at 1:54 PM, Petar Penkov <ppenkov@google.com> wrote:
>
> On Sun, Sep 2, 2018 at 2:03 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 08/30/2018 08:22 PM, Petar Penkov wrote:
> >> From: Petar Penkov <ppenkov@google.com>
> >>
> >> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> >> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> >> path. The BPF program is per-network namespace.
> >>
> >> Signed-off-by: Petar Penkov <ppenkov@google.com>
> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > [...]
> >> +             err = check_flow_keys_access(env, off, size);
> >> +             if (!err && t == BPF_READ && value_regno >= 0)
> >> +                     mark_reg_unknown(env, regs, value_regno);
> >>       } else {
> >>               verbose(env, "R%d invalid mem access '%s'\n", regno,
> >>                       reg_type_str[reg->type]);
> >> @@ -1925,6 +1954,8 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
> >>       case PTR_TO_PACKET_META:
> >>               return check_packet_access(env, regno, reg->off, access_size,
> >>                                          zero_size_allowed);
> >> +     case PTR_TO_FLOW_KEYS:
> >> +             return check_flow_keys_access(env, reg->off, access_size);
> >>       case PTR_TO_MAP_VALUE:
> >>               return check_map_access(env, regno, reg->off, access_size,
> >>                                       zero_size_allowed);
> >> @@ -3976,6 +4007,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> >>       case BPF_PROG_TYPE_SOCKET_FILTER:
> >>       case BPF_PROG_TYPE_SCHED_CLS:
> >>       case BPF_PROG_TYPE_SCHED_ACT:
> >> +     case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >>               return true;
> >
> > This one should not be added here. It would allow for LD_ABS to be used, but
> > you already have direct packet access as well as bpf_skb_load_bytes() helper
> > enabled. Downside on LD_ABS is that error path will exit the BPF prog with
> > return 0 for historical reasons w/o user realizing (here: to BPF_OK mapping).
> > So we should not encourage use of LD_ABS/IND anymore in eBPF context and
> > avoid surprises.
> >
> >>       default:
> >>               return false;
> >> @@ -4451,6 +4483,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
> >>       case PTR_TO_CTX:
> >>       case CONST_PTR_TO_MAP:
> >>       case PTR_TO_PACKET_END:
> >> +     case PTR_TO_FLOW_KEYS:
> >>               /* Only valid matches are exact, which memcmp() above
> >>                * would have accepted
> >>                */
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index c25eb36f1320..0143b9c0c67e 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -5092,6 +5092,17 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >>       }
> >>  }
> >>
> >> +static const struct bpf_func_proto *
> >> +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >> +{
> >> +     switch (func_id) {
> >> +     case BPF_FUNC_skb_load_bytes:
> >> +             return &bpf_skb_load_bytes_proto;
> >
> > Probably makes sense to also enable bpf_skb_pull_data helper for direct packet
> > access use to fetch non-linear data from here once.

On closer look, it turns out that __skb_flow_dissect takes a const skb
pointer, which conflicts with the realloc in bpf_skb_pull_data. But,
we also don't need it if I change bpf_flow_dissect_get_header to try
bpf_skb_load_bytes when direct packet access fails. This is very
similar to the existing use of __skb_header_pointer. See the below
snippet:

static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
                                                         __u16 hdr_size,
                                                         void *buffer)
{
        struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
        void *data_end = (__u8 *)(long)skb->data_end;
        void *data = (__u8 *)(long)skb->data;
        __u8 *hdr;

        /* Verifies this variable offset does not overflow */
        if (cb->nhoff > (USHRT_MAX - hdr_size))
                return NULL;

        hdr = data + cb->nhoff;
        if (hdr + hdr_size <= data_end)
                return hdr;

        if (bpf_skb_load_bytes(skb, cb->nhoff, buffer, hdr_size))
                return NULL;

        return buffer;
}

> >
> >> +     default:
> >> +             return bpf_base_func_proto(func_id);
> >> +     }
> >> +}
> >> +
> >>  static const struct bpf_func_proto *
> >>  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >>  {
> >> @@ -5207,6 +5218,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
> >>       case bpf_ctx_range(struct __sk_buff, data):
> >>       case bpf_ctx_range(struct __sk_buff, data_meta):
> >>       case bpf_ctx_range(struct __sk_buff, data_end):
> >> +     case bpf_ctx_range(struct __sk_buff, flow_keys):
> >>               if (size != size_default)
> >>                       return false;
> >>               break;
> >> @@ -5235,6 +5247,7 @@ static bool sk_filter_is_valid_access(int off, int size,
> >>       case bpf_ctx_range(struct __sk_buff, data):
> >>       case bpf_ctx_range(struct __sk_buff, data_meta):
> >>       case bpf_ctx_range(struct __sk_buff, data_end):
> >> +     case bpf_ctx_range(struct __sk_buff, flow_keys):
> >>       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> > [...]
> > Thanks,
> > Daniel
>
> Thank you for your feedback, Daniel! I'll make these changes and submit a v2.
> Petar

^ permalink raw reply

* [PATCH bpf-next 0/2] bpf: add bpffs/bpftool dump for prog_array and map_in_map maps
From: Yonghong Song @ 2018-09-07  0:26 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team

The support to dump program array and map_in_map maps
for bpffs and bpftool is added. Patch #1 added bpffs support
and Patch #2 added bpftool support. Please see
individual patches for example output.

Yonghong Song (2):
  bpf: add bpffs pretty print for program array map
  tools/bpf: bpftool: support prog array map and map of maps

 kernel/bpf/arraymap.c   | 25 ++++++++++++++++++++++++-
 tools/bpf/bpftool/map.c | 11 +++--------
 2 files changed, 27 insertions(+), 9 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH bpf-next 2/2] tools/bpf: bpftool: support prog array map and map of maps
From: Yonghong Song @ 2018-09-07  0:26 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180907002605.1408960-1-yhs@fb.com>

Currently, prog array map and map of maps are not supported
in bpftool. This patch added the support.
Different from other map types, for prog array map and
map of maps, the key returned bpf_get_next_key() may not
point to a valid value. So for these two map types,
no error will be printed out when such a scenario happens.

The following is the plain and json dump if btf is not available:
  $ ./bpftool map dump id 10
    key: 08 00 00 00  value: 5c 01 00 00
    Found 1 element
  $ ./bpftool -jp map dump id 10
    [{
        "key": ["0x08","0x00","0x00","0x00"
        ],
        "value": ["0x5c","0x01","0x00","0x00"
        ]
    }]

If the BTF is available, the dump looks below:
  $ ./bpftool map dump id 2
    [{
            "key": 0,
            "value": 7
        }
    ]
  $ ./bpftool -jp map dump id 2
    [{
        "key": ["0x00","0x00","0x00","0x00"
        ],
        "value": ["0x07","0x00","0x00","0x00"
        ],
        "formatted": {
            "key": 0,
            "value": 7
        }
    }]

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/bpf/bpftool/map.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 9c55077ca5dd..af8ad32fa6e9 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -673,12 +673,6 @@ static int do_dump(int argc, char **argv)
 	if (fd < 0)
 		return -1;
 
-	if (map_is_map_of_maps(info.type) || map_is_map_of_progs(info.type)) {
-		p_err("Dumping maps of maps and program maps not supported");
-		close(fd);
-		return -1;
-	}
-
 	key = malloc(info.key_size);
 	value = alloc_value(&info);
 	if (!key || !value) {
@@ -732,7 +726,9 @@ static int do_dump(int argc, char **argv)
 				} else {
 					print_entry_plain(&info, key, value);
 				}
-		} else {
+			num_elems++;
+		} else if (!map_is_map_of_maps(info.type) &&
+			   !map_is_map_of_progs(info.type)) {
 			if (json_output) {
 				jsonw_name(json_wtr, "key");
 				print_hex_data_json(key, info.key_size);
@@ -749,7 +745,6 @@ static int do_dump(int argc, char **argv)
 		}
 
 		prev_key = key;
-		num_elems++;
 	}
 
 	if (json_output)
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 1/2] bpf: add bpffs pretty print for program array map
From: Yonghong Song @ 2018-09-07  0:26 UTC (permalink / raw)
  To: ast, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180907002605.1408960-1-yhs@fb.com>

Added bpffs pretty print for program array map. For a particular
array index, if the program array points to a valid program,
the "<index>: <prog_id>" will be printed out like
   0: 6
which means bpf program with id "6" is installed at index "0".

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 kernel/bpf/arraymap.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index f9d24121be99..dded84cbe814 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -553,6 +553,29 @@ static void bpf_fd_array_map_clear(struct bpf_map *map)
 		fd_array_map_delete_elem(map, &i);
 }
 
+static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key,
+					 struct seq_file *m)
+{
+	void **elem, *ptr;
+	u32 prog_id;
+
+	rcu_read_lock();
+
+	elem = array_map_lookup_elem(map, key);
+	if (elem) {
+		ptr = READ_ONCE(*elem);
+		if (ptr) {
+			seq_printf(m, "%u: ", *(u32 *)key);
+			prog_id = prog_fd_array_sys_lookup_elem(ptr);
+			btf_type_seq_show(map->btf, map->btf_value_type_id,
+					  &prog_id, m);
+			seq_puts(m, "\n");
+		}
+	}
+
+	rcu_read_unlock();
+}
+
 const struct bpf_map_ops prog_array_map_ops = {
 	.map_alloc_check = fd_array_map_alloc_check,
 	.map_alloc = array_map_alloc,
@@ -564,7 +587,7 @@ const struct bpf_map_ops prog_array_map_ops = {
 	.map_fd_put_ptr = prog_fd_array_put_ptr,
 	.map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem,
 	.map_release_uref = bpf_fd_array_map_clear,
-	.map_check_btf = map_check_no_btf,
+	.map_seq_show_elem = prog_array_map_seq_show_elem,
 };
 
 static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file,
-- 
2.17.1

^ permalink raw reply related

* Re: [**EXTERNAL**] Re: VRF with enslaved L3 enabled bridge
From: David Ahern @ 2018-09-07  0:26 UTC (permalink / raw)
  To: D'Souza, Nelson, netdev@vger.kernel.org
In-Reply-To: <870CEE3F-48B1-448A-B986-FF0EF5140980@ciena.com>

On 9/5/18 12:00 PM, D'Souza, Nelson wrote:
> Just following up.... would you be able to confirm that this is a Linux VRF issue? 

I can confirm that I can reproduce the problem. Need to find time to dig
into it.

^ permalink raw reply

* Re: linux-next: build failure after merge of the bpf-next tree
From: Björn Töpel @ 2018-09-07  5:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: sfr, Daniel Borkmann, ast, Netdev, linux-next, LKML,
	Björn Töpel
In-Reply-To: <20180907002223.tpizglyhrwn26oox@ast-mbp.dhcp.thefacebook.com>

Den fre 7 sep. 2018 kl 02:23 skrev Alexei Starovoitov
<alexei.starovoitov@gmail.com>:
>
> On Fri, Sep 07, 2018 at 10:19:23AM +1000, Stephen Rothwell wrote:
> > Hi all,
> >
> > After merging the bpf-next tree, today's linux-next build (powerpc
> > ppc64_defconfig) failed like this:
> >
> > ERROR: ".xsk_reuseq_swap" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > ERROR: ".xsk_reuseq_free" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > ERROR: ".xsk_reuseq_prepare" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined
> >
> > Caused by commit
> >
> >   9654bd10da60 ("i40e: clean zero-copy XDP Rx ring on shutdown/reset")
> >
> > CONFIG_XDP_SOCKETS is not set for this build.
> >
> > I have used the version of the bfp-next tree from next-20180906 for today.
>
> merge conflict and build error...
> Bjorn, I'm thinking to toss the patches out of bpf-next and reapply
> cleaned up version of the patches...
> what do you think?
>

Yes, do that. I'll get back with a cleaned up v2.


Thanks,
Björn

^ permalink raw reply

* [PATCH] PCI: Reprogram bridge prefetch registers on resume
From: Daniel Drake @ 2018-09-07  5:36 UTC (permalink / raw)
  To: bhelgaas-hpIqsD4AKlfQT0dZR+AlfA
  Cc: mika.westerberg-VuQAYsv1563Yd54FQh9/CA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	rafael.j.wysocki-ral2JQCrhuEAvxtiuMwx3w,
	nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ,
	keith.busch-ral2JQCrhuEAvxtiuMwx3w, netdev-u79uwXL29TY76Z2rM5mHXA,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	andriy.shevchenko-VuQAYsv1563Yd54FQh9/CA,
	linux-6IF/jdPJHihWk0Htik3J/w, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	jonathan.derrick-ral2JQCrhuEAvxtiuMwx3w,
	hkallweit1-Re5JQEeQqe8AvxtiuMwx3w

On 38+ Intel-based Asus products, the nvidia GPU becomes unusable
after S3 suspend/resume. The affected products include multiple
generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs
many errors such as:

    fifo: fault 00 [READ] at 0000005555555000 engine 00 [GR] client 04 [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown]
    DRM: failed to idle channel 0 [DRM]

Similarly, the nvidia proprietary driver also fails after resume
(black screen, 100% CPU usage in Xorg process). We shipped a sample
to Nvidia for diagnosis, and their response indicated that it's a
problem with the parent PCI bridge (on the Intel SoC), not the GPU.

Runtime suspend/resume works fine, only S3 suspend is affected.

We found a workaround: on resume, rewrite the Intel PCI bridge
'Prefetchable Base Upper 32 Bits' register (PCI_PREF_BASE_UPPER32). In
the cases that I checked, this register has value 0 and we just have to
rewrite that value.

It's very strange that rewriting the exact same register value
makes a difference, but it definitely makes the issue go away.
It's not just acting as some kind of memory barrier, because rewriting
other bridge registers does not work around the issue. There's something
magic in this particular register. We have confirmed this on all
the affected models we have in-hands (X542UQ, UX533FD, X530UN, V272UN).

Additionally, this workaround solves an issue where r8169 MSI-X
interrupts were broken after S3 suspend/resume on Asus X441UAR. This
issue was recently worked around in commit 7bb05b85bc2d ("r8169:
don't use MSI-X on RTL8106e"). It also fixes the same issue on
RTL6186evl/8111evl on an Aimfor-tech laptop that we had not yet
patched. I suspect it will also fix the issue that was worked around in
commit 7c53a722459c ("r8169: don't use MSI-X on RTL8168g").

Thomas Martitz reports that this workaround also solves an issue where
the AMD Radeon Polaris 10 GPU on the HP Zbook 14u G5 is unresponsive
after S3 suspend/resume.

From our testing, the affected Intel PCI bridges are:
Intel Skylake PCIe Controller (x16) [8086:1901] (rev 05)
Intel Skylake PCIe Controller (x16) [8086:1901] (rev 07)
Intel Device [8086:31d8] (rev f3)
Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port B #1 [8086:5ad6] (rev fb)
Intel Celeron N3350/Pentium N4200/Atom E3900 Series PCI Express Port A #1 [8086:5ad8] (rev fb)
Intel Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1)
Intel Sunrise Point-LP PCI Express Root Port #5 [8086:9d14] (rev f1)
Intel Device [8086:9dbc] (rev f0)

On resume, reprogram the PCI bridge prefetch registers, including the
magic register mentioned above.

This matches Win10 behaviour, which also rewrites these registers
during S3 resume (checked with qemu tracing).

Link: https://marc.info/?i=CAD8Lp46Y2eOR7WE28xToUL8s-aYiqPa0nS=1GSD0AxkddXq6+A@mail.gmail.com
Link: https://bugs.freedesktop.org/show_bug.cgi?id=105760
Signed-off-by: Daniel Drake <drake@endlessm.com>
---

Notes:
    Replaces patch:
       PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
    
    Below is the list of Asus products with Intel/Nvidia that we
    believe are affected by the GPU resume issue.
    
    I revised my counting method from my last patch to eliminate duplicate
    platforms that had multiple SKUs with the same DMI/GPU/bridge, that's why
    the product count reduced from 43 to 38.
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: FX502VD
    product_name: FX502VD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev ff) (prog-if ff)
    	!!! Unknown header type 7f
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: FX570UD
    product_name: ASUS Gaming FX570UD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1f40]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: GL553VD
    product_name: GL553VD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:15e0]
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: GL753VD
    product_name: GL753VD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1590]
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: K401UQK
    product_name: K401UQK
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:14b0]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: P1440UF
    product_name: ASUSPRO P1440UF
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:174d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1f10]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: P2440UQ
    product_name: P2440UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:13ce]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: P2540NV
    product_name: P2540NV
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134f] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:17f0]
    00:13.0 PCI bridge [0604]: Intel Corporation Device [8086:5ad8] (rev fb) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: P2540UV
    product_name: P2540UV
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134f] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:132e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: P4540UQ
    product_name: P4540UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1650]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: UX331UN
    product_name: UX331UN
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1d12] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:15de]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: UX410UQK
    product_name: UX410UQK
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:138e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: UX430UQ
    product_name: UX430UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:139e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: UX533FD
    product_name: ZenBook UX533FD_UX533FD
    02:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP107M [GeForce GTX 1050 Mobile] [1043:14a1]
    00:1c.4 PCI bridge [0604]: Intel Corporation Device [8086:9dbc] (rev f0) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: V221ID
    product_name: V221ID
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134f] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:15f0]
    00:13.0 PCI bridge [0604]: Intel Corporation Device [8086:5ad8] (rev fb) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: V272UN
    product_name: Vivo AIO 27 V272UN
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1d10] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:17be]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X430UN
    product_name: VivoBook S14 X430UN
    01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP108M [GeForce MX150] [1043:199e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X441MB
    product_name: X441MB
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:174e] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:171e]
    00:13.0 PCI bridge [0604]: Intel Corporation Device [8086:31d8] (rev f3) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X456UF
    product_name: X456UF
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1346] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:245a]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X510UQ
    product_name: X510UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:145e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X530UN
    product_name: VivoBook S15 X530UN
    01:00.0 3D controller [0302]: NVIDIA Corporation GP108M [GeForce MX150] [10de:1d10] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP108M [GeForce MX150] [1043:18ce]
    00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X541UV
    product_name: X541UV
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134f] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:11ee]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X542UN
    product_name: X542UN
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1d10] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1b10]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X542UQ
    product_name: X542UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation GM108M [GeForce 940MX] [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. GM108M [GeForce 940MX] [1043:142e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X555UB
    product_name: X555UB
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1347] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:246a]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X555UQ
    product_name: X555UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation GM108M [GeForce 940MX] [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. GM108M [GeForce 940MX] [1043:246a]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X556URK
    product_name: X556URK
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134e] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1490]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X570ZD
    product_name: VivoBook_ASUS Laptop X570ZD
    01:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP107M [GeForce GTX 1050 Mobile] [1043:11d1]
    00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:15d3] (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X580GD
    product_name: VivoBook_ASUSLaptop X580GD_X580GD
    01:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP107M [GeForce GTX 1050 Mobile] [1043:1fc0]
    00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 07) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X580VD
    product_name: X580VD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1a10]
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X705FD
    product_name: VivoBook Pro 17 X705FD_X705FD
    02:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. GP107M [GeForce GTX 1050 Mobile] [1043:1431]
    00:1c.4 PCI bridge [0604]: Intel Corporation Device [8086:9dbc] (rev f0) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X705UD
    product_name: X705UD
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev a1)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1b30]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X705UQ
    product_name: X705UQ
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:148e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: X751NV
    product_name: X751NV
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134f] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:13be]
    00:13.0 PCI bridge [0604]: Intel Corporation Device [8086:5ad8] (rev fb) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: Z240IE
    product_name: Z240IE
    01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:1c8d] (rev a1) (prog-if 00 [VGA controller])
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1750]
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: ZN220IC-K
    product_name: ZN220IC-K
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134e] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:117e]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: ZN241IC
    product_name: ZN241IC
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1900]
    00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) (prog-if 00 [Normal decode])
    
    sys_vendor: ASUSTeK COMPUTER INC.
    board_name: ZN270IE
    product_name: ZN270IE
    01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:134d] (rev a2)
    	Subsystem: ASUSTeK Computer Inc. Device [1043:1720]
    00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) (prog-if 00 [Normal decode])

 drivers/pci/pci-driver.c | 14 ++++++++++++++
 drivers/pci/setup-bus.c  |  2 +-
 include/linux/pci.h      |  1 +
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index bef17c3fca67..034f816570ad 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -524,6 +524,20 @@ static void pci_pm_default_resume_early(struct pci_dev *pci_dev)
 	pci_power_up(pci_dev);
 	pci_restore_state(pci_dev);
 	pci_pme_restore(pci_dev);
+
+	/*
+	 * Redo the PCI bridge prefetch register setup.
+	 *
+	 * This works around an Intel PCI bridge issue seen on Asus and HP
+	 * laptops, where the GPU is not usable after S3 resume.
+	 * Even though PCI bridge register contents appear to be intact
+	 * at resume time, rewriting the value of PREF_BASE_UPPER32 is
+	 * required to make the GPU work.
+	 * Windows 10 also reprograms these registers during S3 resume.
+	 */
+	if (pci_dev->class == PCI_CLASS_BRIDGE_PCI << 8)
+		pci_setup_bridge_mmio_pref(pci_dev);
+
 	pci_fixup_device(pci_fixup_resume_early, pci_dev);
 }
 
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 79b1824e83b4..cb88288d2a69 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -630,7 +630,7 @@ static void pci_setup_bridge_mmio(struct pci_dev *bridge)
 	pci_write_config_dword(bridge, PCI_MEMORY_BASE, l);
 }
 
-static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
+void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
 {
 	struct resource *res;
 	struct pci_bus_region region;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e72ca8dd6241..b15828fc26a4 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -934,6 +934,7 @@ struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
 void pci_device_add(struct pci_dev *dev, struct pci_bus *bus);
 unsigned int pci_scan_child_bus(struct pci_bus *bus);
 void pci_bus_add_device(struct pci_dev *dev);
+void pci_setup_bridge_mmio_pref(struct pci_dev *bridge);
 void pci_read_bridge_bases(struct pci_bus *child);
 struct resource *pci_find_parent_resource(const struct pci_dev *dev,
 					  struct resource *res);
-- 
2.17.1

_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply related

* [PATCH 1/2] dt-bindings: marvell,prestera: Add common compatible string
From: Chris Packham @ 2018-09-07  0:59 UTC (permalink / raw)
  To: robh+dt, gregory.clement
  Cc: jason, davem, andrew, sebastian.hesselbarth, devicetree,
	linux-arm-kernel, linux-kernel, netdev, Chris Packham,
	Mark Rutland
In-Reply-To: <20180907005926.27134-1-chris.packham@alliedtelesis.co.nz>

Add "marvell,prestera" as a compatible string so that drivers can be
written to account for any prestera variant without needing to
specialise to the more specific values.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
---
 Documentation/devicetree/bindings/net/marvell,prestera.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/marvell,prestera.txt b/Documentation/devicetree/bindings/net/marvell,prestera.txt
index c329608fa887..83370ebf5b89 100644
--- a/Documentation/devicetree/bindings/net/marvell,prestera.txt
+++ b/Documentation/devicetree/bindings/net/marvell,prestera.txt
@@ -2,7 +2,7 @@ Marvell Prestera Switch Chip bindings
 -------------------------------------
 
 Required properties:
-- compatible: one of the following
+- compatible: must be "marvell,prestera" and one of the following
 	"marvell,prestera-98dx3236",
 	"marvell,prestera-98dx3336",
 	"marvell,prestera-98dx4251",
@@ -21,7 +21,7 @@ switch {
 	ranges = <0 MBUS_ID(0x03, 0x00) 0 0x100000>;
 
 	packet-processor@0 {
-		compatible = "marvell,prestera-98dx3236";
+		compatible = "marvell,prestera-98dx3236", "marvell,prestera";
 		reg = <0 0x4000000>;
 		interrupts = <33>, <34>, <35>;
 		dfx = <&dfx>;
-- 
2.18.0

^ permalink raw reply related

* [PATCH 2/2] ARM: dts: mvebu: add "marvell,prestera" to PP nodes
From: Chris Packham @ 2018-09-07  0:59 UTC (permalink / raw)
  To: robh+dt, gregory.clement
  Cc: jason, davem, andrew, sebastian.hesselbarth, devicetree,
	linux-arm-kernel, linux-kernel, netdev, Chris Packham,
	Mark Rutland
In-Reply-To: <20180907005926.27134-1-chris.packham@alliedtelesis.co.nz>

The compatible string "marvell,prestera" allows drivers to have code
common to any prestera variant.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
---
 arch/arm/boot/dts/armada-xp-98dx3236.dtsi | 2 +-
 arch/arm/boot/dts/armada-xp-98dx3336.dtsi | 2 +-
 arch/arm/boot/dts/armada-xp-98dx4251.dtsi | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/armada-xp-98dx3236.dtsi b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi
index 8d708cc22495..2185ea58abfe 100644
--- a/arch/arm/boot/dts/armada-xp-98dx3236.dtsi
+++ b/arch/arm/boot/dts/armada-xp-98dx3236.dtsi
@@ -243,7 +243,7 @@
 			ranges = <0 MBUS_ID(0x03, 0x00) 0 0x100000>;
 
 			pp0: packet-processor@0 {
-				compatible = "marvell,prestera-98dx3236";
+				compatible = "marvell,prestera-98dx3236", "marvell,prestera";
 				reg = <0 0x4000000>;
 				interrupts = <33>, <34>, <35>;
 				dfx = <&dfx>;
diff --git a/arch/arm/boot/dts/armada-xp-98dx3336.dtsi b/arch/arm/boot/dts/armada-xp-98dx3336.dtsi
index 2f5fc67dd6dc..1d9d8a8ea60c 100644
--- a/arch/arm/boot/dts/armada-xp-98dx3336.dtsi
+++ b/arch/arm/boot/dts/armada-xp-98dx3336.dtsi
@@ -35,5 +35,5 @@
 };
 
 &pp0 {
-	compatible = "marvell,prestera-98dx3336";
+	compatible = "marvell,prestera-98dx3336", "marvell,prestera";
 };
diff --git a/arch/arm/boot/dts/armada-xp-98dx4251.dtsi b/arch/arm/boot/dts/armada-xp-98dx4251.dtsi
index 7a9e8839880b..48ffdc72bfc7 100644
--- a/arch/arm/boot/dts/armada-xp-98dx4251.dtsi
+++ b/arch/arm/boot/dts/armada-xp-98dx4251.dtsi
@@ -49,6 +49,6 @@
 };
 
 &pp0 {
-	compatible = "marvell,prestera-98dx4251";
+	compatible = "marvell,prestera-98dx4251", "marvell,prestera";
 	interrupts = <33>, <34>, <35>, <36>;
 };
-- 
2.18.0

^ permalink raw reply related

* Re: linux-next: build failure after merge of the bpf-next tree
From: Alexei Starovoitov @ 2018-09-07  5:45 UTC (permalink / raw)
  To: Björn Töpel
  Cc: sfr, Daniel Borkmann, ast, Netdev, linux-next, LKML,
	Björn Töpel
In-Reply-To: <CAJ+HfNgRd5ztzwJ9HtgnrDyf8F0ykODY_EGC91d9KiAfDbhcXA@mail.gmail.com>

On Fri, Sep 07, 2018 at 07:21:05AM +0200, Björn Töpel wrote:
> Den fre 7 sep. 2018 kl 02:23 skrev Alexei Starovoitov
> <alexei.starovoitov@gmail.com>:
> >
> > On Fri, Sep 07, 2018 at 10:19:23AM +1000, Stephen Rothwell wrote:
> > > Hi all,
> > >
> > > After merging the bpf-next tree, today's linux-next build (powerpc
> > > ppc64_defconfig) failed like this:
> > >
> > > ERROR: ".xsk_reuseq_swap" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > > ERROR: ".xsk_reuseq_free" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined!
> > > ERROR: ".xsk_reuseq_prepare" [drivers/net/ethernet/intel/i40e/i40e.ko] undefined
> > >
> > > Caused by commit
> > >
> > >   9654bd10da60 ("i40e: clean zero-copy XDP Rx ring on shutdown/reset")
> > >
> > > CONFIG_XDP_SOCKETS is not set for this build.
> > >
> > > I have used the version of the bfp-next tree from next-20180906 for today.
> >
> > merge conflict and build error...
> > Bjorn, I'm thinking to toss the patches out of bpf-next and reapply
> > cleaned up version of the patches...
> > what do you think?
> >
> 
> Yes, do that. I'll get back with a cleaned up v2.

Done.

Unfortunately during interactive rebase that removed your commits git didn't
preserve merge commits that came after yours,
so Jesper's and Yonghong's cover letters are not in the git history.
Yet all patches are in the same order.
Explicit revert with trail in the git history would have been worse.
patchworks + git isn't the most convenient workflow. sigh.

^ permalink raw reply

* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Olof Johansson @ 2018-09-07  6:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Eric Dumazet, David Miller, Neil Horman, Marcelo Ricardo Leitner,
	Vladislav Yasevich, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	linux-crypto, LKML, linux-sctp, netdev, linux-decnet-user,
	kernel-team
In-Reply-To: <20180907033257.2nlgiqm2t4jiwhzc@gondor.apana.org.au>

Hi,

On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote:
>> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@lixom.net> wrote:
>> >
>> > Today these are all global shared variables per protocol, and in
>> > particular tcp_memory_allocated can get hot on a system with
>> > large number of CPUs and a substantial number of connections.
>> >
>> > Moving it over to a per-cpu variable makes it significantly cheaper,
>> > and the added overhead when summing up the percpu copies is still smaller
>> > than the cost of having a hot cacheline bouncing around.
>>
>> I am curious. We never noticed contention on this variable, at least for TCP.
>
> Yes these variables are heavily amortised so I'm surprised that
> they would cause much contention.
>
>> Please share some numbers with us.
>
> Indeed.

Certainly, just had to collect them again.

This is on a dual xeon box, with ~150-200k TCP connections. I see
about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the
inlined atomic ops, most of those in reduce.

Call path for reduce is practically all from tcp_write_timer on softirq:

                __sk_mem_reduce_allocated
                tcp_write_timer
                call_timer_fn
                run_timer_softirq
                __do_softirq
                irq_exit
                smp_apic_timer_interrupt
                apic_timer_interrupt
                cpuidle_enter_state

With this patch, I see about .18+.11+.07 = .36% in percpu-related
functions called from the same __sk_mem functions.

Now, that's a halving of cycles samples on that specific setup. The
real difference though, is on another platform where atomics are more
expensive. There, this makes a significant difference. Unfortunately,
I can't share specifics but I think this change stands on its own on
the dual xeon setup as well, maybe with slightly less strong wording
on just how hot the variable/line happens to be.

-Olof

^ permalink raw reply

* Re: [Intel-wired-lan] e1000e driver stuck at 10Mbps after reconnection
From: Camille Bordignon @ 2018-09-07  6:28 UTC (permalink / raw)
  To: Neftin, Sasha
  Cc: Alexander Duyck, Netdev, intel-wired-lan, David S. Miller,
	linux-kernel
In-Reply-To: <c1e2cf13-6bcc-498d-03ea-7c3792d675cb@intel.com>

Le mercredi 08 août 2018 à 18:00:28 (+0300), Neftin, Sasha a écrit :
> On 8/8/2018 17:24, Neftin, Sasha wrote:
> > On 8/7/2018 09:42, Camille Bordignon wrote:
> > > Le lundi 06 août 2018 à 15:45:29 (-0700), Alexander Duyck a écrit :
> > > > On Mon, Aug 6, 2018 at 4:59 AM, Camille Bordignon
> > > > <camille.bordignon@easymile.com> wrote:
> > > > > Hello,
> > > > > 
> > > > > Recently we experienced some issues with intel NIC (I219-LM
> > > > > and I219-V).
> > > > > It seems that after a wire reconnection, auto-negotation "fails" and
> > > > > link speed drips to 10 Mbps.
> > > > > 
> > > > >  From kernel logs:
> > > > > [17616.346150] e1000e: enp0s31f6 NIC Link is Down
> > > > > [17627.003322] e1000e: enp0s31f6 NIC Link is Up 10 Mbps Full
> > > > > Duplex, Flow Control: None
> > > > > [17627.003325] e1000e 0000:00:1f.6 enp0s31f6: 10/100 speed:
> > > > > disabling TSO
> > > > > 
> > > > > 
> > > > > $ethtool enp0s31f6
> > > > > Settings for enp0s31f6:
> > > > >          Supported ports: [ TP ]
> > > > >          Supported link modes:   10baseT/Half 10baseT/Full
> > > > >                                  100baseT/Half 100baseT/Full
> > > > >                                  1000baseT/Full
> > > > >          Supported pause frame use: No
> > > > >          Supports auto-negotiation: Yes
> > > > >          Supported FEC modes: Not reported
> > > > >          Advertised link modes:  10baseT/Half 10baseT/Full
> > > > >                                  100baseT/Half 100baseT/Full
> > > > >                                  1000baseT/Full
> > > > >          Advertised pause frame use: No
> > > > >          Advertised auto-negotiation: Yes
> > > > >          Advertised FEC modes: Not reported
> > > > >          Speed: 10Mb/s
> > > > >          Duplex: Full
> > > > >          Port: Twisted Pair
> > > > >          PHYAD: 1
> > > > >          Transceiver: internal
> > > > >          Auto-negotiation: on
> > > > >          MDI-X: on (auto)
> > > > >          Supports Wake-on: pumbg
> > > > >          Wake-on: g
> > > > >          Current message level: 0x00000007 (7)
> > > > >                                 drv probe link
> > > > >          Link detected: yes
> > > > > 
> > > > > 
> > > > > Notice that if disconnection last less than about 5 seconds,
> > > > > nothing wrong happens.
> > > > > And if after last failure, disconnection / connection occurs again and
> > > > > last less than 5 seconds, link speed is back to 1000 Mbps.
> > > > > 
> > > > > [18075.350678] e1000e: enp0s31f6 NIC Link is Down
> > > > > [18078.716245] e1000e: enp0s31f6 NIC Link is Up 1000 Mbps
> > > > > Full Duplex, Flow Control: None
> > > > > 
> > > > > The following patch seems to fix this issue.
> > > > > However I don't clearly understand why.
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > index 3ba0c90e7055..763c013960f1 100644
> > > > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > > > @@ -5069,7 +5069,7 @@ static bool e1000e_has_link(struct
> > > > > e1000_adapter *adapter)
> > > > >          case e1000_media_type_copper:
> > > > >                  if (hw->mac.get_link_status) {
> > > > >                          ret_val = hw->mac.ops.check_for_link(hw);
> > > > > -                       link_active = !hw->mac.get_link_status;
> > > > > +                       link_active = false;
> > > > >                  } else {
> > > > >                          link_active = true;
> > > > >                  }
> > > > > 
> > > > > Maybe this is related to watchdog task.
> > > > > 
> > > > > I've found out this fix by comparing with last commit that works fine :
> > > > > commit 0b76aae741abb9d16d2c0e67f8b1e766576f897d.
> > > > > However I don't know if this information is relevant.
> > > > > 
> > > > > Thank you.
> > > > > Camille Bordignon
> > > > 
> > > > What kernel were you testing this on? I know there have been a number
> > > > of changes over the past few months in this area and it would be
> > > > useful to know exactly what code base you started out with and what
> > > > the latest version of the kernel is you have tested.
> > > > 
> > > > Looking over the code change the net effect of it should be to add a 2
> > > > second delay from the time the link has changed until you actually
> > > > check the speed/duplex configuration. It is possible we could be
> > > > seeing some sort of timing issue and adding the 2 second delay after
> > > > the link event is enough time for things to stabilize and detect the
> > > > link at 1000 instead of 10/100.
> > > > 
> > > > - Alex
> > > 
> > > We've found out this issue using Fedora 27 (4.17.11-100.fc27.x86_64).
> > > 
> > > Then I've tested wth a more recent version of the driver v4.18-rc7 but
> > > behavior looks the same.
> > > 
> > > Thanks for you reply.
> > > 
> > > Camille Bordignon
> > > _______________________________________________
> > > Intel-wired-lan mailing list
> > > Intel-wired-lan@osuosl.org
> > > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> > > 
> > I've agree with Alex. Let's try add 2s delay after a link event. Please,
> > let us know if it will solve your problem.
> > Also, I would like recommend try work with different link partner and
> > see if you see same problem.
> > _______________________________________________
> > Intel-wired-lan mailing list
> > Intel-wired-lan@osuosl.org
> > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> Camille,
> My apologies, I wrong understand Alex. Please, do not try add delay. Please,
> check if you see same problem with different link partners.
> Thanks,
> Sasha
Hello,

I recently figured out that neither the previous patch nor commit
0b76aae741abb9d16d2c0e67f8b1e766576f897d fix this issue.

In these cases, after reproducing the issue, when ethernet wire is connected
kernel logs mention full speed (1000 Mbps) but actually it seems it is not.
The problem persists.

I made some simple tests with a web browser and a dowload speed test
site (fast.com). As a result speed was always around 8 Mbps.

And again after a quick ethernet wire disconnection / reconnection, same
test is around 500 Mbps.

I feel very confused.

Camille Bordignon

^ permalink raw reply

* [PATCH] can: rcar_can: convert to SPDX identifiers
From: Kuninori Morimoto @ 2018-09-07  2:00 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde
  Cc: Linux-Renesas, Linux-Net, linux-can


From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>

This patch updates license to use SPDX-License-Identifier
instead of verbose license text.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 drivers/net/can/rcar/Kconfig      | 1 +
 drivers/net/can/rcar/Makefile     | 1 +
 drivers/net/can/rcar/rcar_can.c   | 6 +-----
 drivers/net/can/rcar/rcar_canfd.c | 6 +-----
 4 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/net/can/rcar/Kconfig b/drivers/net/can/rcar/Kconfig
index 7b03a3a..bd5a8fc 100644
--- a/drivers/net/can/rcar/Kconfig
+++ b/drivers/net/can/rcar/Kconfig
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 config CAN_RCAR
 	tristate "Renesas R-Car CAN controller"
 	depends on ARCH_RENESAS || ARM
diff --git a/drivers/net/can/rcar/Makefile b/drivers/net/can/rcar/Makefile
index 08de36a..c9185b0 100644
--- a/drivers/net/can/rcar/Makefile
+++ b/drivers/net/can/rcar/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 #
 #  Makefile for the Renesas R-Car CAN & CAN FD controller drivers
 #
diff --git a/drivers/net/can/rcar/rcar_can.c b/drivers/net/can/rcar/rcar_can.c
index 11662f4..06f90a0 100644
--- a/drivers/net/can/rcar/rcar_can.c
+++ b/drivers/net/can/rcar/rcar_can.c
@@ -1,12 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0+
 /* Renesas R-Car CAN device driver
  *
  * Copyright (C) 2013 Cogent Embedded, Inc. <source@cogentembedded.com>
  * Copyright (C) 2013 Renesas Solutions Corp.
- *
- * This program is free software; you can redistribute  it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
  */
 
 #include <linux/module.h>
diff --git a/drivers/net/can/rcar/rcar_canfd.c b/drivers/net/can/rcar/rcar_canfd.c
index 602c19e..0541000 100644
--- a/drivers/net/can/rcar/rcar_canfd.c
+++ b/drivers/net/can/rcar/rcar_canfd.c
@@ -1,11 +1,7 @@
+// SPDX-License-Identifier: GPL-2.0+
 /* Renesas R-Car CAN FD device driver
  *
  * Copyright (C) 2015 Renesas Electronics Corp.
- *
- * This program is free software; you can redistribute  it and/or modify it
- * under  the terms of  the GNU General  Public License as published by the
- * Free Software Foundation;  either version 2 of the  License, or (at your
- * option) any later version.
  */
 
 /* The R-Car CAN FD controller can operate in either one of the below two modes
-- 
2.7.4

^ permalink raw reply related

* [PATCH] crypto: Adds user space interface for ALG_SET_KEY_TYPE
From: Kalyani Akula @ 2018-09-07  6:40 UTC (permalink / raw)
  To: herbert, davem, kstewart, gregkh, tglx, pombredanne, linux-crypto,
	linux-kernel, netdev, saratcha
  Cc: Kalyani Akula

ALG_SET_KEY_TYPE requires caller to pass the key_type to be used
for AES encryption/decryption.

Sometimes the cipher key will be stored in the device's
hardware (eFuse, BBRAM etc).So,there is a need to specify the information
about the key-type to use it for Encrypt or Decrypt operations.

This patch implements the above requirement.


Signed-off-by: Kalyani Akula <kalyani.akula@xilinx.com>
---
 crypto/af_alg.c             |  7 +++++++
 crypto/algif_skcipher.c     | 12 ++++++++++++
 crypto/blkcipher.c          |  9 +++++++++
 crypto/skcipher.c           | 18 ++++++++++++++++++
 include/crypto/if_alg.h     |  2 ++
 include/crypto/skcipher.h   | 12 +++++++++++-
 include/linux/crypto.h      | 12 ++++++++++++
 include/uapi/linux/if_alg.h |  1 +
 8 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index b053179..9ea6b86 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -261,6 +261,13 @@ static int alg_setsockopt(struct socket *sock, int level, int optname,
                if (!type->setauthsize)
                        goto unlock;
                err = type->setauthsize(ask->private, optlen);
+               break;
+       case ALG_SET_KEY_TYPE:
+               if (sock->state == SS_CONNECTED)
+                       goto unlock;
+               if (!type->setkeytype)
+                       goto unlock;
+               err = type->setkeytype(ask->private, optval, optlen);
        }

 unlock:
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index cfdaab2..9164465 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -320,6 +320,17 @@ static int skcipher_setkey(void *private, const u8 *key, unsigned int keylen)
        return crypto_skcipher_setkey(private, key, keylen);
 }

+static int skcipher_setkeytype(void *private, const u8 *key,
+                              unsigned int keylen)
+{
+       struct skcipher_tfm *tfm = private;
+       int err;
+
+       err = crypto_skcipher_setkeytype(tfm->skcipher, key, keylen);
+
+       return err;
+}
+
 static void skcipher_sock_destruct(struct sock *sk)
 {
        struct alg_sock *ask = alg_sk(sk);
@@ -384,6 +395,7 @@ static int skcipher_accept_parent(void *private, struct sock *sk)
        .bind           =       skcipher_bind,
        .release        =       skcipher_release,
        .setkey         =       skcipher_setkey,
+       .setkeytype     =       skcipher_setkeytype,
        .accept         =       skcipher_accept_parent,
        .accept_nokey   =       skcipher_accept_parent_nokey,
        .ops            =       &algif_skcipher_ops,
diff --git a/crypto/blkcipher.c b/crypto/blkcipher.c
index f93abf1..3ea0e7f 100644
--- a/crypto/blkcipher.c
+++ b/crypto/blkcipher.c
@@ -408,6 +408,14 @@ static int setkey(struct crypto_tfm *tfm, const u8 *key, unsigned int keylen)
        return cipher->setkey(tfm, key, keylen);
 }

+static int setkeytype(struct crypto_tfm *tfm, const u8 *key,
+                     unsigned int keylen)
+{
+       struct blkcipher_alg *cipher = &tfm->__crt_alg->cra_blkcipher;
+
+       return cipher->setkeytype(tfm, key, keylen);
+}
+
 static int async_setkey(struct crypto_ablkcipher *tfm, const u8 *key,
                        unsigned int keylen)
 {
@@ -478,6 +486,7 @@ static int crypto_init_blkcipher_ops_sync(struct crypto_tfm *tfm)
        unsigned long addr;

        crt->setkey = setkey;
+       crt->setkeytype = setkeytype;
        crt->encrypt = alg->encrypt;
        crt->decrypt = alg->decrypt;

diff --git a/crypto/skcipher.c b/crypto/skcipher.c
index 0bd8c6c..cb794dd 100644
--- a/crypto/skcipher.c
+++ b/crypto/skcipher.c
@@ -604,6 +604,23 @@ static int skcipher_setkey_blkcipher(struct crypto_skcipher *tfm,
        return 0;
 }

+static int skcipher_setkeytype_blkcipher(struct crypto_skcipher *tfm,
+                                        const u8 *key, unsigned int keylen)
+{
+       struct crypto_blkcipher **ctx = crypto_skcipher_ctx(tfm);
+       struct crypto_blkcipher *blkcipher = *ctx;
+       int err;
+
+       crypto_blkcipher_clear_flags(blkcipher, ~0);
+       crypto_blkcipher_set_flags(blkcipher, crypto_skcipher_get_flags(tfm) &
+                       CRYPTO_TFM_REQ_MASK);
+       err = crypto_blkcipher_setkeytype(blkcipher, key, keylen);
+       crypto_skcipher_set_flags(tfm, crypto_blkcipher_get_flags(blkcipher) &
+                       CRYPTO_TFM_RES_MASK);
+
+       return err;
+}
+
 static int skcipher_crypt_blkcipher(struct skcipher_request *req,
                                    int (*crypt)(struct blkcipher_desc *,
                                                 struct scatterlist *,
@@ -670,6 +687,7 @@ static int crypto_init_skcipher_ops_blkcipher(struct crypto_tfm *tfm)
        tfm->exit = crypto_exit_skcipher_ops_blkcipher;

        skcipher->setkey = skcipher_setkey_blkcipher;
+       skcipher->setkeytype = skcipher_setkeytype_blkcipher;
        skcipher->encrypt = skcipher_encrypt_blkcipher;
        skcipher->decrypt = skcipher_decrypt_blkcipher;

diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 482461d..202298e 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -51,6 +51,8 @@ struct af_alg_type {
        void *(*bind)(const char *name, u32 type, u32 mask);
        void (*release)(void *private);
        int (*setkey)(void *private, const u8 *key, unsigned int keylen);
+       int (*setkeytype)(void *private, const u8 *keytype,
+                         unsigned int keylen);
        int (*accept)(void *private, struct sock *sk);
        int (*accept_nokey)(void *private, struct sock *sk);
        int (*setauthsize)(void *private, unsigned int authsize);
diff --git a/include/crypto/skcipher.h b/include/crypto/skcipher.h
index 2f327f0..54f6752 100644
--- a/include/crypto/skcipher.h
+++ b/include/crypto/skcipher.h
@@ -54,7 +54,9 @@ struct skcipher_givcrypt_request {

 struct crypto_skcipher {
        int (*setkey)(struct crypto_skcipher *tfm, const u8 *key,
-                     unsigned int keylen);
+                       unsigned int keylen);
+       int (*setkeytype)(struct crypto_skcipher *tfm, const u8 *key,
+                         unsigned int keylen);
        int (*encrypt)(struct skcipher_request *req);
        int (*decrypt)(struct skcipher_request *req);

@@ -125,6 +127,8 @@ struct crypto_skcipher {
 struct skcipher_alg {
        int (*setkey)(struct crypto_skcipher *tfm, const u8 *key,
                      unsigned int keylen);
+       int (*setkeytype)(struct crypto_skcipher *tfm, const u8 *key,
+                         unsigned int keylen);
        int (*encrypt)(struct skcipher_request *req);
        int (*decrypt)(struct skcipher_request *req);
        int (*init)(struct crypto_skcipher *tfm);
@@ -401,6 +405,12 @@ static inline int crypto_skcipher_setkey(struct crypto_skcipher *tfm,
        return tfm->setkey(tfm, key, keylen);
 }

+static inline int crypto_skcipher_setkeytype(struct crypto_skcipher *tfm,
+                                            const u8 *key, unsigned int keylen)
+{
+       return tfm->setkeytype(tfm, key, keylen);
+}
+
 static inline unsigned int crypto_skcipher_default_keysize(
        struct crypto_skcipher *tfm)
 {
diff --git a/include/linux/crypto.h b/include/linux/crypto.h
index e8839d3..bda8380 100644
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -292,6 +292,8 @@ struct ablkcipher_alg {
 struct blkcipher_alg {
        int (*setkey)(struct crypto_tfm *tfm, const u8 *key,
                      unsigned int keylen);
+       int (*setkeytype)(struct crypto_tfm *tfm, const u8 *keytype,
+                         unsigned int keylen);
        int (*encrypt)(struct blkcipher_desc *desc,
                       struct scatterlist *dst, struct scatterlist *src,
                       unsigned int nbytes);
@@ -563,6 +565,8 @@ struct blkcipher_tfm {
        void *iv;
        int (*setkey)(struct crypto_tfm *tfm, const u8 *key,
                      unsigned int keylen);
+       int (*setkeytype)(struct crypto_tfm *tfm, const u8 *key,
+                         unsigned int keylen);
        int (*encrypt)(struct blkcipher_desc *desc, struct scatterlist *dst,
                       struct scatterlist *src, unsigned int nbytes);
        int (*decrypt)(struct blkcipher_desc *desc, struct scatterlist *dst,
@@ -1281,6 +1285,14 @@ static inline int crypto_blkcipher_setkey(struct crypto_blkcipher *tfm,
                                                 key, keylen);
 }

+static inline int crypto_blkcipher_setkeytype(struct crypto_blkcipher *tfm,
+                                             const u8 *key,
+                                             unsigned int keylen)
+{
+       return crypto_blkcipher_crt(tfm)->setkeytype(crypto_blkcipher_tfm(tfm),
+                                                    key, keylen);
+}
+
 /**
  * crypto_blkcipher_encrypt() - encrypt plaintext
  * @desc: reference to the block cipher handle with meta data
diff --git a/include/uapi/linux/if_alg.h b/include/uapi/linux/if_alg.h
index bc2bcde..aa31b18 100644
--- a/include/uapi/linux/if_alg.h
+++ b/include/uapi/linux/if_alg.h
@@ -35,6 +35,7 @@ struct af_alg_iv {
 #define ALG_SET_OP                     3
 #define ALG_SET_AEAD_ASSOCLEN          4
 #define ALG_SET_AEAD_AUTHSIZE          5
+#define ALG_SET_KEY_TYPE               6

 /* Operations */
 #define ALG_OP_DECRYPT                 0
--
1.9.5

This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.

^ permalink raw reply related

* [PATCH] ethernet: renesas: convert to SPDX identifiers
From: Kuninori Morimoto @ 2018-09-07  2:02 UTC (permalink / raw)
  To: Sergei Shtylyov, David S. Miller, Mark Brown
  Cc: Geert Uytterhoeven, Robin Murphy, netdev, linux-renesas-soc


From: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>

This patch updates license to use SPDX-License-Identifier
instead of verbose license text.

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
 drivers/net/ethernet/renesas/Kconfig    | 1 +
 drivers/net/ethernet/renesas/Makefile   | 1 +
 drivers/net/ethernet/renesas/ravb_ptp.c | 6 +-----
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/renesas/Kconfig b/drivers/net/ethernet/renesas/Kconfig
index f3f7477..bb0ebdf 100644
--- a/drivers/net/ethernet/renesas/Kconfig
+++ b/drivers/net/ethernet/renesas/Kconfig
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 #
 # Renesas device configuration
 #
diff --git a/drivers/net/ethernet/renesas/Makefile b/drivers/net/ethernet/renesas/Makefile
index a05102a..f21ab8c 100644
--- a/drivers/net/ethernet/renesas/Makefile
+++ b/drivers/net/ethernet/renesas/Makefile
@@ -1,3 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
 #
 # Makefile for the Renesas device drivers.
 #
diff --git a/drivers/net/ethernet/renesas/ravb_ptp.c b/drivers/net/ethernet/renesas/ravb_ptp.c
index eede70e..0721b5c 100644
--- a/drivers/net/ethernet/renesas/ravb_ptp.c
+++ b/drivers/net/ethernet/renesas/ravb_ptp.c
@@ -1,13 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0+
 /* PTP 1588 clock using the Renesas Ethernet AVB
  *
  * Copyright (C) 2013-2015 Renesas Electronics Corporation
  * Copyright (C) 2015 Renesas Solutions Corp.
  * Copyright (C) 2015-2016 Cogent Embedded, Inc. <source@cogentembedded.com>
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License as published by
- *  the Free Software Foundation; either version 2 of the License, or
- *  (at your option) any later version.
  */
 
 #include "ravb.h"
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
From: Eric Dumazet @ 2018-09-07  7:03 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Herbert Xu, David Miller, Neil Horman, Marcelo Ricardo Leitner,
	Vladislav Yasevich, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	linux-crypto, LKML, linux-sctp, netdev, linux-decnet-user,
	kernel-team, Yuchung Cheng, Neal Cardwell
In-Reply-To: <CAOesGMgRrb4D2S_qWwgo00iNxbCL9EEGfhD5Ji-2HMWuZeq0Yw@mail.gmail.com>

On Thu, Sep 6, 2018 at 11:20 PM Olof Johansson <olof@lixom.net> wrote:
>
> Hi,
>
> On Thu, Sep 6, 2018 at 8:32 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > On Thu, Sep 06, 2018 at 12:33:58PM -0700, Eric Dumazet wrote:
> >> On Thu, Sep 6, 2018 at 12:21 PM Olof Johansson <olof@lixom.net> wrote:
> >> >
> >> > Today these are all global shared variables per protocol, and in
> >> > particular tcp_memory_allocated can get hot on a system with
> >> > large number of CPUs and a substantial number of connections.
> >> >
> >> > Moving it over to a per-cpu variable makes it significantly cheaper,
> >> > and the added overhead when summing up the percpu copies is still smaller
> >> > than the cost of having a hot cacheline bouncing around.
> >>
> >> I am curious. We never noticed contention on this variable, at least for TCP.
> >
> > Yes these variables are heavily amortised so I'm surprised that
> > they would cause much contention.
> >
> >> Please share some numbers with us.
> >
> > Indeed.
>
> Certainly, just had to collect them again.
>
> This is on a dual xeon box, with ~150-200k TCP connections. I see
> about .7% CPU spent in __sk_mem_{reduce,raise}_allocated in the
> inlined atomic ops, most of those in reduce.
>
> Call path for reduce is practically all from tcp_write_timer on softirq:
>
>                 __sk_mem_reduce_allocated
>                 tcp_write_timer
>                 call_timer_fn
>                 run_timer_softirq
>                 __do_softirq
>                 irq_exit
>                 smp_apic_timer_interrupt
>                 apic_timer_interrupt
>                 cpuidle_enter_state
>
> With this patch, I see about .18+.11+.07 = .36% in percpu-related
> functions called from the same __sk_mem functions.
>
> Now, that's a halving of cycles samples on that specific setup. The
> real difference though, is on another platform where atomics are more
> expensive. There, this makes a significant difference. Unfortunately,
> I can't share specifics but I think this change stands on its own on
> the dual xeon setup as well, maybe with slightly less strong wording
> on just how hot the variable/line happens to be.


Problem is : we have platforms with more than 100 cpus, and
sk_memory_allocated() cost will be too expensive,
especially if the host is under memory pressure, since all cpus will
touch their private counter.

per cpu variables do not really scale, they were ok 10 years ago when
no more than 16 cpus were the norm.

I would prefer change TCP to not aggressively call
__sk_mem_reduce_allocated() from tcp_write_timer()

Ideally only tcp_retransmit_timer() should attempt to reduce forward
allocations, after recurring timeout.

Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid
sk_forward_alloc overflows")
we have better control over sockets having huge forward allocations.

Something like :

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 7fdf222a0bdfe9775970082f6b5dcdcc82b2ae1a..7e2e17cde9b6a9be835edfac26b64f4ce9411538
100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -505,6 +505,8 @@ void tcp_retransmit_timer(struct sock *sk)
                        mib_idx = LINUX_MIB_TCPTIMEOUTS;
                }
                __NET_INC_STATS(sock_net(sk), mib_idx);
+       } else {
+               sk_mem_reclaim(sk);
        }

        tcp_enter_loss(sk);
@@ -576,11 +578,11 @@ void tcp_write_timer_handler(struct sock *sk)

        if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ||
            !icsk->icsk_pending)
-               goto out;
+               return;

        if (time_after(icsk->icsk_timeout, jiffies)) {
                sk_reset_timer(sk, &icsk->icsk_retransmit_timer,
icsk->icsk_timeout);
-               goto out;
+               return;
        }

        tcp_mstamp_refresh(tcp_sk(sk));
@@ -602,9 +604,6 @@ void tcp_write_timer_handler(struct sock *sk)
                tcp_probe_timer(sk);
                break;
        }
-
-out:
-       sk_mem_reclaim(sk);
 }

^ permalink raw reply

* Re: [PATCH v2 net-next 3/4] net: make listified RX functions return number of good packets
From: Eric Dumazet @ 2018-09-07  2:27 UTC (permalink / raw)
  To: Edward Cree, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <d601bec7-0eaa-133d-f021-a340e38c45ed@solarflare.com>

On 09/06/2018 07:26 AM, Edward Cree wrote:
> Signed-off-by: Edward Cree <ecree@solarflare.com>

Lack of changelog here ?

I do not know what is a good packet.

You are adding a lot of conditional expressions, that cpu
will mispredict quite often.

Typical micro benchmarks wont really notice.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox