Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v3 4/4] net: vhost: add rx busy polling in tx path
From: xiangxia.m.yue @ 2018-06-30  6:33 UTC (permalink / raw)
  To: jasowang
  Cc: mst, makita.toshiaki, virtualization, netdev, Tonghao Zhang,
	Tonghao Zhang
In-Reply-To: <1530340438-3039-1-git-send-email-xiangxia.m.yue@gmail.com>

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch improves the guest receive and transmit performance.
On the handle_tx side, we poll the sock receive queue at the
same time. handle_rx do that in the same way.

We set the poll-us=100us and use the iperf3 to test
its bandwidth, use the netperf to test throughput and mean
latency. When running the tests, the vhost-net kthread of
that VM, is alway 100% CPU. The commands are shown as below.

iperf3  -s -D
iperf3  -c IP -i 1 -P 1 -t 20 -M 1400

or
netserver
netperf -H IP -t TCP_RR -l 20 -- -O "THROUGHPUT,MEAN_LATENCY"

host -> guest:
iperf3:
* With the patch:     27.0 Gbits/sec
* Without the patch:  14.4 Gbits/sec

netperf (TCP_RR):
* With the patch:     48039.56 trans/s, 20.64us mean latency
* Without the patch:  46027.07 trans/s, 21.58us mean latency

This patch also improves the guest transmit performance.

guest -> host:
iperf3:
* With the patch:     27.2 Gbits/sec
* Without the patch:  24.4 Gbits/sec

netperf (TCP_RR):
* With the patch:     47963.25 trans/s, 20.71us mean latency
* Without the patch:  45796.70 trans/s, 21.68us mean latency

Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
---
 drivers/vhost/net.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 458f81d..fb43d82 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -478,17 +478,13 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 				    struct iovec iov[], unsigned int iov_size,
 				    unsigned int *out_num, unsigned int *in_num)
 {
-	unsigned long uninitialized_var(endtime);
+	struct vhost_net_virtqueue *nvq_rx = &net->vqs[VHOST_NET_VQ_RX];
 	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				  out_num, in_num, NULL, NULL);
 
 	if (r == vq->num && vq->busyloop_timeout) {
-		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
-			cpu_relax();
-		preempt_enable();
+		vhost_net_busy_poll(net, &nvq_rx->vq, vq, false);
+
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
 	}
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next v3 4/4] net: vhost: add rx busy polling in tx path
From: Jesper Dangaard Brouer @ 2018-06-30  7:03 UTC (permalink / raw)
  To: xiangxia.m.yue; +Cc: mst, netdev, brouer, virtualization, Tonghao Zhang
In-Reply-To: <1530340438-3039-5-git-send-email-xiangxia.m.yue@gmail.com>

On Fri, 29 Jun 2018 23:33:58 -0700
xiangxia.m.yue@gmail.com wrote:

> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> This patch improves the guest receive and transmit performance.
> On the handle_tx side, we poll the sock receive queue at the
> same time. handle_rx do that in the same way.
> 
> We set the poll-us=100us and use the iperf3 to test

Where/how do you configure poll-us=100us ?

Are you talking about /proc/sys/net/core/busy_poll ?


p.s. Nice performance boost! :-)

> its bandwidth, use the netperf to test throughput and mean
> latency. When running the tests, the vhost-net kthread of
> that VM, is alway 100% CPU. The commands are shown as below.
> 
> iperf3  -s -D
> iperf3  -c IP -i 1 -P 1 -t 20 -M 1400
> 
> or
> netserver
> netperf -H IP -t TCP_RR -l 20 -- -O "THROUGHPUT,MEAN_LATENCY"
> 
> host -> guest:
> iperf3:
> * With the patch:     27.0 Gbits/sec
> * Without the patch:  14.4 Gbits/sec
> 
> netperf (TCP_RR):
> * With the patch:     48039.56 trans/s, 20.64us mean latency
> * Without the patch:  46027.07 trans/s, 21.58us mean latency
> 
> This patch also improves the guest transmit performance.
> 
> guest -> host:
> iperf3:
> * With the patch:     27.2 Gbits/sec
> * Without the patch:  24.4 Gbits/sec
> 
> netperf (TCP_RR):
> * With the patch:     47963.25 trans/s, 20.71us mean latency
> * Without the patch:  45796.70 trans/s, 21.68us mean latency
> 
> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [net-next PATCH v5 1/7] net: Refactor XPS for CPUs and Rx queues
From: Nambiar, Amritha @ 2018-06-30  7:48 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, alexander.h.duyck, willemdebruijn.kernel,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <20180629.215938.2008552830198656487.davem@davemloft.net>

On 6/29/2018 5:59 AM, David Miller wrote:
> From: Amritha Nambiar <amritha.nambiar@intel.com>
> Date: Wed, 27 Jun 2018 15:31:18 -0700
> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index c6b377a..3790ac9 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>  ...
>> +static inline bool attr_test_online(unsigned long j,
>> +				    const unsigned long *online_mask,
>> +				    unsigned int nr_bits)
> 
> This is a networking header file, so for names you are putting into
> the global namespace please give them some kind of prefix that
> indicates it is about networking.
> 
Sure, will prefix these with 'netif_'.

^ permalink raw reply

* Re: [net-next PATCH v5 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
From: Nambiar, Amritha @ 2018-06-30  7:49 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, alexander.h.duyck, willemdebruijn.kernel,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <20180629.220511.870587764860032381.davem@davemloft.net>

On 6/29/2018 6:05 AM, David Miller wrote:
> From: Amritha Nambiar <amritha.nambiar@intel.com>
> Date: Wed, 27 Jun 2018 15:31:28 -0700
> 
>> @@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
>>  
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> +	/* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +	if (WARN_ON_ONCE((unsigned short)tx_queue > USHRT_MAX))
>> +		return;
>>  	sk->sk_tx_queue_mapping = tx_queue;
>>  }
>>  
>> +#define NO_QUEUE_MAPPING	USHRT_MAX
> 
> I think you need to check ">= USHRT_MAX" since USHRT_MAX is how you
> indicate no queue.
> 
Agree, I'll fix this.

^ permalink raw reply

* Re: [net-next PATCH v5 4/7] net: Record receive queue number for a connection
From: Nambiar, Amritha @ 2018-06-30  7:50 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, alexander.h.duyck, willemdebruijn.kernel,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <20180629.220639.1360385545962536071.davem@davemloft.net>

On 6/29/2018 6:06 AM, David Miller wrote:
> From: Amritha Nambiar <amritha.nambiar@intel.com>
> Date: Wed, 27 Jun 2018 15:31:34 -0700
> 
>> @@ -1702,6 +1709,13 @@ static inline int sk_tx_queue_get(const struct sock *sk)
>>  	return -1;
>>  }
>>  
>> +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
>> +{
>> +#ifdef CONFIG_XPS
>> +	sk->sk_rx_queue_mapping = skb_get_rx_queue(skb);
>> +#endif
>> +}
> 
> Maybe add a >= NO_QUEUE_MAPPING check like you did for
> skc_tx_queue_mapping.
> 
Will fix.

^ permalink raw reply

* Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams
From: 吉藤英明 @ 2018-06-30  8:15 UTC (permalink / raw)
  To: Xin Long
  Cc: Marcelo Ricardo Leitner, Neil Horman, David Miller, network dev,
	linux-sctp, yoshfuji
In-Reply-To: <CADvbK_cHRA1aN+-tdzURqkZUf29w-0VwGHC_oKDbfLPjYQ1+qg@mail.gmail.com>

Hi,

2018-06-28 15:40 GMT+09:00 Xin Long <lucien.xin@gmail.com>:
> On Tue, Jun 26, 2018 at 8:02 PM, 吉藤英明
> <hideaki.yoshifuji@miraclelinux.com> wrote:
>> 2018-06-26 13:33 GMT+09:00 Xin Long <lucien.xin@gmail.com>:
>>> On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner
>>> <marcelo.leitner@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote:
>>>>> Hi,
>>>>>
>>>>> 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>:
>>>>> > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote:
>>>>> >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote:
>>>>> >> > From: Xin Long <lucien.xin@gmail.com>
>>>>> >> > Date: Mon, 25 Jun 2018 10:14:35 +0800
>>>>> >> >
>>>>> >> > >  struct sctp_paddrparams {
>>>>> >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams {
>>>>> >> > >   __u32                   spp_pathmtu;
>>>>> >> > >   __u32                   spp_sackdelay;
>>>>> >> > >   __u32                   spp_flags;
>>>>> >> > > + __u32                   spp_ipv6_flowlabel;
>>>>> >> > > + __u8                    spp_dscp;
>>>>> >> > >  } __attribute__((packed, aligned(4)));
>>>>> >> >
>>>>> >> > I don't think you can change the size of this structure like this.
>>>>> >> >
>>>>> >> > This check in sctp_setsockopt_peer_addr_params():
>>>>> >> >
>>>>> >> >     if (optlen != sizeof(struct sctp_paddrparams))
>>>>> >> >             return -EINVAL;
>>>>> >> >
>>>>> >> > is going to trigger in old kernels when executing programs
>>>>> >> > built against the new struct definition.
>>>>> >
>>>>> > That will happen, yes, but do we really care about being future-proof
>>>>> > here? I mean: if we also update such check(s) to support dealing with
>>>>> > smaller-than-supported structs, newer kernels will be able to run
>>>>> > programs built against the old struct, and the new one; while building
>>>>> > using newer headers and running on older kernel may fool the
>>>>> > application in other ways too (like enabling support for something
>>>>> > that is available on newer kernel and that is not present in the older
>>>>> > one).
>>>>>
>>>>> We should not break existing apps.
>>>>> We still accept apps of pre-2.4 era without sin6_scope_id
>>>>> (e.g., net/ipv6/af_inet6.c:inet6_bind()).
>>>>
>>>> Yes. That's what I tried to say. That is supporting an old app built
>>>> with old kernel headers and running on a newer kernel, and not the
>>>> other way around (an app built with fresh headers and running on an
>>>> old kernel).
>>> To make it, I will update the check like:
>>>
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 1df5d07..c949d8c 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -2715,13 +2715,18 @@ static int
>>> sctp_setsockopt_peer_addr_params(struct sock *sk,
>>>         struct sctp_sock        *sp = sctp_sk(sk);
>>>         int error;
>>>         int hb_change, pmtud_change, sackdelay_change;
>>> +       int plen = sizeof(params);
>>> +       int old_plen = plen - sizeof(u32) * 2;
>>
>> if (optlen < offsetof(struct sctp_paddrparams, spp_ipv6_flowlabel))
>> maybe?
> Hi, yoshfuji,
> offsetof() is better. thank you.
>
>>
>>>
>>> -       if (optlen != sizeof(struct sctp_paddrparams))
>>> +       if (optlen != plen && optlen != old_plen)
>>>                 return -EINVAL;
>>>
>>>         if (copy_from_user(&params, optval, optlen))
>>>                 return -EFAULT;
>>>
>>> +       if (optlen == old_plen)
>>> +               params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL);
>>
>> I think we should return -EINVAL if size is not new one.
> Sorry, if we returned  -EINVAL when size is the old one,
> how can we guarantee an old app built with old kernel
> headers and running on a newer kernel works well?
> or you meant?
> if ((params.spp_flags & (SPP_DSCP | SPP_IPV6_FLOWLABEL)) &&
>     optlen != plen)
>         return EINVAL;

Yes, I meant this (it should be -EINVAL though).


>
>>
>> --yoshfuji
>>
>>> +
>>>         /* Validate flags and value parameters. */
>>>         hb_change        = params.spp_flags & SPP_HB;
>>>         pmtud_change     = params.spp_flags & SPP_PMTUD;
>>> @@ -5591,10 +5596,13 @@ static int
>>> sctp_getsockopt_peer_addr_params(struct sock *sk, int len,
>>>         struct sctp_transport   *trans = NULL;
>>>         struct sctp_association *asoc = NULL;
>>>         struct sctp_sock        *sp = sctp_sk(sk);
>>> +       int plen = sizeof(params);
>>> +       int old_plen = plen - sizeof(u32) * 2;
>>>
>>> -       if (len < sizeof(struct sctp_paddrparams))
>>> +       if (len < old_plen)
>>>                 return -EINVAL;
>>> -       len = sizeof(struct sctp_paddrparams);
>>> +
>>> +       len = len >= plen ? plen : old_plen;
>>>         if (copy_from_user(&params, optval, len))
>>>                 return -EFAULT;
>>>
>>> does it look ok to you?

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Daniel Borkmann @ 2018-06-30  8:47 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: Jiri Benc, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180629100133.16286f24@cakuba.netronome.com>

On 06/29/2018 07:01 PM, Jakub Kicinski wrote:
> On Fri, 29 Jun 2018 09:04:15 +0200, Daniel Borkmann wrote:
>> On 06/28/2018 06:54 PM, Jakub Kicinski wrote:
>>> On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:  
>>>> On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:  
>>>>> Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
>>>>> right approach since this is opaque info and solely defined by the BPF prog
>>>>> that is using the generic helper.    
>>>>
>>>> Wouldn't it make sense to introduce some safeguards here (in a backward
>>>> compatible way, of course)? It's easy to mistakenly set data for a
>>>> different tunnel type in a BPF program and then be surprised by the
>>>> result. It might help users if such usage was detected by the kernel,
>>>> one way or another.  
>>>
>>> Well, that's how it works today ;)  
>>
>> Well, it was designed like that on purpose, to be i) agnostic of the underlying
>> device, ii) to not clutter BPF API with tens of different APIs effectively doing
>> the same thing, and at the same time to avoid adding protocol specifics. E.g. at
>> least core bits of bpf_skb_{set,get}_tunnel_key() will work whether I use vxlan
>> or geneve underneath (we are actually using it this way) and I could use things
>> like tun_id to encode custom meta data from BPF for either of them depending on flavor
>> picked by orchestration system. For the tunnel options in bpf_skb_{set,get}_tunnel_opt()
>> it's similar although here there needs to be awareness of the underlying dev depending
>> on whether you encode data into e.g. gbp or tlvs, etc. However, downside right now I
>> can see with a patch like below is that:
>>
>> i) People might still just keep using 'TUNNEL_OPTIONS_PRESENT path' since available
>> and backwards compatible with current/older kernels, ii) we cut bits away from
>> size over time for each new tunnel proto added in future that would support tunnel
>> options, iii) that extension is one-sided (at least below) and same would be needed
>> in getter part, and iv) there needs to be a way for the case when folks add new
>> tunnel options where we don't need to worry that we forget updating BPF_F_TUN_*
>> each time otherwise this will easily slip through and again people will just rely
>> on using TUNNEL_OPTIONS_PRESENT catchall. Given latter and in particular point i)
>> I wouldn't think it's worth the pain, the APIs were added to BPF in v4.6 so this
>> would buy them 2 more years wrt kernel compatibility with same functionality level.
>> And point v), I just noticed the patch is actually buggy: size is ARG_CONST_SIZE and
>> verifier will attempt to check the value whether the buffer passed in argument 2 is
>> valid or not, so using flags here in upper bits would let verification fail, you'd
>> really have to make a new helper just for this.
> 
> Ah, indeed.  I'd rather avoid a new helper, if we reuse an old one
> people can always write a program like:
> 
> 	err = helper(all_flags);
> 	if (err == -EINVAL)
> 		err = helper(fallback_flags);
> 
> With a new helper the program will not even load on old kernels :(
> 
> Could we add the flags new to bpf_skb_set_tunnel_key(), maybe?  It is a
> bit ugly because only options care about the flags and in theory
> bpf_skb_set_tunnel_key() doesn't have to be called before
> bpf_skb_set_tunnel_opt() ...

Right, though it's also ugly splitting things this way over two helpers
when this is _specifically_ only for one of them (unless you do need it for
hw offload where you need to signal for tunnel key _and_ options in BPF
itself but that's not from what I see is the case here). For sake of
'safeguard'-only I bet users won't do such fallback for reasons above but
again use currently TUNNEL_OPTIONS_PRESENT route, as is since it's simpler
and available since long time already. Thus I would prefer to leave it as
is in such case since ship has sailed long ago on this, and usage of this
extra check is not really obvious either, so I don't think it's worth it
and 'value-add' is marginal.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next v3 4/4] net: vhost: add rx busy polling in tx path
From: Tonghao Zhang @ 2018-06-30  8:48 UTC (permalink / raw)
  To: brouer; +Cc: mst, Linux Kernel Network Developers, virtualization,
	Tonghao Zhang
In-Reply-To: <20180630090303.043a06c7@redhat.com>

On Sat, Jun 30, 2018 at 3:03 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Fri, 29 Jun 2018 23:33:58 -0700
> xiangxia.m.yue@gmail.com wrote:
>
> > From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> >
> > This patch improves the guest receive and transmit performance.
> > On the handle_tx side, we poll the sock receive queue at the
> > same time. handle_rx do that in the same way.
> >
> > We set the poll-us=100us and use the iperf3 to test
>
> Where/how do you configure poll-us=100us ?
>
> Are you talking about /proc/sys/net/core/busy_poll ?
No, we set the poll-us in qemu. e.g
-netdev tap,ifname=tap0,id=hostnet0,vhost=on,script=no,downscript=no,poll-us=100

>
> p.s. Nice performance boost! :-)
>
> > its bandwidth, use the netperf to test throughput and mean
> > latency. When running the tests, the vhost-net kthread of
> > that VM, is alway 100% CPU. The commands are shown as below.
> >
> > iperf3  -s -D
> > iperf3  -c IP -i 1 -P 1 -t 20 -M 1400
> >
> > or
> > netserver
> > netperf -H IP -t TCP_RR -l 20 -- -O "THROUGHPUT,MEAN_LATENCY"
> >
> > host -> guest:
> > iperf3:
> > * With the patch:     27.0 Gbits/sec
> > * Without the patch:  14.4 Gbits/sec
> >
> > netperf (TCP_RR):
> > * With the patch:     48039.56 trans/s, 20.64us mean latency
> > * Without the patch:  46027.07 trans/s, 21.58us mean latency
> >
> > This patch also improves the guest transmit performance.
> >
> > guest -> host:
> > iperf3:
> > * With the patch:     27.2 Gbits/sec
> > * Without the patch:  24.4 Gbits/sec
> >
> > netperf (TCP_RR):
> > * With the patch:     47963.25 trans/s, 20.71us mean latency
> > * Without the patch:  45796.70 trans/s, 21.68us mean latency
> >
> > Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
From: David Miller @ 2018-06-30  8:50 UTC (permalink / raw)
  To: ka-cheong.poon; +Cc: netdev, santosh.shilimkar, rds-devel
In-Reply-To: <693463ed87c5956eed05a6266c4edf354ab3d51a.1530086216.git.ka-cheong.poon@oracle.com>

From: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Date: Wed, 27 Jun 2018 03:23:27 -0700

> This patch changes the internal representation of an IP address to use
> struct in6_addr.  IPv4 address is stored as an IPv4 mapped address.
> All the functions which take an IP address as argument are also
> changed to use struct in6_addr.  But RDS socket layer is not modified
> such that it still does not accept IPv6 address from an application.
> And RDS layer does not accept nor initiate IPv6 connections.
> 
> v2: Fixed sparse warnings.
> 
> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>

I really don't like this.

An ipv4 mapped ipv6 address is not the same as an ipv4 address.

You are effectively preventing the use of ipv6 connections
using ipv4 mapped addresses.

Also, assuming the sockaddr type based upon size is wrong.
You have to check the family field, then you can decide
to interpret the rest of the sockaddr in one way or another
and also validate it's length.

^ permalink raw reply

* [net-next PATCH v6 0/7] Symmetric queue selection using XPS for Rx queues
From: Amritha Nambiar @ 2018-06-30  4:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom

This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net/<dev>/queues/tx-*/xps_rxqs

A set of receive queues can be mapped to a set of transmit queues (many:many),
although the common use case is a 1:1 mapping. This will enable sending
packets on the same Tx-Rx queue association as this is useful for busy polling
multi-threaded workloads where it is not possible to pin the threads to
a CPU. This is a rework of Sridhar's patch for symmetric queueing via
socket option:
https://www.spinics.net/lists/netdev/msg453106.html

Testing Hints:
Kernel:  Linux 4.17.0-rc7+
Interface:
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00015e0b

Configuration:
ethtool -L $iface combined 16
ethtool -C $iface rx-usecs 1000
sysctl net.core.busy_poll=1000
ATR disabled:
ethtool -K $iface ntuple on

Workload:
Modified memcached that changes the thread selection policy to be based
on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket
option. The default is round-robin.

Default: No rxqs_map configured
Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue

System:
Architecture:          x86_64
CPU(s):                72
Model name:            Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

16 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 4/51/2215               2/30/5163
(usec)


intr/sec                        26655                   18606

contextswitch/sec               5145                    4044

insn per cycle                  0.43                    0.72

cache-misses                    6.919                   4.310
(% of all cache refs)

L1-dcache-load-                 4.49                    3.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.26                   8.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

32 threads  400K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                 10/112/5562             9/46/4637
(usec)


intr/sec                        30456                   27666

contextswitch/sec               7552                    5133

insn per cycle                  0.41                    0.49

cache-misses                    9.357                   2.769
(% of all cache refs)

L1-dcache-load-                 4.09                    3.98
-misses
(% of all L1-dcache hits)

LLC-load-misses                 12.96                   3.96
(% of all LL-cache hits)

-------------------------------------------------------------------------------

16 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  5/151/4989             9/69/2611
(usec)


intr/sec                        35686                   22907

contextswitch/sec               25522                   12281

insn per cycle                  0.67                    0.74

cache-misses                    8.652                   6.38
(% of all cache refs)

L1-dcache-load-                 3.19                    2.86
-misses
(% of all L1-dcache hits)

LLC-load-misses                 16.53                   11.99
(% of all LL-cache hits)

-------------------------------------------------------------------------------
32 threads  800K requests/sec
=============================
-------------------------------------------------------------------------------
                                Default                 Symmetric queues
-------------------------------------------------------------------------------
RTT min/avg/max                  6/163/6152             8/88/4209
(usec)


intr/sec                        47079                   26548

contextswitch/sec               42190                   39168

insn per cycle                  0.45                    0.54

cache-misses                    8.798                   4.668
(% of all cache refs)

L1-dcache-load-                 6.55                    6.29
-misses
(% of all L1-dcache hits)

LLC-load-misses                 13.91                   10.44
(% of all LL-cache hits)

-------------------------------------------------------------------------------

v6:
- Changed the names of some functions to begin with net_if.
- Cleaned up sk_tx_queue_set/sk_rx_queue_set functions.
- Added sk_rx_queue_clear to make it consistent with tx_queue_mapping
  initialization.

---

Amritha Nambiar (7):
      net: Refactor XPS for CPUs and Rx queues
      net: Use static_key for XPS maps
      net: sock: Change tx_queue_mapping in sock_common to unsigned short
      net: Record receive queue number for a connection
      net: Enable Tx queue selection based on Rx queues
      net-sysfs: Add interface for Rx queue(s) map per Tx queue
      Documentation: Add explanation for XPS using Rx-queue(s) map


 Documentation/ABI/testing/sysfs-class-net-queues |   11 +
 Documentation/networking/scaling.txt             |   61 ++++-
 include/linux/cpumask.h                          |   11 +
 include/linux/netdevice.h                        |   98 +++++++
 include/net/busy_poll.h                          |    1 
 include/net/sock.h                               |   52 ++++
 net/core/dev.c                                   |  288 +++++++++++++++-------
 net/core/net-sysfs.c                             |   87 ++++++-
 net/core/sock.c                                  |    2 
 net/ipv4/tcp_input.c                             |    3 
 10 files changed, 505 insertions(+), 109 deletions(-)

^ permalink raw reply

* [net-next PATCH v6 1/7] net: Refactor XPS for CPUs and Rx queues
From: Amritha Nambiar @ 2018-06-30  4:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |   98 ++++++++++++++++++++-
 net/core/dev.c            |  211 ++++++++++++++++++++++++++++++---------------
 net/core/net-sysfs.c      |    4 -
 4 files changed, 244 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)		((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-	WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+	WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+	cpu_max_bits_warn(cpu, nr_cpumask_bits);
 	return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6b377a..8bf8d61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -731,10 +731,15 @@ struct xps_map {
  */
 struct xps_dev_maps {
 	struct rcu_head rcu;
-	struct xps_map __rcu *cpu_map[0];
+	struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +		\
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +	\
 	(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+	(_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE	16
@@ -1910,7 +1915,8 @@ struct net_device {
 	int			watchdog_timeo;
 
 #ifdef CONFIG_XPS
-	struct xps_dev_maps __rcu *xps_maps;
+	struct xps_dev_maps __rcu *xps_cpus_map;
+	struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
@@ -3259,6 +3265,92 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map);
+
+/**
+ *	netif_attr_test_mask - Test a CPU or Rx queue set in a mask
+ *	@j: CPU/Rx queue index
+ *	@mask: bitmask of all cpus/rx queues
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
+ */
+static inline bool netif_attr_test_mask(unsigned long j,
+					const unsigned long *mask,
+					unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+	return test_bit(j, mask);
+}
+
+/**
+ *	netif_attr_test_online - Test for online CPU/Rx queue
+ *	@j: CPU/Rx queue index
+ *	@online_mask: bitmask for CPUs/Rx queues that are online
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns true if a CPU/Rx queue is online.
+ */
+static inline bool netif_attr_test_online(unsigned long j,
+					  const unsigned long *online_mask,
+					  unsigned int nr_bits)
+{
+	cpu_max_bits_warn(j, nr_bits);
+
+	if (online_mask)
+		return test_bit(j, online_mask);
+
+	return (j < nr_bits);
+}
+
+/**
+ *	netif_attrmask_next - get the next CPU/Rx queue in a cpu/Rx queues mask
+ *	@n: CPU/Rx queue index
+ *	@srcp: the cpumask/Rx queue mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set.
+ */
+static inline unsigned int netif_attrmask_next(int n, const unsigned long *srcp,
+					       unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (srcp)
+		return find_next_bit(srcp, nr_bits, n + 1);
+
+	return n + 1;
+}
+
+/**
+ *	netif_attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ *	@n: CPU/Rx queue index
+ *	@src1p: the first CPUs/Rx queues mask pointer
+ *	@src2p: the second CPUs/Rx queues mask pointer
+ *	@nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set in both.
+ */
+static inline int netif_attrmask_next_and(int n, const unsigned long *src1p,
+					  const unsigned long *src2p,
+					  unsigned int nr_bits)
+{
+	/* -1 is a legal arg here. */
+	if (n != -1)
+		cpu_max_bits_warn(n, nr_bits);
+
+	if (src1p && src2p)
+		return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
+	else if (src1p)
+		return find_next_bit(src1p, nr_bits, n + 1);
+	else if (src2p)
+		return find_next_bit(src2p, nr_bits, n + 1);
+
+	return n + 1;
+}
 #else
 static inline int netif_set_xps_queue(struct net_device *dev,
 				      const struct cpumask *mask,
diff --git a/net/core/dev.c b/net/core/dev.c
index dffed64..7105955 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 	int pos;
 
 	if (dev_maps)
-		map = xmap_dereference(dev_maps->cpu_map[tci]);
+		map = xmap_dereference(dev_maps->attr_map[tci]);
 	if (!map)
 		return false;
 
@@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
 			break;
 		}
 
-		RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
+		RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
 		kfree_rcu(map, rcu);
 		return false;
 	}
@@ -2135,31 +2135,58 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
 	return active;
 }
 
+static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
+			   struct xps_dev_maps *dev_maps, unsigned int nr_ids,
+			   u16 offset, u16 count, bool is_rxqs_map)
+{
+	bool active = false;
+	int i, j;
+
+	for (j = -1; j = netif_attrmask_next(j, mask, nr_ids),
+	     j < nr_ids;)
+		active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
+					       count);
+	if (!active) {
+		if (is_rxqs_map) {
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		} else {
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
+
+			for (i = offset + (count - 1); count--; i--)
+				netdev_queue_numa_node_write(
+					netdev_get_tx_queue(dev, i),
+							NUMA_NO_NODE);
+		}
+		kfree_rcu(dev_maps, rcu);
+	}
+}
+
 static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 				   u16 count)
 {
+	const unsigned long *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps;
-	int cpu, i;
-	bool active = false;
+	unsigned int nr_ids;
 
 	mutex_lock(&xps_map_mutex);
-	dev_maps = xmap_dereference(dev->xps_maps);
 
-	if (!dev_maps)
-		goto out_no_maps;
-
-	for_each_possible_cpu(cpu)
-		active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
-					       offset, count);
+	dev_maps = xmap_dereference(dev->xps_rxqs_map);
+	if (dev_maps) {
+		nr_ids = dev->num_rx_queues;
+		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
+			       count, true);
 
-	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
-		kfree_rcu(dev_maps, rcu);
 	}
 
-	for (i = offset + (count - 1); count--; i--)
-		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
-					     NUMA_NO_NODE);
+	dev_maps = xmap_dereference(dev->xps_cpus_map);
+	if (!dev_maps)
+		goto out_no_maps;
+
+	if (num_possible_cpus() > 1)
+		possible_mask = cpumask_bits(cpu_possible_mask);
+	nr_ids = nr_cpu_ids;
+	clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, count,
+		       false);
 
 out_no_maps:
 	mutex_unlock(&xps_map_mutex);
@@ -2170,8 +2197,8 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
 	netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
 }
 
-static struct xps_map *expand_xps_map(struct xps_map *map,
-				      int cpu, u16 index)
+static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
+				      u16 index, bool is_rxqs_map)
 {
 	struct xps_map *new_map;
 	int alloc_len = XPS_MIN_MAP_ALLOC;
@@ -2183,7 +2210,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		return map;
 	}
 
-	/* Need to add queue to this CPU's existing map */
+	/* Need to add tx-queue to this CPU's/rx-queue's existing map */
 	if (map) {
 		if (pos < map->alloc_len)
 			return map;
@@ -2191,9 +2218,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 		alloc_len = map->alloc_len * 2;
 	}
 
-	/* Need to allocate new map to store queue on this CPU's map */
-	new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
-			       cpu_to_node(cpu));
+	/* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
+	 *  map
+	 */
+	if (is_rxqs_map)
+		new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
+	else
+		new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
+				       cpu_to_node(attr_index));
 	if (!new_map)
 		return NULL;
 
@@ -2205,14 +2237,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
 	return new_map;
 }
 
-int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
-			u16 index)
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+			  u16 index, bool is_rxqs_map)
 {
+	const unsigned long *online_mask = NULL, *possible_mask = NULL;
 	struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
-	int i, cpu, tci, numa_node_id = -2;
+	int i, j, tci, numa_node_id = -2;
 	int maps_sz, num_tc = 1, tc = 0;
 	struct xps_map *map, *new_map;
 	bool active = false;
+	unsigned int nr_ids;
 
 	if (dev->num_tc) {
 		num_tc = dev->num_tc;
@@ -2221,16 +2255,27 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -EINVAL;
 	}
 
-	maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
-	if (maps_sz < L1_CACHE_BYTES)
-		maps_sz = L1_CACHE_BYTES;
-
 	mutex_lock(&xps_map_mutex);
+	if (is_rxqs_map) {
+		maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		nr_ids = dev->num_rx_queues;
+	} else {
+		maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
+		if (num_possible_cpus() > 1) {
+			online_mask = cpumask_bits(cpu_online_mask);
+			possible_mask = cpumask_bits(cpu_possible_mask);
+		}
+		dev_maps = xmap_dereference(dev->xps_cpus_map);
+		nr_ids = nr_cpu_ids;
+	}
 
-	dev_maps = xmap_dereference(dev->xps_maps);
+	if (maps_sz < L1_CACHE_BYTES)
+		maps_sz = L1_CACHE_BYTES;
 
 	/* allocate memory for queue storage */
-	for_each_cpu_and(cpu, cpu_online_mask, mask) {
+	for (j = -1; j = netif_attrmask_next_and(j, online_mask, mask, nr_ids),
+	     j < nr_ids;) {
 		if (!new_dev_maps)
 			new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
 		if (!new_dev_maps) {
@@ -2238,73 +2283,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 			return -ENOMEM;
 		}
 
-		tci = cpu * num_tc + tc;
-		map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
+		tci = j * num_tc + tc;
+		map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
 				 NULL;
 
-		map = expand_xps_map(map, cpu, index);
+		map = expand_xps_map(map, j, index, is_rxqs_map);
 		if (!map)
 			goto error;
 
-		RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+		RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 	}
 
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
-	for_each_possible_cpu(cpu) {
+	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
-		for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
+		for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* We need to explicitly update tci as prevous loop
 		 * could break out early if dev_maps is NULL.
 		 */
-		tci = cpu * num_tc + tc;
+		tci = j * num_tc + tc;
 
-		if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
-			/* add queue to CPU maps */
+		if (netif_attr_test_mask(j, mask, nr_ids) &&
+		    netif_attr_test_online(j, online_mask, nr_ids)) {
+			/* add tx-queue to CPU/rx-queue maps */
 			int pos = 0;
 
-			map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+			map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			while ((pos < map->len) && (map->queues[pos] != index))
 				pos++;
 
 			if (pos == map->len)
 				map->queues[map->len++] = index;
 #ifdef CONFIG_NUMA
-			if (numa_node_id == -2)
-				numa_node_id = cpu_to_node(cpu);
-			else if (numa_node_id != cpu_to_node(cpu))
-				numa_node_id = -1;
+			if (!is_rxqs_map) {
+				if (numa_node_id == -2)
+					numa_node_id = cpu_to_node(j);
+				else if (numa_node_id != cpu_to_node(j))
+					numa_node_id = -1;
+			}
 #endif
 		} else if (dev_maps) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 
 		/* copy maps belonging to foreign traffic classes */
 		for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
 			/* fill in the new device map from the old device map */
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
-			RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
+			RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
 		}
 	}
 
-	rcu_assign_pointer(dev->xps_maps, new_dev_maps);
+	if (is_rxqs_map)
+		rcu_assign_pointer(dev->xps_rxqs_map, new_dev_maps);
+	else
+		rcu_assign_pointer(dev->xps_cpus_map, new_dev_maps);
 
 	/* Cleanup old maps */
 	if (!dev_maps)
 		goto out_no_old_maps;
 
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
-			map = xmap_dereference(dev_maps->cpu_map[tci]);
+	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
+			map = xmap_dereference(dev_maps->attr_map[tci]);
 			if (map && map != new_map)
 				kfree_rcu(map, rcu);
 		}
@@ -2317,19 +2370,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	active = true;
 
 out_no_new_maps:
-	/* update Tx queue numa node */
-	netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
-				     (numa_node_id >= 0) ? numa_node_id :
-				     NUMA_NO_NODE);
+	if (!is_rxqs_map) {
+		/* update Tx queue numa node */
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
+					     (numa_node_id >= 0) ?
+					     numa_node_id : NUMA_NO_NODE);
+	}
 
 	if (!dev_maps)
 		goto out_no_maps;
 
-	/* removes queue from unused CPUs */
-	for_each_possible_cpu(cpu) {
-		for (i = tc, tci = cpu * num_tc; i--; tci++)
+	/* removes tx-queue from unused CPUs/rx-queues */
+	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = tc, tci = j * num_tc; i--; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
-		if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
+		if (!netif_attr_test_mask(j, mask, nr_ids) ||
+		    !netif_attr_test_online(j, online_mask, nr_ids))
 			active |= remove_xps_queue(dev_maps, tci, index);
 		for (i = num_tc - tc, tci++; --i; tci++)
 			active |= remove_xps_queue(dev_maps, tci, index);
@@ -2337,7 +2394,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 
 	/* free map if not active */
 	if (!active) {
-		RCU_INIT_POINTER(dev->xps_maps, NULL);
+		if (is_rxqs_map)
+			RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+		else
+			RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
 		kfree_rcu(dev_maps, rcu);
 	}
 
@@ -2347,11 +2407,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	return 0;
 error:
 	/* remove any maps that we added */
-	for_each_possible_cpu(cpu) {
-		for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
-			new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
+	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
+	     j < nr_ids;) {
+		for (i = num_tc, tci = j * num_tc; i--; tci++) {
+			new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
 			map = dev_maps ?
-			      xmap_dereference(dev_maps->cpu_map[tci]) :
+			      xmap_dereference(dev_maps->attr_map[tci]) :
 			      NULL;
 			if (new_map && new_map != map)
 				kfree(new_map);
@@ -2363,6 +2424,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
 	kfree(new_dev_maps);
 	return -ENOMEM;
 }
+
+int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
+			u16 index)
+{
+	return __netif_set_xps_queue(dev, cpumask_bits(mask), index, false);
+}
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
@@ -3384,7 +3451,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	int queue_index = -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		unsigned int tci = skb->sender_cpu - 1;
 
@@ -3393,7 +3460,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 			tci += netdev_get_prio_tc_map(dev, skb->priority);
 		}
 
-		map = rcu_dereference(dev_maps->cpu_map[tci]);
+		map = rcu_dereference(dev_maps->attr_map[tci]);
 		if (map) {
 			if (map->len == 1)
 				queue_index = map->queues[0];
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..b39987c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
 		return -ENOMEM;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_maps);
+	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {
 		for_each_possible_cpu(cpu) {
 			int i, tci = cpu * num_tc + tc;
 			struct xps_map *map;
 
-			map = rcu_dereference(dev_maps->cpu_map[tci]);
+			map = rcu_dereference(dev_maps->attr_map[tci]);
 			if (!map)
 				continue;
 

^ permalink raw reply related

* [net-next PATCH v6 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
From: Amritha Nambiar @ 2018-06-30  4:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with ~0 indicating unset and
other positive queue values being set. This will accommodate adding
a new 'unsigned short' field in sock_common in the next patch for
rx_queue_mapping.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b7541..37b09c8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -214,7 +214,7 @@ struct sock_common {
 		struct hlist_node	skc_node;
 		struct hlist_nulls_node skc_nulls_node;
 	};
-	int			skc_tx_queue_mapping;
+	unsigned short		skc_tx_queue_mapping;
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
+	/* sk_tx_queue_mapping accept only upto a 16-bit value */
+	if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
+		return;
 	sk->sk_tx_queue_mapping = tx_queue;
 }
 
+#define NO_QUEUE_MAPPING	USHRT_MAX
+
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = -1;
+	sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk ? sk->sk_tx_queue_mapping : -1;
+	if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING)
+		return sk->sk_tx_queue_mapping;
+
+	return -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)

^ permalink raw reply related

* [net-next PATCH v6 4/7] net: Record receive queue number for a connection
From: Amritha Nambiar @ 2018-06-30  4:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 include/net/busy_poll.h |    1 +
 include/net/sock.h      |   28 ++++++++++++++++++++++++++++
 net/core/sock.c         |    2 ++
 net/ipv4/tcp_input.c    |    3 +++
 4 files changed, 34 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id = skb->napi_id;
 #endif
+	sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 37b09c8..2b097cc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_rx_queue_mapping: rx queue number for this connection
  *	@skc_flags: place holder for sk_flags
  *		%SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  *		%SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	unsigned short		skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+	unsigned short		skc_rx_queue_mapping;
+#endif
 	union {
 		int		skc_incoming_cpu;
 		u32		skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
 #define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end		__sk_common.skc_dontcopy_end
@@ -1702,6 +1709,27 @@ static inline int sk_tx_queue_get(const struct sock *sk)
 	return -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+	if (skb_rx_queue_recorded(skb)) {
+		u16 rx_queue = skb_get_rx_queue(skb);
+
+		if (WARN_ON_ONCE(rx_queue == NO_QUEUE_MAPPING))
+			return;
+
+		sk->sk_rx_queue_mapping = rx_queue;
+	}
+#endif
+}
+
+static inline void sk_rx_queue_clear(struct sock *sk)
+{
+#ifdef CONFIG_XPS
+	sk->sk_rx_queue_mapping = NO_QUEUE_MAPPING;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..dac6d78 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_pacing_rate = ~0U;
 	sk->sk_pacing_shift = 10;
 	sk->sk_incoming_cpu = -1;
+
+	sk_rx_queue_clear(sk);
 	/*
 	 * Before updating sk_refcnt, we must commit prior changes to memory
 	 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c5b341..b3b5aef 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <net/busy_poll.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5588,6 +5589,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	if (skb) {
 		icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
 		security_inet_conn_established(sk, skb);
+		sk_mark_napi_id(sk, skb);
 	}
 
 	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6416,6 +6418,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	tcp_rsk(req)->snt_isn = isn;
 	tcp_rsk(req)->txhash = net_tx_rndhash();
 	tcp_openreq_init_rwin(req, sk, dst);
+	sk_rx_queue_set(req_to_sk(req), skb);
 	if (!want_cookie) {
 		tcp_reqsk_record_syn(sk, req, skb);
 		fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);

^ permalink raw reply related

* [net-next PATCH v6 2/7] net: Use static_key for XPS maps
From: Amritha Nambiar @ 2018-06-30  4:26 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/dev.c |   31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7105955..43b5575 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
 EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
+struct static_key xps_needed __read_mostly;
+EXPORT_SYMBOL(xps_needed);
+struct static_key xps_rxqs_needed __read_mostly;
+EXPORT_SYMBOL(xps_rxqs_needed);
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)		\
 	rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
@@ -2168,14 +2172,18 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 	struct xps_dev_maps *dev_maps;
 	unsigned int nr_ids;
 
-	mutex_lock(&xps_map_mutex);
+	if (!static_key_false(&xps_needed))
+		return;
 
-	dev_maps = xmap_dereference(dev->xps_rxqs_map);
-	if (dev_maps) {
-		nr_ids = dev->num_rx_queues;
-		clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
-			       count, true);
+	mutex_lock(&xps_map_mutex);
 
+	if (static_key_false(&xps_rxqs_needed)) {
+		dev_maps = xmap_dereference(dev->xps_rxqs_map);
+		if (dev_maps) {
+			nr_ids = dev->num_rx_queues;
+			clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
+				       offset, count, true);
+		}
 	}
 
 	dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2189,6 +2197,10 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
 		       false);
 
 out_no_maps:
+	if (static_key_enabled(&xps_rxqs_needed))
+		static_key_slow_dec(&xps_rxqs_needed);
+
+	static_key_slow_dec(&xps_needed);
 	mutex_unlock(&xps_map_mutex);
 }
 
@@ -2297,6 +2309,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
 	if (!new_dev_maps)
 		goto out_no_new_maps;
 
+	static_key_slow_inc(&xps_needed);
+	if (is_rxqs_map)
+		static_key_slow_inc(&xps_rxqs_needed);
+
 	for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 	     j < nr_ids;) {
 		/* copy maps belonging to foreign traffic classes */
@@ -3450,6 +3466,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 	struct xps_map *map;
 	int queue_index = -1;
 
+	if (!static_key_false(&xps_needed))
+		return -1;
+
 	rcu_read_lock();
 	dev_maps = rcu_dereference(dev->xps_cpus_map);
 	if (dev_maps) {

^ permalink raw reply related

* [net-next PATCH v6 5/7] net: Enable Tx queue selection based on Rx queues
From: Amritha Nambiar @ 2018-06-30  4:27 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 include/net/sock.h |   10 ++++++++
 net/core/dev.c     |   62 ++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 2b097cc..2ed99bf 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1730,6 +1730,16 @@ static inline void sk_rx_queue_clear(struct sock *sk)
 #endif
 }
 
+#ifdef CONFIG_XPS
+static inline int sk_rx_queue_get(const struct sock *sk)
+{
+	if (sk && sk->sk_rx_queue_mapping != NO_QUEUE_MAPPING)
+		return sk->sk_rx_queue_mapping;
+
+	return -1;
+}
+#endif
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
 	sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index 43b5575..08d58e0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3459,35 +3459,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+#ifdef CONFIG_XPS
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+			       struct xps_dev_maps *dev_maps, unsigned int tci)
+{
+	struct xps_map *map;
+	int queue_index = -1;
+
+	if (dev->num_tc) {
+		tci *= dev->num_tc;
+		tci += netdev_get_prio_tc_map(dev, skb->priority);
+	}
+
+	map = rcu_dereference(dev_maps->attr_map[tci]);
+	if (map) {
+		if (map->len == 1)
+			queue_index = map->queues[0];
+		else
+			queue_index = map->queues[reciprocal_scale(
+						skb_get_hash(skb), map->len)];
+		if (unlikely(queue_index >= dev->real_num_tx_queues))
+			queue_index = -1;
+	}
+	return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
 	struct xps_dev_maps *dev_maps;
-	struct xps_map *map;
+	struct sock *sk = skb->sk;
 	int queue_index = -1;
 
 	if (!static_key_false(&xps_needed))
 		return -1;
 
 	rcu_read_lock();
-	dev_maps = rcu_dereference(dev->xps_cpus_map);
+	if (!static_key_false(&xps_rxqs_needed))
+		goto get_cpus_map;
+
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
 	if (dev_maps) {
-		unsigned int tci = skb->sender_cpu - 1;
+		int tci = sk_rx_queue_get(sk);
 
-		if (dev->num_tc) {
-			tci *= dev->num_tc;
-			tci += netdev_get_prio_tc_map(dev, skb->priority);
-		}
+		if (tci >= 0 && tci < dev->num_rx_queues)
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
+	}
 
-		map = rcu_dereference(dev_maps->attr_map[tci]);
-		if (map) {
-			if (map->len == 1)
-				queue_index = map->queues[0];
-			else
-				queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
-									   map->len)];
-			if (unlikely(queue_index >= dev->real_num_tx_queues))
-				queue_index = -1;
+get_cpus_map:
+	if (queue_index < 0) {
+		dev_maps = rcu_dereference(dev->xps_cpus_map);
+		if (dev_maps) {
+			unsigned int tci = skb->sender_cpu - 1;
+
+			queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+							  tci);
 		}
 	}
 	rcu_read_unlock();

^ permalink raw reply related

* [net-next PATCH v6 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
From: Amritha Nambiar @ 2018-06-30  4:27 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 net/core/net-sysfs.c |   83 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b39987c..f25ac5f 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,88 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
 	= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+	struct net_device *dev = queue->dev;
+	struct xps_dev_maps *dev_maps;
+	unsigned long *mask, index;
+	int j, len, num_tc = 1, tc = 0;
+
+	index = get_netdev_queue_index(queue);
+
+	if (dev->num_tc) {
+		num_tc = dev->num_tc;
+		tc = netdev_txq_to_tc(dev, index);
+		if (tc < 0)
+			return -EINVAL;
+	}
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	dev_maps = rcu_dereference(dev->xps_rxqs_map);
+	if (!dev_maps)
+		goto out_no_maps;
+
+	for (j = -1; j = netif_attrmask_next(j, NULL, dev->num_rx_queues),
+	     j < dev->num_rx_queues;) {
+		int i, tci = j * num_tc + tc;
+		struct xps_map *map;
+
+		map = rcu_dereference(dev_maps->attr_map[tci]);
+		if (!map)
+			continue;
+
+		for (i = map->len; i--;) {
+			if (map->queues[i] == index) {
+				set_bit(j, mask);
+				break;
+			}
+		}
+	}
+out_no_maps:
+	rcu_read_unlock();
+
+	len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+	kfree(mask);
+
+	return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+			      size_t len)
+{
+	struct net_device *dev = queue->dev;
+	struct net *net = dev_net(dev);
+	unsigned long *mask, index;
+	int err;
+
+	if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+		return -EPERM;
+
+	mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+		       GFP_KERNEL);
+	if (!mask)
+		return -ENOMEM;
+
+	index = get_netdev_queue_index(queue);
+
+	err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+	if (err) {
+		kfree(mask);
+		return err;
+	}
+
+	err = __netif_set_xps_queue(dev, mask, index, true);
+	kfree(mask);
+	return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+	= __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1372,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
 	&queue_traffic_class.attr,
 #ifdef CONFIG_XPS
 	&xps_cpus_attribute.attr,
+	&xps_rxqs_attribute.attr,
 	&queue_tx_maxrate.attr,
 #endif
 	NULL

^ permalink raw reply related

* [net-next PATCH v6 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
From: Amritha Nambiar @ 2018-06-30  4:27 UTC (permalink / raw)
  To: netdev, davem
  Cc: alexander.h.duyck, willemdebruijn.kernel, amritha.nambiar,
	sridhar.samudrala, alexander.duyck, edumazet, hannes, tom, tom
In-Reply-To: <153033254062.8297.3821413056768389263.stgit@anamhost.jf.intel.com>

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
---
 Documentation/ABI/testing/sysfs-class-net-queues |   11 ++++
 Documentation/networking/scaling.txt             |   61 ++++++++++++++++++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues
index 0c0df91..978b763 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -42,6 +42,17 @@ Description:
 		network device transmit queue. Possible vaules depend on the
 		number of available CPU(s) in the system.
 
+What:		/sys/class/<iface>/queues/tx-<queue>/xps_rxqs
+Date:		June 2018
+KernelVersion:	4.18.0
+Contact:	netdev@vger.kernel.org
+Description:
+		Mask of the receive queue(s) currently enabled to participate
+		into the Transmit Packet Steering packet processing flow for this
+		network device transmit queue. Possible values depend on the
+		number of available receive queue(s) in the network device.
+		Default is disabled.
+
 What:		/sys/class/<iface>/queues/tx-<queue>/byte_queue_limits/hold_time
 Date:		November 2011
 KernelVersion:	3.3
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index f55639d..b7056a8 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,15 +382,40 @@ transmit queue). Secondly, cache miss rate on transmit completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
-into the set.
+into the set. When selecting the transmit queue based on receive queue(s)
+map, the transmit device is not validated against the receive device as it
+requires expensive lookup operation in the datapath.
 
 The queue chosen for transmitting a particular flow is saved in the
 corresponding socket structure for the flow (e.g. a TCP connection).
@@ -404,11 +434,15 @@ acknowledged.
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
 default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+configured. To enable XPS, the bitmap of CPUs/receive-queues that may
+use a transmit queue is configured using the sysfs file entry:
 
+For selection based on CPUs map:
 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
 
+For selection based on receive-queues map:
+/sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration
@@ -421,6 +455,11 @@ best CPUs to share a given queue are probably those that share the cache
 with the CPU that processes transmit completions for that queue
 (transmit interrupts).
 
+For transmit queue selection based on receive queue(s), XPS has to be
+explicitly configured mapping receive-queue(s) to transmit queue(s). If the
+user configuration for receive-queue map does not apply, then the transmit
+queue is selected based on the CPUs map.
+
 Per TX Queue rate limitation:
 =============================
 

^ permalink raw reply related

* Re: [PATCH net-next] net: stmmac: Add support for CBS QDISC
From: David Miller @ 2018-06-30  9:39 UTC (permalink / raw)
  To: Jose.Abreu
  Cc: netdev, Joao.Pinto, Vitor.Soares, peppe.cavallaro,
	alexandre.torgue
In-Reply-To: <daa3124702201f0d1c30da51f2a90cf432e39ba3.1530103308.git.joabreu@synopsys.com>

From: Jose Abreu <Jose.Abreu@synopsys.com>
Date: Wed, 27 Jun 2018 13:43:20 +0100

> This adds support for CBS reconfiguration using the TC application.
> 
> A new callback was added to TC ops struct and another one to DMA ops to
> reconfigure the channel mode.
> 
> Tested in GMAC5.10.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] tcp: fix Fast Open key endianness
From: David Miller @ 2018-06-30  9:42 UTC (permalink / raw)
  To: ycheng; +Cc: netdev, edumazet, ncardwell
In-Reply-To: <20180627230448.236488-1-ycheng@google.com>

From: Yuchung Cheng <ycheng@google.com>
Date: Wed, 27 Jun 2018 16:04:48 -0700

> Fast Open key could be stored in different endian based on the CPU.
> Previously hosts in different endianness in a server farm using
> the same key config (sysctl value) would produce different cookies.
> This patch fixes it by always storing it as little endian to keep
> same API for LE hosts.
> 
> Reported-by: Daniele Iamartino <danielei@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] bnx2x: Mark expected switch fall-throughs
From: David Miller @ 2018-06-30  9:43 UTC (permalink / raw)
  To: gustavo; +Cc: ariel.elior, everest-linux-l2, michael.chan, netdev, linux-kernel
In-Reply-To: <20180628013223.GA22163@embeddedor.com>

From: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
Date: Wed, 27 Jun 2018 20:32:23 -0500

> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: add new SNMP counter for drops when try to queue in rcv queue
From: David Miller @ 2018-06-30  9:44 UTC (permalink / raw)
  To: laoar.shao; +Cc: edumazet, netdev, linux-kernel
In-Reply-To: <1530159776-29919-1-git-send-email-laoar.shao@gmail.com>

From: Yafang Shao <laoar.shao@gmail.com>
Date: Thu, 28 Jun 2018 00:22:56 -0400

> When sk_rmem_alloc is larger than the receive buffer and we can't
> schedule more memory for it, the skb will be dropped.
> 
> In above situation, if this skb is put into the ofo queue,
> LINUX_MIB_TCPOFODROP is incremented to track it.
> 
> While if this skb is put into the receive queue, there's no record.
> So a new SNMP counter is introduced to track this behavior.
> 
> LINUX_MIB_TCPRCVQDROP:  Number of packets meant to be queued in rcv queue
> 			but dropped because socket rcvbuf limit hit.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net] atm: iphase: fix a 64 bit bug
From: David Miller @ 2018-06-30  9:45 UTC (permalink / raw)
  To: dan.carpenter; +Cc: 3chas3, linux-atm-general, netdev, kernel-janitors
In-Reply-To: <20180628092442.uu4ktcqymi65hbuh@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 28 Jun 2018 12:24:42 +0300

> The code assumes that there is 4 bytes in a pointer and it doesn't
> allocate enough memory.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] cnic: tidy up a size calculation
From: David Miller @ 2018-06-30  9:47 UTC (permalink / raw)
  To: dan.carpenter; +Cc: christophe.jaillet, keescook, netdev, kernel-janitors
In-Reply-To: <20180628093125.6qamire66fzkv47l@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 28 Jun 2018 12:31:25 +0300

> Static checkers complain that id_tbl->table points to longs and 4 bytes
> is smaller than sizeof(long).  But the since other side is dividing by
> 32 instead of sizeof(long), that means the current code works fine.
> 
> Anyway, it's more conventional to use the BITS_TO_LONGS() macro when
> we're allocating a bitmap.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 1/1] bnx2x: Fix receiving tx-timeout in error or recovery state.
From: David Miller @ 2018-06-30  9:48 UTC (permalink / raw)
  To: sudarsana.kalluru; +Cc: netdev, Ariel.Elior
In-Reply-To: <20180628115215.7709-1-sudarsana.kalluru@cavium.com>

From: Sudarsana Reddy Kalluru <sudarsana.kalluru@cavium.com>
Date: Thu, 28 Jun 2018 04:52:15 -0700

> Driver performs the internal reload when it receives tx-timeout event from
> the OS. Internal reload might fail in some scenarios e.g., fatal HW issues.
> In such cases OS still see the link, which would result in undesirable
> functionalities such as re-generation of tx-timeouts.
> The patch addresses this issue by indicating the link-down to OS when
> tx-timeout is detected, and keeping the link in down state till the
> internal reload is successful.
> 
> Please consider applying it to 'net' branch.
> 
> Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
> Signed-off-by: Ariel Elior <ariel.elior@cavium.com>

Applied.

^ permalink raw reply

* Re: [net 0/2] DPAA fixes
From: David Miller @ 2018-06-30  9:51 UTC (permalink / raw)
  To: madalin.bucur; +Cc: netdev, linux-kernel
In-Reply-To: <1530188811-8918-1-git-send-email-madalin.bucur@nxp.com>

From: Madalin Bucur <madalin.bucur@nxp.com>
Date: Thu, 28 Jun 2018 15:26:49 +0300

> A couple of fixes for the DPAA drivers, addressing an issue
> with short UDP or TCP frames (with padding) that were marked
> as having a wrong checksum and dropped by the FMan hardware
> and a problem with the buffer used for the scatter-gather
> table being too small as per the hardware requirements.

Series applied.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox