Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Jason Wang @ 2017-09-01  3:25 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: Koichiro Den, virtualization, Network Development
In-Reply-To: <CAF=yD-+AjQLLUKdvnrwd2tqFtw4Hm81cR7WUJd65oLnziNGM8A@mail.gmail.com>



On 2017年08月31日 22:30, Willem de Bruijn wrote:
>> Incomplete results at this stage, but I do see this correlation between
>> flows. It occurs even while not running out of zerocopy descriptors,
>> which I cannot yet explain.
>>
>> Running two threads in a guest, each with a udp socket, each
>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>
>> Sender A sends 1B datagrams.
>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>> to trigger zcopy_used in vhost net.
>>
>> A local receive process on the host receives both flows. To avoid
>> a deep copy when looping the packet onto the receive path,
>> changed skb_orphan_frags_rx to always return false (gross hack).
>>
>> The flow with the larger packets is redirected through netem on ifb0:
>>
>>    modprobe ifb
>>    ip link set dev ifb0 up
>>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>
>>    tc qdisc add dev tap0 ingress
>>    tc filter add dev tap0 parent ffff: protocol ip \
>>        u32 match ip dport 8000 0xffff \
>>        action mirred egress redirect dev ifb0
>>
>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>
>> no filter
>>    rx.A: ~840,000
>>    rx.B: ~840,000
>>
>> limit 1
>>    rx.A: ~500,000
>>    rx.B: ~3100
>>    ifb0: 3273 sent, 371141 dropped
>>
>> limit 100
>>    rx.A: ~9000
>>    rx.B: ~4200
>>    ifb0: 4630 sent, 1491 dropped
>>
>> limit 1000
>>    rx.A: ~6800
>>    rx.B: ~4200
>>    ifb0: 4651 sent, 0 dropped
>>
>> Sender B is always correctly rate limited to 1 MBps or less. With a
>> short queue, it ends up dropping a lot and sending even less.
>>
>> When a queue builds up for sender B, sender A throughput is strongly
>> correlated with queue length. With queue length 1, it can send almost
>> at unthrottled speed. But even at limit 100 its throughput is on the
>> same order as sender B.
>>
>> What is surprising to me is that this happens even though the number
>> of ubuf_info in use at limit 100 is around 100 at all times. In other words,
>> it does not exhaust the pool.
>>
>> When forcing zcopy_used to be false for all packets, this effect of
>> sender A throughput being correlated with sender B does not happen.
>>
>> no filter
>>    rx.A: ~850,000
>>    rx.B: ~850,000
>>
>> limit 100
>>    rx.A: ~850,000
>>    rx.B: ~4200
>>    ifb0: 4518 sent, 876182 dropped
>>
>> Also relevant is that with zerocopy, the sender processes back off
>> and report the same count as the receiver. Without zerocopy,
>> both senders send at full speed, even if only 4200 packets from flow
>> B arrive at the receiver.
>>
>> This is with the default virtio_net driver, so without napi-tx.
>>
>> It appears that the zerocopy notifications are pausing the guest.
>> Will look at that now.
> It was indeed as simple as that. With 256 descriptors, queuing even
> a hundred or so packets causes the guest to stall the device as soon
> as the qdisc is installed.
>
> Adding this check
>
> +                       in_use = nvq->upend_idx - nvq->done_idx;
> +                       if (nvq->upend_idx < nvq->done_idx)
> +                               in_use += UIO_MAXIOV;
> +
> +                       if (in_use > (vq->num >> 2))
> +                               zcopy_used = false;
>
> Has the desired behavior of reverting zerocopy requests to copying.
>
> Without this change, the result is, as previously reported, throughput
> dropping to hundreds of packets per second on both flows.
>
> With the change, pps as observed for a few seconds at handle_tx is
>
> zerocopy=165 copy=168435
> zerocopy=0 copy=168500
> zerocopy=65 copy=168535
>
> Both flows continue to send at more or less normal rate, with only
> sender B observing massive drops at the netem.
>
> With the queue removed the rate reverts to
>
> zerocopy=58878 copy=110239
> zerocopy=58833 copy=110207
>
> This is not a 50/50 split, which impliesTw that some packets from the large
> packet flow are still converted to copying. Without the change the rate
> without queue was 80k zerocopy vs 80k copy, so this choice of
> (vq->num >> 2) appears too conservative.
>
> However, testing with (vq->num >> 1) was not as effective at mitigating
> stalls. I did not save that data, unfortunately. Can run more tests on fine
> tuning this variable, if the idea sounds good.

Looks like there're still two cases were left:

1) sndbuf is not INT_MAX
2) tx napi is used for virtio-net

1) could be a corner case, and for 2) what your suggest here may not 
solve the issue since it still do in order completion.

Thanks

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Alexei Starovoitov @ 2017-09-01  3:10 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, David Miller, Willem de Bruijn
In-Reply-To: <CAF=yD-+HaGUYYitWYxVvYUyQUBerw-1YhTnKBdz+_qYJ_T=fdA@mail.gmail.com>

On Thu, Aug 31, 2017 at 11:04:41PM -0400, Willem de Bruijn wrote:
> On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
> >> From: Willem de Bruijn <willemb@google.com>
> >>
> >> Documentation for this feature was missing from the patchset.
> >> Copied a lot from the netdev 2.1 paper, addressing some small
> >> interface changes since then.
> >>
> >> Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ...
> >> +Notification Batching
> >> +~~~~~~~~~~~~~~~~~~~~~
> >> +
> >> +Multiple outstanding packets can be read at once using the recvmmsg
> >> +call. This is often not needed. In each message the kernel returns not
> >> +a single value, but a range. It coalesces consecutive notifications
> >> +while one is outstanding for reception on the error queue.
> >> +
> >> +When a new notification is about to be queued, it checks whether the
> >> +new value extends the range of the notification at the tail of the
> >> +queue. If so, it drops the new notification packet and instead increases
> >> +the range upper value of the outstanding notification.
> >
> > Would it make sense to mention that max notification range is 32-bit?
> > So each 4Gbyte of xmit bytes there will be a notification.
> > In modern 40Gbps NICs it's not a lot. Means that there will be
> > at least one notification every second.
> > Or I misread the code?
> 
> You're right. The doc does mention that the counter and range
> are 32-bit. I can state more explicitly that that bounds the working
> set size to 4GB. Do you expect this to be problematic? Processing
> a single notification per 4GB of data should not be a significant
> cost in itself.

I think 4GB is fine. Just there was an idea that in cases when
notification of transmission can be known by other means the user space
could have skipped reading errqeuee completely, but looks like it
still needs to poll. That's fine.

> > Thanks for the doc!
> 
> Thanks for reviewing :)
> 
> >
> > Acked-by: Alexei Starovoitov <ast@kernel.org>
> >

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Jason Wang @ 2017-09-01  3:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization,
	Network Development
In-Reply-To: <CAF=yD-KUoW6hxZtpAmyVrJXCY+=Fq1FOcbD3h=HmDQaPoC1MLg@mail.gmail.com>



On 2017年08月30日 11:11, Willem de Bruijn wrote:
> On Tue, Aug 29, 2017 at 9:45 PM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年08月30日 03:35, Willem de Bruijn wrote:
>>> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
>>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com>
>>>> wrote:
>>>>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:

[...]

>>> Incomplete results at this stage, but I do see this correlation between
>>> flows. It occurs even while not running out of zerocopy descriptors,
>>> which I cannot yet explain.
>>>
>>> Running two threads in a guest, each with a udp socket, each
>>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>>
>>> Sender A sends 1B datagrams.
>>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>>> to trigger zcopy_used in vhost net.
>>>
>>> A local receive process on the host receives both flows. To avoid
>>> a deep copy when looping the packet onto the receive path,
>>> changed skb_orphan_frags_rx to always return false (gross hack).
>>>
>>> The flow with the larger packets is redirected through netem on ifb0:
>>>
>>>     modprobe ifb
>>>     ip link set dev ifb0 up
>>>     tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>>
>>>     tc qdisc add dev tap0 ingress
>>>     tc filter add dev tap0 parent ffff: protocol ip \
>>>         u32 match ip dport 8000 0xffff \
>>>         action mirred egress redirect dev ifb0
>>>
>>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>>
>>> no filter
>>>     rx.A: ~840,000
>>>     rx.B: ~840,000
>>
>> Just to make sure I understand the case here. What did rx.B mean here? I
>> thought all traffic sent by Sender B has been redirected to ifb0?
> It has been, but the packet still arrives at the destination socket.
> IFB is a special virtual device that applies traffic shaping and
> then reinjects it back at the point it was intercept by mirred.
>
> rx.B is indeed arrival rate at the receiver, similar to rx.A.
>

I see, then ifb looks pretty fit to the test.

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Willem de Bruijn @ 2017-09-01  3:04 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, David Miller, Willem de Bruijn
In-Reply-To: <20170901021007.pkbb2gsipbprf4w7@ast-mbp>

On Thu, Aug 31, 2017 at 10:10 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn <willemb@google.com>
>>
>> Documentation for this feature was missing from the patchset.
>> Copied a lot from the netdev 2.1 paper, addressing some small
>> interface changes since then.
>>
>> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ...
>> +Notification Batching
>> +~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Multiple outstanding packets can be read at once using the recvmmsg
>> +call. This is often not needed. In each message the kernel returns not
>> +a single value, but a range. It coalesces consecutive notifications
>> +while one is outstanding for reception on the error queue.
>> +
>> +When a new notification is about to be queued, it checks whether the
>> +new value extends the range of the notification at the tail of the
>> +queue. If so, it drops the new notification packet and instead increases
>> +the range upper value of the outstanding notification.
>
> Would it make sense to mention that max notification range is 32-bit?
> So each 4Gbyte of xmit bytes there will be a notification.
> In modern 40Gbps NICs it's not a lot. Means that there will be
> at least one notification every second.
> Or I misread the code?

You're right. The doc does mention that the counter and range
are 32-bit. I can state more explicitly that that bounds the working
set size to 4GB. Do you expect this to be problematic? Processing
a single notification per 4GB of data should not be a significant
cost in itself.

> Thanks for the doc!

Thanks for reviewing :)

>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>

^ permalink raw reply

* Re: [PATCH net-next v5 2/2] tcp_diag: report TCP MD5 signing keys and addresses
From: Eric Dumazet @ 2017-09-01  2:58 UTC (permalink / raw)
  To: Ivan Delalande; +Cc: David Miller, netdev
In-Reply-To: <20170831165939.5121-3-colona@arista.com>

On Thu, 2017-08-31 at 09:59 -0700, Ivan Delalande wrote:
> Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to
> processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is
> not possible to retrieve these from the kernel once they have been
> configured on sockets.
> 
> Signed-off-by: Ivan Delalande <colona@arista.com>
> ---
>  include/uapi/linux/inet_diag.h |   1 +
>  include/uapi/linux/tcp.h       |   9 ++++
>  net/ipv4/tcp_diag.c            | 109 ++++++++++++++++++++++++++++++++++++++---
>  3 files changed, 113 insertions(+), 6 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net-next v5 1/2] inet_diag: allow protocols to provide additional data
From: Eric Dumazet @ 2017-09-01  2:57 UTC (permalink / raw)
  To: Ivan Delalande; +Cc: David Miller, netdev
In-Reply-To: <20170831165939.5121-2-colona@arista.com>

On Thu, 2017-08-31 at 09:59 -0700, Ivan Delalande wrote:
> Extend inet_diag_handler to allow individual protocols to report
> additional data on INET_DIAG_INFO through idiag_get_aux. The size
> can be dynamic and is computed by idiag_get_aux_size.
> 
> Signed-off-by: Ivan Delalande <colona@arista.com>
> ---
>  include/linux/inet_diag.h |  7 +++++++
>  net/ipv4/inet_diag.c      | 22 ++++++++++++++++++----
>  2 files changed, 25 insertions(+), 4 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH] bnx2x: drop packets where gso_size is too big for hardware
From: Daniel Axtens @ 2017-09-01  2:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, tlfalcon, Yuval.Mintz, ariel.elior, everest-linux-l2,
	jay.vosburgh
In-Reply-To: <1504159341.15310.6.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> If you had this test in bnx2x_features_check(), packet could be
> segmented by core networking stack before reaching bnx2x_start_xmit() by
> clearing NETIF_F_GSO_MASK
>
> -> No drop would be involved.

Thanks for the pointer - networking code is all a bit new to me.

I'm just struggling at the moment to figure out what the right way to
calculate the length. My original patch uses gso_size + hlen, but:

 - On reflection, while this solves the immediate bug, I'm not 100% sure
   this is the right thing to be calculating

 - If it is, then we have the problem that hlen is calculated in a bunch
   of weird and wonderful ways which make it a nightmare to extract.

Yuval (or anyone else who groks the driver properly) - what's the right
test to be doing here to make sure we don't write to much data to the
card?

Regards,
Daniel

^ permalink raw reply

* RE: [RFC PATCH] net: frag limit checks need to use percpu_counter_compare
From: liujian (CE) @ 2017-09-01  2:25 UTC (permalink / raw)
  To: Michal Kubecek, Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, Florian Westphal
In-Reply-To: <20170831162349.k3qnkfgkygdh2zqw@unicorn.suse.cz>




Best Regards,
liujian


> -----Original Message-----
> From: Michal Kubecek [mailto:mkubecek@suse.cz]
> Sent: Friday, September 01, 2017 12:24 AM
> To: Jesper Dangaard Brouer
> Cc: liujian (CE); netdev@vger.kernel.org; Florian Westphal
> Subject: Re: [RFC PATCH] net: frag limit checks need to use
> percpu_counter_compare
> 
> On Thu, Aug 31, 2017 at 12:20:19PM +0200, Jesper Dangaard Brouer wrote:
> > To: Liujian can you please test this patch?
> >  I want to understand if using __percpu_counter_compare() solves  the
> > problem correctness wise (even-though this will be slower  than using
> > a simple atomic_t on your big system).

I have test the patch, it can work. 
1. make sure frag_mem_limit reach to thresh
  ===>FRAG: inuse 0 memory 0 frag_mem_limit 5386864
2. change NIC rx irq's affinity to a fixed CPU
3. iperf -u -c 9.83.1.41 -l 10000 -i 1 -t 1000 -P 10 -b 20M
  And check /proc/net/snmp, there are no ReasmFails.

And I think it is a better way that adding some counter sync points as you said.

> > Fix bug in fragmentation codes use of the percpu_counter API, that
> > cause issues on systems with many CPUs.
> >
> > The frag_mem_limit() just reads the global counter (fbc->count),
> > without considering other CPUs can have upto batch size (130K) that
> > haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
> > this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
> >
> > The __percpu_counter_compare() does the right thing, and takes into
> > account the number of (online) CPUs and batch size, to account for
> > this and call __percpu_counter_sum() when needed.
> >
> > On systems with many CPUs this will unfortunately always result in the
> > heavier fully locked __percpu_counter_sum() which touch the
> > per_cpu_ptr of all (online) CPUs.
> >
> > On systems with a smaller number of CPUs this solution is also not
> > optimal, because __percpu_counter_compare()/__percpu_counter_sum()
> > doesn't help synchronize the global counter.
> >  Florian Westphal have an idea of adding some counter sync points,
> > which should help address this issue.
> > ---
> >  include/net/inet_frag.h  |   16 ++++++++++++++--
> >  net/ipv4/inet_fragment.c |    6 +++---
> >  2 files changed, 17 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index
> > 6fdcd2427776..b586e320783d 100644
> > --- a/include/net/inet_frag.h
> > +++ b/include/net/inet_frag.h
> > @@ -147,9 +147,21 @@ static inline bool inet_frag_evicting(struct
> inet_frag_queue *q)
> >   */
> >  static unsigned int frag_percpu_counter_batch = 130000;
> >
> > -static inline int frag_mem_limit(struct netns_frags *nf)
> > +static inline bool frag_mem_over_limit(struct netns_frags *nf, int
> > +thresh)
> >  {
> > -	return percpu_counter_read(&nf->mem);
> > +	/* When reading counter here, __percpu_counter_compare() call
> > +	 * will invoke __percpu_counter_sum() when needed.  Which
> > +	 * depend on num_online_cpus()*batch size, as each CPU can
> > +	 * potentential can hold a batch count.
> > +	 *
> > +	 * With many CPUs this heavier sum operation will
> > +	 * unfortunately always occur.
> > +	 */
> > +	if (__percpu_counter_compare(&nf->mem, thresh,
> > +				     frag_percpu_counter_batch) > 0)
> > +		return true;
> > +	else
> > +		return false;
> >  }
> >
> >  static inline void sub_frag_mem_limit(struct netns_frags *nf, int i)
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index
> > 96e95e83cc61..ee2cf56900e6 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -120,7 +120,7 @@ static void inet_frag_secret_rebuild(struct
> > inet_frags *f)  static bool inet_fragq_should_evict(const struct
> > inet_frag_queue *q)  {
> >  	return q->net->low_thresh == 0 ||
> > -	       frag_mem_limit(q->net) >= q->net->low_thresh;
> > +		frag_mem_over_limit(q->net, q->net->low_thresh);
> >  }
> >
> >  static unsigned int
> > @@ -355,7 +355,7 @@ static struct inet_frag_queue
> > *inet_frag_alloc(struct netns_frags *nf,  {
> >  	struct inet_frag_queue *q;
> >
> > -	if (!nf->high_thresh || frag_mem_limit(nf) > nf->high_thresh) {
> > +	if (!nf->high_thresh || frag_mem_over_limit(nf, nf->high_thresh)) {
> >  		inet_frag_schedule_worker(f);
> >  		return NULL;
> >  	}
> 
> If we go this way (which would IMHO require some benchmarks to make sure it
> doesn't harm performance too much) we can drop the explicit checks for zero
> thresholds which were added to work around the unreliability of fast checks of
> percpu counters (or at least the second one was by commit
> 30759219f562 ("net: disable fragment reassembly if high_thresh is zero").
> 
> Michal Kubecek
> 
> > @@ -396,7 +396,7 @@ struct inet_frag_queue *inet_frag_find(struct
> netns_frags *nf,
> >  	struct inet_frag_queue *q;
> >  	int depth = 0;
> >
> > -	if (frag_mem_limit(nf) > nf->low_thresh)
> > +	if (frag_mem_over_limit(nf, nf->low_thresh))
> >  		inet_frag_schedule_worker(f);
> >
> >  	hash &= (INETFRAGS_HASHSZ - 1);
> >

^ permalink raw reply

* Re: [PATCH net-next] doc: document MSG_ZEROCOPY
From: Alexei Starovoitov @ 2017-09-01  2:10 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, Willem de Bruijn
In-Reply-To: <20170831210013.85220-1-willemdebruijn.kernel@gmail.com>

On Thu, Aug 31, 2017 at 05:00:13PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Documentation for this feature was missing from the patchset.
> Copied a lot from the netdev 2.1 paper, addressing some small
> interface changes since then.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
...
> +Notification Batching
> +~~~~~~~~~~~~~~~~~~~~~
> +
> +Multiple outstanding packets can be read at once using the recvmmsg
> +call. This is often not needed. In each message the kernel returns not
> +a single value, but a range. It coalesces consecutive notifications
> +while one is outstanding for reception on the error queue.
> +
> +When a new notification is about to be queued, it checks whether the
> +new value extends the range of the notification at the tail of the
> +queue. If so, it drops the new notification packet and instead increases
> +the range upper value of the outstanding notification.

Would it make sense to mention that max notification range is 32-bit?
So each 4Gbyte of xmit bytes there will be a notification.
In modern 40Gbps NICs it's not a lot. Means that there will be
at least one notification every second.
Or I misread the code?

Thanks for the doc!

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH 2/3] security: bpf: Add eBPF LSM hooks and security field to eBPF map
From: Alexei Starovoitov @ 2017-09-01  2:05 UTC (permalink / raw)
  To: Chenbo Feng
  Cc: Daniel Borkmann, linux-security-module, Jeffrey Vander Stoep,
	netdev, SELinux, lorenzo, Chenbo Feng
In-Reply-To: <20170831205635.80256-3-chenbofeng.kernel@gmail.com>

On Thu, Aug 31, 2017 at 01:56:34PM -0700, Chenbo Feng wrote:
> From: Chenbo Feng <fengc@google.com>
> 
> Introduce a pointer into struct bpf_map to hold the security information
> about the map. The actual security struct varies based on the security
> models implemented. Place the LSM hooks before each of the unrestricted
> eBPF operations, the map_update_elem and map_delete_elem operations are
> checked by security_map_modify. The map_lookup_elem and map_get_next_key
> operations are checked by securtiy_map_read.
> 
> Signed-off-by: Chenbo Feng <fengc@google.com>

...

> @@ -410,6 +418,10 @@ static int map_lookup_elem(union bpf_attr *attr)
>  	if (IS_ERR(map))
>  		return PTR_ERR(map);
>  
> +	err = security_map_read(map);
> +	if (err)
> +		return -EACCES;
> +
>  	key = memdup_user(ukey, map->key_size);
>  	if (IS_ERR(key)) {
>  		err = PTR_ERR(key);
> @@ -490,6 +502,10 @@ static int map_update_elem(union bpf_attr *attr)
>  	if (IS_ERR(map))
>  		return PTR_ERR(map);
>  
> +	err = security_map_modify(map);

I don't feel these extra hooks are really thought through.
With such hook you'll disallow map_update for given map. That's it.
The key/values etc won't be used in such security decision.
In such case you don't need such hooks in update/lookup at all.
Only in map_creation and object_get calls where FD can be received.
In other words I suggest to follow standard unix practices:
Do permissions checks in open() and allow read/write() if FD is valid.
Same here. Do permission checks in prog_load/map_create/obj_pin/get
and that will be enough to jail bpf subsystem.
bpf cmds that need to be fast (like lookup and update) should not
have security hooks.

^ permalink raw reply

* (unknown), 
From: doctornina @ 2017-09-01  1:48 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 204118348.doc --]
[-- Type: application/msword, Size: 40462 bytes --]

^ permalink raw reply

* (unknown), 
From: agar2000 @ 2017-09-01  1:48 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 674423596.doc --]
[-- Type: application/msword, Size: 40462 bytes --]

^ permalink raw reply

* Re: [PATCH v3 net-next 7/7] samples/bpf: Update cgroup socket examples to use uid gid helper
From: Alexei Starovoitov @ 2017-09-01  1:41 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast
In-Reply-To: <1504217150-16151-8-git-send-email-dsahern@gmail.com>

On Thu, Aug 31, 2017 at 03:05:50PM -0700, David Ahern wrote:
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 net-next 6/7] samples/bpf: Update cgrp2 socket tests
From: Alexei Starovoitov @ 2017-09-01  1:40 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast
In-Reply-To: <1504217150-16151-7-git-send-email-dsahern@gmail.com>

On Thu, Aug 31, 2017 at 03:05:49PM -0700, David Ahern wrote:
> Update cgrp2 bpf sock tests to check that device, mark and priority
> can all be set on a socket via bpf programs attached to a cgroup.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 net-next 5/7] samples/bpf: Add option to dump socket settings
From: Alexei Starovoitov @ 2017-09-01  1:39 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast
In-Reply-To: <1504217150-16151-6-git-send-email-dsahern@gmail.com>

On Thu, Aug 31, 2017 at 03:05:48PM -0700, David Ahern wrote:
> Add option to dump socket settings. Will be used in the next patch
> to verify bpf programs are correctly setting mark, priority and
> device based on the cgroup attachment for the program run.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 net-next 4/7] samples/bpf: Add detach option to test_cgrp2_sock
From: Alexei Starovoitov @ 2017-09-01  1:39 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast
In-Reply-To: <1504217150-16151-5-git-send-email-dsahern@gmail.com>

On Thu, Aug 31, 2017 at 03:05:47PM -0700, David Ahern wrote:
> Add option to detach programs from a cgroup.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v3 net-next 3/7] samples/bpf: Update sock test to allow setting mark and priority
From: Alexei Starovoitov @ 2017-09-01  1:38 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, daniel, ast
In-Reply-To: <1504217150-16151-4-git-send-email-dsahern@gmail.com>

On Thu, Aug 31, 2017 at 03:05:46PM -0700, David Ahern wrote:
> Update sock test to set mark and priority on socket create.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* [RFC iproute2 2/2] tc: Add support for the CBS qdisc
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Vinicius Costa Gomes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	andre.guedes, ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran
In-Reply-To: <20170901012646.14939-1-vinicius.gomes@intel.com>

The Credit Based Shaper (CBS) queueing discipline allows bandwidth
reservation with sub-milisecond precision. It is defined by the
802.1Q-2014 specification (section 8.6.8.2 and Annex L).

The syntax is:

tc qdisc add dev DEV parent NODE cbs locredit <LOCREDIT> hicredit
<HICREDIT> sendslope <SENDSLOPE> idleslope <IDLESLOPE>

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 tc/Makefile |   1 +
 tc/q_cbs.c  | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 135 insertions(+)
 create mode 100644 tc/q_cbs.c

diff --git a/tc/Makefile b/tc/Makefile
index a9b4b8e6..f0091217 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -73,6 +73,7 @@ TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
 TCMODULES += f_matchall.o
+TCMODULES += q_cbs.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_cbs.c b/tc/q_cbs.c
new file mode 100644
index 00000000..0120e838
--- /dev/null
+++ b/tc/q_cbs.c
@@ -0,0 +1,134 @@
+/*
+ * q_tbf.c		TBF.
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: ... tbf hicredit BYTES locredit BYTES sendslope BPS idleslope BPS\n");
+}
+
+static void explain1(const char *arg, const char *val)
+{
+	fprintf(stderr, "cbs: illegal value for \"%s\": \"%s\"\n", arg, val);
+}
+
+static int cbs_parse_opt(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n)
+{
+	int ok = 0;
+	struct tc_cbs_qopt opt = {};
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (matches(*argv, "hicredit") == 0) {
+			NEXT_ARG();
+			if (opt.hicredit) {
+				fprintf(stderr, "cbs: duplicate \"hicredit\" specification\n");
+				return -1;
+			}
+			if (get_s32(&opt.hicredit, *argv, 0)) {
+				explain1("hicredit", *argv);
+				return -1;
+			}
+			ok++;
+		} else if (matches(*argv, "locredit") == 0) {
+			NEXT_ARG();
+			if (opt.locredit) {
+				fprintf(stderr, "cbs: duplicate \"locredit\" specification\n");
+				return -1;
+			}
+			if (get_s32(&opt.locredit, *argv, 0)) {
+				explain1("locredit", *argv);
+				return -1;
+			}
+			ok++;
+		} else if (matches(*argv, "sendslope") == 0) {
+			NEXT_ARG();
+			if (opt.sendslope) {
+				fprintf(stderr, "cbs: duplicate \"sendslope\" specification\n");
+				return -1;
+			}
+			if (get_s32(&opt.sendslope, *argv, 0)) {
+				explain1("sendslope", *argv);
+				return -1;
+			}
+			ok++;
+		} else if (matches(*argv, "idleslope") == 0) {
+			NEXT_ARG();
+			if (opt.idleslope) {
+				fprintf(stderr, "cbs: duplicate \"idleslope\" specification\n");
+				return -1;
+			}
+			if (get_s32(&opt.idleslope, *argv, 0)) {
+				explain1("idleslope", *argv);
+				return -1;
+			}
+			ok++;
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "cbs: unknown parameter \"%s\"\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+	addattr_l(n, 2024, TCA_CBS_PARMS, &opt, sizeof(opt));
+	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+	return 0;
+}
+
+static int cbs_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_TBF_MAX+1];
+	struct tc_cbs_qopt *qopt;
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CBS_MAX, opt);
+
+	if (tb[TCA_CBS_PARMS] == NULL)
+		return -1;
+
+	qopt = RTA_DATA(tb[TCA_CBS_PARMS]);
+	if (RTA_PAYLOAD(tb[TCA_CBS_PARMS])  < sizeof(*qopt))
+		return -1;
+
+	fprintf(f, "hicredit %d ", qopt->hicredit);
+	fprintf(f, "locredit %d ", qopt->locredit);
+	fprintf(f, "sendslope %d ", qopt->sendslope);
+	fprintf(f, "idleslope %d ", qopt->idleslope);
+
+	return 0;
+}
+
+struct qdisc_util cbs_qdisc_util = {
+	.id		= "cbs",
+	.parse_qopt	= cbs_parse_opt,
+	.print_qopt	= cbs_print_opt,
+};
-- 
2.14.1

^ permalink raw reply related

* [RFC iproute2 1/2] update headers with CBS API [RFC]
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Vinicius Costa Gomes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	andre.guedes, ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/linux/pkt_sched.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 099bf552..ba6c9a54 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -871,4 +871,33 @@ struct tc_pie_xstats {
 	__u32 maxq;             /* maximum queue size */
 	__u32 ecn_mark;         /* packets marked with ecn*/
 };
+
+/* CBS */
+/* FIXME: this is only for usage with ndo_setup_tc(), this should be
+ * in another header someplace else. Is pkt_cls.h the right place?
+ */
+struct tc_cbs_qopt_offload {
+	__u8		enable;
+	__s32		queue;
+	__s32		hicredit;
+	__s32		locredit;
+	__s32		idleslope;
+	__s32		sendslope;
+};
+
+struct tc_cbs_qopt {
+	__s32		hicredit;
+	__s32		locredit;
+	__s32		idleslope;
+	__s32		sendslope;
+};
+
+enum {
+	TCA_CBS_UNSPEC,
+	TCA_CBS_PARMS,
+	__TCA_CBS_MAX,
+};
+
+#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
+
 #endif
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 5/5] samples/tsn: Add script for calculating CBS config
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Andre Guedes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>

From: Andre Guedes <andre.guedes@intel.com>

Add a script that takes as input the parameters of the Credit-based
shaper used on FQTSS - link rate, max frame size of best effort
traffic, idleslope and maximum frame size of the time-sensitive
traffic class - for SR classes A and B, and calculates how the CBS
qdisc must be configured for each traffic class.

For example, if you want to have Class A with a bandwidth of 300 Mbps
and Class B of 200 Mbps, and the max frame size of both classes'
traffic is 1500 bytes:

$ ./calculate_cbs_params.py -A 300000 -a 1500 -B 200000 -b 1500

would give you the correct cbs qdisc config command lines to be used.

This script is just a helper to ease testing of the TSN samples -
talker and listener - and shouldn't be taken as highly accurate.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 samples/tsn/calculate_cbs_params.py | 112 ++++++++++++++++++++++++++++++++++++
 1 file changed, 112 insertions(+)
 create mode 100755 samples/tsn/calculate_cbs_params.py

diff --git a/samples/tsn/calculate_cbs_params.py b/samples/tsn/calculate_cbs_params.py
new file mode 100755
index 000000000000..9c46210b699f
--- /dev/null
+++ b/samples/tsn/calculate_cbs_params.py
@@ -0,0 +1,112 @@
+#!/usr/bin/env python3
+#
+# Copyright (c) 2017, Intel Corporation
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+#     * Redistributions of source code must retain the above copyright notice,
+#       this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of Intel Corporation nor the names of its contributors
+#       may be used to endorse or promote products derived from this software
+#       without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import math
+import sys
+
+def print_cbs_params_for_class_a(args):
+    idleslope = args.idleslope_a
+    sendslope = idleslope - args.link_speed
+
+    # According to 802.1Q-2014 spec, Annex L, hiCredit and
+    # loCredit for SR class A are calculated following the
+    # equations L-10 and L-12, respectively.
+    hicredit = math.ceil(idleslope * args.frame_non_sr / args.link_speed)
+    locredit = math.ceil(sendslope * args.frame_a / args.link_speed)
+
+    print("Class A --> # tc qdisc replace dev IFACE parent ID cbs " \
+          "locredit %d hicredit %d sendslope %d idleslope %d" % \
+          (locredit, hicredit, sendslope, idleslope))
+
+def print_cbs_params_for_class_b(args):
+    idleslope = args.idleslope_b
+    sendslope = idleslope - args.link_speed
+
+    # Annex L doesn't present a straightforward equation to
+    # calculate hiCredit for Class B so we have to derive it
+    # based on generic equations presented in that Annex.
+    #
+    # L-3 is the primary equation to calculate hiCredit. Section
+    # L.2 states that the 'maxInterferenceSize' for SR class B
+    # is the maximum burst size for SR class A plus the
+    # maxInterferenceSize from SR class A (which is equal to the
+    # maximum frame from non-SR traffic).
+    #
+    # The maximum burst size for SR class A equation is shown in
+    # L-16. Merging L-16 into L-3 we get the resulting equation
+    # which calculates hiCredit B (refer to section L.3 in case
+    # you're not familiar with the legend):
+    #
+    # hiCredit B = Rb * (     Mo         Ma   )
+    #                     ---------- + ------
+    #                      Ro - Ra       Ro
+    #
+    hicredit = math.ceil(idleslope * \
+               ((args.frame_non_sr / (args.link_speed - args.idleslope_a)) + \
+               (args.frame_a / args.link_speed)))
+
+    # loCredit B is calculated following equation L-2.
+    locredit = math.ceil(sendslope * args.frame_b / args.link_speed)
+
+    print("Class B --> # tc qdisc replace dev IFACE parent ID cbs " \
+          "locredit %d hicredit %d sendslope %d idleslope %d" % \
+          (locredit, hicredit, sendslope, idleslope))
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument('-S', dest='link_speed', default=1000000.0, type=float,
+                        help='Link speed in kbps (default=1000000)')
+    parser.add_argument('-s', dest='frame_non_sr', default=1500.0, type=float,
+                        help='Maximum frame size from non-SR traffic (MTU size'
+                        ' usually, default=1500)')
+    parser.add_argument('-A', dest='idleslope_a', default=0, type=float,
+                        help='Idleslope for SR class A in kbps')
+    parser.add_argument('-a', dest='frame_a', default=0, type=float,
+                        help='Maximum frame size for SR class A traffic')
+    parser.add_argument('-B', dest='idleslope_b', default=0, type=float,
+                        help='Idleslope for SR class B in kbps')
+    parser.add_argument('-b', dest='frame_b', default=0, type=float,
+                        help='Maximum frame size for SR class B traffic')
+
+    args = parser.parse_args()
+
+    if not len(sys.argv) > 1:
+        parser.print_help()
+    else:
+        print("\nConfiguration lines to be used are:")
+
+    if args.idleslope_a > 0:
+        print_cbs_params_for_class_a(args)
+
+    if args.idleslope_b > 0:
+        print_cbs_params_for_class_b(args)
+
+
+if __name__ == "__main__":
+    main()
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 4/5] sample: Add TSN Talker and Listener examples
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Jesus Sanchez-Palencia, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, andre.guedes, ivan.briano, boon.leong.ong,
	richardcochran, Vinicius Costa Gomes
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>

From: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>

Add two examples so one can easily test a 'TSN distributed system'
running with standard kernel interfaces. Both 'talker' and 'listener'
sides are provided, and use a AF_PACKET for Tx / Rx of frames.

Running the examples is rather simple.
For the talker, just the interface and SO_PRIORITY are expected as
parameters:

$ ./talker -i enp3s0 -p 3

For the listener, only the interface is needed:

$ ./listener -i enp3s0

The multicast MAC address being used is currently hardcoded on both
examples for simplicity.

Note that the listener side uses a BPF filter so only frames sent to
the correct "stream" destination address are received by the socket.
If you modify the address used by the talker, you must also adapt
the BPF filter otherwise no frames will be received by the socket.

The listener example will print the rate of packets reception after
every 1 second. This makes it easier to verify if the bandwidth
configured for a traffic class is being respected or not.

Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: Iván Briano <ivan.briano@intel.com>
---
 samples/tsn/listener.c | 254 +++++++++++++++++++++++++++++++++++++++++++++++++
 samples/tsn/talker.c   | 136 ++++++++++++++++++++++++++
 2 files changed, 390 insertions(+)
 create mode 100644 samples/tsn/listener.c
 create mode 100644 samples/tsn/talker.c

diff --git a/samples/tsn/listener.c b/samples/tsn/listener.c
new file mode 100644
index 000000000000..2d17bdfbea99
--- /dev/null
+++ b/samples/tsn/listener.c
@@ -0,0 +1,254 @@
+/*
+ * Copyright (c) 2017, Intel Corporation
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *     * Redistributions of source code must retain the above copyright notice,
+ *       this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of Intel Corporation nor the names of its contributors
+ *       may be used to endorse or promote products derived from this software
+ *       without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <alloca.h>
+#include <argp.h>
+#include <arpa/inet.h>
+#include <inttypes.h>
+#include <linux/filter.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/timerfd.h>
+#include <unistd.h>
+
+#define MAX_FRAME_SIZE 1500
+
+/* XXX: If this address is changed, the BPF filter must be adjusted. */
+static uint8_t multicast_macaddr[] = { 0xBB, 0xAA, 0xBB, 0xAA, 0xBB, 0xAA };
+static char ifname[IFNAMSIZ];
+static uint64_t data_count;
+static int arg_count;
+
+/*
+ * BPF Filter so we only receive frames from the destination MAC address of our
+ * SRP stream. This is hardcoded in multicast_macaddr[].
+ */
+static struct sock_filter dst_addr_filter[] = {
+	{ 0x20,  0,  0, 0000000000 }, /* Load DST address: first 32bits only */
+	{ 0x15,  0,  3, 0xbbaabbaa }, /* Compare with first 32bits from MAC */
+	{ 0x28,  0,  0, 0x00000004 }, /* Load DST address: remaining 16bits */
+	{ 0x15,  0,  1, 0x0000bbaa }, /* Compare with last 16bits from MAC */
+	{ 0x06,  0,  0, 0xffffffff },
+	{ 0x06,  0,  0, 0000000000 }, /* Ret 0. Jump here if any mismatches. */
+};
+
+/* BPF program */
+static struct sock_fprog bpf = {
+	.len = 6, /* Number of instructions on BPF filter */
+	.filter = dst_addr_filter,
+};
+
+static struct argp_option options[] = {
+	{"ifname", 'i', "IFNAME", 0, "Network Interface" },
+	{ 0 }
+};
+
+static error_t parser(int key, char *arg, struct argp_state *s)
+{
+	switch (key) {
+	case 'i':
+		strncpy(ifname, arg, sizeof(ifname) - 1);
+		arg_count++;
+		break;
+	case ARGP_KEY_END:
+		if (arg_count < 1)
+			argp_failure(s, 1, 0, "Options missing. Check --help");
+		break;
+	}
+
+	return 0;
+}
+
+static struct argp argp = { options, parser };
+
+static int setup_1s_timer(void)
+{
+	struct itimerspec tspec = { 0 };
+	int fd, res;
+
+	fd = timerfd_create(CLOCK_MONOTONIC, 0);
+	if (fd < 0) {
+		perror("Couldn't create timer");
+		return -1;
+	}
+
+	tspec.it_value.tv_sec = 1;
+	tspec.it_interval.tv_sec = 1;
+
+	res = timerfd_settime(fd, 0, &tspec, NULL);
+	if (res < 0) {
+		perror("Couldn't set timer");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static int setup_socket(void)
+{
+	struct sockaddr_ll sk_addr = {
+		.sll_family = AF_PACKET,
+		.sll_protocol = htons(ETH_P_TSN),
+	};
+	struct packet_mreq mreq;
+	struct ifreq req;
+	int fd, res;
+
+	fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_TSN));
+	if (fd < 0) {
+		perror("Couldn't open socket");
+		return -1;
+	}
+
+	res = setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
+	if (res < 0) {
+		perror("Couldn't attach bpf filter");
+		goto err;
+	}
+
+	strncpy(req.ifr_name, ifname, sizeof(req.ifr_name));
+	res = ioctl(fd, SIOCGIFINDEX, &req);
+	if (res < 0) {
+		perror("Couldn't get interface index");
+		goto err;
+	}
+
+	sk_addr.sll_ifindex = req.ifr_ifindex;
+	res = bind(fd, (struct sockaddr *) &sk_addr, sizeof(sk_addr));
+	if (res < 0) {
+		perror("Couldn't bind() to interface");
+		goto err;
+	}
+
+	/* Use PACKET_ADD_MEMBERSHIP to add a binding to the Multicast Addr */
+	mreq.mr_ifindex = sk_addr.sll_ifindex;
+	mreq.mr_type = PACKET_MR_MULTICAST;
+	mreq.mr_alen = ETH_ALEN;
+	memcpy(&mreq.mr_address, multicast_macaddr, ETH_ALEN);
+
+	res = setsockopt(fd, SOL_PACKET, PACKET_ADD_MEMBERSHIP,
+				&mreq, sizeof(struct packet_mreq));
+	if (res < 0) {
+		perror("Couldn't set PACKET_ADD_MEMBERSHIP");
+		goto err;
+	}
+
+	return fd;
+
+err:
+	close(fd);
+	return -1;
+}
+
+static void recv_packet(int fd)
+{
+	uint8_t *data = alloca(MAX_FRAME_SIZE);
+	ssize_t n = recv(fd, data, MAX_FRAME_SIZE, 0);
+
+	if (n < 0) {
+		perror("Failed to receive data");
+		return;
+	}
+
+	if (n != MAX_FRAME_SIZE)
+		printf("Size mismatch: expected %d, got %zd\n",
+		       MAX_FRAME_SIZE, n);
+
+	data_count += n;
+}
+
+static void report_bw(int fd)
+{
+	uint64_t expirations;
+	ssize_t n = read(fd, &expirations, sizeof(uint64_t));
+
+	if (n < 0) {
+		perror("Couldn't read timerfd");
+		return;
+	}
+
+	if (expirations != 1)
+		printf("Something went wrong with timerfd\n");
+
+	/* Report how much data was received in 1s. */
+	printf("Data rate: %zu kbps\n", (data_count * 8) / 1000);
+
+	data_count = 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int sk_fd, timer_fd, res;
+	struct pollfd fds[2];
+
+	argp_parse(&argp, argc, argv, 0, NULL, NULL);
+
+	sk_fd = setup_socket();
+	if (sk_fd < 0)
+		return 1;
+
+	timer_fd = setup_1s_timer();
+	if (timer_fd < 0) {
+		close(sk_fd);
+		return 1;
+	}
+
+	fds[0].fd = sk_fd;
+	fds[0].events = POLLIN;
+	fds[1].fd = timer_fd;
+	fds[1].events = POLLIN;
+
+	printf("Waiting for packets...\n");
+
+	while (1) {
+		res = poll(fds, 2, -1);
+		if (res < 0) {
+			perror("Error on poll()");
+			goto err;
+		}
+
+		if (fds[0].revents & POLLIN)
+			recv_packet(fds[0].fd);
+
+		if (fds[1].revents & POLLIN)
+			report_bw(fds[1].fd);
+	}
+
+err:
+	close(timer_fd);
+	close(sk_fd);
+	return 1;
+}
diff --git a/samples/tsn/talker.c b/samples/tsn/talker.c
new file mode 100644
index 000000000000..35e6f99b48f6
--- /dev/null
+++ b/samples/tsn/talker.c
@@ -0,0 +1,136 @@
+/*
+ * Copyright (c) 2017, Intel Corporation
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ *     * Redistributions of source code must retain the above copyright notice,
+ *       this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in the
+ *       documentation and/or other materials provided with the distribution.
+ *     * Neither the name of Intel Corporation nor the names of its contributors
+ *       may be used to endorse or promote products derived from this software
+ *       without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
+ * COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <alloca.h>
+#include <argp.h>
+#include <arpa/inet.h>
+#include <inttypes.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <unistd.h>
+
+#define MAX_FRAME_SIZE 1500
+
+static uint8_t multicast_macaddr[] = { 0xBB, 0xAA, 0xBB, 0xAA, 0xBB, 0xAA };
+static char ifname[IFNAMSIZ];
+static int prio = -1;
+static int arg_count;
+
+static struct argp_option options[] = {
+	{"ifname", 'i', "IFNAME", 0, "Network Interface" },
+	{"prio", 'p', "NUM", 0, "SO_PRIORITY to be set in socket" },
+	{ 0 }
+};
+
+static error_t parser(int key, char *arg, struct argp_state *s)
+{
+	switch (key) {
+	case 'i':
+		strncpy(ifname, arg, sizeof(ifname) - 1);
+		arg_count++;
+		break;
+	case 'p':
+		prio = atoi(arg);
+		if (prio < 0)
+			argp_failure(s, 1, 0, "Priority must be >=0\n");
+		arg_count++;
+		break;
+	case ARGP_KEY_END:
+		if (arg_count < 2)
+			argp_failure(s, 1, 0,
+				     "Options missing. Check --help\n");
+		break;
+	}
+
+	return 0;
+}
+
+static struct argp argp = { options, parser };
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_ll dst_ll_addr = {
+		.sll_family = AF_PACKET,
+		.sll_protocol = htons(ETH_P_TSN),
+		.sll_halen = ETH_ALEN,
+	};
+	struct ifreq req;
+	uint8_t *payload;
+	int fd, res;
+
+	argp_parse(&argp, argc, argv, 0, NULL, NULL);
+
+	fd = socket(AF_PACKET, SOCK_DGRAM, htons(ETH_P_TSN));
+	if (fd < 0) {
+		perror("Couldn't open socket");
+		return 1;
+	}
+
+	strncpy(req.ifr_name, ifname, sizeof(req.ifr_name));
+	res = ioctl(fd, SIOCGIFINDEX, &req);
+	if (res < 0) {
+		perror("Couldn't get interface index");
+		goto err;
+	}
+
+	dst_ll_addr.sll_ifindex = req.ifr_ifindex;
+	memcpy(&dst_ll_addr.sll_addr, multicast_macaddr, ETH_ALEN);
+
+	res = setsockopt(fd, SOL_SOCKET, SO_PRIORITY, &prio, sizeof(prio));
+	if (res < 0) {
+		perror("Couldn't set priority");
+		goto err;
+	}
+
+	payload = alloca(MAX_FRAME_SIZE);
+	memset(payload, 0xBE, MAX_FRAME_SIZE);
+
+	printf("Sending packets...\n");
+
+	while (1) {
+		ssize_t n = sendto(fd, payload, MAX_FRAME_SIZE, 0,
+				(struct sockaddr *) &dst_ll_addr,
+				sizeof(dst_ll_addr));
+
+		if (n < 0)
+			perror("Failed to send data");
+
+		/* Sleep for 500us to avoid starvation from a 20Mbps stream. */
+		usleep(500);
+	}
+
+err:
+	close(fd);
+	return 1;
+}
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 2/5] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Vinicius Costa Gomes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	andre.guedes, ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>

This queueing discipline implements the shaper algorithm defined by
the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.

It's primary usage is to apply some bandwidth reservation to user
defined traffic classes, which are mapped to different queues via the
mqprio qdisc.

Initially, it only supports offloading the traffic shaping work to
supporting controllers.

Later, when a software implementation is added, the current dependency
on being installed "under" mqprio can be lifted.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 include/linux/netdevice.h |   1 +
 net/sched/Kconfig         |  11 ++
 net/sched/Makefile        |   1 +
 net/sched/sch_cbs.c       | 286 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 299 insertions(+)
 create mode 100644 net/sched/sch_cbs.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 35de8312e0b5..dd9a2ecd0c03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -775,6 +775,7 @@ enum tc_setup_type {
 	TC_SETUP_CLSFLOWER,
 	TC_SETUP_CLSMATCHALL,
 	TC_SETUP_CLSBPF,
+	TC_SETUP_CBS,
 };
 
 /* These structures hold the attributes of xdp state that are being passed
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index e70ed26485a2..c03d86a7775e 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -172,6 +172,17 @@ config NET_SCH_TBF
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_tbf.
 
+config NET_SCH_CBS
+	tristate "Credit Based Shaper (CBS)"
+	---help---
+	  Say Y here if you want to use the Credit Based Shaper (CBS) packet
+	  scheduling algorithm.
+
+	  See the top of <file:net/sched/sch_cbs.c> for more details.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_cbs.
+
 config NET_SCH_GRED
 	tristate "Generic Random Early Detection (GRED)"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 7b915d226de7..80c8f92d162d 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
+obj-$(CONFIG_NET_SCH_CBS)	+= sch_cbs.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
new file mode 100644
index 000000000000..1c86a9e14150
--- /dev/null
+++ b/net/sched/sch_cbs.c
@@ -0,0 +1,286 @@
+/*
+ * net/sched/sch_cbs.c	Credit Based Shaper
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Vininicius Costa Gomes <vinicius.gomes@intel.com>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+
+struct cbs_sched_data {
+	struct Qdisc *qdisc; /* Inner qdisc, default - pfifo queue */
+	s32 queue;
+	s32 locredit;
+	s32 hicredit;
+	s32 sendslope;
+	s32 idleslope;
+};
+
+static int cbs_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+		       struct sk_buff **to_free)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+	int ret;
+
+	ret = qdisc_enqueue(skb, q->qdisc, to_free);
+	if (ret != NET_XMIT_SUCCESS) {
+		if (net_xmit_drop_count(ret))
+			qdisc_qstats_drop(sch);
+		return ret;
+	}
+
+	qdisc_qstats_backlog_inc(sch, skb);
+	sch->q.qlen++;
+	return NET_XMIT_SUCCESS;
+}
+
+static struct sk_buff *cbs_dequeue(struct Qdisc *sch)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	skb = q->qdisc->ops->peek(q->qdisc);
+	if (skb) {
+		skb = qdisc_dequeue_peeked(q->qdisc);
+		if (unlikely(!skb))
+			return NULL;
+
+		qdisc_qstats_backlog_dec(sch, skb);
+		sch->q.qlen--;
+		qdisc_bstats_update(sch, skb);
+
+		return skb;
+	}
+	return NULL;
+}
+
+static void cbs_reset(struct Qdisc *sch)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	qdisc_reset(q->qdisc);
+}
+
+static const struct nla_policy cbs_policy[TCA_CBS_MAX + 1] = {
+	[TCA_CBS_PARMS]	= { .len = sizeof(struct tc_cbs_qopt) },
+};
+
+static int cbs_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+	struct tc_cbs_qopt_offload cbs = { };
+	struct nlattr *tb[TCA_CBS_MAX + 1];
+	const struct net_device_ops *ops;
+	struct tc_cbs_qopt *qopt;
+	struct net_device *dev;
+	int err;
+
+	err = nla_parse_nested(tb, TCA_CBS_MAX, opt, cbs_policy, NULL);
+	if (err < 0)
+		return err;
+
+	err = -EINVAL;
+	if (!tb[TCA_CBS_PARMS])
+		goto done;
+
+	qopt = nla_data(tb[TCA_CBS_PARMS]);
+
+	dev = qdisc_dev(sch);
+	ops = dev->netdev_ops;
+
+	/* FIXME: this means that we can only install this qdisc
+	 * "under" mqprio. Do we need a more generic way to retrieve
+	 * the queue, or do we pass the netdev_queue to the driver?
+	 */
+	cbs.queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
+
+	cbs.enable = 1;
+	cbs.hicredit = qopt->hicredit;
+	cbs.locredit = qopt->locredit;
+	cbs.idleslope = qopt->idleslope;
+	cbs.sendslope = qopt->sendslope;
+
+	err = -ENOTSUPP;
+	if (!ops->ndo_setup_tc)
+		goto done;
+
+	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_CBS, &cbs);
+	if (err < 0)
+		goto done;
+
+	q->queue = cbs.queue;
+	q->hicredit = cbs.hicredit;
+	q->locredit = cbs.locredit;
+	q->idleslope = cbs.idleslope;
+	q->sendslope = cbs.sendslope;
+
+done:
+	return err;
+}
+
+static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	if (!opt)
+		return -EINVAL;
+
+	q->qdisc = fifo_create_dflt(sch, &pfifo_qdisc_ops, 1024);
+	qdisc_hash_add(q->qdisc, true);
+
+	return cbs_change(sch, opt);
+}
+
+static void cbs_destroy(struct Qdisc *sch)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+	struct tc_cbs_qopt_offload cbs = { };
+	struct net_device *dev;
+	int err;
+
+	q->hicredit = 0;
+	q->locredit = 0;
+	q->idleslope = 0;
+	q->sendslope = 0;
+
+	dev = qdisc_dev(sch);
+
+	cbs.queue = q->queue;
+	cbs.enable = 0;
+
+	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_CBS, &cbs);
+	if (err < 0)
+		pr_warn("Couldn't reset queue %d to default values\n",
+			cbs.queue);
+
+	qdisc_destroy(q->qdisc);
+}
+
+static int cbs_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+	struct nlattr *nest;
+	struct tc_cbs_qopt opt;
+
+	sch->qstats.backlog = q->qdisc->qstats.backlog;
+	nest = nla_nest_start(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+
+	opt.hicredit = q->hicredit;
+	opt.locredit = q->locredit;
+	opt.sendslope = q->sendslope;
+	opt.idleslope = q->idleslope;
+
+	if (nla_put(skb, TCA_CBS_PARMS, sizeof(opt), &opt))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -1;
+}
+
+static int cbs_dump_class(struct Qdisc *sch, unsigned long cl,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	tcm->tcm_handle |= TC_H_MIN(1);
+	tcm->tcm_info = q->qdisc->handle;
+
+	return 0;
+}
+
+static int cbs_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new,
+		     struct Qdisc **old)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	if (!new)
+		new = &noop_qdisc;
+
+	*old = qdisc_replace(sch, new, &q->qdisc);
+	return 0;
+}
+
+static struct Qdisc *cbs_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct cbs_sched_data *q = qdisc_priv(sch);
+
+	return q->qdisc;
+}
+
+static unsigned long cbs_find(struct Qdisc *sch, u32 classid)
+{
+	return 1;
+}
+
+static int cbs_delete(struct Qdisc *sch, unsigned long arg)
+{
+	return 0;
+}
+
+static void cbs_walk(struct Qdisc *sch, struct qdisc_walker *walker)
+{
+	if (!walker->stop) {
+		if (walker->count >= walker->skip)
+			if (walker->fn(sch, 1, walker) < 0) {
+				walker->stop = 1;
+				return;
+			}
+		walker->count++;
+	}
+}
+
+static const struct Qdisc_class_ops cbs_class_ops = {
+	.graft		=	cbs_graft,
+	.leaf		=	cbs_leaf,
+	.find		=	cbs_find,
+	.delete		=	cbs_delete,
+	.walk		=	cbs_walk,
+	.dump		=	cbs_dump_class,
+};
+
+static struct Qdisc_ops cbs_qdisc_ops __read_mostly = {
+	.next		=	NULL,
+	.cl_ops		=	&cbs_class_ops,
+	.id		=	"cbs",
+	.priv_size	=	sizeof(struct cbs_sched_data),
+	.enqueue	=	cbs_enqueue,
+	.dequeue	=	cbs_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	cbs_init,
+	.reset		=	cbs_reset,
+	.destroy	=	cbs_destroy,
+	.change		=	cbs_change,
+	.dump		=	cbs_dump,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init cbs_module_init(void)
+{
+	return register_qdisc(&cbs_qdisc_ops);
+}
+
+static void __exit cbs_module_exit(void)
+{
+	unregister_qdisc(&cbs_qdisc_ops);
+}
+module_init(cbs_module_init)
+module_exit(cbs_module_exit)
+MODULE_LICENSE("GPL");
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 3/5] igb: Add support for CBS offload
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Andre Guedes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>

From: Andre Guedes <andre.guedes@intel.com>

This patch adds support for Credit-Based Shaper (CBS) qdisc offload
from Traffic Control system. This support enable us to leverage the
Forwarding and Queuing for Time-Sensitive Streams (FQTSS) features
from Intel i210 Ethernet Controller. FQTSS is the former 802.1Qav
standard which was merged into 802.1Q in 2014. It enables traffic
prioritization and bandwidth reservation via the Credit-Based Shaper
which is implemented in hardware by i210 controller.

The patch introduces the igb_setup_tc() function which implements the
support for CBS qdisc hardware offload in the IGB driver. CBS offload
is the only traffic control offload supported by the driver at the
moment.

FQTSS transmission mode from i210 controller is automatically enabled
by the IGB driver when the CBS is enabled for the first hardware
queue. Likewise, FQTSS mode is automatically disabled when CBS is
disabled for the last hardware queue. Changing FQTSS mode requires NIC
reset.

FQTSS feature is supported by i210 controller only.

Signed-off-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
---
 drivers/net/ethernet/intel/igb/e1000_defines.h |  23 ++
 drivers/net/ethernet/intel/igb/e1000_regs.h    |   8 +
 drivers/net/ethernet/intel/igb/igb.h           |   6 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 349 +++++++++++++++++++++++++
 4 files changed, 386 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/e1000_defines.h b/drivers/net/ethernet/intel/igb/e1000_defines.h
index 1de82f247312..83cabff1e0ab 100644
--- a/drivers/net/ethernet/intel/igb/e1000_defines.h
+++ b/drivers/net/ethernet/intel/igb/e1000_defines.h
@@ -353,7 +353,18 @@
 #define E1000_RXPBS_CFG_TS_EN           0x80000000
 
 #define I210_RXPBSIZE_DEFAULT		0x000000A2 /* RXPBSIZE default */
+#define I210_RXPBSIZE_MASK		0x0000003F
+#define I210_RXPBSIZE_PB_32KB		0x00000020
 #define I210_TXPBSIZE_DEFAULT		0x04000014 /* TXPBSIZE default */
+#define I210_TXPBSIZE_MASK		0xC0FFFFFF
+#define I210_TXPBSIZE_PB0_8KB		(8 << 0)
+#define I210_TXPBSIZE_PB1_8KB		(8 << 6)
+#define I210_TXPBSIZE_PB2_4KB		(4 << 12)
+#define I210_TXPBSIZE_PB3_4KB		(4 << 18)
+
+#define I210_DTXMXPKTSZ_DEFAULT		0x00000098
+
+#define I210_SR_QUEUES_NUM		2
 
 /* SerDes Control */
 #define E1000_SCTL_DISABLE_SERDES_LOOPBACK 0x0400
@@ -1051,4 +1062,16 @@
 #define E1000_VLAPQF_P_VALID(_n)	(0x1 << (3 + (_n) * 4))
 #define E1000_VLAPQF_QUEUE_MASK	0x03
 
+/* TX Qav Control fields */
+#define E1000_TQAVCTRL_XMIT_MODE	BIT(0)
+#define E1000_TQAVCTRL_DATAFETCHARB	BIT(4)
+#define E1000_TQAVCTRL_DATATRANARB	BIT(8)
+
+/* TX Qav Credit Control fields */
+#define E1000_TQAVCC_IDLESLOPE_MASK	0xFFFF
+#define E1000_TQAVCC_QUEUEMODE		BIT(31)
+
+/* Transmit Descriptor Control fields */
+#define E1000_TXDCTL_PRIORITY		BIT(27)
+
 #endif
diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h b/drivers/net/ethernet/intel/igb/e1000_regs.h
index 58adbf234e07..8eee081d395f 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -421,6 +421,14 @@ do { \
 
 #define E1000_I210_FLA		0x1201C
 
+#define E1000_I210_DTXMXPKTSZ	0x355C
+
+#define E1000_I210_TXDCTL(_n)	(0x0E028 + ((_n) * 0x40))
+
+#define E1000_I210_TQAVCTRL	0x3570
+#define E1000_I210_TQAVCC(_n)	(0x3004 + ((_n) * 0x40))
+#define E1000_I210_TQAVHC(_n)	(0x300C + ((_n) * 0x40))
+
 #define E1000_INVM_DATA_REG(_n)	(0x12120 + 4*(_n))
 #define E1000_INVM_SIZE		64 /* Number of INVM Data Registers */
 
diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h
index 06ffb2bc713e..92845692087a 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -281,6 +281,11 @@ struct igb_ring {
 	u16 count;			/* number of desc. in the ring */
 	u8 queue_index;			/* logical index of the ring*/
 	u8 reg_idx;			/* physical index of the ring */
+	bool cbs_enable;		/* indicates if CBS is enabled */
+	s32 idleslope;			/* idleSlope in kbps */
+	s32 sendslope;			/* sendSlope in kbps */
+	s32 hicredit;			/* hiCredit in bytes */
+	s32 locredit;			/* loCredit in bytes */
 
 	/* everything past this point are written often */
 	u16 next_to_clean;
@@ -621,6 +626,7 @@ struct igb_adapter {
 #define IGB_FLAG_EEE			BIT(14)
 #define IGB_FLAG_VLAN_PROMISC		BIT(15)
 #define IGB_FLAG_RX_LEGACY		BIT(16)
+#define IGB_FLAG_FQTSS			BIT(17)
 
 /* Media Auto Sense */
 #define IGB_MAS_ENABLE_0		0X0001
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index fd4a46b03cc8..47cabca0c99a 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -62,6 +62,17 @@
 #define BUILD 0
 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) "." \
 __stringify(BUILD) "-k"
+
+enum queue_mode {
+	QUEUE_MODE_STRICT_PRIORITY,
+	QUEUE_MODE_STREAM_RESERVATION,
+};
+
+enum tx_queue_prio {
+	TX_QUEUE_PRIO_HIGH,
+	TX_QUEUE_PRIO_LOW,
+};
+
 char igb_driver_name[] = "igb";
 char igb_driver_version[] = DRV_VERSION;
 static const char igb_driver_string[] =
@@ -1271,6 +1282,12 @@ static int igb_alloc_q_vector(struct igb_adapter *adapter,
 		ring->count = adapter->tx_ring_count;
 		ring->queue_index = txr_idx;
 
+		ring->cbs_enable = false;
+		ring->idleslope = 0;
+		ring->sendslope = 0;
+		ring->hicredit = 0;
+		ring->locredit = 0;
+
 		u64_stats_init(&ring->tx_syncp);
 		u64_stats_init(&ring->tx_syncp2);
 
@@ -1598,6 +1615,292 @@ static void igb_get_hw_control(struct igb_adapter *adapter)
 			ctrl_ext | E1000_CTRL_EXT_DRV_LOAD);
 }
 
+static void enable_fqtss(struct igb_adapter *adapter, bool enable)
+{
+	struct net_device *netdev = adapter->netdev;
+
+	if (enable)
+		adapter->flags |= IGB_FLAG_FQTSS;
+	else
+		adapter->flags &= ~IGB_FLAG_FQTSS;
+
+	if (netif_running(netdev))
+		schedule_work(&adapter->reset_task);
+	else
+		igb_reset(adapter);
+}
+
+static bool is_fqtss_enabled(struct igb_adapter *adapter)
+{
+	return (adapter->flags & IGB_FLAG_FQTSS) ? true : false;
+}
+
+static int set_tx_desc_fetch_prio(struct e1000_hw *hw, int queue,
+				  enum tx_queue_prio prio)
+{
+	u32 val;
+
+	WARN_ON(hw->mac.type != e1000_i210);
+
+	if (queue < 0 || queue > 4)
+		return -EINVAL;
+
+	val = rd32(E1000_I210_TXDCTL(queue));
+
+	if (prio == TX_QUEUE_PRIO_HIGH)
+		val |= E1000_TXDCTL_PRIORITY;
+	else
+		val &= ~E1000_TXDCTL_PRIORITY;
+
+	wr32(E1000_I210_TXDCTL(queue), val);
+	return 0;
+}
+
+static int set_queue_mode(struct e1000_hw *hw, int queue, enum queue_mode mode)
+{
+	u32 val;
+
+	WARN_ON(hw->mac.type != e1000_i210);
+
+	/* Stream reservation is only supported for queue 0 and 1. */
+	if (queue < 0 || queue > 1)
+		return -EINVAL;
+
+	val = rd32(E1000_I210_TQAVCC(queue));
+
+	if (mode == QUEUE_MODE_STREAM_RESERVATION)
+		val |= E1000_TQAVCC_QUEUEMODE;
+	else
+		val &= ~E1000_TQAVCC_QUEUEMODE;
+
+	wr32(E1000_I210_TQAVCC(queue), val);
+	return 0;
+}
+
+/**
+ *  igb_configure_cbs - Configure Credit-Based Shaper (CBS)
+ *  @adapter: pointer to adapter struct
+ *  @queue: queue number
+ *  @enable: true = enable CBS, false = disable CBS
+ *  @idleslope: idleSlope in kbps
+ *  @sendslope: sendSlope in kbps
+ *  @hicredit: hiCredit in bytes
+ *  @locredit: loCredit in bytes
+ *
+ *  Configure CBS for a given hardware queue. When disabling, idleslope,
+ *  sendslope, hicredit, locredit arguments are ignored. Returns 0 if
+ *  success. Negative otherwise.
+ **/
+static void igb_configure_cbs(struct igb_adapter *adapter, int queue,
+			      bool enable, int idleslope, int sendslope,
+			      int hicredit, int locredit)
+{
+	struct net_device *netdev = adapter->netdev;
+	struct e1000_hw *hw = &adapter->hw;
+	u32 tqavcc;
+	u16 value;
+
+	WARN_ON(hw->mac.type != e1000_i210);
+	WARN_ON(queue < 0 || queue > 1);
+	WARN_ON(adapter->num_tx_queues < 2);
+
+	if (enable) {
+		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
+		set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
+
+		/* According to i210 datasheet section 7.2.7.7, we should set
+		 * the 'idleSlope' field from TQAVCC register following the
+		 * equation:
+		 *
+		 * For 100 Mbps link speed:
+		 *
+		 *     value = BW * 0x7735 * 0.2                          (E1)
+		 *
+		 * For 1000Mbps link speed:
+		 *
+		 *     value = BW * 0x7735 * 2                            (E2)
+		 *
+		 * E1 and E2 can be merged into one equation as shown below.
+		 * Note that 'link-speed' is in Mbps.
+		 *
+		 *     value = BW * 0x7735 * 2 * link-speed
+		 *                           --------------               (E3)
+		 *                                1000
+		 *
+		 * 'BW' is the percentage bandwidth out of full link speed
+		 * which can be found with the following equation. Note that
+		 * idleSlope here is the parameter from this function which
+		 * is in kbps.
+		 *
+		 *     BW =     idleSlope
+		 *          -----------------                             (E4)
+		 *          link-speed * 1000
+		 *
+		 * That said, we can come up with a generic equation to
+		 * calculate the value we should set it TQAVCC register by
+		 * replacing 'BW' in E3 by E4. The resulting equation is:
+		 *
+		 * value =     idleSlope     * 0x7735 * 2 * link-speed
+		 *         -----------------            --------------    (E5)
+		 *         link-speed * 1000                 1000
+		 *
+		 * 'link-speed' is present in both sides of the fraction so
+		 * it is canceled out. The final equation is the following:
+		 *
+		 *     value = idleSlope * 61034
+		 *             -----------------                          (E6)
+		 *                  1000000
+		 */
+		value = DIV_ROUND_UP_ULL(idleslope * 61034ULL, 1000000);
+
+		tqavcc = rd32(E1000_I210_TQAVCC(queue));
+		tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
+		tqavcc |= value;
+		wr32(E1000_I210_TQAVCC(queue), tqavcc);
+
+		wr32(E1000_I210_TQAVHC(queue), 0x80000000 + hicredit * 0x7735);
+	} else {
+		set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_LOW);
+		set_queue_mode(hw, queue, QUEUE_MODE_STRICT_PRIORITY);
+
+		/* Set idleSlope to zero. */
+		tqavcc = rd32(E1000_I210_TQAVCC(queue));
+		tqavcc &= ~E1000_TQAVCC_IDLESLOPE_MASK;
+		wr32(E1000_I210_TQAVCC(queue), tqavcc);
+
+		/* Set hiCredit to zero. */
+		wr32(E1000_I210_TQAVHC(queue), 0);
+	}
+
+	/* XXX: In i210 controller the sendSlope and loCredit
+	 * parameters from CBS are not configurable by software so we
+	 * don't do any 'controller configuration' in respect to these
+	 * parameters.
+	 */
+
+	netdev_dbg(netdev, "CBS %s: queue %d idleslope %d sendslope %d "
+		   "hiCredit %d locredit %d\n",
+		   (enable) ? "enabled" : "disabled", queue,
+		   idleslope, sendslope, hicredit, locredit);
+}
+
+static void igb_save_cbs_params(struct igb_adapter *adapter, int queue,
+				bool enable, int idleslope, int sendslope,
+				int hicredit, int locredit)
+{
+	struct e1000_hw *hw = &adapter->hw;
+	struct igb_ring *ring;
+
+	WARN_ON(hw->mac.type != e1000_i210);
+	WARN_ON(queue < 0 || queue > 1);
+	WARN_ON(adapter->num_tx_queues < 2);
+
+	ring = adapter->tx_ring[queue];
+
+	ring->cbs_enable = enable;
+	ring->idleslope = idleslope;
+	ring->sendslope = sendslope;
+	ring->hicredit = hicredit;
+	ring->locredit = locredit;
+}
+
+static bool is_any_cbs_enabled(struct igb_adapter *adapter)
+{
+	struct igb_ring *ring;
+	int i;
+
+	WARN_ON(adapter->num_tx_queues < 2);
+
+	for (i = 0; i < I210_SR_QUEUES_NUM; i++) {
+		ring = adapter->tx_ring[i];
+
+		if (ring->cbs_enable)
+			return true;
+	}
+
+	return false;
+}
+
+static void igb_setup_tx_mode(struct igb_adapter *adapter)
+{
+	struct net_device *netdev = adapter->netdev;
+	struct e1000_hw *hw = &adapter->hw;
+	u32 val;
+	int i;
+
+	/* Only i210 controller supports changing the transmission mode. */
+	if (hw->mac.type != e1000_i210)
+		return;
+
+	if (is_fqtss_enabled(adapter)) {
+		/* Configure TQAVCTRL register: set transmit mode to 'Qav',
+		 * set data fetch arbitration to 'round robin' and set data
+		 * transfer arbitration to 'credit shaper algorithm.
+		 */
+		val = rd32(E1000_I210_TQAVCTRL);
+		val |= E1000_TQAVCTRL_XMIT_MODE | E1000_TQAVCTRL_DATATRANARB;
+		val &= ~E1000_TQAVCTRL_DATAFETCHARB;
+		wr32(E1000_I210_TQAVCTRL, val);
+
+		/* Configure Tx and Rx packet buffers sizes as described in
+		 * i210 datasheet section 7.2.7.7.
+		 */
+		val = rd32(E1000_TXPBS);
+		val &= ~I210_TXPBSIZE_MASK;
+		val |= I210_TXPBSIZE_PB0_8KB | I210_TXPBSIZE_PB1_8KB |
+			I210_TXPBSIZE_PB2_4KB | I210_TXPBSIZE_PB3_4KB;
+		wr32(E1000_TXPBS, val);
+
+		val = rd32(E1000_RXPBS);
+		val &= ~I210_RXPBSIZE_MASK;
+		val |= I210_RXPBSIZE_PB_32KB;
+		wr32(E1000_RXPBS, val);
+
+		/* Section 8.12.9 states that MAX_TPKT_SIZE from DTXMXPKTSZ
+		 * register should not exceed the buffer size programmed in
+		 * TXPBS. The smallest buffer size programmed in TXPBS is 4kB
+		 * so according to the datasheet we should set MAX_TPKT_SIZE to
+		 * 4kB / 64.
+		 *
+		 * However, when we do so, no frame from queue 2 and 3 are
+		 * transmitted.  It seems the MAX_TPKT_SIZE should not be great
+		 * or _equal_ to the buffer size programmed in TXPBS. For this
+		 * reason, we set set MAX_ TPKT_SIZE to (4kB - 1) / 64.
+		 */
+		val = (4096 - 1) / 64;
+		wr32(E1000_I210_DTXMXPKTSZ, val);
+
+		/* Since FQTSS mode is enabled, apply any CBS configuration
+		 * previously set. If no previous CBS configuration has been
+		 * done, then the initial configuration is applied, which means
+		 * CBS is disabled. CBS configuration is only supported by SR
+		 * queues i.e. queue 0 and queue 1.
+		 */
+		for (i = 0; i < I210_SR_QUEUES_NUM; i++) {
+			struct igb_ring *ring = adapter->tx_ring[i];
+
+			igb_configure_cbs(adapter, i, ring->cbs_enable,
+					  ring->idleslope, ring->sendslope,
+					  ring->hicredit, ring->locredit);
+		}
+	} else {
+		wr32(E1000_RXPBS, I210_RXPBSIZE_DEFAULT);
+		wr32(E1000_TXPBS, I210_TXPBSIZE_DEFAULT);
+		wr32(E1000_I210_DTXMXPKTSZ, I210_DTXMXPKTSZ_DEFAULT);
+
+		val = rd32(E1000_I210_TQAVCTRL);
+		/* According to Section 8.12.21, the other flags we've set when
+		 * enabling FQTSS are not relevant when disabling FQTSS so we
+		 * don't set they here.
+		 */
+		val &= ~E1000_TQAVCTRL_XMIT_MODE;
+		wr32(E1000_I210_TQAVCTRL, val);
+	}
+
+	netdev_dbg(netdev, "FQTSS %s\n", (is_fqtss_enabled(adapter)) ?
+		   "enabled" : "disabled");
+}
+
 /**
  *  igb_configure - configure the hardware for RX and TX
  *  @adapter: private board structure
@@ -1609,6 +1912,7 @@ static void igb_configure(struct igb_adapter *adapter)
 
 	igb_get_hw_control(adapter);
 	igb_set_rx_mode(netdev);
+	igb_setup_tx_mode(adapter);
 
 	igb_restore_vlan(adapter);
 
@@ -2150,6 +2454,50 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev,
 	return features;
 }
 
+static int igb_setup_tc(struct net_device *dev, enum tc_setup_type type,
+			void *type_data)
+{
+	struct igb_adapter *adapter = netdev_priv(dev);
+	struct e1000_hw *hw = &adapter->hw;
+	struct tc_cbs_qopt_offload *cbs;
+
+	if (hw->mac.type != e1000_i210)
+		return -ENOTSUPP;
+
+	if (type != TC_SETUP_CBS)
+		return -ENOTSUPP;
+
+	/* In order to support FQTSS feature, we must have at least 2 Tx
+	 * queues enabled.
+	 */
+	if (adapter->num_tx_queues < 2)
+		return -ENOTSUPP;
+
+	cbs = type_data;
+
+	/* Only queues 0 and 1 support CBS configuration. */
+	if (cbs->queue < 0 || cbs->queue > 1)
+		return -EINVAL;
+
+	igb_save_cbs_params(adapter, cbs->queue, cbs->enable,
+			    cbs->idleslope, cbs->sendslope,
+			    cbs->hicredit, cbs->locredit);
+
+	if (is_fqtss_enabled(adapter)) {
+		igb_configure_cbs(adapter, cbs->queue, cbs->enable,
+				  cbs->idleslope, cbs->sendslope,
+				  cbs->hicredit, cbs->locredit);
+
+		if (!is_any_cbs_enabled(adapter))
+			enable_fqtss(adapter, false);
+
+	} else {
+		enable_fqtss(adapter, true);
+	}
+
+	return 0;
+}
+
 static const struct net_device_ops igb_netdev_ops = {
 	.ndo_open		= igb_open,
 	.ndo_stop		= igb_close,
@@ -2175,6 +2523,7 @@ static const struct net_device_ops igb_netdev_ops = {
 	.ndo_set_features	= igb_set_features,
 	.ndo_fdb_add		= igb_ndo_fdb_add,
 	.ndo_features_check	= igb_features_check,
+	.ndo_setup_tc		= igb_setup_tc,
 };
 
 /**
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 1/5] net/sched: Introduce the user API for the CBS shaper
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Vinicius Costa Gomes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	andre.guedes, ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran
In-Reply-To: <20170901012625.14838-1-vinicius.gomes@intel.com>

Export the API necessary for configuring the CBS shaper (implemented
in the next patch) via the tc tool.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
 include/uapi/linux/pkt_sched.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 099bf5528fed..aa4a3e5421be 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -871,4 +871,33 @@ struct tc_pie_xstats {
 	__u32 maxq;             /* maximum queue size */
 	__u32 ecn_mark;         /* packets marked with ecn*/
 };
+
+/* CBS */
+/* FIXME: this is only for usage with ndo_setup_tc(), this should be
+ * in another header someplace else. Is pkt_cls.h the right place?
+ */
+struct tc_cbs_qopt_offload {
+	u8		enable;
+	s32		queue;
+	s32		hicredit;
+	s32		locredit;
+	s32		idleslope;
+	s32		sendslope;
+};
+
+struct tc_cbs_qopt {
+	__s32		hicredit;
+	__s32		locredit;
+	__s32		idleslope;
+	__s32		sendslope;
+};
+
+enum {
+	TCA_CBS_UNSPEC,
+	TCA_CBS_PARMS,
+	__TCA_CBS_MAX,
+};
+
+#define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
+
 #endif
-- 
2.14.1

^ permalink raw reply related

* [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers
From: Vinicius Costa Gomes @ 2017-09-01  1:26 UTC (permalink / raw)
  To: netdev
  Cc: Vinicius Costa Gomes, jhs, xiyou.wangcong, jiri, intel-wired-lan,
	andre.guedes, ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran

Hi,

This patchset is an RFC on a proposal of how the Traffic Control subsystem can
be used to offload the configuration of traffic shapers into network devices
that provide support for them in HW. Our goal here is to start upstreaming
support for features related to the Time-Sensitive Networking (TSN) set of
standards into the kernel.

As part of this work, we've assessed previous public discussions related to TSN
enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).

Please note that the patches provided as part of this RFC are implementing what
is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu interfaces
as well. The current patches are only providing support for HW offload of the
configs.

Overview
========

Time-sensitive Networking (TSN) is a set of standards that aim to address
resources availability for providing bandwidth reservation and bounded latency
on Ethernet based LANs. The proposal described here aims to cover mainly what is
needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
802.1Qbu.

The initial target of this work is the Intel i210 NIC, but other controllers'
datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group and
the Synopsis DesignWare Ethernet QoS controller.

Proposal
========

Feature-wise, what is covered here are configuration interfaces for HW
implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
(802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
Qbv and Qbu must be configured per port, with the configuration covering all
queues. Given that these features are related to traffic shaping, and that the
traffic control subsystem already provides a queueing discipline that offloads
config into the device driver (i.e. mqprio), designing new qdiscs for the
specific purpose of offloading the config for each shaper seemed like a good
fit.

For steering traffic into the correct queues, we use the socket option
SO_PRIORITY and then a mechanism to map priority to traffic classes / Tx queues.
The qdisc mqprio is currently used in our tests.

As for the shapers config interface:

 * CBS (802.1Qav)

   This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
   $ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
     idleslope I

   Note that the parameters for this qdisc are the ones defined by the
   802.1Q-2014 spec, so no hardware specific functionality is exposed here.

 * Time-aware shaper (802.1Qbv):

   The idea we are currently exploring is to add a "time-aware", priority based
   qdisc, that also exposes the Tx queues available and provides a mechanism for
   mapping priority <-> traffic class <-> Tx queues in a similar fashion as
   mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would be:

   $ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4    \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                         \
	   queues 0 1 2 3                                              \
     	   sched-file gates.sched [base-time <interval>]               \
           [cycle-time <interval>] [extension-time <interval>]

   <file> is multi-line, with each line being of the following format:
   <cmd> <gate mask> <interval in nanoseconds>

   Qbv only defines one <cmd>: "S" for 'SetGates'

   For example:

   S 0x01 300
   S 0x03 500

   This means that there are two intervals, the first will have the gate
   for traffic class 0 open for 300 nanoseconds, the second will have
   both traffic classes open for 500 nanoseconds.

   Additionally, an option to set just one entry of the gate control list will
   also be provided by 'taprio':

   $ tc qdisc (...) \
        sched-row <row number> <cmd> <gate mask> <interval>  \
        [base-time <interval>] [cycle-time <interval>] \
        [extension-time <interval>]

 * Frame Preemption (802.1Qbu):

   To control even further the latency, it may prove useful to signal which
   traffic classes are marked as preemptable. For that, 'taprio' provides the
   preemption command so you set each traffic class as preemptable or not:

   $ tc qdisc (...) \
        preemption 0 1 1 1

 * Time-aware shaper + Preemption:

   As an example of how Qbv and Qbu can be used together, we may specify
   both the schedule and the preempt-mask, and this way we may also
   specify the Set-Gates-and-Hold and Set-Gates-and-Release commands as
   specified in the Qbu spec:

   $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4 \
     	   map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3                    \
	   queues 0 1 2 3                                         \
     	   preemption 0 1 1 1                                     \
	   sched-file preempt_gates.sched

    <file> is multi-line, with each line being of the following format:
    <cmd> <gate mask> <interval in nanoseconds>

    For this case, two new commands are introduced:

    "H" for 'set gates and hold'
    "R" for 'set gates and release'

    H 0x01 300
    R 0x03 500

Testing this RFC
================

For testing the patches of this RFC only, you can refer to the samples and
helper script being added to samples/tsn/ and the use the 'mqprio' qdisc to
setup the priorities to Tx queues mapping, together with the 'cbs' qdisc to
configure the HW shaper of the i210 controller:

1) Setup priorities to traffic classes to hardware queues mapping
$ tc qdisc replace dev enp3s0 parent root mqprio num_tc 3 \
     map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

2) Check scheme. You want to get the inner qdiscs ID from the bottom up
$ tc -g  class show dev enp3s0

Ex.:
+---(802a:3) mqprio
|    +---(802a:6) mqprio
|    +---(802a:7) mqprio
|
+---(802a:2) mqprio
|    +---(802a:5) mqprio
|
+---(802a:1) mqprio
     +---(802a:4) mqprio

 * Here '802a:4' is Tx Queue #0 and '802a:5' is Tx Queue #1.

3) Calculate CBS parameters for classes A and B. i.e. BW for A is 20Mbps and
   for B is 10Mbps:
$ ./samples/tsn/calculate_cbs_params.py -A 20000 -a 1500 -B 10000 -b 1500

4) Configure CBS for traffic class A (priority 3) as provided by the script:
$ tc qdisc replace dev enp3s0 parent 802a:4 cbs locredit -1470 \
     hicredit 30 sendslope -980000 idleslope 20000

5) Configure CBS for traffic class B (priority 2):
$ tc qdisc replace dev enp3s0 parent 802a:5 cbs \
     locredit -1485 hicredit 31 sendslope -990000 idleslope 10000

6) Run Listener, compiled from samples/tsn/listener.c
$ ./listener -i enp3s0

7) Run Talker for class A (prio 3 here), compiled from samples/tsn/talker.c
$ ./talker -i enp3s0 -p 3

 * The bandwidth displayed on the listener output at this stage should be very
   close to the one configured for class A.

8) You can also run a Talker for class B (prio 2 here)
$ ./talker -i enp3s0 -p 2

 * The bandwidth displayed on the listener output now should increase to very
   close to the one configured for class A + class B.

Authors
=======
 - Andre Guedes <andre.guedes@intel.com>
 - Ivan Briano <ivan.briano@intel.com>
 - Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
 - Vinicius Gomes <vinicius.gomes@intel.com>

Andre Guedes (2):
  igb: Add support for CBS offload
  samples/tsn: Add script for calculating CBS config

Jesus Sanchez-Palencia (1):
  sample: Add TSN Talker and Listener examples

Vinicius Costa Gomes (2):
  net/sched: Introduce the user API for the CBS shaper
  net/sched: Introduce Credit Based Shaper (CBS) qdisc

 drivers/net/ethernet/intel/igb/e1000_defines.h |  23 ++
 drivers/net/ethernet/intel/igb/e1000_regs.h    |   8 +
 drivers/net/ethernet/intel/igb/igb.h           |   6 +
 drivers/net/ethernet/intel/igb/igb_main.c      | 349 +++++++++++++++++++++++++
 include/linux/netdevice.h                      |   1 +
 include/uapi/linux/pkt_sched.h                 |  29 ++
 net/sched/Kconfig                              |  11 +
 net/sched/Makefile                             |   1 +
 net/sched/sch_cbs.c                            | 286 ++++++++++++++++++++
 samples/tsn/calculate_cbs_params.py            | 112 ++++++++
 samples/tsn/listener.c                         | 254 ++++++++++++++++++
 samples/tsn/talker.c                           | 136 ++++++++++
 12 files changed, 1216 insertions(+)
 create mode 100644 net/sched/sch_cbs.c
 create mode 100755 samples/tsn/calculate_cbs_params.py
 create mode 100644 samples/tsn/listener.c
 create mode 100644 samples/tsn/talker.c

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox