Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Waskiewicz Jr, Peter @ 2017-09-28 21:40 UTC (permalink / raw)
  To: John Fastabend, Andy Gospodarek
  Cc: Daniel Borkmann, davem@davemloft.net,
	alexei.starovoitov@gmail.com, jakub.kicinski@netronome.com,
	netdev@vger.kernel.org, mchan@broadcom.com
In-Reply-To: <bdce98b7-1d32-3cd9-1289-79807af8443f@gmail.com>

On 9/28/17 2:23 PM, John Fastabend wrote:
> [...]
> 
>> I'm pretty sure I misunderstood what you were going after with
>> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
>> few weeks) will populate the headroom coming out of the driver to XDP,
>> and then once the XDP program extracts whatever hints it wants via
>> helpers, I fully expect that area in the headroom to get stomped by
>> something else.  If we want to send any of that hint data up farther,
>> we'll already have it extracted via the helpers, and the eBPF program
>> can happily assign it to wherever in the outbound metadata area.
> 
> In case its not obvious with the latest xdp metadata patches the outbound
> metadata can then be pushed into skb fields via a tc_cls program if needed.

Yes, that was what I was alluding to with "can happily assign it to 
wherever."  The patches we're working on are driver->XDP, then anything 
else using the latest meta-data patches would be XDP->anywhere else.  So 
I don't think we're going to step on any toes.

Thanks John,
-PJ

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Daniel Borkmann @ 2017-09-28 21:29 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter, Andy Gospodarek
  Cc: davem@davemloft.net, alexei.starovoitov@gmail.com,
	john.fastabend@gmail.com, jakub.kicinski@netronome.com,
	netdev@vger.kernel.org, mchan@broadcom.com
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F077E6438@ORSMSX103.amr.corp.intel.com>

On 09/28/2017 10:52 PM, Waskiewicz Jr, Peter wrote:
> On 9/28/17 12:59 PM, Andy Gospodarek wrote:
>> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
>> <peter.waskiewicz.jr@intel.com> wrote:
>>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>>>> [...]
>>>>>> First, thanks for this detailed description.  It was helpful to read
>>>>>> along with the patches.
>>>>>>
>>>>>> My only concern about this area being generic is that you are now in a
>>>>>> state where any bpf program must know about all the bpf programs in the
>>>>>> receive pipeline before it can properly parse what is stored in the
>>>>>> meta-data and add it to an skb (or perform any other action).
>>>>>> Especially if each program adds it's own meta-data along the way.
>>>>>>
>>>>>> Maybe this isn't a big concern based on the number of users of this
>>>>>> today, but it just starts to seem like a concern as there are these
>>>>>> hints being passed between layers that are challenging to track due to a
>>>>>> lack of a standard format for passing data between.
>>>>>
>>>>> Btw, we do have similar kind of programmable scratch buffer also today
>>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>>>> where you can transfer data between tail calls for example, then tail
>>>>> calls themselves that need to coordinate, or simply mangling of packets
>>>>> itself if you will, but more below to your use case ...
>>>>>
>>>>>> The main reason I bring this up is that Michael and I had discussed and
>>>>>> designed a way for drivers to communicate between each other that rx
>>>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>>>> action.  Much like this code, it involved adding an new element to
>>>>>> struct xdp_md that could point to the important information.  Now that
>>>>>> there is a generic way to handle this, it would seem nice to be able to
>>>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>>>> without the ability to mark it in some manner.
>>>>>>
>>>>>> For additional background, the minimum amount of data needed in the case
>>>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>>>> producer entry.  This data could be acessed by the driver processing the
>>>>>> tx completions and callback to the driver that received the frame off the wire
>>>>>> to perform any needed processing.  (For those curious this would also require a
>>>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>>>
>>>>> What you describe above doesn't seem to be fitting to the use-case of
>>>>> this set, meaning the area here is fully programmable out of the BPF
>>>>> program, the infrastructure you're describing is some sort of means of
>>>>> communication between drivers for the XDP_REDIRECT, and should be
>>>>> outside of the control of the BPF program to mangle.
>>>>
>>>> OK, I understand that perspective.  I think saying this is really meant
>>>> as a BPF<->BPF communication channel for now is fine.
>>>>
>>>>> You could probably reuse the base infra here and make a part of that
>>>>> inaccessible for the program with some sort of a fixed layout, but I
>>>>> haven't seen your code yet to be able to fully judge. Intention here
>>>>> is to allow for programmability within the BPF prog in a generic way,
>>>>> such that based on the use-case it can be populated in specific ways
>>>>> and propagated to the skb w/o having to define a fixed layout and
>>>>> bloat xdp_buff all the way to an skb while still retaining all the
>>>>> flexibility.
>>>>
>>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>>> my use since it's not exclusively something that will need to be used by
>>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>>> for reference.
>>>
>>> Sorry for chiming in late, I've been offline.
>>>
>>> We're looking to add some functionality from driver to XDP inside this
>>> xdp_buff->data_meta region.  We want to assign it to an opaque
>>> structure, that would be specific per driver (think of a flex descriptor
>>> coming out of the hardware).  We'd like to pass these offloaded
>>> computations into XDP programs to help accelerate them, such as packet
>>> type, where headers are located, etc.  It's similar to Jesper's RFC
>>> patches back in May when passing through the mlx Rx descriptor to XDP.
>>>
>>> This is actually what a few of us are planning to present at NetDev 2.2
>>> in November.  If you're hoping to restrict this headroom in the xdp_buff
>>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
>>> that further.
>>
>> No sweat, PJ, thanks for replying.  I saw the notes for your accepted
>> session and I'm looking forward to it.
>>
>> John's suggestion earlier in the thread was actually similar to the
>> conclusion I reached when thinking about Daniel's patch a bit more.
>> (I like John's better though as it doesn't get constrained by UAPI.)
>> Since redirect actions happen at a point where no other programs will
>> run on the buffer, that space can be used for this redirect data and
>> there are no conflicts.

Yep fully agree, it's not read anywhere else anymore or could go up
the stack where we'd read it out again, so that's the best solution
for your use-case moving forward, Andy. I do like that we don't expose
to uapi.

> Ah, yes, John and I spoke about this at Plumber's and this is basically
> what we came to as well.  A set of helpers that won't have to be in
> UAPI, but they will be potentially vendor-specific to extract the
> meta-data hints out for the XDP program to use.
>
>> It sounds like the idea behind your proposal includes populating some
>> data into the buffer before the XDP program is executed so that it can
>> be used by the program.  Would this data be useful later in the driver
>> or stack or are you just hoping to accelerate processing of frames in
>> the BPF program?
>
> Right now we're thinking it would only be useful for XDP programs to
> execute things quicker, i.e. not have to compute things that are already
> computed by the hardware (rxhash, ptype, header locations, etc.).  I
> don't have any plans to pass this data off elsewhere in the stack or
> back to the driver at this point.
>
>> If the headroom needed for redirect info was only added after it was
>> clear the redirect action was needed, would this conflict with the
>> information you are trying to provide?  I had planned to add this just
>> after the action was XDP_REDIRECT was selected or at the end of the
>> driver's ndo_xdp_xmit function -- it seems like it would not conflict.
>
> I'm pretty sure I misunderstood what you were going after with
> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
> few weeks) will populate the headroom coming out of the driver to XDP,
> and then once the XDP program extracts whatever hints it wants via
> helpers, I fully expect that area in the headroom to get stomped by
> something else.  If we want to send any of that hint data up farther,
> we'll already have it extracted via the helpers, and the eBPF program
> can happily assign it to wherever in the outbound metadata area.

Sure, these two are compatible with each other; in your case it's
populated before the prog is called, and the prog can use it while
processing, in Andy's case it's populated after the prog was called
when we need to redirect, so both fine.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: John Fastabend @ 2017-09-28 21:22 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter, Andy Gospodarek
  Cc: Daniel Borkmann, davem@davemloft.net,
	alexei.starovoitov@gmail.com, jakub.kicinski@netronome.com,
	netdev@vger.kernel.org, mchan@broadcom.com
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F077E6438@ORSMSX103.amr.corp.intel.com>

[...]

> I'm pretty sure I misunderstood what you were going after with 
> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a 
> few weeks) will populate the headroom coming out of the driver to XDP, 
> and then once the XDP program extracts whatever hints it wants via 
> helpers, I fully expect that area in the headroom to get stomped by 
> something else.  If we want to send any of that hint data up farther, 
> we'll already have it extracted via the helpers, and the eBPF program 
> can happily assign it to wherever in the outbound metadata area.

In case its not obvious with the latest xdp metadata patches the outbound
metadata can then be pushed into skb fields via a tc_cls program if needed.

.John

> 
>> (There's also Jesper's series from today -- I've seen it but have not
>> had time to fully grok all of those changes.)
> 
> I'm also working through my inbox to get to that series.  I have some 
> email to catch up on...
> 
> Thanks Andy,
> -PJ
> 

^ permalink raw reply

* [PATCH net-next] ibmvnic: Set state UP
From: Mick Tarsel @ 2017-09-28 20:53 UTC (permalink / raw)
  To: netdev; +Cc: tlfalcon

State is initially reported as UNKNOWN. Before register call
netif_carrier_off(). Once the device is opened, call netif_carrier_on() in
order to set the state to UP.

Signed-off-by: Mick Tarsel <mjtarsel@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index cb8182f..4bc14a9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -927,6 +927,7 @@ static int ibmvnic_open(struct net_device *netdev)
 	}
 
 	rc = __ibmvnic_open(netdev);
+	netif_carrier_on(netdev);
 	mutex_unlock(&adapter->reset_lock);
 
 	return rc;
@@ -3899,6 +3900,7 @@ static int ibmvnic_probe(struct vio_dev *dev, const struct vio_device_id *id)
 	if (rc)
 		goto ibmvnic_init_fail;
 
+	netif_carrier_off(netdev);
 	rc = register_netdev(netdev);
 	if (rc) {
 		dev_err(&dev->dev, "failed to register netdev rc=%d\n", rc);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Waskiewicz Jr, Peter @ 2017-09-28 20:52 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Daniel Borkmann, davem@davemloft.net,
	alexei.starovoitov@gmail.com, john.fastabend@gmail.com,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	mchan@broadcom.com
In-Reply-To: <CAHashqBMfXp-uYH9ANfdaNfez9f4pcrOjnbX2WAFAdBwaJAtvw@mail.gmail.com>

On 9/28/17 12:59 PM, Andy Gospodarek wrote:
> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
> <peter.waskiewicz.jr@intel.com> wrote:
>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>>> [...]
>>>>> First, thanks for this detailed description.  It was helpful to read
>>>>> along with the patches.
>>>>>
>>>>> My only concern about this area being generic is that you are now in a
>>>>> state where any bpf program must know about all the bpf programs in the
>>>>> receive pipeline before it can properly parse what is stored in the
>>>>> meta-data and add it to an skb (or perform any other action).
>>>>> Especially if each program adds it's own meta-data along the way.
>>>>>
>>>>> Maybe this isn't a big concern based on the number of users of this
>>>>> today, but it just starts to seem like a concern as there are these
>>>>> hints being passed between layers that are challenging to track due to a
>>>>> lack of a standard format for passing data between.
>>>>
>>>> Btw, we do have similar kind of programmable scratch buffer also today
>>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>>> where you can transfer data between tail calls for example, then tail
>>>> calls themselves that need to coordinate, or simply mangling of packets
>>>> itself if you will, but more below to your use case ...
>>>>
>>>>> The main reason I bring this up is that Michael and I had discussed and
>>>>> designed a way for drivers to communicate between each other that rx
>>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>>> action.  Much like this code, it involved adding an new element to
>>>>> struct xdp_md that could point to the important information.  Now that
>>>>> there is a generic way to handle this, it would seem nice to be able to
>>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>>> without the ability to mark it in some manner.
>>>>>
>>>>> For additional background, the minimum amount of data needed in the case
>>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>>> producer entry.  This data could be acessed by the driver processing the
>>>>> tx completions and callback to the driver that received the frame off the wire
>>>>> to perform any needed processing.  (For those curious this would also require a
>>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>>
>>>> What you describe above doesn't seem to be fitting to the use-case of
>>>> this set, meaning the area here is fully programmable out of the BPF
>>>> program, the infrastructure you're describing is some sort of means of
>>>> communication between drivers for the XDP_REDIRECT, and should be
>>>> outside of the control of the BPF program to mangle.
>>>
>>> OK, I understand that perspective.  I think saying this is really meant
>>> as a BPF<->BPF communication channel for now is fine.
>>>
>>>> You could probably reuse the base infra here and make a part of that
>>>> inaccessible for the program with some sort of a fixed layout, but I
>>>> haven't seen your code yet to be able to fully judge. Intention here
>>>> is to allow for programmability within the BPF prog in a generic way,
>>>> such that based on the use-case it can be populated in specific ways
>>>> and propagated to the skb w/o having to define a fixed layout and
>>>> bloat xdp_buff all the way to an skb while still retaining all the
>>>> flexibility.
>>>
>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>> my use since it's not exclusively something that will need to be used by
>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>> for reference.
>>
>> Sorry for chiming in late, I've been offline.
>>
>> We're looking to add some functionality from driver to XDP inside this
>> xdp_buff->data_meta region.  We want to assign it to an opaque
>> structure, that would be specific per driver (think of a flex descriptor
>> coming out of the hardware).  We'd like to pass these offloaded
>> computations into XDP programs to help accelerate them, such as packet
>> type, where headers are located, etc.  It's similar to Jesper's RFC
>> patches back in May when passing through the mlx Rx descriptor to XDP.
>>
>> This is actually what a few of us are planning to present at NetDev 2.2
>> in November.  If you're hoping to restrict this headroom in the xdp_buff
>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
>> that further.
>>
> 
> No sweat, PJ, thanks for replying.  I saw the notes for your accepted
> session and I'm looking forward to it.
> 
> John's suggestion earlier in the thread was actually similar to the
> conclusion I reached when thinking about Daniel's patch a bit more.
> (I like John's better though as it doesn't get constrained by UAPI.)
> Since redirect actions happen at a point where no other programs will
> run on the buffer, that space can be used for this redirect data and
> there are no conflicts.

Ah, yes, John and I spoke about this at Plumber's and this is basically 
what we came to as well.  A set of helpers that won't have to be in 
UAPI, but they will be potentially vendor-specific to extract the 
meta-data hints out for the XDP program to use.

> It sounds like the idea behind your proposal includes populating some
> data into the buffer before the XDP program is executed so that it can
> be used by the program.  Would this data be useful later in the driver
> or stack or are you just hoping to accelerate processing of frames in
> the BPF program?

Right now we're thinking it would only be useful for XDP programs to 
execute things quicker, i.e. not have to compute things that are already 
computed by the hardware (rxhash, ptype, header locations, etc.).  I 
don't have any plans to pass this data off elsewhere in the stack or 
back to the driver at this point.

> If the headroom needed for redirect info was only added after it was
> clear the redirect action was needed, would this conflict with the
> information you are trying to provide?  I had planned to add this just
> after the action was XDP_REDIRECT was selected or at the end of the
> driver's ndo_xdp_xmit function -- it seems like it would not conflict.

I'm pretty sure I misunderstood what you were going after with 
XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a 
few weeks) will populate the headroom coming out of the driver to XDP, 
and then once the XDP program extracts whatever hints it wants via 
helpers, I fully expect that area in the headroom to get stomped by 
something else.  If we want to send any of that hint data up farther, 
we'll already have it extracted via the helpers, and the eBPF program 
can happily assign it to wherever in the outbound metadata area.

> (There's also Jesper's series from today -- I've seen it but have not
> had time to fully grok all of those changes.)

I'm also working through my inbox to get to that series.  I have some 
email to catch up on...

Thanks Andy,
-PJ

^ permalink raw reply

* [PATCH net-next 10/10] sctp: introduce round robin stream scheduler
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

This patch introduces RFC Draft ndata section 3.2 Priority Based
Scheduler (SCTP_SS_RR).

Works by maintaining a list of enqueued streams and tracking the last
one used to send data. When the datamsg is done, it switches to the next
stream.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h |  11 +++
 include/uapi/linux/sctp.h  |   3 +-
 net/sctp/Makefile          |   3 +-
 net/sctp/stream_sched.c    |   2 +
 net/sctp/stream_sched_rr.c | 201 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 218 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_rr.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 40eb8d66a37c3ecee39141dc111663f7aac7326a..16f949eef52fdfd7c90fa15b44093334d1355aaf 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1348,6 +1348,10 @@ struct sctp_stream_out_ext {
 			struct list_head prio_list;
 			struct sctp_stream_priorities *prio_head;
 		};
+		/* Fields used by RR scheduler */
+		struct {
+			struct list_head rr_list;
+		};
 	};
 };
 
@@ -1374,6 +1378,13 @@ struct sctp_stream {
 			/* List of priorities scheduled */
 			struct list_head prio_list;
 		};
+		/* Fields used by RR scheduler */
+		struct {
+			/* List of streams scheduled */
+			struct list_head rr_list;
+			/* The next stream stream in line */
+			struct sctp_stream_out_ext *rr_next;
+		};
 	};
 };
 
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 850fa8b29d7e8163dc4ee88af192309bb2535ae9..6cd7d416ca406e59d3214976fc425bb805f5c6cc 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1100,7 +1100,8 @@ struct sctp_add_streams {
 enum sctp_sched_type {
 	SCTP_SS_FCFS,
 	SCTP_SS_PRIO,
-	SCTP_SS_MAX = SCTP_SS_PRIO
+	SCTP_SS_RR,
+	SCTP_SS_MAX = SCTP_SS_RR
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 647c9cfd4e95be4429d25792e5832d7be2efc5c8..bf90c53977190ff563c2b43af31afb7c431d4534 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o stream_sched.o stream_sched_prio.o
+	  offload.o stream_sched.o stream_sched_prio.o \
+	  stream_sched_rr.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 115ddb7651695cca7417cb63004a1a59c93523b8..03513a9fa110b5317af4502f98ab37702c1eddb9 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -122,10 +122,12 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 /* API to other parts of the stack */
 
 extern struct sctp_sched_ops sctp_sched_prio;
+extern struct sctp_sched_ops sctp_sched_rr;
 
 struct sctp_sched_ops *sctp_sched_ops[] = {
 	&sctp_sched_fcfs,
 	&sctp_sched_prio,
+	&sctp_sched_rr,
 };
 
 int sctp_sched_set_sched(struct sctp_association *asoc,
diff --git a/net/sctp/stream_sched_rr.c b/net/sctp/stream_sched_rr.c
new file mode 100644
index 0000000000000000000000000000000000000000..7612a438c5b939ae1c26c4acc06902749b601524
--- /dev/null
+++ b/net/sctp/stream_sched_rr.c
@@ -0,0 +1,201 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Priority handling
+ * RFC DRAFT ndata section 3.2
+ */
+static void sctp_sched_rr_unsched_all(struct sctp_stream *stream);
+
+static void sctp_sched_rr_next_stream(struct sctp_stream *stream)
+{
+	struct list_head *pos;
+
+	pos = stream->rr_next->rr_list.next;
+	if (pos == &stream->rr_list)
+		pos = pos->next;
+	stream->rr_next = list_entry(pos, struct sctp_stream_out_ext, rr_list);
+}
+
+static void sctp_sched_rr_unsched(struct sctp_stream *stream,
+				  struct sctp_stream_out_ext *soute)
+{
+	if (stream->rr_next == soute)
+		/* Try to move to the next stream */
+		sctp_sched_rr_next_stream(stream);
+
+	list_del_init(&soute->rr_list);
+
+	/* If we have no other stream queued, clear next */
+	if (list_empty(&stream->rr_list))
+		stream->rr_next = NULL;
+}
+
+static void sctp_sched_rr_sched(struct sctp_stream *stream,
+				struct sctp_stream_out_ext *soute)
+{
+	if (!list_empty(&soute->rr_list))
+		/* Already scheduled. */
+		return;
+
+	/* Schedule the stream */
+	list_add_tail(&soute->rr_list, &stream->rr_list);
+
+	if (!stream->rr_next)
+		stream->rr_next = soute;
+}
+
+static int sctp_sched_rr_set(struct sctp_stream *stream, __u16 sid,
+			     __u16 prio, gfp_t gfp)
+{
+	return 0;
+}
+
+static int sctp_sched_rr_get(struct sctp_stream *stream, __u16 sid,
+			     __u16 *value)
+{
+	return 0;
+}
+
+static int sctp_sched_rr_init(struct sctp_stream *stream)
+{
+	INIT_LIST_HEAD(&stream->rr_list);
+	stream->rr_next = NULL;
+
+	return 0;
+}
+
+static int sctp_sched_rr_init_sid(struct sctp_stream *stream, __u16 sid,
+				  gfp_t gfp)
+{
+	INIT_LIST_HEAD(&stream->out[sid].ext->rr_list);
+
+	return 0;
+}
+
+static void sctp_sched_rr_free(struct sctp_stream *stream)
+{
+	sctp_sched_rr_unsched_all(stream);
+}
+
+static void sctp_sched_rr_enqueue(struct sctp_outq *q,
+				  struct sctp_datamsg *msg)
+{
+	struct sctp_stream *stream;
+	struct sctp_chunk *ch;
+	__u16 sid;
+
+	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
+	sid = sctp_chunk_stream_no(ch);
+	stream = &q->asoc->stream;
+	sctp_sched_rr_sched(stream, stream->out[sid].ext);
+}
+
+static struct sctp_chunk *sctp_sched_rr_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch = NULL;
+
+	/* Bail out quickly if queue is empty */
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	/* Find which chunk is next */
+	if (stream->out_curr)
+		soute = stream->out_curr->ext;
+	else
+		soute = stream->rr_next;
+	ch = list_entry(soute->outq.next, struct sctp_chunk, stream_list);
+
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_rr_dequeue_done(struct sctp_outq *q,
+				       struct sctp_chunk *ch)
+{
+	struct sctp_stream_out_ext *soute;
+	__u16 sid;
+
+	/* Last chunk on that msg, move to the next stream */
+	sid = sctp_chunk_stream_no(ch);
+	soute = q->asoc->stream.out[sid].ext;
+
+	sctp_sched_rr_next_stream(&q->asoc->stream);
+
+	if (list_empty(&soute->outq))
+		sctp_sched_rr_unsched(&q->asoc->stream, soute);
+}
+
+static void sctp_sched_rr_sched_all(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		__u16 sid;
+
+		sid = sctp_chunk_stream_no(ch);
+		soute = stream->out[sid].ext;
+		if (soute)
+			sctp_sched_rr_sched(stream, soute);
+	}
+}
+
+static void sctp_sched_rr_unsched_all(struct sctp_stream *stream)
+{
+	struct sctp_stream_out_ext *soute, *tmp;
+
+	list_for_each_entry_safe(soute, tmp, &stream->rr_list, rr_list)
+		sctp_sched_rr_unsched(stream, soute);
+}
+
+struct sctp_sched_ops sctp_sched_rr = {
+	.set = sctp_sched_rr_set,
+	.get = sctp_sched_rr_get,
+	.init = sctp_sched_rr_init,
+	.init_sid = sctp_sched_rr_init_sid,
+	.free = sctp_sched_rr_free,
+	.enqueue = sctp_sched_rr_enqueue,
+	.dequeue = sctp_sched_rr_dequeue,
+	.dequeue_done = sctp_sched_rr_dequeue_done,
+	.sched_all = sctp_sched_rr_sched_all,
+	.unsched_all = sctp_sched_rr_unsched_all,
+};
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 09/10] sctp: introduce priority based stream scheduler
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

This patch introduces RFC Draft ndata section 3.4 Priority Based
Scheduler (SCTP_SS_PRIO).

It works by having a struct sctp_stream_priority for each priority
configured. This struct is then enlisted on a queue ordered per priority
if, and only if, there is a stream with data queued, so that dequeueing
is very straightforward: either finish current datamsg or simply dequeue
from the highest priority queued, which is the next stream pointed, and
that's it.

If there are multiple streams assigned with the same priority and with
data queued, it will do round robin amongst them while respecting
datamsgs boundaries (when not using idata chunks), to be reasonably
fair.

We intentionally don't maintain a list of priorities nor a list of all
streams with the same priority to save memory. The first would mean at
least 2 other pointers per priority (which, for 1000 priorities, that
can mean 16kB) and the second would also mean 2 other pointers but per
stream. As SCTP supports up to 65535 streams on a given asoc, that's
1MB. This impacts when giving a priority to some stream, as we have to
find out if the new priority is already being used and if we can free
the old one, and also when tearing down.

The new fields in struct sctp_stream_out_ext and sctp_stream are added
under a union because that memory is to be shared with other schedulers.
It could be defined as an opaque area like skb->cb, but that would make
the list handling a nightmare.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h   |  24 +++
 include/uapi/linux/sctp.h    |   3 +-
 net/sctp/Makefile            |   2 +-
 net/sctp/stream_sched.c      |   3 +
 net/sctp/stream_sched_prio.c | 347 +++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_prio.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 3c22a30fd71b4ef87419a77cf69b00807a5986bb..40eb8d66a37c3ecee39141dc111663f7aac7326a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1328,10 +1328,27 @@ struct sctp_inithdr_host {
 	__u32 initial_tsn;
 };
 
+struct sctp_stream_priorities {
+	/* List of priorities scheduled */
+	struct list_head prio_sched;
+	/* List of streams scheduled */
+	struct list_head active;
+	/* The next stream stream in line */
+	struct sctp_stream_out_ext *next;
+	__u16 prio;
+};
+
 struct sctp_stream_out_ext {
 	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
 	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
 	struct list_head outq; /* chunks enqueued by this stream */
+	union {
+		struct {
+			/* Scheduled streams list */
+			struct list_head prio_list;
+			struct sctp_stream_priorities *prio_head;
+		};
+	};
 };
 
 struct sctp_stream_out {
@@ -1351,6 +1368,13 @@ struct sctp_stream {
 	__u16 incnt;
 	/* Current stream being sent, if any */
 	struct sctp_stream_out *out_curr;
+	union {
+		/* Fields used by priority scheduler */
+		struct {
+			/* List of priorities scheduled */
+			struct list_head prio_list;
+		};
+	};
 };
 
 #define SCTP_STREAM_CLOSED		0x00
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 00ac417d2c4f8468ea2aad32e59806be5c5aa08d..850fa8b29d7e8163dc4ee88af192309bb2535ae9 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1099,7 +1099,8 @@ struct sctp_add_streams {
 /* SCTP Stream schedulers */
 enum sctp_sched_type {
 	SCTP_SS_FCFS,
-	SCTP_SS_MAX = SCTP_SS_FCFS
+	SCTP_SS_PRIO,
+	SCTP_SS_MAX = SCTP_SS_PRIO
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 0f6e6d1d69fd336b4a99f896851b0120f9a0d1e0..647c9cfd4e95be4429d25792e5832d7be2efc5c8 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o stream_sched.o
+	  offload.o stream_sched.o stream_sched_prio.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 40a9a9de2b98a56786a4c8585f5ad514be9189af..115ddb7651695cca7417cb63004a1a59c93523b8 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -121,8 +121,11 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 
 /* API to other parts of the stack */
 
+extern struct sctp_sched_ops sctp_sched_prio;
+
 struct sctp_sched_ops *sctp_sched_ops[] = {
 	&sctp_sched_fcfs,
+	&sctp_sched_prio,
 };
 
 int sctp_sched_set_sched(struct sctp_association *asoc,
diff --git a/net/sctp/stream_sched_prio.c b/net/sctp/stream_sched_prio.c
new file mode 100644
index 0000000000000000000000000000000000000000..384dbf3c876096e2ad98a6b6185d9da5cc4145c6
--- /dev/null
+++ b/net/sctp/stream_sched_prio.c
@@ -0,0 +1,347 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Priority handling
+ * RFC DRAFT ndata section 3.4
+ */
+
+static void sctp_sched_prio_unsched_all(struct sctp_stream *stream);
+
+static struct sctp_stream_priorities *sctp_sched_prio_new_head(
+			struct sctp_stream *stream, int prio, gfp_t gfp)
+{
+	struct sctp_stream_priorities *p;
+
+	p = kmalloc(sizeof(*p), gfp);
+	if (!p)
+		return NULL;
+
+	INIT_LIST_HEAD(&p->prio_sched);
+	INIT_LIST_HEAD(&p->active);
+	p->next = NULL;
+	p->prio = prio;
+
+	return p;
+}
+
+static struct sctp_stream_priorities *sctp_sched_prio_get_head(
+			struct sctp_stream *stream, int prio, gfp_t gfp)
+{
+	struct sctp_stream_priorities *p;
+	int i;
+
+	/* Look into scheduled priorities first, as they are sorted and
+	 * we can find it fast IF it's scheduled.
+	 */
+	list_for_each_entry(p, &stream->prio_list, prio_sched) {
+		if (p->prio == prio)
+			return p;
+		if (p->prio > prio)
+			break;
+	}
+
+	/* No luck. So we search on all streams now. */
+	for (i = 0; i < stream->outcnt; i++) {
+		if (!stream->out[i].ext)
+			continue;
+
+		p = stream->out[i].ext->prio_head;
+		if (!p)
+			/* Means all other streams won't be initialized
+			 * as well.
+			 */
+			break;
+		if (p->prio == prio)
+			return p;
+	}
+
+	/* If not even there, allocate a new one. */
+	return sctp_sched_prio_new_head(stream, prio, gfp);
+}
+
+static void sctp_sched_prio_next_stream(struct sctp_stream_priorities *p)
+{
+	struct list_head *pos;
+
+	pos = p->next->prio_list.next;
+	if (pos == &p->active)
+		pos = pos->next;
+	p->next = list_entry(pos, struct sctp_stream_out_ext, prio_list);
+}
+
+static bool sctp_sched_prio_unsched(struct sctp_stream_out_ext *soute)
+{
+	bool scheduled = false;
+
+	if (!list_empty(&soute->prio_list)) {
+		struct sctp_stream_priorities *prio_head = soute->prio_head;
+
+		/* Scheduled */
+		scheduled = true;
+
+		if (prio_head->next == soute)
+			/* Try to move to the next stream */
+			sctp_sched_prio_next_stream(prio_head);
+
+		list_del_init(&soute->prio_list);
+
+		/* Also unsched the priority if this was the last stream */
+		if (list_empty(&prio_head->active)) {
+			list_del_init(&prio_head->prio_sched);
+			/* If there is no stream left, clear next */
+			prio_head->next = NULL;
+		}
+	}
+
+	return scheduled;
+}
+
+static void sctp_sched_prio_sched(struct sctp_stream *stream,
+				  struct sctp_stream_out_ext *soute)
+{
+	struct sctp_stream_priorities *prio, *prio_head;
+
+	prio_head = soute->prio_head;
+
+	/* Nothing to do if already scheduled */
+	if (!list_empty(&soute->prio_list))
+		return;
+
+	/* Schedule the stream. If there is a next, we schedule the new
+	 * one before it, so it's the last in round robin order.
+	 * If there isn't, we also have to schedule the priority.
+	 */
+	if (prio_head->next) {
+		list_add(&soute->prio_list, prio_head->next->prio_list.prev);
+		return;
+	}
+
+	list_add(&soute->prio_list, &prio_head->active);
+	prio_head->next = soute;
+
+	list_for_each_entry(prio, &stream->prio_list, prio_sched) {
+		if (prio->prio > prio_head->prio) {
+			list_add(&prio_head->prio_sched, prio->prio_sched.prev);
+			return;
+		}
+	}
+
+	list_add_tail(&prio_head->prio_sched, &stream->prio_list);
+}
+
+static int sctp_sched_prio_set(struct sctp_stream *stream, __u16 sid,
+			       __u16 prio, gfp_t gfp)
+{
+	struct sctp_stream_out *sout = &stream->out[sid];
+	struct sctp_stream_out_ext *soute = sout->ext;
+	struct sctp_stream_priorities *prio_head, *old;
+	bool reschedule = false;
+	int i;
+
+	prio_head = sctp_sched_prio_get_head(stream, prio, gfp);
+	if (!prio_head)
+		return -ENOMEM;
+
+	reschedule = sctp_sched_prio_unsched(soute);
+	old = soute->prio_head;
+	soute->prio_head = prio_head;
+	if (reschedule)
+		sctp_sched_prio_sched(stream, soute);
+
+	if (!old)
+		/* Happens when we set the priority for the first time */
+		return 0;
+
+	for (i = 0; i < stream->outcnt; i++) {
+		soute = stream->out[i].ext;
+		if (soute && soute->prio_head == old)
+			/* It's still in use, nothing else to do here. */
+			return 0;
+	}
+
+	/* No hits, we are good to free it. */
+	kfree(old);
+
+	return 0;
+}
+
+static int sctp_sched_prio_get(struct sctp_stream *stream, __u16 sid,
+			       __u16 *value)
+{
+	*value = stream->out[sid].ext->prio_head->prio;
+	return 0;
+}
+
+static int sctp_sched_prio_init(struct sctp_stream *stream)
+{
+	INIT_LIST_HEAD(&stream->prio_list);
+
+	return 0;
+}
+
+static int sctp_sched_prio_init_sid(struct sctp_stream *stream, __u16 sid,
+				    gfp_t gfp)
+{
+	INIT_LIST_HEAD(&stream->out[sid].ext->prio_list);
+	return sctp_sched_prio_set(stream, sid, 0, gfp);
+}
+
+static void sctp_sched_prio_free(struct sctp_stream *stream)
+{
+	struct sctp_stream_priorities *prio, *n;
+	LIST_HEAD(list);
+	int i;
+
+	/* As we don't keep a list of priorities, to avoid multiple
+	 * frees we have to do it in 3 steps:
+	 *   1. unsched everyone, so the lists are free to use in 2.
+	 *   2. build the list of the priorities
+	 *   3. free the list
+	 */
+	sctp_sched_prio_unsched_all(stream);
+	for (i = 0; i < stream->outcnt; i++) {
+		if (!stream->out[i].ext)
+			continue;
+		prio = stream->out[i].ext->prio_head;
+		if (prio && list_empty(&prio->prio_sched))
+			list_add(&prio->prio_sched, &list);
+	}
+	list_for_each_entry_safe(prio, n, &list, prio_sched) {
+		list_del_init(&prio->prio_sched);
+		kfree(prio);
+	}
+}
+
+static void sctp_sched_prio_enqueue(struct sctp_outq *q,
+				    struct sctp_datamsg *msg)
+{
+	struct sctp_stream *stream;
+	struct sctp_chunk *ch;
+	__u16 sid;
+
+	ch = list_first_entry(&msg->chunks, struct sctp_chunk, frag_list);
+	sid = sctp_chunk_stream_no(ch);
+	stream = &q->asoc->stream;
+	sctp_sched_prio_sched(stream, stream->out[sid].ext);
+}
+
+static struct sctp_chunk *sctp_sched_prio_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_stream_priorities *prio;
+	struct sctp_stream_out_ext *soute;
+	struct sctp_chunk *ch = NULL;
+
+	/* Bail out quickly if queue is empty */
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	/* Find which chunk is next. It's easy, it's either the current
+	 * one or the first chunk on the next active stream.
+	 */
+	if (stream->out_curr) {
+		soute = stream->out_curr->ext;
+	} else {
+		prio = list_entry(stream->prio_list.next,
+				  struct sctp_stream_priorities, prio_sched);
+		soute = prio->next;
+	}
+	ch = list_entry(soute->outq.next, struct sctp_chunk, stream_list);
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_prio_dequeue_done(struct sctp_outq *q,
+					 struct sctp_chunk *ch)
+{
+	struct sctp_stream_priorities *prio;
+	struct sctp_stream_out_ext *soute;
+	__u16 sid;
+
+	/* Last chunk on that msg, move to the next stream on
+	 * this priority.
+	 */
+	sid = sctp_chunk_stream_no(ch);
+	soute = q->asoc->stream.out[sid].ext;
+	prio = soute->prio_head;
+
+	sctp_sched_prio_next_stream(prio);
+
+	if (list_empty(&soute->outq))
+		sctp_sched_prio_unsched(soute);
+}
+
+static void sctp_sched_prio_sched_all(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_out *sout;
+	struct sctp_chunk *ch;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		__u16 sid;
+
+		sid = sctp_chunk_stream_no(ch);
+		sout = &stream->out[sid];
+		if (sout->ext)
+			sctp_sched_prio_sched(stream, sout->ext);
+	}
+}
+
+static void sctp_sched_prio_unsched_all(struct sctp_stream *stream)
+{
+	struct sctp_stream_priorities *p, *tmp;
+	struct sctp_stream_out_ext *soute, *souttmp;
+
+	list_for_each_entry_safe(p, tmp, &stream->prio_list, prio_sched)
+		list_for_each_entry_safe(soute, souttmp, &p->active, prio_list)
+			sctp_sched_prio_unsched(soute);
+}
+
+struct sctp_sched_ops sctp_sched_prio = {
+	.set = sctp_sched_prio_set,
+	.get = sctp_sched_prio_get,
+	.init = sctp_sched_prio_init,
+	.init_sid = sctp_sched_prio_init_sid,
+	.free = sctp_sched_prio_free,
+	.enqueue = sctp_sched_prio_enqueue,
+	.dequeue = sctp_sched_prio_dequeue,
+	.dequeue_done = sctp_sched_prio_dequeue_done,
+	.sched_all = sctp_sched_prio_sched_all,
+	.unsched_all = sctp_sched_prio_unsched_all,
+};
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 08/10] sctp: add sockopt to get/set stream scheduler parameters
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

As defined per RFC Draft ndata Section 4.3.3, named as
SCTP_STREAM_SCHEDULER_VALUE.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/uapi/linux/sctp.h |  7 +++++
 net/sctp/socket.c         | 77 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 0050f10087d224bad87c8c54ad318003381aee12..00ac417d2c4f8468ea2aad32e59806be5c5aa08d 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -123,6 +123,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_ADD_STREAMS	121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
 #define SCTP_STREAM_SCHEDULER	123
+#define SCTP_STREAM_SCHEDULER_VALUE	124
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE	0x0000
@@ -815,6 +816,12 @@ struct sctp_assoc_value {
     uint32_t                assoc_value;
 };
 
+struct sctp_stream_value {
+	sctp_assoc_t assoc_id;
+	uint16_t stream_id;
+	uint16_t stream_value;
+};
+
 /*
  * 7.2.2 Peer Address Information
  *
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ae35dbf2810f78c71ce77115ffe4b0e27a672abc..88c28421ec151e83665efcbcbd8a6403b122205a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3945,6 +3945,34 @@ static int sctp_setsockopt_scheduler(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_scheduler_value(struct sock *sk,
+					   char __user *optval,
+					   unsigned int optlen)
+{
+	struct sctp_association *asoc;
+	struct sctp_stream_value params;
+	int retval = -EINVAL;
+
+	if (optlen < sizeof(params))
+		goto out;
+
+	optlen = sizeof(params);
+	if (copy_from_user(&params, optval, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc)
+		goto out;
+
+	retval = sctp_sched_set_value(asoc, params.stream_id,
+				      params.stream_value, GFP_KERNEL);
+
+out:
+	return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4129,6 +4157,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_STREAM_SCHEDULER:
 		retval = sctp_setsockopt_scheduler(sk, optval, optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER_VALUE:
+		retval = sctp_setsockopt_scheduler_value(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -6864,6 +6895,48 @@ static int sctp_getsockopt_scheduler(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_scheduler_value(struct sock *sk, int len,
+					   char __user *optval,
+					   int __user *optlen)
+{
+	struct sctp_stream_value params;
+	struct sctp_association *asoc;
+	int retval = -EFAULT;
+
+	if (len < sizeof(params)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	len = sizeof(params);
+	if (copy_from_user(&params, optval, len))
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	retval = sctp_sched_get_value(asoc, params.stream_id,
+				      &params.stream_value);
+	if (retval)
+		goto out;
+
+	if (put_user(len, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	if (copy_to_user(optval, &params, len)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+out:
+	return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -7050,6 +7123,10 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_getsockopt_scheduler(sk, len, optval,
 						   optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER_VALUE:
+		retval = sctp_getsockopt_scheduler_value(sk, len, optval,
+							 optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 07/10] sctp: add sockopt to get/set stream scheduler
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

As defined per RFC Draft ndata Section 4.3.2, named as
SCTP_STREAM_SCHEDULER.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/socket.c         | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 4487e7625ddbd48be1868a8292a807ecd0a314bc..0050f10087d224bad87c8c54ad318003381aee12 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -122,6 +122,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_RESET_ASSOC	120
 #define SCTP_ADD_STREAMS	121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
+#define SCTP_STREAM_SCHEDULER	123
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE	0x0000
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d207734326b085e60625e4333f74221481114892..ae35dbf2810f78c71ce77115ffe4b0e27a672abc 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -79,6 +79,7 @@
 #include <net/sock.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 /* Forward declarations for internal helper functions. */
 static int sctp_writeable(struct sock *sk);
@@ -3914,6 +3915,36 @@ static int sctp_setsockopt_add_streams(struct sock *sk,
 	return retval;
 }
 
+static int sctp_setsockopt_scheduler(struct sock *sk,
+				     char __user *optval,
+				     unsigned int optlen)
+{
+	struct sctp_association *asoc;
+	struct sctp_assoc_value params;
+	int retval = -EINVAL;
+
+	if (optlen < sizeof(params))
+		goto out;
+
+	optlen = sizeof(params);
+	if (copy_from_user(&params, optval, optlen)) {
+		retval = -EFAULT;
+		goto out;
+	}
+
+	if (params.assoc_value > SCTP_SS_MAX)
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc)
+		goto out;
+
+	retval = sctp_sched_set_sched(asoc, params.assoc_value);
+
+out:
+	return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4095,6 +4126,9 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
 	case SCTP_ADD_STREAMS:
 		retval = sctp_setsockopt_add_streams(sk, optval, optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER:
+		retval = sctp_setsockopt_scheduler(sk, optval, optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
@@ -6793,6 +6827,43 @@ static int sctp_getsockopt_enable_strreset(struct sock *sk, int len,
 	return retval;
 }
 
+static int sctp_getsockopt_scheduler(struct sock *sk, int len,
+				     char __user *optval,
+				     int __user *optlen)
+{
+	struct sctp_assoc_value params;
+	struct sctp_association *asoc;
+	int retval = -EFAULT;
+
+	if (len < sizeof(params)) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	len = sizeof(params);
+	if (copy_from_user(&params, optval, len))
+		goto out;
+
+	asoc = sctp_id2assoc(sk, params.assoc_id);
+	if (!asoc) {
+		retval = -EINVAL;
+		goto out;
+	}
+
+	params.assoc_value = sctp_sched_get_sched(asoc);
+
+	if (put_user(len, optlen))
+		goto out;
+
+	if (copy_to_user(optval, &params, len))
+		goto out;
+
+	retval = 0;
+
+out:
+	return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
 			   char __user *optval, int __user *optlen)
 {
@@ -6975,6 +7046,10 @@ static int sctp_getsockopt(struct sock *sk, int level, int optname,
 		retval = sctp_getsockopt_enable_strreset(sk, len, optval,
 							 optlen);
 		break;
+	case SCTP_STREAM_SCHEDULER:
+		retval = sctp_getsockopt_scheduler(sk, len, optval,
+						   optlen);
+		break;
 	default:
 		retval = -ENOPROTOOPT;
 		break;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 06/10] sctp: introduce stream scheduler foundations
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

This patch introduces the hooks necessary to do stream scheduling, as
per RFC Draft ndata.  It also introduces the first scheduler, which is
what we do today but now factored out: first come first served (FCFS).

With stream scheduling now we have to track which chunk was enqueued on
which stream and be able to select another other than the in front of
the main outqueue. So we introduce a list on sctp_stream_out_ext
structure for this purpose.

We reuse sctp_chunk->transmitted_list space for the list above, as the
chunk cannot belong to the two lists at the same time. By using the
union in there, we can have distinct names for these moments.

sctp_sched_ops are the operations expected to be implemented by each
scheduler. The dequeueing is a bit particular to this implementation but
it is to match how we dequeue packets today. We first dequeue and then
check if it fits the packet and if not, we requeue it at head. Thus why
we don't have a peek operation but have dequeue_done instead, which is
called once the chunk can be safely considered as transmitted.

The check removed from sctp_outq_flush is now performed by
sctp_stream_outq_migrate, which is only called during assoc setup.
(sctp_sendmsg() also checks for it)

The only operation that is foreseen but not yet added here is a way to
signalize that a new packet is starting or that the packet is done, for
round robin scheduler per packet, but is intentionally left to the
patch that actually implements it.

Support for IDATA chunks, also described in this RFC, with user message
interleaving is straightforward as it just requires the schedulers to
probe for the feature and ignore datamsg boundaries when dequeueing.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/stream_sched.h |  72 +++++++++++
 include/net/sctp/structs.h      |  15 ++-
 include/uapi/linux/sctp.h       |   6 +
 net/sctp/Makefile               |   2 +-
 net/sctp/outqueue.c             |  59 +++++----
 net/sctp/sm_sideeffect.c        |   3 +
 net/sctp/stream.c               |  88 +++++++++++--
 net/sctp/stream_sched.c         | 270 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 477 insertions(+), 38 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c

diff --git a/include/net/sctp/stream_sched.h b/include/net/sctp/stream_sched.h
new file mode 100644
index 0000000000000000000000000000000000000000..c676550a4c7dd0ea27ac0e14437d0a2b451ef499
--- /dev/null
+++ b/include/net/sctp/stream_sched.h
@@ -0,0 +1,72 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * These are definitions used by the stream schedulers, defined in RFC
+ * draft ndata (https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-11)
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation  is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresses:
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *   Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#ifndef __sctp_stream_sched_h__
+#define __sctp_stream_sched_h__
+
+struct sctp_sched_ops {
+	/* Property handling for a given stream */
+	int (*set)(struct sctp_stream *stream, __u16 sid, __u16 value,
+		   gfp_t gfp);
+	int (*get)(struct sctp_stream *stream, __u16 sid, __u16 *value);
+
+	/* Init the specific scheduler */
+	int (*init)(struct sctp_stream *stream);
+	/* Init a stream */
+	int (*init_sid)(struct sctp_stream *stream, __u16 sid, gfp_t gfp);
+	/* Frees the entire thing */
+	void (*free)(struct sctp_stream *stream);
+
+	/* Enqueue a chunk */
+	void (*enqueue)(struct sctp_outq *q, struct sctp_datamsg *msg);
+	/* Dequeue a chunk */
+	struct sctp_chunk *(*dequeue)(struct sctp_outq *q);
+	/* Called only if the chunk fit the packet */
+	void (*dequeue_done)(struct sctp_outq *q, struct sctp_chunk *chunk);
+	/* Sched all chunks already enqueued */
+	void (*sched_all)(struct sctp_stream *steam);
+	/* Unched all chunks already enqueued */
+	void (*unsched_all)(struct sctp_stream *steam);
+};
+
+int sctp_sched_set_sched(struct sctp_association *asoc,
+			 enum sctp_sched_type sched);
+int sctp_sched_get_sched(struct sctp_association *asoc);
+int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 value, gfp_t gfp);
+int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 *value);
+void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch);
+
+void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch);
+int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp);
+struct sctp_sched_ops *sctp_sched_ops_from_stream(struct sctp_stream *stream);
+
+#endif /* __sctp_stream_sched_h__ */
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index c48f7999fe9b80c5b5e41910a3608059b94140a7..3c22a30fd71b4ef87419a77cf69b00807a5986bb 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -84,7 +84,6 @@ struct sctp_ulpq;
 struct sctp_ep_common;
 struct crypto_shash;
 struct sctp_stream;
-struct sctp_stream_out;
 
 
 #include <net/sctp/tsnmap.h>
@@ -531,8 +530,12 @@ struct sctp_chunk {
 	/* How many times this chunk have been sent, for prsctp RTX policy */
 	int sent_count;
 
-	/* This is our link to the per-transport transmitted list.  */
-	struct list_head transmitted_list;
+	union {
+		/* This is our link to the per-transport transmitted list.  */
+		struct list_head transmitted_list;
+		/* List in specific stream outq */
+		struct list_head stream_list;
+	};
 
 	/* This field is used by chunks that hold fragmented data.
 	 * For the first fragment this is the list that holds the rest of
@@ -1019,6 +1022,9 @@ struct sctp_outq {
 	/* Data pending that has never been transmitted.  */
 	struct list_head out_chunk_list;
 
+	/* Stream scheduler being used */
+	struct sctp_sched_ops *sched;
+
 	unsigned int out_qlen;	/* Total length of queued data chunks. */
 
 	/* Error of send failed, may used in SCTP_SEND_FAILED event. */
@@ -1325,6 +1331,7 @@ struct sctp_inithdr_host {
 struct sctp_stream_out_ext {
 	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
 	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+	struct list_head outq; /* chunks enqueued by this stream */
 };
 
 struct sctp_stream_out {
@@ -1342,6 +1349,8 @@ struct sctp_stream {
 	struct sctp_stream_in *in;
 	__u16 outcnt;
 	__u16 incnt;
+	/* Current stream being sent, if any */
+	struct sctp_stream_out *out_curr;
 };
 
 #define SCTP_STREAM_CLOSED		0x00
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 6217ff8500a1d818fd1002fbd6f81c0c11974665..4487e7625ddbd48be1868a8292a807ecd0a314bc 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1088,4 +1088,10 @@ struct sctp_add_streams {
 	uint16_t sas_outstrms;
 };
 
+/* SCTP Stream schedulers */
+enum sctp_sched_type {
+	SCTP_SS_FCFS,
+	SCTP_SS_MAX = SCTP_SS_FCFS
+};
+
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 70f1b570bab9764d692f1c2e605d76d056cda2cd..0f6e6d1d69fd336b4a99f896851b0120f9a0d1e0 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  inqueue.o outqueue.o ulpqueue.o \
 	  tsnmap.o bind_addr.o socket.o primitive.o \
 	  output.o input.o debug.o stream.o auth.o \
-	  offload.o
+	  offload.o stream_sched.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 746b07b7937d8730824b9e09917d947aa7863ec6..4db012aa25f7a042f063bc17b56270effebc6cc6 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -50,6 +50,7 @@
 
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 /* Declare internal functions here.  */
 static int sctp_acked(struct sctp_sackhdr *sack, __u32 tsn);
@@ -72,32 +73,38 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp);
 
 /* Add data to the front of the queue. */
 static inline void sctp_outq_head_data(struct sctp_outq *q,
-					struct sctp_chunk *ch)
+				       struct sctp_chunk *ch)
 {
+	struct sctp_stream_out_ext *oute;
+	__u16 stream;
+
 	list_add(&ch->list, &q->out_chunk_list);
 	q->out_qlen += ch->skb->len;
+
+	stream = sctp_chunk_stream_no(ch);
+	oute = q->asoc->stream.out[stream].ext;
+	list_add(&ch->stream_list, &oute->outq);
 }
 
 /* Take data from the front of the queue. */
 static inline struct sctp_chunk *sctp_outq_dequeue_data(struct sctp_outq *q)
 {
-	struct sctp_chunk *ch = NULL;
-
-	if (!list_empty(&q->out_chunk_list)) {
-		struct list_head *entry = q->out_chunk_list.next;
-
-		ch = list_entry(entry, struct sctp_chunk, list);
-		list_del_init(entry);
-		q->out_qlen -= ch->skb->len;
-	}
-	return ch;
+	return q->sched->dequeue(q);
 }
+
 /* Add data chunk to the end of the queue. */
 static inline void sctp_outq_tail_data(struct sctp_outq *q,
 				       struct sctp_chunk *ch)
 {
+	struct sctp_stream_out_ext *oute;
+	__u16 stream;
+
 	list_add_tail(&ch->list, &q->out_chunk_list);
 	q->out_qlen += ch->skb->len;
+
+	stream = sctp_chunk_stream_no(ch);
+	oute = q->asoc->stream.out[stream].ext;
+	list_add_tail(&ch->stream_list, &oute->outq);
 }
 
 /*
@@ -207,6 +214,7 @@ void sctp_outq_init(struct sctp_association *asoc, struct sctp_outq *q)
 	INIT_LIST_HEAD(&q->retransmit);
 	INIT_LIST_HEAD(&q->sacked);
 	INIT_LIST_HEAD(&q->abandoned);
+	sctp_sched_set_sched(asoc, SCTP_SS_FCFS);
 }
 
 /* Free the outqueue structure and any related pending chunks.
@@ -258,6 +266,7 @@ static void __sctp_outq_teardown(struct sctp_outq *q)
 
 	/* Throw away any leftover data chunks. */
 	while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
+		sctp_sched_dequeue_done(q, chunk);
 
 		/* Mark as send failure. */
 		sctp_chunk_fail(chunk, q->error);
@@ -391,13 +400,14 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 	struct sctp_outq *q = &asoc->outqueue;
 	struct sctp_chunk *chk, *temp;
 
+	q->sched->unsched_all(&asoc->stream);
+
 	list_for_each_entry_safe(chk, temp, &q->out_chunk_list, list) {
 		if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
 		    chk->sinfo.sinfo_timetolive <= sinfo->sinfo_timetolive)
 			continue;
 
-		list_del_init(&chk->list);
-		q->out_qlen -= chk->skb->len;
+		sctp_sched_dequeue_common(q, chk);
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		if (chk->sinfo.sinfo_stream < asoc->stream.outcnt) {
@@ -415,6 +425,8 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 			break;
 	}
 
+	q->sched->sched_all(&asoc->stream);
+
 	return msg_len;
 }
 
@@ -1033,22 +1045,9 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 		while ((chunk = sctp_outq_dequeue_data(q)) != NULL) {
 			__u32 sid = ntohs(chunk->subh.data_hdr->stream);
 
-			/* RFC 2960 6.5 Every DATA chunk MUST carry a valid
-			 * stream identifier.
-			 */
-			if (chunk->sinfo.sinfo_stream >= asoc->stream.outcnt) {
-
-				/* Mark as failed send. */
-				sctp_chunk_fail(chunk, SCTP_ERROR_INV_STRM);
-				if (asoc->peer.prsctp_capable &&
-				    SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
-					asoc->sent_cnt_removable--;
-				sctp_chunk_free(chunk);
-				continue;
-			}
-
 			/* Has this chunk expired? */
 			if (sctp_chunk_abandoned(chunk)) {
+				sctp_sched_dequeue_done(q, chunk);
 				sctp_chunk_fail(chunk, 0);
 				sctp_chunk_free(chunk);
 				continue;
@@ -1070,6 +1069,7 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				new_transport = asoc->peer.active_path;
 			if (new_transport->state == SCTP_UNCONFIRMED) {
 				WARN_ONCE(1, "Attempt to send packet on unconfirmed path.");
+				sctp_sched_dequeue_done(q, chunk);
 				sctp_chunk_fail(chunk, 0);
 				sctp_chunk_free(chunk);
 				continue;
@@ -1133,6 +1133,11 @@ static void sctp_outq_flush(struct sctp_outq *q, int rtx_timeout, gfp_t gfp)
 				else
 					asoc->stats.oodchunks++;
 
+				/* Only now it's safe to consider this
+				 * chunk as sent, sched-wise.
+				 */
+				sctp_sched_dequeue_done(q, chunk);
+
 				break;
 
 			default:
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index e6a2974e020e1a4232d94e6c2933eebff5f8acb4..402bfbb888cda53248dd192d3756a2f4db1d2a7f 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -50,6 +50,7 @@
 #include <net/sock.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
 
 static int sctp_cmd_interpreter(enum sctp_event event_type,
 				union sctp_subtype subtype,
@@ -1089,6 +1090,8 @@ static void sctp_cmd_send_msg(struct sctp_association *asoc,
 
 	list_for_each_entry(chunk, &msg->chunks, frag_list)
 		sctp_outq_tail(&asoc->outqueue, chunk, gfp);
+
+	asoc->outqueue.sched->enqueue(&asoc->outqueue, msg);
 }
 
 
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 055ca25bbc91bf932db8048c72a1b11cc2214942..5ea33a2c453b4272c5c22fa61e8e8bec06001f8b 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -32,8 +32,61 @@
  *    Xin Long <lucien.xin@gmail.com>
  */
 
+#include <linux/list.h>
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* Migrates chunks from stream queues to new stream queues if needed,
+ * but not across associations. Also, removes those chunks to streams
+ * higher than the new max.
+ */
+static void sctp_stream_outq_migrate(struct sctp_stream *stream,
+				     struct sctp_stream *new, __u16 outcnt)
+{
+	struct sctp_association *asoc;
+	struct sctp_chunk *ch, *temp;
+	struct sctp_outq *outq;
+	int i;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+	outq = &asoc->outqueue;
+
+	list_for_each_entry_safe(ch, temp, &outq->out_chunk_list, list) {
+		__u16 sid = sctp_chunk_stream_no(ch);
+
+		if (sid < outcnt)
+			continue;
+
+		sctp_sched_dequeue_common(outq, ch);
+		/* No need to call dequeue_done here because
+		 * the chunks are not scheduled by now.
+		 */
+
+		/* Mark as failed send. */
+		sctp_chunk_fail(ch, SCTP_ERROR_INV_STRM);
+		if (asoc->peer.prsctp_capable &&
+		    SCTP_PR_PRIO_ENABLED(ch->sinfo.sinfo_flags))
+			asoc->sent_cnt_removable--;
+
+		sctp_chunk_free(ch);
+	}
+
+	if (new) {
+		/* Here we actually move the old ext stuff into the new
+		 * buffer, because we want to keep it. Then
+		 * sctp_stream_update will swap ->out pointers.
+		 */
+		for (i = 0; i < outcnt; i++) {
+			kfree(new->out[i].ext);
+			new->out[i].ext = stream->out[i].ext;
+			stream->out[i].ext = NULL;
+		}
+	}
+
+	for (i = outcnt; i < stream->outcnt; i++)
+		kfree(stream->out[i].ext);
+}
 
 static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 				 gfp_t gfp)
@@ -87,7 +140,8 @@ static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
-	int i;
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+	int i, ret = 0;
 
 	gfp |= __GFP_NOWARN;
 
@@ -97,6 +151,11 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (outcnt == stream->outcnt)
 		goto in;
 
+	/* Filter out chunks queued on streams that won't exist anymore */
+	sched->unsched_all(stream);
+	sctp_stream_outq_migrate(stream, NULL, outcnt);
+	sched->sched_all(stream);
+
 	i = sctp_stream_alloc_out(stream, outcnt, gfp);
 	if (i)
 		return i;
@@ -105,20 +164,27 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	for (i = 0; i < stream->outcnt; i++)
 		stream->out[i].state = SCTP_STREAM_OPEN;
 
+	sched->init(stream);
+
 in:
 	if (!incnt)
-		return 0;
+		goto out;
 
 	i = sctp_stream_alloc_in(stream, incnt, gfp);
 	if (i) {
-		kfree(stream->out);
-		stream->out = NULL;
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto free;
 	}
 
 	stream->incnt = incnt;
+	goto out;
 
-	return 0;
+free:
+	sched->free(stream);
+	kfree(stream->out);
+	stream->out = NULL;
+out:
+	return ret;
 }
 
 int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
@@ -130,13 +196,15 @@ int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
 		return -ENOMEM;
 	stream->out[sid].ext = soute;
 
-	return 0;
+	return sctp_sched_init_sid(stream, sid, GFP_KERNEL);
 }
 
 void sctp_stream_free(struct sctp_stream *stream)
 {
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
 	int i;
 
+	sched->free(stream);
 	for (i = 0; i < stream->outcnt; i++)
 		kfree(stream->out[i].ext);
 	kfree(stream->out);
@@ -156,6 +224,10 @@ void sctp_stream_clear(struct sctp_stream *stream)
 
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
 {
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+
+	sched->unsched_all(stream);
+	sctp_stream_outq_migrate(stream, new, new->outcnt);
 	sctp_stream_free(stream);
 
 	stream->out = new->out;
@@ -163,6 +235,8 @@ void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new)
 	stream->outcnt = new->outcnt;
 	stream->incnt  = new->incnt;
 
+	sched->sched_all(stream);
+
 	new->out = NULL;
 	new->in  = NULL;
 }
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
new file mode 100644
index 0000000000000000000000000000000000000000..40a9a9de2b98a56786a4c8585f5ad514be9189af
--- /dev/null
+++ b/net/sctp/stream_sched.c
@@ -0,0 +1,270 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ *                 ************************
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * <http://www.gnu.org/licenses/>.
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *    lksctp developers <linux-sctp@vger.kernel.org>
+ *
+ * Written or modified by:
+ *    Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <linux/list.h>
+#include <net/sctp/sctp.h>
+#include <net/sctp/sm.h>
+#include <net/sctp/stream_sched.h>
+
+/* First Come First Serve (a.k.a. FIFO)
+ * RFC DRAFT ndata Section 3.1
+ */
+static int sctp_sched_fcfs_set(struct sctp_stream *stream, __u16 sid,
+			       __u16 value, gfp_t gfp)
+{
+	return 0;
+}
+
+static int sctp_sched_fcfs_get(struct sctp_stream *stream, __u16 sid,
+			       __u16 *value)
+{
+	*value = 0;
+	return 0;
+}
+
+static int sctp_sched_fcfs_init(struct sctp_stream *stream)
+{
+	return 0;
+}
+
+static int sctp_sched_fcfs_init_sid(struct sctp_stream *stream, __u16 sid,
+				    gfp_t gfp)
+{
+	return 0;
+}
+
+static void sctp_sched_fcfs_free(struct sctp_stream *stream)
+{
+}
+
+static void sctp_sched_fcfs_enqueue(struct sctp_outq *q,
+				    struct sctp_datamsg *msg)
+{
+}
+
+static struct sctp_chunk *sctp_sched_fcfs_dequeue(struct sctp_outq *q)
+{
+	struct sctp_stream *stream = &q->asoc->stream;
+	struct sctp_chunk *ch = NULL;
+	struct list_head *entry;
+
+	if (list_empty(&q->out_chunk_list))
+		goto out;
+
+	if (stream->out_curr) {
+		ch = list_entry(stream->out_curr->ext->outq.next,
+				struct sctp_chunk, stream_list);
+	} else {
+		entry = q->out_chunk_list.next;
+		ch = list_entry(entry, struct sctp_chunk, list);
+	}
+
+	sctp_sched_dequeue_common(q, ch);
+
+out:
+	return ch;
+}
+
+static void sctp_sched_fcfs_dequeue_done(struct sctp_outq *q,
+					 struct sctp_chunk *chunk)
+{
+}
+
+static void sctp_sched_fcfs_sched_all(struct sctp_stream *stream)
+{
+}
+
+static void sctp_sched_fcfs_unsched_all(struct sctp_stream *stream)
+{
+}
+
+static struct sctp_sched_ops sctp_sched_fcfs = {
+	.set = sctp_sched_fcfs_set,
+	.get = sctp_sched_fcfs_get,
+	.init = sctp_sched_fcfs_init,
+	.init_sid = sctp_sched_fcfs_init_sid,
+	.free = sctp_sched_fcfs_free,
+	.enqueue = sctp_sched_fcfs_enqueue,
+	.dequeue = sctp_sched_fcfs_dequeue,
+	.dequeue_done = sctp_sched_fcfs_dequeue_done,
+	.sched_all = sctp_sched_fcfs_sched_all,
+	.unsched_all = sctp_sched_fcfs_unsched_all,
+};
+
+/* API to other parts of the stack */
+
+struct sctp_sched_ops *sctp_sched_ops[] = {
+	&sctp_sched_fcfs,
+};
+
+int sctp_sched_set_sched(struct sctp_association *asoc,
+			 enum sctp_sched_type sched)
+{
+	struct sctp_sched_ops *n = sctp_sched_ops[sched];
+	struct sctp_sched_ops *old = asoc->outqueue.sched;
+	struct sctp_datamsg *msg = NULL;
+	struct sctp_chunk *ch;
+	int i, ret = 0;
+
+	if (old == n)
+		return ret;
+
+	if (sched > SCTP_SS_MAX)
+		return -EINVAL;
+
+	if (old) {
+		old->free(&asoc->stream);
+
+		/* Give the next scheduler a clean slate. */
+		for (i = 0; i < asoc->stream.outcnt; i++) {
+			void *p = asoc->stream.out[i].ext;
+
+			if (!p)
+				continue;
+
+			p += offsetofend(struct sctp_stream_out_ext, outq);
+			memset(p, 0, sizeof(struct sctp_stream_out_ext) -
+				     offsetofend(struct sctp_stream_out_ext, outq));
+		}
+	}
+
+	asoc->outqueue.sched = n;
+	n->init(&asoc->stream);
+	for (i = 0; i < asoc->stream.outcnt; i++) {
+		if (!asoc->stream.out[i].ext)
+			continue;
+
+		ret = n->init_sid(&asoc->stream, i, GFP_KERNEL);
+		if (ret)
+			goto err;
+	}
+
+	/* We have to requeue all chunks already queued. */
+	list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list) {
+		if (ch->msg == msg)
+			continue;
+		msg = ch->msg;
+		n->enqueue(&asoc->outqueue, msg);
+	}
+
+	return ret;
+
+err:
+	n->free(&asoc->stream);
+	asoc->outqueue.sched = &sctp_sched_fcfs; /* Always safe */
+
+	return ret;
+}
+
+int sctp_sched_get_sched(struct sctp_association *asoc)
+{
+	int i;
+
+	for (i = 0; i <= SCTP_SS_MAX; i++)
+		if (asoc->outqueue.sched == sctp_sched_ops[i])
+			return i;
+
+	return 0;
+}
+
+int sctp_sched_set_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 value, gfp_t gfp)
+{
+	if (sid >= asoc->stream.outcnt)
+		return -EINVAL;
+
+	if (!asoc->stream.out[sid].ext) {
+		int ret;
+
+		ret = sctp_stream_init_ext(&asoc->stream, sid);
+		if (ret)
+			return ret;
+	}
+
+	return asoc->outqueue.sched->set(&asoc->stream, sid, value, gfp);
+}
+
+int sctp_sched_get_value(struct sctp_association *asoc, __u16 sid,
+			 __u16 *value)
+{
+	if (sid >= asoc->stream.outcnt)
+		return -EINVAL;
+
+	if (!asoc->stream.out[sid].ext)
+		return 0;
+
+	return asoc->outqueue.sched->get(&asoc->stream, sid, value);
+}
+
+void sctp_sched_dequeue_done(struct sctp_outq *q, struct sctp_chunk *ch)
+{
+	if (!list_is_last(&ch->frag_list, &ch->msg->chunks)) {
+		struct sctp_stream_out *sout;
+		__u16 sid;
+
+		/* datamsg is not finish, so save it as current one,
+		 * in case application switch scheduler or a higher
+		 * priority stream comes in.
+		 */
+		sid = sctp_chunk_stream_no(ch);
+		sout = &q->asoc->stream.out[sid];
+		q->asoc->stream.out_curr = sout;
+		return;
+	}
+
+	q->asoc->stream.out_curr = NULL;
+	q->sched->dequeue_done(q, ch);
+}
+
+/* Auxiliary functions for the schedulers */
+void sctp_sched_dequeue_common(struct sctp_outq *q, struct sctp_chunk *ch)
+{
+	list_del_init(&ch->list);
+	list_del_init(&ch->stream_list);
+	q->out_qlen -= ch->skb->len;
+}
+
+int sctp_sched_init_sid(struct sctp_stream *stream, __u16 sid, gfp_t gfp)
+{
+	struct sctp_sched_ops *sched = sctp_sched_ops_from_stream(stream);
+
+	INIT_LIST_HEAD(&stream->out[sid].ext->outq);
+	return sched->init_sid(stream, sid, gfp);
+}
+
+struct sctp_sched_ops *sctp_sched_ops_from_stream(struct sctp_stream *stream)
+{
+	struct sctp_association *asoc;
+
+	asoc = container_of(stream, struct sctp_association, stream);
+
+	return asoc->outqueue.sched;
+}
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 05/10] sctp: introduce sctp_chunk_stream_no
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

Add a helper to fetch the stream number from a given chunk.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 9b2b30b3ba4dfd10c24c3e06ed80779180a06baf..c48f7999fe9b80c5b5e41910a3608059b94140a7 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -642,6 +642,11 @@ void sctp_init_addrs(struct sctp_chunk *, union sctp_addr *,
 		     union sctp_addr *);
 const union sctp_addr *sctp_source(const struct sctp_chunk *chunk);
 
+static inline __u16 sctp_chunk_stream_no(struct sctp_chunk *ch)
+{
+	return ntohs(ch->subh.data_hdr->stream);
+}
+
 enum {
 	SCTP_ADDR_NEW,		/* new address added to assoc/ep */
 	SCTP_ADDR_SRC,		/* address can be used as source */
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 04/10] sctp: introduce struct sctp_stream_out_ext
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

With the stream schedulers, sctp_stream_out will become too big to be
allocated by kmalloc and as we need to allocate with BH disabled, we
cannot use __vmalloc in sctp_stream_init().

This patch moves out the stats from sctp_stream_out to
sctp_stream_out_ext, which will be allocated only when the application
tries to sendmsg something on it.

Just the introduction of sctp_stream_out_ext would already fix the issue
described above by splitting the allocation in two. Moving the stats
to it also reduces the pressure on the allocator as we will ask for less
memory atomically when creating the socket and we will use GFP_KERNEL
later.

Then, for stream schedulers, we will just use sctp_stream_out_ext.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 include/net/sctp/structs.h | 10 ++++++++--
 net/sctp/chunk.c           |  6 +++---
 net/sctp/outqueue.c        |  4 ++--
 net/sctp/socket.c          | 27 +++++++++++++++++++++------
 net/sctp/stream.c          | 16 ++++++++++++++++
 5 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 0477945de1a3cf5c27348e99d9a30e02c491d1de..9b2b30b3ba4dfd10c24c3e06ed80779180a06baf 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -84,6 +84,7 @@ struct sctp_ulpq;
 struct sctp_ep_common;
 struct crypto_shash;
 struct sctp_stream;
+struct sctp_stream_out;
 
 
 #include <net/sctp/tsnmap.h>
@@ -380,6 +381,7 @@ struct sctp_sender_hb_info {
 
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp);
+int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid);
 void sctp_stream_free(struct sctp_stream *stream);
 void sctp_stream_clear(struct sctp_stream *stream);
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new);
@@ -1315,11 +1317,15 @@ struct sctp_inithdr_host {
 	__u32 initial_tsn;
 };
 
+struct sctp_stream_out_ext {
+	__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
+	__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+};
+
 struct sctp_stream_out {
 	__u16	ssn;
 	__u8	state;
-	__u64	abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
-	__u64	abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+	struct sctp_stream_out_ext *ext;
 };
 
 struct sctp_stream_in {
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 3afac275ee82dbec825dd71378dffe69a53718a7..7b261afc47b9d709fdd780a93aaba874f35d79be 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -311,10 +311,10 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 
 		if (chunk->sent_count) {
 			chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
-			streamout->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
+			streamout->ext->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
 		} else {
 			chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
-			streamout->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
+			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
 		}
 		return 1;
 	} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
@@ -323,7 +323,7 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 			&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
 
 		chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
-		streamout->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
+		streamout->ext->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
 		return 1;
 	} else if (!SCTP_PR_POLICY(chunk->sinfo.sinfo_flags) &&
 		   chunk->msg->expires_at &&
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 2966ff400755fe93e3658e09d3bb44b9d7d19d2e..746b07b7937d8730824b9e09917d947aa7863ec6 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -366,7 +366,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association *asoc,
 		streamout = &asoc->stream.out[chk->sinfo.sinfo_stream];
 		asoc->sent_cnt_removable--;
 		asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
-		streamout->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
+		streamout->ext->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
 
 		if (!chk->tsn_gap_acked) {
 			if (chk->transport)
@@ -404,7 +404,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
 			struct sctp_stream_out *streamout =
 				&asoc->stream.out[chk->sinfo.sinfo_stream];
 
-			streamout->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
+			streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
 		}
 
 		msg_len -= SCTP_DATA_SNDSIZE(chk) +
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d4730ada7f3233367be7a0e3bb10e286a25602c8..d207734326b085e60625e4333f74221481114892 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1927,6 +1927,13 @@ static int sctp_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
 		goto out_free;
 	}
 
+	/* Allocate sctp_stream_out_ext if not already done */
+	if (unlikely(!asoc->stream.out[sinfo->sinfo_stream].ext)) {
+		err = sctp_stream_init_ext(&asoc->stream, sinfo->sinfo_stream);
+		if (err)
+			goto out_free;
+	}
+
 	if (sctp_wspace(asoc) < msg_len)
 		sctp_prsctp_prune(asoc, sinfo, msg_len - sctp_wspace(asoc));
 
@@ -6645,7 +6652,7 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 					   char __user *optval,
 					   int __user *optlen)
 {
-	struct sctp_stream_out *streamout;
+	struct sctp_stream_out_ext *streamoute;
 	struct sctp_association *asoc;
 	struct sctp_prstatus params;
 	int retval = -EINVAL;
@@ -6668,21 +6675,29 @@ static int sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
 	if (!asoc || params.sprstat_sid >= asoc->stream.outcnt)
 		goto out;
 
-	streamout = &asoc->stream.out[params.sprstat_sid];
+	streamoute = asoc->stream.out[params.sprstat_sid].ext;
+	if (!streamoute) {
+		/* Not allocated yet, means all stats are 0 */
+		params.sprstat_abandoned_unsent = 0;
+		params.sprstat_abandoned_sent = 0;
+		retval = 0;
+		goto out;
+	}
+
 	if (policy == SCTP_PR_SCTP_NONE) {
 		params.sprstat_abandoned_unsent = 0;
 		params.sprstat_abandoned_sent = 0;
 		for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
 			params.sprstat_abandoned_unsent +=
-				streamout->abandoned_unsent[policy];
+				streamoute->abandoned_unsent[policy];
 			params.sprstat_abandoned_sent +=
-				streamout->abandoned_sent[policy];
+				streamoute->abandoned_sent[policy];
 		}
 	} else {
 		params.sprstat_abandoned_unsent =
-			streamout->abandoned_unsent[__SCTP_PR_INDEX(policy)];
+			streamoute->abandoned_unsent[__SCTP_PR_INDEX(policy)];
 		params.sprstat_abandoned_sent =
-			streamout->abandoned_sent[__SCTP_PR_INDEX(policy)];
+			streamoute->abandoned_sent[__SCTP_PR_INDEX(policy)];
 	}
 
 	if (put_user(len, optlen) || copy_to_user(optval, &params, len)) {
diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 952437d656cc71ad1c133a736c539eff9a8d80c2..055ca25bbc91bf932db8048c72a1b11cc2214942 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -121,8 +121,24 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	return 0;
 }
 
+int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid)
+{
+	struct sctp_stream_out_ext *soute;
+
+	soute = kzalloc(sizeof(*soute), GFP_KERNEL);
+	if (!soute)
+		return -ENOMEM;
+	stream->out[sid].ext = soute;
+
+	return 0;
+}
+
 void sctp_stream_free(struct sctp_stream *stream)
 {
+	int i;
+
+	for (i = 0; i < stream->outcnt; i++)
+		kfree(stream->out[i].ext);
 	kfree(stream->out);
 	kfree(stream->in);
 }
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 03/10] sctp: factor out stream->in allocation
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

Same idea as previous patch.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 6d0e997d301f89e165367106c02e82f8a6c3a877..952437d656cc71ad1c133a736c539eff9a8d80c2 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -59,6 +59,31 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
 	return 0;
 }
 
+static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
+				gfp_t gfp)
+{
+	struct sctp_stream_in *in;
+
+	in = kmalloc_array(incnt, sizeof(*stream->in), gfp);
+
+	if (!in)
+		return -ENOMEM;
+
+	if (stream->in) {
+		memcpy(in, stream->in, min(incnt, stream->incnt) *
+				       sizeof(*in));
+		kfree(stream->in);
+	}
+
+	if (incnt > stream->incnt)
+		memset(in + stream->incnt, 0,
+		       (incnt - stream->incnt) * sizeof(*in));
+
+	stream->in = in;
+
+	return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
@@ -84,8 +109,8 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (!incnt)
 		return 0;
 
-	stream->in = kcalloc(incnt, sizeof(*stream->in), gfp);
-	if (!stream->in) {
+	i = sctp_stream_alloc_in(stream, incnt, gfp);
+	if (i) {
 		kfree(stream->out);
 		stream->out = NULL;
 		return -ENOMEM;
@@ -623,7 +648,6 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
 	struct sctp_strreset_addstrm *addstrm = param.v;
 	struct sctp_stream *stream = &asoc->stream;
 	__u32 result = SCTP_STRRESET_DENIED;
-	struct sctp_stream_in *streamin;
 	__u32 request_seq, incnt;
 	__u16 in, i;
 
@@ -670,13 +694,9 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
 	if (!in || incnt > SCTP_MAX_STREAM)
 		goto out;
 
-	streamin = krealloc(stream->in, incnt * sizeof(*streamin),
-			    GFP_ATOMIC);
-	if (!streamin)
+	if (sctp_stream_alloc_in(stream, incnt, GFP_ATOMIC))
 		goto out;
 
-	memset(streamin + stream->incnt, 0, in * sizeof(*streamin));
-	stream->in = streamin;
 	stream->incnt = incnt;
 
 	result = SCTP_STRRESET_PERFORMED;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 02/10] sctp: factor out stream->out allocation
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

There is 1 place allocating it and 2 other reallocating. Move such
procedures to a common function.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 1afa9555808390d5fc736727422d9700a3855613..6d0e997d301f89e165367106c02e82f8a6c3a877 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -35,6 +35,30 @@
 #include <net/sctp/sctp.h>
 #include <net/sctp/sm.h>
 
+static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
+				 gfp_t gfp)
+{
+	struct sctp_stream_out *out;
+
+	out = kmalloc_array(outcnt, sizeof(*out), gfp);
+	if (!out)
+		return -ENOMEM;
+
+	if (stream->out) {
+		memcpy(out, stream->out, min(outcnt, stream->outcnt) *
+					 sizeof(*out));
+		kfree(stream->out);
+	}
+
+	if (outcnt > stream->outcnt)
+		memset(out + stream->outcnt, 0,
+		       (outcnt - stream->outcnt) * sizeof(*out));
+
+	stream->out = out;
+
+	return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 		     gfp_t gfp)
 {
@@ -48,11 +72,9 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	if (outcnt == stream->outcnt)
 		goto in;
 
-	kfree(stream->out);
-
-	stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
-	if (!stream->out)
-		return -ENOMEM;
+	i = sctp_stream_alloc_out(stream, outcnt, gfp);
+	if (i)
+		return i;
 
 	stream->outcnt = outcnt;
 	for (i = 0; i < stream->outcnt; i++)
@@ -276,15 +298,9 @@ int sctp_send_add_streams(struct sctp_association *asoc,
 	}
 
 	if (out) {
-		struct sctp_stream_out *streamout;
-
-		streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-				     GFP_KERNEL);
-		if (!streamout)
+		retval = sctp_stream_alloc_out(stream, outcnt, GFP_KERNEL);
+		if (retval)
 			goto out;
-
-		memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-		stream->out = streamout;
 	}
 
 	chunk = sctp_make_strreset_addstrm(asoc, out, in);
@@ -682,10 +698,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
 	struct sctp_strreset_addstrm *addstrm = param.v;
 	struct sctp_stream *stream = &asoc->stream;
 	__u32 result = SCTP_STRRESET_DENIED;
-	struct sctp_stream_out *streamout;
 	struct sctp_chunk *chunk = NULL;
 	__u32 request_seq, outcnt;
 	__u16 out, i;
+	int ret;
 
 	request_seq = ntohl(addstrm->request_seq);
 	if (TSN_lt(asoc->strreset_inseq, request_seq) ||
@@ -714,14 +730,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
 	if (!out || outcnt > SCTP_MAX_STREAM)
 		goto out;
 
-	streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-			     GFP_ATOMIC);
-	if (!streamout)
+	ret = sctp_stream_alloc_out(stream, outcnt, GFP_ATOMIC);
+	if (ret)
 		goto out;
 
-	memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-	stream->out = streamout;
-
 	chunk = sctp_make_strreset_addstrm(asoc, out, 0);
 	if (!chunk)
 		goto out;
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 01/10] sctp: silence warns on sctp_stream_init allocations
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight
In-Reply-To: <cover.1506536044.git.marcelo.leitner@gmail.com>

As SCTP supports up to 65535 streams, that can lead to very large
allocations in sctp_stream_init(). As Xin Long noticed, systems with
small amounts of memory are more prone to not have enough memory and
dump warnings on dmesg initiated by user actions. Thus, silence them.

Also, if the reallocation of stream->out is not necessary, skip it and
keep the memory we already have.

Reported-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/stream.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 63ea1550371493ec8863627c7a43f46a22f4a4c9..1afa9555808390d5fc736727422d9700a3855613 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -40,9 +40,14 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 {
 	int i;
 
+	gfp |= __GFP_NOWARN;
+
 	/* Initial stream->out size may be very big, so free it and alloc
-	 * a new one with new outcnt to save memory.
+	 * a new one with new outcnt to save memory if needed.
 	 */
+	if (outcnt == stream->outcnt)
+		goto in;
+
 	kfree(stream->out);
 
 	stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
@@ -53,6 +58,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 	for (i = 0; i < stream->outcnt; i++)
 		stream->out[i].state = SCTP_STREAM_OPEN;
 
+in:
 	if (!incnt)
 		return 0;
 
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 00/10] Introduce SCTP Stream Schedulers
From: Marcelo Ricardo Leitner @ 2017-09-28 20:25 UTC (permalink / raw)
  To: netdev; +Cc: linux-sctp, Neil Horman, Vlad Yasevich, Xin Long, David Laight

This patchset introduces the SCTP Stream Schedulers are defined by
https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13

It provides 3 schedulers at the moment: FCFS, Priority and Round Robin.
The other 3, Round Robin per packet, Fair Capacity and Weighted Fair
Capacity will be added later. More specifically, WFQ is required by
WebRTC Datachannels.

The draft also defines the idata chunk, allowing a usermsg to be
interrupted by another piece of idata from another stream. This patchset
*doesn't* include it. It will be posted later by Xin Long.  Its
integration with this patchset is very simple and it basically only
requires a tweak in sctp_sched_dequeue_done(), to ignore datamsg
boundaries.

The first 5 patches are a preparation for the next ones. The most
relevant patches are the 4th and 6th ones. More details are available on
each patch.

Marcelo Ricardo Leitner (10):
  sctp: silence warns on sctp_stream_init allocations
  sctp: factor out stream->out allocation
  sctp: factor out stream->in allocation
  sctp: introduce struct sctp_stream_out_ext
  sctp: introduce sctp_chunk_stream_no
  sctp: introduce stream scheduler foundations
  sctp: add sockopt to get/set stream scheduler
  sctp: add sockopt to get/set stream scheduler parameters
  sctp: introduce priority based stream scheduler
  sctp: introduce round robin stream scheduler

 include/net/sctp/stream_sched.h |  72 +++++++++
 include/net/sctp/structs.h      |  63 +++++++-
 include/uapi/linux/sctp.h       |  16 ++
 net/sctp/Makefile               |   3 +-
 net/sctp/chunk.c                |   6 +-
 net/sctp/outqueue.c             |  63 ++++----
 net/sctp/sm_sideeffect.c        |   3 +
 net/sctp/socket.c               | 179 ++++++++++++++++++++-
 net/sctp/stream.c               | 196 +++++++++++++++++++----
 net/sctp/stream_sched.c         | 275 +++++++++++++++++++++++++++++++
 net/sctp/stream_sched_prio.c    | 347 ++++++++++++++++++++++++++++++++++++++++
 net/sctp/stream_sched_rr.c      | 201 +++++++++++++++++++++++
 12 files changed, 1347 insertions(+), 77 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c
 create mode 100644 net/sctp/stream_sched_prio.c
 create mode 100644 net/sctp/stream_sched_rr.c

-- 
2.13.5

^ permalink raw reply

* Re: [PATCH 1/4] dt-bindings: net: ravb: Document optional reset-gpios property
From: Sergei Shtylyov @ 2017-09-28 20:07 UTC (permalink / raw)
  To: Geert Uytterhoeven, David S . Miller, Simon Horman, Magnus Damm
  Cc: Andrew Lunn, Florian Fainelli, Niklas Söderlund, netdev,
	linux-renesas-soc, devicetree
In-Reply-To: <1506614014-4398-2-git-send-email-geert+renesas@glider.be>

Hello!

On 09/28/2017 06:53 PM, Geert Uytterhoeven wrote:

> The optional "reset-gpios" property (part of the generic MDIO bus
> properties) lets us describe the GPIO used for resetting the Ethernet
> PHY.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> ---
>   Documentation/devicetree/bindings/net/renesas,ravb.txt | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> index c902261893b913f5..4a6ec1ba32d0bf16 100644
> --- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
> +++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
> @@ -52,6 +52,7 @@ Optional properties:
>   			 AVB_LINK signal.
>   - renesas,ether-link-active-low: boolean, specify when the AVB_LINK signal is
>   				 active-low instead of normal active-high.
> +- reset-gpios: see mdio.txt in the same directory.

    Sigh, I can only repeat that was a terrible prop name choice -- when 
applied to a MAC node... what reset does it mean? MAC?

MBR, Sergei

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Andy Gospodarek @ 2017-09-28 19:58 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter
  Cc: Daniel Borkmann, davem@davemloft.net,
	alexei.starovoitov@gmail.com, john.fastabend@gmail.com,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	mchan@broadcom.com
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F077E53D6@ORSMSX103.amr.corp.intel.com>

On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
<peter.waskiewicz.jr@intel.com> wrote:
> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>> [...]
>>>> First, thanks for this detailed description.  It was helpful to read
>>>> along with the patches.
>>>>
>>>> My only concern about this area being generic is that you are now in a
>>>> state where any bpf program must know about all the bpf programs in the
>>>> receive pipeline before it can properly parse what is stored in the
>>>> meta-data and add it to an skb (or perform any other action).
>>>> Especially if each program adds it's own meta-data along the way.
>>>>
>>>> Maybe this isn't a big concern based on the number of users of this
>>>> today, but it just starts to seem like a concern as there are these
>>>> hints being passed between layers that are challenging to track due to a
>>>> lack of a standard format for passing data between.
>>>
>>> Btw, we do have similar kind of programmable scratch buffer also today
>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>> where you can transfer data between tail calls for example, then tail
>>> calls themselves that need to coordinate, or simply mangling of packets
>>> itself if you will, but more below to your use case ...
>>>
>>>> The main reason I bring this up is that Michael and I had discussed and
>>>> designed a way for drivers to communicate between each other that rx
>>>> resources could be freed after a tx completion on an XDP_REDIRECT
>>>> action.  Much like this code, it involved adding an new element to
>>>> struct xdp_md that could point to the important information.  Now that
>>>> there is a generic way to handle this, it would seem nice to be able to
>>>> leverage it, but I'm not sure how reliable this meta-data area would be
>>>> without the ability to mark it in some manner.
>>>>
>>>> For additional background, the minimum amount of data needed in the case
>>>> Michael and I were discussing was really 2 words.  One to serve as a
>>>> pointer to an rx_ring structure and one to have a counter to the rx
>>>> producer entry.  This data could be acessed by the driver processing the
>>>> tx completions and callback to the driver that received the frame off the wire
>>>> to perform any needed processing.  (For those curious this would also require a
>>>> new callback/netdev op to act on this data stored in the XDP buffer.)
>>>
>>> What you describe above doesn't seem to be fitting to the use-case of
>>> this set, meaning the area here is fully programmable out of the BPF
>>> program, the infrastructure you're describing is some sort of means of
>>> communication between drivers for the XDP_REDIRECT, and should be
>>> outside of the control of the BPF program to mangle.
>>
>> OK, I understand that perspective.  I think saying this is really meant
>> as a BPF<->BPF communication channel for now is fine.
>>
>>> You could probably reuse the base infra here and make a part of that
>>> inaccessible for the program with some sort of a fixed layout, but I
>>> haven't seen your code yet to be able to fully judge. Intention here
>>> is to allow for programmability within the BPF prog in a generic way,
>>> such that based on the use-case it can be populated in specific ways
>>> and propagated to the skb w/o having to define a fixed layout and
>>> bloat xdp_buff all the way to an skb while still retaining all the
>>> flexibility.
>>
>> Some level of reuse might be proper, but I'd rather it be explicit for
>> my use since it's not exclusively something that will need to be used by
>> a BPF prog, but rather the driver.  I'll produce some patches this week
>> for reference.
>
> Sorry for chiming in late, I've been offline.
>
> We're looking to add some functionality from driver to XDP inside this
> xdp_buff->data_meta region.  We want to assign it to an opaque
> structure, that would be specific per driver (think of a flex descriptor
> coming out of the hardware).  We'd like to pass these offloaded
> computations into XDP programs to help accelerate them, such as packet
> type, where headers are located, etc.  It's similar to Jesper's RFC
> patches back in May when passing through the mlx Rx descriptor to XDP.
>
> This is actually what a few of us are planning to present at NetDev 2.2
> in November.  If you're hoping to restrict this headroom in the xdp_buff
> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
> that further.
>

No sweat, PJ, thanks for replying.  I saw the notes for your accepted
session and I'm looking forward to it.

John's suggestion earlier in the thread was actually similar to the
conclusion I reached when thinking about Daniel's patch a bit more.
(I like John's better though as it doesn't get constrained by UAPI.)
Since redirect actions happen at a point where no other programs will
run on the buffer, that space can be used for this redirect data and
there are no conflicts.

It sounds like the idea behind your proposal includes populating some
data into the buffer before the XDP program is executed so that it can
be used by the program.  Would this data be useful later in the driver
or stack or are you just hoping to accelerate processing of frames in
the BPF program?

If the headroom needed for redirect info was only added after it was
clear the redirect action was needed, would this conflict with the
information you are trying to provide?  I had planned to add this just
after the action was XDP_REDIRECT was selected or at the end of the
driver's ndo_xdp_xmit function -- it seems like it would not conflict.

(There's also Jesper's series from today -- I've seen it but have not
had time to fully grok all of those changes.)

Thoughts?

^ permalink raw reply

* Re: [PATCH net-next RFC 3/9] net: dsa: mv88e6xxx: add support for GPIO configuration
From: Vivien Didelot @ 2017-09-28 19:57 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: Brandon Streiff, netdev, linux-kernel, David S. Miller,
	Richard Cochran, Erik Hons
In-Reply-To: <20170928180111.GF14940@lunn.ch>

Hi Brandon,

>> Would there be any value in implementing a proper gpiochip structure
>> here such that other pieces of SW can see this GPIO controller as a
>> provider and you can reference it from e.g: Device Tree using GPIO
>> descriptors?
>
> That would be my preference as well, or maybe a pinctrl driver.

Indeed seeing a gpio_chip or a pinctrl controller registered from a
gpio.c or pinctrl.c file in a separate patchset would be great.


Thanks,

        Vivien

^ permalink raw reply

* Re: [PATCH/RFC net-next] ravb: RX checksum offload
From: Sergei Shtylyov @ 2017-09-28 19:56 UTC (permalink / raw)
  To: Simon Horman; +Cc: David Miller, Magnus Damm, netdev, linux-renesas-soc
In-Reply-To: <20170928104918.GA11212@verge.net.au>

Hello!

On 09/28/2017 01:49 PM, Simon Horman wrote:

>>> Add support for RX checksum offload. This is enabled by default and
>>> may be disabled and re-enabled using ethtool:
>>>
>>>   # ethtool -K eth0 rx off
>>>   # ethtool -K eth0 rx on
>>>
>>> The RAVB provides a simple checksumming scheme which appears to be
>>> completely compatible with CHECKSUM_COMPLETE: a 1's complement sum of
>>
>>     Hm, the gen2/3 manuals say calculation doesn't involve bit inversion...
> 
> Yes, I believe that matches my observation of the values supplied by
> the hardware. Empirically this appears to be what the kernel expects.

    Then why you talk of 1's complement here?

>>> all packet data after the L2 header is appended to packet data; this may
>>> be trivially read by the driver and used to update the skb accordingly.
>>>
>>> In terms of performance throughput is close to gigabit line-rate both with
>>> and without RX checksum offload enabled. Perf output, however, appears to
>>> indicate that significantly less time is spent in do_csum(). This is as
>>> expected.
>>
>> [...]
>>
>>> By inspection this also appears to be compatible with the ravb found
>>> on R-Car Gen 2 SoCs, however, this patch is currently untested on such
>>> hardware.
>>
>>     I probably won't be able to test it on gen2 too...
>>
>>> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
>>
>>     I'm generally OK with the patch but have some questions/comments below...
> 
> Thanks, I will try to address them.
> 
>>> ---
>>>   drivers/net/ethernet/renesas/ravb_main.c | 58 +++++++++++++++++++++++++++++++-
>>>   1 file changed, 57 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index fdf30bfa403b..7c6438cd7de7 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> [...]
>>> @@ -1842,6 +1859,41 @@ static int ravb_do_ioctl(struct net_device *ndev, struct ifreq *req, int cmd)
>>>   	return phy_mii_ioctl(phydev, req, cmd);
>>>   }
>>> +static void ravb_set_rx_csum(struct net_device *ndev, bool enable)
>>> +{
>>> +	struct ravb_private *priv = netdev_priv(ndev);
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&priv->lock, flags);
>>> +
>>> +	/* Disable TX and RX */
>>> +	ravb_rcv_snd_disable(ndev);
>>> +
>>> +	/* Modify RX Checksum setting */
>>> +	if (enable)
>>> +		ravb_modify(ndev, ECMR, 0, ECMR_RCSC);
>>
>>     Please use ECMR_RCSC as the 3rd argument too to conform the common driver
>> style.
>>
>>> +	else
>>> +		ravb_modify(ndev, ECMR, ECMR_RCSC, 0);
>>
>>     This *if* can easily be folded into a single ravb_modify() call...
> 
> Thanks, something like this?
> 
> 	ravb_modify(ndev, ECMR, ECMR_RCSC, enable ? ECMR_RCSC : 0);

    Yes, exactly! :-)

>> [...]
>>> @@ -2004,6 +2057,9 @@ static int ravb_probe(struct platform_device *pdev)
>>>   	if (!ndev)
>>>   		return -ENOMEM;
>>> +	ndev->features |= NETIF_F_RXCSUM;
>>> +	ndev->hw_features |= ndev->features;
>>
>>     Hum, both fields are 0 before this? Then why not use '=' instead of '|='?
>> Even if not, why not just use the same value as both the rvalues?
> 
> I don't feel strongly about this, how about?
> 
> 	ndev->features = NETIF_F_RXCSUM;
> 	ndev->hw_features = NETIF_F_RXCSUM;

    Yes, I think it should work...

MBR, Sergei

^ permalink raw reply

* Re: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: Andrew Lunn @ 2017-09-28 19:34 UTC (permalink / raw)
  To: Tristram.Ha
  Cc: muvarov, pavel, nathan.leigh.conrad, vivien.didelot, f.fainelli,
	netdev, linux-kernel, Woojung.Huh
In-Reply-To: <93AF473E2DA327428DE3D46B72B1E9FD41124D5A@CHN-SV-EXMX02.mchp-main.com>

On Mon, Sep 18, 2017 at 08:27:13PM +0000, Tristram.Ha@microchip.com wrote:
> > > +/**
> > > + * Some counters do not need to be read too often because they are less
> > likely
> > > + * to increase much.
> > > + */
> > 
> > What does comment mean? Are you caching statistics, and updating
> > different values at different rates?
> > 
> 
> There are 34 counters.  In normal case using generic bus I/O or PCI to read them
> is very quick, but the switch is mostly accessed using SPI, or even I2C.  As the SPI
> access is very slow.

How slow is it? The Marvell switches all use MDIO. It is probably a
bit faster than I2C, but it is a lot slower than MMIO or PCI.

ethtool -S lan0 takes about 25ms.

No other driver does caching. So i'm hesitant to add one which does.

>  These accesses can be getting 1588 PTP timestamps and opening/closing ports.

You could drop the mutex between each statistic read, so allowing
something else access to the switch. That should reduce the jitter PTP
experiences.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next RFC 5/9] net: dsa: forward hardware timestamping ioctls to switch driver
From: Vivien Didelot @ 2017-09-28 19:31 UTC (permalink / raw)
  To: Brandon Streiff, netdev
  Cc: linux-kernel, David S. Miller, Florian Fainelli, Andrew Lunn,
	Richard Cochran, Erik Hons, Brandon Streiff
In-Reply-To: <1506612341-18061-6-git-send-email-brandon.streiff@ni.com>

Hi Brandon,

Brandon Streiff <brandon.streiff@ni.com> writes:

>  static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>  {
> +	struct dsa_slave_priv *p = netdev_priv(dev);
> +	struct dsa_switch *ds = p->dp->ds;
> +	int port = p->dp->index;
> +
>  	if (!dev->phydev)
>  		return -ENODEV;

Move this check below:

>  
> -	return phy_mii_ioctl(dev->phydev, ifr, cmd);
> +	switch (cmd) {
> +	case SIOCGMIIPHY:
> +	case SIOCGMIIREG:
> +	case SIOCSMIIREG:
> +		if (dev->phydev)
> +			return phy_mii_ioctl(dev->phydev, ifr, cmd);
> +		else
> +			return -EOPNOTSUPP;

                if (!dev->phydev)
                        return -ENODEV;

                return phy_mii_ioctl(dev->phydev, ifr, cmd);

> +	case SIOCGHWTSTAMP:
> +		if (ds->ops->port_hwtstamp_get)
> +			return ds->ops->port_hwtstamp_get(ds, port, ifr);
> +		else
> +			return -EOPNOTSUPP;

Here you can replace the else statement with break;

> +	case SIOCSHWTSTAMP:
> +		if (ds->ops->port_hwtstamp_set)
> +			return ds->ops->port_hwtstamp_set(ds, port, ifr);
> +		else
> +			return -EOPNOTSUPP;

Same here;

> +	default:
> +		return -EOPNOTSUPP;
> +	}

Then drop the default case and return -EOPNOTSUPP after the switch.

>  }


Thanks,

        Vivien

^ permalink raw reply

* Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower
From: Nambiar, Amritha @ 2017-09-28 19:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: intel-wired-lan, jeffrey.t.kirsher, alexander.h.duyck, netdev,
	mlxsw, alexander.duyck@gmail.com, Jamal Hadi Salim, Cong Wang
In-Reply-To: <82e0a065-c7d6-8fe6-aedc-154dd0dd88d6@intel.com>

On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.nambiar@intel.com wrote:
>>> This patch enables tc-flower based hardware offloads. tc flower
>>> filter provided by the kernel is configured as driver specific
>>> cloud filter. The patch implements functions and admin queue
>>> commands needed to support cloud filters in the driver and
>>> adds cloud filters to configure these tc-flower filters.
>>>
>>> The only action supported is to redirect packets to a traffic class
>>> on the same device.
>>
>> So basically you are not doing redirect, you are just setting tclass for
>> matched packets, right? Why you use mirred for this? I think that
>> you might consider extending g_act for that:
>>
>> # tc filter add dev eth0 protocol ip ingress \
>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>   action tclass 0
>>
> Yes, this doesn't work like a typical egress redirect, but is aimed at
> forwarding the matched packets to a different queue-group/traffic class
> on the same device, so some sort-of ingress redirect in the hardware. I
> possibly may not need the mirred-redirect as you say, I'll look into the
> g_act way of doing this with a new gact tc action.
> 

I was looking at introducing a new gact tclass action to TC. In the HW
offload path, this sets a traffic class value for certain matched
packets so they will be processed in a queue belonging to the traffic class.

# tc filter add dev eth0 protocol ip parent ffff:\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw\
  action tclass 2

But, I'm having trouble defining what this action means in the kernel
datapath. For ingress, this action could just take the default path and
do nothing and only have meaning in the HW offloaded path. For egress,
certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
for band selection, while the 'mqprio' qdisc selects the traffic class
based on the skb priority in netdev_pick_tx(), so what would this action
mean for the 'mqprio' qdisc?

It looks like the 'prio' qdisc uses band selection based on the
'classid', so I was thinking of using the 'classid' through the cls
flower filter and offload it to HW for the traffic class index, this way
we would have the same behavior in HW offload and SW fallback and there
would be no need for a separate tc action.

In HW:
# tc filter add dev eth0 protocol ip parent ffff:\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw classid 1:2\

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_sw
  in_hw

This will be used to route packets to traffic class 2.

In SW:
# tc filter add dev eth0 protocol ip parent ffff:\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_hw classid 1:2

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_hw
  not_in_hw

>>
>>>
>>> # tc qdisc add dev eth0 ingress
>>> # ethtool -K eth0 hw-tc-offload on
>>>
>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 0
>>>
>>> # tc filter add dev eth0 protocol ip parent ffff:\
>>>  prio 2 flower dst_ip 192.168.3.5/32\
>>>  ip_proto udp dst_port 25 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> # tc filter add dev eth0 protocol ipv6 parent ffff:\
>>>  prio 3 flower dst_ip fe8::200:1\
>>>  ip_proto udp dst_port 66 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> Delete tc flower filter:
>>> Example:
>>>
>>> # tc filter del dev eth0 parent ffff: prio 3 handle 0x1 flower
>>> # tc filter del dev eth0 parent ffff:
>>>
>>> Flow Director Sideband is disabled while configuring cloud filters
>>> via tc-flower and until any cloud filter exists.
>>>
>>> Unsupported matches when cloud filters are added using enhanced
>>> big buffer cloud filter mode of underlying switch include:
>>> 1. source port and source IP
>>> 2. Combined MAC address and IP fields.
>>> 3. Not specifying L4 port
>>>
>>> These filter matches can however be used to redirect traffic to
>>> the main VSI (tc 0) which does not require the enhanced big buffer
>>> cloud filter support.
>>>
>>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>>> __be32 array instead of u8 array. Used macro for IP version. Minor
>>> formatting changes.
>>> v2:
>>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>>> 2. Moved dev_info for add/deleting cloud filters in else condition
>>> 3. Fixed some format specifier in dev_err logs
>>> 4. Refactored i40e_get_capabilities to take an additional
>>>   list_type parameter and use it to query device and function
>>>   level capabilities.
>>> 5. Fixed parsing tc redirect action to check for the is_tcf_mirred_tc()
>>>   to verify if redirect to a traffic class is supported.
>>> 6. Added comments for Geneve fix in cloud filter big buffer AQ
>>>   function definitions.
>>> 7. Cleaned up setup_tc interface to rebase and work with Jiri's
>>>   updates, separate function to process tc cls flower offloads.
>>> 8. Changes to make Flow Director Sideband and Cloud filters mutually
>>>   exclusive.
>>>
>>> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
>>> Signed-off-by: Kiran Patil <kiran.patil@intel.com>
>>> Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
>>> Signed-off-by: Jingjing Wu <jingjing.wu@intel.com>
>>> ---
>>> drivers/net/ethernet/intel/i40e/i40e.h             |   49 +
>>> drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h  |    3 
>>> drivers/net/ethernet/intel/i40e/i40e_common.c      |  189 ++++
>>> drivers/net/ethernet/intel/i40e/i40e_main.c        |  971 +++++++++++++++++++-
>>> drivers/net/ethernet/intel/i40e/i40e_prototype.h   |   16 
>>> drivers/net/ethernet/intel/i40e/i40e_type.h        |    1 
>>> .../net/ethernet/intel/i40evf/i40e_adminq_cmd.h    |    3 
>>> 7 files changed, 1202 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
>>> index 6018fb6..b110519 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>>> @@ -55,6 +55,8 @@
>>> #include <linux/net_tstamp.h>
>>> #include <linux/ptp_clock_kernel.h>
>>> #include <net/pkt_cls.h>
>>> +#include <net/tc_act/tc_gact.h>
>>> +#include <net/tc_act/tc_mirred.h>
>>> #include "i40e_type.h"
>>> #include "i40e_prototype.h"
>>> #include "i40e_client.h"
>>> @@ -252,9 +254,52 @@ struct i40e_fdir_filter {
>>> 	u32 fd_id;
>>> };
>>>
>>> +#define IPV4_VERSION 4
>>> +#define IPV6_VERSION 6
>>> +
>>> +#define I40E_CLOUD_FIELD_OMAC	0x01
>>> +#define I40E_CLOUD_FIELD_IMAC	0x02
>>> +#define I40E_CLOUD_FIELD_IVLAN	0x04
>>> +#define I40E_CLOUD_FIELD_TEN_ID	0x08
>>> +#define I40E_CLOUD_FIELD_IIP	0x10
>>> +
>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC	I40E_CLOUD_FIELD_OMAC
>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC	I40E_CLOUD_FIELD_IMAC
>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN	(I40E_CLOUD_FIELD_IMAC | \
>>> +						 I40E_CLOUD_FIELD_IVLAN)
>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID	(I40E_CLOUD_FIELD_IMAC | \
>>> +						 I40E_CLOUD_FIELD_TEN_ID)
>>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>>> +						  I40E_CLOUD_FIELD_IMAC | \
>>> +						  I40E_CLOUD_FIELD_TEN_ID)
>>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>>> +						   I40E_CLOUD_FIELD_IVLAN | \
>>> +						   I40E_CLOUD_FIELD_TEN_ID)
>>> +#define I40E_CLOUD_FILTER_FLAGS_IIP	I40E_CLOUD_FIELD_IIP
>>> +
>>> struct i40e_cloud_filter {
>>> 	struct hlist_node cloud_node;
>>> 	unsigned long cookie;
>>> +	/* cloud filter input set follows */
>>> +	u8 dst_mac[ETH_ALEN];
>>> +	u8 src_mac[ETH_ALEN];
>>> +	__be16 vlan_id;
>>> +	__be32 dst_ip;
>>> +	__be32 src_ip;
>>> +	__be32 dst_ipv6[4];
>>> +	__be32 src_ipv6[4];
>>> +	__be16 dst_port;
>>> +	__be16 src_port;
>>> +	u32 ip_version;
>>> +	u8 ip_proto;	/* IPPROTO value */
>>> +	/* L4 port type: src or destination port */
>>> +#define I40E_CLOUD_FILTER_PORT_SRC	0x01
>>> +#define I40E_CLOUD_FILTER_PORT_DEST	0x02
>>> +	u8 port_type;
>>> +	u32 tenant_id;
>>> +	u8 flags;
>>> +#define I40E_CLOUD_TNL_TYPE_NONE	0xff
>>> +	u8 tunnel_type;
>>> 	u16 seid;	/* filter control */
>>> };
>>>
>>> @@ -491,6 +536,8 @@ struct i40e_pf {
>>> #define I40E_FLAG_LINK_DOWN_ON_CLOSE_ENABLED	BIT(27)
>>> #define I40E_FLAG_SOURCE_PRUNING_DISABLED	BIT(28)
>>> #define I40E_FLAG_TC_MQPRIO			BIT(29)
>>> +#define I40E_FLAG_FD_SB_INACTIVE		BIT(30)
>>> +#define I40E_FLAG_FD_SB_TO_CLOUD_FILTER		BIT(31)
>>>
>>> 	struct i40e_client_instance *cinst;
>>> 	bool stat_offsets_loaded;
>>> @@ -573,6 +620,8 @@ struct i40e_pf {
>>> 	u16 phy_led_val;
>>>
>>> 	u16 override_q_count;
>>> +	u16 last_sw_conf_flags;
>>> +	u16 last_sw_conf_valid_flags;
>>> };
>>>
>>> /**
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>> index 2e567c2..feb3d42 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>>> @@ -1392,6 +1392,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>> 		struct {
>>> 			u8 data[16];
>>> 		} v6;
>>> +		struct {
>>> +			__le16 data[8];
>>> +		} raw_v6;
>>> 	} ipaddr;
>>> 	__le16	flags;
>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>> index 9567702..d9c9665 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_common.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
>>> @@ -5434,5 +5434,194 @@ i40e_add_pinfo_to_list(struct i40e_hw *hw,
>>>
>>> 	status = i40e_aq_write_ppp(hw, (void *)sec, sec->data_end,
>>> 				   track_id, &offset, &info, NULL);
>>> +
>>> +	return status;
>>> +}
>>> +
>>> +/**
>>> + * i40e_aq_add_cloud_filters
>>> + * @hw: pointer to the hardware structure
>>> + * @seid: VSI seid to add cloud filters from
>>> + * @filters: Buffer which contains the filters to be added
>>> + * @filter_count: number of filters contained in the buffer
>>> + *
>>> + * Set the cloud filters for a given VSI.  The contents of the
>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>> + * of the function.
>>> + *
>>> + **/
>>> +enum i40e_status_code
>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 seid,
>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>> +			  u8 filter_count)
>>> +{
>>> +	struct i40e_aq_desc desc;
>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>> +	enum i40e_status_code status;
>>> +	u16 buff_len;
>>> +
>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>> +					  i40e_aqc_opc_add_cloud_filters);
>>> +
>>> +	buff_len = filter_count * sizeof(*filters);
>>> +	desc.datalen = cpu_to_le16(buff_len);
>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>> +	cmd->num_filters = filter_count;
>>> +	cmd->seid = cpu_to_le16(seid);
>>> +
>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>> +
>>> +	return status;
>>> +}
>>> +
>>> +/**
>>> + * i40e_aq_add_cloud_filters_bb
>>> + * @hw: pointer to the hardware structure
>>> + * @seid: VSI seid to add cloud filters from
>>> + * @filters: Buffer which contains the filters in big buffer to be added
>>> + * @filter_count: number of filters contained in the buffer
>>> + *
>>> + * Set the big buffer cloud filters for a given VSI.  The contents of the
>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>> + * function.
>>> + *
>>> + **/
>>> +i40e_status
>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>> +			     u8 filter_count)
>>> +{
>>> +	struct i40e_aq_desc desc;
>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>> +	i40e_status status;
>>> +	u16 buff_len;
>>> +	int i;
>>> +
>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>> +					  i40e_aqc_opc_add_cloud_filters);
>>> +
>>> +	buff_len = filter_count * sizeof(*filters);
>>> +	desc.datalen = cpu_to_le16(buff_len);
>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>> +	cmd->num_filters = filter_count;
>>> +	cmd->seid = cpu_to_le16(seid);
>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>> +
>>> +	for (i = 0; i < filter_count; i++) {
>>> +		u16 tnl_type;
>>> +		u32 ti;
>>> +
>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>> +
>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>> +		 * byte than the offset for the Tenant ID for rest of the
>>> +		 * tunnels.
>>> +		 */
>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>> +		}
>>> +	}
>>> +
>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>> +
>>> +	return status;
>>> +}
>>> +
>>> +/**
>>> + * i40e_aq_rem_cloud_filters
>>> + * @hw: pointer to the hardware structure
>>> + * @seid: VSI seid to remove cloud filters from
>>> + * @filters: Buffer which contains the filters to be removed
>>> + * @filter_count: number of filters contained in the buffer
>>> + *
>>> + * Remove the cloud filters for a given VSI.  The contents of the
>>> + * i40e_aqc_cloud_filters_element_data are filled in by the caller
>>> + * of the function.
>>> + *
>>> + **/
>>> +enum i40e_status_code
>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 seid,
>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>> +			  u8 filter_count)
>>> +{
>>> +	struct i40e_aq_desc desc;
>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>> +	enum i40e_status_code status;
>>> +	u16 buff_len;
>>> +
>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>> +
>>> +	buff_len = filter_count * sizeof(*filters);
>>> +	desc.datalen = cpu_to_le16(buff_len);
>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>> +	cmd->num_filters = filter_count;
>>> +	cmd->seid = cpu_to_le16(seid);
>>> +
>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>> +
>>> +	return status;
>>> +}
>>> +
>>> +/**
>>> + * i40e_aq_rem_cloud_filters_bb
>>> + * @hw: pointer to the hardware structure
>>> + * @seid: VSI seid to remove cloud filters from
>>> + * @filters: Buffer which contains the filters in big buffer to be removed
>>> + * @filter_count: number of filters contained in the buffer
>>> + *
>>> + * Remove the big buffer cloud filters for a given VSI.  The contents of the
>>> + * i40e_aqc_cloud_filters_element_bb are filled in by the caller of the
>>> + * function.
>>> + *
>>> + **/
>>> +i40e_status
>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>> +			     u8 filter_count)
>>> +{
>>> +	struct i40e_aq_desc desc;
>>> +	struct i40e_aqc_add_remove_cloud_filters *cmd =
>>> +	(struct i40e_aqc_add_remove_cloud_filters *)&desc.params.raw;
>>> +	i40e_status status;
>>> +	u16 buff_len;
>>> +	int i;
>>> +
>>> +	i40e_fill_default_direct_cmd_desc(&desc,
>>> +					  i40e_aqc_opc_remove_cloud_filters);
>>> +
>>> +	buff_len = filter_count * sizeof(*filters);
>>> +	desc.datalen = cpu_to_le16(buff_len);
>>> +	desc.flags |= cpu_to_le16((u16)(I40E_AQ_FLAG_BUF | I40E_AQ_FLAG_RD));
>>> +	cmd->num_filters = filter_count;
>>> +	cmd->seid = cpu_to_le16(seid);
>>> +	cmd->big_buffer_flag = I40E_AQC_ADD_CLOUD_CMD_BB;
>>> +
>>> +	for (i = 0; i < filter_count; i++) {
>>> +		u16 tnl_type;
>>> +		u32 ti;
>>> +
>>> +		tnl_type = (le16_to_cpu(filters[i].element.flags) &
>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_MASK) >>
>>> +			   I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT;
>>> +
>>> +		/* For Geneve, the VNI should be placed in offset shifted by a
>>> +		 * byte than the offset for the Tenant ID for rest of the
>>> +		 * tunnels.
>>> +		 */
>>> +		if (tnl_type == I40E_AQC_ADD_CLOUD_TNL_TYPE_GENEVE) {
>>> +			ti = le32_to_cpu(filters[i].element.tenant_id);
>>> +			filters[i].element.tenant_id = cpu_to_le32(ti << 8);
>>> +		}
>>> +	}
>>> +
>>> +	status = i40e_asq_send_command(hw, &desc, filters, buff_len, NULL);
>>> +
>>> 	return status;
>>> }
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> index afcf08a..96ee608 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>>> @@ -69,6 +69,15 @@ static int i40e_reset(struct i40e_pf *pf);
>>> static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired);
>>> static void i40e_fdir_sb_setup(struct i40e_pf *pf);
>>> static int i40e_veb_get_bw_info(struct i40e_veb *veb);
>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>> +				     struct i40e_cloud_filter *filter,
>>> +				     bool add);
>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>> +					     struct i40e_cloud_filter *filter,
>>> +					     bool add);
>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>> +				 enum i40e_admin_queue_opc list_type);
>>> +
>>>
>>> /* i40e_pci_tbl - PCI Device ID Table
>>>  *
>>> @@ -5478,7 +5487,11 @@ int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate)
>>>  **/
>>> static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>> {
>>> +	enum i40e_admin_queue_err last_aq_status;
>>> +	struct i40e_cloud_filter *cfilter;
>>> 	struct i40e_channel *ch, *ch_tmp;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	struct hlist_node *node;
>>> 	int ret, i;
>>>
>>> 	/* Reset rss size that was stored when reconfiguring rss for
>>> @@ -5519,6 +5532,29 @@ static void i40e_remove_queue_channels(struct i40e_vsi *vsi)
>>> 				 "Failed to reset tx rate for ch->seid %u\n",
>>> 				 ch->seid);
>>>
>>> +		/* delete cloud filters associated with this channel */
>>> +		hlist_for_each_entry_safe(cfilter, node,
>>> +					  &pf->cloud_filter_list, cloud_node) {
>>> +			if (cfilter->seid != ch->seid)
>>> +				continue;
>>> +
>>> +			hash_del(&cfilter->cloud_node);
>>> +			if (cfilter->dst_port)
>>> +				ret = i40e_add_del_cloud_filter_big_buf(vsi,
>>> +									cfilter,
>>> +									false);
>>> +			else
>>> +				ret = i40e_add_del_cloud_filter(vsi, cfilter,
>>> +								false);
>>> +			last_aq_status = pf->hw.aq.asq_last_status;
>>> +			if (ret)
>>> +				dev_info(&pf->pdev->dev,
>>> +					 "Failed to delete cloud filter, err %s aq_err %s\n",
>>> +					 i40e_stat_str(&pf->hw, ret),
>>> +					 i40e_aq_str(&pf->hw, last_aq_status));
>>> +			kfree(cfilter);
>>> +		}
>>> +
>>> 		/* delete VSI from FW */
>>> 		ret = i40e_aq_delete_element(&vsi->back->hw, ch->seid,
>>> 					     NULL);
>>> @@ -5970,6 +6006,74 @@ static bool i40e_setup_channel(struct i40e_pf *pf, struct i40e_vsi *vsi,
>>> }
>>>
>>> /**
>>> + * i40e_validate_and_set_switch_mode - sets up switch mode correctly
>>> + * @vsi: ptr to VSI which has PF backing
>>> + * @l4type: true for TCP ond false for UDP
>>> + * @port_type: true if port is destination and false if port is source
>>> + *
>>> + * Sets up switch mode correctly if it needs to be changed and perform
>>> + * what are allowed modes.
>>> + **/
>>> +static int i40e_validate_and_set_switch_mode(struct i40e_vsi *vsi, bool l4type,
>>> +					     bool port_type)
>>> +{
>>> +	u8 mode;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	struct i40e_hw *hw = &pf->hw;
>>> +	int ret;
>>> +
>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_dev_capabilities);
>>> +	if (ret)
>>> +		return -EINVAL;
>>> +
>>> +	if (hw->dev_caps.switch_mode) {
>>> +		/* if switch mode is set, support mode2 (non-tunneled for
>>> +		 * cloud filter) for now
>>> +		 */
>>> +		u32 switch_mode = hw->dev_caps.switch_mode &
>>> +							I40E_SWITCH_MODE_MASK;
>>> +		if (switch_mode >= I40E_NVM_IMAGE_TYPE_MODE1) {
>>> +			if (switch_mode == I40E_NVM_IMAGE_TYPE_MODE2)
>>> +				return 0;
>>> +			dev_err(&pf->pdev->dev,
>>> +				"Invalid switch_mode (%d), only non-tunneled mode for cloud filter is supported\n",
>>> +				hw->dev_caps.switch_mode);
>>> +			return -EINVAL;
>>> +		}
>>> +	}
>>> +
>>> +	/* port_type: true for destination port and false for source port
>>> +	 * For now, supports only destination port type
>>> +	 */
>>> +	if (!port_type) {
>>> +		dev_err(&pf->pdev->dev, "src port type not supported\n");
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	/* Set Bit 7 to be valid */
>>> +	mode = I40E_AQ_SET_SWITCH_BIT7_VALID;
>>> +
>>> +	/* Set L4type to both TCP and UDP support */
>>> +	mode |= I40E_AQ_SET_SWITCH_L4_TYPE_BOTH;
>>> +
>>> +	/* Set cloud filter mode */
>>> +	mode |= I40E_AQ_SET_SWITCH_MODE_NON_TUNNEL;
>>> +
>>> +	/* Prep mode field for set_switch_config */
>>> +	ret = i40e_aq_set_switch_config(hw, pf->last_sw_conf_flags,
>>> +					pf->last_sw_conf_valid_flags,
>>> +					mode, NULL);
>>> +	if (ret && hw->aq.asq_last_status != I40E_AQ_RC_ESRCH)
>>> +		dev_err(&pf->pdev->dev,
>>> +			"couldn't set switch config bits, err %s aq_err %s\n",
>>> +			i40e_stat_str(hw, ret),
>>> +			i40e_aq_str(hw,
>>> +				    hw->aq.asq_last_status));
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +/**
>>>  * i40e_create_queue_channel - function to create channel
>>>  * @vsi: VSI to be configured
>>>  * @ch: ptr to channel (it contains channel specific params)
>>> @@ -6735,13 +6839,726 @@ static int i40e_setup_tc(struct net_device *netdev, void *type_data)
>>> 	return ret;
>>> }
>>>
>>> +/**
>>> + * i40e_set_cld_element - sets cloud filter element data
>>> + * @filter: cloud filter rule
>>> + * @cld: ptr to cloud filter element data
>>> + *
>>> + * This is helper function to copy data into cloud filter element
>>> + **/
>>> +static inline void
>>> +i40e_set_cld_element(struct i40e_cloud_filter *filter,
>>> +		     struct i40e_aqc_cloud_filters_element_data *cld)
>>> +{
>>> +	int i, j;
>>> +	u32 ipa;
>>> +
>>> +	memset(cld, 0, sizeof(*cld));
>>> +	ether_addr_copy(cld->outer_mac, filter->dst_mac);
>>> +	ether_addr_copy(cld->inner_mac, filter->src_mac);
>>> +
>>> +	if (filter->ip_version == IPV6_VERSION) {
>>> +#define IPV6_MAX_INDEX	(ARRAY_SIZE(filter->dst_ipv6) - 1)
>>> +		for (i = 0, j = 0; i < 4; i++, j += 2) {
>>> +			ipa = be32_to_cpu(filter->dst_ipv6[IPV6_MAX_INDEX - i]);
>>> +			ipa = cpu_to_le32(ipa);
>>> +			memcpy(&cld->ipaddr.raw_v6.data[j], &ipa, 4);
>>> +		}
>>> +	} else {
>>> +		ipa = be32_to_cpu(filter->dst_ip);
>>> +		memcpy(&cld->ipaddr.v4.data, &ipa, 4);
>>> +	}
>>> +
>>> +	cld->inner_vlan = cpu_to_le16(ntohs(filter->vlan_id));
>>> +
>>> +	/* tenant_id is not supported by FW now, once the support is enabled
>>> +	 * fill the cld->tenant_id with cpu_to_le32(filter->tenant_id)
>>> +	 */
>>> +	if (filter->tenant_id)
>>> +		return;
>>> +}
>>> +
>>> +/**
>>> + * i40e_add_del_cloud_filter - Add/del cloud filter
>>> + * @vsi: pointer to VSI
>>> + * @filter: cloud filter rule
>>> + * @add: if true, add, if false, delete
>>> + *
>>> + * Add or delete a cloud filter for a specific flow spec.
>>> + * Returns 0 if the filter were successfully added.
>>> + **/
>>> +static int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
>>> +				     struct i40e_cloud_filter *filter, bool add)
>>> +{
>>> +	struct i40e_aqc_cloud_filters_element_data cld_filter;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	int ret;
>>> +	static const u16 flag_table[128] = {
>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC]  =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC,
>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC]  =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC,
>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN]  =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN,
>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID] =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_TEN_ID,
>>> +		[I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC] =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC,
>>> +		[I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID] =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_IMAC_IVLAN_TEN_ID,
>>> +		[I40E_CLOUD_FILTER_FLAGS_IIP] =
>>> +			I40E_AQC_ADD_CLOUD_FILTER_IIP,
>>> +	};
>>> +
>>> +	if (filter->flags >= ARRAY_SIZE(flag_table))
>>> +		return I40E_ERR_CONFIG;
>>> +
>>> +	/* copy element needed to add cloud filter from filter */
>>> +	i40e_set_cld_element(filter, &cld_filter);
>>> +
>>> +	if (filter->tunnel_type != I40E_CLOUD_TNL_TYPE_NONE)
>>> +		cld_filter.flags = cpu_to_le16(filter->tunnel_type <<
>>> +					     I40E_AQC_ADD_CLOUD_TNL_TYPE_SHIFT);
>>> +
>>> +	if (filter->ip_version == IPV6_VERSION)
>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>> +	else
>>> +		cld_filter.flags |= cpu_to_le16(flag_table[filter->flags] |
>>> +						I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>> +
>>> +	if (add)
>>> +		ret = i40e_aq_add_cloud_filters(&pf->hw, filter->seid,
>>> +						&cld_filter, 1);
>>> +	else
>>> +		ret = i40e_aq_rem_cloud_filters(&pf->hw, filter->seid,
>>> +						&cld_filter, 1);
>>> +	if (ret)
>>> +		dev_dbg(&pf->pdev->dev,
>>> +			"Failed to %s cloud filter using l4 port %u, err %d aq_err %d\n",
>>> +			add ? "add" : "delete", filter->dst_port, ret,
>>> +			pf->hw.aq.asq_last_status);
>>> +	else
>>> +		dev_info(&pf->pdev->dev,
>>> +			 "%s cloud filter for VSI: %d\n",
>>> +			 add ? "Added" : "Deleted", filter->seid);
>>> +	return ret;
>>> +}
>>> +
>>> +/**
>>> + * i40e_add_del_cloud_filter_big_buf - Add/del cloud filter using big_buf
>>> + * @vsi: pointer to VSI
>>> + * @filter: cloud filter rule
>>> + * @add: if true, add, if false, delete
>>> + *
>>> + * Add or delete a cloud filter for a specific flow spec using big buffer.
>>> + * Returns 0 if the filter were successfully added.
>>> + **/
>>> +static int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
>>> +					     struct i40e_cloud_filter *filter,
>>> +					     bool add)
>>> +{
>>> +	struct i40e_aqc_cloud_filters_element_bb cld_filter;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	int ret;
>>> +
>>> +	/* Both (Outer/Inner) valid mac_addr are not supported */
>>> +	if (is_valid_ether_addr(filter->dst_mac) &&
>>> +	    is_valid_ether_addr(filter->src_mac))
>>> +		return -EINVAL;
>>> +
>>> +	/* Make sure port is specified, otherwise bail out, for channel
>>> +	 * specific cloud filter needs 'L4 port' to be non-zero
>>> +	 */
>>> +	if (!filter->dst_port)
>>> +		return -EINVAL;
>>> +
>>> +	/* adding filter using src_port/src_ip is not supported at this stage */
>>> +	if (filter->src_port || filter->src_ip ||
>>> +	    !ipv6_addr_any((struct in6_addr *)&filter->src_ipv6))
>>> +		return -EINVAL;
>>> +
>>> +	/* copy element needed to add cloud filter from filter */
>>> +	i40e_set_cld_element(filter, &cld_filter.element);
>>> +
>>> +	if (is_valid_ether_addr(filter->dst_mac) ||
>>> +	    is_valid_ether_addr(filter->src_mac) ||
>>> +	    is_multicast_ether_addr(filter->dst_mac) ||
>>> +	    is_multicast_ether_addr(filter->src_mac)) {
>>> +		/* MAC + IP : unsupported mode */
>>> +		if (filter->dst_ip)
>>> +			return -EINVAL;
>>> +
>>> +		/* since we validated that L4 port must be valid before
>>> +		 * we get here, start with respective "flags" value
>>> +		 * and update if vlan is present or not
>>> +		 */
>>> +		cld_filter.element.flags =
>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT);
>>> +
>>> +		if (filter->vlan_id) {
>>> +			cld_filter.element.flags =
>>> +			cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT);
>>> +		}
>>> +
>>> +	} else if (filter->dst_ip || filter->ip_version == IPV6_VERSION) {
>>> +		cld_filter.element.flags =
>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FILTER_IP_PORT);
>>> +		if (filter->ip_version == IPV6_VERSION)
>>> +			cld_filter.element.flags |=
>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV6);
>>> +		else
>>> +			cld_filter.element.flags |=
>>> +				cpu_to_le16(I40E_AQC_ADD_CLOUD_FLAGS_IPV4);
>>> +	} else {
>>> +		dev_err(&pf->pdev->dev,
>>> +			"either mac or ip has to be valid for cloud filter\n");
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	/* Now copy L4 port in Byte 6..7 in general fields */
>>> +	cld_filter.general_fields[I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD0] =
>>> +						be16_to_cpu(filter->dst_port);
>>> +
>>> +	if (add) {
>>> +		bool proto_type, port_type;
>>> +
>>> +		proto_type = (filter->ip_proto == IPPROTO_TCP) ? true : false;
>>> +		port_type = (filter->port_type & I40E_CLOUD_FILTER_PORT_DEST) ?
>>> +			     true : false;
>>> +
>>> +		/* For now, src port based cloud filter for channel is not
>>> +		 * supported
>>> +		 */
>>> +		if (!port_type) {
>>> +			dev_err(&pf->pdev->dev,
>>> +				"unsupported port type (src port)\n");
>>> +			return -EOPNOTSUPP;
>>> +		}
>>> +
>>> +		/* Validate current device switch mode, change if necessary */
>>> +		ret = i40e_validate_and_set_switch_mode(vsi, proto_type,
>>> +							port_type);
>>> +		if (ret) {
>>> +			dev_err(&pf->pdev->dev,
>>> +				"failed to set switch mode, ret %d\n",
>>> +				ret);
>>> +			return ret;
>>> +		}
>>> +
>>> +		ret = i40e_aq_add_cloud_filters_bb(&pf->hw, filter->seid,
>>> +						   &cld_filter, 1);
>>> +	} else {
>>> +		ret = i40e_aq_rem_cloud_filters_bb(&pf->hw, filter->seid,
>>> +						   &cld_filter, 1);
>>> +	}
>>> +
>>> +	if (ret)
>>> +		dev_dbg(&pf->pdev->dev,
>>> +			"Failed to %s cloud filter(big buffer) err %d aq_err %d\n",
>>> +			add ? "add" : "delete", ret, pf->hw.aq.asq_last_status);
>>> +	else
>>> +		dev_info(&pf->pdev->dev,
>>> +			 "%s cloud filter for VSI: %d, L4 port: %d\n",
>>> +			 add ? "add" : "delete", filter->seid,
>>> +			 ntohs(filter->dst_port));
>>> +	return ret;
>>> +}
>>> +
>>> +/**
>>> + * i40e_parse_cls_flower - Parse tc flower filters provided by kernel
>>> + * @vsi: Pointer to VSI
>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>> + * @filter: Pointer to cloud filter structure
>>> + *
>>> + **/
>>> +static int i40e_parse_cls_flower(struct i40e_vsi *vsi,
>>> +				 struct tc_cls_flower_offload *f,
>>> +				 struct i40e_cloud_filter *filter)
>>> +{
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	u16 addr_type = 0;
>>> +	u8 field_flags = 0;
>>> +
>>> +	if (f->dissector->used_keys &
>>> +	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_VLAN) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
>>> +	      BIT(FLOW_DISSECTOR_KEY_ENC_KEYID))) {
>>> +		dev_err(&pf->pdev->dev, "Unsupported key used: 0x%x\n",
>>> +			f->dissector->used_keys);
>>> +		return -EOPNOTSUPP;
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID)) {
>>> +		struct flow_dissector_key_keyid *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>> +						  f->key);
>>> +
>>> +		struct flow_dissector_key_keyid *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_ENC_KEYID,
>>> +						  f->mask);
>>> +
>>> +		if (mask->keyid != 0)
>>> +			field_flags |= I40E_CLOUD_FIELD_TEN_ID;
>>> +
>>> +		filter->tenant_id = be32_to_cpu(key->keyid);
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_BASIC)) {
>>> +		struct flow_dissector_key_basic *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_BASIC,
>>> +						  f->key);
>>> +
>>> +		filter->ip_proto = key->ip_proto;
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>>> +		struct flow_dissector_key_eth_addrs *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>> +						  f->key);
>>> +
>>> +		struct flow_dissector_key_eth_addrs *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_ETH_ADDRS,
>>> +						  f->mask);
>>> +
>>> +		/* use is_broadcast and is_zero to check for all 0xf or 0 */
>>> +		if (!is_zero_ether_addr(mask->dst)) {
>>> +			if (is_broadcast_ether_addr(mask->dst)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_OMAC;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad ether dest mask %pM\n",
>>> +					mask->dst);
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		if (!is_zero_ether_addr(mask->src)) {
>>> +			if (is_broadcast_ether_addr(mask->src)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IMAC;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad ether src mask %pM\n",
>>> +					mask->src);
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +		ether_addr_copy(filter->dst_mac, key->dst);
>>> +		ether_addr_copy(filter->src_mac, key->src);
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_VLAN)) {
>>> +		struct flow_dissector_key_vlan *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>> +						  f->key);
>>> +		struct flow_dissector_key_vlan *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_VLAN,
>>> +						  f->mask);
>>> +
>>> +		if (mask->vlan_id) {
>>> +			if (mask->vlan_id == VLAN_VID_MASK) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IVLAN;
>>> +
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad vlan mask 0x%04x\n",
>>> +					mask->vlan_id);
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		filter->vlan_id = cpu_to_be16(key->vlan_id);
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_CONTROL)) {
>>> +		struct flow_dissector_key_control *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_CONTROL,
>>> +						  f->key);
>>> +
>>> +		addr_type = key->addr_type;
>>> +	}
>>> +
>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
>>> +		struct flow_dissector_key_ipv4_addrs *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>> +						  f->key);
>>> +		struct flow_dissector_key_ipv4_addrs *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_IPV4_ADDRS,
>>> +						  f->mask);
>>> +
>>> +		if (mask->dst) {
>>> +			if (mask->dst == cpu_to_be32(0xffffffff)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad ip dst mask 0x%08x\n",
>>> +					be32_to_cpu(mask->dst));
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		if (mask->src) {
>>> +			if (mask->src == cpu_to_be32(0xffffffff)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad ip src mask 0x%08x\n",
>>> +					be32_to_cpu(mask->dst));
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		if (field_flags & I40E_CLOUD_FIELD_TEN_ID) {
>>> +			dev_err(&pf->pdev->dev, "Tenant id not allowed for ip filter\n");
>>> +			return I40E_ERR_CONFIG;
>>> +		}
>>> +		filter->dst_ip = key->dst;
>>> +		filter->src_ip = key->src;
>>> +		filter->ip_version = IPV4_VERSION;
>>> +	}
>>> +
>>> +	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
>>> +		struct flow_dissector_key_ipv6_addrs *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>> +						  f->key);
>>> +		struct flow_dissector_key_ipv6_addrs *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_IPV6_ADDRS,
>>> +						  f->mask);
>>> +
>>> +		/* src and dest IPV6 address should not be LOOPBACK
>>> +		 * (0:0:0:0:0:0:0:1), which can be represented as ::1
>>> +		 */
>>> +		if (ipv6_addr_loopback(&key->dst) ||
>>> +		    ipv6_addr_loopback(&key->src)) {
>>> +			dev_err(&pf->pdev->dev,
>>> +				"Bad ipv6, addr is LOOPBACK\n");
>>> +			return I40E_ERR_CONFIG;
>>> +		}
>>> +		if (!ipv6_addr_any(&mask->dst) || !ipv6_addr_any(&mask->src))
>>> +			field_flags |= I40E_CLOUD_FIELD_IIP;
>>> +
>>> +		memcpy(&filter->src_ipv6, &key->src.s6_addr32,
>>> +		       sizeof(filter->src_ipv6));
>>> +		memcpy(&filter->dst_ipv6, &key->dst.s6_addr32,
>>> +		       sizeof(filter->dst_ipv6));
>>> +
>>> +		/* mark it as IPv6 filter, to be used later */
>>> +		filter->ip_version = IPV6_VERSION;
>>> +	}
>>> +
>>> +	if (dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_PORTS)) {
>>> +		struct flow_dissector_key_ports *key =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>> +						  f->key);
>>> +		struct flow_dissector_key_ports *mask =
>>> +			skb_flow_dissector_target(f->dissector,
>>> +						  FLOW_DISSECTOR_KEY_PORTS,
>>> +						  f->mask);
>>> +
>>> +		if (mask->src) {
>>> +			if (mask->src == cpu_to_be16(0xffff)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad src port mask 0x%04x\n",
>>> +					be16_to_cpu(mask->src));
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		if (mask->dst) {
>>> +			if (mask->dst == cpu_to_be16(0xffff)) {
>>> +				field_flags |= I40E_CLOUD_FIELD_IIP;
>>> +			} else {
>>> +				dev_err(&pf->pdev->dev, "Bad dst port mask 0x%04x\n",
>>> +					be16_to_cpu(mask->dst));
>>> +				return I40E_ERR_CONFIG;
>>> +			}
>>> +		}
>>> +
>>> +		filter->dst_port = key->dst;
>>> +		filter->src_port = key->src;
>>> +
>>> +		/* For now, only supports destination port*/
>>> +		filter->port_type |= I40E_CLOUD_FILTER_PORT_DEST;
>>> +
>>> +		switch (filter->ip_proto) {
>>> +		case IPPROTO_TCP:
>>> +		case IPPROTO_UDP:
>>> +			break;
>>> +		default:
>>> +			dev_err(&pf->pdev->dev,
>>> +				"Only UDP and TCP transport are supported\n");
>>> +			return -EINVAL;
>>> +		}
>>> +	}
>>> +	filter->flags = field_flags;
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * i40e_handle_redirect_action: Forward to a traffic class on the device
>>> + * @vsi: Pointer to VSI
>>> + * @ifindex: ifindex of the device to forwared to
>>> + * @tc: traffic class index on the device
>>> + * @filter: Pointer to cloud filter structure
>>> + *
>>> + **/
>>> +static int i40e_handle_redirect_action(struct i40e_vsi *vsi, int ifindex, u8 tc,
>>> +				       struct i40e_cloud_filter *filter)
>>> +{
>>> +	struct i40e_channel *ch, *ch_tmp;
>>> +
>>> +	/* redirect to a traffic class on the same device */
>>> +	if (vsi->netdev->ifindex == ifindex) {
>>> +		if (tc == 0) {
>>> +			filter->seid = vsi->seid;
>>> +			return 0;
>>> +		} else if (vsi->tc_config.enabled_tc & BIT(tc)) {
>>> +			if (!filter->dst_port) {
>>> +				dev_err(&vsi->back->pdev->dev,
>>> +					"Specify destination port to redirect to traffic class that is not default\n");
>>> +				return -EINVAL;
>>> +			}
>>> +			if (list_empty(&vsi->ch_list))
>>> +				return -EINVAL;
>>> +			list_for_each_entry_safe(ch, ch_tmp, &vsi->ch_list,
>>> +						 list) {
>>> +				if (ch->seid == vsi->tc_seid_map[tc])
>>> +					filter->seid = ch->seid;
>>> +			}
>>> +			return 0;
>>> +		}
>>> +	}
>>> +	return -EINVAL;
>>> +}
>>> +
>>> +/**
>>> + * i40e_parse_tc_actions - Parse tc actions
>>> + * @vsi: Pointer to VSI
>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>> + * @filter: Pointer to cloud filter structure
>>> + *
>>> + **/
>>> +static int i40e_parse_tc_actions(struct i40e_vsi *vsi, struct tcf_exts *exts,
>>> +				 struct i40e_cloud_filter *filter)
>>> +{
>>> +	const struct tc_action *a;
>>> +	LIST_HEAD(actions);
>>> +	int err;
>>> +
>>> +	if (!tcf_exts_has_actions(exts))
>>> +		return -EINVAL;
>>> +
>>> +	tcf_exts_to_list(exts, &actions);
>>> +	list_for_each_entry(a, &actions, list) {
>>> +		/* Drop action */
>>> +		if (is_tcf_gact_shot(a)) {
>>> +			dev_err(&vsi->back->pdev->dev,
>>> +				"Cloud filters do not support the drop action.\n");
>>> +			return -EOPNOTSUPP;
>>> +		}
>>> +
>>> +		/* Redirect to a traffic class on the same device */
>>> +		if (!is_tcf_mirred_egress_redirect(a) && is_tcf_mirred_tc(a)) {
>>> +			int ifindex = tcf_mirred_ifindex(a);
>>> +			u8 tc = tcf_mirred_tc(a);
>>> +
>>> +			err = i40e_handle_redirect_action(vsi, ifindex, tc,
>>> +							  filter);
>>> +			if (err == 0)
>>> +				return err;
>>> +		}
>>> +	}
>>> +	return -EINVAL;
>>> +}
>>> +
>>> +/**
>>> + * i40e_configure_clsflower - Configure tc flower filters
>>> + * @vsi: Pointer to VSI
>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>> + *
>>> + **/
>>> +static int i40e_configure_clsflower(struct i40e_vsi *vsi,
>>> +				    struct tc_cls_flower_offload *cls_flower)
>>> +{
>>> +	struct i40e_cloud_filter *filter = NULL;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	int err = 0;
>>> +
>>> +	if (test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state) ||
>>> +	    test_bit(__I40E_RESET_INTR_RECEIVED, pf->state))
>>> +		return -EBUSY;
>>> +
>>> +	if (pf->fdir_pf_active_filters ||
>>> +	    (!hlist_empty(&pf->fdir_filter_list))) {
>>> +		dev_err(&vsi->back->pdev->dev,
>>> +			"Flow Director Sideband filters exists, turn ntuple off to configure cloud filters\n");
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	if (vsi->back->flags & I40E_FLAG_FD_SB_ENABLED) {
>>> +		dev_err(&vsi->back->pdev->dev,
>>> +			"Disable Flow Director Sideband, configuring Cloud filters via tc-flower\n");
>>> +		vsi->back->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>> +		vsi->back->flags |= I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>> +	}
>>> +
>>> +	filter = kzalloc(sizeof(*filter), GFP_KERNEL);
>>> +	if (!filter)
>>> +		return -ENOMEM;
>>> +
>>> +	filter->cookie = cls_flower->cookie;
>>> +
>>> +	err = i40e_parse_cls_flower(vsi, cls_flower, filter);
>>> +	if (err < 0)
>>> +		goto err;
>>> +
>>> +	err = i40e_parse_tc_actions(vsi, cls_flower->exts, filter);
>>> +	if (err < 0)
>>> +		goto err;
>>> +
>>> +	/* Add cloud filter */
>>> +	if (filter->dst_port)
>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, true);
>>> +	else
>>> +		err = i40e_add_del_cloud_filter(vsi, filter, true);
>>> +
>>> +	if (err) {
>>> +		dev_err(&pf->pdev->dev,
>>> +			"Failed to add cloud filter, err %s\n",
>>> +			i40e_stat_str(&pf->hw, err));
>>> +		err = i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>> +		goto err;
>>> +	}
>>> +
>>> +	/* add filter to the ordered list */
>>> +	INIT_HLIST_NODE(&filter->cloud_node);
>>> +
>>> +	hlist_add_head(&filter->cloud_node, &pf->cloud_filter_list);
>>> +
>>> +	pf->num_cloud_filters++;
>>> +
>>> +	return err;
>>> +err:
>>> +	kfree(filter);
>>> +	return err;
>>> +}
>>> +
>>> +/**
>>> + * i40e_find_cloud_filter - Find the could filter in the list
>>> + * @vsi: Pointer to VSI
>>> + * @cookie: filter specific cookie
>>> + *
>>> + **/
>>> +static struct i40e_cloud_filter *i40e_find_cloud_filter(struct i40e_vsi *vsi,
>>> +							unsigned long *cookie)
>>> +{
>>> +	struct i40e_cloud_filter *filter = NULL;
>>> +	struct hlist_node *node2;
>>> +
>>> +	hlist_for_each_entry_safe(filter, node2,
>>> +				  &vsi->back->cloud_filter_list, cloud_node)
>>> +		if (!memcmp(cookie, &filter->cookie, sizeof(filter->cookie)))
>>> +			return filter;
>>> +	return NULL;
>>> +}
>>> +
>>> +/**
>>> + * i40e_delete_clsflower - Remove tc flower filters
>>> + * @vsi: Pointer to VSI
>>> + * @cls_flower: Pointer to struct tc_cls_flower_offload
>>> + *
>>> + **/
>>> +static int i40e_delete_clsflower(struct i40e_vsi *vsi,
>>> +				 struct tc_cls_flower_offload *cls_flower)
>>> +{
>>> +	struct i40e_cloud_filter *filter = NULL;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	int err = 0;
>>> +
>>> +	filter = i40e_find_cloud_filter(vsi, &cls_flower->cookie);
>>> +
>>> +	if (!filter)
>>> +		return -EINVAL;
>>> +
>>> +	hash_del(&filter->cloud_node);
>>> +
>>> +	if (filter->dst_port)
>>> +		err = i40e_add_del_cloud_filter_big_buf(vsi, filter, false);
>>> +	else
>>> +		err = i40e_add_del_cloud_filter(vsi, filter, false);
>>> +	if (err) {
>>> +		kfree(filter);
>>> +		dev_err(&pf->pdev->dev,
>>> +			"Failed to delete cloud filter, err %s\n",
>>> +			i40e_stat_str(&pf->hw, err));
>>> +		return i40e_aq_rc_to_posix(err, pf->hw.aq.asq_last_status);
>>> +	}
>>> +
>>> +	kfree(filter);
>>> +	pf->num_cloud_filters--;
>>> +
>>> +	if (!pf->num_cloud_filters)
>>> +		if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>> +		    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>> +			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>> +			pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>> +		}
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * i40e_setup_tc_cls_flower - flower classifier offloads
>>> + * @netdev: net device to configure
>>> + * @type_data: offload data
>>> + **/
>>> +static int i40e_setup_tc_cls_flower(struct net_device *netdev,
>>> +				    struct tc_cls_flower_offload *cls_flower)
>>> +{
>>> +	struct i40e_netdev_priv *np = netdev_priv(netdev);
>>> +	struct i40e_vsi *vsi = np->vsi;
>>> +
>>> +	if (!is_classid_clsact_ingress(cls_flower->common.classid) ||
>>> +	    cls_flower->common.chain_index)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	switch (cls_flower->command) {
>>> +	case TC_CLSFLOWER_REPLACE:
>>> +		return i40e_configure_clsflower(vsi, cls_flower);
>>> +	case TC_CLSFLOWER_DESTROY:
>>> +		return i40e_delete_clsflower(vsi, cls_flower);
>>> +	case TC_CLSFLOWER_STATS:
>>> +		return -EOPNOTSUPP;
>>> +	default:
>>> +		return -EINVAL;
>>> +	}
>>> +}
>>> +
>>> static int __i40e_setup_tc(struct net_device *netdev, enum tc_setup_type type,
>>> 			   void *type_data)
>>> {
>>> -	if (type != TC_SETUP_MQPRIO)
>>> +	switch (type) {
>>> +	case TC_SETUP_MQPRIO:
>>> +		return i40e_setup_tc(netdev, type_data);
>>> +	case TC_SETUP_CLSFLOWER:
>>> +		return i40e_setup_tc_cls_flower(netdev, type_data);
>>> +	default:
>>> 		return -EOPNOTSUPP;
>>> -
>>> -	return i40e_setup_tc(netdev, type_data);
>>> +	}
>>> }
>>>
>>> /**
>>> @@ -6939,6 +7756,13 @@ static void i40e_cloud_filter_exit(struct i40e_pf *pf)
>>> 		kfree(cfilter);
>>> 	}
>>> 	pf->num_cloud_filters = 0;
>>> +
>>> +	if ((pf->flags & I40E_FLAG_FD_SB_TO_CLOUD_FILTER) &&
>>> +	    !(pf->flags & I40E_FLAG_FD_SB_INACTIVE)) {
>>> +		pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>> +		pf->flags &= ~I40E_FLAG_FD_SB_TO_CLOUD_FILTER;
>>> +		pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>> +	}
>>> }
>>>
>>> /**
>>> @@ -8046,7 +8870,8 @@ static int i40e_reconstitute_veb(struct i40e_veb *veb)
>>>  * i40e_get_capabilities - get info about the HW
>>>  * @pf: the PF struct
>>>  **/
>>> -static int i40e_get_capabilities(struct i40e_pf *pf)
>>> +static int i40e_get_capabilities(struct i40e_pf *pf,
>>> +				 enum i40e_admin_queue_opc list_type)
>>> {
>>> 	struct i40e_aqc_list_capabilities_element_resp *cap_buf;
>>> 	u16 data_size;
>>> @@ -8061,9 +8886,8 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>>
>>> 		/* this loads the data into the hw struct for us */
>>> 		err = i40e_aq_discover_capabilities(&pf->hw, cap_buf, buf_len,
>>> -					    &data_size,
>>> -					    i40e_aqc_opc_list_func_capabilities,
>>> -					    NULL);
>>> +						    &data_size, list_type,
>>> +						    NULL);
>>> 		/* data loaded, buffer no longer needed */
>>> 		kfree(cap_buf);
>>>
>>> @@ -8080,26 +8904,44 @@ static int i40e_get_capabilities(struct i40e_pf *pf)
>>> 		}
>>> 	} while (err);
>>>
>>> -	if (pf->hw.debug_mask & I40E_DEBUG_USER)
>>> -		dev_info(&pf->pdev->dev,
>>> -			 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>> -			 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>> -			 pf->hw.func_caps.num_msix_vectors,
>>> -			 pf->hw.func_caps.num_msix_vectors_vf,
>>> -			 pf->hw.func_caps.fd_filters_guaranteed,
>>> -			 pf->hw.func_caps.fd_filters_best_effort,
>>> -			 pf->hw.func_caps.num_tx_qp,
>>> -			 pf->hw.func_caps.num_vsis);
>>> -
>>> +	if (pf->hw.debug_mask & I40E_DEBUG_USER) {
>>> +		if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>> +			dev_info(&pf->pdev->dev,
>>> +				 "pf=%d, num_vfs=%d, msix_pf=%d, msix_vf=%d, fd_g=%d, fd_b=%d, pf_max_q=%d num_vsi=%d\n",
>>> +				 pf->hw.pf_id, pf->hw.func_caps.num_vfs,
>>> +				 pf->hw.func_caps.num_msix_vectors,
>>> +				 pf->hw.func_caps.num_msix_vectors_vf,
>>> +				 pf->hw.func_caps.fd_filters_guaranteed,
>>> +				 pf->hw.func_caps.fd_filters_best_effort,
>>> +				 pf->hw.func_caps.num_tx_qp,
>>> +				 pf->hw.func_caps.num_vsis);
>>> +		} else if (list_type == i40e_aqc_opc_list_dev_capabilities) {
>>> +			dev_info(&pf->pdev->dev,
>>> +				 "switch_mode=0x%04x, function_valid=0x%08x\n",
>>> +				 pf->hw.dev_caps.switch_mode,
>>> +				 pf->hw.dev_caps.valid_functions);
>>> +			dev_info(&pf->pdev->dev,
>>> +				 "SR-IOV=%d, num_vfs for all function=%u\n",
>>> +				 pf->hw.dev_caps.sr_iov_1_1,
>>> +				 pf->hw.dev_caps.num_vfs);
>>> +			dev_info(&pf->pdev->dev,
>>> +				 "num_vsis=%u, num_rx:%u, num_tx=%u\n",
>>> +				 pf->hw.dev_caps.num_vsis,
>>> +				 pf->hw.dev_caps.num_rx_qp,
>>> +				 pf->hw.dev_caps.num_tx_qp);
>>> +		}
>>> +	}
>>> +	if (list_type == i40e_aqc_opc_list_func_capabilities) {
>>> #define DEF_NUM_VSI (1 + (pf->hw.func_caps.fcoe ? 1 : 0) \
>>> 		       + pf->hw.func_caps.num_vfs)
>>> -	if (pf->hw.revision_id == 0 && (DEF_NUM_VSI > pf->hw.func_caps.num_vsis)) {
>>> -		dev_info(&pf->pdev->dev,
>>> -			 "got num_vsis %d, setting num_vsis to %d\n",
>>> -			 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>> -		pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>> +		if (pf->hw.revision_id == 0 &&
>>> +		    (pf->hw.func_caps.num_vsis < DEF_NUM_VSI)) {
>>> +			dev_info(&pf->pdev->dev,
>>> +				 "got num_vsis %d, setting num_vsis to %d\n",
>>> +				 pf->hw.func_caps.num_vsis, DEF_NUM_VSI);
>>> +			pf->hw.func_caps.num_vsis = DEF_NUM_VSI;
>>> +		}
>>> 	}
>>> -
>>> 	return 0;
>>> }
>>>
>>> @@ -8141,6 +8983,7 @@ static void i40e_fdir_sb_setup(struct i40e_pf *pf)
>>> 		if (!vsi) {
>>> 			dev_info(&pf->pdev->dev, "Couldn't create FDir VSI\n");
>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> 			return;
>>> 		}
>>> 	}
>>> @@ -8163,6 +9006,48 @@ static void i40e_fdir_teardown(struct i40e_pf *pf)
>>> }
>>>
>>> /**
>>> + * i40e_rebuild_cloud_filters - Rebuilds cloud filters for VSIs
>>> + * @vsi: PF main vsi
>>> + * @seid: seid of main or channel VSIs
>>> + *
>>> + * Rebuilds cloud filters associated with main VSI and channel VSIs if they
>>> + * existed before reset
>>> + **/
>>> +static int i40e_rebuild_cloud_filters(struct i40e_vsi *vsi, u16 seid)
>>> +{
>>> +	struct i40e_cloud_filter *cfilter;
>>> +	struct i40e_pf *pf = vsi->back;
>>> +	struct hlist_node *node;
>>> +	i40e_status ret;
>>> +
>>> +	/* Add cloud filters back if they exist */
>>> +	if (hlist_empty(&pf->cloud_filter_list))
>>> +		return 0;
>>> +
>>> +	hlist_for_each_entry_safe(cfilter, node, &pf->cloud_filter_list,
>>> +				  cloud_node) {
>>> +		if (cfilter->seid != seid)
>>> +			continue;
>>> +
>>> +		if (cfilter->dst_port)
>>> +			ret = i40e_add_del_cloud_filter_big_buf(vsi, cfilter,
>>> +								true);
>>> +		else
>>> +			ret = i40e_add_del_cloud_filter(vsi, cfilter, true);
>>> +
>>> +		if (ret) {
>>> +			dev_dbg(&pf->pdev->dev,
>>> +				"Failed to rebuild cloud filter, err %s aq_err %s\n",
>>> +				i40e_stat_str(&pf->hw, ret),
>>> +				i40e_aq_str(&pf->hw,
>>> +					    pf->hw.aq.asq_last_status));
>>> +			return ret;
>>> +		}
>>> +	}
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>>  * i40e_rebuild_channels - Rebuilds channel VSIs if they existed before reset
>>>  * @vsi: PF main vsi
>>>  *
>>> @@ -8199,6 +9084,13 @@ static int i40e_rebuild_channels(struct i40e_vsi *vsi)
>>> 						I40E_BW_CREDIT_DIVISOR,
>>> 				ch->seid);
>>> 		}
>>> +		ret = i40e_rebuild_cloud_filters(vsi, ch->seid);
>>> +		if (ret) {
>>> +			dev_dbg(&vsi->back->pdev->dev,
>>> +				"Failed to rebuild cloud filters for channel VSI %u\n",
>>> +				ch->seid);
>>> +			return ret;
>>> +		}
>>> 	}
>>> 	return 0;
>>> }
>>> @@ -8365,7 +9257,7 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>> 		i40e_verify_eeprom(pf);
>>>
>>> 	i40e_clear_pxe_mode(hw);
>>> -	ret = i40e_get_capabilities(pf);
>>> +	ret = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>> 	if (ret)
>>> 		goto end_core_reset;
>>>
>>> @@ -8482,6 +9374,10 @@ static void i40e_rebuild(struct i40e_pf *pf, bool reinit, bool lock_acquired)
>>> 			goto end_unlock;
>>> 	}
>>>
>>> +	ret = i40e_rebuild_cloud_filters(vsi, vsi->seid);
>>> +	if (ret)
>>> +		goto end_unlock;
>>> +
>>> 	/* PF Main VSI is rebuild by now, go ahead and rebuild channel VSIs
>>> 	 * for this main VSI if they exist
>>> 	 */
>>> @@ -9404,6 +10300,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
>>> 	    (pf->num_fdsb_msix == 0)) {
>>> 		dev_info(&pf->pdev->dev, "Sideband Flowdir disabled, not enough MSI-X vectors\n");
>>> 		pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> 	}
>>> 	if ((pf->flags & I40E_FLAG_VMDQ_ENABLED) &&
>>> 	    (pf->num_vmdq_msix == 0)) {
>>> @@ -9521,6 +10418,7 @@ static int i40e_init_interrupt_scheme(struct i40e_pf *pf)
>>> 				       I40E_FLAG_FD_SB_ENABLED	|
>>> 				       I40E_FLAG_FD_ATR_ENABLED	|
>>> 				       I40E_FLAG_VMDQ_ENABLED);
>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>>
>>> 			/* rework the queue expectations without MSIX */
>>> 			i40e_determine_queue_usage(pf);
>>> @@ -10263,9 +11161,13 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>> 		/* Enable filters and mark for reset */
>>> 		if (!(pf->flags & I40E_FLAG_FD_SB_ENABLED))
>>> 			need_reset = true;
>>> -		/* enable FD_SB only if there is MSI-X vector */
>>> -		if (pf->num_fdsb_msix > 0)
>>> +		/* enable FD_SB only if there is MSI-X vector and no cloud
>>> +		 * filters exist
>>> +		 */
>>> +		if (pf->num_fdsb_msix > 0 && !pf->num_cloud_filters) {
>>> 			pf->flags |= I40E_FLAG_FD_SB_ENABLED;
>>> +			pf->flags &= ~I40E_FLAG_FD_SB_INACTIVE;
>>> +		}
>>> 	} else {
>>> 		/* turn off filters, mark for reset and clear SW filter list */
>>> 		if (pf->flags & I40E_FLAG_FD_SB_ENABLED) {
>>> @@ -10274,6 +11176,8 @@ bool i40e_set_ntuple(struct i40e_pf *pf, netdev_features_t features)
>>> 		}
>>> 		pf->flags &= ~(I40E_FLAG_FD_SB_ENABLED |
>>> 			       I40E_FLAG_FD_SB_AUTO_DISABLED);
>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> +
>>> 		/* reset fd counters */
>>> 		pf->fd_add_err = 0;
>>> 		pf->fd_atr_cnt = 0;
>>> @@ -10857,7 +11761,8 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
>>> 		netdev->hw_features |= NETIF_F_NTUPLE;
>>> 	hw_features = hw_enc_features		|
>>> 		      NETIF_F_HW_VLAN_CTAG_TX	|
>>> -		      NETIF_F_HW_VLAN_CTAG_RX;
>>> +		      NETIF_F_HW_VLAN_CTAG_RX	|
>>> +		      NETIF_F_HW_TC;
>>>
>>> 	netdev->hw_features |= hw_features;
>>>
>>> @@ -12159,8 +13064,10 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>> 	*/
>>>
>>> 	if ((pf->hw.pf_id == 0) &&
>>> -	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT))
>>> +	    !(pf->flags & I40E_FLAG_TRUE_PROMISC_SUPPORT)) {
>>> 		flags = I40E_AQ_SET_SWITCH_CFG_PROMISC;
>>> +		pf->last_sw_conf_flags = flags;
>>> +	}
>>>
>>> 	if (pf->hw.pf_id == 0) {
>>> 		u16 valid_flags;
>>> @@ -12176,6 +13083,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, bool reinit)
>>> 					     pf->hw.aq.asq_last_status));
>>> 			/* not a fatal problem, just keep going */
>>> 		}
>>> +		pf->last_sw_conf_valid_flags = valid_flags;
>>> 	}
>>>
>>> 	/* first time setup */
>>> @@ -12273,6 +13181,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>> 			       I40E_FLAG_DCB_ENABLED	|
>>> 			       I40E_FLAG_SRIOV_ENABLED	|
>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> 	} else if (!(pf->flags & (I40E_FLAG_RSS_ENABLED |
>>> 				  I40E_FLAG_FD_SB_ENABLED |
>>> 				  I40E_FLAG_FD_ATR_ENABLED |
>>> @@ -12287,6 +13196,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>> 			       I40E_FLAG_FD_ATR_ENABLED	|
>>> 			       I40E_FLAG_DCB_ENABLED	|
>>> 			       I40E_FLAG_VMDQ_ENABLED);
>>> +		pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> 	} else {
>>> 		/* Not enough queues for all TCs */
>>> 		if ((pf->flags & I40E_FLAG_DCB_CAPABLE) &&
>>> @@ -12310,6 +13220,7 @@ static void i40e_determine_queue_usage(struct i40e_pf *pf)
>>> 			queues_left -= 1; /* save 1 queue for FD */
>>> 		} else {
>>> 			pf->flags &= ~I40E_FLAG_FD_SB_ENABLED;
>>> +			pf->flags |= I40E_FLAG_FD_SB_INACTIVE;
>>> 			dev_info(&pf->pdev->dev, "not enough queues for Flow Director. Flow Director feature is disabled\n");
>>> 		}
>>> 	}
>>> @@ -12613,7 +13524,7 @@ static int i40e_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>>> 		dev_warn(&pdev->dev, "This device is a pre-production adapter/LOM. Please be aware there may be issues with your hardware. If you are experiencing problems please contact your Intel or hardware representative who provided you with this hardware.\n");
>>>
>>> 	i40e_clear_pxe_mode(hw);
>>> -	err = i40e_get_capabilities(pf);
>>> +	err = i40e_get_capabilities(pf, i40e_aqc_opc_list_func_capabilities);
>>> 	if (err)
>>> 		goto err_adminq_setup;
>>>
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>> index 92869f5..3bb6659 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
>>> @@ -283,6 +283,22 @@ i40e_status i40e_aq_query_switch_comp_bw_config(struct i40e_hw *hw,
>>> 		struct i40e_asq_cmd_details *cmd_details);
>>> i40e_status i40e_aq_resume_port_tx(struct i40e_hw *hw,
>>> 				   struct i40e_asq_cmd_details *cmd_details);
>>> +i40e_status
>>> +i40e_aq_add_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>> +			     u8 filter_count);
>>> +enum i40e_status_code
>>> +i40e_aq_add_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>> +			  u8 filter_count);
>>> +enum i40e_status_code
>>> +i40e_aq_rem_cloud_filters(struct i40e_hw *hw, u16 vsi,
>>> +			  struct i40e_aqc_cloud_filters_element_data *filters,
>>> +			  u8 filter_count);
>>> +i40e_status
>>> +i40e_aq_rem_cloud_filters_bb(struct i40e_hw *hw, u16 seid,
>>> +			     struct i40e_aqc_cloud_filters_element_bb *filters,
>>> +			     u8 filter_count);
>>> i40e_status i40e_read_lldp_cfg(struct i40e_hw *hw,
>>> 			       struct i40e_lldp_variables *lldp_cfg);
>>> /* i40e_common */
>>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>> index c019f46..af38881 100644
>>> --- a/drivers/net/ethernet/intel/i40e/i40e_type.h
>>> +++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
>>> @@ -287,6 +287,7 @@ struct i40e_hw_capabilities {
>>> #define I40E_NVM_IMAGE_TYPE_MODE1	0x6
>>> #define I40E_NVM_IMAGE_TYPE_MODE2	0x7
>>> #define I40E_NVM_IMAGE_TYPE_MODE3	0x8
>>> +#define I40E_SWITCH_MODE_MASK		0xF
>>>
>>> 	u32  management_mode;
>>> 	u32  mng_protocols_over_mctp;
>>> diff --git a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>> index b8c78bf..4fe27f0 100644
>>> --- a/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>> +++ b/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h
>>> @@ -1360,6 +1360,9 @@ struct i40e_aqc_cloud_filters_element_data {
>>> 		struct {
>>> 			u8 data[16];
>>> 		} v6;
>>> +		struct {
>>> +			__le16 data[8];
>>> +		} raw_v6;
>>> 	} ipaddr;
>>> 	__le16	flags;
>>> #define I40E_AQC_ADD_CLOUD_FILTER_SHIFT			0
>>>

^ permalink raw reply

* Re: [PATCH 2/4] ravb: Add optional PHY reset during system resume
From: Florian Fainelli @ 2017-09-28 19:21 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Geert Uytterhoeven, David S . Miller, Simon Horman, Magnus Damm,
	Sergei Shtylyov, Andrew Lunn, Niklas Söderlund,
	netdev@vger.kernel.org, Linux-Renesas, devicetree@vger.kernel.org
In-Reply-To: <CAMuHMdWOAnT3xONPfU8pJi9fbAgtWL2GyRbooAxrfGDb=bsB_A@mail.gmail.com>

On 09/28/2017 11:45 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Thu, Sep 28, 2017 at 7:22 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
>> On 09/28/2017 08:53 AM, Geert Uytterhoeven wrote:
>>> If the optional "reset-gpios" property is specified in DT, the generic
>>> MDIO bus code takes care of resetting the PHY during device probe.
>>> However, the PHY may still have to be reset explicitly after system
>>> resume.
>>>
>>> This allows to restore Ethernet operation after resume from s2ram on
>>> Salvator-XS, where the enable pin of the regulator providing PHY power
>>> is connected to PRESETn, and PSCI suspend powers down the SoC.
>>>
>>> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
>>> ---
>>>  drivers/net/ethernet/renesas/ravb_main.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
>>> index fdf30bfa403bf416..96d1d48e302f8c9a 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -19,6 +19,7 @@
>>>  #include <linux/etherdevice.h>
>>>  #include <linux/ethtool.h>
>>>  #include <linux/if_vlan.h>
>>> +#include <linux/gpio/consumer.h>
>>>  #include <linux/kernel.h>
>>>  #include <linux/list.h>
>>>  #include <linux/module.h>
>>> @@ -2268,6 +2269,7 @@ static int __maybe_unused ravb_resume(struct device *dev)
>>>  {
>>>       struct net_device *ndev = dev_get_drvdata(dev);
>>>       struct ravb_private *priv = netdev_priv(ndev);
>>> +     struct mii_bus *bus = priv->mii_bus;
>>>       int ret = 0;
>>>
>>>       if (priv->wol_enabled) {
>>> @@ -2302,6 +2304,13 @@ static int __maybe_unused ravb_resume(struct device *dev)
>>>        * reopen device if it was running before system suspended.
>>>        */
>>>
>>> +     /* PHY reset */
>>> +     if (bus->reset_gpiod) {
>>> +             gpiod_set_value_cansleep(bus->reset_gpiod, 1);
>>> +             udelay(bus->reset_delay_us);
>>> +             gpiod_set_value_cansleep(bus->reset_gpiod, 0);
>>> +     }
>>
>> This is a clever hack, but unfortunately this is also misusing the MDIO
>> bus reset line into a PHY reset line. As commented in patch 3, if this
>> reset line is tied to the PHY, then this should be a PHY property and
> 
> OK.
> 
>> you cannot (ab)use the MDIO bus GPIO reset logic anymore...
> 
> And then I should add reset-gpios support to drivers/net/phy/micrel.c?
> Or is there already generic code to handle per-PHY reset? I couldn't find it.

There is not such a thing unfortunately, but it would presumably be
called within drivers/net/phy/mdio_bus.c during bus->reset() time
because you need the PHY reset to be deasserted before you can
successfully read/write from the PHY, and if you can't read/write from
the PHY, the MDIO bus layer cannot read the PHY ID, and therefore cannot
match a PHY device with its driver, so things don't work.

NB: you could move this entirely to the Micrel PHY driver if you specify
a compatible string that has a the PHY OUI in it, because that bypasses
the need to match the PHY driver with the PHY device, but this may not
be an acceptable solution for non-DT platforms or other platforms where
the PHY can't be determined based on the board DTS.

I was going to suggest writing some sort of generic helper that walks
the list of child nodes from a MDIO bus device node and deassert reset
lines and enables clocks, but there is absolutely nothing generic about
that. Things like which of the reset should come first, and if there are
multiple, in which order, etc.

> 
>> Should not you also try to manage this reset line during ravb_open() to
>> achiever better power savings?
> 
> I don't know. The Micrel KSZ9031RNXVA datasheet doesn't mention if it's
> safe or not to assert reset for a prolonged time.
> 
> Thanks!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH RFC 3/5] Add KSZ8795 switch driver
From: Florian Fainelli @ 2017-09-28 18:45 UTC (permalink / raw)
  To: Pavel Machek, Tristram.Ha
  Cc: andrew, muvarov, nathan.leigh.conrad, vivien.didelot, netdev,
	linux-kernel, Woojung.Huh
In-Reply-To: <20170928184059.GA2825@amd>

On 09/28/2017 11:40 AM, Pavel Machek wrote:
> Hi!
> 
> On Mon 2017-09-18 20:27:13, Tristram.Ha@microchip.com wrote:
>>>> +/**
>>>> + * Some counters do not need to be read too often because they are less
>>> likely
>>>> + * to increase much.
>>>> + */
>>>
>>> What does comment mean? Are you caching statistics, and updating
>>> different values at different rates?
>>>
>>
>> There are 34 counters.  In normal case using generic bus I/O or PCI to read them
>> is very quick, but the switch is mostly accessed using SPI, or even I2C.  As the SPI
>> access is very slow and cannot run in interrupt context I keep worrying reading
>> the MIB counters in a loop for 5 or more ports will prevent other critical hardware
>> access from executing soon enough.  These accesses can be getting 1588 PTP
>> timestamps and opening/closing ports.  (RSTP Conformance Test sends test traffic
>> to port supposed to be closed/opened after receiving specific RSTP
>> BPDU.)
> 
> Hmm. Ok, interesting.
> 
> I wonder how well this is going to work if userspace actively 'does
> something' with the switch.
> 
> It seems to me that even if your statistics code is careful not to do
> 'a lot' of accesses at the same time, userspace can use other parts of
> the driver to do the same, and thus cause same unwanted effects...

A few switches have a MIB snapshot feature that is implemented such that
accessing the snapshot does not hog the remainder of the switch
registers, is this something possible on KSZ switches?

Tangential: net-next is currently open, so now would be a good time to
send a revised version of your patch series to target possibly 4.15 with
an initial implementation. Please fix the cover-letter and patch
threading such that they look like the following:

[PATCH 0/X]
   [PATCH 1/X]
   [PATCH 2/X]
   etc..

Right now this shows up as separate emails/patches and this is very
annoying to follow as a thread.

Thank you
-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox