Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 2/2] flow_dissector: dissect tunnel info
From: Jiri Pirko @ 2017-10-04 19:17 UTC (permalink / raw)
  To: Simon Horman
  Cc: Tom Herbert, David Miller, Jiri Pirko, Jamal Hadi Salim,
	Cong Wang, Linux Kernel Network Developers, oss-drivers
In-Reply-To: <20171004190716.GA22135@netronome.com>

Wed, Oct 04, 2017 at 09:07:17PM CEST, simon.horman@netronome.com wrote:
>On Wed, Oct 04, 2017 at 08:07:15PM +0200, Jiri Pirko wrote:
>> Wed, Oct 04, 2017 at 05:52:54PM CEST, tom@herbertland.com wrote:
>> >On Wed, Oct 4, 2017 at 1:15 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> >> Wed, Oct 04, 2017 at 10:08:57AM CEST, simon.horman@netronome.com wrote:
>> >>>On Tue, Oct 03, 2017 at 11:17:46AM -0700, Tom Herbert wrote:
>> >>>> On Tue, Oct 3, 2017 at 2:40 AM, Simon Horman <simon.horman@netronome.com> wrote:
>> >>>> > On Mon, Oct 02, 2017 at 01:37:55PM -0700, Tom Herbert wrote:
>> >>>> >> On Mon, Oct 2, 2017 at 1:41 AM, Simon Horman <simon.horman@netronome.com> wrote:
>> >>>> >> > Move dissection of tunnel info from the flower classifier to the flow
>> >>>> >> > dissector where all other dissection occurs.  This should not have any
>> >>>> >> > behavioural affect on other users of the flow dissector.
>> >>>> >
>> >>>> > ...
>> >>>>
>> >>>> > I feel that we are circling back the perennial issue of flower using the
>> >>>> > flow dissector in a somewhat broader/different way than many/all other
>> >>>> > users of the flow dissector.
>> >>>> >
>> >>>> Simon,
>> >>>>
>> >>>> It's more like __skb_flow_dissect is already an incredibly complex
>> >>>> function and because of that it's difficult to maintain. We need to
>> >>>> measure changes against that fact. For this patch, there is precisely
>> >>>> one user (cls_flower.c) and it's not at all clear to me if there will
>> >>>> be ever any more (e.g. for hashing we don't need tunnel info). IMO, it
>> >>>> should be just as easy and less convolution for everyone to have
>> >>>> flower call __skb_flow_dissect_tunnel_info directly and not call if
>> >>>> from __skb_flow_dissect.
>> >>>
>> >>>Hi Tom,
>> >>>
>> >>>my original suggestion was just that, but Jiri indicated a strong preference
>> >>>for the approach taken by this patch. I think we need to widen the
>> >>>participants in this discussion.
>> >>
>> >> I like the __skb_flow_dissect to be the function to call and it will do
>> >> the job according to the configuration. I don't like to split in
>> >> multiple calls.
>> >
>> >Those are not technical arguments. As I already mentioned, I don't
>> >like it when we add stuff for the benefit of a 1% use case that
>> >negatively impacts the rest of the 99% cases which is what I believe
>> >is happening here.
>> 
>> Yeah. I just wanted the flow dissector to stay compact. But if needed,
>> could be split. I just fear that it will become a mess that's all.
>> 
>> 
>> >
>> >> Does not make sense in the most of the cases as the
>> >> dissection state would have to be carried in between calls.
>> >
>> >Please elaborate. This code is being moved into __skb_flow_dissect, so
>> >the functionality was already there. I don't see any description in
>> >this discussion that things were broken and that this patch is a
>> >necessary fix.
>> 
>> Yeah, you are right.
>
>Hi Tom, Hi Jiri,
>
>I'm happy to make a patch to move the call to
>__skb_flow_dissect_tunnel_info() from __skb_flow_dissect() to
>fl_classify(). It seems that approach has been agreed on above.

If the consensus is that the right way is to cut-out flow dissector,
so be it. But first, I believe it is reasonable to request to see some
numbers that would indicate that it actually resolves anything.

^ permalink raw reply

* Re: [PATCH 1/3] wireless: iwlwifi: use bool instead of int
From: Luciano Coelho @ 2017-10-04 19:18 UTC (permalink / raw)
  To: Joe Perches, Christoph Böhmwalder, johannes.berg,
	emmanuel.grumbach, kvalo
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1507139722.4434.12.camel@perches.com>

On Wed, 2017-10-04 at 10:55 -0700, Joe Perches wrote:
> On Wed, 2017-10-04 at 19:39 +0300, Luciano Coelho wrote:
> > On Wed, 2017-10-04 at 09:26 -0700, Joe Perches wrote:
> 
> []
> > > This might be more intelligble as separate tests
> > > 
> > > static bool is_valid_channel(u16 ch_id)
> > > {
> > > 	if (ch_id <= 14)
> > > 		return true;
> > > 
> > > 	if ((ch_id % 4 == 0) &&
> > > 	    ((ch_id >= 36 && ch_id <= 64) ||
> > > 	     (ch_id >= 100 && ch_id <= 140)))
> > > 		return true;
> > > 
> > > 	if ((ch_id % 4 == 1) &&
> > > 	    (chid >= 145 && ch_id <= 165))
> > > 		return true;
> > > 
> > > 	return false;
> > > }
> > > 
> > > The compiler should produce the same object code.
> > 
> > Yeah, it may be a bit easier to read, but I don't want to start
> > getting
> > "fixes" to working and reasonable code.  There's nothing wrong with
> > the
> > existing function (except maybe for the int vs. boolean) so let's
> > not
> > change it.
> > 
> > A good time to change this would be the next time someone adds yet
> > another range of valid channels here. ;)
> 
> <shrug>  Your choice.
> 
> I like code I can read and understand at a glance.

I do too, but I don't think the original is that hard to read, really. 
Each "if" you add is already corresponding to one separate line in the
original code...


> At case somebody needs to add channels, likely nobody
> would do the change suggested but would just add
> another test to the already odd looking block.

Yeah, that would most likely be the case, but if I saw that and thought
there was a better way to write it, believe me, I would definitely
nitpick the patch and ask the author to reorg the code so it would look
nicer.


> And constants should be on the right side of the tests.

Sure, in a new patch, we would definitely pay attention to that.  But
now, is it worth having one more patch go through the entire machinery
to change a relatively clear, extremely simple function just because
you could write it in a different way? My answer is a resounding no,
sorry.

--
Luca.

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size
From: Daniel Borkmann @ 2017-10-04 19:27 UTC (permalink / raw)
  To: Craig Gallek
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, David S . Miller,
	Chonggang Li, netdev
In-Reply-To: <CAEfhGixoRM_wRy_e9reCGq69XUM1w-1kKGErCUty=gEB8DbzfA@mail.gmail.com>

On 10/04/2017 03:59 PM, Craig Gallek wrote:
> On Tue, Oct 3, 2017 at 10:39 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 10/03/2017 01:07 AM, Alexei Starovoitov wrote:
>>> On 10/2/17 9:41 AM, Craig Gallek wrote:
>>>>
>>>> +    /* Assume equally sized map definitions */
>>>> +    map_def_sz = data->d_size / nr_maps;
>>>> +    if (!data->d_size || (data->d_size % nr_maps) != 0) {
>>>> +        pr_warning("unable to determine map definition size "
>>>> +               "section %s, %d maps in %zd bytes\n",
>>>> +               obj->path, nr_maps, data->d_size);
>>>> +        return -EINVAL;
>>>> +    }
>>>
>>> this approach is not as flexible as done by samples/bpf/bpf_load.c
>>> where it looks at every map independently by walking symtab,
>>> but I guess it's ok.
>>
>> Regarding different map spec structs in a single prog: unless
>> we have a good use case why we would need it (and I'm not aware
>> of anything in particular), I would just go with a fixed size.
>> I did kind of similar sanity checks in bpf_fetch_maps_end() in
>> iproute2 loader as well.
>
> Split vote?  I'm happy to try to make this work with varying sizes,
> but I agree the usefulness seems low.  It would imply building map
> sections with different definition structures and we would then need a
> way to differentiate them.  If the goal is to allow for different map
> definition structures, I don't think size alone is a sufficient
> differentiator.

They would still all need to go to maps section, but you would end
up going with different structs for map specs, where they all would
need to overlap for the members in order for this to work. E.g. you
would have maps of both structs sitting in map section of a single
prog ...

struct bpf_map_spec_v1 {
	__u32 type;
	__u32 size_key;
	__u32 size_value;
	__u32 max_elem;
};

struct bpf_map_spec_v2 {
	__u32 type;
	__u32 size_key;
	__u32 size_value;
	__u32 max_elem;
	__u32 flags;
};

... rather than just using single spec everywhere consistently, and
have the members zeroed that are unused or such, so if there's no
real use-case for different structs (cannot think of any right now),
lets not add complexity for it until it's really required.

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] libbpf: parse maps sections of varying size
From: Alexei Starovoitov @ 2017-10-04 19:54 UTC (permalink / raw)
  To: Daniel Borkmann, Craig Gallek
  Cc: Jesper Dangaard Brouer, David S . Miller, Chonggang Li, netdev
In-Reply-To: <59D53611.5020907@iogearbox.net>

On 10/4/17 12:27 PM, Daniel Borkmann wrote:
> On 10/04/2017 03:59 PM, Craig Gallek wrote:
>> On Tue, Oct 3, 2017 at 10:39 AM, Daniel Borkmann
>> <daniel@iogearbox.net> wrote:
>>> On 10/03/2017 01:07 AM, Alexei Starovoitov wrote:
>>>> On 10/2/17 9:41 AM, Craig Gallek wrote:
>>>>>
>>>>> +    /* Assume equally sized map definitions */
>>>>> +    map_def_sz = data->d_size / nr_maps;
>>>>> +    if (!data->d_size || (data->d_size % nr_maps) != 0) {
>>>>> +        pr_warning("unable to determine map definition size "
>>>>> +               "section %s, %d maps in %zd bytes\n",
>>>>> +               obj->path, nr_maps, data->d_size);
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>
>>>> this approach is not as flexible as done by samples/bpf/bpf_load.c
>>>> where it looks at every map independently by walking symtab,
>>>> but I guess it's ok.
>>>
>>> Regarding different map spec structs in a single prog: unless
>>> we have a good use case why we would need it (and I'm not aware
>>> of anything in particular), I would just go with a fixed size.
>>> I did kind of similar sanity checks in bpf_fetch_maps_end() in
>>> iproute2 loader as well.
>>
>> Split vote?  I'm happy to try to make this work with varying sizes,
>> but I agree the usefulness seems low.  It would imply building map
>> sections with different definition structures and we would then need a
>> way to differentiate them.  If the goal is to allow for different map
>> definition structures, I don't think size alone is a sufficient
>> differentiator.
>
> They would still all need to go to maps section, but you would end
> up going with different structs for map specs, where they all would
> need to overlap for the members in order for this to work. E.g. you
> would have maps of both structs sitting in map section of a single
> prog ...
>
> struct bpf_map_spec_v1 {
>     __u32 type;
>     __u32 size_key;
>     __u32 size_value;
>     __u32 max_elem;
> };
>
> struct bpf_map_spec_v2 {
>     __u32 type;
>     __u32 size_key;
>     __u32 size_value;
>     __u32 max_elem;
>     __u32 flags;
> };
>
> ... rather than just using single spec everywhere consistently, and
> have the members zeroed that are unused or such, so if there's no
> real use-case for different structs (cannot think of any right now),
> lets not add complexity for it until it's really required.

makes sense to me :)

^ permalink raw reply

* [PATCH] net/ipv4: Remove unused variable in route.c
From: Tim Hansen @ 2017-10-04 19:59 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, alexander.levin,
	devtimhansen

int rc is unmodified after initalization in net/ipv4/route.c, this patch simply cleans up that variable and returns 0.

This was found with coccicheck M=net/ipv4/ on linus' tree.

Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
---
 net/ipv4/route.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 94d4cd2..02b6be9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -3032,7 +3032,6 @@ struct ip_rt_acct __percpu *ip_rt_acct __read_mostly;
 
 int __init ip_rt_init(void)
 {
-	int rc = 0;
 	int cpu;
 
 	ip_idents = kmalloc(IP_IDENTS_SZ * sizeof(*ip_idents), GFP_KERNEL);
@@ -3089,7 +3088,7 @@ int __init ip_rt_init(void)
 #endif
 	register_pernet_subsys(&rt_genid_ops);
 	register_pernet_subsys(&ipv4_inetpeer_ops);
-	return rc;
+	return 0;
 }
 
 #ifdef CONFIG_SYSCTL
-- 
2.1.4

^ permalink raw reply related

* [PATCH net-next 0/3] tcp: improving RACK cpu performance
From: Yuchung Cheng @ 2017-10-04 19:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng


This patch set improves the CPU consumption of the RACK TCP loss
recovery algorithm, in particular for high-speed networks. Currently,
for every ACK in recovery RACK can potentially iterate over all sent
packets in the write queue. On large BDP networks with non-trivial
losses the RACK write queue walk CPU usage becomes unreasonably high.

This patch introduces a new queue in TCP that keeps only skbs sent and
not yet (s)acked or marked lost, in time order instead of sequence
order.  With that, RACK can examine this time-sorted list and only
check packets that were sent recently, within the reordering window,
per ACK. This is the fastest way without any write queue walks. The
number of skbs examined per ACK is reduced by orders of magnitude.


Eric Dumazet (1):
  tcp: new list for sent but unacked skbs for RACK recovery

Yuchung Cheng (2):
  tcp: more efficient RACK loss detection
  tcp: a small refactor of RACK loss detection

 include/linux/skbuff.h   | 11 ++++++++--
 include/linux/tcp.h      |  1 +
 include/net/tcp.h        | 24 +++++++++++++++++++++-
 net/ipv4/tcp.c           |  2 ++
 net/ipv4/tcp_input.c     |  9 +++++++--
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv4/tcp_output.c    | 42 ++++++++++++++++++++++++++++----------
 net/ipv4/tcp_recovery.c  | 52 ++++++++++++++++++------------------------------
 8 files changed, 93 insertions(+), 49 deletions(-)

-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply

* [PATCH net-next 1/3] tcp: new list for sent but unacked skbs for RACK recovery
From: Yuchung Cheng @ 2017-10-04 19:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Neal Cardwell
In-Reply-To: <20171004200000.39257-1-ycheng@google.com>

From: Eric Dumazet <edumazet@google.com>

This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).

This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.

Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
   send queue. For now I believe it's not worth the complexity to
   make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
   the pure SYN packet. For TFO which queues a pure SYN and
   then a data packet, send-queue only queues the data packet but
   not the pure SYN due to the structure of TFO code. This is okay
   because the SYN receiver would never respond with a SACK on a
   missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 include/linux/skbuff.h   | 11 +++++++++--
 include/linux/tcp.h      |  1 +
 include/net/tcp.h        | 24 +++++++++++++++++++++++-
 net/ipv4/tcp.c           |  2 ++
 net/ipv4/tcp_input.c     |  9 +++++++--
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv4/tcp_output.c    | 42 +++++++++++++++++++++++++++++++-----------
 7 files changed, 74 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ada821466e88..01a985937867 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -617,6 +617,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@nf_trace: netfilter packet trace flag
  *	@protocol: Packet protocol from driver
  *	@destructor: Destruct function
+ *	@tcp_tsorted_anchor: list structure for TCP (tp->tsorted_sent_queue)
  *	@_nfct: Associated connection, if any (with nfctinfo bits)
  *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c
  *	@skb_iif: ifindex of device we arrived on
@@ -686,8 +687,14 @@ struct sk_buff {
 	 */
 	char			cb[48] __aligned(8);
 
-	unsigned long		_skb_refdst;
-	void			(*destructor)(struct sk_buff *skb);
+	union {
+		struct {
+			unsigned long	_skb_refdst;
+			void		(*destructor)(struct sk_buff *skb);
+		};
+		struct list_head	tcp_tsorted_anchor;
+	};
+
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4aa40ef02d32..1d2c44e09e31 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -191,6 +191,7 @@ struct tcp_sock {
 	u32	tsoffset;	/* timestamp offset */
 
 	struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
+	struct list_head tsorted_sent_queue; /* time-sorted sent but un-SACKed skbs */
 
 	u32	snd_wl1;	/* Sequence for window update		*/
 	u32	snd_wnd;	/* The window we expect to receive	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d25d8305054..c39bcc222c9b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1588,14 +1588,34 @@ enum tcp_chrono {
 void tcp_chrono_start(struct sock *sk, const enum tcp_chrono type);
 void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type);
 
+/* This helper is needed, because skb->tcp_tsorted_anchor uses
+ * the same memory storage than skb->destructor/_skb_refdst
+ */
+static inline void tcp_skb_tsorted_anchor_cleanup(struct sk_buff *skb)
+{
+	skb->destructor = NULL;
+	skb->_skb_refdst = 0UL;
+}
+
+#define tcp_skb_tsorted_save(skb) {		\
+	unsigned long _save = skb->_skb_refdst;	\
+	skb->_skb_refdst = 0UL;
+
+#define tcp_skb_tsorted_restore(skb)		\
+	skb->_skb_refdst = _save;		\
+}
+
 /* write queue abstraction */
 static inline void tcp_write_queue_purge(struct sock *sk)
 {
 	struct sk_buff *skb;
 
 	tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
-	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL) {
+		tcp_skb_tsorted_anchor_cleanup(skb);
 		sk_wmem_free_skb(sk, skb);
+	}
+	INIT_LIST_HEAD(&tcp_sk(sk)->tsorted_sent_queue);
 	sk_mem_reclaim(sk);
 	tcp_clear_all_retrans_hints(tcp_sk(sk));
 }
@@ -1710,6 +1730,8 @@ static inline void tcp_insert_write_queue_before(struct sk_buff *new,
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
+	list_del(&skb->tcp_tsorted_anchor);
+	tcp_skb_tsorted_anchor_cleanup(skb);
 	__skb_unlink(skb, &sk->sk_write_queue);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 23225c98d287..6d25008be84b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -415,6 +415,7 @@ void tcp_init_sock(struct sock *sk)
 	tp->out_of_order_queue = RB_ROOT;
 	tcp_init_xmit_timers(sk);
 	INIT_LIST_HEAD(&tp->tsq_node);
+	INIT_LIST_HEAD(&tp->tsorted_sent_queue);
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
@@ -869,6 +870,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
 			 * available to the caller, no more, no less.
 			 */
 			skb->reserved_tailroom = skb->end - skb->tail - size;
+			INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
 			return skb;
 		}
 		__kfree_skb(skb);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index db9bb46b5776..f0402bd9fd7e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1593,6 +1593,8 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 						tcp_skb_pcount(skb),
 						skb->skb_mstamp);
 			tcp_rate_skb_delivered(sk, skb, state->rate);
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+				list_del_init(&skb->tcp_tsorted_anchor);
 
 			if (!before(TCP_SKB_CB(skb)->seq,
 				    tcp_highest_sack_seq(tp)))
@@ -3054,8 +3056,11 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb,
 
 	shinfo = skb_shinfo(skb);
 	if (!before(shinfo->tskey, prior_snd_una) &&
-	    before(shinfo->tskey, tcp_sk(sk)->snd_una))
-		__skb_tstamp_tx(skb, NULL, sk, SCM_TSTAMP_ACK);
+	    before(shinfo->tskey, tcp_sk(sk)->snd_una)) {
+		tcp_skb_tsorted_save(skb) {
+			__skb_tstamp_tx(skb, NULL, sk, SCM_TSTAMP_ACK);
+		} tcp_skb_tsorted_restore(skb);
+	}
 }
 
 /* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 188a6f31356d..2341b9f857b6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -446,6 +446,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		newtp->snd_nxt = newtp->snd_up = treq->snt_isn + 1;
 
 		INIT_LIST_HEAD(&newtp->tsq_node);
+		INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
 
 		tcp_init_wl(newtp, treq->rcv_isn);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bc9e46a5369..8162e2880178 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -971,6 +971,12 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 		      HRTIMER_MODE_ABS_PINNED);
 }
 
+static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	skb->skb_mstamp = tp->tcp_mstamp;
+	list_move_tail(&skb->tcp_tsorted_anchor, &tp->tsorted_sent_queue);
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -1003,10 +1009,14 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		TCP_SKB_CB(skb)->tx.in_flight = TCP_SKB_CB(skb)->end_seq
 			- tp->snd_una;
 		oskb = skb;
-		if (unlikely(skb_cloned(skb)))
-			skb = pskb_copy(skb, gfp_mask);
-		else
-			skb = skb_clone(skb, gfp_mask);
+
+		tcp_skb_tsorted_save(oskb) {
+			if (unlikely(skb_cloned(oskb)))
+				skb = pskb_copy(oskb, gfp_mask);
+			else
+				skb = skb_clone(oskb, gfp_mask);
+		} tcp_skb_tsorted_restore(oskb);
+
 		if (unlikely(!skb))
 			return -ENOBUFS;
 	}
@@ -1127,7 +1137,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 		err = net_xmit_eval(err);
 	}
 	if (!err && oskb) {
-		oskb->skb_mstamp = tp->tcp_mstamp;
+		tcp_update_skb_after_send(tp, oskb);
 		tcp_rate_skb_sent(sk, oskb);
 	}
 	return err;
@@ -1328,6 +1338,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
 	/* Link BUFF into the send queue. */
 	__skb_header_release(buff);
 	tcp_insert_write_queue_after(skb, buff, sk);
+	list_add(&buff->tcp_tsorted_anchor, &skb->tcp_tsorted_anchor);
 
 	return 0;
 }
@@ -2260,7 +2271,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
 			/* "skb_mstamp" is used as a start point for the retransmit timer */
-			skb->skb_mstamp = tp->tcp_mstamp;
+			tcp_update_skb_after_send(tp, skb);
 			goto repair; /* Skip network transmission */
 		}
 
@@ -2838,11 +2849,14 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		     skb_headroom(skb) >= 0xFFFF)) {
 		struct sk_buff *nskb;
 
-		nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
-		err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
-			     -ENOBUFS;
+		tcp_skb_tsorted_save(skb) {
+			nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
+			err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
+				     -ENOBUFS;
+		} tcp_skb_tsorted_restore(skb);
+
 		if (!err)
-			skb->skb_mstamp = tp->tcp_mstamp;
+			tcp_update_skb_after_send(tp, skb);
 	} else {
 		err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
 	}
@@ -3023,6 +3037,7 @@ void tcp_send_fin(struct sock *sk)
 				goto coalesce;
 			return;
 		}
+		INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
 		skb_reserve(skb, MAX_TCP_HEADER);
 		sk_forced_mem_schedule(sk, skb->truesize);
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
@@ -3078,9 +3093,14 @@ int tcp_send_synack(struct sock *sk)
 	}
 	if (!(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_ACK)) {
 		if (skb_cloned(skb)) {
-			struct sk_buff *nskb = skb_copy(skb, GFP_ATOMIC);
+			struct sk_buff *nskb;
+
+			tcp_skb_tsorted_save(skb) {
+				nskb = skb_copy(skb, GFP_ATOMIC);
+			} tcp_skb_tsorted_restore(skb);
 			if (!nskb)
 				return -ENOMEM;
+			INIT_LIST_HEAD(&nskb->tcp_tsorted_anchor);
 			tcp_unlink_write_queue(skb, sk);
 			__skb_header_release(nskb);
 			__tcp_add_write_queue_head(sk, nskb);
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* [PATCH net-next 2/3] tcp: more efficient RACK loss detection
From: Yuchung Cheng @ 2017-10-04 19:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet
In-Reply-To: <20171004200000.39257-1-ycheng@google.com>

Use the new time-ordered list to speed up RACK. The detection
logic is identical. But since the list is chronologically ordered
by skb_mstamp and contains only skbs not yet acked or sacked,
RACK can abort the loop upon hitting skbs that were sent more
recently. On YouTube servers this patch reduces the iterations on
write queue by 40x. The improvement is even bigger with large
BDP networks.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_recovery.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 449cd914d58e..8aa56caefde8 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -45,7 +45,7 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
 static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
+	struct sk_buff *skb, *n;
 	u32 reo_wnd;
 
 	*reo_timeout = 0;
@@ -58,17 +58,10 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 	if ((tp->rack.reord || !tp->lost_out) && tcp_min_rtt(tp) != ~0U)
 		reo_wnd = max(tcp_min_rtt(tp) >> 2, reo_wnd);
 
-	tcp_for_write_queue(skb, sk) {
+	list_for_each_entry_safe(skb, n, &tp->tsorted_sent_queue,
+				 tcp_tsorted_anchor) {
 		struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
 
-		if (skb == tcp_send_head(sk))
-			break;
-
-		/* Skip ones already (s)acked */
-		if (!after(scb->end_seq, tp->snd_una) ||
-		    scb->sacked & TCPCB_SACKED_ACKED)
-			continue;
-
 		if (tcp_rack_sent_after(tp->rack.mstamp, skb->skb_mstamp,
 					tp->rack.end_seq, scb->end_seq)) {
 			/* Step 3 in draft-cheng-tcpm-rack-00.txt:
@@ -81,6 +74,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 
 			if (remaining < 0) {
 				tcp_rack_mark_skb_lost(sk, skb);
+				list_del_init(&skb->tcp_tsorted_anchor);
 				continue;
 			}
 
@@ -91,11 +85,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 
 			/* Record maximum wait time (+1 to avoid 0) */
 			*reo_timeout = max_t(u32, *reo_timeout, 1 + remaining);
-
-		} else if (!(scb->sacked & TCPCB_RETRANS)) {
-			/* Original data are sent sequentially so stop early
-			 * b/c the rest are all sent after rack_sent
-			 */
+		} else {
 			break;
 		}
 	}
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* [PATCH net-next 3/3] tcp: a small refactor of RACK loss detection
From: Yuchung Cheng @ 2017-10-04 20:00 UTC (permalink / raw)
  To: davem; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Eric Dumazet
In-Reply-To: <20171004200000.39257-1-ycheng@google.com>

Refactor the RACK loop to improve readability and speed up the checks.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_recovery.c | 40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 8aa56caefde8..cda6074a429a 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -61,32 +61,28 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 	list_for_each_entry_safe(skb, n, &tp->tsorted_sent_queue,
 				 tcp_tsorted_anchor) {
 		struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
+		s32 remaining;
 
-		if (tcp_rack_sent_after(tp->rack.mstamp, skb->skb_mstamp,
-					tp->rack.end_seq, scb->end_seq)) {
-			/* Step 3 in draft-cheng-tcpm-rack-00.txt:
-			 * A packet is lost if its elapsed time is beyond
-			 * the recent RTT plus the reordering window.
-			 */
-			u32 elapsed = tcp_stamp_us_delta(tp->tcp_mstamp,
-							 skb->skb_mstamp);
-			s32 remaining = tp->rack.rtt_us + reo_wnd - elapsed;
-
-			if (remaining < 0) {
-				tcp_rack_mark_skb_lost(sk, skb);
-				list_del_init(&skb->tcp_tsorted_anchor);
-				continue;
-			}
-
-			/* Skip ones marked lost but not yet retransmitted */
-			if ((scb->sacked & TCPCB_LOST) &&
-			    !(scb->sacked & TCPCB_SACKED_RETRANS))
-				continue;
+		/* Skip ones marked lost but not yet retransmitted */
+		if ((scb->sacked & TCPCB_LOST) &&
+		    !(scb->sacked & TCPCB_SACKED_RETRANS))
+			continue;
 
+		if (!tcp_rack_sent_after(tp->rack.mstamp, skb->skb_mstamp,
+					 tp->rack.end_seq, scb->end_seq))
+			break;
+
+		/* A packet is lost if it has not been s/acked beyond
+		 * the recent RTT plus the reordering window.
+		 */
+		remaining = tp->rack.rtt_us + reo_wnd -
+			    tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+		if (remaining < 0) {
+			tcp_rack_mark_skb_lost(sk, skb);
+			list_del_init(&skb->tcp_tsorted_anchor);
+		} else {
 			/* Record maximum wait time (+1 to avoid 0) */
 			*reo_timeout = max_t(u32, *reo_timeout, 1 + remaining);
-		} else {
-			break;
 		}
 	}
 }
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* Re: [PATCH net-next v3 3/3] tools: bpftool: add documentation
From: Jakub Kicinski @ 2017-10-04 20:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, alexei.starovoitov, daniel, dsahern, oss-drivers,
	David Beckett, linux-doc@vger.kernel.org
In-Reply-To: <20171004203642.699f5f1a@redhat.com>

On Wed, 4 Oct 2017 20:36:42 +0200, Jesper Dangaard Brouer wrote:
> On Wed,  4 Oct 2017 08:40:32 -0700
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> > Add documentation for bpftool.  Separate files for each subcommand.
> > Use rst format.  Documentation is compiled into man pages using
> > rst2man.
> > 
> > Signed-off-by: David Beckett <david.beckett@netronome.com>
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > Acked-by: Alexei Starovoitov <ast@kernel.org>
> > Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> > ---
> >  tools/bpf/bpftool/Documentation/Makefile         |  34 +++++++
> >  tools/bpf/bpftool/Documentation/bpftool-map.txt  | 110 +++++++++++++++++++++++
> >  tools/bpf/bpftool/Documentation/bpftool-prog.txt |  79 ++++++++++++++++
> >  tools/bpf/bpftool/Documentation/bpftool.txt      |  34 +++++++  
> 
> RST-format files are usually called .rst and not .txt
> 
> This is useful when people happen to browse the code via github, then they get formatted nicely e.g.:
>  https://github.com/torvalds/linux/blob/master/samples/bpf/README.rst

I was following perf's example.  Are perf's docs not RST?

^ permalink raw reply

* RE
From: Stefan E Persson @ 2017-10-04 20:05 UTC (permalink / raw)


Your email has been randonly selected for my foundation Donation reply  
to email: eriling.persson089@swedenmail.com for more details

^ permalink raw reply

* Re: [PATCH net-next 0/3] A own subdirectory for shared TCP code
From: Andrew Lunn @ 2017-10-04 20:27 UTC (permalink / raw)
  To: Richard Siegfried; +Cc: David Miller, netdev
In-Reply-To: <d7237d11-ec83-0ef6-d201-da8b99c94b88@systemli.org>

On Wed, Oct 04, 2017 at 08:54:17PM +0200, Richard Siegfried wrote:
> On 04/10/17 01:03, David Miller wrote:
> > As someone who has to do backports regularly to -stable, there is no way
> > I am applying this.
> > 
> > Sorry.
> Okay, I see.
> 
> Is grouping files into subdirectories something generally
> unwanted/unlikely to be applied or is this specific to TCP / networking?
> 
> Because there are several other places in the source tree where I would
> like to group things.

Hi Richard

It is generally unwanted.

Have you tried back porting patches when the directory structure has
changed? Files have moved around? It makes it a lot harder to
do. Meaning patches are going to be back ported less often. Fixes
which could be security relevant might not get back ported, etc.

Kernel 4.4 is going to be supported until 2022. So moving files around
is going to make Greg Kroah-Hartman life more difficult for the next 5
years.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Add erspan tunnel support.
From: Pravin Shelar @ 2017-10-04 21:31 UTC (permalink / raw)
  To: William Tu; +Cc: Linux Kernel Network Developers
In-Reply-To: <1507118559-13774-1-git-send-email-u9012063@gmail.com>

On Wed, Oct 4, 2017 at 5:02 AM, William Tu <u9012063@gmail.com> wrote:
> Add type II erspan vport implementation.  Since erspan protocol is
> on top of the GRE header, the implementation is extended from the
> existing gre implementation.
>
> Signed-off-by: William Tu <u9012063@gmail.com>
> Cc: Pravin B Shelar <pshelar@ovn.org>

Why are you adding ERSPAN support to compat code. Isn't this supported
over OVS netlink-rtnl (dpif-netlink-rtnl)?

^ permalink raw reply

* Re: [RFC] bpf: remove global verifier state
From: Jakub Kicinski @ 2017-10-04 21:57 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Eric Dumazet, dsahern, netdev, oss-drivers,
	david.beckett
In-Reply-To: <59D532EB.4000104@iogearbox.net>

On Wed, 04 Oct 2017 21:13:47 +0200, Daniel Borkmann wrote:
> On 10/04/2017 05:43 AM, Alexei Starovoitov wrote:
> > On Tue, Oct 03, 2017 at 08:24:06PM -0700, Eric Dumazet wrote:  
> >> On Tue, 2017-10-03 at 19:52 -0700, Alexei Starovoitov wrote:
> >>  
> >>> yep. looks great.
> >>> Please test it and submit officially :)
> >>> The commit aafe6ae9cee3 ("bpf: dynamically allocate digest scratch buffer")
> >>> fixed the other case where we were relying on the above mutex.
> >>> The only other spot to be adjusted is to add spin_lock/mutex or DO_ONCE() to
> >>> bpf_get_skb_set_tunnel_proto() to protect md_dst init.
> >>> imo that would be it.
> >>> Daniel, anything else comes to mind?  
> 
> Yes, this should be all. DO_ONCE() for the tunnel proto seems a
> good choice.

Hm.  I actually did:

if (!dst) {
	tmp = alloc();
	if (!tmp)
		return;
	if (cmpxchg(&dst, NULL, tmp))
		free(tmp);
}

I don't like how DO_ONCE() doesn't handle errors from the init
function :(

> >> 16 MB of log (unswappable kernel memory) per active checker.
> >>
> >> We might offer a way to oom hosts.  
> >
> > right. good point!
> > we need to switch to continuous copy_to_user() after a page or so.
> > Can even do it after every vscnprintf()
> > but page at a time is probably faster.  
> 
> Also worst case upper limits on verification side for holding state
> aside from the log would need to be checked in terms of how much mem
> we end up holding that is not accounted against any process (and not
> really "rate-limited" anymore once we drop the mutex).

^ permalink raw reply

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Jakub Kicinski @ 2017-10-04 22:22 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: Joe Perches, David S . Miller, Simon Horman, Dirk van der Merwe,
	oss-drivers, netdev, linux-kernel, Renato Golin, Manoj Gupta,
	Guenter Roeck, Doug Anderson
In-Reply-To: <20171004184957.GO173745@google.com>

On Wed, 4 Oct 2017 11:49:57 -0700, Matthias Kaehlcke wrote:
> Hi Joe,
> 
> El Wed, Oct 04, 2017 at 11:07:19AM -0700 Joe Perches ha dit:
> 
> > On Tue, 2017-10-03 at 13:05 -0700, Matthias Kaehlcke wrote:  
> > > nfp_eth_set_bit_config() is marked as __always_inline to allow gcc to
> > > identify the 'mask' parameter as known to be constant at compile time,
> > > which is required to use the FIELD_GET() macro.
> > > 
> > > The forced inlining does the trick for gcc, but for kernel builds with
> > > clang it results in undefined symbols:  
> > 
> > Can't you use local different FIELD_PREP/FIELD_GET macros
> > with a different name without the BUILD_BUG tests?
> > 
> > i.e.:
> > 
> > #define NFP_FIELD_PREP(_mask, _val)				\
> > ({								\
> > 	((typeof(_mask))(_val) << __bf_shf(_mask)) & (_mask);	\
> > })
> > 
> > #define NFP_FIELD_GET(_mask, _reg)				\
> > ({								\
> > 	(typeof(_mask))(((_reg) & (_mask)) >> __bf_shf(_mask));	\
> > })
> > 
> > Then the __always_inline can be removed from
> > nfp_eth_set_bit_config too.  
> 
> Thanks for the suggestion. This seems a viable alternative if David
> and the NFP owners can live without the extra checking provided by
> __BF_FIELD_CHECK.

The reason the __BF_FIELD_CHECK refuses to compile non-constant masks
is that it will require runtime ffs on the mask, which is potentially
costly.  I would also feel quite stupid adding those macros to the nfp
driver, given that I specifically created the bitfield.h header to not
have to reimplement these in every driver I write/maintain.

Can you please test the patch I provided in the other reply?

^ permalink raw reply

* Re: [RFC] bpf: remove global verifier state
From: Alexei Starovoitov @ 2017-10-04 22:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Eric Dumazet, dsahern, netdev, oss-drivers,
	david.beckett
In-Reply-To: <20171004145744.27c08d4a@cakuba.netronome.com>

On Wed, Oct 04, 2017 at 02:57:44PM -0700, Jakub Kicinski wrote:
> On Wed, 04 Oct 2017 21:13:47 +0200, Daniel Borkmann wrote:
> > On 10/04/2017 05:43 AM, Alexei Starovoitov wrote:
> > > On Tue, Oct 03, 2017 at 08:24:06PM -0700, Eric Dumazet wrote:  
> > >> On Tue, 2017-10-03 at 19:52 -0700, Alexei Starovoitov wrote:
> > >>  
> > >>> yep. looks great.
> > >>> Please test it and submit officially :)
> > >>> The commit aafe6ae9cee3 ("bpf: dynamically allocate digest scratch buffer")
> > >>> fixed the other case where we were relying on the above mutex.
> > >>> The only other spot to be adjusted is to add spin_lock/mutex or DO_ONCE() to
> > >>> bpf_get_skb_set_tunnel_proto() to protect md_dst init.
> > >>> imo that would be it.
> > >>> Daniel, anything else comes to mind?  
> > 
> > Yes, this should be all. DO_ONCE() for the tunnel proto seems a
> > good choice.
> 
> Hm.  I actually did:
> 
> if (!dst) {
> 	tmp = alloc();
> 	if (!tmp)
> 		return;
> 	if (cmpxchg(&dst, NULL, tmp))
> 		free(tmp);
> }
> 
> I don't like how DO_ONCE() doesn't handle errors from the init
> function :(

yeah. good point.
Above looks good to me.

^ permalink raw reply

* RE: [PATCH 2/3 v2] net: phy: DP83822 initial driver submission
From: Woojung.Huh @ 2017-10-04 22:44 UTC (permalink / raw)
  To: dmurphy, andrew, f.fainelli; +Cc: netdev, afd
In-Reply-To: <20171004182031.13794-2-dmurphy@ti.com>

> +static int dp83822_suspend(struct phy_device *phydev)
> +{
> +	int value;
> +
> +	mutex_lock(&phydev->lock);
> +	value = phy_read_mmd(phydev, DP83822_DEVADDR,
> MII_DP83822_WOL_CFG);
> +	mutex_unlock(&phydev->lock);
Would we need mutex to access phy_read_mmd()?
phy_read_mmd() has mdio_lock for indirect access.

> +	if (!(value & DP83822_WOL_EN))
> +		genphy_suspend(phydev);
> +
> +	return 0;
> +}
> +
> +static int dp83822_resume(struct phy_device *phydev)
> +{
> +	int value;
> +
> +	genphy_resume(phydev);
> +
> +	mutex_lock(&phydev->lock);
> +	value = phy_read_mmd(phydev, DP83822_DEVADDR,
> MII_DP83822_WOL_CFG);
> +
> +	phy_write_mmd(phydev, DP83822_DEVADDR,
> MII_DP83822_WOL_CFG, value |
> +		      DP83822_WOL_CLR_INDICATION);
> +
> +	mutex_unlock(&phydev->lock);
Same here.

Woojung

^ permalink raw reply

* Re: [PATCH net-next v4 1/3] bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood
From: David Miller @ 2017-10-04 22:52 UTC (permalink / raw)
  To: roopa; +Cc: netdev, nikolay, stephen, bridge
In-Reply-To: <1507093953-59929-2-git-send-email-roopa@cumulusnetworks.com>

From: Roopa Prabhu <roopa@cumulusnetworks.com>
Date: Tue,  3 Oct 2017 22:12:31 -0700

> BR_ARP_PROXY flag but has a few semantic differences to conform

This should be "BR_PROXYARP".

Otherwise this series looks fine to me, but I see there will be
a v5 respin.

^ permalink raw reply

* Re: [PATCH v2 4/5] VSOCK: add sock_diag interface
From: David Miller @ 2017-10-04 22:58 UTC (permalink / raw)
  To: stefanha; +Cc: netdev, jhansen, decui
In-Reply-To: <20171004163716.3964-5-stefanha@redhat.com>

From: Stefan Hajnoczi <stefanha@redhat.com>
Date: Wed,  4 Oct 2017 12:37:15 -0400

> This patch adds the sock_diag interface for querying sockets from
> userspace.  Tools like ss(8) and netstat(8) can use this interface to
> list open sockets.
> 
> The userspace ABI is defined in <linux/vm_sockets_diag.h> and includes
> netlink request and response structs.  The request can query sockets
> based on their sk_state (e.g. listening sockets only) and the response
> contains socket information fields including the local/remote addresses,
> inode number, etc.
> 
> This patch does not dump VMCI pending sockets because I have only tested
> the virtio transport, which does not use pending sockets.  Support can
> be added later by extending vsock_diag_dump() if needed by VMCI users.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Please post new feature patches against net-next.

> diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
> index 09fc2eb29dc8..e5dbf153aff0 100644
> --- a/net/vmw_vsock/Makefile
> +++ b/net/vmw_vsock/Makefile
> @@ -1,10 +1,13 @@
>  obj-$(CONFIG_VSOCKETS) += vsock.o
> +obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
>  obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
>  obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
>  obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
>  

This hunk fails to apply to the net-next tree, the context looks
different.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v6 0/4] bpf: add two helpers to read perf event enabled/running time
From: David Miller @ 2017-10-04 23:00 UTC (permalink / raw)
  To: yhs; +Cc: peterz, rostedt, ast, daniel, netdev, kernel-team
In-Reply-To: <20171002224218.3181418-1-yhs@fb.com>

From: Yonghong Song <yhs@fb.com>
Date: Mon, 2 Oct 2017 15:42:14 -0700

> [Dave, Peter,
> 
>  Previous communcation shows that this patch may potentially have
>  merge conflict with upcoming tip changes in the next merge window.
> 
>  Could you advise how this patch should proceed?
> 
>  Thanks!
> ]

Indeed, Peter how do you want to handle this?

Thanks.

^ permalink raw reply

* Re: [PATCH] PCI: Check/Set ARI capability before setting numVFs
From: Bjorn Helgaas @ 2017-10-04 23:01 UTC (permalink / raw)
  To: Tony Nguyen
  Cc: linux-pci, intel-wired-lan, linux-kernel, netdev, bhelgaas,
	Alexander Duyck, Emil Tantilov
In-Reply-To: <20171004155258.35634-1-anthony.l.nguyen@intel.com>

On Wed, Oct 04, 2017 at 08:52:58AM -0700, Tony Nguyen wrote:
> This fixes a bug that can occur if an AER error is encountered while SRIOV
> devices are present.
> 
> This issue was seen by doing the following. Inject an AER error to a device
> that has SRIOV devices.  After the device has recovered, remove the driver.
> Reload the driver and enable SRIOV which causes the following crash to
> occur:
> 
> kernel BUG at drivers/pci/iov.c:157!
> invalid opcode: 0000 [#1] SMP
> CPU: 36 PID: 2295 Comm: bash Not tainted 4.14.0-rc1+ #74
> Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0a 04/29/2014
> task: ffff9fa41cd45a00 task.stack: ffffb4b2036e8000
> RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
> RSP: 0018:ffffb4b2036ebcb8 EFLAGS: 00010286
> RAX: 00000000fffffff0 RBX: ffff9fa42c1c8800 RCX: ffff9fa421ce2388
> RDX: 00000000df900000 RSI: ffff9fa8214fb388 RDI: 00000000df903fff
> RBP: ffffb4b2036ebd18 R08: ffff9fa421ce23b8 R09: ffffb4b2036ebc2c
> R10: ffff9fa42c1a5548 R11: 000000000000058e R12: ffff9fa8214fb000
> R13: ffff9fa42c1a5000 R14: ffff9fa8214fb388 R15: 0000000000000000
> FS:  00007f60724b6700(0000) GS:ffff9fa82f300000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000559eca8b0f40 CR3: 0000000864146000 CR4: 00000000001606e0
> Call Trace:
>  pci_enable_sriov+0x353/0x440
>  ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
>  sriov_numvfs_store+0xf7/0x170
>  dev_attr_store+0x18/0x30
>  sysfs_kf_write+0x37/0x40
>  kernfs_fop_write+0x120/0x1b0
>  __vfs_write+0x37/0x170
>  ? __alloc_fd+0x3f/0x170
>  ? set_close_on_exec+0x30/0x70
>  vfs_write+0xb5/0x1a0
>  SyS_write+0x55/0xc0
>  entry_SYSCALL_64_fastpath+0x1a/0xa5
> RIP: 0033:0x7f6071bafc20
> RSP: 002b:00007ffe7d42ba48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 0000559eca8b0f30 RCX: 00007f6071bafc20
> RDX: 0000000000000002 RSI: 0000559eca961f60 RDI: 0000000000000001
> RBP: 00007f6071e78ae0 R08: 00007f6071e7a740 R09: 00007f60724b6700
> R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000559eca892170
> RIP: pci_iov_add_virtfn+0x2eb/0x350 RSP: ffffb4b2036ebcb8
> 
> The occurs since during AER recovery the ARI Capable Hierarchy bit,
> which can affect the values for First VF Offset and VF Stride, is not set
> until after pci_iov_set_numvfs() is called.  

Can you elaborate on where exactly this happens?  The only place we
explicitly set PCI_SRIOV_CTRL_ARI is in sriov_init(), which is only
called at enumeration-time.  So I'm guessing you're talking about this
path:

  ixgbe_io_slot_reset
    pci_restore_state
      pci_restore_iov_state
	sriov_restore_state
	  pci_iov_set_numvfs

where we don't set PCI_SRIOV_CTRL_ARI at all.  The fact that you say
PCI_SRIOV_CTRL_ARI isn't set until *after* pci_iov_set_numvfs() is
called suggests that it is being set *somewhere*, but I don't know
where.

> This can cause the iov
> structure to be populated with values that are incorrect if the bit is
> later set.   Check and set this bit, if needed, before calling
> pci_iov_set_numvfs() so that the values being populated properly take
> the ARI bit into account.
> 
> CC: Alexander Duyck <alexander.h.duyck@intel.com>
> CC: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> ---
>  drivers/pci/iov.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 7492a65..a8896c7 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -497,6 +497,10 @@ static void sriov_restore_state(struct pci_dev *dev)
>  	if (ctrl & PCI_SRIOV_CTRL_VFE)
>  		return;
>  
> +	if ((iov->ctrl & PCI_SRIOV_CTRL_ARI) && !(ctrl & PCI_SRIOV_CTRL_ARI))
> +		pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL,
> +				      ctrl | PCI_SRIOV_CTRL_ARI);
> +
>  	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
>  		pci_update_resource(dev, i);
>  
> -- 
> 2.9.5
> 

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Add erspan tunnel support.
From: William Tu @ 2017-10-04 23:02 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAOrHB_DNkfnUty=QzO5d6SAyscLXBP-teE_j0RvDaJiMHqRaaw@mail.gmail.com>

On Wed, Oct 4, 2017 at 2:31 PM, Pravin Shelar <pshelar@ovn.org> wrote:
> On Wed, Oct 4, 2017 at 5:02 AM, William Tu <u9012063@gmail.com> wrote:
>> Add type II erspan vport implementation.  Since erspan protocol is
>> on top of the GRE header, the implementation is extended from the
>> existing gre implementation.
>>
>> Signed-off-by: William Tu <u9012063@gmail.com>
>> Cc: Pravin B Shelar <pshelar@ovn.org>
>
> Why are you adding ERSPAN support to compat code. Isn't this supported
> over OVS netlink-rtnl (dpif-netlink-rtnl)?

You're right. I will remove it and submit v2. Thanks for the review.

William

^ permalink raw reply

* Re: [PATCH v2 net-next 0/8] bpf: muli prog support for cgroup-bpf
From: David Miller @ 2017-10-04 23:05 UTC (permalink / raw)
  To: ast; +Cc: daniel, tj, dsa, netdev, kernel-team
In-Reply-To: <20171003055028.1294791-1-ast@fb.com>

From: Alexei Starovoitov <ast@fb.com>
Date: Mon, 2 Oct 2017 22:50:20 -0700

> v1->v2:
> - fixed accidentally swapped two lines which caused static_key not going to zero
> - addressed Martin's feedback and changed prog_query to be consistent
>   with verifier output: return -enospc and fill supplied buffer instead
>   of just returning -enospc when buffer is too small to fit all prog_ids
> 
> v1:
> cgroup-bpf use cases are getting more advanced and running only
> one program per cgroup is no longer enough. Therefore introduce
> support for attaching multiple programs per cgroup and running
> a set of effective programs.
> 
> These patches introduces BPF_F_ALLOW_MULTI flag for BPF_PROG_ATTACH cmd.
> The default is still NONE and behavior of BPF_F_ALLOW_OVERRIDE flag
> is unchanged.
> The difference between three possible flags for BPF_PROG_ATTACH command:
> - NONE(default): No further bpf programs allowed in the subtree.
> - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
>   the program in this cgroup yields to sub-cgroup program.
> - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
>   that cgroup program gets run in addition to the program in this cgroup.
> 
> Most of the logic is in patch 1. Even when cgroup doesn't have
> any programs attached its set of effective program can be non-empty.
> To quickly execute them and avoid penalizing cgroups without
> any effective programs introduce 'struct bpf_prog_array'
> which has an optimization for cgroups with zero effective programs.
> 
> Patch 2 introduces BPF_PROG_QUERY command for introspection
> Patch 3 makes verifier more strict for cgroup-bpf program types.
> Patch 4+ are tests.
> 
> More details in individual patches

Looks good, series applied, thanks!

^ permalink raw reply

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Matthias Kaehlcke @ 2017-10-04 23:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joe Perches, David S . Miller, Simon Horman, Dirk van der Merwe,
	oss-drivers, netdev, linux-kernel, Renato Golin, Manoj Gupta,
	Guenter Roeck, Doug Anderson
In-Reply-To: <20171004152203.2a4f564d@cakuba.netronome.com>

Hi Jakub,

El Wed, Oct 04, 2017 at 03:22:03PM -0700 Jakub Kicinski ha dit:

> On Wed, 4 Oct 2017 11:49:57 -0700, Matthias Kaehlcke wrote:
> > Hi Joe,
> > 
> > El Wed, Oct 04, 2017 at 11:07:19AM -0700 Joe Perches ha dit:
> > 
> > > On Tue, 2017-10-03 at 13:05 -0700, Matthias Kaehlcke wrote:  
> > > > nfp_eth_set_bit_config() is marked as __always_inline to allow gcc to
> > > > identify the 'mask' parameter as known to be constant at compile time,
> > > > which is required to use the FIELD_GET() macro.
> > > > 
> > > > The forced inlining does the trick for gcc, but for kernel builds with
> > > > clang it results in undefined symbols:  
> > > 
> > > Can't you use local different FIELD_PREP/FIELD_GET macros
> > > with a different name without the BUILD_BUG tests?
> > > 
> > > i.e.:
> > > 
> > > #define NFP_FIELD_PREP(_mask, _val)				\
> > > ({								\
> > > 	((typeof(_mask))(_val) << __bf_shf(_mask)) & (_mask);	\
> > > })
> > > 
> > > #define NFP_FIELD_GET(_mask, _reg)				\
> > > ({								\
> > > 	(typeof(_mask))(((_reg) & (_mask)) >> __bf_shf(_mask));	\
> > > })
> > > 
> > > Then the __always_inline can be removed from
> > > nfp_eth_set_bit_config too.  
> > 
> > Thanks for the suggestion. This seems a viable alternative if David
> > and the NFP owners can live without the extra checking provided by
> > __BF_FIELD_CHECK.
> 
> The reason the __BF_FIELD_CHECK refuses to compile non-constant masks
> is that it will require runtime ffs on the mask, which is potentially
> costly.  I would also feel quite stupid adding those macros to the nfp
> driver, given that I specifically created the bitfield.h header to not
> have to reimplement these in every driver I write/maintain.

That make sense, thanks for providing more context.

> Can you please test the patch I provided in the other reply?

With this patch there are no errors when building the kernel with
clang.

Thanks!

Matthias

^ permalink raw reply

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Jakub Kicinski @ 2017-10-04 23:25 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: Joe Perches, David S . Miller, Simon Horman, Dirk van der Merwe,
	oss-drivers, netdev, linux-kernel, Renato Golin, Manoj Gupta,
	Guenter Roeck, Doug Anderson
In-Reply-To: <20171004231649.GP173745@google.com>

On Wed, 4 Oct 2017 16:16:49 -0700, Matthias Kaehlcke wrote:
> > > Thanks for the suggestion. This seems a viable alternative if David
> > > and the NFP owners can live without the extra checking provided by
> > > __BF_FIELD_CHECK.  
> > 
> > The reason the __BF_FIELD_CHECK refuses to compile non-constant masks
> > is that it will require runtime ffs on the mask, which is potentially
> > costly.  I would also feel quite stupid adding those macros to the nfp
> > driver, given that I specifically created the bitfield.h header to not
> > have to reimplement these in every driver I write/maintain.  
> 
> That make sense, thanks for providing more context.
> 
> > Can you please test the patch I provided in the other reply?  
> 
> With this patch there are no errors when building the kernel with
> clang.

Cool, thanks for checking!  I will run it through full tests and queue
for upstreaming :)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox