Netdev List
 help / color / mirror / Atom feed
* [PATCH v2 net-next] tcp: internal implementation for pacing
From: Eric Dumazet @ 2017-05-16 11:24 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
	Van Jacobson, Jerry Chu

BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
---
v2: added one missing kdoc for sk_pacing_status (kbuild test robot report)
 include/linux/tcp.h   |  2 ++
 include/net/sock.h    |  9 +++++-
 include/net/tcp.h     |  3 ++
 net/core/sock.c       |  4 +++
 net/ipv4/tcp_bbr.c    |  9 +++---
 net/ipv4/tcp_output.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_timer.c  |  3 ++
 net/sched/sch_fq.c    |  8 ++++++
 8 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b6d5adcee8fcb611de202993623cc80274d262e4..22854f0284347a3bb047709478525ee5a9dd9b36 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -293,6 +293,8 @@ struct tcp_sock {
 	u32	sacked_out;	/* SACK'd packets			*/
 	u32	fackets_out;	/* FACK'd packets			*/
 
+	struct hrtimer	pacing_timer;
+
 	/* from STCP, retrans queue hinting */
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
diff --git a/include/net/sock.h b/include/net/sock.h
index f33e3d134e0b7f66329f2122d7acc8b396c1787b..28d33b08524fda31d50f97366e2a4fc6f9bff402 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -253,6 +253,7 @@ struct sock_common {
   *	@sk_ll_usec: usecs to busypoll when there is no data
   *	@sk_allocation: allocation mode
   *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
+  *	@sk_pacing_status: Pacing status (requested, handled by sch_fq)
   *	@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *	@sk_sndbuf: size of send buffer in bytes
   *	@sk_padding: unused element for alignment
@@ -396,7 +397,7 @@ struct sock {
 	__s32			sk_peek_off;
 	int			sk_write_pending;
 	__u32			sk_dst_pending_confirm;
-	/* Note: 32bit hole on 64bit arches */
+	u32			sk_pacing_status; /* see enum sk_pacing */
 	long			sk_sndtimeo;
 	struct timer_list	sk_timer;
 	__u32			sk_priority;
@@ -475,6 +476,12 @@ struct sock {
 	struct rcu_head		sk_rcu;
 };
 
+enum sk_pacing {
+	SK_PACING_NONE		= 0,
+	SK_PACING_NEEDED	= 1,
+	SK_PACING_FQ		= 2,
+};
+
 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
 
 #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 38a7427ae902e35973a8b7fa0e95ff602ede0e87..b4dc93dae98c2d175ccadce150083705d237555e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -574,6 +574,7 @@ void tcp_fin(struct sock *sk);
 void tcp_init_xmit_timers(struct sock *);
 static inline void tcp_clear_xmit_timers(struct sock *sk)
 {
+	hrtimer_cancel(&tcp_sk(sk)->pacing_timer);
 	inet_csk_clear_xmit_timers(sk);
 }
 
@@ -1945,4 +1946,6 @@ static inline void tcp_listendrop(const struct sock *sk)
 	__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
 }
 
+enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
+
 #endif	/* _TCP_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index e43e71d7856b385111cd4c4b1bd835a78c670c60..93d011e35b8349954db6918055c2f90ae473d254 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,6 +1041,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 #endif
 
 	case SO_MAX_PACING_RATE:
+		if (val != ~0U)
+			cmpxchg(&sk->sk_pacing_status,
+				SK_PACING_NONE,
+				SK_PACING_NEEDED);
 		sk->sk_max_pacing_rate = val;
 		sk->sk_pacing_rate = min(sk->sk_pacing_rate,
 					 sk->sk_max_pacing_rate);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b89bce4c721eed530f5cfc725b759147b38cef42..92b045c72163def1c1d6aa0f2002760186aa5dc3 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -52,10 +52,9 @@
  * There is a public e-mail list for discussing BBR development and testing:
  *   https://groups.google.com/forum/#!forum/bbr-dev
  *
- * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled,
- * since pacing is integral to the BBR design and implementation.
- * BBR without pacing would not function properly, and may incur unnecessary
- * high packet loss rates.
+ * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
+ * otherwise TCP stack falls back to an internal pacing using one high
+ * resolution timer per TCP socket and may use more resources.
  */
 #include <linux/module.h>
 #include <net/tcp.h>
@@ -830,6 +829,8 @@ static void bbr_init(struct sock *sk)
 	bbr->cycle_idx = 0;
 	bbr_reset_lt_bw_sampling(sk);
 	bbr_reset_startup_mode(sk);
+
+	cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
 }
 
 static u32 bbr_sndbuf_expand(struct sock *sk)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4858e190f6ac130c9441f58cb8944cc82bf67270..a32172d69a03cbe76b45ec3094222f6c3a73e27d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -904,6 +904,72 @@ void tcp_wfree(struct sk_buff *skb)
 	sk_free(sk);
 }
 
+/* Note: Called under hard irq.
+ * We can not call TCP stack right away.
+ */
+enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
+{
+	struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer);
+	struct sock *sk = (struct sock *)tp;
+	unsigned long nval, oval;
+
+	for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
+		struct tsq_tasklet *tsq;
+		bool empty;
+
+		if (oval & TSQF_QUEUED)
+			break;
+
+		nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
+		nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
+		if (nval != oval)
+			continue;
+
+		if (!atomic_inc_not_zero(&sk->sk_wmem_alloc))
+			break;
+		/* queue this socket to tasklet queue */
+		tsq = this_cpu_ptr(&tsq_tasklet);
+		empty = list_empty(&tsq->head);
+		list_add(&tp->tsq_node, &tsq->head);
+		if (empty)
+			tasklet_schedule(&tsq->tasklet);
+		break;
+	}
+	return HRTIMER_NORESTART;
+}
+
+/* BBR congestion control needs pacing.
+ * Same remark for SO_MAX_PACING_RATE.
+ * sch_fq packet scheduler is efficiently handling pacing,
+ * but is not always installed/used.
+ * Return true if TCP stack should pace packets itself.
+ */
+static bool tcp_needs_internal_pacing(const struct sock *sk)
+{
+	return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
+}
+
+static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
+{
+	u64 len_ns;
+	u32 rate;
+
+	if (!tcp_needs_internal_pacing(sk))
+		return;
+	rate = sk->sk_pacing_rate;
+	if (!rate || rate == ~0U)
+		return;
+
+	/* Should account for header sizes as sch_fq does,
+	 * but lets make things simple.
+	 */
+	len_ns = (u64)skb->len * NSEC_PER_SEC;
+	do_div(len_ns, rate);
+	hrtimer_start(&tcp_sk(sk)->pacing_timer,
+		      ktime_add_ns(ktime_get(), len_ns),
+		      HRTIMER_MODE_ABS_PINNED);
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -1034,6 +1100,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	if (skb->len != tcp_header_size) {
 		tcp_event_data_sent(tp, sk);
 		tp->data_segs_out += tcp_skb_pcount(skb);
+		tcp_internal_pacing(sk, skb);
 	}
 
 	if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
@@ -2086,6 +2153,12 @@ static int tcp_mtu_probe(struct sock *sk)
 	return -1;
 }
 
+static bool tcp_pacing_check(const struct sock *sk)
+{
+	return tcp_needs_internal_pacing(sk) &&
+	       hrtimer_active(&tcp_sk(sk)->pacing_timer);
+}
+
 /* TCP Small Queues :
  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
  * (These limits are doubled for retransmits)
@@ -2210,6 +2283,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
+		if (tcp_pacing_check(sk))
+			break;
+
 		tso_segs = tcp_init_tso_segs(skb, mss_now);
 		BUG_ON(!tso_segs);
 
@@ -2878,6 +2954,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 
 		if (skb == tcp_send_head(sk))
 			break;
+
+		if (tcp_pacing_check(sk))
+			break;
+
 		/* we could do better than to assign each time */
 		if (!hole)
 			tp->retransmit_skb_hint = skb;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 14672543cf0bd27bc59976d5cec38d2d3bbcdd2c..86934bcf685a65ec3af3d22f1801ffa33eea76e2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -710,4 +710,7 @@ void tcp_init_xmit_timers(struct sock *sk)
 {
 	inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
 				  &tcp_keepalive_timer);
+	hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED);
+	tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
 }
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index b488721a0059adb24aea47240afa0164a6e467a9..147fde73a0f566e8f6a26718adf176ef3943afa0 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -390,9 +390,17 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		q->stat_tcp_retrans++;
 	qdisc_qstats_backlog_inc(sch, skb);
 	if (fq_flow_is_detached(f)) {
+		struct sock *sk = skb->sk;
+
 		fq_flow_add_tail(&q->new_flows, f);
 		if (time_after(jiffies, f->age + q->flow_refill_delay))
 			f->credit = max_t(u32, f->credit, q->quantum);
+		if (sk && q->rate_enable) {
+			if (unlikely(smp_load_acquire(&sk->sk_pacing_status) !=
+				     SK_PACING_FQ))
+				smp_store_release(&sk->sk_pacing_status,
+						  SK_PACING_FQ);
+		}
 		q->inactive_flows--;
 	}
 
-- 
2.13.0.303.g4ebf302169-goog

^ permalink raw reply related

* Re: [PATCH] ravb: add wake-on-lan support via magic packet
From: Simon Horman @ 2017-05-16 11:01 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Niklas Söderlund, Sergei Shtylyov, netdev@vger.kernel.org,
	Linux-Renesas
In-Reply-To: <CAMuHMdVkitTgVmUrPu5+z2JwnwHop4bJ5anMq1SmCeFfV=OxQg@mail.gmail.com>

On Tue, May 16, 2017 at 11:07:34AM +0200, Geert Uytterhoeven wrote:
> Hi Niklas,
> 
> On Tue, May 16, 2017 at 11:02 AM, Niklas Söderlund
> <niklas.soderlund@ragnatech.se> wrote:
> >> > Whit all this being said I still like to withdraw this patch as I found
> >> > another fault with it, ravb_wol_restore() will unconditionally be called
> >> > while ravb_wol_setup() will only be called if netif_running(ndev). This
> >> > is en easy fix and I will send out a v2 once we figure out what to do
> >> > about the clock.
> >>
> >> The clock issue is external to the ravb driver. If it works with
> >> s2idle, it should
> >> be OK.
> >
> > Do you think it's a good idea to move ahead with a v2 of the ravb WoL
> > patch to fix the unrelated issue and aim for it to be picked up prior to
> > suspend/resume support is added to the CPG/MSSR?
> 
> Sure.
> 
> It can still be used on R-Car Gen2, where we're not s*d by mandatory PSCI.

Is there some way for - e.g. the driver - to not enable WoL on Gen3 SoCs
until the clock issues is sorted out? I'm quite happy to enable features
where they work; not so much where they don't.

^ permalink raw reply

* Re: [ISSUE: sky2 - rx error] Link stops working under heavy traffic load connected to a mv88e6176
From: Rafa Corvillo @ 2017-05-16 10:50 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Stephen Hemminger, netdev
In-Reply-To: <20170508123852.GI27709@lunn.ch>

On 08/05/17 14:38, Andrew Lunn wrote:
>>> static unsigned sky2_get_rx_threshold(struct sky2_port *sky2)
>>> {
>>>          unsigned size;
>>>
>>>          /* Space needed for frame data + headers rounded up */
>>>          size = roundup(sky2->netdev->mtu + ETH_HLEN + VLAN_HLEN, 8);
>>>
>>>          /* Stopping point for hardware truncation */
>>>          return (size - 8) / sizeof(u32);
>>> }
>>>
>>> This is not going to be big enough for a frame with a DSA header.
>>>
>>
>> Then, would be a good fix add 8 bytes to the size variable in this function?
>
> Yes. Also look at the transmit code, is there again a limit based on
> the MTU.

Hi Andrew,

Adding 8 bytes (sky2->netdev->mtu + ETH_HLEN + VLAN_HLEN + 8 
(EDSA_HLEN)) does not fix the error, because the interface keep having a 
maximum length of 1518 bytes (sky2->netdev->mtu + ETH_HLEN + VLAN_HLEN).

The polling function of sky2 driver (sky2_poll) calls to the function 
sky2_status_intr (with the parameter struct sky2_hw *hw). The 
sky2_status_intr function gets the status of the list elements (struct 
sky2_status_le) from the sky2_hw parameter and, from the sky2_status_le, 
gets the maximum length (1518) and the status code (0x5f20010). When the 
latter function (sky2_status_intr) calls to the sky2_receive function 
with the parameters length and status, it reports about the error (rx 
error, status 0x5f20010
length 1518).

I don't know who sets the maximum length (1518) and the status code 
(0x5f20010) of the packets. Is it possible that these values to be set 
outside the sky2 code?

Thanks,

Rafa

>
>> Settings for marvell:
>>          Supported ports: [ TP ]
>>          Supported link modes:   10baseT/Half 10baseT/Full
>>                                  100baseT/Half 100baseT/Full
>>                                  1000baseT/Half 1000baseT/Full
>>          Supported pause frame use: No
>>          Supports auto-negotiation: Yes
>>          Advertised link modes:  10baseT/Half 10baseT/Full
>>                                  100baseT/Half 100baseT/Full
>>                                  1000baseT/Half 1000baseT/Full
>>          Advertised pause frame use: No
>>          Advertised auto-negotiation: No
>>          Speed: 1000Mb/s
>>          Duplex: Full
>>          Port: Twisted Pair
>>          PHYAD: 0
>>          Transceiver: internal
>>          Auto-negotiation: on
>>          MDI-X: Unknown
>>          Supports Wake-on: pg
>>          Wake-on: d
>>          Current message level: 0x000000ff (255)
>>                                 drv probe link timer ifdown ifup
>> rx_err tx_err
>>          Link detected: yes
>>
>
> So this suggests there is a real PHY there, and it is
> auto-negotiating.
>
> What we cannot see is the status for the PHY it connects to. But since
> this PHY has established a link, the other PHY is probably O.K. It is
> just a bit unsafe, since you are relying on reset behaviour. There is
> nothing in software configuring the second PHY to make it
> auto-negotiate.
>
> 	Andrew
>
>

^ permalink raw reply

* Re: [PATCH v2 net-next 3/7] net: add function to retrieve original skb device using NAPI ID
From: Miroslav Lichvar @ 2017-05-16  9:51 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Richard Cochran, Willem de Bruijn,
	Soheil Hassas Yeganeh, Keller, Jacob E, Denny Page, Jiri Benc
In-Reply-To: <CAF=yD-+n-KF3r1zDiKEQCQsPMkF_DxGZG3AEpzbX4jXFUXcW-Q@mail.gmail.com>

On Tue, May 02, 2017 at 12:16:13PM -0400, Willem de Bruijn wrote:
> On Tue, May 2, 2017 at 6:10 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
> >  /**
> > + *     dev_get_by_napi_id - find a device by napi_id
> > + *     @napi_id: ID of the NAPI struct
> > + *
> > + *     Search for an interface by NAPI ID. Returns %NULL if the device
> > + *     is not found or a pointer to the device. The device has not had
> > + *     its reference counter increased so the caller must be careful
> > + *     about locking. The caller must hold RCU lock.
> 
> Instead of a comment, can check with
> 
>   WARN_ON_ONCE(!rcu_read_lock_held());

The other dev_get_* functions have the same comment, so I think it's
better to keep it for consistency. I'll add the warning and sent
a new series with the other changes you have suggested.

Thanks,

-- 
Miroslav Lichvar

^ permalink raw reply

* Re: [PATCH v2 0/8] NFC: fix device allocation and nfcmrvl crashes
From: Johan Hovold @ 2017-05-16  9:42 UTC (permalink / raw)
  To: Samuel Ortiz; +Cc: Johan Hovold, linux-wireless, netdev, linux-kernel
In-Reply-To: <20170426224238.GB3717@zurbaran.ger.intel.com>

Hi Samuel,

On Thu, Apr 27, 2017 at 12:42:38AM +0200, Samuel Ortiz wrote:
> Hi Johan,
> 
> On Thu, Mar 30, 2017 at 12:15:34PM +0200, Johan Hovold wrote:
> > This started out with the observation that the nfcmrvl_uart driver
> > unconditionally dereferenced the tty class device despite the fact that
> > not every tty has an associated struct device (Unix98 ptys). Some
> > further changes were needed in the common nfcmrvl code to fully address
> > this, some of which also incidentally fixed a few related bugs (e.g.
> > resource leaks in error paths).
> > 
> > While fixing this I stumbled over a regression in NFC core that lead to
> > broken registration error paths and misnamed workqueues.
> > 
> > Note that this has only been tested by configuring the n_hci line
> > discipline for different ttys without any actual NFC hardware connected.
> > 
> > Johan
> > 
> > 
> > Changes in v2
> >  - fix typo in commit message (1/8)
> >  - release reset gpio in error paths (3/8)
> >  - fix description of patch impact (3/8)
> >  - allow gpio 0 to be used for reset signalling (8/8, new)
> > 
> > 
> > Johan Hovold (8):
> >   NFC: fix broken device allocation
> >   NFC: nfcmrvl_uart: add missing tty-device sanity check
> >   NFC: nfcmrvl: do not use device-managed resources
> >   NFC: nfcmrvl: use nfc-device for firmware download
> >   NFC: nfcmrvl: fix firmware-management initialisation
> >   NFC: nfcmrvl_uart: fix device-node leak during probe
> >   NFC: nfcmrvl_usb: use interface as phy device
> >   NFC: nfcmrvl: allow gpio 0 for reset signalling
> Applied, thanks.

These never made it into net-next and 4.12-rc1, so will you be sending
them on as fixes for 4.12-rc instead?

Thanks,
Johan

^ permalink raw reply

* Re: [PATCH] xfrm: use memdup_user
From: Steffen Klassert @ 2017-05-16  9:30 UTC (permalink / raw)
  To: Geliang Tang; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel
In-Reply-To: <7d89949ab688323cf05273b7dfaca94348118594.1493780592.git.geliangtang@gmail.com>

On Sat, May 06, 2017 at 11:42:21PM +0800, Geliang Tang wrote:
> Use memdup_user() helper instead of open-coding to simplify the code.
> 
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Applied to ipsec-next, thanks!

^ permalink raw reply

* [PATCH net-next v2 3/3] udp: keep the sk_receive_queue held when splicing
From: Paolo Abeni @ 2017-05-16  9:20 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494881617.git.pabeni@redhat.com>

On packet reception, when we are forced to splice the
sk_receive_queue, we can keep the related lock held, so
that we can avoid re-acquiring it, if fwd memory
scheduling is required.

v1 -> v2:
  the rx_queue_lock_held param in udp_rmem_release() is
  now a bool

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 net/ipv4/udp.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 492c76b..7bd56c9 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1164,7 +1164,8 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 
 /* fully reclaim rmem/fwd memory allocated for skb */
-static void udp_rmem_release(struct sock *sk, int size, int partial)
+static void udp_rmem_release(struct sock *sk, int size, int partial,
+			     bool rx_queue_lock_held)
 {
 	struct udp_sock *up = udp_sk(sk);
 	struct sk_buff_head *sk_queue;
@@ -1181,9 +1182,13 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	}
 	up->forward_deficit = 0;
 
-	/* acquire the sk_receive_queue for fwd allocated memory scheduling */
+	/* acquire the sk_receive_queue for fwd allocated memory scheduling,
+	 * if the called don't held it already
+	 */
 	sk_queue = &sk->sk_receive_queue;
-	spin_lock(&sk_queue->lock);
+	if (!rx_queue_lock_held)
+		spin_lock(&sk_queue->lock);
+
 
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
@@ -1197,7 +1202,8 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	/* this can save us from acquiring the rx queue lock on next receive */
 	skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
 
-	spin_unlock(&sk_queue->lock);
+	if (!rx_queue_lock_held)
+		spin_unlock(&sk_queue->lock);
 }
 
 /* Note: called with reader_queue.lock held.
@@ -1207,10 +1213,16 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
  */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->dev_scratch, 1);
+	udp_rmem_release(sk, skb->dev_scratch, 1, false);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* as above, but the caller held the rx queue lock, too */
+void udp_skb_dtor_locked(struct sock *sk, struct sk_buff *skb)
+{
+	udp_rmem_release(sk, skb->dev_scratch, 1, true);
+}
+
 /* Idea of busylocks is to let producers grab an extra spinlock
  * to relieve pressure on the receive_queue spinlock shared by consumer.
  * Under flood, this means that only one producer can be in line
@@ -1325,7 +1337,7 @@ void udp_destruct_sock(struct sock *sk)
 		total += skb->truesize;
 		kfree_skb(skb);
 	}
-	udp_rmem_release(sk, total, 0);
+	udp_rmem_release(sk, total, 0, true);
 
 	inet_sock_destruct(sk);
 }
@@ -1397,7 +1409,7 @@ static int first_packet_length(struct sock *sk)
 	}
 	res = skb ? skb->len : -1;
 	if (total)
-		udp_rmem_release(sk, total, 1);
+		udp_rmem_release(sk, total, 1, false);
 	spin_unlock_bh(&rcvq->lock);
 	return res;
 }
@@ -1471,16 +1483,20 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
 				goto busy_check;
 			}
 
-			/* refill the reader queue and walk it again */
+			/* refill the reader queue and walk it again
+			 * keep both queues locked to avoid re-acquiring
+			 * the sk_receive_queue lock if fwd memory scheduling
+			 * is needed.
+			 */
 			_off = *off;
 			spin_lock(&sk_queue->lock);
 			skb_queue_splice_tail_init(sk_queue, queue);
-			spin_unlock(&sk_queue->lock);
 
 			skb = __skb_try_recv_from_queue(sk, queue, flags,
-							udp_skb_destructor,
+							udp_skb_dtor_locked,
 							peeked, &_off, err,
 							&last);
+			spin_unlock(&sk_queue->lock);
 			spin_unlock_bh(&queue->lock);
 			if (skb) {
 				*off = _off;
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v2 2/3] udp: use a separate rx queue for packet reception
From: Paolo Abeni @ 2017-05-16  9:20 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494881617.git.pabeni@redhat.com>

under udp flood the sk_receive_queue spinlock is heavily contended.
This patch try to reduce the contention on such lock adding a
second receive queue to the udp sockets; recvmsg() looks first
in such queue and, only if empty, tries to fetch the data from
sk_receive_queue. The latter is spliced into the newly added
queue every time the receive path has to acquire the
sk_receive_queue lock.

The accounting of forward allocated memory is still protected with
the sk_receive_queue lock, so udp_rmem_release() needs to acquire
both locks when the forward deficit is flushed.

On specific scenarios we can end up acquiring and releasing the
sk_receive_queue lock multiple times; that will be covered by
the next patch

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/udp.h   |   3 ++
 include/net/udp.h     |   9 +---
 include/net/udplite.h |   2 +-
 net/ipv4/udp.c        | 138 ++++++++++++++++++++++++++++++++++++++++++++------
 net/ipv6/udp.c        |   3 +-
 5 files changed, 131 insertions(+), 24 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 6cb4061..eaea63b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -80,6 +80,9 @@ struct udp_sock {
 						struct sk_buff *skb,
 						int nhoff);
 
+	/* udp_recvmsg try to use this before splicing sk_receive_queue */
+	struct sk_buff_head	reader_queue ____cacheline_aligned_in_smp;
+
 	/* This field is dirtied by udp_recvmsg() */
 	int		forward_deficit;
 };
diff --git a/include/net/udp.h b/include/net/udp.h
index 3391dbd..1468dbd 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -249,13 +249,8 @@ void udp_destruct_sock(struct sock *sk);
 void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len);
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb);
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb);
-static inline struct sk_buff *
-__skb_recv_udp(struct sock *sk, unsigned int flags, int noblock, int *peeked,
-	       int *off, int *err)
-{
-	return __skb_recv_datagram(sk, flags | (noblock ? MSG_DONTWAIT : 0),
-				   udp_skb_destructor, peeked, off, err);
-}
+struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
+			       int noblock, int *peeked, int *off, int *err);
 static inline struct sk_buff *skb_recv_udp(struct sock *sk, unsigned int flags,
 					   int noblock, int *err)
 {
diff --git a/include/net/udplite.h b/include/net/udplite.h
index ea34052..b7a18f6 100644
--- a/include/net/udplite.h
+++ b/include/net/udplite.h
@@ -26,8 +26,8 @@ static __inline__ int udplite_getfrag(void *from, char *to, int  offset,
 /* Designate sk as UDP-Lite socket */
 static inline int udplite_sk_init(struct sock *sk)
 {
+	udp_init_sock(sk);
 	udp_sk(sk)->pcflag = UDPLITE_BIT;
-	sk->sk_destruct = udp_destruct_sock;
 	return 0;
 }
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ea6e4cf..492c76b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1167,19 +1167,24 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
 	struct udp_sock *up = udp_sk(sk);
+	struct sk_buff_head *sk_queue;
 	int amt;
 
 	if (likely(partial)) {
 		up->forward_deficit += size;
 		size = up->forward_deficit;
 		if (size < (sk->sk_rcvbuf >> 2) &&
-		    !skb_queue_empty(&sk->sk_receive_queue))
+		    !skb_queue_empty(&up->reader_queue))
 			return;
 	} else {
 		size += up->forward_deficit;
 	}
 	up->forward_deficit = 0;
 
+	/* acquire the sk_receive_queue for fwd allocated memory scheduling */
+	sk_queue = &sk->sk_receive_queue;
+	spin_lock(&sk_queue->lock);
+
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
 	sk->sk_forward_alloc -= amt;
@@ -1188,9 +1193,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 
 	atomic_sub(size, &sk->sk_rmem_alloc);
+
+	/* this can save us from acquiring the rx queue lock on next receive */
+	skb_queue_splice_tail_init(sk_queue, &up->reader_queue);
+
+	spin_unlock(&sk_queue->lock);
 }
 
-/* Note: called with sk_receive_queue.lock held.
+/* Note: called with reader_queue.lock held.
  * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
  * This avoids a cache line miss while receive_queue lock is held.
  * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
@@ -1306,10 +1316,12 @@ EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
 void udp_destruct_sock(struct sock *sk)
 {
 	/* reclaim completely the forward allocated memory */
+	struct udp_sock *up = udp_sk(sk);
 	unsigned int total = 0;
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+	skb_queue_splice_tail_init(&sk->sk_receive_queue, &up->reader_queue);
+	while ((skb = __skb_dequeue(&up->reader_queue)) != NULL) {
 		total += skb->truesize;
 		kfree_skb(skb);
 	}
@@ -1321,6 +1333,7 @@ EXPORT_SYMBOL_GPL(udp_destruct_sock);
 
 int udp_init_sock(struct sock *sk)
 {
+	skb_queue_head_init(&udp_sk(sk)->reader_queue);
 	sk->sk_destruct = udp_destruct_sock;
 	return 0;
 }
@@ -1338,6 +1351,26 @@ void skb_consume_udp(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL_GPL(skb_consume_udp);
 
+static struct sk_buff *__first_packet_length(struct sock *sk,
+					     struct sk_buff_head *rcvq,
+					     int *total)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_peek(rcvq)) != NULL &&
+	       udp_lib_checksum_complete(skb)) {
+		__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
+				IS_UDPLITE(sk));
+		__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
+				IS_UDPLITE(sk));
+		atomic_inc(&sk->sk_drops);
+		__skb_unlink(skb, rcvq);
+		*total += skb->truesize;
+		kfree_skb(skb);
+	}
+	return skb;
+}
+
 /**
  *	first_packet_length	- return length of first packet in receive queue
  *	@sk: socket
@@ -1347,22 +1380,20 @@ EXPORT_SYMBOL_GPL(skb_consume_udp);
  */
 static int first_packet_length(struct sock *sk)
 {
-	struct sk_buff_head *rcvq = &sk->sk_receive_queue;
+	struct sk_buff_head *rcvq = &udp_sk(sk)->reader_queue;
+	struct sk_buff_head *sk_queue = &sk->sk_receive_queue;
 	struct sk_buff *skb;
 	int total = 0;
 	int res;
 
 	spin_lock_bh(&rcvq->lock);
-	while ((skb = skb_peek(rcvq)) != NULL &&
-		udp_lib_checksum_complete(skb)) {
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS,
-				IS_UDPLITE(sk));
-		__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
-				IS_UDPLITE(sk));
-		atomic_inc(&sk->sk_drops);
-		__skb_unlink(skb, rcvq);
-		total += skb->truesize;
-		kfree_skb(skb);
+	skb = __first_packet_length(sk, rcvq, &total);
+	if (!skb && !skb_queue_empty(sk_queue)) {
+		spin_lock(&sk_queue->lock);
+		skb_queue_splice_tail_init(sk_queue, rcvq);
+		spin_unlock(&sk_queue->lock);
+
+		skb = __first_packet_length(sk, rcvq, &total);
 	}
 	res = skb ? skb->len : -1;
 	if (total)
@@ -1400,6 +1431,79 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(udp_ioctl);
 
+struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned int flags,
+			       int noblock, int *peeked, int *off, int *err)
+{
+	struct sk_buff_head *sk_queue = &sk->sk_receive_queue;
+	struct sk_buff_head *queue;
+	struct sk_buff *last;
+	long timeo;
+	int error;
+
+	queue = &udp_sk(sk)->reader_queue;
+	flags |= noblock ? MSG_DONTWAIT : 0;
+	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+	do {
+		struct sk_buff *skb;
+
+		error = sock_error(sk);
+		if (error)
+			break;
+
+		error = -EAGAIN;
+		*peeked = 0;
+		do {
+			int _off = *off;
+
+			spin_lock_bh(&queue->lock);
+			skb = __skb_try_recv_from_queue(sk, queue, flags,
+							udp_skb_destructor,
+							peeked, &_off, err,
+							&last);
+			if (skb) {
+				spin_unlock_bh(&queue->lock);
+				*off = _off;
+				return skb;
+			}
+
+			if (skb_queue_empty(sk_queue)) {
+				spin_unlock_bh(&queue->lock);
+				goto busy_check;
+			}
+
+			/* refill the reader queue and walk it again */
+			_off = *off;
+			spin_lock(&sk_queue->lock);
+			skb_queue_splice_tail_init(sk_queue, queue);
+			spin_unlock(&sk_queue->lock);
+
+			skb = __skb_try_recv_from_queue(sk, queue, flags,
+							udp_skb_destructor,
+							peeked, &_off, err,
+							&last);
+			spin_unlock_bh(&queue->lock);
+			if (skb) {
+				*off = _off;
+				return skb;
+			}
+
+busy_check:
+			if (!sk_can_busy_loop(sk))
+				break;
+
+			sk_busy_loop(sk, flags & MSG_DONTWAIT);
+		} while (!skb_queue_empty(sk_queue));
+
+		/* sk_queue is empty, reader_queue may contain peeked packets */
+	} while (timeo &&
+		 !__skb_wait_for_more_packets(sk, &error, &timeo,
+					      (struct sk_buff *)sk_queue));
+
+	*err = error;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(__skb_recv_udp);
+
 /*
  * 	This should be easy, if there is something there we
  * 	return it, otherwise we block.
@@ -1490,7 +1594,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 	return err;
 
 csum_copy_err:
-	if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
+	if (!__sk_queue_drop_skb(sk, &udp_sk(sk)->reader_queue, skb, flags,
+				 udp_skb_destructor)) {
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 		UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
 	}
@@ -2325,6 +2430,9 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	unsigned int mask = datagram_poll(file, sock, wait);
 	struct sock *sk = sock->sk;
 
+	if (!skb_queue_empty(&udp_sk(sk)->reader_queue))
+		mask |= POLLIN | POLLRDNORM;
+
 	sock_rps_record_flow(sk);
 
 	/* Check for false positives due to checksum errors */
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 04862ab..f78fdf8 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -455,7 +455,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	return err;
 
 csum_copy_err:
-	if (!__sk_queue_drop_skb(sk, skb, flags, udp_skb_destructor)) {
+	if (!__sk_queue_drop_skb(sk, &udp_sk(sk)->reader_queue, skb, flags,
+				 udp_skb_destructor)) {
 		if (is_udp4) {
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_CSUMERRORS, is_udplite);
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v2 1/3] net/sock: factor out dequeue/peek with offset code
From: Paolo Abeni @ 2017-05-16  9:20 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <cover.1494881617.git.pabeni@redhat.com>

And update __sk_queue_drop_skb() to work on the specified queue.
This will help the udp protocol to use an additional private
rx queue in a later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  7 ++++
 include/net/sock.h     |  4 +--
 net/core/datagram.c    | 90 ++++++++++++++++++++++++++++----------------------
 3 files changed, 60 insertions(+), 41 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a098d95..bfc7892 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff *skb)
 
 int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
 				const struct sk_buff *skb);
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+					  struct sk_buff_head *queue,
+					  unsigned int flags,
+					  void (*destructor)(struct sock *sk,
+							   struct sk_buff *skb),
+					  int *peeked, int *off, int *err,
+					  struct sk_buff **last);
 struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
 					void (*destructor)(struct sock *sk,
 							   struct sk_buff *skb),
diff --git a/include/net/sock.h b/include/net/sock.h
index 66349e4..49d226f 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct timer_list *timer,
 
 void sk_stop_timer(struct sock *sk, struct timer_list *timer);
 
-int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-			unsigned int flags,
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
+			struct sk_buff *skb, unsigned int flags,
 			void (*destructor)(struct sock *sk,
 					   struct sk_buff *skb));
 int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index db1866f2..a4592b4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff *skb)
 	return skb;
 }
 
+struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
+					  struct sk_buff_head *queue,
+					  unsigned int flags,
+					  void (*destructor)(struct sock *sk,
+							   struct sk_buff *skb),
+					  int *peeked, int *off, int *err,
+					  struct sk_buff **last)
+{
+	struct sk_buff *skb;
+
+	*last = queue->prev;
+	skb_queue_walk(queue, skb) {
+		if (flags & MSG_PEEK) {
+			if (*off >= skb->len && (skb->len || *off ||
+						 skb->peeked)) {
+				*off -= skb->len;
+				continue;
+			}
+			if (!skb->len) {
+				skb = skb_set_peeked(skb);
+				if (unlikely(IS_ERR(skb))) {
+					*err = PTR_ERR(skb);
+					return skb;
+				}
+			}
+			*peeked = 1;
+			atomic_inc(&skb->users);
+		} else {
+			__skb_unlink(skb, queue);
+			if (destructor)
+				destructor(sk, skb);
+		}
+		return skb;
+	}
+	return NULL;
+}
+
 /**
  *	__skb_try_recv_datagram - Receive a datagram skbuff
  *	@sk: socket
@@ -216,46 +253,20 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
 
 	*peeked = 0;
 	do {
+		int _off = *off;
+
 		/* Again only user level code calls this function, so nothing
 		 * interrupt level will suddenly eat the receive_queue.
 		 *
 		 * Look at current nfs client by the way...
 		 * However, this function was correct in any case. 8)
 		 */
-		int _off = *off;
-
-		*last = (struct sk_buff *)queue;
 		spin_lock_irqsave(&queue->lock, cpu_flags);
-		skb_queue_walk(queue, skb) {
-			*last = skb;
-			if (flags & MSG_PEEK) {
-				if (_off >= skb->len && (skb->len || _off ||
-							 skb->peeked)) {
-					_off -= skb->len;
-					continue;
-				}
-				if (!skb->len) {
-					skb = skb_set_peeked(skb);
-					if (IS_ERR(skb)) {
-						error = PTR_ERR(skb);
-						spin_unlock_irqrestore(&queue->lock,
-								       cpu_flags);
-						goto no_packet;
-					}
-				}
-				*peeked = 1;
-				atomic_inc(&skb->users);
-			} else {
-				__skb_unlink(skb, queue);
-				if (destructor)
-					destructor(sk, skb);
-			}
-			spin_unlock_irqrestore(&queue->lock, cpu_flags);
-			*off = _off;
-			return skb;
-		}
-
+		skb = __skb_try_recv_from_queue(sk, queue, flags, destructor,
+						peeked, &_off, err, last);
 		spin_unlock_irqrestore(&queue->lock, cpu_flags);
+		if (skb)
+			return skb;
 
 		if (!sk_can_busy_loop(sk))
 			break;
@@ -335,8 +346,8 @@ void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len)
 }
 EXPORT_SYMBOL(__skb_free_datagram_locked);
 
-int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
-			unsigned int flags,
+int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
+			struct sk_buff *skb, unsigned int flags,
 			void (*destructor)(struct sock *sk,
 					   struct sk_buff *skb))
 {
@@ -344,15 +355,15 @@ int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
 
 	if (flags & MSG_PEEK) {
 		err = -ENOENT;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		if (skb == skb_peek(&sk->sk_receive_queue)) {
-			__skb_unlink(skb, &sk->sk_receive_queue);
+		spin_lock_bh(&sk_queue->lock);
+		if (skb == skb_peek(sk_queue)) {
+			__skb_unlink(skb, sk_queue);
 			atomic_dec(&skb->users);
 			if (destructor)
 				destructor(sk, skb);
 			err = 0;
 		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
+		spin_unlock_bh(&sk_queue->lock);
 	}
 
 	atomic_inc(&sk->sk_drops);
@@ -383,7 +394,8 @@ EXPORT_SYMBOL(__sk_queue_drop_skb);
 
 int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags)
 {
-	int err = __sk_queue_drop_skb(sk, skb, flags, NULL);
+	int err = __sk_queue_drop_skb(sk, &sk->sk_receive_queue, skb, flags,
+				      NULL);
 
 	kfree_skb(skb);
 	sk_mem_reclaim_partial(sk);
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next v2 0/3] udp: scalability improvements
From: Paolo Abeni @ 2017-05-16  9:20 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet

This patch series implement an idea suggested by Eric Dumazet to
reduce the contention of the udp sk_receive_queue lock when the socket is
under flood.

An ancillary queue is added to the udp socket, and the socket always
tries first to read packets from such queue. If it's empty, we splice
the content from sk_receive_queue into the ancillary queue.

The first patch introduces some helpers to keep the udp code small, and the
following two implement the ancillary queue strategy. The code is split
to hopefully help the reviewing process.

The measured overall gain under udp flood is up to the 30% depending on
the numa layout and the number of ingress queue used by the relevant nic.

The performance numbers have been gathered using pktgen as sender, with 64
bytes packets, random src port on a host b2b connected via a 10Gbs link
with the dut.

The receiver used the udp_sink program by Jesper [1] and an h/w l4 rx hash on
the ingress nic, so that the number of ingress nic rx queues hit by the udp
traffic could be controlled via ethtool -L.

The udp_sink program was bound to the first idle cpu, to get more
stable numbers.

On a single numa node receiver:

nic rx queues           vanilla                 patched kernel
1                       1820 kpps               1900 kpps
2                       1950 kpps               2500 kpps
16                      1670 kpps               2120 kpps

When using a single nic rx queue, busy polling was also enabled,
elsewhere, in the above scenario, the bh processing becomes the bottle-neck
and this produces large artifacts in the measured performances (e.g.
improving the udp sink run time, decreases the overall tput, since more
action from the scheduler comes into play).

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

v1 -> v2:
  Patches 1/3 and 2/3 are unchanged, in patch 3/3 the rx_queue_lock_held param
  of udp_rmem_release() is now a bool.

Paolo Abeni (3):
  net/sock: factor out dequeue/peek with offset code
  udp: use a separate rx queue for packet reception
  udp: keep the sk_receive_queue held when splicing

 include/linux/skbuff.h |   7 +++
 include/linux/udp.h    |   3 +
 include/net/sock.h     |   4 +-
 include/net/udp.h      |   9 +--
 include/net/udplite.h  |   2 +-
 net/core/datagram.c    |  90 +++++++++++++++------------
 net/ipv4/udp.c         | 162 +++++++++++++++++++++++++++++++++++++++++++------
 net/ipv6/udp.c         |   3 +-
 8 files changed, 211 insertions(+), 69 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: [PATCH] ravb: add wake-on-lan support via magic packet
From: Geert Uytterhoeven @ 2017-05-16  9:07 UTC (permalink / raw)
  To: Niklas Söderlund
  Cc: Sergei Shtylyov, netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <20170516090214.GA30768@bigcity.dyn.berto.se>

Hi Niklas,

On Tue, May 16, 2017 at 11:02 AM, Niklas Söderlund
<niklas.soderlund@ragnatech.se> wrote:
>> > Whit all this being said I still like to withdraw this patch as I found
>> > another fault with it, ravb_wol_restore() will unconditionally be called
>> > while ravb_wol_setup() will only be called if netif_running(ndev). This
>> > is en easy fix and I will send out a v2 once we figure out what to do
>> > about the clock.
>>
>> The clock issue is external to the ravb driver. If it works with
>> s2idle, it should
>> be OK.
>
> Do you think it's a good idea to move ahead with a v2 of the ravb WoL
> patch to fix the unrelated issue and aim for it to be picked up prior to
> suspend/resume support is added to the CPG/MSSR?

Sure.

It can still be used on R-Car Gen2, where we're not s*d by mandatory PSCI.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH] ravb: add wake-on-lan support via magic packet
From: Niklas Söderlund @ 2017-05-16  9:02 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Sergei Shtylyov, netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <CAMuHMdWZ8XLX0Zu0D54Ex+3RaFNm_fJyFQvyo+qz10HpB2nGqQ@mail.gmail.com>

Hi Geert,

Thanks for your feedback.

On 2017-05-16 09:48:51 +0200, Geert Uytterhoeven wrote:
> Hi Niklas,
> 
> On Fri, May 12, 2017 at 11:12 PM, Niklas Söderlund
> <niklas.soderlund@ragnatech.se> wrote:
> > On 2017-05-12 16:43:55 +0200, Geert Uytterhoeven wrote:
> >> On Fri, May 12, 2017 at 4:32 PM, Niklas Söderlund
> >> <niklas.soderlund@ragnatech.se> wrote:
> >> > On 2017-05-12 14:58:53 +0200, Niklas Söderlund wrote:
> >> >> On 2017-05-12 14:47:53 +0200, Geert Uytterhoeven wrote:
> >> >> > On Fri, May 12, 2017 at 12:27 AM, Niklas Söderlund
> >> >> > <niklas.soderlund+renesas@ragnatech.se> wrote:
> >> >> > When using PSCI suspend/resume, and Wake-on-LAN cannot work due to
> >> >> > PSCI firmware issues, Ethernet fails to come up afterwards:
> >> >> >
> >> >> >     ravb e6800000.ethernet eth0: failed to switch device to config mode
> >> >> >     ravb e6800000.ethernet eth0: device will be stopped after h/w
> >> >> > processes are done.
> >> >> >     ravb e6800000.ethernet eth0: failed to switch device to config mode
> >> >> >     dpm_run_callback(): ravb_resume+0x0/0x148 returns -110
> >> >> >     PM: Device e6800000.ethernet failed to resume: error -110
> >> >> >
> >> >> > Your resume routine cannot assume RAVB is in a sane mode, as it will
> >> >> > have been reset if PSCI suspend was used.
> >> >>
> >> >> Ouch, yes this is true thanks for reporting will look in to it.
> >> >>
> >> >> The problem is that in the resume handler if WoL is enabled it will try
> >> >> to close the device before reinitializing it from reset state. If WoL is
> >> >> not enabled the device will be closed at suspend time so no need to
> >> >> close it before restoring operation from reset in the resume handler.
> 
> >> With renesas-drivers, it will refuse to use PSCI suspend if any other wake-up
> >> sources are configured, i.e. try
> >>
> >>     echo disabled > /sys/devices/platform/soc/e6800000.ethernet/power/wakeup
> >>
> >> Good luck, and have a nice weekend!
> >
> > Thanks this allowed me to reproduce the same error as you. And after
> > future digging I don't believe the problem being in the logic of the
> > ravb suspend/resume functions. The problem is that the module clock is
> > never turned on after PCSI system suspend if its usage count is above 0
> > at suspend time, so the errors we both now observe are due to the module
> > clock being disabled.
> >
> > If I set DEBUG in renesas-cpg-mssr.c and observe when each module clock
> > is turned OFF and ON, the fault is clear. If WoL is enabled the clock is
> > never turned on when the system is resuming, while if WoL is disabled it
> > is. I verified this by removing the calls to clk_enable() and
> > clk_disable() from this patch, and by doing so the PCSI system suspend
> > works perfect with WoL enabled and the ravb comes up fine after toggling
> > SW23 (while ofc WoL no longer works in s2idle due to the module clock is
> > switched off at suspend time).
> >
> > I checked drivers/clk/renesas and I can't find a suspend/resume handler
> > for the clock driver, how is this intended to work? If a clock have a
> > usage count higher then 0 when the system is PSCI System Suspended it
> > seems like it won't be turned back on when the system is resumed from
> > this sleep stage. I might have misunderstood something and I need to
> > alter the logic in the ravb driver to let the clock driver know it
> > should turn on the clock at resume time?
> 
> Ah, you found a real use case for suspend/resume support in the clock
> drivers ;-)

Happy to be of service ;-)

> 
> Due to PSCI system suspend powering down the whole SoC, all clock
> settings are lost.
> 
> Thanks, I will look into this...

Thanks!

> 
> > Whit all this being said I still like to withdraw this patch as I found
> > another fault with it, ravb_wol_restore() will unconditionally be called
> > while ravb_wol_setup() will only be called if netif_running(ndev). This
> > is en easy fix and I will send out a v2 once we figure out what to do
> > about the clock.
> 
> The clock issue is external to the ravb driver. If it works with
> s2idle, it should
> be OK.

Do you think it's a good idea to move ahead with a v2 of the ravb WoL 
patch to fix the unrelated issue and aim for it to be picked up prior to
suspend/resume support is added to the CPG/MSSR?

> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds

-- 
Regards,
Niklas Söderlund

^ permalink raw reply

* [PATCH v3] bridge: netlink: check vlan_default_pvid range
From: Tobias Jungel @ 2017-05-16  8:48 UTC (permalink / raw)
  To: Sabrina Dubroca, Nikolay Aleksandrov, Stephen Hemminger,
	David S. Miller, netdev

Currently it is allowed to set the default pvid of a bridge to a value
above VLAN_VID_MASK (0xfff). This patch adds a check to br_validate and
returns -EINVAL in case the pvid is out of bounds.

Reproduce by calling:

[root@test ~]# ip l a type bridge
[root@test ~]# ip l a type dummy
[root@test ~]# ip l s bridge0 type bridge vlan_filtering 1
[root@test ~]# ip l s bridge0 type bridge vlan_default_pvid 9999
[root@test ~]# ip l s dummy0 master bridge0
[root@test ~]# bridge vlan
port	vlan ids
bridge0	 9999 PVID Egress Untagged

dummy0	 9999 PVID Egress Untagged

Fixes: 0f963b7592ef ("bridge: netlink: add support for default_pvid")
Signed-off-by: Tobias Jungel <tobias.jungel@bisdn.de>
---
 net/bridge/br_netlink.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index c5ce774..47cb95b 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -835,6 +835,12 @@ static int br_validate(struct nlattr *tb[], struct nlattr *data[])
 			return -EPROTONOSUPPORT;
 		}
 	}
+
+	if (data[IFLA_BR_VLAN_DEFAULT_PVID]) {
+		__u16 defpvid = nla_get_u16(data[IFLA_BR_VLAN_DEFAULT_PVID]);
+		if (defpvid >= VLAN_VID_MASK)
+			return -EINVAL;
+	}
 #endif
 
 	return 0;
-- 
2.9.4

^ permalink raw reply related

* Re: [PATCH 2/6] wl1251: Use request_firmware_prefer_user() for loading NVS calibration data
From: Arend Van Spriel @ 2017-05-16  8:41 UTC (permalink / raw)
  To: Luis R. Rodriguez, Johannes Berg
  Cc: Pavel Machek, Daniel Wagner, Tom Gundersen, Pali Rohár,
	Greg Kroah-Hartman, Kalle Valo, David Gnedt, Tony Lindgren,
	Sebastian Reichel, Ivaylo Dimitrov, Aaro Koskinen, Takashi Iwai,
	AKASHI Takahiro, David Woodhouse, Bjorn Andersson,
	Grazvydas Ignotas, linux-kernel, linux-wireless, netdev,
	Michał Kazior
In-Reply-To: <20170515231339.GF17314@wotan.suse.de>

On 16-5-2017 1:13, Luis R. Rodriguez wrote:
> On Fri, May 12, 2017 at 11:02:26PM +0200, Arend Van Spriel wrote:
>> try again.. replacing email address from Michał
>> On 12-5-2017 22:55, Arend Van Spriel wrote:
>>> Let me explain the idea to refresh your memory (and mine). It started
>>> when we were working on adding driver support for OpenWrt in brcmfmac.
>>> The driver requests for firmware calibration data, but on routers it is
>>> stored in flash. So after failing on the firmware request we now call a
>>> platform specific API. That was my itch, but it was not bad enough to go
>>> and scratch. Now for N900 case there is a similar scenario alhtough it
>>> has additional requirement to go to user-space due to need to use a
>>> proprietary library to obtain the NVS calibration data. My thought: Why
>>> should firmware_class care?
> 
> Agreed.
> 
>>> So the idea is that firmware_class provides
>>> a registry for modules that can produce a certain firmware "file". Those
>>> modules can do whatever is needed. If they need to use umh so be it.
>>> They would only register themselves with firmware_class on platforms
>>> that need them. It would basically be replacing the fallback mechanism
>>> and only be effective on certain platforms.
> 
> Sure, so it sounds like the work that Daniel Wagner and Tom Gundersen worked
> [0] on which provides a firmwared with two modes: best-effort, and final-mode,
> would address what you are looking for but without requiring any upstream
> changes, *and* it also helps solve the rootfs race remote-proc folks had
> concerns over.
> 
> The other added gain over this solution is if folks need their own proprietary
> concoction they can just fork firmwared and have that do whatever it needs
> for the specific device on the specific rootfs. That is, firmwared can be the
> upstream solution if folks need it, but if folks need something custom they can
> just mimic the implementation: best-effort, and and final-mode.
> 
> Yet another added gain over this solution we can do *not* support the
> custom fallback mechanism as its not needed, the udev event should suffice
> to let userspace do what it needs.
> 
> Lastly, if we did not want to deal with timeouts for the way the driver data
> API implements it I think we might be able to do away with them for for async
> requests if we assume there will be a daemon that spawns in final-mode eventually,
> and since it *knows* when the rootfs is ready it should be able to do a final
> lookup, if it returns -ENOENT; then indeed we know we can give up. Now, perhaps
> how and if we want to deal with timeouts when using the driver data API for
> the fallback mechanism is worth considering given it does not have a fallback
> mechanism support yet. If we *add* them it would seem this would also put an
> implicit race against userspace finishing initialization and running firmwared
> in final-mode.

Just to be clear. When you are saying "rootfs" in this story, you mean
any (mounted) file-system which may hold the firmware. At least that was
one of the arguments. In kernel space we can not know how the system is
setup in terms of mount points, let alone on which mounted file-system
the firmware resides.

> Johannes, do you recall the corner cases we spoke about regarding timeouts?
> Does this match what we spoke about?
> 
>>> Let me know if this idea is still of interest and I will rebase what I
>>> have for an RFC round.
> 
> Since no upstream delta is needed for firmwared I'd like to first encourage
> evaluating the above. While distributions don't carry it yet that may be seen as
> an issue but since what we are looking for are corner cases, only folks needing
> to deploy a specific solution would need it or a custom proprietary solution.

Ok. I will go try and run firmwared in OpenWrt on a router platform.
Have to steal one from a colleague :-p Will study firmwared.

> [0] https://github.com/teg/firmwared.git
> 
> PS.
> 
> Note that firmware signing will require an additional file, the detached
> signature. The driver data API does not currently support the fallback
> mechanism so we would not have to worry about that yet but once we add
> fallback support we'd need to consider this.

Do you have references to the firmware signing design. Is the idea to
have one signature and all "firmware files" need to be signed with it?

Thanks,
Arend

^ permalink raw reply

* Re: [PATCH iproute2] ip: add support for more MPLS labels
From: Simon Horman @ 2017-05-16  8:32 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, stephen
In-Reply-To: <20170514012702.1083-1-dsahern@gmail.com>

On Sat, May 13, 2017 at 07:27:02PM -0600, David Ahern wrote:
> Kernel now supports up to 30 labels but not defined as part of the uapi.
> iproute2 handles up to 8 labels but in a non-consistent way. Update ip
> to handle more labels, but in a more programmatic way.
> 
> For the MPLS address family, the data field in inet_prefix is used for
> labels.  Increase that field to 64 u32's -- 64 as nothing more than a
> convenient power of 2 number.
> 
> Update mpls_pton to take the length of the address field, convert that
> length to number of labels and add better error handling to the parsing
> of the user supplied string.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>
> ---
>  include/utils.h |  7 ++-----
>  lib/mpls_pton.c | 11 +++++++----
>  lib/utils.c     |  7 +++++--
>  3 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/include/utils.h b/include/utils.h
> index 8c12e1e2a60c..bfbc9e6dff55 100644
> --- a/include/utils.h
> +++ b/include/utils.h
> @@ -61,7 +61,7 @@ typedef struct
>  	__s16 bitlen;
>  	/* These next two fields match rtvia */
>  	__u16 family;
> -	__u32 data[8];
> +	__u32 data[64];
>  } inet_prefix;
>  
>  #define PREFIXLEN_SPECIFIED 1
> @@ -88,9 +88,6 @@ struct ipx_addr {
>  # define AF_MPLS 28
>  #endif
>  
> -/* Maximum number of labels the mpls helpers support */
> -#define MPLS_MAX_LABELS 8
> -
>  __u32 get_addr32(const char *name);
>  int get_addr_1(inet_prefix *dst, const char *arg, int family);
>  int get_prefix_1(inet_prefix *dst, char *arg, int family);
> @@ -155,7 +152,7 @@ const char *ipx_ntop(int af, const void *addr, char *str, size_t len);
>  int ipx_pton(int af, const char *src, void *addr);
>  
>  const char *mpls_ntop(int af, const void *addr, char *str, size_t len);
> -int mpls_pton(int af, const char *src, void *addr);
> +int mpls_pton(int af, const char *src, void *addr, size_t alen);
>  
>  extern int __iproute2_hz_internal;
>  int __get_hz(void);
> diff --git a/lib/mpls_pton.c b/lib/mpls_pton.c
> index bd448cfcf14a..6d2e6a69436a 100644
> --- a/lib/mpls_pton.c
> +++ b/lib/mpls_pton.c
> @@ -7,12 +7,13 @@
>  #include "utils.h"
>  
>  
> -static int mpls_pton1(const char *name, struct mpls_label *addr)
> +static int mpls_pton1(const char *name, struct mpls_label *addr,
> +		      unsigned int maxlabels)
>  {
>  	char *endp;
>  	unsigned count;
>  
> -	for (count = 0; count < MPLS_MAX_LABELS; count++) {
> +	for (count = 0; count < maxlabels; count++) {
>  		unsigned long label;
>  
>  		label = strtoul(name, &endp, 0);
> @@ -37,17 +38,19 @@ static int mpls_pton1(const char *name, struct mpls_label *addr)
>  		addr += 1;
>  	}
>  	/* The address was too long */
> +	fprintf(stderr, "Error: too many labels.\n");
>  	return 0;
>  }
>  
> -int mpls_pton(int af, const char *src, void *addr)
> +int mpls_pton(int af, const char *src, void *addr, size_t alen)
>  {
> +	unsigned int maxlabels = alen / sizeof(struct mpls_label);

Could ARRAY_SIZE be used?

Also, I know its only calculated twice, but might it be nicer to
provide a function or macro to do so? It seems like an ugly thing
to open code.

Cosmetic points above notwithstanding:

Reviewed-by: Simon Horman <simon.horman@netronome.com>

>  	int err;
>  
>  	switch(af) {
>  	case AF_MPLS:
>  		errno = 0;
> -		err = mpls_pton1(src, (struct mpls_label *)addr);
> +		err = mpls_pton1(src, (struct mpls_label *)addr, maxlabels);
>  		break;
>  	default:
>  		errno = EAFNOSUPPORT;
> diff --git a/lib/utils.c b/lib/utils.c
> index 6d5642f4f1f3..e77bd302530b 100644
> --- a/lib/utils.c
> +++ b/lib/utils.c
> @@ -518,15 +518,18 @@ int get_addr_1(inet_prefix *addr, const char *name, int family)
>  	}
>  
>  	if (family == AF_MPLS) {
> +		unsigned int maxlabels;
>  		int i;
>  
>  		addr->family = AF_MPLS;
> -		if (mpls_pton(AF_MPLS, name, addr->data) <= 0)
> +		if (mpls_pton(AF_MPLS, name, addr->data,
> +			      sizeof(addr->data)) <= 0)
>  			return -1;
>  		addr->bytelen = 4;
>  		addr->bitlen = 20;
>  		/* How many bytes do I need? */
> -		for (i = 0; i < 8; i++) {
> +		maxlabels = sizeof(addr->data) / sizeof(struct mpls_label);
> +		for (i = 0; i < maxlabels; i++) {
>  			if (ntohl(addr->data[i]) & MPLS_LS_S_MASK) {
>  				addr->bytelen = (i + 1)*4;
>  				break;
> -- 
> 2.11.0 (Apple Git-81)
> 

^ permalink raw reply

* Re: Advice on user space application integration with tc
From: Simon Horman @ 2017-05-16  8:25 UTC (permalink / raw)
  To: Morgan Yang; +Cc: Cong Wang, Linux Kernel Network Developers
In-Reply-To: <CAHV_CwbvSbLXy7gfJdTWrrimk+MZP6r+fDZjqc1L8rkUA8p2Xw@mail.gmail.com>

On Mon, May 15, 2017 at 01:39:27PM -0700, Morgan Yang wrote:
> I tried on both stock CentOS 7.3 and Ubuntu 16.04 and tc-skbmod was
> not support (I built tc from the latest versions of iproute2). For
> tc-pedit, examples from man tc-pedit such as "pedit ex munge" were not
> supported, but "pedit munge offset" is.

Please don't top-post on netdev.

Unless I understand things "pedit ex munge" uses new features of ped which
were introduced in v4.11. So yes, you would need a new kernel - or a
backport to an old one - in order to use that feature.

^ permalink raw reply

* I NEED YOUR REPLY,
From: Joy Julian Zengo @ 2017-05-16  8:17 UTC (permalink / raw)


Good day,

How are you today, I hope that everything is OK with you as it is my great pleasure to contact you in having communication with you today after seeing your email My name is Princess Joy J. Zengo.i am 24 yrs old,I am contacting you for a very important issue because i think you are the right person which i can work with.I will tell you more about me when i get a reply from you.
 
I am awaiting to hear from you
yours Princess Joy J.Zengo
Call me +229 64 89 03 05

^ permalink raw reply

* Re: [oss-drivers] [PATCH net-next 8/9] nfp: add a helper for wrapping descriptor index
From: Simon Horman @ 2017-05-16  8:22 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170516003857.52c3f31b@cakuba.netronome.com>

On Tue, May 16, 2017 at 12:38:57AM -0700, Jakub Kicinski wrote:
> On Tue, 16 May 2017 09:08:08 +0200, Simon Horman wrote:
> > On Mon, May 15, 2017 at 05:55:22PM -0700, Jakub Kicinski wrote:
> > > We have a number of places where we calculate the descriptor
> > > index based on a value which may have overflown.  Create a
> > > macro for masking with the ring size.
> > > 
> > > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > > ---
> > >  drivers/net/ethernet/netronome/nfp/nfp_net.h        |  3 +++
> > >  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 21 ++++++++++-----------
> > >  2 files changed, 13 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > > index 66319a1026bb..7b9518cbe965 100644
> > > --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > > +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > > @@ -117,6 +117,9 @@ struct nfp_eth_table_port;
> > >  struct nfp_net;
> > >  struct nfp_net_r_vector;
> > >  
> > > +/* Convenience macro for wrapping descriptor index on ring size */
> > > +#define D_IDX(ring, idx)	((idx) & ((ring)->cnt - 1))  
> > 
> > Any reason not to make this a function?
> 
> This is to be able to use the same macro for both RX and TX rings.  If
> you look below in the code I have a macro for setting dma addresses on
> descriptors also regardless of the type.
> 
> It makes the code a tiny bit cleaner IMHO and should be acceptable for
> trivial macros.

Thanks for the clarification, that sounds reasonable to me.

> > That notwithstanding:
> > 
> > Reviewed-by: Simon Horman <simon.horman@netronome.com>
> 
> Thanks for the reviews!

^ permalink raw reply

* Re: [PATCH v4 1/4] can: m_can: move Message RAM initialization to function
From: Alexandre Belloni @ 2017-05-16  8:00 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: Marc Kleine-Budde, Quentin Schulz, wg, mario.huettel, linux-can,
	netdev, linux-kernel, thomas.petazzoni
In-Reply-To: <1d41efe1-d24f-4700-8986-0d5dc848ea23@hartkopp.net>

Hi,

On 15/05/2017 at 20:51:30 -0700, Oliver Hartkopp wrote:
> On 05/15/2017 06:50 AM, Marc Kleine-Budde wrote:
> > On 05/12/2017 08:37 AM, Quentin Schulz wrote:
> > > On 05/05/2017 15:50, Quentin Schulz wrote:
> > > > To avoid possible ECC/parity checksum errors when reading an
> > > > uninitialized buffer, the entire Message RAM is initialized when probing
> > > > the driver. This initialization is done in the same function reading the
> > > > Device Tree properties.
> > > > 
> > > > This patch moves the RAM initialization to a separate function so it can
> > > > be called separately from device initialization from Device Tree.
> > > > 
> > > > Signed-off-by: Quentin Schulz <quentin.schulz@free-electrons.com>
> > > 
> > > It's been a week since I sent this patch series. Any comments?
> > 
> > Looks good, added to linux-can-next.
> 
> Isn't this a fix for linux-can instead?
> 
> At least it would make no sense to me to have the upgraded M_CAN driver in
> Linux 4.12 without this fix.
> 

The related suspend mode on the sama5d2 is not present in v4.12 so I
think this can wait v4.13.

-- 
Alexandre Belloni, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [patch net-next v2 10/10] net: sched: add termination action to allow goto chain
From: Daniel Borkmann @ 2017-05-16  7:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, xiyou.wangcong, dsa, edumazet, stephen,
	alexander.h.duyck, simon.horman, mlxsw, alexei.starovoitov
In-Reply-To: <20170516044319.GA24493@nanopsycho>

On 05/16/2017 06:43 AM, Jiri Pirko wrote:
> Mon, May 15, 2017 at 10:02:08PM CEST, daniel@iogearbox.net wrote:
>> On 05/15/2017 10:38 AM, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>>
>>> Introduce new type of termination action called "goto_chain". This allows
>>> user to specify a chain to be processed. This action type is
>>> then processed as a return value in tcf_classify loop in similar
>>> way as "reclassify" is, only it does not reset to the first filter
>>> in chain but rather reset to the first filter of the desired chain.
>>>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> [...]
>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>> index 1112a2b..98cc689 100644
>>> --- a/net/sched/cls_api.c
>>> +++ b/net/sched/cls_api.c
>>> @@ -304,10 +304,14 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>>>    			continue;
>>>
>>>    		err = tp->classify(skb, tp, res);
>>> -		if (unlikely(err == TC_ACT_RECLASSIFY && !compat_mode))
>>> +		if (err == TC_ACT_RECLASSIFY && !compat_mode) {
>>>    			goto reset;
>>> -		if (err >= 0)
>>> +		} else if (TC_ACT_EXT_CMP(err, TC_ACT_GOTO_CHAIN)) {
>>> +			old_tp = res->goto_tp;
>>> +			goto reset;
>>> +		} else if (err >= 0) {
>>>    			return err;
>>> +		}
>>
>> Given this goto chain feature is pretty much only interesting for hw
>> offloads, can we move this further away from the sw fast path to not
>> add up to the cost per packet? (I doubt anyone is using TC_ACT_RECLASSIFY
>> in sw as well ...)
>
> I don't think so. First of all, the whole thing would be broken then in
> sw. It is useful to have it in sw, at least for testing reasons.
> So I would leave the unlikely and add it to the second check as well.

Ok, lets go with that then, thanks!

^ permalink raw reply

* Re: [PATCH] ravb: add wake-on-lan support via magic packet
From: Geert Uytterhoeven @ 2017-05-16  7:48 UTC (permalink / raw)
  To: Niklas Söderlund
  Cc: Sergei Shtylyov, netdev@vger.kernel.org, Linux-Renesas
In-Reply-To: <20170512211222.GI31437@bigcity.dyn.berto.se>

Hi Niklas,

On Fri, May 12, 2017 at 11:12 PM, Niklas Söderlund
<niklas.soderlund@ragnatech.se> wrote:
> On 2017-05-12 16:43:55 +0200, Geert Uytterhoeven wrote:
>> On Fri, May 12, 2017 at 4:32 PM, Niklas Söderlund
>> <niklas.soderlund@ragnatech.se> wrote:
>> > On 2017-05-12 14:58:53 +0200, Niklas Söderlund wrote:
>> >> On 2017-05-12 14:47:53 +0200, Geert Uytterhoeven wrote:
>> >> > On Fri, May 12, 2017 at 12:27 AM, Niklas Söderlund
>> >> > <niklas.soderlund+renesas@ragnatech.se> wrote:
>> >> > When using PSCI suspend/resume, and Wake-on-LAN cannot work due to
>> >> > PSCI firmware issues, Ethernet fails to come up afterwards:
>> >> >
>> >> >     ravb e6800000.ethernet eth0: failed to switch device to config mode
>> >> >     ravb e6800000.ethernet eth0: device will be stopped after h/w
>> >> > processes are done.
>> >> >     ravb e6800000.ethernet eth0: failed to switch device to config mode
>> >> >     dpm_run_callback(): ravb_resume+0x0/0x148 returns -110
>> >> >     PM: Device e6800000.ethernet failed to resume: error -110
>> >> >
>> >> > Your resume routine cannot assume RAVB is in a sane mode, as it will
>> >> > have been reset if PSCI suspend was used.
>> >>
>> >> Ouch, yes this is true thanks for reporting will look in to it.
>> >>
>> >> The problem is that in the resume handler if WoL is enabled it will try
>> >> to close the device before reinitializing it from reset state. If WoL is
>> >> not enabled the device will be closed at suspend time so no need to
>> >> close it before restoring operation from reset in the resume handler.

>> With renesas-drivers, it will refuse to use PSCI suspend if any other wake-up
>> sources are configured, i.e. try
>>
>>     echo disabled > /sys/devices/platform/soc/e6800000.ethernet/power/wakeup
>>
>> Good luck, and have a nice weekend!
>
> Thanks this allowed me to reproduce the same error as you. And after
> future digging I don't believe the problem being in the logic of the
> ravb suspend/resume functions. The problem is that the module clock is
> never turned on after PCSI system suspend if its usage count is above 0
> at suspend time, so the errors we both now observe are due to the module
> clock being disabled.
>
> If I set DEBUG in renesas-cpg-mssr.c and observe when each module clock
> is turned OFF and ON, the fault is clear. If WoL is enabled the clock is
> never turned on when the system is resuming, while if WoL is disabled it
> is. I verified this by removing the calls to clk_enable() and
> clk_disable() from this patch, and by doing so the PCSI system suspend
> works perfect with WoL enabled and the ravb comes up fine after toggling
> SW23 (while ofc WoL no longer works in s2idle due to the module clock is
> switched off at suspend time).
>
> I checked drivers/clk/renesas and I can't find a suspend/resume handler
> for the clock driver, how is this intended to work? If a clock have a
> usage count higher then 0 when the system is PSCI System Suspended it
> seems like it won't be turned back on when the system is resumed from
> this sleep stage. I might have misunderstood something and I need to
> alter the logic in the ravb driver to let the clock driver know it
> should turn on the clock at resume time?

Ah, you found a real use case for suspend/resume support in the clock
drivers ;-)

Due to PSCI system suspend powering down the whole SoC, all clock
settings are lost.

Thanks, I will look into this...

> Whit all this being said I still like to withdraw this patch as I found
> another fault with it, ravb_wol_restore() will unconditionally be called
> while ravb_wol_setup() will only be called if netif_running(ndev). This
> is en easy fix and I will send out a v2 once we figure out what to do
> about the clock.

The clock issue is external to the ravb driver. If it works with
s2idle, it should
be OK.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [Patch net] ipv4: restore rt->fi for reference counting
From: Julian Anastasov @ 2017-05-16  7:46 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, David Miller, Linux Kernel Network Developers,
	Andrey Konovalov, Eric Dumazet
In-Reply-To: <CAM_iQpWAjTJL54h_Pdg3DigjEK4kQdSz0pmAks6ByoP67ROOFg@mail.gmail.com>


	Hello,

On Mon, 15 May 2017, Cong Wang wrote:

> On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov <ja@ssi.bg> wrote:
> >         Any user that does not set FIB_LOOKUP_NOREF
> > will need nh_dev refcounts. The assumption is that the
> > NHs are accessed, who knows, may be even after RCU grace
> > period. As result, we can not use dev_put on NETDEV_UNREGISTER.
> > So, we should check if there are users that do not
> > set FIB_LOOKUP_NOREF, at first look, I don't see such ones
> > for IPv4.
> 
> I see, although we do have FIB_LOOKUP_NOREF set all the times,
> there are other places we hold fib_clntref too, for example
> mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too...
> 
> So I am afraid moving dev_put() to fib_release_info() is not a solution
> here. I have to rethink about it.

	At first look, they use fib_info_hold() to get fib_clntref 
reference from places where fib_treeref is not fatally decreased
to 0 but later a work is used which finishes the job. I guess, we
can convert such places to use just a fib_treeref reference.
They can use such new method instead of fib_info_hold:

void fib_treeref_get(struct fib_info *fi)
{
	spin_lock_bh(&fib_info_lock);
	fi->fib_treeref++;
	spin_unlock_bh(&fib_info_lock);
}

They will use fib_release_info() to put the reference. But
on FIB_EVENT_ENTRY_DEL there is a small window where the
scheduled work delays the unlink of fib info from the
hash tables, i.e. there is a risk fib_find_info to reuse
a dead fib info.

	May be we can add a fi->fib_flags & RTNH_F_DEAD
check there but the problem is that it is set also on
NETDEV_DOWN. While attempts to add route to device with
!(dev->flags & IFF_UP) is rejected by fib_check_nh(),
fib_create_info still can create routes when
cfg->fc_scope == RT_SCOPE_HOST. So, RTNH_F_DEAD check
in fib_find_info can avoid the reuse of fib info for
host routes while device is down but not unregistered.
As result, many fib infos can be created instead of one
that is reused. Adding new RTNH_F_* flag to properly handle
this does not look good...

Regards

^ permalink raw reply

* stmmac dtc warnings 4.12-rc1
From: Niklas Cassel @ 2017-05-16  7:46 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Joao Pinto, Thierry Reding,
	Mark Rutland, Rob Herring, David Miller
  Cc: devicetree, netdev

Hello


With the dtc in linux 4.12-rc1 we get the following
warning when compiling our stmmac node:

Warning (simple_bus_reg): Node /amba/stmmac-axi-config missing or empty reg/ranges property

Note that we do follow the example binding for stmmac:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/net/stmmac.txt

The binding does not specify a reg/ranges property for the stmmac-axi-config node.
(Neither does the mtl_rx_setup/mtl_tx_setup nodes.)

We are building with a local patch for dtc that fails on warnings (like -Werror).
However, I figure that there shouldn't be any warnings when compiling a dtc,
so what's the best solution here?
Add a reg = <0> property to the stmmac-axi-config/mtl_rx_setup/mtl_tx_setup nodes?


Regards,
Niklas

^ permalink raw reply

* Re: [oss-drivers] [PATCH net-next 8/9] nfp: add a helper for wrapping descriptor index
From: Jakub Kicinski @ 2017-05-16  7:38 UTC (permalink / raw)
  To: Simon Horman; +Cc: netdev, oss-drivers
In-Reply-To: <20170516070806.GC11965@vergenet.net>

On Tue, 16 May 2017 09:08:08 +0200, Simon Horman wrote:
> On Mon, May 15, 2017 at 05:55:22PM -0700, Jakub Kicinski wrote:
> > We have a number of places where we calculate the descriptor
> > index based on a value which may have overflown.  Create a
> > macro for masking with the ring size.
> > 
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > ---
> >  drivers/net/ethernet/netronome/nfp/nfp_net.h        |  3 +++
> >  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 21 ++++++++++-----------
> >  2 files changed, 13 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > index 66319a1026bb..7b9518cbe965 100644
> > --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> > @@ -117,6 +117,9 @@ struct nfp_eth_table_port;
> >  struct nfp_net;
> >  struct nfp_net_r_vector;
> >  
> > +/* Convenience macro for wrapping descriptor index on ring size */
> > +#define D_IDX(ring, idx)	((idx) & ((ring)->cnt - 1))  
> 
> Any reason not to make this a function?

This is to be able to use the same macro for both RX and TX rings.  If
you look below in the code I have a macro for setting dma addresses on
descriptors also regardless of the type.

It makes the code a tiny bit cleaner IMHO and should be acceptable for
trivial macros.

> That notwithstanding:
> 
> Reviewed-by: Simon Horman <simon.horman@netronome.com>

Thanks for the reviews!

^ permalink raw reply

* Re: [oss-drivers] [PATCH net-next 8/9] nfp: add a helper for wrapping descriptor index
From: Simon Horman @ 2017-05-16  7:08 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170516005523.26124-9-jakub.kicinski@netronome.com>

On Mon, May 15, 2017 at 05:55:22PM -0700, Jakub Kicinski wrote:
> We have a number of places where we calculate the descriptor
> index based on a value which may have overflown.  Create a
> macro for masking with the ring size.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_net.h        |  3 +++
>  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 21 ++++++++++-----------
>  2 files changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> index 66319a1026bb..7b9518cbe965 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> @@ -117,6 +117,9 @@ struct nfp_eth_table_port;
>  struct nfp_net;
>  struct nfp_net_r_vector;
>  
> +/* Convenience macro for wrapping descriptor index on ring size */
> +#define D_IDX(ring, idx)	((idx) & ((ring)->cnt - 1))

Any reason not to make this a function?

That notwithstanding:

Reviewed-by: Simon Horman <simon.horman@netronome.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox