Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
From: Lawrence Brakmo @ 2018-06-29 18:55 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Yuchung Cheng, Matt Mathis, Netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Eric Dumazet, Wei Wang, Steve Ibanez,
	Yousuk Seung
In-Reply-To: <CADVnQy=MsiEBCr+Mnp97mp0MxDqrA+_KiZEQehgcDfe9L-hghQ@mail.gmail.com>

On 6/28/18, 1:48 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > I just looked at 4.18 traces and the behavior is as follows:
    >
    >    Host A sends the last packets of the request
    >
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >
    >    Host B sends ACKs for packets not marked with congestion
    >
    >    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
    >
    >    Host A receives ACKs with no ECE flag
    >
    >    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
    >
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >
    >    Host A RTO timer fires!
    >
    >    Host A to send the next packet
    >
    >    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
    >
    >    Host A send more packets…
    
    Thanks, Larry! This is very interesting. I don't know the cause, but
    this reminds me of an issue  Steve Ibanez raised on the netdev list
    last December, where he was seeing cases with DCTCP where a CWR packet
    would be received and buffered by Host B but not ACKed by Host B. This
    was the thread "Re: Linux ECN Handling", starting around December 5. I
    have cc-ed Steve.
    
    I wonder if this may somehow be related to the DCTCP logic to rewind
    tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
    DCTCP notices that the incoming CE bits have been changed while the
    receiver thinks it is holding on to a delayed ACK (in
    dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
    "synthetic" call to tcp_send_ack() somehow has side effects in the
    delayed ACK state machine that can cause the connection to forget that
    it still needs to fire a delayed ACK, even though it just sent an ACK
    just now.
    
    neal
    
You were right Neal, one of the bugs is related to this and is caused by a lack of state update to DCTCP. DCTCP is first informed that the ACK was delayed but it is not updated when the ACK is sent with a data packet.

I am working on a patch to fix this which hopefully should be out today. Thanks everyone for the great feedback! 

^ permalink raw reply

* Re: [PATCH] cfg80211: use IDA to allocate wiphy indeces
From: Ben Greear @ 2018-06-29 19:01 UTC (permalink / raw)
  To: Brian Norris, Johannes Berg; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <20180629184847.GA251207@ban.mtv.corp.google.com>

On 06/29/2018 11:48 AM, Brian Norris wrote:
> Hi Johannes,
>
> On Fri, Jun 29, 2018 at 09:42:20AM +0200, Johannes Berg wrote:
>> On Wed, 2018-06-20 at 18:29 -0700, Brian Norris wrote:
>>> It's annoying to see the phy index increase arbitrarily, just because a
>>> device got removed and re-probed (e.g., during a device reset, or due to
>>> probe testing). We can use the in-kernel index allocator for this,
>>> instead of just an increasing counter.
>>
>> I can understand that it's somewhat annoying to people, but it was
>> actually done on purpose to avoid userspace talking to the wrong device.
>
> Hmm, interesting. I'm not dead-set on this patch, so if there are good
> reasons to reject it, I won't fret.
>
>> Imagine you have some userspace process running that has remembered the
>> wiphy index to use it to talk to nl80211, and now underneath the device
>> goes away and reappears. This process should understand that situation,
>> and handle it accordingly, rather than being blind to the reset.
>
> How is this different from the wlan (netdev) device naming? We allow
> 'wlan0' to leave and return under the same name. Isn't the right answer
> that user space should be listening for udev and/or netlink events?
>
> Brian
>

For what it is worth, we use udev to rename the phyX to wiphyZ devices based on
their MAC address, and that seems to work OK.

I can't think of any reason why user-space would need the phy index number
to increase as modules are loaded/unloaded though.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* [PATCH v3 net-next 0/9] Handle multiple received packets at each stage
From: Edward Cree @ 2018-06-29 19:27 UTC (permalink / raw)
  To: davem; +Cc: netdev

This patch series adds the capability for the network stack to receive a
 list of packets and process them as a unit, rather than handling each
 packet singly in sequence.  This is done by factoring out the existing
 datapath code at each layer and wrapping it in list handling code.

The motivation for this change is twofold:
* Instruction cache locality.  Currently, running the entire network
  stack receive path on a packet involves more code than will fit in the
  lowest-level icache, meaning that when the next packet is handled, the
  code has to be reloaded from more distant caches.  By handling packets
  in "row-major order", we ensure that the code at each layer is hot for
  most of the list.  (There is a corresponding downside in _data_ cache
  locality, since we are now touching every packet at every layer, but in
  practice there is easily enough room in dcache to hold one cacheline of
  each of the 64 packets in a NAPI poll.)
* Reduction of indirect calls.  Owing to Spectre mitigations, indirect
  function calls are now more expensive than ever; they are also heavily
  used in the network stack's architecture (see [1]).  By replacing 64
  indirect calls to the next-layer per-packet function with a single
  indirect call to the next-layer list function, we can save CPU cycles.

Drivers pass an SKB list to the stack at the end of the NAPI poll; this
 gives a natural batch size (the NAPI poll weight) and avoids waiting at
 the software level for further packets to make a larger batch (which
 would add latency).  It also means that the batch size is automatically
 tuned by the existing interrupt moderation mechanism.
The stack then runs each layer of processing over all the packets in the
 list before proceeding to the next layer.  Where the 'next layer' (or
 the context in which it must run) differs among the packets, the stack
 splits the list; this 'late demux' means that packets which differ only
 in later headers (e.g. same L2/L3 but different L4) can traverse the
 early part of the stack together.
Also, where the next layer is not (yet) list-aware, the stack can revert
 to calling the rest of the stack in a loop; this allows gradual/creeping
 listification, with no 'flag day' patch needed to listify everything.

Patches 1-2 simply place received packets on a list during the event
 processing loop on the sfc EF10 architecture, then call the normal stack
 for each packet singly at the end of the NAPI poll.  (Analogues of patch
 #2 for other NIC drivers should be fairly straightforward.)
Patches 3-9 extend the list processing as far as the IP receive handler.

Patches 1-2 alone give about a 10% improvement in packet rate in the
 baseline test; adding patches 3-9 raises this to around 25%.

Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
 packets and a single core to handle interrupts on the RX side; this was
 in order to measure as simply as possible the packet rate handled by a
 single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
 setup was tuned for maximum reproducibility, rather than raw performance.
 Full details and more results (both with and without retpolines) from a
 previous version of the patch series are presented in [2].

The baseline test uses four streams, and multiple RXQs all bound to a
 single CPU (the netperf binary is bound to a neighbouring CPU).  These
 tests were run with retpolines.
net-next: 6.91 Mb/s (datum)
 after 9: 8.46 Mb/s (+22.5%)
Note however that these results are not robust; changes in the parameters
 of the test sometimes shrink the gain to single-digit percentages.  For
 instance, when using only a single RXQ, only a 4% gain was seen.

One test variation was the use of software filtering/firewall rules.
 Adding a single iptables rule (UDP port drop on a port range not matching
 the test traffic), thus making the netfilter hook have work to do,
 reduced baseline performance but showed a similar gain from the patches:
net-next: 5.02 Mb/s (datum)
 after 9: 6.78 Mb/s (+35.1%)

Similarly, testing with a set of TC flower filters (kindly supplied by
 Cong Wang) gave the following:
net-next: 6.83 Mb/s (datum)
 after 9: 8.86 Mb/s (+29.7%)

These data suggest that the batching approach remains effective in the
 presence of software switching rules, and perhaps even improves the
 performance of those rules by allowing them and their codepaths to stay
 in cache between packets.

Changes from v2:
* Used standard list handling (and skb->list) instead of the skb-queue
  functions (that use skb->next, skb->prev).
  - As part of this, changed from a "dequeue, process, enqueue" model to
    using list_for_each_safe, list_del, and (new) list_cut_before.
* Altered __netif_receive_skb_core() changes in patch 6 as per Willem de
  Bruijn's suggestions (separate **ppt_prev from *pt_prev; renaming).
* Removed patches to Generic XDP, since they were producing no benefit.
  I may revisit them later.
* Removed RFC tags.

Changes from v1:
* Rebased across 2 years' net-next movement (surprisingly straightforward).
  - Added Generic XDP handling to netif_receive_skb_list_internal()
  - Dealt with changes to PFMEMALLOC setting APIs
* General cleanup of code and comments.
* Skipped function calls for empty lists at various points in the stack
  (patch #9).
* Added listified Generic XDP handling (patches 10-12), though it doesn't
  seem to help (see above).
* Extended testing to cover software firewalls / netfilter etc.

[1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
[2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf

Edward Cree (9):
  net: core: trivial netif_receive_skb_list() entry point
  sfc: batch up RX delivery
  net: core: unwrap skb list receive slightly further
  net: core: Another step of skb receive list processing
  net: core: another layer of lists, around PF_MEMALLOC skb handling
  net: core: propagate SKB lists through packet_type lookup
  net: ipv4: listified version of ip_rcv
  net: ipv4: listify ip_rcv_finish
  net: don't bother calling list RX functions on empty lists

 drivers/net/ethernet/sfc/efx.c        |  12 +++
 drivers/net/ethernet/sfc/net_driver.h |   3 +
 drivers/net/ethernet/sfc/rx.c         |   7 +-
 include/linux/list.h                  |  30 ++++++
 include/linux/netdevice.h             |   4 +
 include/linux/netfilter.h             |  25 +++++
 include/net/ip.h                      |   2 +
 include/trace/events/net.h            |   7 ++
 net/core/dev.c                        | 174 ++++++++++++++++++++++++++++++++--
 net/ipv4/af_inet.c                    |   1 +
 net/ipv4/ip_input.c                   | 114 ++++++++++++++++++++--
 11 files changed, 363 insertions(+), 16 deletions(-)

^ permalink raw reply

* [PATCH v3 net-next 1/9] net: core: trivial netif_receive_skb_list() entry point
From: Edward Cree @ 2018-06-29 19:29 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

Just calls netif_receive_skb() in a loop.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 19 +++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c6b377a15869..e104b2e4a735 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3365,6 +3365,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
+void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
diff --git a/net/core/dev.c b/net/core/dev.c
index dffed642e686..110c8dfebc01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4792,6 +4792,25 @@ int netif_receive_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
+/**
+ *	netif_receive_skb_list - process many receive buffers from network
+ *	@head: list of skbs to process.
+ *
+ *	For now, just calls netif_receive_skb() in a loop, ignoring the
+ *	return value.
+ *
+ *	This function may only be called from softirq context and interrupts
+ *	should be enabled.
+ */
+void netif_receive_skb_list(struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		netif_receive_skb(skb);
+}
+EXPORT_SYMBOL(netif_receive_skb_list);
+
 DEFINE_PER_CPU(struct work_struct, flush_works);
 
 /* Network device is going away, flush any packets still pending */

^ permalink raw reply related

* [PATCH v3 net-next 3/9] net: core: unwrap skb list receive slightly further
From: Edward Cree @ 2018-06-29 19:30 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/trace/events/net.h | 7 +++++++
 net/core/dev.c             | 4 +++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 9c886739246a..00aa72ce0e7c 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -223,6 +223,13 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_entry,
 	TP_ARGS(skb)
 );
 
+DEFINE_EVENT(net_dev_rx_verbose_template, netif_receive_skb_list_entry,
+
+	TP_PROTO(const struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
 
 	TP_PROTO(const struct sk_buff *skb),
diff --git a/net/core/dev.c b/net/core/dev.c
index 110c8dfebc01..99167ff83919 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4806,8 +4806,10 @@ void netif_receive_skb_list(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
+	list_for_each_entry(skb, head, list)
+		trace_netif_receive_skb_list_entry(skb);
 	list_for_each_entry_safe(skb, next, head, list)
-		netif_receive_skb(skb);
+		netif_receive_skb_internal(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related

* [PATCH v3 net-next 2/9] sfc: batch up RX delivery
From: Edward Cree @ 2018-06-29 19:29 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-net-drivers
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

Improves packet rate of 1-byte UDP receives by up to 10%.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c        | 12 ++++++++++++
 drivers/net/ethernet/sfc/net_driver.h |  3 +++
 drivers/net/ethernet/sfc/rx.c         |  7 ++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 570ec72266f3..b24c2e21db8e 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -264,11 +264,17 @@ static int efx_check_disabled(struct efx_nic *efx)
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
 	struct efx_tx_queue *tx_queue;
+	struct list_head rx_list;
 	int spent;
 
 	if (unlikely(!channel->enabled))
 		return 0;
 
+	/* Prepare the batch receive list */
+	EFX_WARN_ON_PARANOID(channel->rx_list != NULL);
+	INIT_LIST_HEAD(&rx_list);
+	channel->rx_list = &rx_list;
+
 	efx_for_each_channel_tx_queue(tx_queue, channel) {
 		tx_queue->pkts_compl = 0;
 		tx_queue->bytes_compl = 0;
@@ -291,6 +297,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 		}
 	}
 
+	/* Receive any packets we queued up */
+	netif_receive_skb_list(channel->rx_list);
+	channel->rx_list = NULL;
+
 	return spent;
 }
 
@@ -555,6 +565,8 @@ static int efx_probe_channel(struct efx_channel *channel)
 			goto fail;
 	}
 
+	channel->rx_list = NULL;
+
 	return 0;
 
 fail:
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 65568925c3ef..961b92979640 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -448,6 +448,7 @@ enum efx_sync_events_state {
  *	__efx_rx_packet(), or zero if there is none
  * @rx_pkt_index: Ring index of first buffer for next packet to be delivered
  *	by __efx_rx_packet(), if @rx_pkt_n_frags != 0
+ * @rx_list: list of SKBs from current RX, awaiting processing
  * @rx_queue: RX queue for this channel
  * @tx_queue: TX queues for this channel
  * @sync_events_state: Current state of sync events on this channel
@@ -500,6 +501,8 @@ struct efx_channel {
 	unsigned int rx_pkt_n_frags;
 	unsigned int rx_pkt_index;
 
+	struct list_head *rx_list;
+
 	struct efx_rx_queue rx_queue;
 	struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
 
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index d2e254f2f72b..396ff01298cd 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -634,7 +634,12 @@ static void efx_rx_deliver(struct efx_channel *channel, u8 *eh,
 			return;
 
 	/* Pass the packet up */
-	netif_receive_skb(skb);
+	if (channel->rx_list != NULL)
+		/* Add to list, will pass up later */
+		list_add_tail(&skb->list, channel->rx_list);
+	else
+		/* No list, so pass it up now */
+		netif_receive_skb(skb);
 }
 
 /* Handle a received packet.  Second half: Touches packet payload. */

^ permalink raw reply related

* [PATCH v3 net-next 4/9] net: core: Another step of skb receive list processing
From: Edward Cree @ 2018-06-29 19:30 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

netif_receive_skb_list_internal() now processes a list and hands it
 on to the next function.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 56 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 99167ff83919..d7f2a880aeed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4729,6 +4729,14 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		__netif_receive_skb(skb);
+}
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4769,6 +4777,50 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	return ret;
 }
 
+static void netif_receive_skb_list_internal(struct list_head *head)
+{
+	struct bpf_prog *xdp_prog = NULL;
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		net_timestamp_check(netdev_tstamp_prequeue, skb);
+		if (skb_defer_rx_timestamp(skb))
+			/* Handled, remove from list */
+			list_del(&skb->list);
+	}
+
+	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		preempt_disable();
+		rcu_read_lock();
+		list_for_each_entry_safe(skb, next, head, list) {
+			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
+				/* Dropped, remove from list */
+				list_del(&skb->list);
+		}
+		rcu_read_unlock();
+		preempt_enable();
+	}
+
+	rcu_read_lock();
+#ifdef CONFIG_RPS
+	if (static_key_false(&rps_needed)) {
+		list_for_each_entry_safe(skb, next, head, list) {
+			struct rps_dev_flow voidflow, *rflow = &voidflow;
+			int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+
+			if (cpu >= 0) {
+				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+				/* Handled, remove from list */
+				list_del(&skb->list);
+			}
+		}
+	}
+#endif
+	__netif_receive_skb_list(head);
+	rcu_read_unlock();
+}
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
@@ -4796,20 +4848,19 @@ EXPORT_SYMBOL(netif_receive_skb);
  *	netif_receive_skb_list - process many receive buffers from network
  *	@head: list of skbs to process.
  *
- *	For now, just calls netif_receive_skb() in a loop, ignoring the
- *	return value.
+ *	Since return value of netif_receive_skb() is normally ignored, and
+ *	wouldn't be meaningful for a list, this function returns void.
  *
  *	This function may only be called from softirq context and interrupts
  *	should be enabled.
  */
 void netif_receive_skb_list(struct list_head *head)
 {
-	struct sk_buff *skb, *next;
+	struct sk_buff *skb;
 
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
-	list_for_each_entry_safe(skb, next, head, list)
-		netif_receive_skb_internal(skb);
+	netif_receive_skb_list_internal(head);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 

^ permalink raw reply related

* [PATCH v3 net-next 5/9] net: core: another layer of lists, around PF_MEMALLOC skb handling
From: Edward Cree @ 2018-06-29 19:30 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

First example of a layer splitting the list (rather than merely taking
 individual packets off it).
Involves new list.h function, list_cut_before(), like list_cut_position()
 but cuts on the other side of the given entry.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/list.h | 30 ++++++++++++++++++++++++++++++
 net/core/dev.c       | 44 ++++++++++++++++++++++++++++++++++++--------
 2 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 4b129df4d46b..de04cc5ed536 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -285,6 +285,36 @@ static inline void list_cut_position(struct list_head *list,
 		__list_cut_position(list, head, entry);
 }
 
+/**
+ * list_cut_before - cut a list into two, before given entry
+ * @list: a new list to add all removed entries
+ * @head: a list with entries
+ * @entry: an entry within head, could be the head itself
+ *
+ * This helper moves the initial part of @head, up to but
+ * excluding @entry, from @head to @list.  You should pass
+ * in @entry an element you know is on @head.  @list should
+ * be an empty list or a list you do not care about losing
+ * its data.
+ * If @entry == @head, all entries on @head are moved to
+ * @list.
+ */
+static inline void list_cut_before(struct list_head *list,
+				   struct list_head *head,
+				   struct list_head *entry)
+{
+	if (head->next == entry) {
+		INIT_LIST_HEAD(list);
+		return;
+	}
+	list->next = head->next;
+	list->next->prev = list;
+	list->prev = entry->prev;
+	list->prev->next = list;
+	head->next = entry;
+	entry->prev = head;
+}
+
 static inline void __list_splice(const struct list_head *list,
 				 struct list_head *prev,
 				 struct list_head *next)
diff --git a/net/core/dev.c b/net/core/dev.c
index d7f2a880aeed..d2454678bc82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4670,6 +4670,14 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
+static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list)
+		__netif_receive_skb_core(skb, pfmemalloc);
+}
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	int ret;
@@ -4695,6 +4703,34 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
+static void __netif_receive_skb_list(struct list_head *head)
+{
+	unsigned long noreclaim_flag = 0;
+	struct sk_buff *skb, *next;
+	bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
+			struct list_head sublist;
+
+			/* Handle the previous sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			pfmemalloc = !pfmemalloc;
+			/* See comments in __netif_receive_skb */
+			if (pfmemalloc)
+				noreclaim_flag = memalloc_noreclaim_save();
+			else
+				memalloc_noreclaim_restore(noreclaim_flag);
+		}
+	}
+	/* Handle the remaining sublist */
+	__netif_receive_skb_list_core(head, pfmemalloc);
+	/* Restore pflags */
+	if (pfmemalloc)
+		memalloc_noreclaim_restore(noreclaim_flag);
+}
+
 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
@@ -4729,14 +4765,6 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 	return ret;
 }
 
-static void __netif_receive_skb_list(struct list_head *head)
-{
-	struct sk_buff *skb, *next;
-
-	list_for_each_entry_safe(skb, next, head, list)
-		__netif_receive_skb(skb);
-}
-
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;

^ permalink raw reply related

* [PATCH v3 net-next 6/9] net: core: propagate SKB lists through packet_type lookup
From: Edward Cree @ 2018-06-29 19:31 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

__netif_receive_skb_core() does a depressingly large amount of per-packet
 work that can't easily be listified, because the another_round looping
 makes it nontrivial to slice up into smaller functions.
Fortunately, most of that work disappears in the fast path:
 * Hardware devices generally don't have an rx_handler
 * Unless you're tcpdumping or something, there is usually only one ptype
 * VLAN processing comes before the protocol ptype lookup, so doesn't force
   a pt_prev deliver
 so normally, __netif_receive_skb_core() will run straight through and pass
 back the one ptype found in ptype_base[hash of skb->protocol].

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d2454678bc82..edd67b1f1e12 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4494,7 +4494,8 @@ static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
 	return 0;
 }
 
-static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
+				    struct packet_type **ppt_prev)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
@@ -4624,8 +4625,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	if (pt_prev) {
 		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 			goto drop;
-		else
-			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+		*ppt_prev = pt_prev;
 	} else {
 drop:
 		if (!deliver_exact)
@@ -4643,6 +4643,18 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	return ret;
 }
 
+static int __netif_receive_skb_one_core(struct sk_buff *skb, bool pfmemalloc)
+{
+	struct net_device *orig_dev = skb->dev;
+	struct packet_type *pt_prev = NULL;
+	int ret;
+
+	ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+	if (pt_prev)
+		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	return ret;
+}
+
 /**
  *	netif_receive_skb_core - special purpose version of netif_receive_skb
  *	@skb: buffer to process
@@ -4663,19 +4675,63 @@ int netif_receive_skb_core(struct sk_buff *skb)
 	int ret;
 
 	rcu_read_lock();
-	ret = __netif_receive_skb_core(skb, false);
+	ret = __netif_receive_skb_one_core(skb, false);
 	rcu_read_unlock();
 
 	return ret;
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
-static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+static inline void __netif_receive_skb_list_ptype(struct list_head *head,
+						  struct packet_type *pt_prev,
+						  struct net_device *orig_dev)
 {
 	struct sk_buff *skb, *next;
 
+	if (!pt_prev)
+		return;
+	if (list_empty(head))
+		return;
+
 	list_for_each_entry_safe(skb, next, head, list)
-		__netif_receive_skb_core(skb, pfmemalloc);
+		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+}
+
+static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+{
+	/* Fast-path assumptions:
+	 * - There is no RX handler.
+	 * - Only one packet_type matches.
+	 * If either of these fails, we will end up doing some per-packet
+	 * processing in-line, then handling the 'last ptype' for the whole
+	 * sublist.  This can't cause out-of-order delivery to any single ptype,
+	 * because the 'last ptype' must be constant across the sublist, and all
+	 * other ptypes are handled per-packet.
+	 */
+	/* Current (common) ptype of sublist */
+	struct packet_type *pt_curr = NULL;
+	/* Current (common) orig_dev of sublist */
+	struct net_device *od_curr = NULL;
+	struct list_head sublist;
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *orig_dev = skb->dev;
+		struct packet_type *pt_prev = NULL;
+
+		__netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+		if (pt_curr != pt_prev || od_curr != orig_dev) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+			/* start new sublist */
+			pt_curr = pt_prev;
+			od_curr = orig_dev;
+		}
+	}
+
+	/* dispatch final sublist */
+	__netif_receive_skb_list_ptype(head, pt_curr, od_curr);
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)
@@ -4695,10 +4751,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
 		 * context down to all allocation sites.
 		 */
 		noreclaim_flag = memalloc_noreclaim_save();
-		ret = __netif_receive_skb_core(skb, true);
+		ret = __netif_receive_skb_one_core(skb, true);
 		memalloc_noreclaim_restore(noreclaim_flag);
 	} else
-		ret = __netif_receive_skb_core(skb, false);
+		ret = __netif_receive_skb_one_core(skb, false);
 
 	return ret;
 }

^ permalink raw reply related

* [PATCH v3 net-next 7/9] net: ipv4: listified version of ip_rcv
From: Edward Cree @ 2018-06-29 19:31 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  3 +++
 include/linux/netfilter.h | 25 +++++++++++++++++
 include/net/ip.h          |  2 ++
 net/core/dev.c            |  8 +++---
 net/ipv4/af_inet.c        |  1 +
 net/ipv4/ip_input.c       | 68 ++++++++++++++++++++++++++++++++++++++++++-----
 6 files changed, 97 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e104b2e4a735..fe81a2bfcd08 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2291,6 +2291,9 @@ struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
+	void			(*list_func) (struct list_head *,
+					      struct packet_type *,
+					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
 					    struct sock *sk);
 	void			*af_packet_priv;
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index dd2052f0efb7..2a41f53be1ce 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -288,6 +288,20 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct
 	return ret;
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	struct sk_buff *skb, *next;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		int ret = nf_hook(pf, hook, net, sk, skb, in, out, okfn);
+		if (ret != 1)
+			list_del(&skb->list);
+	}
+}
+
 /* Call setsockopt() */
 int nf_setsockopt(struct sock *sk, u_int8_t pf, int optval, char __user *opt,
 		  unsigned int len);
@@ -369,6 +383,17 @@ NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 	return okfn(net, sk, skb);
 }
 
+static inline void
+NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
+	     struct list_head *head, struct list_head *sublist,
+	     struct net_device *in, struct net_device *out,
+	     int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+{
+	INIT_LIST_HEAD(sublist);
+	/* Move everything to the sublist */
+	list_splice_init(head, sublist);
+}
+
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 			  struct sock *sk, struct sk_buff *skb,
 			  struct net_device *indev, struct net_device *outdev,
diff --git a/include/net/ip.h b/include/net/ip.h
index 0d2281b4b27a..1de72f9cb23c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -138,6 +138,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index edd67b1f1e12..4c5ebfab9bc8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4692,9 +4692,11 @@ static inline void __netif_receive_skb_list_ptype(struct list_head *head,
 		return;
 	if (list_empty(head))
 		return;
-
-	list_for_each_entry_safe(skb, next, head, list)
-		pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+	if (pt_prev->list_func != NULL)
+		pt_prev->list_func(head, pt_prev, orig_dev);
+	else
+		list_for_each_entry_safe(skb, next, head, list)
+			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
 static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 06b218a2870f..3ff7659c9afd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1882,6 +1882,7 @@ fs_initcall(ipv4_offload_init);
 static struct packet_type ip_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_IP),
 	.func = ip_rcv,
+	.list_func = ip_list_rcv,
 };
 
 static int __init inet_init(void)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713dd18f..914240830bdf 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -408,10 +408,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 /*
  * 	Main IP Receive routine.
  */
-int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net)
 {
 	const struct iphdr *iph;
-	struct net *net;
 	u32 len;
 
 	/* When the interface is in promisc. mode, drop all the crap
@@ -421,7 +420,6 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		goto drop;
 
 
-	net = dev_net(dev);
 	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
@@ -489,9 +487,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
 
-	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
-		       net, NULL, skb, dev, NULL,
-		       ip_rcv_finish);
+	return skb;
 
 csum_error:
 	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
@@ -500,5 +496,63 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 drop:
 	kfree_skb(skb);
 out:
-	return NET_RX_DROP;
+	return NULL;
+}
+
+/*
+ * IP receive entry point
+ */
+int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
+	   struct net_device *orig_dev)
+{
+	struct net *net = dev_net(dev);
+
+	skb = ip_rcv_core(skb, net);
+	if (skb == NULL)
+		return NET_RX_DROP;
+	return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
+		       net, NULL, skb, dev, NULL,
+		       ip_rcv_finish);
+}
+
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
+	struct sk_buff *skb, *next;
+
+	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
+		     head, dev, NULL, ip_rcv_finish);
+	list_for_each_entry_safe(skb, next, head, list)
+		ip_rcv_finish(net, NULL, skb);
+}
+
+/* Receive a list of IP packets */
+void ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		 struct net_device *orig_dev)
+{
+	struct net_device *curr_dev = NULL;
+	struct net *curr_net = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct net_device *dev = skb->dev;
+		struct net *net = dev_net(dev);
+
+		skb = ip_rcv_core(skb, net);
+		if (skb == NULL)
+			continue;
+
+		if (curr_dev != dev || curr_net != net) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv(&sublist, dev, net);
+			/* start new sublist */
+			curr_dev = dev;
+			curr_net = net;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv(head, curr_dev, curr_net);
 }

^ permalink raw reply related

* [PATCH v3 net-next 8/9] net: ipv4: listify ip_rcv_finish
From: Edward Cree @ 2018-06-29 19:31 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph->protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/ipv4/ip_input.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 914240830bdf..24b9b0210aeb 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -307,7 +307,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	return true;
 }
 
-static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int ip_rcv_finish_core(struct net *net, struct sock *sk,
+			      struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int (*edemux)(struct sk_buff *skb);
@@ -393,7 +394,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	return dst_input(skb);
+	return NET_RX_SUCCESS;
 
 drop:
 	kfree_skb(skb);
@@ -405,6 +406,15 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	goto drop;
 }
 
+static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret = ip_rcv_finish_core(net, sk, skb);
+
+	if (ret != NET_RX_DROP)
+		ret = dst_input(skb);
+	return ret;
+}
+
 /*
  * 	Main IP Receive routine.
  */
@@ -515,15 +525,47 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 }
 
-static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
-			   struct net *net)
+static void ip_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
 
+	list_for_each_entry_safe(skb, next, head, list)
+		dst_input(skb);
+}
+
+static void ip_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct list_head *head)
+{
+	struct dst_entry *curr_dst = NULL;
+	struct sk_buff *skb, *next;
+	struct list_head sublist;
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		struct dst_entry *dst;
+
+		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
+			continue;
+
+		dst = skb_dst(skb);
+		if (curr_dst != dst) {
+			/* dispatch old sublist */
+			list_cut_before(&sublist, head, &skb->list);
+			if (!list_empty(&sublist))
+				ip_sublist_rcv_finish(&sublist);
+			/* start new sublist */
+			curr_dst = dst;
+		}
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv_finish(head);
+}
+
+static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
+{
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     head, dev, NULL, ip_rcv_finish);
-	list_for_each_entry_safe(skb, next, head, list)
-		ip_rcv_finish(net, NULL, skb);
+	ip_list_rcv_finish(net, NULL, head);
 }
 
 /* Receive a list of IP packets */

^ permalink raw reply related

* [PATCH v3 net-next 9/9] net: don't bother calling list RX functions on empty lists
From: Edward Cree @ 2018-06-29 19:32 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <bec738bd-c3fa-ba54-9473-84c8366c5699@solarflare.com>

Generally the check should be very cheap, as the sk_buff_head is in cache.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 4c5ebfab9bc8..d6084b0cd9ce 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4773,7 +4773,8 @@ static void __netif_receive_skb_list(struct list_head *head)
 
 			/* Handle the previous sublist */
 			list_cut_before(&sublist, head, &skb->list);
-			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			if (!list_empty(&sublist))
+				__netif_receive_skb_list_core(&sublist, pfmemalloc);
 			pfmemalloc = !pfmemalloc;
 			/* See comments in __netif_receive_skb */
 			if (pfmemalloc)
@@ -4783,7 +4784,8 @@ static void __netif_receive_skb_list(struct list_head *head)
 		}
 	}
 	/* Handle the remaining sublist */
-	__netif_receive_skb_list_core(head, pfmemalloc);
+	if (!list_empty(head))
+		__netif_receive_skb_list_core(head, pfmemalloc);
 	/* Restore pflags */
 	if (pfmemalloc)
 		memalloc_noreclaim_restore(noreclaim_flag);
@@ -4944,6 +4946,8 @@ void netif_receive_skb_list(struct list_head *head)
 {
 	struct sk_buff *skb;
 
+	if (list_empty(head))
+		return;
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
 	netif_receive_skb_list_internal(head);

^ permalink raw reply related

* Re: [PATCH net] net: fib_rules: add protocol check in rule_find
From: Roopa Prabhu @ 2018-06-29 19:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <CAJieiUhsvn8s_-EFjJREM0dS7Jg=jw_uLLvLWOOkNfBjdkMdsA@mail.gmail.com>

On Thu, Jun 28, 2018 at 9:59 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
> On Wed, Jun 27, 2018 at 6:27 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>
>> After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule
>> delrule msgs into fib_nl2rule"), rule_find is strict about checking
>> for an existing rule. rule_find must check against all
>> user given attributes, else it may match against a subset
>> of attributes and return an existing rule.
>>
>> In the below case, without support for protocol match, rule_find
>> will match only against 'table main' and return an existing rule.
>>
>> $ip -4 rule add table main protocol boot
>> RTNETLINK answers: File exists
>>
>> This patch adds protocol support to rule_find, forcing it to
>> check protocol match if given by the user.
>>
>> Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>> I spent some time looking at all match keys today and protocol
>> was the only missing one (protocol is not in a released kernel yet).
>> The only way this could be avoided is to move back to the old loose
>> rule_find. I am worried about this new strict checking surprising users,
>> but going back to the previous loose checking does not seem right either.
>> If there is a reason to believe that users did rely on the previous
>> behaviour, I will be happy to revert. Here is another example of old and
>> new behaviour.
>>
>> old rule_find behaviour:
>> $ip -4 rule add table main protocol boot
>> $ip -4 rule add table main protocol boot
>> $ip -4 rule add table main protocol boot
>> $ip rule show
>> 0:      from all lookup local
>> 32763:  from all lookup main  proto boot
>> 32764:  from all lookup main  proto boot
>> 32765:  from all lookup main  proto boot
>> 32766:  from all lookup main
>> 32767:  from all lookup default
>>
>> new rule_find behaviour (after this patch):
>> $ip -4 rule add table main protocol boot
>> $ip -4 rule add table main protocol boot
>> RTNETLINK answers: File exists
>>
>
> I found the case where the new rule_find breaks for add.
> $ip -4 rule add table main tos 10 fwmark 1
> $ip -4 rule add table main tos 10
> RTNETLINK answers: File exists
>
> The key masks in the new and old rule need to be compared .
> And it cannot be easily compared today without an elaborate if-else block.
> Its best to introduce key masks for easier and accurate rule comparison.
> But this is best done in net-next. I will submit an incremental patch
> tomorrow to
> restore previous rule_exists for the add case and the rest should be good.
>
> The current patch in context is needed regardless.
>
> Thanks (and sorry about the iterations).

as I write the commit msg for the new incremental patch, it seems
better to merge this one with the new one.

Please ignore this patch, I will send an updated patch in a bit. thanks.

^ permalink raw reply

* Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
From: Ilpo Järvinen @ 2018-06-29 19:52 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Netdev, Yuchung Cheng, Eric Dumazet, Michal Kubecek
In-Reply-To: <CADVnQy=1UWUNk-J5=ytVLUoivL7OG0uYh_gJfL1mNLcTr+zuBg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6700 bytes --]

On Fri, 29 Jun 2018, Neal Cardwell wrote:

> On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > If SACK is not enabled and the first cumulative ACK after the RTO
> > retransmission covers more than the retransmitted skb, a spurious
> > FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> > The reason is that any non-retransmitted segment acknowledged will
> > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> > no indication that it would have been delivered for real (the
> > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> > case so the check for that bit won't help like it does with SACK).
> > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> > in tcp_process_loss.
> >
> > We need to use more strict condition for non-SACK case and check
> > that none of the cumulatively ACKed segments were retransmitted
> > to prove that progress is due to original transmissions. Only then
> > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> > non-SACK case.
> >
> > (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> > to better indicate its purpose but to keep this change minimal, it
> > will be done in another patch).
> >
> > Besides burstiness and congestion control violations, this problem
> > can result in RTO loop: When the loss recovery is prematurely
> > undoed, only new data will be transmitted (if available) and
> > the next retransmission can occur only after a new RTO which in case
> > of multiple losses (that are not for consecutive packets) requires
> > one RTO per loss to recover.
> >
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 045d930..8e5522c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
> >
> >                 if (tcp_is_reno(tp)) {
> >                         tcp_remove_reno_sacks(sk, pkts_acked);
> > +
> > +                       /* If any of the cumulatively ACKed segments was
> > +                        * retransmitted, non-SACK case cannot confirm that
> > +                        * progress was due to original transmission due to
> > +                        * lack of TCPCB_SACKED_ACKED bits even if some of
> > +                        * the packets may have been never retransmitted.
> > +                        */
> > +                       if (flag & FLAG_RETRANS_DATA_ACKED)
> > +                               flag &= ~FLAG_ORIG_SACK_ACKED;
> >                 } else {
> >                         int delta;
> 
> Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
> got one failure, which detected the change you made, so that on the
> first ACK in a non-SACK F-RTO case we stay in CA_Loss.

Great, thanks for testing.

> However, depending on the exact scenario we are concerned about here,
> this may not be a complete solution. Consider the packetdrill test I
> cooked below, which is for a scenario where we imagine a non-SACK
> connection, with data packet #1 and all dupacks lost. With your patch
> we don't have an F-RTO undo on the first cumulative ACK, which is a
> definite improvement. However, we end up with an F-RTO undo on the
> *second* cumulative ACK, because the second ACK didn't cover any
> retransmitted data. That means that we end up in the second round trip
> back in the initial slow-start mode with a very high cwnd and infinite
> ssthresh, even though data was actually lost in the first round trip.
>
> I'm not sure how much we care about all these cases. Perhaps we should
> just disable F-RTO if the connection has no SACK enabled? These
> non-SACK connections are just such a rare case at this point that I
> would propose it's not worth spending too much time on this.
> 
> The following packetdrill script passes when Ilpo's patch is applied.
> This documents the behavior, which I think is questionable:
> 
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
> 
>    +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
>    +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
>  +.02 < . 1:1(0) ack 1 win 257
>    +0 accept(3, ..., ...) = 4
>    +0 write(4, ..., 27000) = 27000
>    +0 > . 1:10001(10000) ack 1
> 
> // Suppose 1:1001 is lost and all dupacks are lost.
> 
> // RTO and retransmit head. This fills a real hole.
>  +.22 > . 1:1001(1000) ack 1
> 
> +.005 < . 1:1(0) ack 2001 win 257

Why did the receiver send a cumulative ACK only for 2001?

>    +0 > . 10001:13001(3000) ack 1
>    +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
>    +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }%
>    +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
> 
> +.002 < . 1:1(0) ack 10001 win 257

...And then magically all packets were received up to 10001 here as we
get cumulative ACK that far? None were lost I guess? Were they perhaps 
reordered past RTO?

>    +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
>    +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }%
>    +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }%
>    +0 > . 13001:15001(2000) ack 1

This looks to me just another "lets defeat the heuristics" pattern rather 
than what a realistic receiver / properly functioning "middlebox" should 
do, or do I miss something? I think that the power/ability with 
packetdrill sets a trap for us we need to be careful not to step into: It 
is very possible to overdo scenarios with packetdrill to a point where 
they stop being realistic.


When we go to this road, we might also look into FRTO with SACK with >1 
hole and all ACKs lost. I think it will trigger exactly the same issue my 
patch attempts to fix for non-SACK cases because scoreboard won't get 
S-bits from those non-arriving ACKs (it could also trigger the scenario 
you describe). ...I guess it could be fixed by limiting the undo check 
really into step 3 rather than possibly invoking it already in step 2, 
which is a subtle difference with the current Linux implementation and 
FRTO RFC. But I want to state (now that I bring this scenario up), that I 
think this is much much less likely to trigger than the issue my fix 
addresses (very possible to be true even considering the huge bias in # of 
SACK enabled vs non-SACK flow).

...Beyond that one, I'm yet to see/come up a convincing scenario due to 
FRTO heuristics.


-- 
 i.

^ permalink raw reply

* Re: [PATCH v4 08/18] net: davinci_emac: potentially get the MAC address from MTD
From: David Lechner @ 2018-06-29 20:09 UTC (permalink / raw)
  To: Bartosz Golaszewski, Sekhar Nori, Kevin Hilman, Russell King,
	Grygorii Strashko, David S . Miller, Srinivas Kandagatla,
	Lukas Wunner, Rob Herring, Florian Fainelli, Dan Carpenter,
	Ivan Khoronzhuk, Greg Kroah-Hartman, Andrew Lunn, Jonathan Corbet
  Cc: linux-arm-kernel, linux-kernel, linux-omap, netdev,
	Bartosz Golaszewski
In-Reply-To: <20180629094039.7543-9-brgl@bgdev.pl>

On 06/29/2018 04:40 AM, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> 
> On da850-evm board we can read the MAC address from MTD. It's currently
> done in the relevant board file, but we want to get rid of all the MAC
> reading callbacks from the board file (SPI and NAND). Move the reading
> of the MAC address from SPI to the emac driver's probe function.
> 
> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
> ---
>   drivers/net/ethernet/ti/davinci_emac.c | 20 ++++++++++++++++++--
>   1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
> index a1a6445b5a7e..48e6a7755811 100644
> --- a/drivers/net/ethernet/ti/davinci_emac.c
> +++ b/drivers/net/ethernet/ti/davinci_emac.c
> @@ -67,7 +67,7 @@
>   #include <linux/of_irq.h>
>   #include <linux/of_net.h>
>   #include <linux/mfd/syscon.h>
> -
> +#include <linux/mtd/mtd.h>
>   #include <asm/irq.h>
>   #include <asm/page.h>
>   
> @@ -1783,7 +1783,10 @@ static int davinci_emac_probe(struct platform_device *pdev)
>   	struct cpdma_params dma_params;
>   	struct clk *emac_clk;
>   	unsigned long emac_bus_frequency;
> -
> +#ifdef CONFIG_MTD
> +	size_t mac_addr_len;
> +	struct mtd_info *mtd;
> +#endif /* CONFIG_MTD */
>   
>   	/* obtain emac clock from kernel */
>   	emac_clk = devm_clk_get(&pdev->dev, NULL);
> @@ -1815,6 +1818,19 @@ static int davinci_emac_probe(struct platform_device *pdev)
>   		goto err_free_netdev;
>   	}
>   
> +#ifdef CONFIG_MTD

What about the case when MTD is compiled as a module?

> +	mtd = get_mtd_device_nm("MAC-Address");

What about the case when PTR_ERR(mtd) == -EPROBE_DEFER?

> +	if (!IS_ERR(mtd)) {
> +		rc = mtd_read(mtd, 0, ETH_ALEN,
> +			      &mac_addr_len, priv->mac_addr);
> +		if (rc == 0)
> +			dev_info(&pdev->dev,
> +				 "Read MAC addr from SPI Flash: %pM\n",
> +				 priv->mac_addr);
> +		put_mtd_device(mtd);
> +	}
> +#endif /* CONFIG_MTD */
> +
>   	/* MAC addr and PHY mask , RMII enable info from platform_data */
>   	memcpy(priv->mac_addr, pdata->mac_addr, ETH_ALEN);
>   	priv->phy_id = pdata->phy_id;
> 

^ permalink raw reply

* Re: [PATCH v1 net-next 14/14] net/sched: Make etf report drops on error_queue
From: Jesus Sanchez-Palencia @ 2018-06-29 20:14 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, Thomas Gleixner, jan.altenberg,
	Vinicius Gomes, kurt.kanzenbach, Henrik Austad, Richard Cochran,
	ilias.apalodimas, ivan.khoronzhuk, Miroslav Lichvar,
	Willem de Bruijn, Jamal Hadi Salim, Cong Wang,
	Jiří Pírko, Alexander Duyck
In-Reply-To: <CAF=yD-KqkuZsstXK2-s9_ttuCKx=wucHS9zEmkdH=BsgMY=a0Q@mail.gmail.com>




On 06/29/2018 11:49 AM, Willem de Bruijn wrote:
>>>> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
>>>> +static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
>>>> +{
>>>> +       struct sock_exterr_skb *serr;
>>>> +       ktime_t txtime = skb->tstamp;
>>>> +
>>>> +       if (!skb->sk || !(skb->sk->sk_txtime_flags & SK_TXTIME_RECV_ERR_MASK))
>>>> +               return;
>>>> +
>>>> +       skb = skb_clone_sk(skb);
>>>> +       if (!skb)
>>>> +               return;
>>>> +
>>>> +       sock_hold(skb->sk);
>>>
>>> Why take an extra reference? The skb holds a ref on the sk.
>>
>>
>> Yes, the cloned skb holds a ref on the socket, but the documentation of
>> skb_clone_sk() makes this explicit suggestion:
>>
>> (...)
>>  * When passing buffers allocated with this function to sock_queue_err_skb
>>  * it is necessary to wrap the call with sock_hold/sock_put in order to
>>  * prevent the socket from being released prior to being enqueued on
>>  * the sk_error_queue.
>>  */
>>
>> which I believe is here just so we are protected against a possible race after
>> skb_orphan() is called from sock_queue_err_skb(). Please let me know if I'm
>> misreading anything.
> 
> Yes, indeed. Code only has to worry about that if there are no
> concurrent references
> on the socket.
> 
> I may be mistaken, but I believe that this complicated logic exists
> only for cases where
> the timestamp may be queued after the original skb has been released.
> Specifically,
> when a tx timestamp is returned from a hardware device after transmission of the
> original skb. Then the cloned timestamp skb needs its own reference on
> the sk while
> it is waiting for the timestamp data (i.e., until the device
> completion arrives) and then
> we need a temporary extra ref to work around the skb_orphan in
> sock_queue_err_skb.
> 
> Compare skb_complete_tx_timestamp with skb_tstamp_tx. The second is used in
> the regular datapath to clone an skb and queue it on the error queue
> immediately,
> while holding the original skb. This does not call skb_clone_sk and
> does not need the
> extra sock_hold. This should be good enough for this code path, too.
> As kb holds a
> ref on skb->sk, the socket cannot go away in the middle of report_sock_error.


Oh, that makes sense. Great, I will give this a try and add it to the v2.

Thanks,
Jesus

^ permalink raw reply

* Re: [PATCH v4 08/18] net: davinci_emac: potentially get the MAC address from MTD
From: David Lechner @ 2018-06-29 20:35 UTC (permalink / raw)
  To: Bartosz Golaszewski, Sekhar Nori, Kevin Hilman, Russell King,
	Grygorii Strashko, David S . Miller, Srinivas Kandagatla,
	Lukas Wunner, Rob Herring, Florian Fainelli, Dan Carpenter,
	Ivan Khoronzhuk, Greg Kroah-Hartman, Andrew Lunn, Jonathan Corbet
  Cc: linux-arm-kernel, linux-kernel, linux-omap, netdev,
	Bartosz Golaszewski
In-Reply-To: <e2579b81-7c66-8be6-fcad-ab7a3a5b0765@lechnology.com>

On 06/29/2018 03:09 PM, David Lechner wrote:
> On 06/29/2018 04:40 AM, Bartosz Golaszewski wrote:
>> From: Bartosz Golaszewski <bgolaszewski@baylibre.com>
>>
>> On da850-evm board we can read the MAC address from MTD. It's currently
>> done in the relevant board file, but we want to get rid of all the MAC
>> reading callbacks from the board file (SPI and NAND). Move the reading
>> of the MAC address from SPI to the emac driver's probe function.
>>
>> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
>> ---
>>   drivers/net/ethernet/ti/davinci_emac.c | 20 ++++++++++++++++++--
>>   1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
>> index a1a6445b5a7e..48e6a7755811 100644
>> --- a/drivers/net/ethernet/ti/davinci_emac.c
>> +++ b/drivers/net/ethernet/ti/davinci_emac.c
>> @@ -67,7 +67,7 @@
>>   #include <linux/of_irq.h>
>>   #include <linux/of_net.h>
>>   #include <linux/mfd/syscon.h>
>> -
>> +#include <linux/mtd/mtd.h>
>>   #include <asm/irq.h>
>>   #include <asm/page.h>
>> @@ -1783,7 +1783,10 @@ static int davinci_emac_probe(struct platform_device *pdev)
>>       struct cpdma_params dma_params;
>>       struct clk *emac_clk;
>>       unsigned long emac_bus_frequency;
>> -
>> +#ifdef CONFIG_MTD
>> +    size_t mac_addr_len;
>> +    struct mtd_info *mtd;
>> +#endif /* CONFIG_MTD */
>>       /* obtain emac clock from kernel */
>>       emac_clk = devm_clk_get(&pdev->dev, NULL);
>> @@ -1815,6 +1818,19 @@ static int davinci_emac_probe(struct platform_device *pdev)
>>           goto err_free_netdev;
>>       }
>> +#ifdef CONFIG_MTD
> 
> What about the case when MTD is compiled as a module?
> 
>> +    mtd = get_mtd_device_nm("MAC-Address");
> 
> What about the case when PTR_ERR(mtd) == -EPROBE_DEFER?

To answer my own question: because get_mtd_device_nm() doesn't
ever return -EPROBE_DEFER.

I'm trying to make this work on LCDK, but the emac driver probes
before any mtd device is registered, so, I just get -ENODEV even
though I've added a partition to the device tree labeled
"MAC-Address". You can see in the kernel messages that MTD is
not probed until later.

> 
>> +    if (!IS_ERR(mtd)) {
>> +        rc = mtd_read(mtd, 0, ETH_ALEN,
>> +                  &mac_addr_len, priv->mac_addr);
>> +        if (rc == 0)
>> +            dev_info(&pdev->dev,
>> +                 "Read MAC addr from SPI Flash: %pM\n",
>> +                 priv->mac_addr);
>> +        put_mtd_device(mtd);
>> +    }
>> +#endif /* CONFIG_MTD */
>> +
>>       /* MAC addr and PHY mask , RMII enable info from platform_data */
>>       memcpy(priv->mac_addr, pdata->mac_addr, ETH_ALEN);
>>       priv->phy_id = pdata->phy_id;
>>
> 

^ permalink raw reply

* Re: [PATCH v6 2/3] drivers core: prepare device_shutdown for multi-threading
From: Andy Shevchenko @ 2018-06-29 20:38 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: Steven Sistare, Daniel Jordan, Linux Kernel Mailing List,
	Kirsher, Jeffrey T, intel-wired-lan, netdev, Greg Kroah-Hartman,
	Alexander Duyck, tobin
In-Reply-To: <20180629182541.6735-3-pasha.tatashin@oracle.com>

On Fri, Jun 29, 2018 at 9:25 PM, Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
> Do all the necessary refactoring to prepare device_shutdown() logic to
> be multi-threaded.
>
> Which includes:
> 1. Change the direction of traversing the list instead of going backward,
> we now go forward.
> 2. Children are shutdown recursively for each root device from bottom-up.
> 3. Functions that can be multi-threaded have _task() in their name.

> +/*
> + * device_children_count - device children count
> + * @parent: parent struct device.
> + *
> + * Returns number of children for this device or 0 if none.
> + */
> +static int device_children_count(struct device *parent)
> +{
> +       struct klist_iter i;
> +       int children = 0;
> +
> +       if (!parent->p)
> +               return 0;
> +
> +       klist_iter_init(&parent->p->klist_children, &i);
> +       while (next_device(&i))
> +               children++;
> +       klist_iter_exit(&i);
> +
> +       return children;
> +}

Looking at another patch in LKML (for gluedir children counting) I
would suggest to consider to cache children value.
Would it  make sense?

(So, basically you would update that value on children registered / removed)

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [Patch net] net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
From: Cong Wang @ 2018-06-29 20:42 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, Cong Wang

As noticed by Eric, we need to switch to the helper
dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too,
otheriwse still miss dev_qdisc_change_tx_queue_len().

Fixes: 6a643ddb5624 ("net: introduce helper dev_change_tx_queue_len()")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/core/dev_ioctl.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
index a04e1e88bf3a..50537ff961a7 100644
--- a/net/core/dev_ioctl.c
+++ b/net/core/dev_ioctl.c
@@ -285,16 +285,9 @@ static int dev_ifsioc(struct net *net, struct ifreq *ifr, unsigned int cmd)
 		if (ifr->ifr_qlen < 0)
 			return -EINVAL;
 		if (dev->tx_queue_len ^ ifr->ifr_qlen) {
-			unsigned int orig_len = dev->tx_queue_len;
-
-			dev->tx_queue_len = ifr->ifr_qlen;
-			err = call_netdevice_notifiers(
-					NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-			err = notifier_to_errno(err);
-			if (err) {
-				dev->tx_queue_len = orig_len;
+			err = dev_change_tx_queue_len(dev, ifr->ifr_qlen);
+			if (err)
 				return err;
-			}
 		}
 		return 0;
 
-- 
2.14.4

^ permalink raw reply related

* [PATCH net] hv_netvsc: split sub-channel setup into async and sync
From: Haiyang Zhang @ 2018-06-29 21:07 UTC (permalink / raw)
  To: davem, netdev
  Cc: haiyangz, kys, sthemmin, olaf, vkuznets, devel, linux-kernel

From: Stephen Hemminger <sthemmin@microsoft.com>

When doing device hotplug the sub channel must be async to avoid
deadlock issues because device is discovered in softirq context.

When doing changes to MTU and number of channels, the setup
must be synchronous to avoid races such as when MTU and device
settings are done in a single ip command.

Reported-by: Thomas Walker <Thomas.Walker@twosigma.com>
Fixes: 8195b1396ec8 ("hv_netvsc: fix deadlock on hotplug")
Fixes: 732e49850c5e ("netvsc: fix race on sub channel creation")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h   |  2 +-
 drivers/net/hyperv/netvsc.c       | 37 ++++++++++++++++++-
 drivers/net/hyperv/netvsc_drv.c   | 17 ++++++++-
 drivers/net/hyperv/rndis_filter.c | 61 ++++++-------------------------
 4 files changed, 65 insertions(+), 52 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 1a924b867b07..4b6e308199d2 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -210,7 +210,7 @@ int netvsc_recv_callback(struct net_device *net,
 void netvsc_channel_cb(void *context);
 int netvsc_poll(struct napi_struct *napi, int budget);
 
-void rndis_set_subchannel(struct work_struct *w);
+int rndis_set_subchannel(struct net_device *ndev, struct netvsc_device *nvdev);
 int rndis_filter_open(struct netvsc_device *nvdev);
 int rndis_filter_close(struct netvsc_device *nvdev);
 struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 5d5bd513847f..8e9d0ee1572b 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -65,6 +65,41 @@ void netvsc_switch_datapath(struct net_device *ndev, bool vf)
 			       VM_PKT_DATA_INBAND, 0);
 }
 
+/* Worker to setup sub channels on initial setup
+ * Initial hotplug event occurs in softirq context
+ * and can't wait for channels.
+ */
+static void netvsc_subchan_work(struct work_struct *w)
+{
+	struct netvsc_device *nvdev =
+		container_of(w, struct netvsc_device, subchan_work);
+	struct rndis_device *rdev;
+	int i, ret;
+
+	/* Avoid deadlock with device removal already under RTNL */
+	if (!rtnl_trylock()) {
+		schedule_work(w);
+		return;
+	}
+
+	rdev = nvdev->extension;
+	if (rdev) {
+		ret = rndis_set_subchannel(rdev->ndev, nvdev);
+		if (ret == 0) {
+			netif_device_attach(rdev->ndev);
+		} else {
+			/* fallback to only primary channel */
+			for (i = 1; i < nvdev->num_chn; i++)
+				netif_napi_del(&nvdev->chan_table[i].napi);
+
+			nvdev->max_chn = 1;
+			nvdev->num_chn = 1;
+		}
+	}
+
+	rtnl_unlock();
+}
+
 static struct netvsc_device *alloc_net_device(void)
 {
 	struct netvsc_device *net_device;
@@ -81,7 +116,7 @@ static struct netvsc_device *alloc_net_device(void)
 
 	init_completion(&net_device->channel_init_wait);
 	init_waitqueue_head(&net_device->subchan_open);
-	INIT_WORK(&net_device->subchan_work, rndis_set_subchannel);
+	INIT_WORK(&net_device->subchan_work, netvsc_subchan_work);
 
 	return net_device;
 }
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index fe2256bf1d13..dd1d6e115145 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -905,8 +905,20 @@ static int netvsc_attach(struct net_device *ndev,
 	if (IS_ERR(nvdev))
 		return PTR_ERR(nvdev);
 
-	/* Note: enable and attach happen when sub-channels setup */
+	if (nvdev->num_chn > 1) {
+		ret = rndis_set_subchannel(ndev, nvdev);
+
+		/* if unavailable, just proceed with one queue */
+		if (ret) {
+			nvdev->max_chn = 1;
+			nvdev->num_chn = 1;
+		}
+	}
+
+	/* In any case device is now ready */
+	netif_device_attach(ndev);
 
+	/* Note: enable and attach happen when sub-channels setup */
 	netif_carrier_off(ndev);
 
 	if (netif_running(ndev)) {
@@ -2089,6 +2101,9 @@ static int netvsc_probe(struct hv_device *dev,
 
 	memcpy(net->dev_addr, device_info.mac_adr, ETH_ALEN);
 
+	if (nvdev->num_chn > 1)
+		schedule_work(&nvdev->subchan_work);
+
 	/* hw_features computed in rndis_netdev_set_hwcaps() */
 	net->features = net->hw_features |
 		NETIF_F_HIGHDMA | NETIF_F_SG |
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 5428bb261102..9b4e3c3787e5 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -1062,29 +1062,15 @@ static void netvsc_sc_open(struct vmbus_channel *new_sc)
  * This breaks overlap of processing the host message for the
  * new primary channel with the initialization of sub-channels.
  */
-void rndis_set_subchannel(struct work_struct *w)
+int rndis_set_subchannel(struct net_device *ndev, struct netvsc_device *nvdev)
 {
-	struct netvsc_device *nvdev
-		= container_of(w, struct netvsc_device, subchan_work);
 	struct nvsp_message *init_packet = &nvdev->channel_init_pkt;
-	struct net_device_context *ndev_ctx;
-	struct rndis_device *rdev;
-	struct net_device *ndev;
-	struct hv_device *hv_dev;
+	struct net_device_context *ndev_ctx = netdev_priv(ndev);
+	struct hv_device *hv_dev = ndev_ctx->device_ctx;
+	struct rndis_device *rdev = nvdev->extension;
 	int i, ret;
 
-	if (!rtnl_trylock()) {
-		schedule_work(w);
-		return;
-	}
-
-	rdev = nvdev->extension;
-	if (!rdev)
-		goto unlock;	/* device was removed */
-
-	ndev = rdev->ndev;
-	ndev_ctx = netdev_priv(ndev);
-	hv_dev = ndev_ctx->device_ctx;
+	ASSERT_RTNL();
 
 	memset(init_packet, 0, sizeof(struct nvsp_message));
 	init_packet->hdr.msg_type = NVSP_MSG5_TYPE_SUBCHANNEL;
@@ -1100,13 +1086,13 @@ void rndis_set_subchannel(struct work_struct *w)
 			       VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED);
 	if (ret) {
 		netdev_err(ndev, "sub channel allocate send failed: %d\n", ret);
-		goto failed;
+		return ret;
 	}
 
 	wait_for_completion(&nvdev->channel_init_wait);
 	if (init_packet->msg.v5_msg.subchn_comp.status != NVSP_STAT_SUCCESS) {
 		netdev_err(ndev, "sub channel request failed\n");
-		goto failed;
+		return -EIO;
 	}
 
 	nvdev->num_chn = 1 +
@@ -1125,21 +1111,7 @@ void rndis_set_subchannel(struct work_struct *w)
 	for (i = 0; i < VRSS_SEND_TAB_SIZE; i++)
 		ndev_ctx->tx_table[i] = i % nvdev->num_chn;
 
-	netif_device_attach(ndev);
-	rtnl_unlock();
-	return;
-
-failed:
-	/* fallback to only primary channel */
-	for (i = 1; i < nvdev->num_chn; i++)
-		netif_napi_del(&nvdev->chan_table[i].napi);
-
-	nvdev->max_chn = 1;
-	nvdev->num_chn = 1;
-
-	netif_device_attach(ndev);
-unlock:
-	rtnl_unlock();
+	return 0;
 }
 
 static int rndis_netdev_set_hwcaps(struct rndis_device *rndis_device,
@@ -1360,21 +1332,12 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
 		netif_napi_add(net, &net_device->chan_table[i].napi,
 			       netvsc_poll, NAPI_POLL_WEIGHT);
 
-	if (net_device->num_chn > 1)
-		schedule_work(&net_device->subchan_work);
+	return net_device;
 
 out:
-	/* if unavailable, just proceed with one queue */
-	if (ret) {
-		net_device->max_chn = 1;
-		net_device->num_chn = 1;
-	}
-
-	/* No sub channels, device is ready */
-	if (net_device->num_chn == 1)
-		netif_device_attach(net);
-
-	return net_device;
+	/* setting up multiple channels failed */
+	net_device->max_chn = 1;
+	net_device->num_chn = 1;
 
 err_dev_remv:
 	rndis_filter_device_remove(dev, net_device);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH v6 2/3] drivers core: prepare device_shutdown for multi-threading
From: Pavel Tatashin @ 2018-06-29 21:20 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Steven Sistare, Daniel Jordan, Linux Kernel Mailing List,
	Kirsher, Jeffrey T, intel-wired-lan, netdev, Greg Kroah-Hartman,
	Alexander Duyck, tobin
In-Reply-To: <CAHp75VdQsrxS5=Jb_+Wfu=JL0-AR1B=p8Jp3psEqH5ApmpewGA@mail.gmail.com>

On 06/29/2018 04:38 PM, Andy Shevchenko wrote:
> On Fri, Jun 29, 2018 at 9:25 PM, Pavel Tatashin
> <pasha.tatashin@oracle.com> wrote:
>> Do all the necessary refactoring to prepare device_shutdown() logic to
>> be multi-threaded.
>>
>> Which includes:
>> 1. Change the direction of traversing the list instead of going backward,
>> we now go forward.
>> 2. Children are shutdown recursively for each root device from bottom-up.
>> 3. Functions that can be multi-threaded have _task() in their name.
> 
>> +/*
>> + * device_children_count - device children count
>> + * @parent: parent struct device.
>> + *
>> + * Returns number of children for this device or 0 if none.
>> + */
>> +static int device_children_count(struct device *parent)
>> +{
>> +       struct klist_iter i;
>> +       int children = 0;
>> +
>> +       if (!parent->p)
>> +               return 0;
>> +
>> +       klist_iter_init(&parent->p->klist_children, &i);
>> +       while (next_device(&i))
>> +               children++;
>> +       klist_iter_exit(&i);
>> +
>> +       return children;
>> +}
> 
> Looking at another patch in LKML (for gluedir children counting) I
> would suggest to consider to cache children value.
> Would it  make sense?
> 
> (So, basically you would update that value on children registered / removed)
> 

Hi Andy,

Thank you for looking at this patch.

For this particular series, this function is called only once per device during shutdown, when nothing else is running on the machine. So, caching won't really provide us any improvement.

If, in the future this function is going to be called from other sites during run-time of the machine, then yes, that would mean that caching might help.

Thank you,
Pavel

^ permalink raw reply

* [PATCH net] net: fib_rules: bring back rule_exists to match rule during add
From: Roopa Prabhu @ 2018-06-29 21:32 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsa

From: Roopa Prabhu <roopa@cumulusnetworks.com>

After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule
delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find
for existing rule lookup in both the add and del paths. While this
is good for the delete path, it solves a few problems but opens up
a few invalid key matches in the add path.

$ip -4 rule add table main tos 10 fwmark 1
$ip -4 rule add table main tos 10
RTNETLINK answers: File exists

The problem here is rule_find does not check if the key masks in
the new and old rule are the same and hence ends up matching a more
secific rule. Rule key masks cannot be easily compared today without
an elaborate if-else block. Its best to introduce key masks for easier
and accurate rule comparison in the future. Until then, due to fear of
regressions this patch re-introduces older loose rule_exists during add.
Also fixes both rule_exists and rule_find to cover missing attributes.

Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
---
This does bring back old problems with rule_exists. But at this hour, old problems
are better than new problems!. The pref compare in rule_exists can be fixed, but its best
done in net-next. I will look at adding more coverage in fib rule selftests. Thanks.

 net/core/fib_rules.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 71 insertions(+), 1 deletion(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index bc8425d..f64aa13 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -444,6 +444,9 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops,
 		if (rule->ip_proto && r->ip_proto != rule->ip_proto)
 			continue;
 
+		if (rule->proto && r->proto != rule->proto)
+			continue;
+
 		if (fib_rule_port_range_set(&rule->sport_range) &&
 		    !fib_rule_port_range_compare(&r->sport_range,
 						 &rule->sport_range))
@@ -653,6 +656,73 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return err;
 }
 
+static int rule_exists(struct fib_rules_ops *ops, struct fib_rule_hdr *frh,
+		       struct nlattr **tb, struct fib_rule *rule)
+{
+	struct fib_rule *r;
+
+	list_for_each_entry(r, &ops->rules_list, list) {
+		if (r->action != rule->action)
+			continue;
+
+		if (r->table != rule->table)
+			continue;
+
+		if (r->pref != rule->pref)
+			continue;
+
+		if (memcmp(r->iifname, rule->iifname, IFNAMSIZ))
+			continue;
+
+		if (memcmp(r->oifname, rule->oifname, IFNAMSIZ))
+			continue;
+
+		if (r->mark != rule->mark)
+			continue;
+
+		if (r->suppress_ifgroup != rule->suppress_ifgroup)
+			continue;
+
+		if (r->suppress_prefixlen != rule->suppress_prefixlen)
+			continue;
+
+		if (r->mark_mask != rule->mark_mask)
+			continue;
+
+		if (r->tun_id != rule->tun_id)
+			continue;
+
+		if (r->fr_net != rule->fr_net)
+			continue;
+
+		if (r->l3mdev != rule->l3mdev)
+			continue;
+
+		if (!uid_eq(r->uid_range.start, rule->uid_range.start) ||
+		    !uid_eq(r->uid_range.end, rule->uid_range.end))
+			continue;
+
+		if (r->ip_proto != rule->ip_proto)
+			continue;
+
+		if (r->proto != rule->proto)
+			continue;
+
+		if (!fib_rule_port_range_compare(&r->sport_range,
+						 &rule->sport_range))
+			continue;
+
+		if (!fib_rule_port_range_compare(&r->dport_range,
+						 &rule->dport_range))
+			continue;
+
+		if (!ops->compare(r, frh, tb))
+			continue;
+		return 1;
+	}
+	return 0;
+}
+
 int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		   struct netlink_ext_ack *extack)
 {
@@ -687,7 +757,7 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh,
 		goto errout;
 
 	if ((nlh->nlmsg_flags & NLM_F_EXCL) &&
-	    rule_find(ops, frh, tb, rule, user_priority)) {
+	    rule_exists(ops, frh, tb, rule)) {
 		err = -EEXIST;
 		goto errout_free;
 	}
-- 
2.1.4

^ permalink raw reply related

* Re: [Patch net] net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
From: Eric Dumazet @ 2018-06-29 21:36 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: edumazet
In-Reply-To: <20180629204248.28919-1-xiyou.wangcong@gmail.com>



On 06/29/2018 01:42 PM, Cong Wang wrote:
> As noticed by Eric, we need to switch to the helper
> dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too,
> otheriwse still miss dev_qdisc_change_tx_queue_len().
> 
> Fixes: 6a643ddb5624 ("net: introduce helper dev_change_tx_queue_len()")
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [PATCH] atmel: using strlcpy() to avoid possible buffer overflows
From: Andy Shevchenko @ 2018-06-29 21:47 UTC (permalink / raw)
  To: YueHaibing
  Cc: simon, Kalle Valo, Linux Kernel Mailing List, netdev,
	open list:TI WILINK WIRELES..., David S. Miller
In-Reply-To: <20180629025135.12236-1-yuehaibing@huawei.com>

On Fri, Jun 29, 2018 at 5:51 AM, YueHaibing <yuehaibing@huawei.com> wrote:
> 'firmware' is a module param which may been longer than firmware_id,
> so using strlcpy() to guard against overflows

strncat() is against overflow, this does a bit more.

>         priv->firmware_id[0] = '\0';
...
>         if (firmware) /* module parameter */
> -               strcpy(priv->firmware_id, firmware);
> +               strlcpy(priv->firmware_id, firmware, sizeof(priv->firmware_id));

In either case the above '\0' is not needed.
But it looks like the intention was to use strncat() / strlcat().

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* Re: [PATCH] atmel: using strlcpy() to avoid possible buffer overflows
From: Andy Shevchenko @ 2018-06-29 21:51 UTC (permalink / raw)
  To: YueHaibing
  Cc: simon, Kalle Valo, Linux Kernel Mailing List, netdev,
	open list:TI WILINK WIRELES..., David S. Miller
In-Reply-To: <CAHp75Vet-vTz_Ld=Lturmm8NSpK1etvoj-7Wzn4-h=bM2FQgZg@mail.gmail.com>

On Sat, Jun 30, 2018 at 12:47 AM, Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
> On Fri, Jun 29, 2018 at 5:51 AM, YueHaibing <yuehaibing@huawei.com> wrote:
>> 'firmware' is a module param which may been longer than firmware_id,
>> so using strlcpy() to guard against overflows
>
> strncat() is against overflow, this does a bit more.
>
>>         priv->firmware_id[0] = '\0';
> ...
>>         if (firmware) /* module parameter */
>> -               strcpy(priv->firmware_id, firmware);
>> +               strlcpy(priv->firmware_id, firmware, sizeof(priv->firmware_id));
>
> In either case the above '\0' is not needed.
> But it looks like the intention was to use strncat() / strlcat().

Ah, this is under condition, yes. If no parameter supplied, this needs
to be clean, but
priv is allocated with zeroed memory
https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L8369

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox