Netdev List
 help / color / mirror / Atom feed
* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Cong Wang @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925190031.GB1936@nanopsycho.orion>

On Mon, Sep 25, 2017 at 12:00 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Interesting, any idea why this is not 0x7FFFFFFF as well?
>
> I wonder if we could have 0x7FFFFFFF magic defined somewhere.

I have no idea, it just exists for a rather long time. Probably too late
to change, or at least requires a separate patch to change it.

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Cong Wang @ 2017-09-25 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Linux Kernel Network Developers, Chris Mi, Jamal Hadi Salim
In-Reply-To: <59C97244.30301@iogearbox.net>

On Mon, Sep 25, 2017 at 2:16 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 09/25/2017 07:13 PM, Cong Wang wrote:
>>         ret = cls_bpf_offload(tp, prog, oldprog);
>>         if (ret) {
>> +               if (!oldprog)
>> +                       idr_remove_ext(&head->handle_idr, prog->handle);
>
>
> Shouldn't we also call idr_remove_ext() when there was an
> oldprog, but we didn't care about reusing the same handle,
> so it was handle == 0 initially?

When oldprog is non-NULL, we are replacing the oldprog with
a new one, therefore we should call idr_replace_ext() which
happens after this. So no need to call idr_remove_ext() at
this point.



>
> There's this condition in the code before above idr allocations,
> I think also in other classifiers:
>
>         if (oldprog) {
>                 if (handle && oldprog->handle != handle) {
>                         ret = -EINVAL;
>                         goto errout;
>                 }
>         }

Sure. If we use handle to find oldprog, it should have the
same handle. cls_bpf_get() guarantees it. This check is
redundant.

>
>>                 __cls_bpf_delete_prog(prog);
>>                 return ret;
>>         }
>> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct
>> sk_buff *in_skb,
>>                 prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>>
>>         if (oldprog) {
>> +               idr_replace_ext(&head->handle_idr, prog, handle);
>
>
> And here, we should probably use prog->handle for the above
> mentioned case as well, no?

Since are replacing oldprog with a new one, prog->handle is
same with handle.


>
> Would be great if all this (and e.g. the fact that we use idr itself)
> could optionally be hidden behind some handle generator api given
> we could reuse that api also for cls_basic and cls_u32. Could also
> be followed-up perhaps.
>

Yeah, the idr_alloc_ext(.., handle, handle+1,) is ugly. Ideally we should
specify the range during initialization rather than in each idr_alloc_ext().
Commit c15ab236d69d already did the same thing. We can refactor
this later.

^ permalink raw reply

* [PATCH net] net: dsa: Fix network device registration order
From: Florian Fainelli @ 2017-09-25 22:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli, Andrew Lunn, Vivien Didelot, open list

We cannot be registering the network device first, then setting its
carrier off and finally connecting it to a PHY, doing that leaves a
window during which the carrier is at best inconsistent, and at worse
the device is not usable without a down/up sequence since the network
device is visible to user space with possibly no PHY device attached.

Re-order steps so that they make logical sense. This fixes some devices
where the port was not usable after e.g: an unbind then bind of the
driver.

Fixes: 0071f56e46da ("dsa: Register netdev before phy")
Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 net/dsa/slave.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 2afa99506f8b..865e29e62bad 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1301,28 +1301,33 @@ int dsa_slave_create(struct dsa_port *port, const char *name)
 	p->old_duplex = -1;
 
 	port->netdev = slave_dev;
-	ret = register_netdev(slave_dev);
-	if (ret) {
-		netdev_err(master, "error %d registering interface %s\n",
-			   ret, slave_dev->name);
-		port->netdev = NULL;
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
-	}
 
 	netif_carrier_off(slave_dev);
 
 	ret = dsa_slave_phy_setup(p, slave_dev);
 	if (ret) {
 		netdev_err(master, "error %d setting up slave phy\n", ret);
-		unregister_netdev(slave_dev);
-		free_percpu(p->stats64);
-		free_netdev(slave_dev);
-		return ret;
+		goto out_free;
+	}
+
+	ret = register_netdev(slave_dev);
+	if (ret) {
+		netdev_err(master, "error %d registering interface %s\n",
+			   ret, slave_dev->name);
+		goto out_phy;
 	}
 
 	return 0;
+
+out_phy:
+	phy_disconnect(p->phy);
+	if (of_phy_is_fixed_link(p->dp->dn))
+		of_phy_deregister_fixed_link(p->dp->dn);
+out_free:
+	free_percpu(p->stats64);
+	free_netdev(slave_dev);
+	port->netdev = NULL;
+	return ret;
 }
 
 void dsa_slave_destroy(struct net_device *slave_dev)
-- 
2.9.3

^ permalink raw reply related

* [jkirsher/next-queue PATCH] ixgbe: Update adaptive ITR algorithm
From: Alexander Duyck @ 2017-09-25 21:55 UTC (permalink / raw)
  To: netdev, intel-wired-lan; +Cc: john.fastabend, brouer

From: Alexander Duyck <alexander.h.duyck@intel.com>

The following change is meant to update the adaptive ITR algorithm to
better support the needs of the network. Specifically with this change what
I have done is make it so that our ITR algorithm will try to prevent either
starving a socket buffer for memory in the case of Tx, or overruing an Rx
socket buffer on receive.

In addition a side effect of the calculations used is that we should
function better with new features such as XDP which can handle small
packets at high rates without needing to lock us into NAPI polling mode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

So I am putting this out to a wider distribution list than normal for a
patch like this in order to get feedback on if there are any areas I may
have overlooked. With this patch is should address many of the performance
limitations seen with pktgen and XDP in terms of workloads that the old
adaptive scheme wasn't handling.

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    7 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |   11 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  215 +++++++++++++++++++------
 3 files changed, 178 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 56039d04b38d..555eb80d8a08 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -435,8 +435,15 @@ static inline unsigned int ixgbe_rx_pg_order(struct ixgbe_ring *ring)
 }
 #define ixgbe_rx_pg_size(_ring) (PAGE_SIZE << ixgbe_rx_pg_order(_ring))
 
+#define IXGBE_ITR_ADAPTIVE_MIN_INC	2
+#define IXGBE_ITR_ADAPTIVE_MIN_USECS	10
+#define IXGBE_ITR_ADAPTIVE_MAX_USECS	126
+#define IXGBE_ITR_ADAPTIVE_LATENCY	0x80
+#define IXGBE_ITR_ADAPTIVE_BULK		0x00
+
 struct ixgbe_ring_container {
 	struct ixgbe_ring *ring;	/* pointer to linked list of rings */
+	unsigned long next_update;	/* jiffies value of last update */
 	unsigned int total_bytes;	/* total bytes processed this int */
 	unsigned int total_packets;	/* total packets processed this int */
 	u16 work_limit;			/* total work allowed per interrupt */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f1bfae0c41d0..8e2a957aca18 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -806,6 +806,7 @@ static void ixgbe_add_ring(struct ixgbe_ring *ring,
 	ring->next = head->ring;
 	head->ring = ring;
 	head->count++;
+	head->next_update = jiffies + 1;
 }
 
 /**
@@ -879,8 +880,11 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* initialize work limits */
 	q_vector->tx.work_limit = adapter->tx_work_limit;
 
-	/* initialize pointer to rings */
-	ring = q_vector->ring;
+	/* Initialize setting for adaptive ITR */
+	q_vector->tx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	q_vector->rx.itr = IXGBE_ITR_ADAPTIVE_MAX_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
 
 	/* intialize ITR */
 	if (txr_count && !rxr_count) {
@@ -897,6 +901,9 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 			q_vector->itr = adapter->rx_itr_setting;
 	}
 
+	/* initialize pointer to rings */
+	ring = q_vector->ring;
+
 	while (txr_count) {
 		/* assign generic ring traits */
 		ring->dev = &adapter->pdev->dev;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3d3739f103af..44a96878075b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2517,50 +2517,174 @@ enum latency_range {
 static void ixgbe_update_itr(struct ixgbe_q_vector *q_vector,
 			     struct ixgbe_ring_container *ring_container)
 {
-	int bytes = ring_container->total_bytes;
-	int packets = ring_container->total_packets;
-	u32 timepassed_us;
-	u64 bytes_perint;
-	u8 itr_setting = ring_container->itr;
+	unsigned int itr = IXGBE_ITR_ADAPTIVE_MIN_USECS |
+			   IXGBE_ITR_ADAPTIVE_LATENCY;
+	unsigned int avg_wire_size, packets, bytes;
+	unsigned long next_update = jiffies;
 
-	if (packets == 0)
+	/* If we don't have any rings just leave ourselves set for maximum
+	 * possible latency so we take ourselves out of the equation.
+	 */
+	if (!ring_container->ring)
 		return;
 
-	/* simple throttlerate management
-	 *   0-10MB/s   lowest (100000 ints/s)
-	 *  10-20MB/s   low    (20000 ints/s)
-	 *  20-1249MB/s bulk   (12000 ints/s)
+	/* If we didn't update within up to 1 - 2 jiffies we can assume
+	 * that either packets are coming in so slow there hasn't been
+	 * any work, or that there is so much work that NAPI is dealing
+	 * with interrupt moderation and we don't need to do anything.
 	 */
-	/* what was last interrupt timeslice? */
-	timepassed_us = q_vector->itr >> 2;
-	if (timepassed_us == 0)
-		return;
+	if (time_after(next_update, ring_container->next_update))
+		goto clear_counts;
 
-	bytes_perint = bytes / timepassed_us; /* bytes/usec */
+	packets = ring_container->total_packets;
 
-	switch (itr_setting) {
-	case lowest_latency:
-		if (bytes_perint > 10)
-			itr_setting = low_latency;
-		break;
-	case low_latency:
-		if (bytes_perint > 20)
-			itr_setting = bulk_latency;
-		else if (bytes_perint <= 10)
-			itr_setting = lowest_latency;
+	/* We have no packets to actually measure against. This means
+	 * either one of the other queues on this vector is active or
+	 * we are a Tx queue doing TSO with too high of an interrupt rate.
+	 *
+	 * When this occurs just tick up our delay by the minimum value
+	 * and hope that this extra delay will prevent us from being called
+	 * without any work on our queue.
+	 */
+	if (!packets) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		itr += ring_container->itr & IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto clear_counts;
+	}
+
+	bytes = ring_container->total_bytes;
+
+	/* If packets are less than 4 or bytes are less than 9000 assume
+	 * insufficient data to use bulk rate limiting approach. We are
+	 * likely latency driven.
+	 */
+	if (packets < 4 && bytes < 9000) {
+		itr = IXGBE_ITR_ADAPTIVE_LATENCY;
+		goto adjust_by_size;
+	}
+
+	/* Between 4 and 48 we can assume that our current interrupt delay
+	 * is only slightly too low. As such we should increase it by a small
+	 * fixed amount.
+	 */
+	if (packets < 48) {
+		itr = (q_vector->itr >> 2) + IXGBE_ITR_ADAPTIVE_MIN_INC;
+		if (itr > IXGBE_ITR_ADAPTIVE_MAX_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MAX_USECS;
+		goto clear_counts;
+	}
+
+	/* Between 48 and 96 is our "goldilocks" zone where we are working
+	 * out "just right". Just report that our current ITR is good for us.
+	 */
+	if (packets < 96) {
+		itr = q_vector->itr >> 2;
+		goto clear_counts;
+	}
+
+	/* If packet count is 96 or greater we are likely looking at a slight
+	 * overrun of the delay we want. Try halving our delay to see if that
+	 * will cut the number of packets in half per interrupt.
+	 */
+	if (packets < 256) {
+		itr = q_vector->itr >> 3;
+		if (itr < IXGBE_ITR_ADAPTIVE_MIN_USECS)
+			itr = IXGBE_ITR_ADAPTIVE_MIN_USECS;
+		goto clear_counts;
+	}
+
+	/* The paths below assume we are dealing with a bulk ITR since number
+	 * of packets is 256 or greater. We are just going to have to compute
+	 * a value and try to bring the count under control, though for smaller
+	 * packet sizes there isn't much we can do as NAPI polling will likely
+	 * be kicking in sooner rather than later.
+	 */
+	itr = IXGBE_ITR_ADAPTIVE_BULK;
+
+adjust_by_size:
+	/* If packet counts are 256 or greater we can assume we have a gross
+	 * overestimation of what the rate should be. Instead of trying to fine
+	 * tune it just use the formula below to try and dial in an exact value
+	 * give the current packet size of the frame.
+	 */
+	avg_wire_size = bytes / packets;
+
+	/* The following is a crude approximation of:
+	 *  wmem_default / (size + overhead) = desired_pkts_per_int
+	 *  rate / bits_per_byte / (size + ethernet overhead) = pkt_rate
+	 *  (desired_pkt_rate / pkt_rate) * usecs_per_sec = ITR value
+	 *
+	 * Assuming wmem_default is 212992 and overhead is 640 bytes per
+	 * packet, (256 skb, 64 headroom, 320 shared info), we can reduce the
+	 * formula down to
+	 *
+	 *  (170 * (size + 24)) / (size + 640) = ITR
+	 *
+	 * We first do some math on the packet size and then finally bitshift
+	 * by 8 after rounding up. We also have to account for PCIe link speed
+	 * difference as ITR scales based on this.
+	 */
+	if (avg_wire_size <= 60) {
+		/* Start at 50k ints/sec */
+		avg_wire_size = 5120;
+	} else if (avg_wire_size <= 316) {
+		/* 50K ints/sec to 16K ints/sec */
+		avg_wire_size *= 40;
+		avg_wire_size += 2720;
+	} else if (avg_wire_size <= 1084) {
+		/* 16K ints/sec to 9.2K ints/sec */
+		avg_wire_size *= 15;
+		avg_wire_size += 11452;
+	} else if (avg_wire_size <= 1980) {
+		/* 9.2K ints/sec to 8K ints/sec */
+		avg_wire_size *= 5;
+		avg_wire_size += 22420;
+	} else {
+		/* plateau at a limit of 8K ints/sec */
+		avg_wire_size = 32256;
+	}
+
+	/* If we are in low latency mode half our delay which doubles the rate
+	 * to somewhere between 100K to 16K ints/sec
+	 */
+	if (itr & IXGBE_ITR_ADAPTIVE_LATENCY)
+		avg_wire_size >>= 1;
+
+	/* Resultant value is 256 times larger than it needs to be. This
+	 * gives us room to adjust the value as needed to either increase
+	 * or decrease the value based on link speeds of 10G, 2.5G, 1G, etc.
+	 *
+	 * Use addition as we have already recorded the new latency flag
+	 * for the ITR value.
+	 */
+	switch (q_vector->adapter->link_speed) {
+	case IXGBE_LINK_SPEED_10GB_FULL:
+	case IXGBE_LINK_SPEED_100_FULL:
+	default:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 256) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
-	case bulk_latency:
-		if (bytes_perint <= 20)
-			itr_setting = low_latency;
+	case IXGBE_LINK_SPEED_2_5GB_FULL:
+	case IXGBE_LINK_SPEED_1GB_FULL:
+	case IXGBE_LINK_SPEED_10_FULL:
+		itr += DIV_ROUND_UP(avg_wire_size,
+				    IXGBE_ITR_ADAPTIVE_MIN_INC * 64) *
+		       IXGBE_ITR_ADAPTIVE_MIN_INC;
 		break;
 	}
 
-	/* clear work counters since we have the values we need */
+clear_counts:
+	/* write back value */
+	ring_container->itr = itr;
+
+	/* next update should occur within next jiffy */
+	ring_container->next_update = next_update + 1;
+
 	ring_container->total_bytes = 0;
 	ring_container->total_packets = 0;
-
-	/* write updated itr to ring container */
-	ring_container->itr = itr_setting;
 }
 
 /**
@@ -2602,34 +2726,19 @@ void ixgbe_write_eitr(struct ixgbe_q_vector *q_vector)
 
 static void ixgbe_set_itr(struct ixgbe_q_vector *q_vector)
 {
-	u32 new_itr = q_vector->itr;
-	u8 current_itr;
+	u32 new_itr;
 
 	ixgbe_update_itr(q_vector, &q_vector->tx);
 	ixgbe_update_itr(q_vector, &q_vector->rx);
 
-	current_itr = max(q_vector->rx.itr, q_vector->tx.itr);
+	/* use the smallest value of new ITR delay calculations */
+	new_itr = min(q_vector->rx.itr, q_vector->tx.itr);
 
-	switch (current_itr) {
-	/* counts and packets in update_itr are dependent on these numbers */
-	case lowest_latency:
-		new_itr = IXGBE_100K_ITR;
-		break;
-	case low_latency:
-		new_itr = IXGBE_20K_ITR;
-		break;
-	case bulk_latency:
-		new_itr = IXGBE_12K_ITR;
-		break;
-	default:
-		break;
-	}
+	/* Clear latency flag if set, shift into correct position */
+	new_itr &= ~IXGBE_ITR_ADAPTIVE_LATENCY;
+	new_itr <<= 2;
 
 	if (new_itr != q_vector->itr) {
-		/* do an exponential smoothing */
-		new_itr = (10 * new_itr * q_vector->itr) /
-			  ((9 * new_itr) + q_vector->itr);
-
 		/* save the algorithm value here */
 		q_vector->itr = new_itr;
 

^ permalink raw reply related

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Ahern @ 2017-09-25 21:52 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <20170925.131101.478664536826772174.davem@davemloft.net>

On 9/25/17 2:11 PM, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Mon, 25 Sep 2017 10:14:23 -0600
> 
>> I made a simple program this morning and ran it under perf.
> 
> If possible please submit this for selftests.
> 

It is more of a microbenchmark of options to flush an rbtree than a
self-test. Further, it relies on the tools/lib/rbtree.c versus
lib/rbtree.c. The tools/lib version was imported by Arnaldo in July 2015
and is a out of date, though it is good enough to show the intent w.r.t.
flushing options.

^ permalink raw reply

* Re: [PATCH net 0/2] l2tp: fix some races in session deletion
From: David Miller @ 2017-09-25 21:45 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, tparkin, sd
In-Reply-To: <cover.1506086081.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 22 Sep 2017 15:39:22 +0200

> L2TP provides several interfaces for deleting sessions. Using two of
> them concurrently can lead to use-after-free bugs.
> 
> Patch #2 uses a flag to prevent double removal of L2TP sessions.
> Patch #1 fixes a bug found in the way. Fixing this bug is also
> necessary for patch #2 to handle all cases.
> 
> This issue is similar to the tunnel deletion bug being worked on by
> Sabrina: https://patchwork.ozlabs.org/patch/814173/

Series applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] bpf/verifier: improve disassembly of BPF_END instructions
From: Daniel Borkmann @ 2017-09-25 21:44 UTC (permalink / raw)
  To: Alexei Starovoitov, Y Song
  Cc: Edward Cree, David Miller, netdev, Jiong Wang, Jakub Kicinski
In-Reply-To: <20170924055016.w6x5tj6kjxjbocpl@ast-mbp>

On 09/24/2017 07:50 AM, Alexei Starovoitov wrote:
> On Fri, Sep 22, 2017 at 09:49:10PM -0700, Y Song wrote:
>> On Fri, Sep 22, 2017 at 9:23 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> On 22/09/17 16:16, Alexei Starovoitov wrote:
>>>> looks like we're converging on
>>>> "be16/be32/be64/le16/le32/le64 #register" for BPF_END.
>>>> I guess it can live with that. I would prefer more C like syntax
>>>> to match the rest, but llvm parsing point is a strong one.
>>> Yep, agreed.  I'll post a v2 once we've settled BPF_NEG.
>>>> For BPG_NEG I prefer to do it in C syntax like interpreter does:
>>>>          ALU_NEG:
>>>>                  DST = (u32) -DST;
>>>>          ALU64_NEG:
>>>>                  DST = -DST;
>>>> Yonghong, does it mean that asmparser will equally suffer?
>>> Correction to my earlier statements: verifier will currently disassemble
>>>   neg as:
>>> (87) r0 neg 0
>>> (84) (u32) r0 neg (u32) 0
>>>   because it pretends 'neg' is a compound-assignment operator like +=.
>>> The analogy with be16 and friends would be to use
>>>      neg64 r0
>>>      neg32 r0
>>>   whereas the analogy with everything else would be
>>>      r0 = -r0
>>>      r0 = (u32) -r0
>>>   as Alexei says.
>>> I'm happy to go with Alexei's version if it doesn't cause problems for llvm.
>>
>> I got some time to do some prototyping in llvm and it looks like that
>> I am able to
>> resolve the issue and we are able to use more C-like syntax. That is:
>> for bswap:
>>       r1 = (be16) (u16) r1
>>       or
>>       r1 = (be16) r1
>>       or
>>       r1 = be16 r1
>> for neg:
>>       r0 = -r0
>>       (for 32bit support, llvm may output "w0 = -w0" in the future. But
>> since it is not
>>        enabled yet, you can continue to output "r0 = (u32) -r0".)
>>
>> Not sure which syntax is best for bswap. The "r1 = (be16) (u16) r1" is most
>> explicit in its intention.
>>
>> Attaching my llvm patch as well and cc'ing Jiong and Jakub so they can see my
>> implementation and the relative discussion here. (In this patch, I did
>> not implement
>> bswap for little endian yet.) Maybe they can provide additional comments.
>
> This is awesome. In such case I'd like to swing back to the C syntax for bpf_end :)
> Any of these
>    r1 = (be16) (u16) r1
>    or
>    r1 = (be16) r1
>    or
>    r1 = be16 r1
> are better than just
>    be16 r1
> I like 1st the most, since it's explicit in terms of what happens with upper bits,
> but 2nd is also ok. 3rd is not quite C-like.

But above cast to be16 also doesn't seem quite C-like in terms
of what we're actually doing... 3rd option would be my personal
preference even if it doesn't look C-like, but otoh we also have
'call' etc which is neither.

^ permalink raw reply

* Re: [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Vivien Didelot @ 2017-09-25 21:34 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: netdev, Andrew Lunn
In-Reply-To: <1506375140-2853-1-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> Ports with the same VLAN must all be in the same bridge. However the
> CPU and DSA ports need to be in multiple VLANs spread over multiple
> bridges. So exclude them when performing this test.
>
> Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next v2] bpf: Optimize lpm trie delete
From: David Miller @ 2017-09-25 21:38 UTC (permalink / raw)
  To: kraigatgoog; +Cc: daniel, ast, daniel, netdev
In-Reply-To: <20170921224329.101928-1-kraigatgoog@gmail.com>

From: Craig Gallek <kraigatgoog@gmail.com>
Date: Thu, 21 Sep 2017 18:43:29 -0400

> From: Craig Gallek <kraig@google.com>
> 
> Before the delete operator was added, this datastructure maintained
> an invariant that intermediate nodes were only present when necessary
> to build the tree.  This patch updates the delete operation to reinstate
> that invariant by removing unnecessary intermediate nodes after a node is
> removed and thus keeping the tree structure at a minimal size.
> 
> Suggested-by: Daniel Mack <daniel@zonque.org>
> Signed-off-by: Craig Gallek <kraig@google.com>

Applied, thank you.

^ permalink raw reply

* [patch net] net: dsa: mv88e6xxx: Allow dsa and cpu ports in multiple vlans
From: Andrew Lunn @ 2017-09-25 21:32 UTC (permalink / raw)
  To: David Miller; +Cc: Vivien Didelot, netdev, Andrew Lunn

Ports with the same VLAN must all be in the same bridge. However the
CPU and DSA ports need to be in multiple VLANs spread over multiple
bridges. So exclude them when performing this test.

Fixes: b2f81d304cee ("net: dsa: add CPU and DSA ports as VLAN members")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c6678aa9b4ef..674dab71d71c 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1100,6 +1100,10 @@ static int mv88e6xxx_port_check_hw_vlan(struct dsa_switch *ds, int port,
 	};
 	int i, err;
 
+	/* DSA and CPU ports have to be members of multiple vlans */
+	if (dsa_is_dsa_port(ds, port) || dsa_is_cpu_port(ds, port))
+		return 0;
+
 	if (!vid_begin)
 		return -EOPNOTSUPP;
 
-- 
2.14.1

^ permalink raw reply related

* WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
From: Richard Weinberger @ 2017-09-25 21:23 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linux-kernel

Hi!

While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to trigger 
this splat:

[  297.629773] WARNING: kernel stack frame pointer at ffff880156a5fea0 in 
bash:2103 has bad value 00007ffec7d87e50
[  297.629777] unwind stack type:0 next_sp:          (null) mask:0x6 
graph_idx:0
[  297.629783] ffff88015b207ae0: ffff88015b207b68 (0xffff88015b207b68)
[  297.629790] ffff88015b207ae8: ffffffffb163c00e (__save_stack_trace+0x6e/
0xd0)
[  297.629792] ffff88015b207af0: 0000000000000000 ...
[  297.629795] ffff88015b207af8: ffff880156a58000 (0xffff880156a58000)
[  297.629799] ffff88015b207b00: ffff880156a60000 (0xffff880156a60000)
[  297.629800] ffff88015b207b08: 0000000000000000 ...
[  297.629803] ffff88015b207b10: 0000000000000006 (0x6)
[  297.629806] ffff88015b207b18: ffff880151b02700 (0xffff880151b02700)
[  297.629809] ffff88015b207b20: 0000010100000000 (0x10100000000)
[  297.629812] ffff88015b207b28: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.629815] ffff88015b207b30: ffff88015b207ae0 (0xffff88015b207ae0)
[  297.629818] ffff88015b207b38: ffffffffc0050282 (0xffffffffc0050282)
[  297.629819] ffff88015b207b40: 0000000000000000 ...
[  297.629822] ffff88015b207b48: 0000000001000000 (0x1000000)
[  297.629825] ffff88015b207b50: ffff880157b98280 (0xffff880157b98280)
[  297.629828] ffff88015b207b58: ffff880157b98380 (0xffff880157b98380)
[  297.629831] ffff88015b207b60: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.629834] ffff88015b207b68: ffff88015b207b78 (0xffff88015b207b78)
[  297.629838] ffff88015b207b70: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629841] ffff88015b207b78: ffff88015b207da8 (0xffff88015b207da8)
[  297.629847] ffff88015b207b80: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629850] ffff88015b207b88: 000000400000000c (0x400000000c)
[  297.629852] ffff88015b207b90: ffff88015b207ba0 (0xffff88015b207ba0)
[  297.629855] ffff88015b207b98: ffff880100000000 (0xffff880100000000)
[  297.629859] ffff88015b207ba0: ffffffffb163c086 (save_stack_trace+0x16/0x20)
[  297.629864] ffff88015b207ba8: ffffffffb18a8ed6 (save_stack+0x46/0xd0)
[  297.629868] ffff88015b207bb0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.629873] ffff88015b207bb8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.629879] ffff88015b207bc0: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.629886] ffff88015b207bc8: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.629892] ffff88015b207bd0: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.629897] ffff88015b207bd8: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.629902] ffff88015b207be0: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.629907] ffff88015b207be8: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.629913] ffff88015b207bf0: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.629916] ffff88015b207bf8: ffffffffc0050282 (0xffffffffc0050282)
[  297.629918] ffff88015b207c00: 0000000000000000 ...
[  297.629921] ffff88015b207c08: ffff88015a77e24c (0xffff88015a77e24c)
[  297.629924] ffff88015b207c10: ffff88015b207c38 (0xffff88015b207c38)
[  297.629927] ffff88015b207c18: ffff88015b207c38 (0xffff88015b207c38)
[  297.629929] ffff88015b207c20: 0000000000000086 (0x86)
[  297.629932] ffff88015b207c28: ffff88015a77db00 (0xffff88015a77db00)
[  297.629935] ffff88015b207c30: 1ffff1002b640f91 (0x1ffff1002b640f91)
[  297.629938] ffff88015b207c38: ffff88015b207d10 (0xffff88015b207d10)
[  297.629945] ffff88015b207c40: ffffffffb16c9f60 (try_to_wake_up+0xb0/0x710)
[  297.629947] ffff88015b207c48: 0000000000000000 ...
[  297.629952] ffff88015b207c50: ffffffffb2dfd3c0 (machine_ops+0x40/0x40)
[  297.629954] ffff88015b207c58: ffff88015a77df94 (0xffff88015a77df94)
[  297.629957] ffff88015b207c60: 0000000000023540 (0x23540)
[  297.629960] ffff88015b207c68: ffff88015b215c38 (0xffff88015b215c38)
[  297.629963] ffff88015b207c70: ffff88015b200000 (0xffff88015b200000)
[  297.629965] ffff88015b207c78: 0000000000000086 (0x86)
[  297.629968] ffff88015b207c80: 0000000100000000 (0x100000000)
[  297.629971] ffff88015b207c88: 0000000041b58ab3 (0x41b58ab3)
[  297.629975] ffff88015b207c90: ffffffffb2d919f2 (.LC2+0x6e0e/0x83b5)
[  297.629981] ffff88015b207c98: ffffffffb16c9eb0 (migrate_swap_stop
+0x2e0/0x2e0)
[  297.629986] ffff88015b207ca0: ffffffffb16d0f73 (account_entity_dequeue
+0x73/0x110)
[  297.629989] ffff88015b207ca8: 0000000000100000 (0x100000)
[  297.629992] ffff88015b207cb0: ffff88015b2235a0 (0xffff88015b2235a0)
[  297.629994] ffff88015b207cb8: ffff88015061e280 (0xffff88015061e280)
[  297.629997] ffff88015b207cc0: ffff88015b207ce8 (0xffff88015b207ce8)
[  297.630003] ffff88015b207cc8: ffffffffb16c87ed (sched_avg_update+0x2d/0x90)
[  297.630005] ffff88015b207cd0: 0000000000000005 (0x5)
[  297.630008] ffff88015b207cd8: ffff88015b223570 (0xffff88015b223570)
[  297.630010] ffff88015b207ce0: 00000000000000dd (0xdd)
[  297.630013] ffff88015b207ce8: ffff88015a017ea0 (0xffff88015a017ea0)
[  297.630021] ffff88015b207cf0: ffffffffb30b7128 (rcu_sched_state
+0x928/0xaa0)
[  297.630024] ffff88015b207cf8: ffff880151b02700 (0xffff880151b02700)
[  297.630026] ffff88015b207d00: 0000000000000001 (0x1)
[  297.630031] ffff88015b207d08: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630034] ffff88015b207d10: ffff88015b207d20 (0xffff88015b207d20)
[  297.630040] ffff88015b207d18: ffffffffb16ca5d0 (wake_up_process+0x10/0x20)
[  297.630043] ffff88015b207d20: ffff88015b207d48 (0xffff88015b207d48)
[  297.630045] ffff88015b207d28: ffff88015b207d48 (0xffff88015b207d48)
[  297.630048] ffff88015b207d30: 0000000000000202 (0x202)
[  297.630053] ffff88015b207d38: ffffffffb30b7120 (rcu_sched_state
+0x920/0xaa0)
[  297.630056] ffff88015b207d40: 0000000000000202 (0x202)
[  297.630059] ffff88015b207d48: ffff88015b207d68 (0xffff88015b207d68)
[  297.630063] ffff88015b207d50: ffffffffb16ee225 (swake_up+0x25/0x30)
[  297.630069] ffff88015b207d58: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630072] ffff88015b207d60: ffff88015a77db00 (0xffff88015a77db00)
[  297.630074] ffff88015b207d68: ffff88015b207d90 (0xffff88015b207d90)
[  297.630079] ffff88015b207d70: ffffffffb1720016 (rcu_gp_kthread_wake
+0x56/0x60)
[  297.630082] ffff88015b207d78: 0000000000000002 (0x2)
[  297.630087] ffff88015b207d80: ffffffffb30b7138 (rcu_sched_state
+0x938/0xaa0)
[  297.630092] ffff88015b207d88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630095] ffff88015b207d90: ffff88015b207e18 (0xffff88015b207e18)
[  297.630099] ffff88015b207d98: ffffffffb1720521 (rcu_report_qs_rnp
+0x2f1/0x310)
[  297.630102] ffff88015b207da0: ffff88015ad2b500 (0xffff88015ad2b500)
[  297.630105] ffff88015b207da8: ffff88015b207dd0 (0xffff88015b207dd0)
[  297.630110] ffff88015b207db0: ffffffffb18a9752 (kasan_slab_free+0x72/0xc0)
[  297.630113] ffff88015b207db8: ffff880157b98280 (0xffff880157b98280)
[  297.630116] ffff88015b207dc0: ffffea00055ee600 (0xffffea00055ee600)
[  297.630121] ffff88015b207dc8: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630124] ffff88015b207dd0: ffff88015b207e00 (0xffff88015b207e00)
[  297.630128] ffff88015b207dd8: ffffffffb18a5e90 (kmem_cache_free+0x70/0x190)
[  297.630131] ffff88015b207de0: ffff880157b98280 (0xffff880157b98280)
[  297.630135] ffff88015b207de8: ffffffffb18b7e60 (get_max_files+0x10/0x10)
[  297.630141] ffff88015b207df0: ffffffffb30b72a0 (rcu_sched_state
+0xaa0/0xaa0)
[  297.630143] ffff88015b207df8: 000000000000000f (0xf)
[  297.630146] ffff88015b207e00: ffff88015b207e18 (0xffff88015b207e18)
[  297.630150] ffff88015b207e08: ffffffffb18b7e94 (file_free_rcu+0x34/0x40)
[  297.630153] ffff88015b207e10: ffff880157b98280 (0xffff880157b98280)
[  297.630156] ffff88015b207e18: ffff88015b207f30 (0xffff88015b207f30)
[  297.630161] ffff88015b207e20: ffffffffb172580c (rcu_process_callbacks
+0x2dc/0xcd0)
[  297.630164] ffff88015b207e28: ffff88015b21b000 (0xffff88015b21b000)
[  297.630167] ffff88015b207e30: ffff88015b21b070 (0xffff88015b21b070)
[  297.630170] ffff88015b207e38: 1ffff1002b640fd5 (0x1ffff1002b640fd5)
[  297.630173] ffff88015b207e40: ffff880151b02700 (0xffff880151b02700)
[  297.630176] ffff88015b207e48: ffff88015b224200 (0xffff88015b224200)
[  297.630178] ffff88015b207e50: ffff88015b224280 (0xffff88015b224280)
[  297.630181] ffff88015b207e58: ffff88015b2242b0 (0xffff88015b2242b0)
[  297.630184] ffff88015b207e60: ffff88015b207f08 (0xffff88015b207f08)
[  297.630187] ffff88015b207e68: ffff880151b0274c (0xffff880151b0274c)
[  297.630190] ffff88015b207e70: ffff880151b02700 (0xffff880151b02700)
[  297.630195] ffff88015b207e78: ffffffffb30b7258 (rcu_sched_state
+0xa58/0xaa0)
[  297.630198] ffff88015b207e80: ffff880157b98288 (0xffff880157b98288)
[  297.630203] ffff88015b207e88: ffffffffb30b6800 (rcu_bh_varname+0x60/0x60)
[  297.630206] ffff88015b207e90: ffff88015b224238 (0xffff88015b224238)
[  297.630209] ffff88015b207e98: ffff88015b207ec8 (0xffff88015b207ec8)
[  297.630211] ffff88015b207ea0: 000000000000000a (0xa)
[  297.630214] ffff88015b207ea8: 0000000041b58ab3 (0x41b58ab3)
[  297.630218] ffff88015b207eb0: ffffffffb2d944f5 (.LC0+0x155c/0xa3a6)
[  297.630223] ffff88015b207eb8: ffffffffb1725530 (note_gp_changes+0xe0/0xe0)
[  297.630226] ffff88015b207ec0: ffff88015b215740 (0xffff88015b215740)
[  297.630229] ffff88015b207ec8: ffff880157b983c0 (0xffff880157b983c0)
[  297.630231] ffff88015b207ed0: ffff88014ac19eb0 (0xffff88014ac19eb0)
[  297.630234] ffff88015b207ed8: ffffffffffffffff (0xffffffffffffffff)
[  297.630236] ffff88015b207ee0: 0000000000000000 ...
[  297.630239] ffff88015b207ee8: 0000004552dda1c0 (0x4552dda1c0)
[  297.630240] ffff88015b207ef0: 0000000000000000 ...
[  297.630243] ffff88015b207ef8: ffff88015b207f20 (0xffff88015b207f20)
[  297.630249] ffff88015b207f00: ffffffffb174a0a8 (tick_program_event
+0x48/0x80)
[  297.630252] ffff88015b207f08: 0000000000000009 (0x9)
[  297.630259] ffff88015b207f10: ffffffffb3009148 (softirq_vec+0x48/0x80)
[  297.630261] ffff88015b207f18: 0000000000000009 (0x9)
[  297.630263] ffff88015b207f20: 0000000000000008 (0x8)
[  297.630265] ffff88015b207f28: 0000000000000009 (0x9)
[  297.630268] ffff88015b207f30: ffff88015b207fa8 (0xffff88015b207fa8)
[  297.630273] ffff88015b207f38: ffffffffb2646cbc (__do_softirq+0x12c/0x343)
[  297.630276] ffff88015b207f40: 0000000a00404100 (0xa00404100)
[  297.630279] ffff88015b207f48: ffff880151b02700 (0xffff880151b02700)
[  297.630282] ffff88015b207f50: 00000000fffff730 (0xfffff730)
[  297.630284] ffff88015b207f58: 0000000000000009 (0x9)
[  297.630286] ffff88015b207f60: 0000000000000040 (0x40)
[  297.630289] ffff88015b207f68: 000001005b21c294 (0x1005b21c294)
[  297.630294] ffff88015b207f70: ffffffffb3009110 (softirq_vec+0x10/0x80)
[  297.630297] ffff88015b207f78: 0000008000000008 (0x8000000008)
[  297.630300] ffff88015b207f80: ffff88015a77ce00 (0xffff88015a77ce00)
[  297.630303] ffff88015b207f88: ffff88015b215840 (0xffff88015b215840)
[  297.630304] ffff88015b207f90: 0000000000000000 ...
[  297.630307] ffff88015b207f98: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630311] ffff88015b207fa0: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630314] ffff88015b207fa8: ffff88015b207fc0 (0xffff88015b207fc0)
[  297.630318] ffff88015b207fb0: ffffffffb1692304 (irq_exit+0xe4/0xf0)
[  297.630321] ffff88015b207fb8: ffff88015b215740 (0xffff88015b215740)
[  297.630324] ffff88015b207fc0: ffff88015b207fe8 (0xffff88015b207fe8)
[  297.630329] ffff88015b207fc8: ffffffffb2646446 (smp_apic_timer_interrupt
+0x86/0x1a0)
[  297.630332] ffff88015b207fd0: ffff88015104d500 (0xffff88015104d500)
[  297.630335] ffff88015b207fd8: ffff88015b215840 (0xffff88015b215840)
[  297.630338] ffff88015b207fe0: 0000000000000246 (0x246)
[  297.630341] ffff88015b207fe8: ffff880156a5fdc9 (0xffff880156a5fdc9)
[  297.630345] ffff88015b207ff0: ffffffffb26452f3 (apic_timer_interrupt
+0x93/0xa0)
[  297.630348] ffff88015b207ff8: ffff880156a5fdc8 (0xffff880156a5fdc8)
[  297.630352] ffff880156a5fdc8: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630355] ffff880156a5fdd0: ffff880156a5feb0 (0xffff880156a5feb0)
[  297.630357] ffff880156a5fdd8: 0000000000000246 (0x246)
[  297.630360] ffff880156a5fde0: ffff88015b215840 (0xffff88015b215840)
[  297.630363] ffff880156a5fde8: ffff880156a5fea0 (0xffff880156a5fea0)
[  297.630366] ffff880156a5fdf0: ffff88015104d500 (0xffff88015104d500)
[  297.630369] ffff880156a5fdf8: fffff52000140c08 (0xfffff52000140c08)
[  297.630372] ffff880156a5fe00: ffffc90000a0603f (0xffffc90000a0603f)
[  297.630375] ffff880156a5fe08: fffff52000140c07 (0xfffff52000140c07)
[  297.630378] ffff880156a5fe10: fffff52000140c08 (0xfffff52000140c08)
[  297.630379] ffff880156a5fe18: 0000000000000000 ...
[  297.630385] ffff880156a5fe20: ffffffffb178d9eb (opt_pre_handler+0x6b/0x80)
[  297.630388] ffff880156a5fe28: dffffc0000000000 (0xdffffc0000000000)
[  297.630391] ffff880156a5fe30: dffffc0000000000 (0xdffffc0000000000)
[  297.630393] ffff880156a5fe38: 0000000000000246 (0x246)
[  297.630396] ffff880156a5fe40: ffffffffffffff10 (0xffffffffffffff10)
[  297.630401] ffff880156a5fe48: ffffffffb1667417 (optimized_callback
+0x67/0x100)
[  297.630404] ffff880156a5fe50: 0000000000000010 (0x10)
[  297.630406] ffff880156a5fe58: 0000000000000246 (0x246)
[  297.630409] ffff880156a5fe60: ffff880156a5fe78 (0xffff880156a5fe78)
[  297.630412] ffff880156a5fe68: 0000000000000018 (0x18)
[  297.630414] ffff880156a5fe70: 0000000000000246 (0x246)
[  297.630417] ffff880156a5fe78: 00000000026aed08 (0x26aed08)
[  297.630419] ffff880156a5fe80: 0000000000000005 (0x5)
[  297.630421] ffff880156a5fe88: 0000000000000003 (0x3)
[  297.630423] ffff880156a5fe90: 0000000000000000 ...
[  297.630425] ffff880156a5fe98: 00000000025d1568 (0x25d1568)
[  297.630428] ffff880156a5fea0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630431] ffff880156a5fea8: ffffffffc0050282 (0xffffffffc0050282)
[  297.630433] ffff880156a5feb0: 00000000025d1568 (0x25d1568)
[  297.630435] ffff880156a5feb8: 0000000000000000 ...
[  297.630437] ffff880156a5fec0: 0000000000000003 (0x3)
[  297.630440] ffff880156a5fec8: 0000000000000005 (0x5)
[  297.630442] ffff880156a5fed0: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630445] ffff880156a5fed8: 00000000026aed08 (0x26aed08)
[  297.630448] ffff880156a5fee0: ffff880151b02700 (0xffff880151b02700)
[  297.630450] ffff880156a5fee8: 0000000002675e00 (0x2675e00)
[  297.630453] ffff880156a5fef0: 0000000000000001 (0x1)
[  297.630455] ffff880156a5fef8: 0000000000000002 (0x2)
[  297.630457] ffff880156a5ff00: 0000000000000002 (0x2)
[  297.630460] ffff880156a5ff08: 0000000002675e00 (0x2675e00)
[  297.630462] ffff880156a5ff10: 0000000000000180 (0x180)
[  297.630464] ffff880156a5ff18: 0000000000000000 ...
[  297.630466] ffff880156a5ff20: 000000000272a008 (0x272a008)
[  297.630469] ffff880156a5ff28: ffffffffffffffff (0xffffffffffffffff)
[  297.630473] ffff880156a5ff30: ffffffffb18b23b1 (SyS_open+0x1/0x20)
[  297.630475] ffff880156a5ff38: 0000000000000010 (0x10)
[  297.630478] ffff880156a5ff40: 0000000000000293 (0x293)
[  297.630481] ffff880156a5ff48: ffff880156a5ff50 (0xffff880156a5ff50)
[  297.630485] ffff880156a5ff50: ffffffffb1665770 (copy_oldmem_page+0x90/0x90)
[  297.630488] ffff880156a5ff58: 00000000025d1b28 (0x25d1b28)
[  297.630489] ffff880156a5ff60: 0000000000000000 ...
[  297.630492] ffff880156a5ff68: 0000000000000003 (0x3)
[  297.630494] ffff880156a5ff70: 0000000000000005 (0x5)
[  297.630497] ffff880156a5ff78: 00007ffec7d87e50 (0x7ffec7d87e50)
[  297.630499] ffff880156a5ff80: 00000000026aed08 (0x26aed08)
[  297.630502] ffff880156a5ff88: 0000000000000246 (0x246)
[  297.630504] ffff880156a5ff90: 0000000002675e00 (0x2675e00)
[  297.630506] ffff880156a5ff98: 0000000000000001 (0x1)
[  297.630509] ffff880156a5ffa0: 0000000000000002 (0x2)
[  297.630511] ffff880156a5ffa8: ffffffffffffffda (0xffffffffffffffda)
[  297.630514] ffff880156a5ffb0: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630517] ffff880156a5ffb8: 0000000000000180 (0x180)
[  297.630518] ffff880156a5ffc0: 0000000000000000 ...
[  297.630521] ffff880156a5ffc8: 000000000272a008 (0x272a008)
[  297.630523] ffff880156a5ffd0: 0000000000000002 (0x2)
[  297.630526] ffff880156a5ffd8: 00007f3d3f7be4e0 (0x7f3d3f7be4e0)
[  297.630528] ffff880156a5ffe0: 0000000000000033 (0x33)
[  297.630530] ffff880156a5ffe8: 0000000000000246 (0x246)
[  297.630533] ffff880156a5fff0: 00007ffec7d87db8 (0x7ffec7d87db8)
[  297.630535] ffff880156a5fff8: 000000000000002b (0x2b)

opensnoop(pythong) itself blocks too:

root@test:~# cat /proc/2075/stack
[<ffffffffb79a0a07>] ring_buffer_wait+0x167/0x2e0
[<ffffffffb79a34e7>] wait_on_pipe+0x77/0x80
[<ffffffffb79aa7a1>] tracing_wait_pipe.isra.69+0x51/0xf0
[<ffffffffb79abdf9>] tracing_read_pipe+0x1c9/0x500
[<ffffffffb7ab5e62>] __vfs_read+0xd2/0x370
[<ffffffffb7ab61b7>] vfs_read+0xb7/0x1a0
[<ffffffffb7ab6bd0>] SyS_read+0xa0/0x120
[<ffffffffb8843c37>] entry_SYSCALL_64_fastpath+0x1a/0xa5
[<ffffffffffffffff>] 0xffffffffffffffff

Thanks,
//richard

-- 
sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria
ATU66964118 - FN 374287y

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate bpf filter handles
From: Daniel Borkmann @ 2017-09-25 21:16 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-1-xiyou.wangcong@gmail.com>

On 09/25/2017 07:13 PM, Cong Wang wrote:
> Instead of calling cls_bpf_get() in a loop to find
> a unused handle, just switch to idr API to allocate
> new handles.
>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Chris Mi <chrism@mellanox.com>
> Cc: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
[...]
> @@ -476,21 +462,30 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		}
>   	}
>
> -	if (handle == 0)
> -		prog->handle = cls_bpf_grab_new_handle(tp, head);
> -	else
> +	if (handle == 0) {
> +		ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +				    1, 0x7FFFFFFF, GFP_KERNEL);
> +		if (ret)
> +			goto errout;
> +		prog->handle = idr_index;
> +	} else {
> +		if (!oldprog) {
> +			ret = idr_alloc_ext(&head->handle_idr, prog, &idr_index,
> +					    handle, handle + 1, GFP_KERNEL);
> +			if (ret)
> +				goto errout;
> +		}
>   		prog->handle = handle;
> -	if (prog->handle == 0) {
> -		ret = -EINVAL;
> -		goto errout;
>   	}
>
>   	ret = cls_bpf_set_parms(net, tp, prog, base, tb, tca[TCA_RATE], ovr);
>   	if (ret < 0)
> -		goto errout;
> +		goto errout_idr;
>
>   	ret = cls_bpf_offload(tp, prog, oldprog);
>   	if (ret) {
> +		if (!oldprog)
> +			idr_remove_ext(&head->handle_idr, prog->handle);

Shouldn't we also call idr_remove_ext() when there was an
oldprog, but we didn't care about reusing the same handle,
so it was handle == 0 initially?

There's this condition in the code before above idr allocations,
I think also in other classifiers:

         if (oldprog) {
                 if (handle && oldprog->handle != handle) {
                         ret = -EINVAL;
                         goto errout;
                 }
         }

>   		__cls_bpf_delete_prog(prog);
>   		return ret;
>   	}
> @@ -499,6 +494,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   		prog->gen_flags |= TCA_CLS_FLAGS_NOT_IN_HW;
>
>   	if (oldprog) {
> +		idr_replace_ext(&head->handle_idr, prog, handle);

And here, we should probably use prog->handle for the above
mentioned case as well, no?

Would be great if all this (and e.g. the fact that we use idr itself)
could optionally be hidden behind some handle generator api given
we could reuse that api also for cls_basic and cls_u32. Could also
be followed-up perhaps.

>   		list_replace_rcu(&oldprog->link, &prog->link);
>   		tcf_unbind_filter(tp, &oldprog->res);
>   		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
> @@ -509,6 +505,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
>   	*arg = prog;
>   	return 0;
>
> +errout_idr:
> +	if (!oldprog)
> +		idr_remove_ext(&head->handle_idr, prog->handle);

(Likewise as the failing cls_bpf_offload().)

>   errout:
>   	tcf_exts_destroy(&prog->exts);
>   	kfree(prog);
>

^ permalink raw reply

* [PATCH net-next] net: ipv6: send NS for DAD when link operationally up
From: Mike Manning @ 2017-09-25 21:01 UTC (permalink / raw)
  To: netdev

The NS for DAD are sent on admin up as long as a valid qdisc is found.
A race condition exists by which these packets will not egress the
interface if the operational state of the lower device is not yet up.
The solution is to delay DAD until the link is operationally up
according to RFC2863. Rather than only doing this, follow the existing
code checks by deferring IPv6 device initialization altogether. The fix
allows DAD on devices like tunnels that are controlled by userspace
control plane. The fix has no impact on regular deployments, but means
that there is no IPv6 connectivity until the port has been opened in
the case of port-based network access control, which should be
desirable.

Signed-off-by: Mike Manning <mmanning@brocade.com>
---
 net/ipv6/addrconf.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index c2e2a78..dffbf3b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -303,10 +303,10 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
 	.disable_policy		= 0,
 };
 
-/* Check if a valid qdisc is available */
-static inline bool addrconf_qdisc_ok(const struct net_device *dev)
+/* Check if link is ready: is it up and is a valid qdisc available */
+static inline bool addrconf_link_ready(const struct net_device *dev)
 {
-	return !qdisc_tx_is_noop(dev);
+	return netif_oper_up(dev) && !qdisc_tx_is_noop(dev);
 }
 
 static void addrconf_del_rs_timer(struct inet6_dev *idev)
@@ -451,7 +451,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
 
 	ndev->token = in6addr_any;
 
-	if (netif_running(dev) && addrconf_qdisc_ok(dev))
+	if (netif_running(dev) && addrconf_link_ready(dev))
 		ndev->if_flags |= IF_READY;
 
 	ipv6_mc_init_dev(ndev);
@@ -3393,7 +3393,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 			/* restore routes for permanent addresses */
 			addrconf_permanent_addr(dev);
 
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is not ready yet. */
 				pr_info("ADDRCONF(NETDEV_UP): %s: link is not ready\n",
 					dev->name);
@@ -3408,7 +3408,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 				run_pending = 1;
 			}
 		} else if (event == NETDEV_CHANGE) {
-			if (!addrconf_qdisc_ok(dev)) {
+			if (!addrconf_link_ready(dev)) {
 				/* device is still not ready. */
 				break;
 			}
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket
From: Thomas Gleixner @ 2017-09-25 20:42 UTC (permalink / raw)
  To: Vallish Vaidyeshwara
  Cc: Eric Dumazet, Eduardo Valentin, David Miller, dwmw2, shuah,
	richardcochran, xiyou.wangcong, netdev, linux-kernel, anchalag,
	dwmw
In-Reply-To: <20170920224816.GA73561@amazon.com>

On Wed, 20 Sep 2017, Vallish Vaidyeshwara wrote:
> On Sat, Sep 16, 2017 at 11:47:56AM +0200, Thomas Gleixner wrote:
> > > So if we need to replace all 'legacy' timers to high resolution timer,
> > > because some application was _relying_ on jiffies being kind of precise,
> > > maybe it is better to revert the change done on legacy timers.
> > 
> > Which would be a major step back in terms of timer performance and system
> > disturbance caused by massive recascading operations.
> > 
> > > Or continue the migration and make them use high res internally.
> > > 
> > > select() and poll() are the standard way to have precise timeouts,
> > > it is silly we have to maintain a timeout handling in the datagram fast
> > > path.
> > 
> > A few years ago we switched select/poll over to use hrtimers because the
> > wheel timers were too inaccurate for some operations, so it feels
> > consequent to switch the timeout in the datagram rcv path over as well. I
> > agree that the whole timeout magic there feels silly, but unfortunately
> > it's a documented property of sockets.
> > 
> 
> Thanks for your comments. This patch has been NACK'ed by David Miller. Is
> there any other approach to solve this problem with out application code
> being recompiled?

We have only three options here:

   1) Do a massive revert of the timer wheel changes and lose all the
      benefits of that rework.

   2) Make that timer list -> hrtimer change in the datagram code

   3) Ignore it

#1 Would be pretty ironic as networking would take the biggest penalty of
   the revert.

#2 Is IMO the proper solution as it cures a user space visible regression,
   though the patch itself could be made way simpler

#3 Shrug

Dave, Eric?

Thanks,

	tglx

^ permalink raw reply

* Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets
From: Paolo Abeni @ 2017-09-25 20:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Pablo Neira Ayuso, Florian Westphal,
	Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:
> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:
> > This series refactor the UDP early demux code so that:
> > 
> > * full socket lookup is performed for unicast packets
> > * a sk is grabbed even for unconnected socket match
> > * a dst cache is used even in such scenario
> > 
> > To perform this tasks a couple of facilities are added:
> > 
> > * noref socket references, scoped inside the current RCU section, to be
> >   explicitly cleared before leaving such section
> > * a dst cache inside the inet and inet6 local addresses tables, caching the
> >   related local dst entry
> > 
> > The measured performance gain under small packet UDP flood is as follow:
> > 
> > ingress NIC	vanilla		patched		delta
> > rx queues	(kpps)		(kpps)		(%)
> > [ipv4]
> > 1		2177		2414		10
> > 2		2527		2892		14
> > 3		3050		3733		22
> 
> 
> This is a clear sign your program is not using latest SO_REUSEPORT +
> [ec]BPF filter [1]
> 
> return socket[RX_QUEUE# | or CPU#];
> 
> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is
> based on a lazy hash, meaning that you do not have proper siloing.
> 
> return socket[hash(skb)];
> 
> Multiple cpus can then :
>  - compete on grabbing same socket refcount
>  - compete on grabbing the receive queue lock
>  - compete for releasing lock and socket refcount
>  - skb freeing done on different cpus than where allocated.
> 
> You are adding complexity to the kernel because you are using a
> sub-optimal user space program, favoring false sharing.
> 
> First solve the false sharing issue.
> 
> Performance with 2 rx queues should be almost twice the performance with
> 1 rx queue.
> 
> Then we can see if the gains you claim are still applicable.

Here are the performance results using a BPF filter to distribute the
ingress packet to the reuseport socket with the same id of the ingress
CPU - we have 1 to 1 mapping between the ingress receive queue and the
destination socket:

ingress NIC     vanilla         patched         delta
rx queues       (kpps)          (kpps)          (%)
[ipv4]
2               3020                3663                21
3               4352                5179                19
4               5318                6194                16
5               6258                7583                21
6               7376                8558                16

[ipv6]
2               2446                3949                61
3               3099                5092                64
4               3698                6611                78
5               4382                7852                79
6               5116                8851                73

Sone notes:

- figures obtained with: 

ethtool  -L em2 combined $n
MASK=1
for I in `seq 0 $((n - 1))`; do
        [ $I -eq 0 ] && USE_BPF="--use_bpf" || USE_BPF=""
        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &
        taskset -p $((MASK << ($I + $n) )) $!
done

- in the IPv6 routing code we currently have a relevant bottle-neck in
ip6_pol_route(), I see a lot of contention on a dst refcount, so
without early demux the performances do not scale well there.

- For maximum performances BH and user space sink need to run on
difference CPUs - yes we have some more cacheline misses and a little
contention on the receive queue spin lock, but a lot less icache misses
and more CPU cycles available, the overall tput is a lot higher than
binding on the same CPU where the BH is running.

> PS: Wei Wan is about to release the IPV6 changes so that the big
> differences you showed are going to disappear soon.

Interesting, looking forward to that!

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH v2 4/4] net: nfc: llcp_core: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-4-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:05 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 3/4] net: nfc: hci: llc_shdlc: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-3-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:04 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 2/4] net: nfc: hci: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-2-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:03 +0530

> Use setup_timer function instead of initializing timer with the
>    function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: [PATCH v2 1/4] net: af_packet: use setup_timer() helper.
From: David Miller @ 2017-09-25 20:19 UTC (permalink / raw)
  To: allen.lkml; +Cc: netdev, sameo
In-Reply-To: <1506324605-10160-1-git-send-email-allen.lkml@gmail.com>

From: Allen Pais <allen.lkml@gmail.com>
Date: Mon, 25 Sep 2017 13:00:02 +0530

> Use setup_timer function instead of initializing timer with the
> function and data fields.
> 
> Signed-off-by: Allen Pais <allen.lkml@gmail.com>
> ---
> v2: rebased to latest net-next.

Applied.

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-09-25 20:18 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <3f824b0e-65f9-c69c-5421-2c5f6b349b09@redhat.com>

On 09/22/2017 12:03 AM, Jason Wang wrote:
> 
> 
> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>> patch goes even further which disables rx polling during processing tx.
>>> Please try it to see if it has any difference.
>> Unfortunately, this patch doesn't seem to have made a difference.  I
>> tried runs with both this patch and the previous patch applied, as well
>> as only this patch applied for comparison (numbers from vhost thread of
>> sending VM):
>>
>> 4.12    4.13     patch1   patch2   patch1+2
>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>
>> In each case, the regression in throughput was still present.
> 
> This probably means some other cases of the wakeups were missed. Could
> you please record the callers of __wake_up_sync_key()?
> 

Hi Jason,

With your 2 previous patches applied, every call to __wake_up_sync_key
(for both sender and server vhost threads) shows the following stack trace:

     vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
<-sock_def_readable
     vhost-11478-11520 [002] ....   312.927230: <stack trace>
 => dev_hard_start_xmit
 => sch_direct_xmit
 => __dev_queue_xmit
 => br_dev_queue_push_xmit
 => br_forward_finish
 => __br_forward
 => br_handle_frame_finish
 => br_handle_frame
 => __netif_receive_skb_core
 => netif_receive_skb_internal
 => tun_get_user
 => tun_sendmsg
 => handle_tx
 => vhost_worker
 => kthread
 => kernel_thread_starter
 => kernel_thread_starter

>>
>>> And two questions:
>>> - Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
>> Verified that the second set of guests are not actually required, I can
>> see the regression with only 2 VMs.
>>
>>> - Can enable batching in the tap of sending VM improve the performance
>>> (ethtool -C $tap rx-frames 64)
>> I tried this, but it did not help (actually seemed to make things a
>> little worse)
>>
> 
>  I still can't see a reason that can lead more wakeups, will take more
> time to look at this issue and keep you posted.
> 
> Thanks
> 

^ permalink raw reply

* Re: [PATCH] r8152: add Linksys USB3GIGV1 id
From: Grant Grundler @ 2017-09-25 20:17 UTC (permalink / raw)
  To: Oliver Neukum
  Cc: Grant Grundler, Hayes Wang, David S . Miller, LKML, linux-usb,
	netdev
In-Reply-To: <1506351074.7827.13.camel@suse.com>

[grrhmail...sorry! resending as plain text]

Hallo Oliver!

On Mon, Sep 25, 2017 at 7:51 AM, Oliver Neukum <oneukum@suse.com> wrote:
> Am Freitag, den 22.09.2017, 12:06 -0700 schrieb Grant Grundler:
> > This Linksys dongle by default comes up in cdc_ether mode.
> > This patch allows r8152 to claim the device:
> >    Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Hi,
>
> have you tested this in case cdc_ether is for some reason already
> loaded?

I did not consider testing this case since it's not possible on a
normal chromeos system (the entire root file system is signed for
normal users and get's rebooted after an update).  I could test this
in developer mode of course.

Did you expect both driver probe routines to claim the device and
wreak havoc with the device?

> The patch seems to enable r8152 but does not disable cdc_ether.

Correct. r8152 happens to claim the device before cdc_ether does - I
thought because cdc_ether is a class driver and only gets picked up
after vendor specific drivers are probed.  Is that correct?

I didn't realize cdc_ether has a blacklist to make sure
RTL8152|RTL8153 devices are not picked up by cdc_ether. Would you
prefer I add this device to the blacklist in the same patch?

cheers,
grant

>
>         Regards
>                 Oliver
>

^ permalink raw reply

* Re: [PATCH net-next] sch_netem: faster rb tree removal
From: David Miller @ 2017-09-25 20:11 UTC (permalink / raw)
  To: dsahern; +Cc: eric.dumazet, netdev, stephen
In-Reply-To: <ce328dfd-0b54-883a-36a8-91ec344f86ad@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Mon, 25 Sep 2017 10:14:23 -0600

> I made a simple program this morning and ran it under perf.

If possible please submit this for selftests.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: John Fastabend @ 2017-09-25 19:47 UTC (permalink / raw)
  To: Daniel Borkmann, Andy Gospodarek
  Cc: davem, alexei.starovoitov, peter.waskiewicz.jr, jakub.kicinski,
	netdev, mchan
In-Reply-To: <59C94FF4.8070900@iogearbox.net>

On 09/25/2017 11:50 AM, Daniel Borkmann wrote:
> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
> [...]
>> First, thanks for this detailed description.  It was helpful to read
>> along with the patches.
>>
>> My only concern about this area being generic is that you are now in a
>> state where any bpf program must know about all the bpf programs in the
>> receive pipeline before it can properly parse what is stored in the
>> meta-data and add it to an skb (or perform any other action).
>> Especially if each program adds it's own meta-data along the way.
>>
>> Maybe this isn't a big concern based on the number of users of this
>> today, but it just starts to seem like a concern as there are these
>> hints being passed between layers that are challenging to track due to a
>> lack of a standard format for passing data between.
> 
> Btw, we do have similar kind of programmable scratch buffer also today
> wrt skb cb[] that you can program from tc side, the perf ring buffer,
> which doesn't have any fixed layout for the slots, or a per-cpu map
> where you can transfer data between tail calls for example, then tail
> calls themselves that need to coordinate, or simply mangling of packets
> itself if you will, but more below to your use case ...
> 
>> The main reason I bring this up is that Michael and I had discussed and
>> designed a way for drivers to communicate between each other that rx
>> resources could be freed after a tx completion on an XDP_REDIRECT
>> action.  Much like this code, it involved adding an new element to
>> struct xdp_md that could point to the important information.  Now that
>> there is a generic way to handle this, it would seem nice to be able to
>> leverage it, but I'm not sure how reliable this meta-data area would be
>> without the ability to mark it in some manner.
>>
>> For additional background, the minimum amount of data needed in the case
>> Michael and I were discussing was really 2 words.  One to serve as a
>> pointer to an rx_ring structure and one to have a counter to the rx
>> producer entry.  This data could be acessed by the driver processing the
>> tx completions and callback to the driver that received the frame off
>> the wire
>> to perform any needed processing.  (For those curious this would also
>> require a
>> new callback/netdev op to act on this data stored in the XDP buffer.)
> 
> What you describe above doesn't seem to be fitting to the use-case of
> this set, meaning the area here is fully programmable out of the BPF
> program, the infrastructure you're describing is some sort of means of
> communication between drivers for the XDP_REDIRECT, and should be
> outside of the control of the BPF program to mangle.
> 
> You could probably reuse the base infra here and make a part of that
> inaccessible for the program with some sort of a fixed layout, but I
> haven't seen your code yet to be able to fully judge. Intention here
> is to allow for programmability within the BPF prog in a generic way,
> such that based on the use-case it can be populated in specific ways
> and propagated to the skb w/o having to define a fixed layout and
> bloat xdp_buff all the way to an skb while still retaining all the
> flexibility.
> 
> Thanks,
> Daniel

Hi Andy,

I'm guessing this data needs to be passed from the input dev to the
output dev based on your description. If the driver data
is pushed after the BPF program is run but before the xdp_do_flush_map
call no other BPF programs can be run on that xdp_buff. It
should be safe at this point to use the metadata region directly
from the driver. We would just need to add a few helpers for the
drivers to use for this maybe, xdp_metadata_write_drv,
xdp_meadata_read_drv. I think this would work for your use case?
The data structure would have to be agreed upon by all the drivers
but would not be UAPI because it would only be exposed in the
driver. So we would be free to change/update it as needed.

Thanks,
John

^ permalink raw reply

* Re: [ovs-dev] [PATCH net-next v9] openvswitch: enable NSH support
From: Eric Garver @ 2017-09-25 19:28 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, jbenc, davem
In-Reply-To: <1506348969-6233-1-git-send-email-yi.y.yang@intel.com>

On Mon, Sep 25, 2017 at 10:16:09PM +0800, Yi Yang wrote:
> v8->v9
>  - Fix build error reported by daily intel build
>    because nsh module isn't selected by openvswitch
> 
> v7->v8
>  - Rework nested value and mask for OVS_KEY_ATTR_NSH
>  - Change pop_nsh to adapt to nsh kernel module
>  - Fix many issues per comments from Jiri Benc
> 
> v6->v7
>  - Remove NSH GSO patches in v6 because Jiri Benc
>    reworked it as another patch series and they have
>    been merged.
>  - Change it to adapt to nsh kernel module added by NSH
>    GSO patch series
> 
> v5->v6
>  - Fix the rest comments for v4.
>  - Add NSH GSO support for VxLAN-gpe + NSH and
>    Eth + NSH.
> 
> v4->v5
>  - Fix many comments by Jiri Benc and Eric Garver
>    for v4.
> 
> v3->v4
>  - Add new NSH match field ttl
>  - Update NSH header to the latest format
>    which will be final format and won't change
>    per its author's confirmation.
>  - Fix comments for v3.
> 
> v2->v3
>  - Change OVS_KEY_ATTR_NSH to nested key to handle
>    length-fixed attributes and length-variable
>    attriubte more flexibly.
>  - Remove struct ovs_action_push_nsh completely
>  - Add code to handle nested attribute for SET_MASKED
>  - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
>    to transfer NSH header data.
>  - Fix comments and coding style issues by Jiri and Eric
> 
> v1->v2
>  - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
>  - Dynamically allocate struct ovs_action_push_nsh for
>    length-variable metadata.
> 
> OVS master and 2.8 branch has merged NSH userspace
> patch series, this patch is to enable NSH support
> in kernel data path in order that OVS can support
> NSH in compat mode by porting this.
> 
> Signed-off-by: Yi Yang <yi.y.yang@intel.com>
> ---
>  include/net/nsh.h                |   3 +
>  include/uapi/linux/openvswitch.h |  29 ++++
>  net/nsh/nsh.c                    |  53 +++++++
>  net/openvswitch/Kconfig          |   1 +
>  net/openvswitch/actions.c        | 112 ++++++++++++++
>  net/openvswitch/flow.c           |  51 ++++++
>  net/openvswitch/flow.h           |  11 ++
>  net/openvswitch/flow_netlink.c   | 324 ++++++++++++++++++++++++++++++++++++++-
>  net/openvswitch/flow_netlink.h   |   5 +
>  9 files changed, 588 insertions(+), 1 deletion(-)
> 
[...]
> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 58fb827..54334ca 100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -14,6 +14,59 @@
>  #include <net/nsh.h>
>  #include <net/tun_proto.h>
>  
> +int skb_push_nsh(struct sk_buff *skb, const struct nshhdr *src_nsh_hdr)
> +{
> +	struct nshhdr *nsh_hdr;
> +	size_t length = nsh_hdr_len(src_nsh_hdr);
> +	u8 next_proto;
> +
> +	if (skb->mac_len) {
> +		next_proto = TUN_P_ETHERNET;
> +	} else {
> +		next_proto = tun_p_from_eth_p(skb->protocol);
> +		if (!next_proto)
> +			return -EAFNOSUPPORT;
> +	}
> +
> +	/* Add the NSH header */
> +	if (skb_cow_head(skb, length) < 0)
> +		return -ENOMEM;
> +
> +	skb_push(skb, length);
> +	nsh_hdr = (struct nshhdr *)(skb->data);
> +	memcpy(nsh_hdr, src_nsh_hdr, length);
> +	nsh_hdr->np = next_proto;
> +
> +	skb->protocol = htons(ETH_P_NSH);
> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

I think you mean to reset network_header before mac_len.

> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_push_nsh);
> +
> +int skb_pop_nsh(struct sk_buff *skb)
> +{
> +	struct nshhdr *nsh_hdr = (struct nshhdr *)(skb->data);
> +	size_t length;
> +	u16 inner_proto;
> +
> +	inner_proto = tun_p_to_eth_p(nsh_hdr->np);
> +	if (!inner_proto)
> +		return -EAFNOSUPPORT;
> +
> +	length = nsh_hdr_len(nsh_hdr);
> +	skb_pull(skb, length);

Do you need to verify you can actually pull length bytes? I don't see
any guarantee.

> +	skb_reset_mac_header(skb);
> +	skb_reset_mac_len(skb);
> +	skb_reset_network_header(skb);

Reset network before mac_len.

> +	skb->protocol = inner_proto;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(skb_pop_nsh);
> +
>  static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
>  				       netdev_features_t features)
>  {
[...]
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index a54a556..d026b85 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
[...]
> @@ -602,6 +642,59 @@ static int set_ipv6(struct sk_buff *skb, struct sw_flow_key *flow_key,
>  	return 0;
>  }
>  
> +static int set_nsh(struct sk_buff *skb, struct sw_flow_key *flow_key,
> +		   const struct nlattr *a)
> +{
> +	struct nshhdr *nsh_hdr;
> +	int err;
> +	u8 flags;
> +	u8 ttl;
> +	int i;
> +
> +	struct ovs_key_nsh key;
> +	struct ovs_key_nsh mask;
> +
> +	err = nsh_key_from_nlattr(a, &key, &mask);
> +	if (err)
> +		return err;
> +
> +	err = skb_ensure_writable(skb, skb_network_offset(skb) +
> +				  sizeof(struct nshhdr));

This calls pskb_may_pull(), but you're not pulling any data here.

> +	if (unlikely(err))
> +		return err;
> +
> +	nsh_hdr = (struct nshhdr *)skb_network_header(skb);
> +
> +	flags = nsh_get_flags(nsh_hdr);
> +	flags = OVS_MASKED(flags, key.flags, mask.flags);
> +	flow_key->nsh.flags = flags;
> +	ttl = nsh_get_ttl(nsh_hdr);
> +	ttl = OVS_MASKED(ttl, key.ttl, mask.ttl);
> +	flow_key->nsh.ttl = ttl;
> +	nsh_set_flags_and_ttl(nsh_hdr, flags, ttl);
> +	nsh_hdr->path_hdr = OVS_MASKED(nsh_hdr->path_hdr, key.path_hdr,
> +				       mask.path_hdr);
> +	flow_key->nsh.path_hdr = nsh_hdr->path_hdr;
> +	switch (nsh_hdr->mdtype) {
> +	case NSH_M_TYPE1:
> +		for (i = 0; i < NSH_MD1_CONTEXT_SIZE; i++) {
> +			nsh_hdr->md1.context[i] =
> +			    OVS_MASKED(nsh_hdr->md1.context[i], key.context[i],
> +				       mask.context[i]);
> +		}
> +		memcpy(flow_key->nsh.context, nsh_hdr->md1.context,
> +		       sizeof(nsh_hdr->md1.context));
> +		break;
> +	case NSH_M_TYPE2:
> +		memset(flow_key->nsh.context, 0,
> +		       sizeof(flow_key->nsh.context));
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
>  /* Must follow skb_ensure_writable() since that can move the skb data. */
>  static void set_tp_port(struct sk_buff *skb, __be16 *port,
>  			__be16 new_port, __sum16 *check)
[...]

^ permalink raw reply

* Re: [Patch net-next v2] net_sched: use idr to allocate u32 filter handles
From: Jiri Pirko @ 2017-09-25 19:00 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, Chris Mi, Jamal Hadi Salim
In-Reply-To: <20170925171351.4956-3-xiyou.wangcong@gmail.com>

Mon, Sep 25, 2017 at 07:13:51PM CEST, xiyou.wangcong@gmail.com wrote:
>Instead of calling u32_lookup_ht() in a loop to find
>a unused handle, just switch to idr API to allocate
>new handles. u32 filters are special as the handle
>could contain a hash table id and a key id, so we
>need two IDR to allocate each of them.
>
>Cc: Chris Mi <chrism@mellanox.com>
>Cc: Jamal Hadi Salim <jhs@mojatatu.com>
>Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>---

[...]

>@@ -311,19 +313,19 @@ static void *u32_get(struct tcf_proto *tp, u32 handle)
> 	return u32_lookup_key(ht, handle);
> }
> 
>-static u32 gen_new_htid(struct tc_u_common *tp_c)
>+static u32 gen_new_htid(struct tc_u_common *tp_c, struct tc_u_hnode *ptr)
> {
>-	int i = 0x800;
>+	unsigned long idr_index;
>+	int err;
> 
>-	/* hgenerator only used inside rtnl lock it is safe to increment
>+	/* This is only used inside rtnl lock it is safe to increment
> 	 * without read _copy_ update semantics
> 	 */
>-	do {
>-		if (++tp_c->hgenerator == 0x7FF)
>-			tp_c->hgenerator = 1;
>-	} while (--i > 0 && u32_lookup_ht(tp_c, (tp_c->hgenerator|0x800)<<20));
>-
>-	return i > 0 ? (tp_c->hgenerator|0x800)<<20 : 0;
>+	err = idr_alloc_ext(&tp_c->handle_idr, ptr, &idr_index,
>+			    1, 0x7FF, GFP_KERNEL);

Interesting, any idea why this is not 0x7FFFFFFF as well?

I wonder if we could have 0x7FFFFFFF magic defined somewhere.

Otherwise, "patchset" looks good. Thank you for taking care of this!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox