Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] pktgen: add a new sample script for 40G and above link testing
From: Waskiewicz Jr, Peter @ 2017-08-25 15:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Hu, Robert, robert.hu@linux.intel.com, netdev@vger.kernel.org
In-Reply-To: <20170825165929.0d840ab2@redhat.com>

On 8/25/17 10:59 AM, Jesper Dangaard Brouer wrote:
> On Fri, 25 Aug 2017 14:24:28 +0000
> "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:
> 
>> On 8/25/17 5:19 AM, Jesper Dangaard Brouer wrote:
>>>>
>>>> Tested with Intel XL710 NIC with Cisco 3172 switch.
>>>>
>>>> It would be even slightly better if the irqbalance service is turned
>>>> off outside.
>>>
>>> Yes, if you don't turn-off (kill) irqbalance it will move around the
>>> IRQs behind your back...
>>
>> Or you can use the --banirq option to irqbalance to ignore your device's
>> interrupts as targets for balancing.
> 
> It might be worth mentioning that --banirq=X is specified for each IRQ
> that you want to exclude, and --banirq is simply specified multiple
> times on the command line.
> 
> Is it possible to tell a running irqbalance that I want to excluded an
> extra IRQ? (just before I do my manual adjustment).

It isn't possible today, since we don't have a way to attach a 
foreground/oneshot irqbalance run to a currently-running daemon.  That's 
an interesting feature enhancement...I can add it to our list as a 
feature request so I don't forget about it.  That way I can also get 
Neil's thoughts on this.

-PJ

^ permalink raw reply

* Re: [PATCH] net: stmmac: dwmac-sun8i: Use reset exclusive
From: Corentin Labbe @ 2017-08-25 15:17 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: peppe.cavallaro, alexandre.torgue, wens, netdev, linux-arm-kernel,
	linux-kernel
In-Reply-To: <20170825144832.3w6izhri5vstbndp@flea.lan>

On Fri, Aug 25, 2017 at 04:48:32PM +0200, Maxime Ripard wrote:
> On Fri, Aug 25, 2017 at 04:38:05PM +0200, Corentin Labbe wrote:
> > The current dwmac_sun8i module cannot be rmmod/modprobe due to that
> > the reset controller was not released when removed.
> > 
> > This patch remove ambiguity, by using of_reset_control_get_exclusive and
> > add the missing reset_control_put().
> > 
> > Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > index fffd6d5fc907..675a09629d85 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> > @@ -782,6 +782,7 @@ static int sun8i_dwmac_unpower_internal_phy(struct sunxi_priv_data *gmac)
> >  
> >  	clk_disable_unprepare(gmac->ephy_clk);
> >  	reset_control_assert(gmac->rst_ephy);
> > +	reset_control_put(gmac->rst_ephy);
> >  	return 0;
> >  }
> >  
> > @@ -942,7 +943,7 @@ static int sun8i_dwmac_probe(struct platform_device *pdev)
> >  			return -EINVAL;
> >  		}
> >  
> > -		gmac->rst_ephy = of_reset_control_get(plat_dat->phy_node, NULL);
> > +		gmac->rst_ephy = of_reset_control_get_exclusive(plat_dat->phy_node, NULL);
> 
> Why not just use devm_reset_control_get?
> 

Because there no devm_ functions with of_

Regards

^ permalink raw reply

* [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Jacob Keller @ 2017-08-25 15:24 UTC (permalink / raw)
  To: netdev; +Cc: Jacob Keller

Under some circumstances, such as with many stacked devices, it is
possible that dev_hard_start_xmit will bundle many packets together, and
mark them all with xmit_more.

Most drivers respond to xmit_more by skipping tail bumps on packet
rings, or similar behavior as long as xmit_more is set. This is
a performance win since it means drivers can avoid notifying hardware of
new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
bandwidth.

This use of xmit_more comes with a trade off because bundling too many
packets can increase latency of the Tx packets. To avoid this, we should
limit the maximum number of packets with xmit_more.

Driver authors could modify their drivers to check for some determined
limit, but this requires all drivers to be modified in order to gain
advantage.

Instead, add a sysctl "xmit_more_max" which can be used to configure the
maximum number of xmit_more skbs to send in a sequence. This ensures
that all drivers benefit, and allows system administrators the option to
tune the value to their environment.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---

Stray thoughts and further questions....

Is this the right approach? Did I miss any other places where we should
limit? Does the limit make sense? Should it instead be a per-device
tuning nob instead of a global? Is 32 a good default?

 Documentation/sysctl/net.txt |  6 ++++++
 include/linux/netdevice.h    |  2 ++
 net/core/dev.c               | 10 +++++++++-
 net/core/sysctl_net_core.c   |  7 +++++++
 4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index b67044a2575f..3d995e8f4448 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -230,6 +230,12 @@ netdev_max_backlog
 Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
 receives packets faster than kernel can process them.
 
+xmit_more_max
+-------------
+
+Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
+indicates no limit.
+
 netdev_rss_key
 --------------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c5475b37a631..6341452aed09 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3321,6 +3321,8 @@ void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev);
 extern int		netdev_budget;
 extern unsigned int	netdev_budget_usecs;
 
+extern unsigned int sysctl_xmit_more_max;
+
 /* Called by rtnetlink.c:rtnl_unlock() */
 void netdev_run_todo(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 270b54754821..d9946d29c3a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2983,12 +2983,19 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *de
 {
 	struct sk_buff *skb = first;
 	int rc = NETDEV_TX_OK;
+	int xmit_count = 0;
+	bool more = true;
 
 	while (skb) {
 		struct sk_buff *next = skb->next;
 
+		if (sysctl_xmit_more_max)
+			more = xmit_count++ < sysctl_xmit_more_max;
+		if (!more)
+			xmit_count = 0;
+
 		skb->next = NULL;
-		rc = xmit_one(skb, dev, txq, next != NULL);
+		rc = xmit_one(skb, dev, txq, more && next != NULL);
 		if (unlikely(!dev_xmit_complete(rc))) {
 			skb->next = next;
 			goto out;
@@ -3523,6 +3530,7 @@ EXPORT_SYMBOL(netdev_max_backlog);
 int netdev_tstamp_prequeue __read_mostly = 1;
 int netdev_budget __read_mostly = 300;
 unsigned int __read_mostly netdev_budget_usecs = 2000;
+unsigned int __read_mostly sysctl_xmit_more_max = 32;
 int weight_p __read_mostly = 64;           /* old backlog weight */
 int dev_weight_rx_bias __read_mostly = 1;  /* bias for backlog weight */
 int dev_weight_tx_bias __read_mostly = 1;  /* bias for output_queue quota */
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index b7cd9aafe99e..6950e702e101 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -460,6 +460,13 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "xmit_more_max",
+		.data		= &sysctl_xmit_more_max,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.prox_handler	= proc_dointvec
+	},
 	{ }
 };
 
-- 
2.14.1.436.g33e61a4f0239

^ permalink raw reply related

* Re: XDP redirect measurements, gotchas and tracepoints
From: Michael Chan @ 2017-08-25 15:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Alexander Duyck, Duyck, Alexander H,
	pstaszewski@itcare.pl, netdev@vger.kernel.org,
	xdp-newbies@vger.kernel.org, andy@greyhouse.net,
	borkmann@iogearbox.net
In-Reply-To: <59A03DF5.7070806@gmail.com>

On Fri, Aug 25, 2017 at 8:10 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 08/25/2017 05:45 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 24 Aug 2017 20:36:28 -0700
>> Michael Chan <michael.chan@broadcom.com> wrote:
>>
>>> On Wed, Aug 23, 2017 at 1:29 AM, Jesper Dangaard Brouer
>>> <brouer@redhat.com> wrote:
>>>> On Tue, 22 Aug 2017 23:59:05 -0700
>>>> Michael Chan <michael.chan@broadcom.com> wrote:
>>>>
>>>>> On Tue, Aug 22, 2017 at 6:06 PM, Alexander Duyck
>>>>> <alexander.duyck@gmail.com> wrote:
>>>>>> On Tue, Aug 22, 2017 at 1:04 PM, Michael Chan <michael.chan@broadcom.com> wrote:
>>>>>>>
>>>>>>> Right, but it's conceivable to add an API to "return" the buffer to
>>>>>>> the input device, right?
>>>>
>>>> Yes, I would really like to see an API like this.
>>>>
>>>>>>
>>>>>> You could, it is just added complexity. "just free the buffer" in
>>>>>> ixgbe usually just amounts to one atomic operation to decrement the
>>>>>> total page count since page recycling is already implemented in the
>>>>>> driver. You still would have to unmap the buffer regardless of if you
>>>>>> were recycling it or not so all you would save is 1.000015259 atomic
>>>>>> operations per packet. The fraction is because once every 64K uses we
>>>>>> have to bulk update the count on the page.
>>>>>>
>>>>>
>>>>> If the buffer is returned to the input device, the input device can
>>>>> keep the DMA mapping.  All it needs to do is to dma_sync it back to
>>>>> the input device when the buffer is returned.
>>>>
>>>> Yes, exactly, return to the input device. I really think we should
>>>> work on a solution where we can keep the DMA mapping around.  We have
>>>> an opportunity here to make ndo_xdp_xmit TX queues use a specialized
>>>> page return call, to achieve this. (I imagine other arch's have a high
>>>> DMA overhead than Intel)
>>>>
>>>> I'm not sure how the API should look.  The ixgbe recycle mechanism and
>>>> splitting the page (into two packets) actually complicates things, and
>>>> tie us into a page-refcnt based model.  We could get around this by
>>>> each driver implementing a page-return-callback, that allow us to
>>>> return the page to the input device?  Then, drivers implementing the
>>>> 1-packet-per-page can simply check/read the page-refcnt, and if it is
>>>> "1" DMA-sync and reuse it in the RX queue.
>>>>
>>>
>>> Yeah, based on Alex' description, it's not clear to me whether ixgbe
>>> redirecting to a non-intel NIC or vice versa will actually work.  It
>>> sounds like the output device has to make some assumptions about how
>>> the page was allocated by the input device.
>>
>> Yes, exactly. We are tied into a page refcnt based scheme.
>>
>> Besides the ixgbe page recycle scheme (which keeps the DMA RX-mapping)
>> is also tied to the RX queue size, plus how fast the pages are returned.
>> This makes it very hard to tune.  As I demonstrated, default ixgbe
>> settings does not work well with XDP_REDIRECT.  I needed to increase
>> TX-ring size, but it broke page recycling (dropping perf from 13Mpps to
>> 10Mpps) so I also needed it increase RX-ring size.  But perf is best if
>> RX-ring size is smaller, thus two contradicting tuning needed.
>>
>
> The changes to decouple the ixgbe page recycle scheme (1pg per descriptor
> split into two halves being the default) from the number of descriptors
> doesn't look too bad IMO. It seems like it could be done by having some
> extra pages allocated upfront and pulling those in when we need another
> page.
>
> This would be a nice iterative step we could take on the existing API.
>
>>
>>> With buffer return API,
>>> each driver can cleanly recycle or free its own buffers properly.
>>
>> Yes, exactly. And RX-driver can implement a special memory model for
>> this queue.  E.g. RX-driver can know this is a dedicated XDP RX-queue
>> which is never used for SKBs, thus opening for new RX memory models.
>>
>> Another advantage of a return API.  There is also an opportunity for
>> avoiding the DMA map on TX. As we need to know the from-device.  Thus,
>> we can add a DMA API, where we can query if the two devices uses the
>> same DMA engine, and can reuse the same DMA address the RX-side already
>> knows.
>>
>>
>>> Let me discuss this further with Andy to see if we can come up with a
>>> good scheme.
>>
>> Sound good, looking forward to hear what you come-up with :-)
>>
>
> I guess by this thread we will see a broadcom nic with redirect support
> soon ;)

Yes, Andy actually has finished the coding for XDP_REDIRECT, but the
buffer recycling scheme has some problems.  We can make it work for
Broadcom to Broadcom only, but we want a better solution.

^ permalink raw reply

* Re: [PATCH 5/7] RDS: make rhashtable_params const
From: Santosh Shilimkar @ 2017-08-25 15:31 UTC (permalink / raw)
  To: Bhumika Goyal, julia.lawall-L2FTfq7BK8M,
	marcel-kz+m5ild9QBg9hUCZPvPmw, gustavo-THi1TnShQwVAfugRpC6u6w,
	johan.hedberg-Re5JQEeQqe8AvxtiuMwx3w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	kadlec-K40Dz/62t/MgiyqX0sVFJYdd74u8MsAO,
	fw-HFFVJYpyMKqzQB+pC5nmwQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	alex.aring-Re5JQEeQqe8AvxtiuMwx3w, stefan-JPH+aEBZ4P+UEJcrhfAQsw,
	kuznet-v/Mj1YrvjDBInbfyfbPRSQ, yoshfuji-VfPWfsRibaP+Ru+s062T9g,
	trond.myklebust-7I+n7zu2hftEKMMhf/gKZA,
	anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA,
	bfields-uC3wQj2KruNg9hUCZPvPmw, jlayton-vpEMnDpepFuMZCB2o+C8xQ,
	linux-bluetooth-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA,
	coreteam-Cap9r6Oaw4JrovVCs/uTlw,
	bridge-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-wpan-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1503670907-23221-6-git-send-email-bhumirks-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

8/25/2017 7:21 AM, Bhumika Goyal wrote:
> Make this const as it is either used during a copy operation or passed
> to a const argument of the function rhltable_init
> 
> Signed-off-by: Bhumika Goyal <bhumirks-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
Acked-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Waskiewicz Jr, Peter @ 2017-08-25 15:36 UTC (permalink / raw)
  To: Keller, Jacob E, netdev@vger.kernel.org
In-Reply-To: <20170825152449.29790-1-jacob.e.keller@intel.com>

On 8/25/17 11:25 AM, Jacob Keller wrote:
> Under some circumstances, such as with many stacked devices, it is
> possible that dev_hard_start_xmit will bundle many packets together, and
> mark them all with xmit_more.
> 
> Most drivers respond to xmit_more by skipping tail bumps on packet
> rings, or similar behavior as long as xmit_more is set. This is
> a performance win since it means drivers can avoid notifying hardware of
> new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> bandwidth.
> 
> This use of xmit_more comes with a trade off because bundling too many
> packets can increase latency of the Tx packets. To avoid this, we should
> limit the maximum number of packets with xmit_more.
> 
> Driver authors could modify their drivers to check for some determined
> limit, but this requires all drivers to be modified in order to gain
> advantage.
> 
> Instead, add a sysctl "xmit_more_max" which can be used to configure the
> maximum number of xmit_more skbs to send in a sequence. This ensures
> that all drivers benefit, and allows system administrators the option to
> tune the value to their environment.
> 
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
> 
> Stray thoughts and further questions....
> 
> Is this the right approach? Did I miss any other places where we should
> limit? Does the limit make sense? Should it instead be a per-device
> tuning nob instead of a global? Is 32 a good default?

I actually like the idea of a per-device knob.  A xmit_more_max that's 
global in a system with 1GbE devices along with a 25/50GbE or more just 
doesn't make much sense to me.  Or having heterogeneous vendor devices 
in the same system that have different HW behaviors could mask issues 
with latency.

This seems like another incarnation of possible buffer-bloat if the max 
is too high...

> 
>   Documentation/sysctl/net.txt |  6 ++++++
>   include/linux/netdevice.h    |  2 ++
>   net/core/dev.c               | 10 +++++++++-
>   net/core/sysctl_net_core.c   |  7 +++++++
>   4 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> index b67044a2575f..3d995e8f4448 100644
> --- a/Documentation/sysctl/net.txt
> +++ b/Documentation/sysctl/net.txt
> @@ -230,6 +230,12 @@ netdev_max_backlog
>   Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
>   receives packets faster than kernel can process them.
>   
> +xmit_more_max
> +-------------
> +
> +Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
> +indicates no limit.

What defines "packet?"  MTU-sized packets, or payloads coming down from 
the stack (e.g. TSO's)?

-PJ

^ permalink raw reply

* Re: [PATCH net v2 1/4] net: mvpp2: fix the mac address used when using PPv2.2
From: Andrew Lunn @ 2017-08-25 15:42 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, thomas.petazzoni, gregory.clement, nadavh, linux,
	linux-kernel, mw, stefanc, netdev
In-Reply-To: <20170825142928.GB31512@kwain>

> So probably the best way to handle this would have been to send 1/4 to
> net and 2-4/4 to net-next

Correct. 

> (but then there's a dependency between the two series).

Dave merges net into net-next every so often. So you just need to wait
a bit before submitting the net-next parts.

  Andrew

^ permalink raw reply

* Re: [PATCH] strparser: initialize all callbacks
From: Tom Herbert @ 2017-08-25 15:50 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Linux Kernel Network Developers, David S . Miller, linux-kernel,
	Eric Biggers, Dmitry Vyukov
In-Reply-To: <20170824213851.57601-1-ebiggers3@gmail.com>

On Thu, Aug 24, 2017 at 2:38 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> From: Eric Biggers <ebiggers@google.com>
>
> commit bbb03029a899 ("strparser: Generalize strparser") added more
> function pointers to 'struct strp_callbacks'; however, kcm_attach() was
> not updated to initialize them.  This could cause the ->lock() and/or
> ->unlock() function pointers to be set to garbage values, causing a
> crash in strp_work().
>
> Fix the bug by moving the callback structs into static memory, so
> unspecified members are zeroed.  Also constify them while we're at it.
>
> This bug was found by syzkaller, which encountered the following splat:
>
>     IP: 0x55
>     PGD 3b1ca067
>     P4D 3b1ca067
>     PUD 3b12f067
>     PMD 0
>
>     Oops: 0010 [#1] SMP KASAN
>     Dumping ftrace buffer:
>        (ftrace buffer empty)
>     Modules linked in:
>     CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
>     Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>     Workqueue: kstrp strp_work
>     task: ffff88006bb0e480 task.stack: ffff88006bb10000
>     RIP: 0010:0x55
>     RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
>     RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
>     RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
>     RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
>     R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
>     R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
>     FS:  0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
>     CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>     CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
>     DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>     DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>     Call Trace:
>      process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
>      worker_thread+0x223/0x1860 kernel/workqueue.c:2233
>      kthread+0x35e/0x430 kernel/kthread.c:231
>      ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
>     Code:  Bad RIP value.
>     RIP: 0x55 RSP: ffff88006bb17540
>     CR2: 0000000000000055
>     ---[ end trace f0e4920047069cee ]---
>
> Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
> CONFIG_AF_KCM=y):
>
>     #include <linux/bpf.h>
>     #include <linux/kcm.h>
>     #include <linux/types.h>
>     #include <stdint.h>
>     #include <sys/ioctl.h>
>     #include <sys/socket.h>
>     #include <sys/syscall.h>
>     #include <unistd.h>
>
>     static const struct bpf_insn bpf_insns[3] = {
>         { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
>         { .code = 0x95 }, /* BPF_EXIT_INSN() */
>     };
>
>     static const union bpf_attr bpf_attr = {
>         .prog_type = 1,
>         .insn_cnt = 2,
>         .insns = (uintptr_t)&bpf_insns,
>         .license = (uintptr_t)"",
>     };
>
>     int main(void)
>     {
>         int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
>                              &bpf_attr, sizeof(bpf_attr));
>         int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
>         int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
>
>         ioctl(kcm_fd, SIOCKCMATTACH,
>               &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
>     }
>
> Fixes: bbb03029a899 ("strparser: Generalize strparser")
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Tom Herbert <tom@quantonium.net>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  Documentation/networking/strparser.txt |  2 +-
>  include/net/strparser.h                |  2 +-
>  kernel/bpf/sockmap.c                   | 10 +++++-----
>  net/kcm/kcmsock.c                      | 11 +++++------
>  net/strparser/strparser.c              |  2 +-
>  5 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt
> index fe01302471ae..13081b3decef 100644
> --- a/Documentation/networking/strparser.txt
> +++ b/Documentation/networking/strparser.txt
> @@ -35,7 +35,7 @@ Functions
>  =========
>
>  strp_init(struct strparser *strp, struct sock *sk,
> -         struct strp_callbacks *cb)
> +         const struct strp_callbacks *cb)
>
>       Called to initialize a stream parser. strp is a struct of type
>       strparser that is allocated by the upper layer. sk is the TCP
> diff --git a/include/net/strparser.h b/include/net/strparser.h
> index 4fe966a0ad92..7dc131d62ad5 100644
> --- a/include/net/strparser.h
> +++ b/include/net/strparser.h
> @@ -138,7 +138,7 @@ void strp_done(struct strparser *strp);
>  void strp_stop(struct strparser *strp);
>  void strp_check_rcv(struct strparser *strp);
>  int strp_init(struct strparser *strp, struct sock *sk,
> -             struct strp_callbacks *cb);
> +             const struct strp_callbacks *cb);
>  void strp_data_ready(struct strparser *strp);
>  int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
>                  unsigned int orig_offset, size_t orig_len,
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index 78b2bb9370ac..617c239590c2 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -368,12 +368,12 @@ static int smap_read_sock_done(struct strparser *strp, int err)
>  static int smap_init_sock(struct smap_psock *psock,
>                           struct sock *sk)
>  {
> -       struct strp_callbacks cb;
> +       static const struct strp_callbacks cb = {
> +               .rcv_msg = smap_read_sock_strparser,
> +               .parse_msg = smap_parse_func_strparser,
> +               .read_sock_done = smap_read_sock_done,
> +       };
>
> -       memset(&cb, 0, sizeof(cb));
> -       cb.rcv_msg = smap_read_sock_strparser;
> -       cb.parse_msg = smap_parse_func_strparser;
> -       cb.read_sock_done = smap_read_sock_done;
>         return strp_init(&psock->strp, sk, &cb);
>  }
>
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index 88ce73288247..48e993b2dbcf 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -1376,7 +1376,11 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
>         struct kcm_psock *psock = NULL, *tpsock;
>         struct list_head *head;
>         int index = 0;
> -       struct strp_callbacks cb;
> +       static const struct strp_callbacks cb = {
> +               .rcv_msg = kcm_rcv_strparser,
> +               .parse_msg = kcm_parse_func_strparser,
> +               .read_sock_done = kcm_read_sock_done,
> +       };
>         int err;
>
>         csk = csock->sk;
> @@ -1391,11 +1395,6 @@ static int kcm_attach(struct socket *sock, struct socket *csock,
>         psock->sk = csk;
>         psock->bpf_prog = prog;
>
> -       cb.rcv_msg = kcm_rcv_strparser;
> -       cb.abort_parser = NULL;
> -       cb.parse_msg = kcm_parse_func_strparser;
> -       cb.read_sock_done = kcm_read_sock_done;
> -
>         err = strp_init(&psock->strp, csk, &cb);
>         if (err) {
>                 kmem_cache_free(kcm_psockp, psock);
> diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
> index 434aa6637a52..d4ea46a5f233 100644
> --- a/net/strparser/strparser.c
> +++ b/net/strparser/strparser.c
> @@ -472,7 +472,7 @@ static void strp_sock_unlock(struct strparser *strp)
>  }
>
>  int strp_init(struct strparser *strp, struct sock *sk,
> -             struct strp_callbacks *cb)
> +             const struct strp_callbacks *cb)
>  {
>
>         if (!cb || !cb->rcv_msg || !cb->parse_msg)
> --
> 2.14.1.342.g6490525c54-goog
>

 Acked-by: Tom Herbert <tom@quantonium.net>

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v2 00/14] net: mvpp2: comphy configuration
From: Andrew Lunn @ 2017-08-25 15:51 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-1-antoine.tenart@free-electrons.com>

>   - Checked if the carrier_on/off functions were needed. They are.

Hi Antoine

Could you explain the situations they are needed in?

Quite a few drivers do this, so i'm not saying it is wrong.  But it
would be nice to understand if we are missing something in phylib.

      Andrew

^ permalink raw reply

* Re: [PATCH net v2 1/4] net: mvpp2: fix the mac address used when using PPv2.2
From: Antoine Tenart @ 2017-08-25 15:54 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, thomas.petazzoni, gregory.clement, nadavh,
	linux, linux-kernel, mw, stefanc, netdev
In-Reply-To: <20170825154247.GC30922@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 598 bytes --]

On Fri, Aug 25, 2017 at 05:42:47PM +0200, Andrew Lunn wrote:
> > So probably the best way to handle this would have been to send 1/4 to
> > net and 2-4/4 to net-next
> 
> Correct. 
> 
> > (but then there's a dependency between the two series).
> 
> Dave merges net into net-next every so often. So you just need to wait
> a bit before submitting the net-next parts.

I see. So Dave can take patch 1/4 and I'll respin the others once it's
merged into net-next.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 00/14] net: mvpp2: comphy configuration
From: Antoine Tenart @ 2017-08-25 15:56 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux, linux-kernel,
	mw, stefanc, miquel.raynal, netdev
In-Reply-To: <20170825155111.GD30922@lunn.ch>

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

Hi Andrew,

On Fri, Aug 25, 2017 at 05:51:11PM +0200, Andrew Lunn wrote:
> >   - Checked if the carrier_on/off functions were needed. They are.
> 
> Could you explain the situations they are needed in?
> 
> Quite a few drivers do this, so i'm not saying it is wrong.  But it
> would be nice to understand if we are missing something in phylib.

I know it's not working without these calls, but I can't tell why (yet).
I can try to dig into this.

Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Stephen Hemminger @ 2017-08-25 15:58 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter; +Cc: Keller, Jacob E, netdev@vger.kernel.org
In-Reply-To: <E0D909EE5BB15A4699798539EA149D7F07793B3E@ORSMSX103.amr.corp.intel.com>

On Fri, 25 Aug 2017 15:36:22 +0000
"Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:

> On 8/25/17 11:25 AM, Jacob Keller wrote:
> > Under some circumstances, such as with many stacked devices, it is
> > possible that dev_hard_start_xmit will bundle many packets together, and
> > mark them all with xmit_more.
> > 
> > Most drivers respond to xmit_more by skipping tail bumps on packet
> > rings, or similar behavior as long as xmit_more is set. This is
> > a performance win since it means drivers can avoid notifying hardware of
> > new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
> > bandwidth.
> > 
> > This use of xmit_more comes with a trade off because bundling too many
> > packets can increase latency of the Tx packets. To avoid this, we should
> > limit the maximum number of packets with xmit_more.
> > 
> > Driver authors could modify their drivers to check for some determined
> > limit, but this requires all drivers to be modified in order to gain
> > advantage.
> > 
> > Instead, add a sysctl "xmit_more_max" which can be used to configure the
> > maximum number of xmit_more skbs to send in a sequence. This ensures
> > that all drivers benefit, and allows system administrators the option to
> > tune the value to their environment.
> > 
> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> > ---
> > 
> > Stray thoughts and further questions....
> > 
> > Is this the right approach? Did I miss any other places where we should
> > limit? Does the limit make sense? Should it instead be a per-device
> > tuning nob instead of a global? Is 32 a good default?  
> 
> I actually like the idea of a per-device knob.  A xmit_more_max that's 
> global in a system with 1GbE devices along with a 25/50GbE or more just 
> doesn't make much sense to me.  Or having heterogeneous vendor devices 
> in the same system that have different HW behaviors could mask issues 
> with latency.
> 
> This seems like another incarnation of possible buffer-bloat if the max 
> is too high...
> 
> > 
> >   Documentation/sysctl/net.txt |  6 ++++++
> >   include/linux/netdevice.h    |  2 ++
> >   net/core/dev.c               | 10 +++++++++-
> >   net/core/sysctl_net_core.c   |  7 +++++++
> >   4 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
> > index b67044a2575f..3d995e8f4448 100644
> > --- a/Documentation/sysctl/net.txt
> > +++ b/Documentation/sysctl/net.txt
> > @@ -230,6 +230,12 @@ netdev_max_backlog
> >   Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
> >   receives packets faster than kernel can process them.
> >   
> > +xmit_more_max
> > +-------------
> > +
> > +Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
> > +indicates no limit.  
> 
> What defines "packet?"  MTU-sized packets, or payloads coming down from 
> the stack (e.g. TSO's)?

xmit_more is only a hint to the device. The device driver should ignore it unless
there are hardware advantages. The device driver is the place with HW specific
knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this device).

Anything that pushes that optimization out to the user is only useful for benchmarks
and embedded devices.

^ permalink raw reply

* Re: [PATCH 3/7] ieee802154: 6lowpan: make header_ops const
From: Marcel Holtmann @ 2017-08-25 16:17 UTC (permalink / raw)
  To: Bhumika Goyal
  Cc: julia.lawall, Gustavo F. Padovan, Johan Hedberg, David S. Miller,
	Pablo Neira Ayuso, kadlec, Florian Westphal, Stephen Hemminger,
	Alexander Aring, Stefan Schmidt, Alexey Kuznetsov,
	Hideaki YOSHIFUJI, santosh.shilimkar, trond.myklebust,
	anna.schumaker, bfields, jlayton, open list:BLUETOOTH DRIVERS,
	Network Development, LKML, netfilter-devel
In-Reply-To: <1503670907-23221-4-git-send-email-bhumirks@gmail.com>

Hi Bhumika,

> Make this const as it is only stored as a reference in a const field of
> a net_device structure.
> 
> Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
> ---
> net/ieee802154/6lowpan/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

patch has been applied to bluetooth-next tree.

Regards

Marcel

^ permalink raw reply

* [PATCH net-next 0/3] gre: add collect_md mode for ERSPAN tunnel
From: William Tu @ 2017-08-25 16:21 UTC (permalink / raw)
  To: netdev

This patch series provide collect_md mode for ERSPAN tunnel.  The fist patch
refactors the existing gre_fb_xmit function by exacting the route cache
portion into a new function called prepare_fb_xmit.  The second patch
introduces the collect_md mode for ERSPAN tunnel, by calling the
prepare_fb_xmit function and adding ERSPAN specific logic.  The final patch
adds the test case using bpf_skb_{set,get}_tunnel_{key,opt}.

Thank you

William Tu (3):
  gre: refactor the gre_fb_xmit
  gre: add collect_md mode to ERSPAN tunnel
  samples/bpf: extend test_tunnel_bpf.sh with ERSPAN

 include/net/ip_tunnels.h       |   4 +-
 net/ipv4/ip_gre.c              | 157 ++++++++++++++++++++++++++++++++++++-----
 samples/bpf/tcbpf2_kern.c      |  63 ++++++++++++++++-
 samples/bpf/test_tunnel_bpf.sh |  29 ++++++++
 4 files changed, 232 insertions(+), 21 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next 1/3] gre: refactor the gre_fb_xmit
From: William Tu @ 2017-08-25 16:21 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1503678089-27131-1-git-send-email-u9012063@gmail.com>

The patch refactors the gre_fb_xmit function, by creating
prepare_fb_xmit function for later ERSPAN collect_md mode patch.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 net/ipv4/ip_gre.c | 55 ++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f70674799fdd..453b7925b940 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -432,39 +432,33 @@ static struct rtable *gre_get_rt(struct sk_buff *skb,
 	return ip_route_output_key(net, fl);
 }
 
-static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
-			__be16 proto)
+static struct rtable *prepare_fb_xmit(struct sk_buff *skb,
+				      struct net_device *dev,
+				      struct flowi4 *fl,
+				      int tunnel_hlen)
 {
 	struct ip_tunnel_info *tun_info;
 	const struct ip_tunnel_key *key;
 	struct rtable *rt = NULL;
-	struct flowi4 fl;
 	int min_headroom;
-	int tunnel_hlen;
-	__be16 df, flags;
 	bool use_cache;
 	int err;
 
 	tun_info = skb_tunnel_info(skb);
-	if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
-		     ip_tunnel_info_af(tun_info) != AF_INET))
-		goto err_free_skb;
-
 	key = &tun_info->key;
 	use_cache = ip_tunnel_dst_cache_usable(skb, tun_info);
+
 	if (use_cache)
-		rt = dst_cache_get_ip4(&tun_info->dst_cache, &fl.saddr);
+		rt = dst_cache_get_ip4(&tun_info->dst_cache, &fl->saddr);
 	if (!rt) {
-		rt = gre_get_rt(skb, dev, &fl, key);
+		rt = gre_get_rt(skb, dev, fl, key);
 		if (IS_ERR(rt))
-				goto err_free_skb;
+			goto err_free_skb;
 		if (use_cache)
 			dst_cache_set_ip4(&tun_info->dst_cache, &rt->dst,
-					  fl.saddr);
+					  fl->saddr);
 	}
 
-	tunnel_hlen = gre_calc_hlen(key->tun_flags);
-
 	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
 			+ tunnel_hlen + sizeof(struct iphdr);
 	if (skb_headroom(skb) < min_headroom || skb_header_cloned(skb)) {
@@ -476,6 +470,37 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (unlikely(err))
 			goto err_free_rt;
 	}
+	return rt;
+
+err_free_rt:
+	ip_rt_put(rt);
+err_free_skb:
+	kfree_skb(skb);
+	dev->stats.tx_dropped++;
+	return NULL;
+}
+
+static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
+			__be16 proto)
+{
+	struct ip_tunnel_info *tun_info;
+	const struct ip_tunnel_key *key;
+	struct rtable *rt = NULL;
+	struct flowi4 fl;
+	int tunnel_hlen;
+	__be16 df, flags;
+
+	tun_info = skb_tunnel_info(skb);
+	if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
+		     ip_tunnel_info_af(tun_info) != AF_INET))
+		goto err_free_skb;
+
+	key = &tun_info->key;
+	tunnel_hlen = gre_calc_hlen(key->tun_flags);
+
+	rt = prepare_fb_xmit(skb, dev, &fl, tunnel_hlen);
+	if (!rt)
+		return;
 
 	/* Push Tunnel header. */
 	if (gre_handle_offloads(skb, !!(tun_info->key.tun_flags & TUNNEL_CSUM)))
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/3] gre: add collect_md mode to ERSPAN tunnel
From: William Tu @ 2017-08-25 16:21 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1503678089-27131-1-git-send-email-u9012063@gmail.com>

Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to
operate in 'collect metadata' mode.  bpf_skb_[gs]et_tunnel_key() helpers
can make use of it right away.  OVS can use it as well in the future.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 include/net/ip_tunnels.h |   4 +-
 net/ipv4/ip_gre.c        | 102 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 625c29329372..992652856fe8 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -154,8 +154,10 @@ struct ip_tunnel {
 #define TUNNEL_GENEVE_OPT	__cpu_to_be16(0x0800)
 #define TUNNEL_VXLAN_OPT	__cpu_to_be16(0x1000)
 #define TUNNEL_NOCACHE		__cpu_to_be16(0x2000)
+#define TUNNEL_ERSPAN_OPT	__cpu_to_be16(0x4000)
 
-#define TUNNEL_OPTIONS_PRESENT	(TUNNEL_GENEVE_OPT | TUNNEL_VXLAN_OPT)
+#define TUNNEL_OPTIONS_PRESENT \
+		(TUNNEL_GENEVE_OPT | TUNNEL_VXLAN_OPT | TUNNEL_ERSPAN_OPT)
 
 struct tnl_ptk_info {
 	__be16 flags;
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 453b7925b940..0162fb955b33 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -113,6 +113,8 @@ MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
 
 static struct rtnl_link_ops ipgre_link_ops __read_mostly;
 static int ipgre_tunnel_init(struct net_device *dev);
+static void erspan_build_header(struct sk_buff *skb,
+				__be32 id, u32 index, bool truncate);
 
 static unsigned int ipgre_net_id __read_mostly;
 static unsigned int gre_tap_net_id __read_mostly;
@@ -287,7 +289,33 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 					   false, false) < 0)
 			goto drop;
 
-		tunnel->index = ntohl(index);
+		if (tunnel->collect_md) {
+			struct ip_tunnel_info *info;
+			struct erspan_metadata *md;
+			__be64 tun_id;
+			__be16 flags;
+
+			tpi->flags |= TUNNEL_KEY;
+			flags = tpi->flags;
+			tun_id = key32_to_tunnel_id(tpi->key);
+
+			tun_dst = ip_tun_rx_dst(skb, flags,
+						tun_id, sizeof(*md));
+			if (!tun_dst)
+				return PACKET_REJECT;
+
+			md = ip_tunnel_info_opts(&tun_dst->u.tun_info);
+			if (!md)
+				return PACKET_REJECT;
+
+			md->index = index;
+			info = &tun_dst->u.tun_info;
+			info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
+			info->options_len = sizeof(*md);
+		} else {
+			tunnel->index = ntohl(index);
+		}
+
 		skb_reset_mac_header(skb);
 		ip_tunnel_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
 		return PACKET_RCVD;
@@ -523,6 +551,64 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 	dev->stats.tx_dropped++;
 }
 
+static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
+			   __be16 proto)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+	struct ip_tunnel_info *tun_info;
+	const struct ip_tunnel_key *key;
+	struct erspan_metadata *md;
+	struct rtable *rt = NULL;
+	bool truncate = false;
+	struct flowi4 fl;
+	int tunnel_hlen;
+	__be16 df;
+
+	tun_info = skb_tunnel_info(skb);
+	if (unlikely(!tun_info || !(tun_info->mode & IP_TUNNEL_INFO_TX) ||
+		     ip_tunnel_info_af(tun_info) != AF_INET))
+		goto err_free_skb;
+
+	key = &tun_info->key;
+
+	/* ERSPAN has fixed 8 byte GRE header */
+	tunnel_hlen = 8 + sizeof(struct erspanhdr);
+
+	rt = prepare_fb_xmit(skb, dev, &fl, tunnel_hlen);
+	if (!rt)
+		return;
+
+	if (gre_handle_offloads(skb, false))
+		goto err_free_rt;
+
+	if (skb->len > dev->mtu) {
+		pskb_trim(skb, dev->mtu);
+		truncate = true;
+	}
+
+	md = ip_tunnel_info_opts(tun_info);
+	if (!md)
+		goto err_free_rt;
+
+	erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
+			    ntohl(md->index), truncate);
+
+	gre_build_header(skb, 8, TUNNEL_SEQ,
+			 htons(ETH_P_ERSPAN), 0, htonl(tunnel->o_seqno++));
+
+	df = key->tun_flags & TUNNEL_DONT_FRAGMENT ?  htons(IP_DF) : 0;
+
+	iptunnel_xmit(skb->sk, rt, skb, fl.saddr, key->u.ipv4.dst, IPPROTO_GRE,
+		      key->tos, key->ttl, df, false);
+	return;
+
+err_free_rt:
+	ip_rt_put(rt);
+err_free_skb:
+	kfree_skb(skb);
+	dev->stats.tx_dropped++;
+}
+
 static int gre_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
 {
 	struct ip_tunnel_info *info = skb_tunnel_info(skb);
@@ -636,6 +722,11 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
 	struct ip_tunnel *tunnel = netdev_priv(dev);
 	bool truncate = false;
 
+	if (tunnel->collect_md) {
+		erspan_fb_xmit(skb, dev, skb->protocol);
+		return NETDEV_TX_OK;
+	}
+
 	if (gre_handle_offloads(skb, false))
 		goto free_skb;
 
@@ -998,9 +1089,12 @@ static int erspan_validate(struct nlattr *tb[], struct nlattr *data[],
 		return ret;
 
 	/* ERSPAN should only have GRE sequence and key flag */
-	flags |= nla_get_be16(data[IFLA_GRE_OFLAGS]);
-	flags |= nla_get_be16(data[IFLA_GRE_IFLAGS]);
-	if (flags != (GRE_SEQ | GRE_KEY))
+	if (data[IFLA_GRE_OFLAGS])
+		flags |= nla_get_be16(data[IFLA_GRE_OFLAGS]);
+	if (data[IFLA_GRE_IFLAGS])
+		flags |= nla_get_be16(data[IFLA_GRE_IFLAGS]);
+	if (!data[IFLA_GRE_COLLECT_METADATA] &&
+	    flags != (GRE_SEQ | GRE_KEY))
 		return -EINVAL;
 
 	/* ERSPAN Session ID only has 10-bit. Since we reuse
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 3/3] samples/bpf: extend test_tunnel_bpf.sh with ERSPAN
From: William Tu @ 2017-08-25 16:21 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1503678089-27131-1-git-send-email-u9012063@gmail.com>

Extend existing tests for vxlan, gre, geneve, ipip to
include ERSPAN tunnel.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/tcbpf2_kern.c      | 63 +++++++++++++++++++++++++++++++++++++++++-
 samples/bpf/test_tunnel_bpf.sh | 29 +++++++++++++++++++
 2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 270edcc149a1..370b749f5ee6 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -17,6 +17,7 @@
 #include <uapi/linux/pkt_cls.h>
 #include <net/ipv6.h>
 #include "bpf_helpers.h"
+#include "bpf_endian.h"
 
 #define _htonl __builtin_bswap32
 #define ERROR(ret) do {\
@@ -38,6 +39,10 @@ struct vxlan_metadata {
 	u32     gbp;
 };
 
+struct erspan_metadata {
+	__be32 index;
+};
+
 SEC("gre_set_tunnel")
 int _gre_set_tunnel(struct __sk_buff *skb)
 {
@@ -76,6 +81,63 @@ int _gre_get_tunnel(struct __sk_buff *skb)
 	return TC_ACT_OK;
 }
 
+SEC("erspan_set_tunnel")
+int _erspan_set_tunnel(struct __sk_buff *skb)
+{
+	struct bpf_tunnel_key key;
+	struct erspan_metadata md;
+	int ret;
+
+	__builtin_memset(&key, 0x0, sizeof(key));
+	key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
+	key.tunnel_id = 2;
+	key.tunnel_tos = 0;
+	key.tunnel_ttl = 64;
+
+	ret = bpf_skb_set_tunnel_key(skb, &key, sizeof(key), BPF_F_ZERO_CSUM_TX);
+	if (ret < 0) {
+		ERROR(ret);
+		return TC_ACT_SHOT;
+	}
+
+	md.index = htonl(123);
+	ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
+	if (ret < 0) {
+		ERROR(ret);
+		return TC_ACT_SHOT;
+	}
+
+	return TC_ACT_OK;
+}
+
+SEC("erspan_get_tunnel")
+int _erspan_get_tunnel(struct __sk_buff *skb)
+{
+	char fmt[] = "key %d remote ip 0x%x erspan index 0x%x\n";
+	struct bpf_tunnel_key key;
+	struct erspan_metadata md;
+	u32 index;
+	int ret;
+
+	ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
+	if (ret < 0) {
+		ERROR(ret);
+		return TC_ACT_SHOT;
+	}
+
+	ret = bpf_skb_get_tunnel_opt(skb, &md, sizeof(md));
+	if (ret < 0) {
+		ERROR(ret);
+		return TC_ACT_SHOT;
+	}
+
+	index = bpf_ntohl(md.index);
+	bpf_trace_printk(fmt, sizeof(fmt),
+			key.tunnel_id, key.remote_ipv4, index);
+
+	return TC_ACT_OK;
+}
+
 SEC("vxlan_set_tunnel")
 int _vxlan_set_tunnel(struct __sk_buff *skb)
 {
@@ -378,5 +440,4 @@ int _ip6ip6_get_tunnel(struct __sk_buff *skb)
 	return TC_ACT_OK;
 }
 
-
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index a70d2ea90313..410052d9fc37 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -32,6 +32,19 @@ function add_gre_tunnel {
 	ip addr add dev $DEV 10.1.1.200/24
 }
 
+function add_erspan_tunnel {
+	# in namespace
+	ip netns exec at_ns0 \
+		ip link add dev $DEV_NS type $TYPE seq key 2 local 172.16.1.100 remote 172.16.1.200 erspan 123
+	ip netns exec at_ns0 ip link set dev $DEV_NS up
+	ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
+
+	# out of namespace
+	ip link add dev $DEV type $TYPE external
+	ip link set dev $DEV up
+	ip addr add dev $DEV 10.1.1.200/24
+}
+
 function add_vxlan_tunnel {
 	# Set static ARP entry here because iptables set-mark works
 	# on L3 packet, as a result not applying to ARP packets,
@@ -99,6 +112,18 @@ function test_gre {
 	cleanup
 }
 
+function test_erspan {
+	TYPE=erspan
+	DEV_NS=erspan00
+	DEV=erspan11
+	config_device
+	add_erspan_tunnel
+	attach_bpf $DEV erspan_set_tunnel erspan_get_tunnel
+	ping -c 1 10.1.1.100
+	ip netns exec at_ns0 ping -c 1 10.1.1.200
+	cleanup
+}
+
 function test_vxlan {
 	TYPE=vxlan
 	DEV_NS=vxlan00
@@ -151,14 +176,18 @@ function cleanup {
 	ip link del gretap11
 	ip link del vxlan11
 	ip link del geneve11
+	ip link del erspan11
 	pkill tcpdump
 	pkill cat
 	set -ex
 }
 
+trap cleanup 0 2 3 6 9
 cleanup
 echo "Testing GRE tunnel..."
 test_gre
+echo "Testing ERSPAN tunnel..."
+test_erspan
 echo "Testing VXLAN tunnel..."
 test_vxlan
 echo "Testing GENEVE tunnel..."
-- 
2.7.4

^ permalink raw reply related

* RE: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Keller, Jacob E @ 2017-08-25 16:24 UTC (permalink / raw)
  To: Stephen Hemminger, Waskiewicz Jr, Peter; +Cc: netdev@vger.kernel.org
In-Reply-To: <20170825085816.3425a70c@xeon-e3>

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, August 25, 2017 8:58 AM
> To: Waskiewicz Jr, Peter <peter.waskiewicz.jr@intel.com>
> Cc: Keller, Jacob E <jacob.e.keller@intel.com>; netdev@vger.kernel.org
> Subject: Re: [RFC PATCH] net: limit maximum number of packets to mark with
> xmit_more
> 
> xmit_more is only a hint to the device. The device driver should ignore it unless
> there are hardware advantages. The device driver is the place with HW specific
> knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this
> device).
> 
> Anything that pushes that optimization out to the user is only useful for
> benchmarks
> and embedded devices.

Right so most drivers I've seen simply take it as a "avoid bumping tail of a ring" whenever they see xmit_more. But unfortunately in some circumstances, this results in potentially several hundred packets being set with xmit_more in a row, and then the driver doesn't bump the tail for a long time, resulting in high latency spikes..

I was trying to find a way to fix this potentially in multiple drivers, rather than just a single driver, since I figured the same sort of code might need to be needed.

So you're suggesting we should just perform some check in the device driver, even if it might be duplication?

We could also instead make it a setting in the netdev struct or something which would be set by the driver and then tell stack code to limit how many it sends at once (so that we don't need to duplicate that checking code in every driver?)

Thanks,
Jake

^ permalink raw reply

* Re: [PATCH net-next v6 2/3] net: gso: Add GSO support for NSH
From: Jiri Benc @ 2017-08-25 16:25 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, blp, jan.scheurich
In-Reply-To: <1503670805-31051-3-git-send-email-yi.y.yang@intel.com>

On Fri, 25 Aug 2017 22:20:04 +0800, Yi Yang wrote:
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -766,7 +766,7 @@ struct sk_buff {
>  	__u8			ndisc_nodetype:2;
>  #endif
>  	__u8			ipvs_property:1;
> -	__u8			inner_protocol_type:1;
> +	__u8			inner_protocol_type:2;

Adding anything to sk_buff is pretty much forbidden. You can't add more
bytes to it and there are no more free bits to use, either.

Luckily, we still have one byte hole next to inner_ipproto that we can
use. What is needed is renaming of ENCAP_TYPE_IPPROTO to ENCAP_TYPE_L3
and storing the L3 type in the unused byte. It's not beautiful (would
be better to use ethertype than a custom enum) but it will work.

While looking at this, I realized that GSO for VXLAN-GPE is broken,
too. Let me fix it by implementing what I described above which will
make your patch much easier.

 Jiri

^ permalink raw reply

* Re: [PATCH] net: stmmac: Handle possible fixed-link with need_mdio_ids
From: Andrew Lunn @ 2017-08-25 16:28 UTC (permalink / raw)
  To: Corentin Labbe; +Cc: peppe.cavallaro, alexandre.torgue, netdev, linux-kernel
In-Reply-To: <20170825144208.24503-1-clabbe.montjoie@gmail.com>

On Fri, Aug 25, 2017 at 04:42:08PM +0200, Corentin Labbe wrote:
> In case of fixed link, there are no mdio node.
> This patch add a test for fixed-link for bypassing MDIO node register.

The two are not mutually exclusive. E.g.
vf610-zii-dev.dtsi/vf610-zii-dev-rev-b.dts.  It has a fixed-link on
the FEC ethernet controller, and an Ethernet switch on the MDIO bus.

If anybody ever wants to use a switch with the stmmac, this will be
required.

	Andrew

^ permalink raw reply

* Re: [PATCH net-next v2] net: mv643xx_eth: Be drop monitor friendly
From: Florian Fainelli @ 2017-08-25 16:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, andrew, Sebastian Hesselbarth, open list
In-Reply-To: <1503635699.2499.100.camel@edumazet-glaptop3.roam.corp.google.com>

On 08/24/2017 09:34 PM, Eric Dumazet wrote:
> On Thu, 2017-08-24 at 20:55 -0700, Florian Fainelli wrote:
>> txq_reclaim() does the normal transmit queue reclamation and
>> rxq_deinit() does the RX ring cleanup, none of these are packet drops,
>> so use dev_consume_skb() for both locations.
>>
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>>  drivers/net/ethernet/marvell/mv643xx_eth.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
>> index fb2d533ae4ef..81c1fac00d33 100644
>> --- a/drivers/net/ethernet/marvell/mv643xx_eth.c
>> +++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
>> @@ -1121,7 +1121,7 @@ static int txq_reclaim(struct tx_queue *txq, int budget, int force)
>>  			struct sk_buff *skb = __skb_dequeue(&txq->tx_skb);
>>  
>>  			if (!WARN_ON(!skb))
>> -				dev_kfree_skb(skb);
>> +				dev_consume_skb_any(skb);
>>  		}
>>  
>>  		if (cmd_sts & ERROR_SUMMARY) {
>> @@ -2024,7 +2024,7 @@ static void rxq_deinit(struct rx_queue *rxq)
>>  
>>  	for (i = 0; i < rxq->rx_ring_size; i++) {
>>  		if (rxq->rx_skb[i]) {
>> -			dev_kfree_skb(rxq->rx_skb[i]);
>> +			dev_consume_skb_any(rxq->rx_skb[i]);
>>  			rxq->rx_desc_count--;
>>  		}
>>  	}
> 
> 
> I do not believe this patch is needed.
> 
> dev_kfree_skb() is an alias of consume_skb(), which is already drop
> monitor ready ;)

You are right, this patch is/was not needed, I have been too trigger
happy with the dev_kfree_skb*() replacement, but only
dev_kfree_skb_{any,irq} are valid candidates for replacement by
dev_consume_skb_{any,irq}. I will re-audit my other two submissions to
e1000e and r8169. Thanks Eric.
-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: stmmac: Handle possible fixed-link with need_mdio_ids
From: Florian Fainelli @ 2017-08-25 16:45 UTC (permalink / raw)
  To: Andrew Lunn, Corentin Labbe
  Cc: peppe.cavallaro, alexandre.torgue, netdev, linux-kernel
In-Reply-To: <20170825162827.GA4207@lunn.ch>

On 08/25/2017 09:28 AM, Andrew Lunn wrote:
> On Fri, Aug 25, 2017 at 04:42:08PM +0200, Corentin Labbe wrote:
>> In case of fixed link, there are no mdio node.
>> This patch add a test for fixed-link for bypassing MDIO node register.
> 
> The two are not mutually exclusive. E.g.
> vf610-zii-dev.dtsi/vf610-zii-dev-rev-b.dts.  It has a fixed-link on
> the FEC ethernet controller, and an Ethernet switch on the MDIO bus.
> 
> If anybody ever wants to use a switch with the stmmac, this will be
> required.

This is already done in the Lamobo R1 DTS file so it would be nice not
to break this use case:

&gmac {
        pinctrl-names = "default";
        pinctrl-0 = <&gmac_pins_rgmii_a>;
        phy-mode = "rgmii";
        phy-supply = <&reg_gmac_3v3>;
        status = "okay";

        fixed-link {
                speed = <1000>;
                full-duplex;
        };

        mdio {
                compatible = "snps,dwmac-mdio";
                #address-cells = <1>;
                #size-cells = <0>;

                switch: ethernet-switch@1e {
                        compatible = "brcm,bcm53125";
                        reg = <30>;
                        #address-cells = <1>;
                        #size-cells = <0>;


> 
> 	Andrew
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v2] e1000e: Be drop monitor friendly
From: Florian Fainelli @ 2017-08-25 16:47 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, Jeff Kirsher,
	moderated list:INTEL ETHERNET DRIVERS, open list
In-Reply-To: <20170825035824.26935-1-f.fainelli@gmail.com>

On 08/24/2017 08:58 PM, Florian Fainelli wrote:
> e1000_put_txbuf() cleans up the successfully transmitted TX packets,
> e1000e_tx_hwtstamp_work() also does the successfully completes the
> timestamped TX packets, e1000_clean_rx_ring() cleans up the RX ring and
> e1000_remove() cleans up the timestampted packets. None of these
> functions should be reporting dropped packets, so make them use
> dev_consume_skb_any() to be drop monitor friendly.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 327dfe5bedc0..a90e459c5b8a 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1085,7 +1085,7 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
>  		buffer_info->dma = 0;
>  	}
>  	if (buffer_info->skb) {
> -		dev_kfree_skb_any(buffer_info->skb);
> +		dev_consume_skb_any(buffer_info->skb);
>  		buffer_info->skb = NULL;
>  	}
>  	buffer_info->time_stamp = 0;
> @@ -1199,7 +1199,7 @@ static void e1000e_tx_hwtstamp_work(struct work_struct *work)
>  		wmb(); /* force write prior to skb_tstamp_tx */
>  
>  		skb_tstamp_tx(skb, &shhwtstamps);
> -		dev_kfree_skb_any(skb);
> +		dev_consume_skb_any(skb);
>  	} else if (time_after(jiffies, adapter->tx_hwtstamp_start
>  			      + adapter->tx_timeout_factor * HZ)) {
>  		dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
> @@ -1716,7 +1716,7 @@ static void e1000_clean_rx_ring(struct e1000_ring *rx_ring)
>  		}
>  
>  		if (buffer_info->skb) {
> -			dev_kfree_skb(buffer_info->skb);
> +			dev_consume_skb_any(buffer_info->skb);

This one is not actually needed since dev_kfree_skb() is an alias for
consume_skb().

>  			buffer_info->skb = NULL;
>  		}
>  
> @@ -1734,7 +1734,7 @@ static void e1000_clean_rx_ring(struct e1000_ring *rx_ring)
>  
>  	/* there also may be some cached data from a chained receive */
>  	if (rx_ring->rx_skb_top) {
> -		dev_kfree_skb(rx_ring->rx_skb_top);
> +		dev_consume_skb_any(rx_ring->rx_skb_top);

Same here.

I will make a repost of this patch later, Jeff please drop it if you
have it queued already, thanks!
-- 
Florian

^ permalink raw reply

* Re: Possible race in nsc-ircc.ko
From: Jean Tourrilhes @ 2017-08-25 16:48 UTC (permalink / raw)
  To: Anton Volkov
  Cc: dagb, samuel, netdev, linux-kernel, ldv-project,
	Alexey Khoroshilov
In-Reply-To: <4a33a281-f8bd-305b-e580-0e594feea799@ispras.ru>

On Fri, Aug 25, 2017 at 05:05:25PM +0300, Anton Volkov wrote:
> Hello.
> 
> While searching for races in the Linux kernel I've come across
> "drivers/net/irda/nsc-ircc.ko" module. Here is a question that I came up
> with while analyzing results. Lines are given using the info from Linux
> v4.12.
> 
> Consider the following case:
> 
> Thread 1:                   Thread 2:
> nsc_ircc_init
> ->nsc_ircc_open
>     self = netdev_priv(dev)
>     register_netdev(dev)
>                             nsc_ircc_net_ioctl
>                             ->nsc_ircc_change_speed
>     self->dongle_id = ...       <READ self->io.dongle_id>
>     (nsc-ircc.c: line 485)      (nsc-ircc.c: line 1318)
>     platform_device_register_simple
> 
> Before the initialization of self->dongle_id in msc_ircc_open() its value is
> 0. Thus if read access to its value in nsc_ircc_change_speed occurs before
> the initialization there will be an attempt to change speed of dongle with
> undesired id (if the dongle with id 0 exists). Is this case feasible from
> your point of view?
> 
> Thank you for your time.
> 
> -- Anton Volkov

	A first glance, that seems like a valid race. I'm not sure if
there is a netdev lock/status to protect the driver, because it looks
like doing any operation on a device before "open" has completed would
be dangerous for most drivers.
	I don't have time to check the code paths, as I have not
looked at that code in ages.
	Good luck !

	Jean

^ permalink raw reply

* Re: [PATCH net-next v2 08/14] net: mvpp2: check the netif is running in the link_event function
From: Florian Fainelli @ 2017-08-25 16:49 UTC (permalink / raw)
  To: Antoine Tenart, davem, kishon, andrew, jason,
	sebastian.hesselbarth, gregory.clement
  Cc: thomas.petazzoni, nadavh, linux, linux-kernel, mw, stefanc,
	miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-9-antoine.tenart@free-electrons.com>

On 08/25/2017 07:48 AM, Antoine Tenart wrote:
> This patch adds an extra check when the link_event function is called,
> so that it won't do anything when the netif isn't running.

Why is this needed? Are you possibly starting the PHY state machine
earlier than your ndo_open() call? Looking quickly through the driver
does not suggest this is going on since you properly connect to the PHY
in mvpp2_open() and start the PHY there.

> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 1b26f5ed994f..49a6789a4142 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -5741,6 +5741,9 @@ static void mvpp2_link_event(struct net_device *dev)
>  	struct mvpp2_port *port = netdev_priv(dev);
>  	struct phy_device *phydev = dev->phydev;
>  
> +	if (!netif_running(dev))
> +		return;
> +
>  	if (phydev->link) {
>  		if ((port->speed != phydev->speed) ||
>  		    (port->duplex != phydev->duplex)) {
> 


-- 
Florian

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox