Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH v9 0/8] thunderbolt: Introducing Thunderbolt(TM) Networking
From: Levy, Amir (Jer) @ 2016-11-20  6:30 UTC (permalink / raw)
  To: gregkh@linuxfoundation.org
  Cc: Simon Guinot, andreas.noever@gmail.com, bhelgaas@google.com,
	corbet@lwn.net, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, netdev@vger.kernel.org,
	linux-doc@vger.kernel.org, mario_limonciello@dell.com,
	thunderbolt-linux, Westerberg, Mika, Winkler, Tomas,
	Zhang, Xiong Y, Jamet, Michael, remi.rerolle@seagate.com
In-Reply-To: <20161118100700.GA7126@kroah.com>

On Fri, Nov 18 2016, 12:07 PM, gregkh@linuxfoundation.org wrote:
> On Fri, Nov 18, 2016 at 08:48:36AM +0000, Levy, Amir (Jer) wrote:
> > > BTW, it is quite a shame that the Thunderbolt firmware version can't
> > > be read from Linux.
> > >
> >
> > This is WIP, once this patch will be upstream, we will be able to
> > focus more on aligning Linux with the Thunderbolt features that we have
> for windows.
> 
> Why is this patch somehow holding that work back?  You aren't just sitting
> around waiting for people to review this and not doing anything else, right?
> Is there some basic building block in these patches that your firmware
> download code is going to rely on?
> 
> confused,
> 
> greg k-h

All the Thunderbolt SW features (including networking and FW update) depend 
on the communication with FW, which is patch 3/8 in the series.
The patch also sets up a generic netlink for user space communication.
So yes, the communication with Thunderbolt FW is a basic building block.

^ permalink raw reply

* RE: [PATCH] net: fec: Detect and recover receive queue hangs
From: Andy Duan @ 2016-11-20  6:18 UTC (permalink / raw)
  To: Chris Lesiak
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Jaccon Bastiaansen
In-Reply-To: <b420fcd7-4171-f5c7-9d0c-cd8809e31bbc@licor.com>

From: Chris Lesiak <chris.lesiak@licor.com> Sent: Friday, November 18, 2016 10:37 PM
 >To: Andy Duan <fugang.duan@nxp.com>
 >Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jaccon
 >Bastiaansen <jaccon.bastiaansen@gmail.com>
 >Subject: Re: [PATCH] net: fec: Detect and recover receive queue hangs
 >
 >On 11/18/2016 12:44 AM, Andy Duan wrote:
 >> From: Chris Lesiak <chris.lesiak@licor.com> Sent: Friday, November 18,
 >> 2016 5:15 AM
 >>  >To: Andy Duan <fugang.duan@nxp.com>
 >>  >Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Jaccon
 >> >Bastiaansen <jaccon.bastiaansen@gmail.com>; chris.lesiak@licor.com
 >>  >Subject: [PATCH] net: fec: Detect and recover receive queue hangs  >
 >> >This corrects a problem that appears to be similar to ERR006358.  But
 >> while
 >>  >ERR006358 is a race when the tx queue transitions from empty to not
 >> empty,  >this problem is a race when the rx queue transitions from full to
 >not full.
 >>  >
 >>  >The symptom is a receive queue that is stuck.  The ENET_RDAR
 >> register will  >read 0, indicating that there are no empty receive
 >> descriptors in the receive  >ring.  Since no additional frames can be queued,
 >no RXF interrupts occur.
 >>  >
 >>  >This problem can be triggered with a 1 Gb link and about 400 Mbps of
 >traffic.
 >
 >I can cause the error by running the following on an imx6q: iperf -s -u And
 >sending packets from the other end of a 1 Gbps link:
 >iperf -c $IPADDR -u -b40000pps
 >
 >A few others have seen this problem.
 >See: https://community.nxp.com/thread/322882
 >
 >>  >
 >>  >This patch detects this condition, sets the work_rx bit, and
 >> reschedules the  >poll method.
 >>  >
 >>  >Signed-off-by: Chris Lesiak <chris.lesiak@licor.com>
 >>  >---
 >>  > drivers/net/ethernet/freescale/fec_main.c | 31
 >> >+++++++++++++++++++++++++++++++  > 1 file changed, 31 insertions(+)
 >> > Firstly, how to reproduce the issue, pls list the reproduce steps.
 >> Thanks.
 >> Secondly, pls check below comments.
 >>
 >>  >diff --git a/drivers/net/ethernet/freescale/fec_main.c
 >>  >b/drivers/net/ethernet/freescale/fec_main.c
 >>  >index fea0f33..8a87037 100644
 >>  >--- a/drivers/net/ethernet/freescale/fec_main.c
 >>  >+++ b/drivers/net/ethernet/freescale/fec_main.c
 >>  >@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
 >>  > 	return ret;
 >>  > }
 >>  >
 >>  >+static inline bool
 >>  >+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id) {
 >>  >+	int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
 >>  >+
 >>  >+	if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
 >> If rx ring is really empty in slight throughput cases,  rdar is always cleared,
 >then there always do napi reschedule.
 >
 >I think that you are concerned that if rdar is zero due to this hardware
 >problem, but the rx ring is actually empty, then fec_enet_rx_queue will
 >never do a write to rdar so that it can be non-zero.  That will cause napi to
 >always be resceduled.
 >
 >I suppose that might be the case with zero rx traffic, and I was concerned
 >that it might be true even when there was rx traffic.  I suspected that the
 >hardware, seeing that rdar is zero, would never queue another packet, even
 >if there were in fact empty descriptors.  But it doesn't seem to be the case.  It
 >does reschedule multiple times, but eventually sees some packets in the rx
 >ring and recovers.
 >
 >I admit that I do not completely understand how that can happen.  I did
 >confirm that fec_enet_active_rxring is not being called.
 >
 >Maybe someone with a deeper understanding of the fec than I can provide
 >an explanation.
 >
The patch needs to hold on for some time (days), I will reserve time to investigate the issue.
Thanks.

 >>
 >>  >+		return false;
 >>  >+
 >>  >+	dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n");
 >>  >+
 >>  >+	fep->work_rx |= 1 << work_bit;
 >>  >+
 >>  >+	return true;
 >>  >+}
 >>  >+
 >>  >+static inline bool fec_enet_recover_rxqs(struct fec_enet_private
 >> *fep)  >+{
 >>  >+	unsigned int q;
 >>  >+	bool ret = false;
 >>  >+
 >>  >+	for (q = 0; q < fep->num_rx_queues; q++) {
 >>  >+		if (fec_enet_recover_rxq(fep, q))
 >>  >+			ret = true;
 >>  >+	}
 >>  >+
 >>  >+	return ret;
 >>  >+}
 >>  >+
 >>  > static int fec_enet_rx_napi(struct napi_struct *napi, int budget)  {
 >>  > 	struct net_device *ndev = napi->dev;
 >>  >@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct
 >> *napi,  >int budget)
 >>  > 	if (pkts < budget) {
 >>  > 		napi_complete(napi);
 >>  > 		writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
 >>  >+
 >>  >+		if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
 >>  >+			writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
 >>  > 	}
 >>  > 	return pkts;
 >>  > }
 >>  >--
 >>  >2.5.5
 >>
 >
 >
 >--
 >Chris Lesiak
 >Principal Design Engineer, Software
 >LI-COR Biosciences
 >chris.lesiak@licor.com
 >
 >Any opinions expressed are those of the author and do not necessarily
 >represent those of his employer.
 >

^ permalink raw reply

* Re: [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: Alexei Starovoitov @ 2016-11-20  6:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Daniel Mack, htejun-b10kYP2dOMg, daniel-FeC+5ew28dpmcu3hnIyYJQ,
	ast-b10kYP2dOMg, davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, harald-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA, sargun-GaZTRHToo+CzQB+pC5nmwQ,
	cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161118174405.GA9678@salvia>

On Fri, Nov 18, 2016 at 06:44:05PM +0100, Pablo Neira Ayuso wrote:
> On Fri, Nov 18, 2016 at 09:17:18AM -0800, Alexei Starovoitov wrote:
> > On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> > > On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> > > [...]
> > > > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  	skb->dev = dev;
> > > >  	skb->protocol = htons(ETH_P_IP);
> > > >  
> > > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > > +	if (ret) {
> > > > +		kfree_skb(skb);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	/*
> > > >  	 *	Multicasts are looped back for other local users
> > > >  	 */
> > > > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > > >  {
> > > >  	struct net_device *dev = skb_dst(skb)->dev;
> > > > +	int ret;
> > > >  
> > > >  	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> > > >  
> > > >  	skb->dev = dev;
> > > >  	skb->protocol = htons(ETH_P_IP);
> > > >  
> > > > +	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > > +	if (ret) {
> > > > +		kfree_skb(skb);
> > > > +		return ret;
> > > > +	}
> > > > +
> > > >  	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> > > >  			    net, sk, skb, NULL, dev,
> > > >  			    ip_finish_output,
> > > 
> > > Please, place this after the netfilter hook.
> > > 
> > > Since this new hook may mangle output packets, any mangling
> > > potentially interfers and breaks conntrack.
> > 
> > actually this hook cannot mangle the packets, so no conntrack
> > concerns.  Also this was brought up by Lorenzo earlier and consensus
> > was that it's cleaner to leave it in this order.
> 
> Not yet probably, but this could be used to implement snat at some
> point, you have potentially the infrastructure to do so in place
> already.
> 
> > My reply:
> > http://www.spinics.net/lists/cgroups/msg16675.html
> > and Daniel's:
> > http://www.spinics.net/lists/cgroups/msg16677.html
> > and the rest of that thread.
> 
> Please place this afterwards since I don't want to update Netfilter
> documentation to indicate that there is a new spot to debug before
> POSTROUTING that may drop packets. People are used to debugging things
> in a certain way, if packets are dropped after POSTROUTING, then
> netfilter tracing will indicate the packet has successfully left our
> framework and people will notice that packets are dropped somewhere
> else, so they have a clue probably is this new layer.
> 
> Actually I remember you mentioned in a previous email that this hook
> can be placed anywhere, and that they don't really need a fixed
> location, if so, then it should not be much of a problem to change
> this.

correct. As I said in the link above I don't mind the order.
We discussed it in private and (my personal interpretation) that
the opinions are 60% to keep it as-is and 40% to go with your
proposal.
It is certainly not a big deal to go one way or the other.
I think only you have a strong opinion about the order whereas
myself and Daneil B and Daniel M are exhuasted enough, so we will
take it either way.
So let me recoup the pro and con and let you decide, since it seems
doing it after nf hook on tx will actually make them much harder to
use together and I think you wouldn't want it in the first place.
Since RX hook is so late in the receive path there is no
practical use case to mangle the packet at this stage.
This set of patches doesn't allow to change the packet either.
After this rx hook the socket can only receive the packet and
messing up l3/l4 headers won't have any effect. Furthermore since
the hooks have to be symmetrical it doesn't make sense to mangle
the packet on TX either. In tcp case the socket is an established
connection, so allowing mangling won't do any good to the remote side.
Hence this cgroup+bpf+inet_rx/tx thingy is only useful for monitoring
of what applications are doing and high level application firewalling.
Now imagine some policy framework that prevents a websever
in a cgroup talk to a database using this facility. If nf hook
runs first on tx, this policy framework won't be usable, since l3/l4
can be mangled, so application enforcement is not possible unless
iptables are disabled. I don't want such "either iptables or cgroup+bpf"
scenario and I don't think you want that either.
At the same time if iptables do traditional nat and firewall
_after_ this tx hook then this policy framework can coexist
with iptables based security. So imo this order of hooks
is a better way forward.
But if you strongly disagree, I'm ok to change it,
though I think long term it will be a loosing proposition
for both frameworks. So your call.

Thanks

^ permalink raw reply

* Re: [PATCH] VSOCK: add loopback to virtio_transport
From: David Miller @ 2016-11-20  3:16 UTC (permalink / raw)
  To: john.fastabend; +Cc: stefanha, netdev, cavery, imbrenda, jhansen
In-Reply-To: <58311507.7070707@gmail.com>

From: John Fastabend <john.fastabend@gmail.com>
Date: Sat, 19 Nov 2016 19:14:15 -0800

> On 16-11-19 07:08 PM, David Miller wrote:
>> From: Stefan Hajnoczi <stefanha@redhat.com>
>> Date: Thu, 17 Nov 2016 15:49:23 +0000
>> 
>>> @@ -159,6 +199,10 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>>>  		return -ENODEV;
>>>  	}
>>>  
>>> +	if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
>>> +		return virtio_transport_send_pkt_loopback(vsock, pkt);
>>> +	}
>>> +
>> 
>> A single-statement basic block should get surrounding curly
>> braces.
>> 
> 
> Should _not_ get curly braces in case it wasn't obvious ;)

Indeed, thanks for the correction :)

^ permalink raw reply

* Re: [PATCH net 1/1] tipc: eliminate obsolete socket locking policy description
From: David Miller @ 2016-11-20  3:15 UTC (permalink / raw)
  To: jon.maloy
  Cc: netdev, paul.gortmaker, parthasarathy.bhuvaragan, ying.xue, maloy,
	tipc-discussion
In-Reply-To: <1479584827-21772-1-git-send-email-jon.maloy@ericsson.com>

From: Jon Maloy <jon.maloy@ericsson.com>
Date: Sat, 19 Nov 2016 14:47:07 -0500

> The comment block in socket.c describing the locking policy is
> obsolete, and does not reflect current reality. We remove it in this
> commit.
> 
> Since the current locking policy is much simpler and follows a
> mainstream approach, we see no need to add a new description.
> 
> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Applied, thanks Jon.

^ permalink raw reply

* Re: [net-next] rtnl: fix the loop index update error in rtnl_dump_ifinfo()
From: David Miller @ 2016-11-20  3:15 UTC (permalink / raw)
  To: zhangshengju; +Cc: netdev, dsa
In-Reply-To: <1479569312-15224-1-git-send-email-zhangshengju@cmss.chinamobile.com>

From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Date: Sat, 19 Nov 2016 23:28:32 +0800

> If the link is filtered out, loop index should also be updated. If not,
> loop index will not be correct.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>

Applied to 'net' and queued up for -stable.

Bug fixes should always be targetted at 'net' not 'net-next'.

^ permalink raw reply

* Re: [PATCH] VSOCK: add loopback to virtio_transport
From: John Fastabend @ 2016-11-20  3:14 UTC (permalink / raw)
  To: David Miller, stefanha; +Cc: netdev, cavery, imbrenda, jhansen
In-Reply-To: <20161119.220804.755313313976863187.davem@davemloft.net>

On 16-11-19 07:08 PM, David Miller wrote:
> From: Stefan Hajnoczi <stefanha@redhat.com>
> Date: Thu, 17 Nov 2016 15:49:23 +0000
> 
>> @@ -159,6 +199,10 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>>  		return -ENODEV;
>>  	}
>>  
>> +	if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
>> +		return virtio_transport_send_pkt_loopback(vsock, pkt);
>> +	}
>> +
> 
> A single-statement basic block should get surrounding curly
> braces.
> 

Should _not_ get curly braces in case it wasn't obvious ;)

^ permalink raw reply

* Re: [PATCH] net: fix bogus cast in skb_pagelen() and use unsigned variables
From: David Miller @ 2016-11-20  3:12 UTC (permalink / raw)
  To: adobriyan; +Cc: netdev
In-Reply-To: <20161119010808.GF1200@avx2>

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Sat, 19 Nov 2016 04:08:08 +0300

> 1) cast to "int" is unnecessary:
>    u8 will be promoted to int before decrementing,
>    small positive numbers fit into "int", so their values won't be changed
>    during promotion.
> 
>    Once everything is int including loop counters, signedness doesn't
>    matter: 32-bit operations will stay 32-bit operations.
> 
>    But! Someone tried to make this loop smart by making everything of
>    the same type apparently in an attempt to optimise it.
>    Do the optimization, just differently.
>    Do the cast where it matters. :^)
> 
> 2) frag size is unsigned entity and sum of fragments sizes is also
>    unsigned.
> 
> Make everything unsigned, leave no MOVSX instruction behind.
 ...
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] netlink: smaller nla_attr_minlen table
From: David Miller @ 2016-11-20  3:12 UTC (permalink / raw)
  To: adobriyan; +Cc: netdev
In-Reply-To: <20161119005907.GE1200@avx2>

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Sat, 19 Nov 2016 03:59:07 +0300

> Length of a netlink attribute may be u16 but lengths of basic attributes
> are much smaller, so small we can save 16 bytes of .rodata and pocket
> change inside .text.
> 
> 16-bit is worse on x86-64 than 8-bit because of operand size override prefix.
 ...
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] netlink: use "unsigned int" in nla_next()
From: David Miller @ 2016-11-20  3:11 UTC (permalink / raw)
  To: adobriyan; +Cc: netdev
In-Reply-To: <20161119005435.GD1200@avx2>

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Sat, 19 Nov 2016 03:54:35 +0300

> ->nla_len is unsigned entity (it's length after all) and u16,
> thus it can't overflow when being aligned into int/unsigned int.
> 
> (nlmsg_next has the same code, but I didn't yet convince myself
> it is correct to do so).
> 
> There is pointer arithmetic in this function and offset being
> unsigned is better:
 ...
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH] net: make struct napi_alloc_cache::skb_count unsigned int
From: David Miller @ 2016-11-20  3:11 UTC (permalink / raw)
  To: adobriyan; +Cc: netdev
In-Reply-To: <20161119004756.GC1200@avx2>

From: Alexey Dobriyan <adobriyan@gmail.com>
Date: Sat, 19 Nov 2016 03:47:56 +0300

> size_t is way too much for an integer not exceeding 64.
> 
> Space savings: 10 bytes!
> 
> 	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-10 (-10)
> 	function                                     old     new   delta
> 	napi_consume_skb                             165     163      -2
> 	__kfree_skb_flush                             56      53      -3
> 	__kfree_skb_defer                             97      92      -5
> 	Total: Before=154865639, After=154865629, chg -0.00%
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Applied to net-next.

^ permalink raw reply

* Re: [PATCH net] l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
From: David Miller @ 2016-11-20  3:10 UTC (permalink / raw)
  To: g.nault; +Cc: netdev, jchapman, celston, sploving1, andreyknvl
In-Reply-To: <f04ad12d6f2a433d60baf5720b2f7a6d685d37c6.1479503186.git.g.nault@alphalink.fr>

From: Guillaume Nault <g.nault@alphalink.fr>
Date: Fri, 18 Nov 2016 22:13:00 +0100

> Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
> Without lock, a concurrent call could modify the socket flags between
> the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
> a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
> would then leave a stale pointer there, generating use-after-free
> errors when walking through the list or modifying adjacent entries.
 ...
> The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table.
> 
> Fixes: c51ce49735c1 ("l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case")
> Reported-by: Baozeng Ding <sploving1@gmail.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Tested-by: Baozeng Ding <sploving1@gmail.com>
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] VSOCK: add loopback to virtio_transport
From: David Miller @ 2016-11-20  3:08 UTC (permalink / raw)
  To: stefanha; +Cc: netdev, cavery, imbrenda, jhansen
In-Reply-To: <1479397763-22319-1-git-send-email-stefanha@redhat.com>

From: Stefan Hajnoczi <stefanha@redhat.com>
Date: Thu, 17 Nov 2016 15:49:23 +0000

> @@ -159,6 +199,10 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>  		return -ENODEV;
>  	}
>  
> +	if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
> +		return virtio_transport_send_pkt_loopback(vsock, pkt);
> +	}
> +

A single-statement basic block should get surrounding curly
braces.

^ permalink raw reply

* [net-next PATCH v2 5/5] virtio_net: add XDP_TX support
From: John Fastabend @ 2016-11-20  2:51 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120024710.19187.31037.stgit@john-Precision-Tower-5810>

This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.

Before sending the packet the header is zeroed.  Also XDP is expected
to handle checksum correctly so no checksum offload  support is
provided.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   57 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 80a426c..c127494 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -330,12 +330,40 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
+static void virtnet_xdp_xmit(struct virtnet_info *vi,
+			     unsigned int qnum, struct xdp_buff *xdp)
+{
+	struct send_queue *sq = &vi->sq[qnum];
+	struct virtio_net_hdr_mrg_rxbuf *hdr;
+	unsigned int num_sg, len;
+	void *xdp_sent;
+
+	/* Free up any pending old buffers before queueing new ones. */
+	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		struct page *page = virt_to_head_page(xdp_sent);
+
+		put_page(page);
+	}
+
+	/* Zero header and leave csum up to XDP layers */
+	hdr = xdp->data;
+	memset(hdr, 0, vi->hdr_len);
+	hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+	hdr->hdr.flags = VIRTIO_NET_HDR_F_DATA_VALID;
+
+	num_sg = 1;
+	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+	virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, xdp->data, GFP_ATOMIC);
+	virtqueue_kick(sq->vq);
+}
+
 static u32 do_xdp_prog(struct virtnet_info *vi,
 		       struct bpf_prog *xdp_prog,
 		       struct page *page, int offset, int len)
 {
 	int hdr_padded_len;
 	struct xdp_buff xdp;
+	unsigned int qp;
 	u32 act;
 	u8 *buf;
 
@@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
 	switch (act) {
 	case XDP_PASS:
 		return XDP_PASS;
+	case XDP_TX:
+		qp = vi->curr_queue_pairs -
+			vi->xdp_queue_pairs +
+			smp_processor_id();
+		xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
+		virtnet_xdp_xmit(vi, qp, &xdp);
+		return XDP_TX;
 	default:
 		bpf_warn_invalid_xdp_action(act);
-	case XDP_TX:
 	case XDP_ABORTED:
 	case XDP_DROP:
 		return XDP_DROP;
@@ -386,8 +420,15 @@ static struct sk_buff *receive_big(struct net_device *dev,
 	if (xdp_prog) {
 		u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
 
-		if (act == XDP_DROP)
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			goto xdp_xmit;
+		case XDP_DROP:
+		default:
 			goto err;
+		}
 	}
 
 	skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
@@ -399,6 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
 err:
 	dev->stats.rx_dropped++;
 	give_pages(rq, page);
+xdp_xmit:
 	return NULL;
 }
 
@@ -417,6 +459,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	struct sk_buff *head_skb, *curr_skb;
 	struct bpf_prog *xdp_prog;
 
+	head_skb = NULL;
 	xdp_prog = rcu_dereference_bh(rq->xdp_prog);
 	if (xdp_prog) {
 		u32 act;
@@ -427,8 +470,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 
 		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
-		if (act == XDP_DROP)
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			goto xdp_xmit;
+		case XDP_DROP:
+		default:
 			goto err_skb;
+		}
 	}
 
 	head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
@@ -502,6 +552,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 err_buf:
 	dev->stats.rx_dropped++;
 	dev_kfree_skb(head_skb);
+xdp_xmit:
 	return NULL;
 }
 

^ permalink raw reply related

* [net-next PATCH v2 4/5] virtio_net: add dedicated XDP transmit queues
From: John Fastabend @ 2016-11-20  2:51 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120024710.19187.31037.stgit@john-Precision-Tower-5810>

XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.

However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8f99a53..80a426c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -114,6 +114,9 @@ struct virtnet_info {
 	/* # of queue pairs currently used by the driver */
 	u16 curr_queue_pairs;
 
+	/* # of XDP queue pairs currently used by the driver */
+	u16 xdp_queue_pairs;
+
 	/* I like... big packets and I cannot lie! */
 	bool big_packets;
 
@@ -1525,7 +1528,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct bpf_prog *old_prog;
-	int i;
+	u16 xdp_qp = 0, curr_qp;
+	int err, i;
 
 	if ((dev->features & NETIF_F_LRO) && prog) {
 		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
@@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return -EINVAL;
 	}
 
+	curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
+	if (prog)
+		xdp_qp = nr_cpu_ids;
+
+	/* XDP requires extra queues for XDP_TX */
+	if (curr_qp + xdp_qp > vi->max_queue_pairs) {
+		netdev_warn(dev, "request %i queues but max is %i\n",
+			    curr_qp + xdp_qp, vi->max_queue_pairs);
+		return -ENOMEM;
+	}
+
+	err = virtnet_set_queues(vi, curr_qp + xdp_qp);
+	if (err) {
+		dev_warn(&dev->dev, "XDP Device queue allocation failure.\n");
+		return err;
+	}
+
 	if (prog) {
-		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
-		if (IS_ERR(prog))
+		prog = bpf_prog_add(prog, vi->max_queue_pairs);
+		if (IS_ERR(prog)) {
+			virtnet_set_queues(vi, curr_qp);
 			return PTR_ERR(prog);
+		}
 	}
 
+	vi->xdp_queue_pairs = xdp_qp;
+	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
+
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
 		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);

^ permalink raw reply related

* [net-next PATCH v2 3/5] virtio_net: Add XDP support
From: John Fastabend @ 2016-11-20  2:50 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120024710.19187.31037.stgit@john-Precision-Tower-5810>

From: Shrijeet Mukherjee <shrijeet@gmail.com>

This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. The MTU must be less than a page size
to avoid having to handle XDP across multiple pages.

If mergeable receive is enabled this first series only supports
the case where header and data are in the same buf which we can
check when a packet is received by looking at num_buf. If the
num_buf is greater than 1 and a XDP program is loaded the packet
is dropped and a warning is thrown. When any_header_sg is set this
does not happen and both header and data is put in a single buffer
as expected so we check this when XDP programs are loaded. Note I
have only tested this with Linux vhost backend.

If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.

A follow on patch can be generated to solve the mergeable receive
case with num_bufs equal to 2. Buffers greater than two may not
be handled has easily.

Suggested-by: Shrijeet Mukherjee <shrijeet@gmail.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |  146 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 142 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8189e5b..8f99a53 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_net.h>
+#include <linux/bpf.h>
 #include <linux/scatterlist.h>
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
@@ -81,6 +82,8 @@ struct receive_queue {
 
 	struct napi_struct napi;
 
+	struct bpf_prog __rcu *xdp_prog;
+
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
@@ -324,6 +327,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
 	return skb;
 }
 
+static u32 do_xdp_prog(struct virtnet_info *vi,
+		       struct bpf_prog *xdp_prog,
+		       struct page *page, int offset, int len)
+{
+	int hdr_padded_len;
+	struct xdp_buff xdp;
+	u32 act;
+	u8 *buf;
+
+	buf = page_address(page) + offset;
+
+	if (vi->mergeable_rx_bufs)
+		hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_padded_len = sizeof(struct padded_vnet_hdr);
+
+	xdp.data = buf + hdr_padded_len;
+	xdp.data_end = xdp.data + (len - vi->hdr_len);
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	switch (act) {
+	case XDP_PASS:
+		return XDP_PASS;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_TX:
+	case XDP_ABORTED:
+	case XDP_DROP:
+		return XDP_DROP;
+	}
+}
+
 static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, unsigned int len)
 {
 	struct sk_buff * skb = buf;
@@ -340,9 +375,19 @@ static struct sk_buff *receive_big(struct net_device *dev,
 				   void *buf,
 				   unsigned int len)
 {
+	struct bpf_prog *xdp_prog;
 	struct page *page = buf;
-	struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
+	struct sk_buff *skb;
 
+	xdp_prog = rcu_dereference_bh(rq->xdp_prog);
+	if (xdp_prog) {
+		u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
+
+		if (act == XDP_DROP)
+			goto err;
+	}
+
+	skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
 	if (unlikely(!skb))
 		goto err;
 
@@ -366,10 +411,25 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
 	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+	struct sk_buff *head_skb, *curr_skb;
+	struct bpf_prog *xdp_prog;
 
-	struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
-					       truesize);
-	struct sk_buff *curr_skb = head_skb;
+	xdp_prog = rcu_dereference_bh(rq->xdp_prog);
+	if (xdp_prog) {
+		u32 act;
+
+		if (num_buf > 1) {
+			bpf_warn_invalid_xdp_buffer();
+			goto err_skb;
+		}
+
+		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
+		if (act == XDP_DROP)
+			goto err_skb;
+	}
+
+	head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
+	curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
@@ -1328,6 +1388,13 @@ static int virtnet_set_channels(struct net_device *dev,
 	if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
 		return -EINVAL;
 
+	/* For now we don't support modifying channels while XDP is loaded
+	 * also when XDP is loaded all RX queues have XDP programs so we only
+	 * need to check a single RX queue.
+	 */
+	if (vi->rq[0].xdp_prog)
+		return -EINVAL;
+
 	get_online_cpus();
 	err = virtnet_set_queues(vi, queue_pairs);
 	if (!err) {
@@ -1454,6 +1521,68 @@ static int virtnet_set_features(struct net_device *netdev,
 	return 0;
 }
 
+static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+	int i;
+
+	if ((dev->features & NETIF_F_LRO) && prog) {
+		netdev_warn(dev, "can't set XDP while LRO is on, disable LRO first\n");
+		return -EINVAL;
+	}
+
+	if (vi->mergeable_rx_bufs && !vi->any_header_sg) {
+		netdev_warn(dev, "XDP expects header/data in single page\n");
+		return -EINVAL;
+	}
+
+	if (dev->mtu > PAGE_SIZE) {
+		netdev_warn(dev, "XDP requires MTU less than %lu\n", PAGE_SIZE);
+		return -EINVAL;
+	}
+
+	if (prog) {
+		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
+		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
+
+	return 0;
+}
+
+static bool virtnet_xdp_query(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		if (vi->rq[i].xdp_prog)
+			return true;
+	}
+	return false;
+}
+
+static int virtnet_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return virtnet_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = virtnet_xdp_query(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1471,6 +1600,7 @@ static int virtnet_set_features(struct net_device *netdev,
 	.ndo_busy_poll		= virtnet_busy_poll,
 #endif
 	.ndo_set_features	= virtnet_set_features,
+	.ndo_xdp		= virtnet_xdp,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -1527,12 +1657,20 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 
 static void free_receive_bufs(struct virtnet_info *vi)
 {
+	struct bpf_prog *old_prog;
 	int i;
 
+	rtnl_lock();
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		while (vi->rq[i].pages)
 			__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
+
+		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
+		RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
+		if (old_prog)
+			bpf_prog_put(old_prog);
 	}
+	rtnl_unlock();
 }
 
 static void free_receive_page_frags(struct virtnet_info *vi)

^ permalink raw reply related

* [net-next PATCH v2 2/5] net: xdp: add invalid buffer warning
From: John Fastabend @ 2016-11-20  2:50 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120024710.19187.31037.stgit@john-Precision-Tower-5810>

This adds a warning for drivers to use when encountering an invalid
buffer for XDP. For normal cases this should not happen but to catch
this in virtual/qemu setups that I may not have expected from the
emulation layer having a standard warning is useful.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/filter.h |    1 +
 net/core/filter.c      |    6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..0c79004 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -595,6 +595,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 void bpf_warn_invalid_xdp_action(u32 act);
+void bpf_warn_invalid_xdp_buffer(void);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index dece94f..b777730 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2774,6 +2774,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+void bpf_warn_invalid_xdp_buffer(void)
+{
+	WARN_ONCE(1, "Illegal XDP buffer encountered, expect packet loss\n");
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
+
 static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 					int src_reg, int ctx_off,
 					struct bpf_insn *insn_buf,

^ permalink raw reply related

* [net-next PATCH v2 1/5] net: virtio dynamically disable/enable LRO
From: John Fastabend @ 2016-11-20  2:49 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf
In-Reply-To: <20161120024710.19187.31037.stgit@john-Precision-Tower-5810>

This adds support for dynamically setting the LRO feature flag. The
message to control guest features in the backend uses the
CTRL_GUEST_OFFLOADS msg type.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/virtio_net.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ca5239a..8189e5b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device *dev)
 	.set_settings = virtnet_set_settings,
 };
 
+static int virtnet_set_features(struct net_device *netdev,
+				netdev_features_t features)
+{
+	struct virtnet_info *vi = netdev_priv(netdev);
+	struct virtio_device *vdev = vi->vdev;
+	struct scatterlist sg;
+	u64 offloads = 0;
+
+	if (features & NETIF_F_LRO)
+		offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
+			    (1 << VIRTIO_NET_F_GUEST_TSO6);
+
+	if (features & NETIF_F_RXCSUM)
+		offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
+		sg_init_one(&sg, &offloads, sizeof(uint64_t));
+		if (!virtnet_send_command(vi,
+					  VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+					  VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+					  &sg)) {
+			dev_warn(&netdev->dev,
+				 "Failed to set guest offloads by virtnet command.\n");
+			return -EINVAL;
+		}
+	} else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
+		   !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+		dev_warn(&netdev->dev,
+			 "No support for setting offloads pre version_1.\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
@@ -1435,6 +1470,7 @@ static void virtnet_init_settings(struct net_device *dev)
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	.ndo_busy_poll		= virtnet_busy_poll,
 #endif
+	.ndo_set_features	= virtnet_set_features,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -1810,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
 		dev->features |= NETIF_F_RXCSUM;
 
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
+	    virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+		dev->features |= NETIF_F_LRO;
+		dev->hw_features |= NETIF_F_LRO;
+	}
+
 	dev->vlan_features = dev->features;
 
 	/* MTU range: 68 - 65535 */
@@ -2047,7 +2089,8 @@ static int virtnet_restore(struct virtio_device *vdev)
 	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
-	VIRTIO_NET_F_MTU
+	VIRTIO_NET_F_MTU, \
+	VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,

^ permalink raw reply related

* [net-next PATCH v2 0/5] XDP for virtio_net
From: John Fastabend @ 2016-11-20  2:49 UTC (permalink / raw)
  To: daniel, eric.dumazet, mst, kubakici, shm, davem,
	alexei.starovoitov
  Cc: netdev, bblanco, john.fastabend, john.r.fastabend, brouer, tgraf

This implements virtio_net for the mergeable buffers and big_packet
modes. I tested this with vhost_net running on qemu and did not see
any issues.

There are some restrictions for XDP to be enabled (see patch 3) for
more details.

  1. LRO must be off
  2. MTU must be less than PAGE_SIZE
  3. queues must be available to dedicate to XDP
  4. num_bufs received in mergeable buffers must be 1
  5. big_packet mode must have all data on single page

Please review any comments/feedback welcome as always.

v2, fixes rcu usage throughout thanks to Eric and the use of
num_online_cpus() usage thanks to Jakub.

Thanks,
John

---

John Fastabend (4):
      net: virtio dynamically disable/enable LRO
      net: xdp: add invalid buffer warning
      virtio_net: add dedicated XDP transmit queues
      virtio_net: add XDP_TX support

Shrijeet Mukherjee (1):
      virtio_net: Add XDP support


 drivers/net/virtio_net.c |  268 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/filter.h   |    1 
 net/core/filter.c        |    6 +
 3 files changed, 270 insertions(+), 5 deletions(-)

^ permalink raw reply

* Re: [PATCH] i40e: remove unnecessary __packed
From: David Miller @ 2016-11-20  2:47 UTC (permalink / raw)
  To: tushar.n.dave; +Cc: jeffrey.t.kirsher, intel-wired-lan, netdev
In-Reply-To: <1479592438-17078-1-git-send-email-tushar.n.dave@oracle.com>

From: Tushar Dave <tushar.n.dave@oracle.com>
Date: Sat, 19 Nov 2016 13:53:58 -0800

> 'struct i40e_dma_mem' defined with 'packed' directive causing kernel
> unaligned errors on sparc.

I really wish people would get out of the habit of just slapping
the __packed attribute onto structures seemingly by default.

^ permalink raw reply

* Re: [net-next] rtnl: fix the loop index update error in rtnl_dump_ifinfo()
From: David Ahern @ 2016-11-20  2:45 UTC (permalink / raw)
  To: Zhang Shengju, netdev
In-Reply-To: <1479569312-15224-1-git-send-email-zhangshengju@cmss.chinamobile.com>

On 11/19/16 10:28 AM, Zhang Shengju wrote:
> If the link is filtered out, loop index should also be updated. If not,
> loop index will not be correct.
> 
> Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> ---
>  net/core/rtnetlink.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index db313ec..f4b9350 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1606,7 +1606,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
>  		head = &net->dev_index_head[h];
>  		hlist_for_each_entry(dev, head, index_hlist) {
>  			if (link_dump_filtered(dev, master_idx, kind_ops))
> -				continue;
> +				goto cont;
>  			if (idx < s_idx)
>  				goto cont;
>  			err = rtnl_fill_ifinfo(skb, dev, RTM_NEWLINK,
> 

Fixes: dc599f76c22b0 ("net: Add support for filtering link dump by master device and kind")

Acked-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply

* [PATCH net-next] bnx2: use READ_ONCE() instead of barrier()
From: Eric Dumazet @ 2016-11-19 22:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Rasesh Mody, Harish Patil

From: Eric Dumazet <edumazet@google.com>

barrier() is a big hammer compared to READ_ONCE(),
and requires comments explaining what is protected.

READ_ONCE() is more precise and compiler should generate
better overall code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/ethernet/broadcom/bnx2.c |   17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 9960a9249dc0d0c8fb3494989e882a4c96e64970..d5d1026be4b70a320c48af7545a1891c74e19a47 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -254,13 +254,10 @@ static inline u32 bnx2_tx_avail(struct bnx2 *bp, struct bnx2_tx_ring_info *txr)
 {
 	u32 diff;
 
-	/* Tell compiler to fetch tx_prod and tx_cons from memory. */
-	barrier();
-
 	/* The ring uses 256 indices for 255 entries, one of them
 	 * needs to be skipped.
 	 */
-	diff = txr->tx_prod - txr->tx_cons;
+	diff = READ_ONCE(txr->tx_prod) - READ_ONCE(txr->tx_cons);
 	if (unlikely(diff >= BNX2_TX_DESC_CNT)) {
 		diff &= 0xffff;
 		if (diff == BNX2_TX_DESC_CNT)
@@ -2839,10 +2836,8 @@ bnx2_get_hw_tx_cons(struct bnx2_napi *bnapi)
 {
 	u16 cons;
 
-	/* Tell compiler that status block fields can change. */
-	barrier();
-	cons = *bnapi->hw_tx_cons_ptr;
-	barrier();
+	cons = READ_ONCE(*bnapi->hw_tx_cons_ptr);
+
 	if (unlikely((cons & BNX2_MAX_TX_DESC_CNT) == BNX2_MAX_TX_DESC_CNT))
 		cons++;
 	return cons;
@@ -3141,10 +3136,8 @@ bnx2_get_hw_rx_cons(struct bnx2_napi *bnapi)
 {
 	u16 cons;
 
-	/* Tell compiler that status block fields can change. */
-	barrier();
-	cons = *bnapi->hw_rx_cons_ptr;
-	barrier();
+	cons = READ_ONCE(*bnapi->hw_rx_cons_ptr);
+
 	if (unlikely((cons & BNX2_MAX_RX_DESC_CNT) == BNX2_MAX_RX_DESC_CNT))
 		cons++;
 	return cons;

^ permalink raw reply related

* [PATCH] i40e: remove unnecessary __packed
From: Tushar Dave @ 2016-11-19 21:53 UTC (permalink / raw)
  To: jeffrey.t.kirsher, intel-wired-lan, netdev

'struct i40e_dma_mem' defined with 'packed' directive causing kernel
unaligned errors on sparc.

e.g.
i40e: Intel(R) Ethernet Connection XL710 Network Driver - version
1.6.16-k
i40e: Copyright (c) 2013 - 2014 Intel Corporation.
Kernel unaligned access at TPC[44894c] dma_4v_alloc_coherent+0x1ac/0x300
Kernel unaligned access at TPC[44894c] dma_4v_alloc_coherent+0x1ac/0x300
Kernel unaligned access at TPC[44894c] dma_4v_alloc_coherent+0x1ac/0x300
Kernel unaligned access at TPC[44894c] dma_4v_alloc_coherent+0x1ac/0x300
Kernel unaligned access at TPC[44894c] dma_4v_alloc_coherent+0x1ac/0x300
i40e 0000:03:00.0: fw 5.1.40981 api 1.5 nvm 5.04 0x80002548 0.0.0

This can be fixed with get_unaligned/put_unaligned(). However no
reference in driver shows that 'struct i40e_dma_mem' directly shoved
into NIC hardware. But instead fields of the struct are being read and
used for hardware. Therefore, __packed is unnecessary for 'struct
i40e_dma_mem'.

In addition, although 'struct i40e_virt_mem' doesn't cause any
unaligned access, keeping it packed is unnecessary as well because
of aforementioned reason.

This change make 'struct i40e_dma_mem' and 'struct i40e_virt_mem'
unpacked.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 drivers/net/ethernet/intel/i40e/i40e_osdep.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index 5b6feb7..be74bcf 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -55,7 +55,7 @@ struct i40e_dma_mem {
 	void *va;
 	dma_addr_t pa;
 	u32 size;
-} __packed;
+};
 
 #define i40e_allocate_dma_mem(h, m, unused, s, a) \
 			i40e_allocate_dma_mem_d(h, m, s, a)
@@ -64,7 +64,7 @@ struct i40e_dma_mem {
 struct i40e_virt_mem {
 	void *va;
 	u32 size;
-} __packed;
+};
 
 #define i40e_allocate_virt_mem(h, m, s) i40e_allocate_virt_mem_d(h, m, s)
 #define i40e_free_virt_mem(h, m) i40e_free_virt_mem_d(h, m)
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues
From: John Fastabend @ 2016-11-19 21:33 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: tgraf, shm, alexei.starovoitov, daniel, davem, john.r.fastabend,
	netdev, bblanco, brouer
In-Reply-To: <20161118192317.6c283867@laptop>

On 16-11-18 07:23 PM, Jakub Kicinski wrote:
> On Fri, 18 Nov 2016 19:20:58 -0800, Eric Dumazet wrote:
>> On Fri, 2016-11-18 at 18:57 -0800, Jakub Kicinski wrote:
>>> On Fri, 18 Nov 2016 18:43:55 -0800, John Fastabend wrote:  
>>>> On 16-11-18 06:10 PM, Jakub Kicinski wrote:  
>>  [...]  
>>>>
>>>> Seem like a valid concerns to me how about num_possible_cpus() instead.  
>>>
>>> That would solve problem 1, but could cpu_possible_mask still be sparse
>>> on strange setups?  Let me try to dig into this, I recall someone
>>> (Eric?) was fixing similar problems some time ago.  
>>
>> nr_cpu_ids is probably what you want ;)
> 
> Thank you :)
> 

Yep poked around a bit and the common pattern seems to be to use
nr_cpu_ids to build a cpu array and then index it with
smp_processor_id(). So I'll do this as well.

Although I'm not sure I entirely follow on the x86 platforms at
least how/if nr_cpu_ids != num_possible_cpus().

Nice catch Jakub.

Thanks,
John

^ permalink raw reply

* [PATCH net 1/1] tipc: eliminate obsolete socket locking policy description
From: Jon Maloy @ 2016-11-19 19:47 UTC (permalink / raw)
  To: davem
  Cc: netdev, Paul Gortmaker, parthasarathy.bhuvaragan, ying.xue, maloy,
	tipc-discussion, Jon Maloy

The comment block in socket.c describing the locking policy is
obsolete, and does not reflect current reality. We remove it in this
commit.

Since the current locking policy is much simpler and follows a
mainstream approach, we see no need to add a new description.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/socket.c | 48 +-----------------------------------------------
 1 file changed, 1 insertion(+), 47 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 22d92f0..4916d8f 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -1,7 +1,7 @@
 /*
  * net/tipc/socket.c: TIPC socket API
  *
- * Copyright (c) 2001-2007, 2012-2015, Ericsson AB
+ * Copyright (c) 2001-2007, 2012-2016, Ericsson AB
  * Copyright (c) 2004-2008, 2010-2013, Wind River Systems
  * All rights reserved.
  *
@@ -127,54 +127,8 @@ static const struct proto_ops packet_ops;
 static const struct proto_ops stream_ops;
 static const struct proto_ops msg_ops;
 static struct proto tipc_proto;
-
 static const struct rhashtable_params tsk_rht_params;
 
-/*
- * Revised TIPC socket locking policy:
- *
- * Most socket operations take the standard socket lock when they start
- * and hold it until they finish (or until they need to sleep).  Acquiring
- * this lock grants the owner exclusive access to the fields of the socket
- * data structures, with the exception of the backlog queue.  A few socket
- * operations can be done without taking the socket lock because they only
- * read socket information that never changes during the life of the socket.
- *
- * Socket operations may acquire the lock for the associated TIPC port if they
- * need to perform an operation on the port.  If any routine needs to acquire
- * both the socket lock and the port lock it must take the socket lock first
- * to avoid the risk of deadlock.
- *
- * The dispatcher handling incoming messages cannot grab the socket lock in
- * the standard fashion, since invoked it runs at the BH level and cannot block.
- * Instead, it checks to see if the socket lock is currently owned by someone,
- * and either handles the message itself or adds it to the socket's backlog
- * queue; in the latter case the queued message is processed once the process
- * owning the socket lock releases it.
- *
- * NOTE: Releasing the socket lock while an operation is sleeping overcomes
- * the problem of a blocked socket operation preventing any other operations
- * from occurring.  However, applications must be careful if they have
- * multiple threads trying to send (or receive) on the same socket, as these
- * operations might interfere with each other.  For example, doing a connect
- * and a receive at the same time might allow the receive to consume the
- * ACK message meant for the connect.  While additional work could be done
- * to try and overcome this, it doesn't seem to be worthwhile at the present.
- *
- * NOTE: Releasing the socket lock while an operation is sleeping also ensures
- * that another operation that must be performed in a non-blocking manner is
- * not delayed for very long because the lock has already been taken.
- *
- * NOTE: This code assumes that certain fields of a port/socket pair are
- * constant over its lifetime; such fields can be examined without taking
- * the socket lock and/or port lock, and do not need to be re-read even
- * after resuming processing after waiting.  These fields include:
- *   - socket type
- *   - pointer to socket sk structure (aka tipc_sock structure)
- *   - pointer to port structure
- *   - port reference
- */
-
 static u32 tsk_own_node(struct tipc_sock *tsk)
 {
 	return msg_prevnode(&tsk->phdr);
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox