Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-16 17:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <543FFC03.1060207@redhat.com>

On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:

> So if a page is used twice we are double counting the page size for the 
> socket then, is that correct?  I just want to make sure because prior to 
> this patch both flows did the same thing and counted the portion of the 
> page used in this pass, now with this change for PAGE_SIZE of 4K we 
> count the entire page, and for all other cases we count the portion of 
> the page used.

When a page is split in 2 parts only, probability that a segment holds
the 4K page is quite high (There is a single half page)

When we split say 64KB in 42 segments, the probability a single segment
hold the full 64KB block is very low, so we can almost be safe when we
consider 'truesize = 1536'

Of course there are pathological cases, but attacker has to be quite
smart.

I am just saying that counting 2048 might have a big impact on memory
consumption if all these incoming segments are stored a long time in
receive queues (TCP receive queues or out of order queues) : We might be
off by a factor of 2 on the real memory usage, and delay the TCP
collapsing too much.



^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Lluís Batlle i Rossell @ 2014-10-16 17:46 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Eric Dumazet, linux-kernel, netdev, Carles Pagès,
	linux-arm-kernel
In-Reply-To: <20141016173401.GA16515@f1.synalogic.ca>

On Thu, Oct 16, 2014 at 10:34:01AM -0700, Benjamin Poirier wrote:
> On 2014/10/15 15:45, Eric Dumazet wrote:
> > On Wed, 2014-10-15 at 14:57 -0700, Benjamin Poirier wrote:
> > > On 2014/10/13 12:52, Lluís Batlle i Rossell wrote:
> > > > Hello,
> > > > 
> > > > on the 7th of January 2014 ths patch was applied:
> > > > https://lkml.org/lkml/2014/1/7/307
> > > > 
> > > > [PATCH v2] net: Do not enable tx-nocache-copy by default
> > > >         
> > > > In the Sheevaplug (ARM Feroceon 88FR131 from Marvell) this made packets to be
> > > > sent corrupted. I think this machine has something special about the cache.
> > > > 
> > > > Enabling back this tx-nocache-copy (as it used to be before the patch) the
> > > > transfers work fine again. I think that most people, encountering this problem,
> > > > completely disable the tx offload instead of enabling back this setting.
> > > > 
> > > > Is this an ARM kernel problem regarding this platform?
> > > 
> > > This is odd, only x86 defines ARCH_HAS_NOCACHE_UACCESS. On arm,
> > > skb_do_copy_data_nocache() should end up using __copy_from_user()
> > > regardless of tx-nocache-copy.
> > 
> >  kmap_atomic()/kunmap_atomic() is missing, so we lack
> > __cpuc_flush_dcache_area() operations.
> > 
> 
> You lost me there.
> 1) I don't see the link
> 2) It seems kmap_atomic and so on are there:
> $ grep kmap_atomic System.map-3.16-2-kirkwood
> c0014838 T kmap_atomic
> c001491c T kmap_atomic_pfn
> c00149a4 T kmap_atomic_to_page
> 
> MACH_KIRKWOOD selects CPU_FEROCEON which has
> __cpuc_flush_dcache_area ->
> 	cpu_cache.flush_kern_dcache_area ->
> 		feroceon_flush_kern_dcache_area

Hello all,

it seems I was a bit wrong - although enabling back tx-nocache-copy makes the
tx-errors happen much less often (ssh complaining about HMAC), they still
happen. It seems that something was introduced in some recent kernels that broke
the tx offload.

I have no idea what it can be, but since 2.6 until at least 3.10 the network
driver worked fine with tx offload in this sheevaplug board.

Regards,
Lluís.

^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Eric Dumazet @ 2014-10-16 17:48 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Lluís Batlle i Rossell, linux-kernel, netdev,
	Carles Pagès, linux-arm-kernel
In-Reply-To: <20141016173401.GA16515@f1.synalogic.ca>

On Thu, 2014-10-16 at 10:34 -0700, Benjamin Poirier wrote:
> On 2014/10/15 15:45, Eric Dumazet wrote:

> >  kmap_atomic()/kunmap_atomic() is missing, so we lack
> > __cpuc_flush_dcache_area() operations.
> > 
> 
> You lost me there.
> 1) I don't see the link
> 2) It seems kmap_atomic and so on are there:
> $ grep kmap_atomic System.map-3.16-2-kirkwood
> c0014838 T kmap_atomic
> c001491c T kmap_atomic_pfn
> c00149a4 T kmap_atomic_to_page
> 
> MACH_KIRKWOOD selects CPU_FEROCEON which has
> __cpuc_flush_dcache_area ->
> 	cpu_cache.flush_kern_dcache_area ->
> 		feroceon_flush_kern_dcache_area

I meant to put a '?' instead of a '.'

Note that tcp does a copy, using :

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 18:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413481529.28798.29.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/16/2014 10:45 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:
>
>> So if a page is used twice we are double counting the page size for the
>> socket then, is that correct?  I just want to make sure because prior to
>> this patch both flows did the same thing and counted the portion of the
>> page used in this pass, now with this change for PAGE_SIZE of 4K we
>> count the entire page, and for all other cases we count the portion of
>> the page used.
> When a page is split in 2 parts only, probability that a segment holds
> the 4K page is quite high (There is a single half page)

Actually the likelihood of anything holding onto the 4K page for very 
long doesn't seem to occur, at least from the drivers perspective.  It 
is one of the reasons why I went for the page reuse approach rather than 
just partitioning a single large page.  It allows us to avoid having to 
call IOMMU map/unmap for the pages since the entire page is usually back 
in the driver ownership before we need to reuse the portion given to the 
stack.

> When we split say 64KB in 42 segments, the probability a single segment
> hold the full 64KB block is very low, so we can almost be safe when we
> consider 'truesize = 1536'

Yes, but the likelihood that only a few segments are holding the page is 
still very high.  So you might not have one segment holding the 64K 
page, but I find it very difficult to believe that all 42 would be 
holding it at the same time.  In that case should we be adding some 
portion of the 64K to the truesize for all frames to account for this?

> Of course there are pathological cases, but attacker has to be quite
> smart.
>
> I am just saying that counting 2048 might have a big impact on memory
> consumption if all these incoming segments are stored a long time in
> receive queues (TCP receive queues or out of order queues) : We might be
> off by a factor of 2 on the real memory usage, and delay the TCP
> collapsing too much.

My concern would be that we are off by a factor of 2 and prematurely 
collapse the TCP too soon with this change.  For example if you are 
looking at a socket that is holding pages for a long period of time 
there would be a good chance of it ending up with both halves of the 
page.  In this case is it fair to charge it for 8K or memory use when in 
reality it is only using 4K?

Thanks,

Alex


^ permalink raw reply

* Re: [net 0/4][pull request] Intel Wired LAN Driver Updates 2014-10-16
From: David Miller @ 2014-10-16 18:43 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <1413452187-2950-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 16 Oct 2014 02:36:23 -0700

> This series contains updates to fm10k and ixgbe.
> 
> Matthew provides two fixes for fm10k, first sets the flag to fetch the
> host state before kicking off the service task that reads the host
> state when bringing the interface up.  The second makes sure that we
> release the mailbox lock after detecting an error and before we return
> the error code.
> 
> Andy Zhou provides a compile fix for fm10k, when the driver is compiled
> into the kernel and the VXLAN driver is compiled as a module.
> 
> Emil provides a fix for ixgbe to prevent against a panic by trying
> to dereference a NULL pointer in ixgbe_ndo_set_vf_spoofchk().

Series applied, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net] netlink: fix description of portid
From: David Miller @ 2014-10-16 18:53 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <1413467271-6385-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 16 Oct 2014 15:47:51 +0200

> Avoid confusion between pid and portid.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied, thanks Nicolas.

We've been plagued by this confusion forever, because the value used to
even be stored in struct members and variables named 'pid'.

^ permalink raw reply

* Re: [PATCH] [trivial] treewide: Fix company name in module descriptions
From: Chris Snook @ 2014-10-16 19:09 UTC (permalink / raw)
  To: Masanari Iida
  Cc: LKML, mturquette, trivial, linus.walleij, gnurou, James Cliburn,
	viresh.linux, inki.dae, dh09.lee, kgene.kim, rdunlap,
	netdev@vger.kernel.org
In-Reply-To: <1413472164-22366-1-git-send-email-standby24x7@gmail.com>

On Thu, Oct 16, 2014 at 8:09 AM, Masanari Iida <standby24x7@gmail.com> wrote:
> This patch fix company name's spelling typo in module descriptions
> and a Kconfig.
>
> Signed-off-by: Masanari Iida <standby24x7@gmail.com>

> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index 72fb86b..c9946c6 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -48,7 +48,7 @@ MODULE_DEVICE_TABLE(pci, atl1c_pci_tbl);
>
>  MODULE_AUTHOR("Jie Yang");
>  MODULE_AUTHOR("Qualcomm Atheros Inc., <nic-devel@qualcomm.com>");
> -MODULE_DESCRIPTION("Qualcom Atheros 100/1000M Ethernet Network Driver");
> +MODULE_DESCRIPTION("Qualcomm Atheros 100/1000M Ethernet Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(ATL1C_DRV_VERSION);
>

Acked-by: Chris Snook <chris.snook@gmail.com>

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-16 21:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54400C6C.7010405@redhat.com>

On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:

> My concern would be that we are off by a factor of 2 and prematurely 
> collapse the TCP too soon with this change. 

That is the opposite actually. We can consume 4K but we pretend we
consume 2K in some worst cases.

>  For example if you are 
> looking at a socket that is holding pages for a long period of time 
> there would be a good chance of it ending up with both halves of the 
> page.  In this case is it fair to charge it for 8K or memory use when in 
> reality it is only using 4K?

Its better to collapse too soon than too late.

If you want to avoid collapses because one host has plenty of memory,
all you need to do is increase tcp_rmem.

Why are you referring to 8K ? PAGE_SIZE is 4K



^ permalink raw reply

* [PATCH net-next, v3] hyperv: Add handling of IP header with option field in netvsc_set_hash()
From: Haiyang Zhang @ 2014-10-16 21:47 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, jasowang, driverdev-devel, linux-kernel, haiyangz

In case that the IP header has optional field at the end, this patch will
get the port numbers after that field, and compute the hash. The general
parser skb_flow_dissect() is used here.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c |   26 ++++++++++----------------
 1 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 0fcb5e7..9e17d1a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -162,7 +162,7 @@ union sub_key {
  * data: network byte order
  * return: host byte order
  */
-static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
+static u32 comp_hash(u8 *key, int klen, void *data, int dlen)
 {
 	union sub_key subk;
 	int k_next = 4;
@@ -176,7 +176,7 @@ static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
 	for (i = 0; i < dlen; i++) {
 		subk.kb = key[k_next];
 		k_next = (k_next + 1) % klen;
-		dt = data[i];
+		dt = ((u8 *)data)[i];
 		for (j = 0; j < 8; j++) {
 			if (dt & 0x80)
 				ret ^= subk.ka;
@@ -190,26 +190,20 @@ static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
 
 static bool netvsc_set_hash(u32 *hash, struct sk_buff *skb)
 {
-	struct iphdr *iphdr;
+	struct flow_keys flow;
 	int data_len;
-	bool ret = false;
 
-	if (eth_hdr(skb)->h_proto != htons(ETH_P_IP))
+	if (!skb_flow_dissect(skb, &flow) || flow.n_proto != htons(ETH_P_IP))
 		return false;
 
-	iphdr = ip_hdr(skb);
+	if (flow.ip_proto == IPPROTO_TCP)
+		data_len = 12;
+	else
+		data_len = 8;
 
-	if (iphdr->version == 4) {
-		if (iphdr->protocol == IPPROTO_TCP)
-			data_len = 12;
-		else
-			data_len = 8;
-		*hash = comp_hash(netvsc_hash_key, HASH_KEYLEN,
-				  (u8 *)&iphdr->saddr, data_len);
-		ret = true;
-	}
+	*hash = comp_hash(netvsc_hash_key, HASH_KEYLEN, &flow, data_len);
 
-	return ret;
+	return true;
 }
 
 static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 22:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413495626.28798.38.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/16/2014 02:40 PM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:
>
>> My concern would be that we are off by a factor of 2 and prematurely
>> collapse the TCP too soon with this change.
> That is the opposite actually. We can consume 4K but we pretend we
> consume 2K in some worst cases.

The only case where we consume the full 4K but only list it as 2K should 
be if we have memory from the wrong node and we want to flush it from 
the descriptor queue.  For all other cases we should be using the page 
at least twice per buffer.  So the the first page that was assigned for 
an Rx descriptor might be flushed but then after that reuse should take 
hold and stay in place as long as the NAPI poll doesn't change NUMA nodes.

That should be no worse than the case where the remaining space in a 
large page is not large enough to use as a buffer. You still use the 
current size as your truesize, you don't include the overhead of the 
unused space in your calculation.

>>   For example if you are
>> looking at a socket that is holding pages for a long period of time
>> there would be a good chance of it ending up with both halves of the
>> page.  In this case is it fair to charge it for 8K or memory use when in
>> reality it is only using 4K?
> Its better to collapse too soon than too late.
>
> If you want to avoid collapses because one host has plenty of memory,
> all you need to do is increase tcp_rmem.
>
> Why are you referring to 8K ? PAGE_SIZE is 4K

The truesize would be reported as 8K vs 4K for 2 half pages with your 
change if we were to hand off both halves of a page to the same socket.

The 2K value makes sense and is consistent with how we handle this in 
other cases where we are partitioning pages for use as network buffers.  
I think increasing this to 4K is just going to cause performance issues 
as flows are going to get choked off prematurely for memory usage that 
they aren't actually getting.

Part of my hesitation is that I spent the last couple of years 
explaining to our performance testing team and customers that they need 
to adjust tcp_rmem with all of the changes that have been made to 
truesize and the base network drivers, and I think I would prefer it if 
I didn't have to go another round of it.  Then again I probably won't 
have to anyway since I am not doing drivers for Intel any more, but 
still my reaction to this kind of change is what it is.

Thanks,

Alex





^ permalink raw reply

* RE: tcpdump's capture filter: "vlan" doesn't match
From: Lukas Tribus @ 2014-10-16 23:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, John Fastabend, Michał Mirosław,
	Jiri Pirko, Ben Hutchings, Atzm Watanabe, Patrick McHardy,
	Jesse Gross, Michael Richardson, Ani Sinha
In-Reply-To: <543F616C.5040801@redhat.com>

>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>
> There were some discussions on this in the past e.g. [1]. We have
> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
> this, but libpcap is currently not making use of any of them.
>
> [1] http://thread.gmane.org/gmane.linux.network/247947

Thanks for the link. I see the situation is unfortunate and although those
new BPF filters in the kernel may fix the actual filtering problem, one
thing seems to remain impossible: disabling all this kernel magic and
passing the frame as-is to libpcap without interception (avoiding any
kind of artificial header reconstruction).

How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
in the kernel?



Regards,

Lukas

 		 	   		  

^ permalink raw reply

* Re: tcpdump's capture filter: "vlan" doesn't match
From: Ani Sinha @ 2014-10-16 23:39 UTC (permalink / raw)
  To: Lukas Tribus
  Cc: Daniel Borkmann, netdev@vger.kernel.org, John Fastabend,
	Michał Mirosław, Jiri Pirko, Ben Hutchings,
	Atzm Watanabe, Patrick McHardy, Jesse Gross, Michael Richardson,
	Ani Sinha, fenner
In-Reply-To: <DUB123-W25C5DA9BEDC92CD2524B68EDAB0@phx.gbl>

+fenner.

I had spent some considerable time in the past looking into this and
proposing a patch :

http://seclists.org/tcpdump/2013/q1/0

However, there was no feedback and I got sucked into a different
project and this work fell through the cracks. If someone else picks
it up, I will be glad to help/lend a hand.

Cheers,
ani


On Thu, Oct 16, 2014 at 4:25 PM, Lukas Tribus <luky-37@hotmail.com> wrote:
>>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>>
>> There were some discussions on this in the past e.g. [1]. We have
>> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
>> this, but libpcap is currently not making use of any of them.
>>
>> [1] http://thread.gmane.org/gmane.linux.network/247947
>
> Thanks for the link. I see the situation is unfortunate and although those
> new BPF filters in the kernel may fix the actual filtering problem, one
> thing seems to remain impossible: disabling all this kernel magic and
> passing the frame as-is to libpcap without interception (avoiding any
> kind of artificial header reconstruction).
>
> How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
> in the kernel?
>
>
>
> Regards,
>
> Lukas
>
>

^ permalink raw reply

* [PATCH net] net: dsa: remove phy.h and phy_fixed.h inclusions
From: Florian Fainelli @ 2014-10-17  0:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli

There is no need to include phy.h nor phy_fixed.h when we can use
forward declarations instead to keep the include chain smaller.

Doing this unveiled that we were implicitely getting the definitions for
struct ethtool_eee and struct ethtool_wolinfo, and that net/dsa/slave.c
was missing an include of phy_fixed.h.

Fixes: ec9436baedb6 ("net: dsa: allow drivers to do link adjustment")
Fixes: ce31b31c68e7 ("net: dsa: allow updating fixed PHY link information")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/dsa.h | 7 +++++--
 net/dsa/slave.c   | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 58ad8c6492db..0c38b083e6eb 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -16,8 +16,11 @@
 #include <linux/timer.h>
 #include <linux/workqueue.h>
 #include <linux/of.h>
-#include <linux/phy.h>
-#include <linux/phy_fixed.h>
+
+struct phy_device;
+struct fixed_phy_status;
+struct ethtool_eee;
+struct ethtool_wolinfo;
 
 enum dsa_tag_protocol {
 	DSA_TAG_PROTO_NONE = 0,
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 8030489d9cbe..a851e9f14118 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -11,6 +11,7 @@
 #include <linux/list.h>
 #include <linux/etherdevice.h>
 #include <linux/phy.h>
+#include <linux/phy_fixed.h>
 #include <linux/of_net.h>
 #include <linux/of_mdio.h>
 #include "dsa_priv.h"
-- 
1.9.1

^ permalink raw reply related

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: Jiafei.Pan @ 2014-10-17  2:35 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: David Miller, jkosina@suse.cz, netdev@vger.kernel.org,
	LeoLi@freescale.com, linux-doc@vger.kernel.org,
	Jiafei.Pan@freescale.com
In-Reply-To: <543FE413.6030406@redhat.com>


> -----Original Message-----
> From: Alexander Duyck [mailto:alexander.h.duyck@redhat.com]
> Sent: Thursday, October 16, 2014 11:28 PM
> To: Pan Jiafei-B37022; Eric Dumazet
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> 
> On 10/15/2014 10:15 PM, Jiafei.Pan@freescale.com wrote:
> >> -----Original Message-----
> >> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> >> Sent: Thursday, October 16, 2014 12:15 PM
> >> To: Pan Jiafei-B37022
> >> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> >> linux-doc@vger.kernel.org
> >> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> >>
> >> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
> >>
> >>> Thanks for your comments and suggestion. In my case, I want to build skb
> >>> from hardware block specified memory, I only can see two ways, one is
> modified
> >>> net card driver replace common skb allocation function with my specially
> >>> functions, another way is to hack common skb allocation function in which
> >>> redirect to my specially functions. My patch is just for the second way.
> >>> Except these two ways, would you please give me some advice for some other
> >>> ways for my case? Thanks
> >> I suggest you read drivers/net/ethernet numerous examples.
> >>
> >> No need to change anything  in net/* or include/*, really.
> >>
> >> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
> >>
> >> Mentioning 'hack' in your mails simply should hint you are doing
> >> something very wrong.
> >>
> >> What makes you think your hardware is so special ?
> >>
> > In fact, I am developing a bridge driver, it can bridge between any other the
> > third party net card and my own net card. My target is to let any other the
> > third party net card can directly use my own net card specified buffer, then
> > there will be no memory copy in the whole bridge process.
> > By the way, I don’t see any similar between igb_main.c and my case. And also
> > My bridge also can’t implemented with "skb frag" in order to aim at zero
> memory
> > copy.
> 
> I think the part you are not getting is that is how buffers are
> essentially handled now.  

[Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
you have catch my concerns. For example, I want to add igb net card 
into my bridge, and want to igb net driver allocate skb by using
my specified memory address, but I don’t want to modify igb net driver
directly, how to do this in my bridge drivers?

Thanks,
Jiafei.

So for example in the case if igb the only
> part we have copied out is usually the header, or the entire frame in
> the case of small packets.  This has to happen in order to allow for
> changes to the header for routing and such.  Beyond that the frags that
> are passed are the buffers that igb is still holding onto.  So
> effectively what the other device transmits in a bridging/routing
> scenario is my own net card specified buffer plus the copied/modified
> header.
> 
> For a brief period igb used build_skb but that isn't valid on most
> systems as memory mapped for a device can be overwritten if the page is
> unmapped resulting in any changes to the header for routing/bridging
> purposes being invalidated.  Thus we cannot use the buffers for both the
> skb->data header which may be changed and Rx DMA simultaneously.
> 
> Thanks,
> 
> Alex

^ permalink raw reply

* [RFC PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-17  4:31 UTC (permalink / raw)
  To: netdev; +Cc: hannes, lorenzo, edumazet, Erik Kline

Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes.  Preferred addresses
will still take precendence over optimistic addresses, subject
to other ranking in the source address selection algorithm.

This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).

The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.

For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.

Also: add an entry in ip-sysctl.txt for optimistic_dad.

Signed-off-by: Erik Kline <ek@google.com>
---
 Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
 include/linux/ipv6.h                   |  1 +
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 29 ++++++++++++++++++++++++++++-
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7a81ac..9bf32bc 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1458,6 +1458,19 @@ suppress_frag_ndisc - INTEGER
 	1 - (default) discard fragmented neighbor discovery packets
 	0 - allow fragmented neighbor discovery packets
 
+optimistic_dad - BOOLEAN
+	Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+		0: disabled (default)
+		1: enabled
+
+use_optimistic - BOOLEAN
+	If enabled, do not classify optimistic addresses as deprecated during
+	source address selection.  Preferred addresses will still be chosen
+	before optimistic addresses, subject to other ranking in the source
+	address selection algorithm.
+		0: disabled (default)
+		1: enabled
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ff56053..7121a2e 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -42,6 +42,7 @@ struct ipv6_devconf {
 	__s32		accept_ra_from_local;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	__s32		optimistic_dad;
+	__s32		use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	__s32		mc_forwarding;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index efa2666..5b2c6d9 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -154,6 +154,7 @@ enum {
 	DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN,
 	DEVCONF_PROXY_NDP,
 	DEVCONF_OPTIMISTIC_DAD,
+	DEVCONF_USE_OPTIMISTIC,
 	DEVCONF_ACCEPT_SOURCE_ROUTE,
 	DEVCONF_MC_FORWARDING,
 	DEVCONF_DISABLE_IPV6,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 725c763..fc8c08d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1169,6 +1169,9 @@ enum {
 	IPV6_SADDR_RULE_LABEL,
 	IPV6_SADDR_RULE_PRIVACY,
 	IPV6_SADDR_RULE_ORCHID,
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	IPV6_SADDR_RULE_NOT_OPTIMISTIC,
+#endif
 	IPV6_SADDR_RULE_PREFIX,
 	IPV6_SADDR_RULE_MAX
 };
@@ -1257,10 +1260,17 @@ static int ipv6_get_saddr_eval(struct net *net,
 		score->scopedist = ret;
 		break;
 	case IPV6_SADDR_RULE_PREFERRED:
+	    {
 		/* Rule 3: Avoid deprecated and optimistic addresses */
+		u8 avoid = IFA_F_DEPRECATED;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+		if (!score->ifa->idev->cnf.use_optimistic)
+			avoid |= IFA_F_OPTIMISTIC;
+#endif
 		ret = ipv6_saddr_preferred(score->addr_type) ||
-		      !(score->ifa->flags & (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC));
+		      !(score->ifa->flags & avoid);
 		break;
+	    }
 #ifdef CONFIG_IPV6_MIP6
 	case IPV6_SADDR_RULE_HOA:
 	    {
@@ -1299,6 +1309,14 @@ static int ipv6_get_saddr_eval(struct net *net,
 		ret = !(ipv6_addr_orchid(&score->ifa->addr) ^
 			ipv6_addr_orchid(dst->addr));
 		break;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	case IPV6_SADDR_RULE_NOT_OPTIMISTIC:
+		/* Optimistic addresses still have lower precedence than other
+		 * preferred addresses.
+		 */
+		ret = !(score->ifa->flags & IFA_F_OPTIMISTIC);
+		break;
+#endif
 	case IPV6_SADDR_RULE_PREFIX:
 		/* Rule 8: Use longest matching prefix */
 		ret = ipv6_addr_diff(&score->ifa->addr, dst->addr);
@@ -4330,6 +4348,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 	array[DEVCONF_ACCEPT_SOURCE_ROUTE] = cnf->accept_source_route;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	array[DEVCONF_OPTIMISTIC_DAD] = cnf->optimistic_dad;
+	array[DEVCONF_USE_OPTIMISTIC] = cnf->use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	array[DEVCONF_MC_FORWARDING] = cnf->mc_forwarding;
@@ -5155,6 +5174,14 @@ static struct addrconf_sysctl_table
 			.proc_handler   = proc_dointvec,
 
 		},
+		{
+			.procname       = "use_optimistic",
+			.data           = &ipv6_devconf.use_optimistic,
+			.maxlen         = sizeof(int),
+			.mode           = 0644,
+			.proc_handler   = proc_dointvec,
+
+		},
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 		{
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [RFC PATCH net-next 1/6] virtio: make sure used event never go backwards
From: Jason Wang @ 2014-10-17  5:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20141015113859.GA26918@redhat.com>

On 10/15/2014 07:38 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 15, 2014 at 06:44:41PM +0800, Jason Wang wrote:
>> On 10/15/2014 06:32 PM, Michael S. Tsirkin wrote:
>>> On Wed, Oct 15, 2014 at 06:13:19PM +0800, Jason Wang wrote:
>>>> On 10/15/2014 05:34 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Oct 15, 2014 at 03:25:25PM +0800, Jason Wang wrote:
>>>>>> This patch checks the new event idx to make sure used event idx never
>>>>>> goes back. This is used to synchronize the calls between
>>>>>> virtqueue_enable_cb_delayed() and virtqueue_enable_cb().
>>>>>>
>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> the implication being that moving event idx back might cause some race
>>>>> condition?  
>>>> This will cause race condition when tx interrupt is enabled. Consider
>>>> the following cases
>>>>
>>>> 1) tx napi was scheduled
>>>> 2) start_xmit() call virtqueue_enable_cb_delayed() and disable cb, [used
>>>> event is vq->last_used_idx + 3/4 pendg bufs]
>>>> 3) tx napi enable the callback by virtqueue_enable_cb() [ used event is
>>>> vq->last_used_idx ]
>>>>  
>>>> After step 3, used event was moved back, unnecessary tx interrupt was
>>>> triggered.
>>> Well unnecessary interrupts are safe.
>> But it that is what we want to reduce.
> It's all about correctness. I don't think mixing enable_cb
> and enable_cb_delayed makes sense, let's just make
> virtio behave correctly if that happens, no need to
> optimize for that.

Then as you said, need document or add WARN_ON() or BUG() in case both
of the two are used.
>
>
>>> With your patch caller of virtqueue_enable_cb will not get an
>>> interrupt on the next buffer which is not safe.
>>>
>>> If you don't want an interrupt on the next buffer, don't
>>> call virtqueue_enable_cb.
>> So something like this patch should be done in virtio core somewhere
>> else. Virtio-net can not do this since it does not have the knowledge of
>> event index.
> Take a look at my patch - no calls to enable_cb, only
> enable_cb_delayed, so we should be fine.
>
>>>>> If yes but please describe the race explicitly.
>>>>> Is there a bug we need to fix on stable?
>>>> Looks not, current code does not have such race condition.
>>>>> Please also explicitly describe a configuration that causes event idx
>>>>> to go back.
>>>>>
>>>>> All this info should go in the commit log.
>>>> Will do this.
>>>>>> ---
>>>>>>  drivers/virtio/virtio_ring.c |    7 +++++--
>>>>>>  1 files changed, 5 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 3b1f89b..1b3929f 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -559,14 +559,17 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>>>>>>  	u16 last_used_idx;
>>>>>>  
>>>>>>  	START_USE(vq);
>>>>>> -
>>>>>> +	last_used_idx = vq->last_used_idx;
>>>>>>  	/* We optimistically turn back on interrupts, then check if there was
>>>>>>  	 * more to do. */
>>>>>>  	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
>>>>>>  	 * either clear the flags bit or point the event index at the next
>>>>>>  	 * entry. Always do both to keep code simple. */
>>>>>>  	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
>>>>>> -	vring_used_event(&vq->vring) = last_used_idx = vq->last_used_idx;
>>>>>> +	/* Make sure used event never go backwards */
>>>>> s/go/goes/
>>>>>
>>>>>> +	if (!vring_need_event(vring_used_event(&vq->vring),
>>>>>> +			      vq->vring.avail->idx, last_used_idx))
>>>>>> +		vring_used_event(&vq->vring) = last_used_idx;
>>>>> The result will be that driver will *not* get an interrupt
>>>>> on the next consumed buffer, which is likely not what driver
>>>>> intended when it called virtqueue_enable_cb.
>>>> This will only happen when we want to delay the interrupt for next few
>>>> consumed buffers (virtqueue_enable_cb_delayed() was called). For the
>>>> other case, vq->last_used_idx should be ahead of previous used event. Do
>>>> you see any other case?
>>> Call virtqueue_enable_cb_delayed, later call virtqueue_enable_cb.  If
>>> event index is not updated in virtqueue_enable_cb, driver will not get
>>> an interrupt on the next buffer.
>> This is just what we want I think. The interrupt was not lost but fired
>> after 3/4 pending buffers were consumed. Do you see any real issue on this?
> Yes, this violates the API. For example device might never
> consume the rest of buffers.

Then it should be a bug of device which is out of the control of guest.
If not, device might never also consume 3/4 rest of buffers.
>
>>>>> Instead, how about we simply document the requirement that drivers either
>>>>> always call virtqueue_enable_cb_delayed or virtqueue_enable_cb
>>>>> but not both?
>>>> We need call them both when tx interrupt is enabled I believe.
>>> Can you pls reply to my patch and document issues you see?
>>>
>> In the previous reply you said you're using
>> virtuqueue_enable_cb_delayed(), so no race in your patch.
> OK so you think my patch is also correct, but that yours gives better
> efficiency?
>

Need some benchmark to see the difference I think.

^ permalink raw reply

* Re: [PATCH RFC v2 1/3] virtio_net: enable tx interrupt
From: Jason Wang @ 2014-10-17  5:11 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413383116-30977-2-git-send-email-mst@redhat.com>

On 10/15/2014 10:32 PM, Michael S. Tsirkin wrote:
> On newer hosts that support delayed tx interrupts,
> we probably don't have much to gain from orphaning
> packets early.
>
> Based on patch by Jason Wang.
>
> Note: this might degrade performance for
> hosts without event idx support.
> Should be addressed by the next patch.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/net/virtio_net.c | 137 ++++++++++++++++++++++++++++++++---------------
>  1 file changed, 94 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 13d0a8b..a9bf178 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>  
>  	/* Name of the send queue: output.$index */
>  	char name[40];
> +
> +	struct napi_struct napi;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,37 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>  
> +static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> +	struct sk_buff *skb;
> +	unsigned int len;
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> +	unsigned int packets = 0;
> +
> +	while (packets < budget &&
> +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		pr_debug("Sent skb %p\n", skb);
> +
> +		u64_stats_update_begin(&stats->tx_syncp);
> +		stats->tx_bytes += skb->len;
> +		stats->tx_packets++;
> +		u64_stats_update_end(&stats->tx_syncp);
> +
> +		dev_kfree_skb_any(skb);
> +		packets++;
> +	}
> +
> +	return packets;
> +}
> +
>  static void skb_xmit_done(struct virtqueue *vq)
>  {
>  	struct virtnet_info *vi = vq->vdev->priv;
> +	struct send_queue *sq = &vi->sq[vq2txq(vq)];
>  
> -	/* Suppress further interrupts. */
> -	virtqueue_disable_cb(vq);
> -
> -	/* We were probably waiting for more output buffers. */
> -	netif_wake_subqueue(vi->dev, vq2txq(vq));
> +	if (napi_schedule_prep(&sq->napi))
> +		__napi_schedule(&sq->napi);
>  }
>  
>  static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
> @@ -774,6 +798,37 @@ again:
>  	return received;
>  }
>  
> +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> +{
> +	struct send_queue *sq =
> +		container_of(napi, struct send_queue, napi);
> +	struct virtnet_info *vi = sq->vq->vdev->priv;
> +	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> +	unsigned int sent = 0;
> +	bool enable_done;
> +
> +again:
> +	__netif_tx_lock(txq, smp_processor_id());
> +	virtqueue_disable_cb(sq->vq);
> +	sent += free_old_xmit_skbs(sq, budget - sent);
> +
> +	if (sent < budget) {
> +		enable_done = virtqueue_enable_cb_delayed(sq->vq);
> +		napi_complete(napi);
> +		__netif_tx_unlock(txq);
> +		if (unlikely(enable_done) && napi_schedule_prep(napi)) {

I think you mean unlikely(!enable_done) here?
> +			virtqueue_disable_cb(sq->vq);
> +			__napi_schedule(napi);
> +			goto again;
> +		}
> +	} else {
> +		__netif_tx_unlock(txq);
> +	}
> +
> +	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
> +	return sent;
> +}
> +
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>  /* must be called with local_bh_disable()d */
>  static int virtnet_busy_poll(struct napi_struct *napi)
> @@ -822,30 +877,12 @@ static int virtnet_open(struct net_device *dev)
>  			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>  				schedule_delayed_work(&vi->refill, 0);
>  		virtnet_napi_enable(&vi->rq[i]);
> +		napi_enable(&vi->sq[i].napi);
>  	}
>  
>  	return 0;
>  }
>  
> -static void free_old_xmit_skbs(struct send_queue *sq)
> -{
> -	struct sk_buff *skb;
> -	unsigned int len;
> -	struct virtnet_info *vi = sq->vq->vdev->priv;
> -	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> -
> -	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		pr_debug("Sent skb %p\n", skb);
> -
> -		u64_stats_update_begin(&stats->tx_syncp);
> -		stats->tx_bytes += skb->len;
> -		stats->tx_packets++;
> -		u64_stats_update_end(&stats->tx_syncp);
> -
> -		dev_kfree_skb_any(skb);
> -	}
> -}
> -
>  static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  {
>  	struct skb_vnet_hdr *hdr;
> @@ -911,7 +948,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  		sg_set_buf(sq->sg, hdr, hdr_len);
>  		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
>  	}
> -	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> +
> +	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
> +				    GFP_ATOMIC);
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -919,12 +958,16 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int qnum = skb_get_queue_mapping(skb);
>  	struct send_queue *sq = &vi->sq[qnum];
> -	int err;
> +	int err, qsize = virtqueue_get_vring_size(sq->vq);
>  	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>  	bool kick = !skb->xmit_more;
> +	bool stopped;
> +
> +	virtqueue_disable_cb(sq->vq);
>  
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(sq);
> +	/* We are going to push one skb.
> +	 * Try to pop one off to free space for it. */
> +	free_old_xmit_skbs(sq, 1);

Looks like qsize instead of 1 is better? The more skbs freed in
ndo_start_xmit() the less chance tx napi will be scheduled.
>  
>  	/* Try to transmit */
>  	err = xmit_skb(sq, skb);
> @@ -940,27 +983,25 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		return NETDEV_TX_OK;
>  	}
>  
> -	/* Don't wait up for transmitted skbs to be freed. */
> -	skb_orphan(skb);
> -	nf_reset(skb);
> -
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>  		netif_stop_subqueue(dev, qnum);
> -		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> -			/* More just got used, free them then recheck. */
> -			free_old_xmit_skbs(sq);
> -			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> -				netif_start_subqueue(dev, qnum);
> -				virtqueue_disable_cb(sq->vq);
> -			}
> -		}
> +		stopped = true;
> +	} else {
> +		stopped = false;
>  	}
>  
>  	if (kick || netif_xmit_stopped(txq))
>  		virtqueue_kick(sq->vq);

Looks like we'd better move this in the end in case the queue could be
restarted by following lines?
>  
> +	if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> +		/* More just got used, free them then recheck. */
> +		free_old_xmit_skbs(sq, qsize);
> +		if (stopped && sq->vq->num_free >= 2+MAX_SKB_FRAGS)
> +			netif_start_subqueue(dev, qnum);

Why drop virtqueue_disable_cb() here?
> +	}
> +
>  	return NETDEV_TX_OK;
>  }
>  
> @@ -1137,8 +1178,10 @@ static int virtnet_close(struct net_device *dev)
>  	/* Make sure refill_work doesn't re-enable napi! */
>  	cancel_delayed_work_sync(&vi->refill);
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		napi_disable(&vi->rq[i].napi);
> +		napi_disable(&vi->sq[i].napi);
> +	}
>  
>  	return 0;
>  }
> @@ -1457,8 +1500,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  {
>  	int i;
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		netif_napi_del(&vi->rq[i].napi);
> +		netif_napi_del(&vi->sq[i].napi);
> +	}
>  
>  	kfree(vi->rq);
>  	kfree(vi->sq);
> @@ -1612,6 +1657,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
>  			       napi_weight);
>  		napi_hash_add(&vi->rq[i].napi);
> +		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
> +			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
>  		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
> @@ -1916,8 +1963,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  	if (netif_running(vi->dev)) {
>  		for (i = 0; i < vi->max_queue_pairs; i++) {
>  			napi_disable(&vi->rq[i].napi);
> +			napi_disable(&vi->sq[i].napi);
>  			napi_hash_del(&vi->rq[i].napi);
>  			netif_napi_del(&vi->rq[i].napi);
> +			netif_napi_del(&vi->sq[i].napi);
>  		}
>  	}
>  
> @@ -1942,8 +1991,10 @@ static int virtnet_restore(struct virtio_device *vdev)
>  			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>  				schedule_delayed_work(&vi->refill, 0);
>  
> -		for (i = 0; i < vi->max_queue_pairs; i++)
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
>  			virtnet_napi_enable(&vi->rq[i]);
> +			napi_enable(&vi->sq[i].napi);
> +		}
>  	}
>  
>  	netif_device_attach(vi->dev);

^ permalink raw reply

* Re: [PATCH RFC v2 2/3] virtio_net: bql
From: Jason Wang @ 2014-10-17  5:16 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413383116-30977-3-git-send-email-mst@redhat.com>

On 10/15/2014 10:32 PM, Michael S. Tsirkin wrote:
> Improve tx batching using byte queue limits.
> Should be especially effective for MQ.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/net/virtio_net.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a9bf178..8dea411 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -219,13 +219,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  	return p;
>  }
>  
> -static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
> +				       struct send_queue *sq, int budget)
>  {
>  	struct sk_buff *skb;
>  	unsigned int len;
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	unsigned int packets = 0;
> +	unsigned int bytes = 0;
>  
>  	while (packets < budget &&
>  	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> @@ -233,6 +235,7 @@ static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
>  
>  		u64_stats_update_begin(&stats->tx_syncp);
>  		stats->tx_bytes += skb->len;
> +		bytes += skb->len;
>  		stats->tx_packets++;
>  		u64_stats_update_end(&stats->tx_syncp);
>  
> @@ -240,6 +243,8 @@ static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
>  		packets++;
>  	}
>  
> +	netdev_tx_completed_queue(txq, packets, bytes);
> +
>  	return packets;
>  }
>  
> @@ -810,7 +815,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>  again:
>  	__netif_tx_lock(txq, smp_processor_id());
>  	virtqueue_disable_cb(sq->vq);
> -	sent += free_old_xmit_skbs(sq, budget - sent);
> +	sent += free_old_xmit_skbs(txq, sq, budget - sent);
>  
>  	if (sent < budget) {
>  		enable_done = virtqueue_enable_cb_delayed(sq->vq);
> @@ -962,12 +967,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>  	bool kick = !skb->xmit_more;
>  	bool stopped;
> +	unsigned int bytes = skb->len;
>  
>  	virtqueue_disable_cb(sq->vq);
>  
>  	/* We are going to push one skb.
>  	 * Try to pop one off to free space for it. */
> -	free_old_xmit_skbs(sq, 1);
> +	free_old_xmit_skbs(txq, sq, 1);
>  
>  	/* Try to transmit */
>  	err = xmit_skb(sq, skb);
> @@ -983,6 +989,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		return NETDEV_TX_OK;
>  	}
>  
> +	netdev_tx_sent_queue(txq, bytes);
> +
> +	/* Kick early so device can process descriptors in parallel with us. */
> +	if (kick)
> +		virtqueue_kick(sq->vq);

Haven't figured out how this will help for bql, consider only a
netif_stop_subqueue() may be called during two possible kicks. And since
we don't add any buffer between the two kicks, the send kick is almost
useless.
> +
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> @@ -997,7 +1009,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>  		/* More just got used, free them then recheck. */
> -		free_old_xmit_skbs(sq, qsize);
> +		free_old_xmit_skbs(txq, sq, qsize);
>  		if (stopped && sq->vq->num_free >= 2+MAX_SKB_FRAGS)
>  			netif_start_subqueue(dev, qnum);
>  	}

^ permalink raw reply

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Jason Wang @ 2014-10-17  5:23 UTC (permalink / raw)
  To: Rusty Russell, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <877g01c2ml.fsf@rustcorp.com.au>

On 10/15/2014 01:40 PM, Rusty Russell wrote:
> Jason Wang <jasowang@redhat.com> writes:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
> networking (which tends to take packets in order) couldn't we just set
> the event counter to give us a tx interrupt at the packet we want?
>
> Cheers,
> Rusty.

Yes, we could. Recent RFC of enabling tx interrupt use this.

^ permalink raw reply

* Mail Administrator
From: Webmail Administrator  @ 2014-10-17  5:58 UTC (permalink / raw)


Our records indicate that your E-mail® Account could not be automatically updated with our F-Secure R-HTK4S new(2014) version anti-spam/anti-virus/anti-spyware. Please click this link below to update manually

http://www.contactme.com/543903bb8c730e0002008ccf

We Are Sorry For Any Inconvenience.

Regards, 
Technical Support Team
Copyright © 2014. All Rights Reserved

^ permalink raw reply

* [PATCH net] r8152: use cancel_delayed_work for runtime suspend
From: Hayes Wang @ 2014-10-17  5:55 UTC (permalink / raw)
  To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang

It would cause dead lock for runtime suspend, when the workqueue
is running and runtime suspend occurs before the workqueue wakes
up the device. The rtl8152_suspend() waits the workqueue to finish
because of calling cancel_delayed_work_sync(). The workqueue waits
the suspend function to finish for waking up the device because of
calling usb_autopm_get_interface().

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
 drivers/net/usb/r8152.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 864159e..7d4e55a 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3200,12 +3200,13 @@ static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
 	if (netif_running(tp->netdev)) {
 		clear_bit(WORK_ENABLE, &tp->flags);
 		usb_kill_urb(tp->intr_urb);
-		cancel_delayed_work_sync(&tp->schedule);
 		tasklet_disable(&tp->tl);
 		if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
+			cancel_delayed_work(&tp->schedule);
 			rtl_stop_rx(tp);
 			rtl_runtime_suspend_enable(tp, true);
 		} else {
+			cancel_delayed_work_sync(&tp->schedule);
 			tp->rtl_ops.down(tp);
 		}
 		tasklet_enable(&tp->tl);
-- 
1.9.3

^ permalink raw reply related

* [PATCH] openvswitch: fix a use after free
From: roy.qing.li @ 2014-10-17  6:03 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, jesse

From: Li RongQing <roy.qing.li@gmail.com>

pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
setting after arphdr_ok to avoid the use the freed memory

Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.")
Cc: Jesse Gross <jesse@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 net/openvswitch/flow.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 62db02b..c5cfc72 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -557,10 +557,11 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
 	} else if (key->eth.type == htons(ETH_P_ARP) ||
 		   key->eth.type == htons(ETH_P_RARP)) {
 		struct arp_eth_header *arp;
+		bool arp_available = arphdr_ok(skb);
 
 		arp = (struct arp_eth_header *)skb_network_header(skb);
 
-		if (arphdr_ok(skb) &&
+		if (arp_available &&
 		    arp->ar_hrd == htons(ARPHRD_ETHER) &&
 		    arp->ar_pro == htons(ETH_P_IP) &&
 		    arp->ar_hln == ETH_ALEN &&
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2] vxlan: remove the dead codes
From: roy.qing.li @ 2014-10-17  6:04 UTC (permalink / raw)
  To: netdev

From: Li RongQing <roy.qing.li@gmail.com>

remove the dead codes, no user uses vxlan_salt

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 drivers/net/vxlan.c |    4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 77ab844..855a81d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -152,8 +152,6 @@ struct vxlan_dev {
 	struct hlist_head fdb_head[FDB_HASH_SIZE];
 };
 
-/* salt for hash table */
-static u32 vxlan_salt __read_mostly;
 static struct workqueue_struct *vxlan_wq;
 
 static void vxlan_sock_work(struct work_struct *work);
@@ -2797,8 +2795,6 @@ static int __init vxlan_init_module(void)
 	if (!vxlan_wq)
 		return -ENOMEM;
 
-	get_random_bytes(&vxlan_salt, sizeof(vxlan_salt));
-
 	rc = register_pernet_subsys(&vxlan_net_ops);
 	if (rc)
 		goto out1;
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH v2] vxlan: fix a free after use
From: roy.qing.li @ 2014-10-17  6:06 UTC (permalink / raw)
  To: netdev; +Cc: edumazet

From: Li RongQing <roy.qing.li@gmail.com>

pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload

Fixes: 91269e390d062 ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
 drivers/net/vxlan.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 855a81d..fabd514 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1885,6 +1885,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
 				    msg->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION)
 					return neigh_reduce(dev, skb);
 		}
+		eth = eth_hdr(skb);
 #endif
 	}
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH net-next] ipvs: Avoid null-pointer deref in debug code
From: Simon Horman @ 2014-10-17  6:27 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Alex Gartrell, dan.carpenter, lvs-devel, netdev, kernel-team
In-Reply-To: <alpine.LFD.2.11.1410062159290.2258@ja.home.ssi.bg>

On Mon, Oct 06, 2014 at 10:01:08PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Mon, 6 Oct 2014, Alex Gartrell wrote:
> 
> > Use daddr instead of reaching into dest.
> > 
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Signed-off-by: Alex Gartrell <agartrell@fb.com>
> 
> 	Thanks!
> 
> Acked-by: Julian Anastasov <ja@ssi.bg>

Thanks, I have applied this to the ipvs tree and will see about
both getting it included in v3.18 and v3.17-stable.

> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 91f17c1..437a366 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> >  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> >  						  local))) {
> >  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > -			     " daddr=%pI4\n", &dest->addr.ip);
> > +			     " daddr=%pI4\n", &daddr);
> >  		goto err_put;
> >  	}
> >  
> > @@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> >  	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> >  						  local))) {
> >  		IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > -			     " daddr=%pI6\n", &dest->addr.in6);
> > +			     " daddr=%pI6\n", daddr);
> >  		goto err_put;
> >  	}
> >  
> > -- 
> > Alex Gartrell <agartrell@fb.com>
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox