Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH net-next 1/6] virtio: make sure used event never go backwards
From: Jason Wang @ 2014-10-17  5:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: eric.dumazet, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20141015113859.GA26918@redhat.com>

On 10/15/2014 07:38 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 15, 2014 at 06:44:41PM +0800, Jason Wang wrote:
>> On 10/15/2014 06:32 PM, Michael S. Tsirkin wrote:
>>> On Wed, Oct 15, 2014 at 06:13:19PM +0800, Jason Wang wrote:
>>>> On 10/15/2014 05:34 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Oct 15, 2014 at 03:25:25PM +0800, Jason Wang wrote:
>>>>>> This patch checks the new event idx to make sure used event idx never
>>>>>> goes back. This is used to synchronize the calls between
>>>>>> virtqueue_enable_cb_delayed() and virtqueue_enable_cb().
>>>>>>
>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> the implication being that moving event idx back might cause some race
>>>>> condition?  
>>>> This will cause race condition when tx interrupt is enabled. Consider
>>>> the following cases
>>>>
>>>> 1) tx napi was scheduled
>>>> 2) start_xmit() call virtqueue_enable_cb_delayed() and disable cb, [used
>>>> event is vq->last_used_idx + 3/4 pendg bufs]
>>>> 3) tx napi enable the callback by virtqueue_enable_cb() [ used event is
>>>> vq->last_used_idx ]
>>>>  
>>>> After step 3, used event was moved back, unnecessary tx interrupt was
>>>> triggered.
>>> Well unnecessary interrupts are safe.
>> But it that is what we want to reduce.
> It's all about correctness. I don't think mixing enable_cb
> and enable_cb_delayed makes sense, let's just make
> virtio behave correctly if that happens, no need to
> optimize for that.

Then as you said, need document or add WARN_ON() or BUG() in case both
of the two are used.
>
>
>>> With your patch caller of virtqueue_enable_cb will not get an
>>> interrupt on the next buffer which is not safe.
>>>
>>> If you don't want an interrupt on the next buffer, don't
>>> call virtqueue_enable_cb.
>> So something like this patch should be done in virtio core somewhere
>> else. Virtio-net can not do this since it does not have the knowledge of
>> event index.
> Take a look at my patch - no calls to enable_cb, only
> enable_cb_delayed, so we should be fine.
>
>>>>> If yes but please describe the race explicitly.
>>>>> Is there a bug we need to fix on stable?
>>>> Looks not, current code does not have such race condition.
>>>>> Please also explicitly describe a configuration that causes event idx
>>>>> to go back.
>>>>>
>>>>> All this info should go in the commit log.
>>>> Will do this.
>>>>>> ---
>>>>>>  drivers/virtio/virtio_ring.c |    7 +++++--
>>>>>>  1 files changed, 5 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 3b1f89b..1b3929f 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -559,14 +559,17 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>>>>>>  	u16 last_used_idx;
>>>>>>  
>>>>>>  	START_USE(vq);
>>>>>> -
>>>>>> +	last_used_idx = vq->last_used_idx;
>>>>>>  	/* We optimistically turn back on interrupts, then check if there was
>>>>>>  	 * more to do. */
>>>>>>  	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
>>>>>>  	 * either clear the flags bit or point the event index at the next
>>>>>>  	 * entry. Always do both to keep code simple. */
>>>>>>  	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
>>>>>> -	vring_used_event(&vq->vring) = last_used_idx = vq->last_used_idx;
>>>>>> +	/* Make sure used event never go backwards */
>>>>> s/go/goes/
>>>>>
>>>>>> +	if (!vring_need_event(vring_used_event(&vq->vring),
>>>>>> +			      vq->vring.avail->idx, last_used_idx))
>>>>>> +		vring_used_event(&vq->vring) = last_used_idx;
>>>>> The result will be that driver will *not* get an interrupt
>>>>> on the next consumed buffer, which is likely not what driver
>>>>> intended when it called virtqueue_enable_cb.
>>>> This will only happen when we want to delay the interrupt for next few
>>>> consumed buffers (virtqueue_enable_cb_delayed() was called). For the
>>>> other case, vq->last_used_idx should be ahead of previous used event. Do
>>>> you see any other case?
>>> Call virtqueue_enable_cb_delayed, later call virtqueue_enable_cb.  If
>>> event index is not updated in virtqueue_enable_cb, driver will not get
>>> an interrupt on the next buffer.
>> This is just what we want I think. The interrupt was not lost but fired
>> after 3/4 pending buffers were consumed. Do you see any real issue on this?
> Yes, this violates the API. For example device might never
> consume the rest of buffers.

Then it should be a bug of device which is out of the control of guest.
If not, device might never also consume 3/4 rest of buffers.
>
>>>>> Instead, how about we simply document the requirement that drivers either
>>>>> always call virtqueue_enable_cb_delayed or virtqueue_enable_cb
>>>>> but not both?
>>>> We need call them both when tx interrupt is enabled I believe.
>>> Can you pls reply to my patch and document issues you see?
>>>
>> In the previous reply you said you're using
>> virtuqueue_enable_cb_delayed(), so no race in your patch.
> OK so you think my patch is also correct, but that yours gives better
> efficiency?
>

Need some benchmark to see the difference I think.

^ permalink raw reply

* [RFC PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Erik Kline @ 2014-10-17  4:31 UTC (permalink / raw)
  To: netdev; +Cc: hannes, lorenzo, edumazet, Erik Kline

Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes.  Preferred addresses
will still take precendence over optimistic addresses, subject
to other ranking in the source address selection algorithm.

This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).

The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.

For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.

Also: add an entry in ip-sysctl.txt for optimistic_dad.

Signed-off-by: Erik Kline <ek@google.com>
---
 Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
 include/linux/ipv6.h                   |  1 +
 include/uapi/linux/ipv6.h              |  1 +
 net/ipv6/addrconf.c                    | 29 ++++++++++++++++++++++++++++-
 4 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7a81ac..9bf32bc 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1458,6 +1458,19 @@ suppress_frag_ndisc - INTEGER
 	1 - (default) discard fragmented neighbor discovery packets
 	0 - allow fragmented neighbor discovery packets
 
+optimistic_dad - BOOLEAN
+	Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+		0: disabled (default)
+		1: enabled
+
+use_optimistic - BOOLEAN
+	If enabled, do not classify optimistic addresses as deprecated during
+	source address selection.  Preferred addresses will still be chosen
+	before optimistic addresses, subject to other ranking in the source
+	address selection algorithm.
+		0: disabled (default)
+		1: enabled
+
 icmp/*:
 ratelimit - INTEGER
 	Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index ff56053..7121a2e 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -42,6 +42,7 @@ struct ipv6_devconf {
 	__s32		accept_ra_from_local;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	__s32		optimistic_dad;
+	__s32		use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	__s32		mc_forwarding;
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index efa2666..5b2c6d9 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -154,6 +154,7 @@ enum {
 	DEVCONF_ACCEPT_RA_RT_INFO_MAX_PLEN,
 	DEVCONF_PROXY_NDP,
 	DEVCONF_OPTIMISTIC_DAD,
+	DEVCONF_USE_OPTIMISTIC,
 	DEVCONF_ACCEPT_SOURCE_ROUTE,
 	DEVCONF_MC_FORWARDING,
 	DEVCONF_DISABLE_IPV6,
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 725c763..fc8c08d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1169,6 +1169,9 @@ enum {
 	IPV6_SADDR_RULE_LABEL,
 	IPV6_SADDR_RULE_PRIVACY,
 	IPV6_SADDR_RULE_ORCHID,
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	IPV6_SADDR_RULE_NOT_OPTIMISTIC,
+#endif
 	IPV6_SADDR_RULE_PREFIX,
 	IPV6_SADDR_RULE_MAX
 };
@@ -1257,10 +1260,17 @@ static int ipv6_get_saddr_eval(struct net *net,
 		score->scopedist = ret;
 		break;
 	case IPV6_SADDR_RULE_PREFERRED:
+	    {
 		/* Rule 3: Avoid deprecated and optimistic addresses */
+		u8 avoid = IFA_F_DEPRECATED;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+		if (!score->ifa->idev->cnf.use_optimistic)
+			avoid |= IFA_F_OPTIMISTIC;
+#endif
 		ret = ipv6_saddr_preferred(score->addr_type) ||
-		      !(score->ifa->flags & (IFA_F_DEPRECATED|IFA_F_OPTIMISTIC));
+		      !(score->ifa->flags & avoid);
 		break;
+	    }
 #ifdef CONFIG_IPV6_MIP6
 	case IPV6_SADDR_RULE_HOA:
 	    {
@@ -1299,6 +1309,14 @@ static int ipv6_get_saddr_eval(struct net *net,
 		ret = !(ipv6_addr_orchid(&score->ifa->addr) ^
 			ipv6_addr_orchid(dst->addr));
 		break;
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+	case IPV6_SADDR_RULE_NOT_OPTIMISTIC:
+		/* Optimistic addresses still have lower precedence than other
+		 * preferred addresses.
+		 */
+		ret = !(score->ifa->flags & IFA_F_OPTIMISTIC);
+		break;
+#endif
 	case IPV6_SADDR_RULE_PREFIX:
 		/* Rule 8: Use longest matching prefix */
 		ret = ipv6_addr_diff(&score->ifa->addr, dst->addr);
@@ -4330,6 +4348,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf *cnf,
 	array[DEVCONF_ACCEPT_SOURCE_ROUTE] = cnf->accept_source_route;
 #ifdef CONFIG_IPV6_OPTIMISTIC_DAD
 	array[DEVCONF_OPTIMISTIC_DAD] = cnf->optimistic_dad;
+	array[DEVCONF_USE_OPTIMISTIC] = cnf->use_optimistic;
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 	array[DEVCONF_MC_FORWARDING] = cnf->mc_forwarding;
@@ -5155,6 +5174,14 @@ static struct addrconf_sysctl_table
 			.proc_handler   = proc_dointvec,
 
 		},
+		{
+			.procname       = "use_optimistic",
+			.data           = &ipv6_devconf.use_optimistic,
+			.maxlen         = sizeof(int),
+			.mode           = 0644,
+			.proc_handler   = proc_dointvec,
+
+		},
 #endif
 #ifdef CONFIG_IPV6_MROUTE
 		{
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: Jiafei.Pan @ 2014-10-17  2:35 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: David Miller, jkosina@suse.cz, netdev@vger.kernel.org,
	LeoLi@freescale.com, linux-doc@vger.kernel.org,
	Jiafei.Pan@freescale.com
In-Reply-To: <543FE413.6030406@redhat.com>


> -----Original Message-----
> From: Alexander Duyck [mailto:alexander.h.duyck@redhat.com]
> Sent: Thursday, October 16, 2014 11:28 PM
> To: Pan Jiafei-B37022; Eric Dumazet
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> 
> On 10/15/2014 10:15 PM, Jiafei.Pan@freescale.com wrote:
> >> -----Original Message-----
> >> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> >> Sent: Thursday, October 16, 2014 12:15 PM
> >> To: Pan Jiafei-B37022
> >> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> >> linux-doc@vger.kernel.org
> >> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> >>
> >> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
> >>
> >>> Thanks for your comments and suggestion. In my case, I want to build skb
> >>> from hardware block specified memory, I only can see two ways, one is
> modified
> >>> net card driver replace common skb allocation function with my specially
> >>> functions, another way is to hack common skb allocation function in which
> >>> redirect to my specially functions. My patch is just for the second way.
> >>> Except these two ways, would you please give me some advice for some other
> >>> ways for my case? Thanks
> >> I suggest you read drivers/net/ethernet numerous examples.
> >>
> >> No need to change anything  in net/* or include/*, really.
> >>
> >> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
> >>
> >> Mentioning 'hack' in your mails simply should hint you are doing
> >> something very wrong.
> >>
> >> What makes you think your hardware is so special ?
> >>
> > In fact, I am developing a bridge driver, it can bridge between any other the
> > third party net card and my own net card. My target is to let any other the
> > third party net card can directly use my own net card specified buffer, then
> > there will be no memory copy in the whole bridge process.
> > By the way, I don’t see any similar between igb_main.c and my case. And also
> > My bridge also can’t implemented with "skb frag" in order to aim at zero
> memory
> > copy.
> 
> I think the part you are not getting is that is how buffers are
> essentially handled now.  

[Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
you have catch my concerns. For example, I want to add igb net card 
into my bridge, and want to igb net driver allocate skb by using
my specified memory address, but I don’t want to modify igb net driver
directly, how to do this in my bridge drivers?

Thanks,
Jiafei.

So for example in the case if igb the only
> part we have copied out is usually the header, or the entire frame in
> the case of small packets.  This has to happen in order to allow for
> changes to the header for routing and such.  Beyond that the frags that
> are passed are the buffers that igb is still holding onto.  So
> effectively what the other device transmits in a bridging/routing
> scenario is my own net card specified buffer plus the copied/modified
> header.
> 
> For a brief period igb used build_skb but that isn't valid on most
> systems as memory mapped for a device can be overwritten if the page is
> unmapped resulting in any changes to the header for routing/bridging
> purposes being invalidated.  Thus we cannot use the buffers for both the
> skb->data header which may be changed and Rx DMA simultaneously.
> 
> Thanks,
> 
> Alex

^ permalink raw reply

* [PATCH net] net: dsa: remove phy.h and phy_fixed.h inclusions
From: Florian Fainelli @ 2014-10-17  0:55 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli

There is no need to include phy.h nor phy_fixed.h when we can use
forward declarations instead to keep the include chain smaller.

Doing this unveiled that we were implicitely getting the definitions for
struct ethtool_eee and struct ethtool_wolinfo, and that net/dsa/slave.c
was missing an include of phy_fixed.h.

Fixes: ec9436baedb6 ("net: dsa: allow drivers to do link adjustment")
Fixes: ce31b31c68e7 ("net: dsa: allow updating fixed PHY link information")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/net/dsa.h | 7 +++++--
 net/dsa/slave.c   | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 58ad8c6492db..0c38b083e6eb 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -16,8 +16,11 @@
 #include <linux/timer.h>
 #include <linux/workqueue.h>
 #include <linux/of.h>
-#include <linux/phy.h>
-#include <linux/phy_fixed.h>
+
+struct phy_device;
+struct fixed_phy_status;
+struct ethtool_eee;
+struct ethtool_wolinfo;
 
 enum dsa_tag_protocol {
 	DSA_TAG_PROTO_NONE = 0,
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 8030489d9cbe..a851e9f14118 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -11,6 +11,7 @@
 #include <linux/list.h>
 #include <linux/etherdevice.h>
 #include <linux/phy.h>
+#include <linux/phy_fixed.h>
 #include <linux/of_net.h>
 #include <linux/of_mdio.h>
 #include "dsa_priv.h"
-- 
1.9.1

^ permalink raw reply related

* Re: tcpdump's capture filter: "vlan" doesn't match
From: Ani Sinha @ 2014-10-16 23:39 UTC (permalink / raw)
  To: Lukas Tribus
  Cc: Daniel Borkmann, netdev@vger.kernel.org, John Fastabend,
	Michał Mirosław, Jiri Pirko, Ben Hutchings,
	Atzm Watanabe, Patrick McHardy, Jesse Gross, Michael Richardson,
	Ani Sinha, fenner
In-Reply-To: <DUB123-W25C5DA9BEDC92CD2524B68EDAB0@phx.gbl>

+fenner.

I had spent some considerable time in the past looking into this and
proposing a patch :

http://seclists.org/tcpdump/2013/q1/0

However, there was no feedback and I got sucked into a different
project and this work fell through the cracks. If someone else picks
it up, I will be glad to help/lend a hand.

Cheers,
ani


On Thu, Oct 16, 2014 at 4:25 PM, Lukas Tribus <luky-37@hotmail.com> wrote:
>>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>>
>> There were some discussions on this in the past e.g. [1]. We have
>> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
>> this, but libpcap is currently not making use of any of them.
>>
>> [1] http://thread.gmane.org/gmane.linux.network/247947
>
> Thanks for the link. I see the situation is unfortunate and although those
> new BPF filters in the kernel may fix the actual filtering problem, one
> thing seems to remain impossible: disabling all this kernel magic and
> passing the frame as-is to libpcap without interception (avoiding any
> kind of artificial header reconstruction).
>
> How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
> in the kernel?
>
>
>
> Regards,
>
> Lukas
>
>

^ permalink raw reply

* RE: tcpdump's capture filter: "vlan" doesn't match
From: Lukas Tribus @ 2014-10-16 23:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, John Fastabend, Michał Mirosław,
	Jiri Pirko, Ben Hutchings, Atzm Watanabe, Patrick McHardy,
	Jesse Gross, Michael Richardson, Ani Sinha
In-Reply-To: <543F616C.5040801@redhat.com>

>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>
> There were some discussions on this in the past e.g. [1]. We have
> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
> this, but libpcap is currently not making use of any of them.
>
> [1] http://thread.gmane.org/gmane.linux.network/247947

Thanks for the link. I see the situation is unfortunate and although those
new BPF filters in the kernel may fix the actual filtering problem, one
thing seems to remain impossible: disabling all this kernel magic and
passing the frame as-is to libpcap without interception (avoiding any
kind of artificial header reconstruction).

How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
in the kernel?



Regards,

Lukas

 		 	   		  

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 22:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413495626.28798.38.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/16/2014 02:40 PM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:
>
>> My concern would be that we are off by a factor of 2 and prematurely
>> collapse the TCP too soon with this change.
> That is the opposite actually. We can consume 4K but we pretend we
> consume 2K in some worst cases.

The only case where we consume the full 4K but only list it as 2K should 
be if we have memory from the wrong node and we want to flush it from 
the descriptor queue.  For all other cases we should be using the page 
at least twice per buffer.  So the the first page that was assigned for 
an Rx descriptor might be flushed but then after that reuse should take 
hold and stay in place as long as the NAPI poll doesn't change NUMA nodes.

That should be no worse than the case where the remaining space in a 
large page is not large enough to use as a buffer. You still use the 
current size as your truesize, you don't include the overhead of the 
unused space in your calculation.

>>   For example if you are
>> looking at a socket that is holding pages for a long period of time
>> there would be a good chance of it ending up with both halves of the
>> page.  In this case is it fair to charge it for 8K or memory use when in
>> reality it is only using 4K?
> Its better to collapse too soon than too late.
>
> If you want to avoid collapses because one host has plenty of memory,
> all you need to do is increase tcp_rmem.
>
> Why are you referring to 8K ? PAGE_SIZE is 4K

The truesize would be reported as 8K vs 4K for 2 half pages with your 
change if we were to hand off both halves of a page to the same socket.

The 2K value makes sense and is consistent with how we handle this in 
other cases where we are partitioning pages for use as network buffers.  
I think increasing this to 4K is just going to cause performance issues 
as flows are going to get choked off prematurely for memory usage that 
they aren't actually getting.

Part of my hesitation is that I spent the last couple of years 
explaining to our performance testing team and customers that they need 
to adjust tcp_rmem with all of the changes that have been made to 
truesize and the base network drivers, and I think I would prefer it if 
I didn't have to go another round of it.  Then again I probably won't 
have to anyway since I am not doing drivers for Intel any more, but 
still my reaction to this kind of change is what it is.

Thanks,

Alex





^ permalink raw reply

* [PATCH net-next, v3] hyperv: Add handling of IP header with option field in netvsc_set_hash()
From: Haiyang Zhang @ 2014-10-16 21:47 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, jasowang, driverdev-devel, linux-kernel, haiyangz

In case that the IP header has optional field at the end, this patch will
get the port numbers after that field, and compute the hash. The general
parser skb_flow_dissect() is used here.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c |   26 ++++++++++----------------
 1 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 0fcb5e7..9e17d1a 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -162,7 +162,7 @@ union sub_key {
  * data: network byte order
  * return: host byte order
  */
-static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
+static u32 comp_hash(u8 *key, int klen, void *data, int dlen)
 {
 	union sub_key subk;
 	int k_next = 4;
@@ -176,7 +176,7 @@ static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
 	for (i = 0; i < dlen; i++) {
 		subk.kb = key[k_next];
 		k_next = (k_next + 1) % klen;
-		dt = data[i];
+		dt = ((u8 *)data)[i];
 		for (j = 0; j < 8; j++) {
 			if (dt & 0x80)
 				ret ^= subk.ka;
@@ -190,26 +190,20 @@ static u32 comp_hash(u8 *key, int klen, u8 *data, int dlen)
 
 static bool netvsc_set_hash(u32 *hash, struct sk_buff *skb)
 {
-	struct iphdr *iphdr;
+	struct flow_keys flow;
 	int data_len;
-	bool ret = false;
 
-	if (eth_hdr(skb)->h_proto != htons(ETH_P_IP))
+	if (!skb_flow_dissect(skb, &flow) || flow.n_proto != htons(ETH_P_IP))
 		return false;
 
-	iphdr = ip_hdr(skb);
+	if (flow.ip_proto == IPPROTO_TCP)
+		data_len = 12;
+	else
+		data_len = 8;
 
-	if (iphdr->version == 4) {
-		if (iphdr->protocol == IPPROTO_TCP)
-			data_len = 12;
-		else
-			data_len = 8;
-		*hash = comp_hash(netvsc_hash_key, HASH_KEYLEN,
-				  (u8 *)&iphdr->saddr, data_len);
-		ret = true;
-	}
+	*hash = comp_hash(netvsc_hash_key, HASH_KEYLEN, &flow, data_len);
 
-	return ret;
+	return true;
 }
 
 static u16 netvsc_select_queue(struct net_device *ndev, struct sk_buff *skb,
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-16 21:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54400C6C.7010405@redhat.com>

On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:

> My concern would be that we are off by a factor of 2 and prematurely 
> collapse the TCP too soon with this change. 

That is the opposite actually. We can consume 4K but we pretend we
consume 2K in some worst cases.

>  For example if you are 
> looking at a socket that is holding pages for a long period of time 
> there would be a good chance of it ending up with both halves of the 
> page.  In this case is it fair to charge it for 8K or memory use when in 
> reality it is only using 4K?

Its better to collapse too soon than too late.

If you want to avoid collapses because one host has plenty of memory,
all you need to do is increase tcp_rmem.

Why are you referring to 8K ? PAGE_SIZE is 4K



^ permalink raw reply

* Re: [PATCH] [trivial] treewide: Fix company name in module descriptions
From: Chris Snook @ 2014-10-16 19:09 UTC (permalink / raw)
  To: Masanari Iida
  Cc: LKML, mturquette, trivial, linus.walleij, gnurou, James Cliburn,
	viresh.linux, inki.dae, dh09.lee, kgene.kim, rdunlap,
	netdev@vger.kernel.org
In-Reply-To: <1413472164-22366-1-git-send-email-standby24x7@gmail.com>

On Thu, Oct 16, 2014 at 8:09 AM, Masanari Iida <standby24x7@gmail.com> wrote:
> This patch fix company name's spelling typo in module descriptions
> and a Kconfig.
>
> Signed-off-by: Masanari Iida <standby24x7@gmail.com>

> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index 72fb86b..c9946c6 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -48,7 +48,7 @@ MODULE_DEVICE_TABLE(pci, atl1c_pci_tbl);
>
>  MODULE_AUTHOR("Jie Yang");
>  MODULE_AUTHOR("Qualcomm Atheros Inc., <nic-devel@qualcomm.com>");
> -MODULE_DESCRIPTION("Qualcom Atheros 100/1000M Ethernet Network Driver");
> +MODULE_DESCRIPTION("Qualcomm Atheros 100/1000M Ethernet Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(ATL1C_DRV_VERSION);
>

Acked-by: Chris Snook <chris.snook@gmail.com>

^ permalink raw reply

* Re: [PATCH net] netlink: fix description of portid
From: David Miller @ 2014-10-16 18:53 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <1413467271-6385-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 16 Oct 2014 15:47:51 +0200

> Avoid confusion between pid and portid.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied, thanks Nicolas.

We've been plagued by this confusion forever, because the value used to
even be stored in struct members and variables named 'pid'.

^ permalink raw reply

* Re: [net 0/4][pull request] Intel Wired LAN Driver Updates 2014-10-16
From: David Miller @ 2014-10-16 18:43 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene
In-Reply-To: <1413452187-2950-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 16 Oct 2014 02:36:23 -0700

> This series contains updates to fm10k and ixgbe.
> 
> Matthew provides two fixes for fm10k, first sets the flag to fetch the
> host state before kicking off the service task that reads the host
> state when bringing the interface up.  The second makes sure that we
> release the mailbox lock after detecting an error and before we return
> the error code.
> 
> Andy Zhou provides a compile fix for fm10k, when the driver is compiled
> into the kernel and the VXLAN driver is compiled as a module.
> 
> Emil provides a fix for ixgbe to prevent against a panic by trying
> to dereference a NULL pointer in ixgbe_ndo_set_vf_spoofchk().

Series applied, thanks Jeff.

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 18:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413481529.28798.29.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/16/2014 10:45 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:
>
>> So if a page is used twice we are double counting the page size for the
>> socket then, is that correct?  I just want to make sure because prior to
>> this patch both flows did the same thing and counted the portion of the
>> page used in this pass, now with this change for PAGE_SIZE of 4K we
>> count the entire page, and for all other cases we count the portion of
>> the page used.
> When a page is split in 2 parts only, probability that a segment holds
> the 4K page is quite high (There is a single half page)

Actually the likelihood of anything holding onto the 4K page for very 
long doesn't seem to occur, at least from the drivers perspective.  It 
is one of the reasons why I went for the page reuse approach rather than 
just partitioning a single large page.  It allows us to avoid having to 
call IOMMU map/unmap for the pages since the entire page is usually back 
in the driver ownership before we need to reuse the portion given to the 
stack.

> When we split say 64KB in 42 segments, the probability a single segment
> hold the full 64KB block is very low, so we can almost be safe when we
> consider 'truesize = 1536'

Yes, but the likelihood that only a few segments are holding the page is 
still very high.  So you might not have one segment holding the 64K 
page, but I find it very difficult to believe that all 42 would be 
holding it at the same time.  In that case should we be adding some 
portion of the 64K to the truesize for all frames to account for this?

> Of course there are pathological cases, but attacker has to be quite
> smart.
>
> I am just saying that counting 2048 might have a big impact on memory
> consumption if all these incoming segments are stored a long time in
> receive queues (TCP receive queues or out of order queues) : We might be
> off by a factor of 2 on the real memory usage, and delay the TCP
> collapsing too much.

My concern would be that we are off by a factor of 2 and prematurely 
collapse the TCP too soon with this change.  For example if you are 
looking at a socket that is holding pages for a long period of time 
there would be a good chance of it ending up with both halves of the 
page.  In this case is it fair to charge it for 8K or memory use when in 
reality it is only using 4K?

Thanks,

Alex


^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Eric Dumazet @ 2014-10-16 17:48 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Lluís Batlle i Rossell, linux-kernel, netdev,
	Carles Pagès, linux-arm-kernel
In-Reply-To: <20141016173401.GA16515@f1.synalogic.ca>

On Thu, 2014-10-16 at 10:34 -0700, Benjamin Poirier wrote:
> On 2014/10/15 15:45, Eric Dumazet wrote:

> >  kmap_atomic()/kunmap_atomic() is missing, so we lack
> > __cpuc_flush_dcache_area() operations.
> > 
> 
> You lost me there.
> 1) I don't see the link
> 2) It seems kmap_atomic and so on are there:
> $ grep kmap_atomic System.map-3.16-2-kirkwood
> c0014838 T kmap_atomic
> c001491c T kmap_atomic_pfn
> c00149a4 T kmap_atomic_to_page
> 
> MACH_KIRKWOOD selects CPU_FEROCEON which has
> __cpuc_flush_dcache_area ->
> 	cpu_cache.flush_kern_dcache_area ->
> 		feroceon_flush_kern_dcache_area

I meant to put a '?' instead of a '.'

Note that tcp does a copy, using :

^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Lluís Batlle i Rossell @ 2014-10-16 17:46 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Eric Dumazet, linux-kernel, netdev, Carles Pagès,
	linux-arm-kernel
In-Reply-To: <20141016173401.GA16515@f1.synalogic.ca>

On Thu, Oct 16, 2014 at 10:34:01AM -0700, Benjamin Poirier wrote:
> On 2014/10/15 15:45, Eric Dumazet wrote:
> > On Wed, 2014-10-15 at 14:57 -0700, Benjamin Poirier wrote:
> > > On 2014/10/13 12:52, Lluís Batlle i Rossell wrote:
> > > > Hello,
> > > > 
> > > > on the 7th of January 2014 ths patch was applied:
> > > > https://lkml.org/lkml/2014/1/7/307
> > > > 
> > > > [PATCH v2] net: Do not enable tx-nocache-copy by default
> > > >         
> > > > In the Sheevaplug (ARM Feroceon 88FR131 from Marvell) this made packets to be
> > > > sent corrupted. I think this machine has something special about the cache.
> > > > 
> > > > Enabling back this tx-nocache-copy (as it used to be before the patch) the
> > > > transfers work fine again. I think that most people, encountering this problem,
> > > > completely disable the tx offload instead of enabling back this setting.
> > > > 
> > > > Is this an ARM kernel problem regarding this platform?
> > > 
> > > This is odd, only x86 defines ARCH_HAS_NOCACHE_UACCESS. On arm,
> > > skb_do_copy_data_nocache() should end up using __copy_from_user()
> > > regardless of tx-nocache-copy.
> > 
> >  kmap_atomic()/kunmap_atomic() is missing, so we lack
> > __cpuc_flush_dcache_area() operations.
> > 
> 
> You lost me there.
> 1) I don't see the link
> 2) It seems kmap_atomic and so on are there:
> $ grep kmap_atomic System.map-3.16-2-kirkwood
> c0014838 T kmap_atomic
> c001491c T kmap_atomic_pfn
> c00149a4 T kmap_atomic_to_page
> 
> MACH_KIRKWOOD selects CPU_FEROCEON which has
> __cpuc_flush_dcache_area ->
> 	cpu_cache.flush_kern_dcache_area ->
> 		feroceon_flush_kern_dcache_area

Hello all,

it seems I was a bit wrong - although enabling back tx-nocache-copy makes the
tx-errors happen much less often (ssh complaining about HMAC), they still
happen. It seems that something was introduced in some recent kernels that broke
the tx offload.

I have no idea what it can be, but since 2.6 until at least 3.10 the network
driver worked fine with tx offload in this sheevaplug board.

Regards,
Lluís.

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-16 17:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <543FFC03.1060207@redhat.com>

On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:

> So if a page is used twice we are double counting the page size for the 
> socket then, is that correct?  I just want to make sure because prior to 
> this patch both flows did the same thing and counted the portion of the 
> page used in this pass, now with this change for PAGE_SIZE of 4K we 
> count the entire page, and for all other cases we count the portion of 
> the page used.

When a page is split in 2 parts only, probability that a segment holds
the 4K page is quite high (There is a single half page)

When we split say 64KB in 42 segments, the probability a single segment
hold the full 64KB block is very low, so we can almost be safe when we
consider 'truesize = 1536'

Of course there are pathological cases, but attacker has to be quite
smart.

I am just saying that counting 2048 might have a big impact on memory
consumption if all these incoming segments are stored a long time in
receive queues (TCP receive queues or out of order queues) : We might be
off by a factor of 2 on the real memory usage, and delay the TCP
collapsing too much.



^ permalink raw reply

* Re: Regarding tx-nocache-copy in the Sheevaplug
From: Benjamin Poirier @ 2014-10-16 17:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Lluís Batlle i Rossell, linux-kernel, netdev,
	Carles Pagès, linux-arm-kernel
In-Reply-To: <1413413127.17186.5.camel@edumazet-glaptop2.roam.corp.google.com>

On 2014/10/15 15:45, Eric Dumazet wrote:
> On Wed, 2014-10-15 at 14:57 -0700, Benjamin Poirier wrote:
> > On 2014/10/13 12:52, Lluís Batlle i Rossell wrote:
> > > Hello,
> > > 
> > > on the 7th of January 2014 ths patch was applied:
> > > https://lkml.org/lkml/2014/1/7/307
> > > 
> > > [PATCH v2] net: Do not enable tx-nocache-copy by default
> > >         
> > > In the Sheevaplug (ARM Feroceon 88FR131 from Marvell) this made packets to be
> > > sent corrupted. I think this machine has something special about the cache.
> > > 
> > > Enabling back this tx-nocache-copy (as it used to be before the patch) the
> > > transfers work fine again. I think that most people, encountering this problem,
> > > completely disable the tx offload instead of enabling back this setting.
> > > 
> > > Is this an ARM kernel problem regarding this platform?
> > 
> > This is odd, only x86 defines ARCH_HAS_NOCACHE_UACCESS. On arm,
> > skb_do_copy_data_nocache() should end up using __copy_from_user()
> > regardless of tx-nocache-copy.
> 
>  kmap_atomic()/kunmap_atomic() is missing, so we lack
> __cpuc_flush_dcache_area() operations.
> 

You lost me there.
1) I don't see the link
2) It seems kmap_atomic and so on are there:
$ grep kmap_atomic System.map-3.16-2-kirkwood
c0014838 T kmap_atomic
c001491c T kmap_atomic_pfn
c00149a4 T kmap_atomic_to_page

MACH_KIRKWOOD selects CPU_FEROCEON which has
__cpuc_flush_dcache_area ->
	cpu_cache.flush_kern_dcache_area ->
		feroceon_flush_kern_dcache_area

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 17:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413478657.28798.22.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/16/2014 09:57 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:
>
>> I think the part you are not getting is that is how buffers are
>> essentially handled now.  So for example in the case if igb the only
>> part we have copied out is usually the header, or the entire frame in
>> the case of small packets.  This has to happen in order to allow for
>> changes to the header for routing and such.  Beyond that the frags that
>> are passed are the buffers that igb is still holding onto.  So
>> effectively what the other device transmits in a bridging/routing
>> scenario is my own net card specified buffer plus the copied/modified
>> header.
>>
>> For a brief period igb used build_skb but that isn't valid on most
>> systems as memory mapped for a device can be overwritten if the page is
>> unmapped resulting in any changes to the header for routing/bridging
>> purposes being invalidated.  Thus we cannot use the buffers for both the
>> skb->data header which may be changed and Rx DMA simultaneously.
> This reminds me that igb still has skb->truesize underestimation by 100%
>
> If a fragment is held in some socket receive buffer, a full page is
> consumed, not 2048 bytes.
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index a21b14495ebd..56ca6c78985e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
>   	struct page *page = rx_buffer->page;
>   	unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
>   #if (PAGE_SIZE < 8192)
> -	unsigned int truesize = IGB_RX_BUFSZ;
> +	unsigned int segsize = IGB_RX_BUFSZ;
> +	unsigned int truesize = PAGE_SIZE;
>   #else
> -	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
> +	unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
> +	unsigned int truesize = segsize;
>   #endif

So if a page is used twice we are double counting the page size for the 
socket then, is that correct?  I just want to make sure because prior to 
this patch both flows did the same thing and counted the portion of the 
page used in this pass, now with this change for PAGE_SIZE of 4K we 
count the entire page, and for all other cases we count the portion of 
the page used.

Thanks,

Alex



^ permalink raw reply

* Re: [PATCH] [trivial] treewide: Fix company name in module descriptions
From: Randy Dunlap @ 2014-10-16 17:01 UTC (permalink / raw)
  To: Masanari Iida, linux-kernel, mturquette, trivial, linus.walleij,
	gnurou, jcliburn, chris.snook, viresh.linux, inki.dae, dh09.lee,
	kgene.kim
  Cc: netdev
In-Reply-To: <1413472164-22366-1-git-send-email-standby24x7@gmail.com>

On 10/16/14 08:09, Masanari Iida wrote:
> This patch fix company name's spelling typo in module descriptions
> and a Kconfig.
> 
> Signed-off-by: Masanari Iida <standby24x7@gmail.com>

Acked-by: Randy Dunlap <rdunlap@infradead.org>

Thanks.

> ---
>  drivers/clk/Kconfig                                 | 2 +-
>  drivers/gpio/gpio-spear-spics.c                     | 2 +-
>  drivers/net/ethernet/atheros/atl1c/atl1c_main.c     | 2 +-
>  drivers/pinctrl/spear/pinctrl-plgpio.c              | 2 +-
>  drivers/video/fbdev/exynos/exynos_mipi_dsi.c        | 2 +-
>  drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c | 2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
> index 455fd17..3f44f29 100644
> --- a/drivers/clk/Kconfig
> +++ b/drivers/clk/Kconfig
> @@ -28,7 +28,7 @@ config COMMON_CLK_WM831X
>  	depends on MFD_WM831X
>  	---help---
>            Supports the clocking subsystem of the WM831x/2x series of
> -	  PMICs from Wolfson Microlectronics.
> +	  PMICs from Wolfson Microelectronics.
>  
>  source "drivers/clk/versatile/Kconfig"
>  
> diff --git a/drivers/gpio/gpio-spear-spics.c b/drivers/gpio/gpio-spear-spics.c
> index 353263c..506a2ea 100644
> --- a/drivers/gpio/gpio-spear-spics.c
> +++ b/drivers/gpio/gpio-spear-spics.c
> @@ -204,5 +204,5 @@ static int __init spics_gpio_init(void)
>  subsys_initcall(spics_gpio_init);
>  
>  MODULE_AUTHOR("Shiraz Hashim <shiraz.linux.kernel@gmail.com>");
> -MODULE_DESCRIPTION("ST Microlectronics SPEAr SPI Chip Select Abstraction");
> +MODULE_DESCRIPTION("STMicroelectronics SPEAr SPI Chip Select Abstraction");
>  MODULE_LICENSE("GPL");
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index 72fb86b..c9946c6 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -48,7 +48,7 @@ MODULE_DEVICE_TABLE(pci, atl1c_pci_tbl);
>  
>  MODULE_AUTHOR("Jie Yang");
>  MODULE_AUTHOR("Qualcomm Atheros Inc., <nic-devel@qualcomm.com>");
> -MODULE_DESCRIPTION("Qualcom Atheros 100/1000M Ethernet Network Driver");
> +MODULE_DESCRIPTION("Qualcomm Atheros 100/1000M Ethernet Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(ATL1C_DRV_VERSION);
>  
> diff --git a/drivers/pinctrl/spear/pinctrl-plgpio.c b/drivers/pinctrl/spear/pinctrl-plgpio.c
> index bddb791..ce5f22c 100644
> --- a/drivers/pinctrl/spear/pinctrl-plgpio.c
> +++ b/drivers/pinctrl/spear/pinctrl-plgpio.c
> @@ -724,5 +724,5 @@ static int __init plgpio_init(void)
>  subsys_initcall(plgpio_init);
>  
>  MODULE_AUTHOR("Viresh Kumar <viresh.kumar@linaro.org>");
> -MODULE_DESCRIPTION("ST Microlectronics SPEAr PLGPIO driver");
> +MODULE_DESCRIPTION("STMicroelectronics SPEAr PLGPIO driver");
>  MODULE_LICENSE("GPL");
> diff --git a/drivers/video/fbdev/exynos/exynos_mipi_dsi.c b/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
> index cee9602..716bfad 100644
> --- a/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
> +++ b/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
> @@ -570,5 +570,5 @@ static struct platform_driver exynos_mipi_dsi_driver = {
>  module_platform_driver(exynos_mipi_dsi_driver);
>  
>  MODULE_AUTHOR("InKi Dae <inki.dae@samsung.com>");
> -MODULE_DESCRIPTION("Samusung SoC MIPI-DSI driver");
> +MODULE_DESCRIPTION("Samsung SoC MIPI-DSI driver");
>  MODULE_LICENSE("GPL");
> diff --git a/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c b/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
> index 85edabf..2358a2f 100644
> --- a/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
> +++ b/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
> @@ -876,5 +876,5 @@ int exynos_mipi_dsi_fifo_clear(struct mipi_dsim_device *dsim,
>  }
>  
>  MODULE_AUTHOR("InKi Dae <inki.dae@samsung.com>");
> -MODULE_DESCRIPTION("Samusung SoC MIPI-DSI common driver");
> +MODULE_DESCRIPTION("Samsung SoC MIPI-DSI common driver");
>  MODULE_LICENSE("GPL");
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-16 16:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
	netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <543FE413.6030406@redhat.com>

On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:

> I think the part you are not getting is that is how buffers are 
> essentially handled now.  So for example in the case if igb the only 
> part we have copied out is usually the header, or the entire frame in 
> the case of small packets.  This has to happen in order to allow for 
> changes to the header for routing and such.  Beyond that the frags that 
> are passed are the buffers that igb is still holding onto.  So 
> effectively what the other device transmits in a bridging/routing 
> scenario is my own net card specified buffer plus the copied/modified 
> header.
> 
> For a brief period igb used build_skb but that isn't valid on most 
> systems as memory mapped for a device can be overwritten if the page is 
> unmapped resulting in any changes to the header for routing/bridging 
> purposes being invalidated.  Thus we cannot use the buffers for both the 
> skb->data header which may be changed and Rx DMA simultaneously.

This reminds me that igb still has skb->truesize underestimation by 100%

If a fragment is held in some socket receive buffer, a full page is
consumed, not 2048 bytes.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a21b14495ebd..56ca6c78985e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 	struct page *page = rx_buffer->page;
 	unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
 #if (PAGE_SIZE < 8192)
-	unsigned int truesize = IGB_RX_BUFSZ;
+	unsigned int segsize = IGB_RX_BUFSZ;
+	unsigned int truesize = PAGE_SIZE;
 #else
-	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
+	unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
+	unsigned int truesize = segsize;
 #endif
 
 	if ((size <= IGB_RX_HDR_LEN) && !skb_is_nonlinear(skb)) {
@@ -6614,7 +6616,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
 			rx_buffer->page_offset, size, truesize);
 
-	return igb_can_reuse_rx_page(rx_buffer, page, truesize);
+	return igb_can_reuse_rx_page(rx_buffer, page, segsize);
 }
 
 static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 


^ permalink raw reply related

* [PATCH iproute2] ss: Identify a lot of netlink protocol names
From: Vadim Kochan @ 2014-10-16 16:46 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

There were only few Netlink protocol names:

    rtnl, fw, tcpdiag

which were printed on output.
So added the other ones.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 misc/ss.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 55 insertions(+), 6 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 2420b51..576db4e 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2960,12 +2960,63 @@ static int packet_show(struct filter *f)
 	return 0;
 }
 
+static char *netlink_proto_name(int proto)
+{
+	switch (proto)
+	{
+	case NETLINK_ROUTE:
+		return "rtnl";
+	case NETLINK_USERSOCK:
+		return "user";
+	case NETLINK_FIREWALL:
+		return "fw";
+	case NETLINK_SOCK_DIAG:
+		return "tcpdiag";
+	case NETLINK_NFLOG:
+		return "nflog";
+	case NETLINK_XFRM:
+		return "xfrm";
+	case NETLINK_SELINUX:
+		return "selinux";
+	case NETLINK_ISCSI:
+		return "iscsi";
+	case NETLINK_AUDIT:
+		return "audit";
+	case NETLINK_FIB_LOOKUP:
+		return "fiblookup";
+	case NETLINK_CONNECTOR:
+		return "conn";
+	case NETLINK_NETFILTER:
+		return "nft";
+	case NETLINK_IP6_FW:
+		return "ip6fw";
+	case NETLINK_DNRTMSG:
+		return "decrt";
+	case NETLINK_KOBJECT_UEVENT:
+		return "uevent";
+	case NETLINK_GENERIC:
+		return "genl";
+	case NETLINK_SCSITRANSPORT:
+		return "scsitrans";
+	case NETLINK_ECRYPTFS:
+		return "ecryptfs";
+	case NETLINK_RDMA:
+		return "rdma";
+	case NETLINK_CRYPTO:
+		return "crypto";
+	}
+
+	return NULL;
+};
+
 static void netlink_show_one(struct filter *f,
 				int prot, int pid, unsigned groups,
 				int state, int dst_pid, unsigned dst_group,
 				int rq, int wq,
 				unsigned long long sk, unsigned long long cb)
 {
+	char *prot_name;
+
 	if (f->f) {
 		struct tcpstat tst;
 		tst.local.family = AF_NETLINK;
@@ -2983,14 +3034,12 @@ static void netlink_show_one(struct filter *f,
 	if (state_width)
 		printf("%-*s ", state_width, "UNCONN");
 	printf("%-6d %-6d ", rq, wq);
-	if (resolve_services && prot == 0)
-		printf("%*s:", addr_width, "rtnl");
-	else if (resolve_services && prot == 3)
-		printf("%*s:", addr_width, "fw");
-	else if (resolve_services && prot == 4)
-		printf("%*s:", addr_width, "tcpdiag");
+
+	if (resolve_services && (prot_name = netlink_proto_name(prot)))
+		printf("%*s:", addr_width, prot_name);
 	else
 		printf("%*d:", addr_width, prot);
+
 	if (pid == -1) {
 		printf("%-*s ", serv_width, "*");
 	} else if (resolve_services) {
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH RFC net-next] sfc: add support for skb->xmit_more
From: Robert Stonehouse @ 2014-10-16 16:36 UTC (permalink / raw)
  To: Edward Cree, Daniel Borkmann
  Cc: davem, nikolay, netdev, Shradha Shah, Jon Cooper,
	linux-net-drivers
In-Reply-To: <alpine.LFD.2.03.1410141933500.26972@solarflare.com>

On 14/10/14 19:41, Edward Cree wrote:
> Don't ring the doorbell, and don't do PIO.  This will also prevent
>   TX Push, because there will be more than one buffer waiting when
>   the doorbell is rung.
>
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---
>
> @@ -351,8 +343,6 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
>   	unsigned short dma_flags;
>   	int i = 0;
>   
> -	EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
> -
>   	if (skb_shinfo(skb)->gso_size)
>   		return efx_enqueue_skb_tso(tx_queue, skb);

Would it be possible to keep a weaker version of this check i.e.
EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);

> @@ -1258,14 +1249,13 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
>   			       struct sk_buff *skb)
>   {
>   	struct efx_nic *efx = tx_queue->efx;
> +	unsigned int old_insert_count = tx_queue->insert_count;
>   	int frag_i, rc;
>   	struct tso_state state;
>   
>   	/* Find the packet protocol and sanity-check it */
>   	state.protocol = efx_tso_check_protocol(skb);
>   
> -	EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
> -
>   	rc = tso_start(&state, efx, skb);
>   	if (rc)
>   		goto mem_err;

The same would apply here.

Thanks
Rob

^ permalink raw reply

* Re: [PATCH RFC net-next] sfc: add support for skb->xmit_more
From: Jonathan Cooper @ 2014-10-16 15:42 UTC (permalink / raw)
  To: David Miller; +Cc: ecree, dborkman, nikolay, netdev, sshah, linux-net-drivers
In-Reply-To: <20141015.122034.126232757997381022.davem@davemloft.net>

On 15/10/14 17:20, David Miller wrote:
> From: Edward Cree <ecree@solarflare.com>
> Date: Wed, 15 Oct 2014 12:05:35 +0100
>
>> On 14/10/14 22:15, David Miller wrote:
>>> From: Edward Cree <ecree@solarflare.com>
>>> Date: Tue, 14 Oct 2014 19:41:37 +0100
>>>
>>>> Don't ring the doorbell, and don't do PIO.  This will also prevent
>>>>   TX Push, because there will be more than one buffer waiting when
>>>>   the doorbell is rung.
>>>>
>>>> Signed-off-by: Edward Cree <ecree@solarflare.com>
>>> This looks good to me, mind if I apply this now?
>> I'd rather wait until Jon Cooper's had a chance to look at it; he
>> understands our TX path better than I.
> Ok.
Looks good to me.

Acked-by: Jon Cooper <jcooper@solarflare.com>

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-16 15:28 UTC (permalink / raw)
  To: Jiafei.Pan@freescale.com, Eric Dumazet
  Cc: David Miller, jkosina@suse.cz, netdev@vger.kernel.org,
	LeoLi@freescale.com, linux-doc@vger.kernel.org
In-Reply-To: <aeef795129504782ae1e9f91467d243e@BLUPR03MB517.namprd03.prod.outlook.com>


On 10/15/2014 10:15 PM, Jiafei.Pan@freescale.com wrote:
>> -----Original Message-----
>> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
>> Sent: Thursday, October 16, 2014 12:15 PM
>> To: Pan Jiafei-B37022
>> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
>> linux-doc@vger.kernel.org
>> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>>
>> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
>>
>>> Thanks for your comments and suggestion. In my case, I want to build skb
>>> from hardware block specified memory, I only can see two ways, one is modified
>>> net card driver replace common skb allocation function with my specially
>>> functions, another way is to hack common skb allocation function in which
>>> redirect to my specially functions. My patch is just for the second way.
>>> Except these two ways, would you please give me some advice for some other
>>> ways for my case? Thanks
>> I suggest you read drivers/net/ethernet numerous examples.
>>
>> No need to change anything  in net/* or include/*, really.
>>
>> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
>>
>> Mentioning 'hack' in your mails simply should hint you are doing
>> something very wrong.
>>
>> What makes you think your hardware is so special ?
>>
> In fact, I am developing a bridge driver, it can bridge between any other the
> third party net card and my own net card. My target is to let any other the
> third party net card can directly use my own net card specified buffer, then
> there will be no memory copy in the whole bridge process.
> By the way, I don’t see any similar between igb_main.c and my case. And also
> My bridge also can’t implemented with "skb frag" in order to aim at zero memory
> copy.

I think the part you are not getting is that is how buffers are 
essentially handled now.  So for example in the case if igb the only 
part we have copied out is usually the header, or the entire frame in 
the case of small packets.  This has to happen in order to allow for 
changes to the header for routing and such.  Beyond that the frags that 
are passed are the buffers that igb is still holding onto.  So 
effectively what the other device transmits in a bridging/routing 
scenario is my own net card specified buffer plus the copied/modified 
header.

For a brief period igb used build_skb but that isn't valid on most 
systems as memory mapped for a device can be overwritten if the page is 
unmapped resulting in any changes to the header for routing/bridging 
purposes being invalidated.  Thus we cannot use the buffers for both the 
skb->data header which may be changed and Rx DMA simultaneously.

Thanks,

Alex

^ permalink raw reply

* [PATCH] [trivial] treewide: Fix company name in module descriptions
From: Masanari Iida @ 2014-10-16 15:09 UTC (permalink / raw)
  To: linux-kernel, mturquette, trivial, linus.walleij, gnurou,
	jcliburn, chris.snook, viresh.linux, inki.dae, dh09.lee,
	kgene.kim, rdunlap
  Cc: netdev, Masanari Iida

This patch fix company name's spelling typo in module descriptions
and a Kconfig.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
---
 drivers/clk/Kconfig                                 | 2 +-
 drivers/gpio/gpio-spear-spics.c                     | 2 +-
 drivers/net/ethernet/atheros/atl1c/atl1c_main.c     | 2 +-
 drivers/pinctrl/spear/pinctrl-plgpio.c              | 2 +-
 drivers/video/fbdev/exynos/exynos_mipi_dsi.c        | 2 +-
 drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/clk/Kconfig b/drivers/clk/Kconfig
index 455fd17..3f44f29 100644
--- a/drivers/clk/Kconfig
+++ b/drivers/clk/Kconfig
@@ -28,7 +28,7 @@ config COMMON_CLK_WM831X
 	depends on MFD_WM831X
 	---help---
           Supports the clocking subsystem of the WM831x/2x series of
-	  PMICs from Wolfson Microlectronics.
+	  PMICs from Wolfson Microelectronics.
 
 source "drivers/clk/versatile/Kconfig"
 
diff --git a/drivers/gpio/gpio-spear-spics.c b/drivers/gpio/gpio-spear-spics.c
index 353263c..506a2ea 100644
--- a/drivers/gpio/gpio-spear-spics.c
+++ b/drivers/gpio/gpio-spear-spics.c
@@ -204,5 +204,5 @@ static int __init spics_gpio_init(void)
 subsys_initcall(spics_gpio_init);
 
 MODULE_AUTHOR("Shiraz Hashim <shiraz.linux.kernel@gmail.com>");
-MODULE_DESCRIPTION("ST Microlectronics SPEAr SPI Chip Select Abstraction");
+MODULE_DESCRIPTION("STMicroelectronics SPEAr SPI Chip Select Abstraction");
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
index 72fb86b..c9946c6 100644
--- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
+++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
@@ -48,7 +48,7 @@ MODULE_DEVICE_TABLE(pci, atl1c_pci_tbl);
 
 MODULE_AUTHOR("Jie Yang");
 MODULE_AUTHOR("Qualcomm Atheros Inc., <nic-devel@qualcomm.com>");
-MODULE_DESCRIPTION("Qualcom Atheros 100/1000M Ethernet Network Driver");
+MODULE_DESCRIPTION("Qualcomm Atheros 100/1000M Ethernet Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(ATL1C_DRV_VERSION);
 
diff --git a/drivers/pinctrl/spear/pinctrl-plgpio.c b/drivers/pinctrl/spear/pinctrl-plgpio.c
index bddb791..ce5f22c 100644
--- a/drivers/pinctrl/spear/pinctrl-plgpio.c
+++ b/drivers/pinctrl/spear/pinctrl-plgpio.c
@@ -724,5 +724,5 @@ static int __init plgpio_init(void)
 subsys_initcall(plgpio_init);
 
 MODULE_AUTHOR("Viresh Kumar <viresh.kumar@linaro.org>");
-MODULE_DESCRIPTION("ST Microlectronics SPEAr PLGPIO driver");
+MODULE_DESCRIPTION("STMicroelectronics SPEAr PLGPIO driver");
 MODULE_LICENSE("GPL");
diff --git a/drivers/video/fbdev/exynos/exynos_mipi_dsi.c b/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
index cee9602..716bfad 100644
--- a/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
+++ b/drivers/video/fbdev/exynos/exynos_mipi_dsi.c
@@ -570,5 +570,5 @@ static struct platform_driver exynos_mipi_dsi_driver = {
 module_platform_driver(exynos_mipi_dsi_driver);
 
 MODULE_AUTHOR("InKi Dae <inki.dae@samsung.com>");
-MODULE_DESCRIPTION("Samusung SoC MIPI-DSI driver");
+MODULE_DESCRIPTION("Samsung SoC MIPI-DSI driver");
 MODULE_LICENSE("GPL");
diff --git a/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c b/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
index 85edabf..2358a2f 100644
--- a/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
+++ b/drivers/video/fbdev/exynos/exynos_mipi_dsi_common.c
@@ -876,5 +876,5 @@ int exynos_mipi_dsi_fifo_clear(struct mipi_dsim_device *dsim,
 }
 
 MODULE_AUTHOR("InKi Dae <inki.dae@samsung.com>");
-MODULE_DESCRIPTION("Samusung SoC MIPI-DSI common driver");
+MODULE_DESCRIPTION("Samsung SoC MIPI-DSI common driver");
 MODULE_LICENSE("GPL");
-- 
2.1.2.443.g670a3c1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox