Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: -27% netperf TCP_STREAM regression by "tcp_memcontrol: Kill struct tcp_memcontrol"
From: Christoph Paasch @ 2013-10-23 12:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, fengguang.wu, netdev, linux-kernel
In-Reply-To: <87k3h461ql.fsf@tw-ebiederman.twitter.com>

Hello,

On 22/10/13 - 21:38:10, Eric W. Biederman wrote:
> David Miller <davem@davemloft.net> writes:
> > From: fengguang.wu@intel.com
> > Date: Tue, 22 Oct 2013 22:41:29 +0100
> >
> >> We noticed big netperf throughput regressions
> >> 
> >>     a4fe34bf902b8f709c63      2e685cad57906e19add7  
> >> ------------------------  ------------------------  
> >>                   707.40       -40.7%       419.60  lkp-nex04/micro/netperf/120s-200%-TCP_STREAM
> >>                  2775.60       -23.7%      2116.40  lkp-sb03/micro/netperf/120s-200%-TCP_STREAM
> >>                  3483.00       -27.2%      2536.00  TOTAL netperf.Throughput_Mbps
> >> 
> >> and bisected it to
> >> 
> >> commit 2e685cad57906e19add7189b5ff49dfb6aaa21d3
> >> Author: Eric W. Biederman <ebiederm@xmission.com>
> >> Date:   Sat Oct 19 16:26:19 2013 -0700
> >> 
> >>     tcp_memcontrol: Kill struct tcp_memcontrol
> >
> > Eric please look into this, I'd rather have a fix to apply than revert your
> > work.
> 
> Will do I expect some ordering changed, and that changed the cache line
> behavior.

may it be the below?


Cheers,
Christoph

----
From: Christoph Paasch <christoph.paasch@uclouvain.be>
Subject: [PATCH] Fix: Dereference pointer-value of sk_prot->memory_pressure

2e685cad57 (tcp_memcontrol: Kill struct tcp_memcontrol) falsly modified
the access to memory_pressure of sk->sk_prot->memory_pressure. The patch
did modify the memory_pressure-field of struct cg_proto, but not the one
of struct proto.

So, the access to sk_prot->memory_pressure should not be changed.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
---
 include/net/sock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c93542f..e3a18ff 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1137,7 +1137,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
        if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
                return !!sk->sk_cgrp->memory_pressure;

-       return !!sk->sk_prot->memory_pressure;
+       return !!*sk->sk_prot->memory_pressure;
 }

 static inline void sk_leave_memory_pressure(struct sock *sk)
-- 
1.8.3.2

^ permalink raw reply related

* Re: BUG: scheduling while atomic dev_set_promiscuity->__dev_notify_flags
From: Nicolas Dichtel @ 2013-10-23 12:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Cong Wang; +Cc: netdev
In-Reply-To: <52677F30.1070800@6wind.com>

Le 23/10/2013 09:48, Nicolas Dichtel a écrit :
> Le 23/10/2013 07:34, Alexei Starovoitov a écrit :
>> On Tue, Oct 22, 2013 at 8:53 PM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>> On Tue, 22 Oct 2013 at 01:04 GMT, Alexei Starovoitov <ast@plumgrid.com> wrote:
>>>>
[snip]
>> Nicolas, I've sent you my .config.
> I got some pb with my testbed, hence I still didn't reproduce the bug.
> I will make more test today, but I think this is the right patch.
Still not reproduced with your config.
But as I said, I think your patch is the right thing to do.

^ permalink raw reply

* RE: [PATCH 3/3] netfilter: x_tables: fix ordering of jumpstack allocation and table update
From: Eric Dumazet @ 2013-10-23 12:13 UTC (permalink / raw)
  To: David Laight; +Cc: Pablo Neira Ayuso, netfilter-devel, davem, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73A2@saturn3.aculab.com>

On Wed, 2013-10-23 at 10:45 +0100, David Laight wrote:

> Those writes were in the wrong order on all systems.
> Also gcc needs to be told not to reorder the writes even on non-smp
> systems (if the code might be pre-empted).
> So an asm volatile (:::"memory") is needed there even if no specific
> synchronisation instruction is needed.

smp_wmb() contains this compiler barrier already.




^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Felix Fietkau @ 2013-10-23 12:04 UTC (permalink / raw)
  To: Jamal Hadi Salim, Florian Fainelli, Neil Horman
  Cc: John Fastabend, netdev, David Miller, Sascha Hauer, John Crispin,
	Jonas Gorski, Gary Thomas, Vlad Yasevich, Stephen Hemminger
In-Reply-To: <5267B764.305@mojatatu.com>

On 2013-10-23 1:47 PM, Jamal Hadi Salim wrote:
> Florian,
> 
> I think it would be fantastic if you adopt the FDB API. The comment
> to use rtnetlink configure is valid. You can configure hardware
> switches as John has shown. I realize you guys have invested
> tons of time and this stuff has been tested by tons of people and this
> is a painful exercise to go through, but:
> having more than one approach for configuring/controlling kernel
> switch interfaces is not ideal. If you use the rtnetlink API then one
> can configure both the Linux bridge, embedded intel switches, etc with
> iproute2. i.e the switch becomes a bridge. I see a lot of commonolity
> between your model based on what you described and the current bridge.
> Pull the latest iproute2 code and look at "bridge" command.
> 
> Essentially, the current bridged could be described as an entity
> that does L2 switching:
> a) it has bridge ports which are any netdevs on Linux
> b) it has an FDB which constitutes a MAC address as the lookup and 
> optionally a VLAN. You can control learning and flooding.
> c) it has vlan filtering capabilities which you can turn on/off. The
> vlan capability to sellect PVIDs is also built in.
> d) It has multicast snooping
> 
> I think your model needs #a and #b, you can ignore the rest.
> I am not quiet sure how vlan port membership will apply; an fdb for
> each entry will have a vlan. You could also create one bridge per vlan
> (not the best  approach) - ccing Vlad and Stephen.
I still don't understand how this is supposed to work with the kind of
switches that we're supporting with swconfig.

A typical switch has something like 5-8 ports (+ one port that goes to
the CPU), and handles the entire forwarding path on its own. It usually
allows creating VLANs and assigning ports to them (tagged, untagged),
but many (probably most) switches do not support controlling the
forwarding path via a MAC address based FDB.

Many also do not have support for a packet header to indicate the
incoming/outgoing switch port, so creating one netdev per port will work
only for link status, not for the data path.

- Felix

^ permalink raw reply

* Re: Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Steffen Klassert @ 2013-10-23 12:04 UTC (permalink / raw)
  To: Wolfgang Walter; +Cc: David Miller, hannes, netdev, klassert
In-Reply-To: <1748378.n7GIUzQXTX@h2o.as.studentenwerk.mhn.de>

On Wed, Oct 23, 2013 at 01:33:14PM +0200, Wolfgang Walter wrote:
> Am Mittwoch, 23. Oktober 2013, 10:12:55 schrieb Steffen Klassert:
> > On Tue, Oct 22, 2013 at 03:46:38PM -0400, David Miller wrote:
> > > 
> > > I think we should resolve this soon, even bumping it to 2048 or 4096
> > > and leaving it at that would be I think acceptable.
> > 
> > Yes, of course. Let's use 4096 as the default for ipv4 and ipv6.
> > I'll take care of it next week.
> > 
> 
> I don't know what this value actually means. But on 3.4.x it is much higher. 
> On a machine with 512MB ram it is 32768, on a machine with 1GB ram it is 
> 262144 and with 16GB ram it is 4194304.
> 

Before  we removed the routing cache, the gc threshold was scaled along
with the maximum routing cache size (ip_rt_max_size). With the routing
cache removal, we lost the possibility to scale with ip_rt_max_size
and we had to choose a static default. Maybe we can try to tweak the
gc threshold again with the available memory somehow later. But to fix
it now, we need to find a reasonable default value. Would a default of
4096 meet your requirements?

^ permalink raw reply

* Re: Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Eric Dumazet @ 2013-10-23 12:00 UTC (permalink / raw)
  To: Wolfgang Walter; +Cc: Steffen Klassert, David Miller, hannes, netdev, klassert
In-Reply-To: <1748378.n7GIUzQXTX@h2o.as.studentenwerk.mhn.de>

On Wed, 2013-10-23 at 13:33 +0200, Wolfgang Walter wrote:

> I don't know what this value actually means. But on 3.4.x it is much higher. 
> On a machine with 512MB ram it is 32768, on a machine with 1GB ram it is 
> 262144 and with 16GB ram it is 4194304.
> 

Such huge values should not be needed. We should have at most one dst
per packet in flight.

On a loaded router, a NIC not using BQL could queue around 16,000
packets.

Of course, Qdisc layers could also store a lot of packets, but using the
default pfifo_fast is only adding 1000 packets per interface.

I guess using 65536 as the default value should be safe and reasonable

Have you tried using 32768 or 65536 ?

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Jamal Hadi Salim @ 2013-10-23 11:47 UTC (permalink / raw)
  To: Florian Fainelli, Neil Horman
  Cc: John Fastabend, netdev, David Miller, Sascha Hauer, Felix Fietkau,
	John Crispin, Jonas Gorski, Gary Thomas, Vlad Yasevich,
	Stephen Hemminger
In-Reply-To: <CAGVrzcaXFboRDn40+VciTes09w-jqTASNZz2GZNpTxbaV6D0Lw@mail.gmail.com>

On 10/22/13 18:09, Florian Fainelli wrote:
> 2013/10/22 Neil Horman <nhorman@tuxdriver.com>:
>> On Tue, Oct 22, 2013 at 12:59:12PM -0700, Florian Fainelli wrote:
>>> 2013/10/22 John Fastabend <john.r.fastabend@intel.com>:
>>>> On 10/22/2013 11:23 AM, Florian Fainelli wrote:
>>>>>
>
>>>
>>> Ok, I know nothing about the FDB API, but will take a look and see if
>>> that sounds suitable for the embedded use cases.
>>>
>> Further to Johns comments, why are you creating a new netlink protocol for this?
>> It seems that 90% of what you want to accomplish above is handled by rtnetlink.
>> As long as you write your driver properly, most of that should "just work".
>
> This is not a new netlink protocol, but a generic netlink family. Why
> would I extend rtnetlink to cover the remaining 10% which are not
> going to be used on desktop and servers when a new generic netlink
> family is cheap and can be selectively disabled in the kernel?
>

Florian,

I think it would be fantastic if you adopt the FDB API. The comment
to use rtnetlink configure is valid. You can configure hardware
switches as John has shown. I realize you guys have invested
tons of time and this stuff has been tested by tons of people and this
is a painful exercise to go through, but:
having more than one approach for configuring/controlling kernel
switch interfaces is not ideal. If you use the rtnetlink API then one
can configure both the Linux bridge, embedded intel switches, etc with
iproute2. i.e the switch becomes a bridge. I see a lot of commonolity
between your model based on what you described and the current bridge.
Pull the latest iproute2 code and look at "bridge" command.

Essentially, the current bridged could be described as an entity
that does L2 switching:
a) it has bridge ports which are any netdevs on Linux
b) it has an FDB which constitutes a MAC address as the lookup and 
optionally a VLAN. You can control learning and flooding.
c) it has vlan filtering capabilities which you can turn on/off. The
vlan capability to sellect PVIDs is also built in.
d) It has multicast snooping

I think your model needs #a and #b, you can ignore the rest.
I am not quiet sure how vlan port membership will apply; an fdb for
each entry will have a vlan. You could also create one bridge per vlan
(not the best  approach) - ccing Vlad and Stephen.

cheers,
jamal

^ permalink raw reply

* Re: -27% netperf TCP_STREAM regression by "tcp_memcontrol: Kill struct tcp_memcontrol"
From: Fengguang Wu @ 2013-10-23 11:46 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <87a9i0l3v1.fsf@xmission.com>

On Wed, Oct 23, 2013 at 02:43:14AM -0700, Eric W. Biederman wrote:
> Fengguang Wu <fengguang.wu@intel.com> writes:
> 
> > On Tue, Oct 22, 2013 at 09:38:10PM -0700, Eric W. Biederman wrote:
> >> David Miller <davem@davemloft.net> writes:
> >> 
> >> > From: fengguang.wu@intel.com
> >> > Date: Tue, 22 Oct 2013 22:41:29 +0100
> >> >
> >> >> We noticed big netperf throughput regressions
> >> >> 
> >> >>     a4fe34bf902b8f709c63      2e685cad57906e19add7  
> >> >> ------------------------  ------------------------  
> >> >>                   707.40       -40.7%       419.60  lkp-nex04/micro/netperf/120s-200%-TCP_STREAM
> >> >>                  2775.60       -23.7%      2116.40  lkp-sb03/micro/netperf/120s-200%-TCP_STREAM
> >> >>                  3483.00       -27.2%      2536.00  TOTAL netperf.Throughput_Mbps
> >> >> 
> >> >> and bisected it to
> >> >> 
> >> >> commit 2e685cad57906e19add7189b5ff49dfb6aaa21d3
> >> >> Author: Eric W. Biederman <ebiederm@xmission.com>
> >> >> Date:   Sat Oct 19 16:26:19 2013 -0700
> >> >> 
> >> >>     tcp_memcontrol: Kill struct tcp_memcontrol
> >> >
> >> > Eric please look into this, I'd rather have a fix to apply than revert your
> >> > work.
> >> 
> >> Will do I expect some ordering changed, and that changed the cache line
> >> behavior.
> >> 
> >> If I can't find anything we can revert this one particular patch without
> >> affecting anything else, but it would be nice to keep the data structure
> >> smaller.
> >> 
> >> Fengguag what would I need to do to reproduce this?
> >
> > Eric, attached is the kernel config.
> >
> > We used these commands in the test:
> >
> >         netserver
> >         netperf -t TCP_STREAM -c -C -l 120      # repeat 64 times and get average

Sorry it's not about repeating, but running 64 netperf in parallel.
The number 64 is 2 times the number of logical CPUs.

> > btw, we've got more complete change set (attached) and also noticed
> > performance increase in the TCP_SENDFILE case:
> >
> >     a4fe34bf902b8f709c63      2e685cad57906e19add7
> > ------------------------  ------------------------
> >                   707.40       -40.7%       419.60  lkp-nex04/micro/netperf/120s-200%-TCP_STREAM
> >                  2572.20       -17.7%      2116.20  lkp-sb03/micro/netperf/120s-200%-TCP_MAERTS
> >                  2775.60       -23.7%      2116.40  lkp-sb03/micro/netperf/120s-200%-TCP_STREAM
> >                  1006.60       -54.4%       459.40  lkp-sbx04/micro/netperf/120s-200%-TCP_STREAM
> >                  3278.60       -25.2%      2453.80  lkp-t410/micro/netperf/120s-200%-TCP_MAERTS
> >                  1902.80       +21.7%      2315.00  lkp-t410/micro/netperf/120s-200%-TCP_SENDFILE
> >                  3345.40       -26.7%      2451.00  lkp-t410/micro/netperf/120s-200%-TCP_STREAM
> >                 15588.60       -20.9%     12331.40  TOTAL netperf.Throughput_Mbps
> 
> I have a second question.  Do you mount the cgroup filesystem?  Do you
> set memory.kmem.tcp.limit_in_bytes?

No I didn't mount cgroup at all.

> If you aren't setting any memory cgroup limits or creating any groups
> this change should not have had any effect whatsoever.  And you haven't
> mentioned it so I don't expect you are enabling the memory cgroup limits
> explicitly.
> 
> If you have enabled the memory cgroups can you please describe your
> configuration as that may play a significant role.
> 
> Eric

^ permalink raw reply

* Re: [PATCH net] net: sctp: fix ASCONF to allow non SCTP_ADDR_SRC addresses in ipv6
From: Neil Horman @ 2013-10-23 11:36 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, linux-sctp, Michio Honda
In-Reply-To: <1382459696-1732-1-git-send-email-dborkman@redhat.com>

On Tue, Oct 22, 2013 at 06:34:56PM +0200, Daniel Borkmann wrote:
> Commit 8a07eb0a50 ("sctp: Add ASCONF operation on the single-homed host")
> implemented possible use of IPv4 addresses with non SCTP_ADDR_SRC state
> as source address when sending ASCONF (ADD) packets, but IPv6 part for
> that was not implemented in 8a07eb0a50. Therefore, as this is not restricted
> to IPv4-only, fix this up to allow the same for IPv6 addresses in SCTP.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> Cc: Michio Honda <micchie@sfc.wide.ad.jp>
> ---
>  net/sctp/ipv6.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index e7b2d4f..96a5591 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -279,7 +279,9 @@ static void sctp_v6_get_dst(struct sctp_transport *t, union sctp_addr *saddr,
>  		sctp_v6_to_addr(&dst_saddr, &fl6->saddr, htons(bp->port));
>  		rcu_read_lock();
>  		list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> -			if (!laddr->valid || (laddr->state != SCTP_ADDR_SRC))
> +			if (!laddr->valid || laddr->state == SCTP_ADDR_DEL ||
> +			    (laddr->state != SCTP_ADDR_SRC &&
> +			     !asoc->src_out_of_asoc_ok))
>  				continue;
>  
>  			/* Do not compare against v4 addrs */
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: [PATCH 1/4 net-next] net: phy: add Generic Netlink Ethernet switch configuration API
From: Neil Horman @ 2013-10-23 11:34 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: John Fastabend, netdev, David Miller, Sascha Hauer, Felix Fietkau,
	John Crispin, Jonas Gorski, Gary Thomas, Jamal Hadi Salim
In-Reply-To: <CAGVrzcaXFboRDn40+VciTes09w-jqTASNZz2GZNpTxbaV6D0Lw@mail.gmail.com>

On Tue, Oct 22, 2013 at 03:09:32PM -0700, Florian Fainelli wrote:
> 2013/10/22 Neil Horman <nhorman@tuxdriver.com>:
> > On Tue, Oct 22, 2013 at 12:59:12PM -0700, Florian Fainelli wrote:
> >> 2013/10/22 John Fastabend <john.r.fastabend@intel.com>:
> >> > On 10/22/2013 11:23 AM, Florian Fainelli wrote:
> >> >>
> >> >> This patch adds an Ethernet Switch generic netlink configuration API
> >> >> which allows for doing the required configuration of managed Ethernet
> >> >> switches commonly found in Wireless/Cable/DSL routers in the market.
> >> >>
> >> >> Since this API is based on the Generic Netlink infrastructure it is very
> >> >> easy to extend a particular switch driver to support additional features
> >> >> and to adapt it to specific switches.
> >> >>
> >> >
> >> >> So far the API includes support for:
> >> >>
> >> >> - getting/setting a port VLAN id
> >> >> - getting/setting VLAN port membership
> >> >> - getting a port link status
> >> >> - getting a port statistics counters
> >> >> - resetting a switch device
> >> >> - applying a configuration to a switch device
> >> >>
> >> >
> >> > Did you consider exposing each physical switch port as a netdevice on
> >> > the host? I would assume your switch driver could do this.
> >> >
> >> > Then you can drop the port specific attributes (link status, stats, etc)
> >> > and use existing interfaces. The win being my tools work equally well on
> >> > your real switch as they do on my software switch. Also by exposing net
> >> > devices you provide a mechanism to send packets over the port and trap
> >> > control packets.
> >>
> >> Well this is exactly what DSA does and which I do not like because it
> >> is completely overkill for most switches out there which are using
> >> 802.1q tags and do not prepend/append proprietary tags for internal
> >> traffic classification.
> >>
> >> >
> >> > Next instead of creating a switch specific netlink API could you use
> >> > the existing FDB API? Again what I would like is for my existing
> >> > applications to run on the switch without having to rewrite them. For
> >> > example it would be great to have 'bridge fdb show dev myswitch' report
> >> > the correct tables for both the Sw bridge, a real switch bridge, and
> >> > for the embedded SR-IOV bridge case.
> >>
> >> Ok, I know nothing about the FDB API, but will take a look and see if
> >> that sounds suitable for the embedded use cases.
> >>
> > Further to Johns comments, why are you creating a new netlink protocol for this?
> > It seems that 90% of what you want to accomplish above is handled by rtnetlink.
> > As long as you write your driver properly, most of that should "just work".
> 
> This is not a new netlink protocol, but a generic netlink family. Why
Thats hair splitting.  The point I'm making here is that you're creating a new
communication path from user space to the kernel to do something that we already
have a communication path to do.

> would I extend rtnetlink to cover the remaining 10% which are not
> going to be used on desktop and servers when a new generic netlink
> family is cheap and can be selectively disabled in the kernel?
90% of it is already done on servers and desktops using rtnetlink (thats my
point), and you can reasonably add the other 10% (I think), if you just expose
the switch ports as their own ethernet interfaces.  You say DSA is overkill, but
if you just add the other switch ports as their own ethernet interfaces, you
would get most of the above work for free, which seems to me like less overkill
than a new netlink family and userspace tools. 

Regards
Neil

^ permalink raw reply

* Re: Big performance loss from 3.4.63 to 3.10.13 when routing ipv4
From: Wolfgang Walter @ 2013-10-23 11:33 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, hannes, netdev, klassert
In-Reply-To: <20131023081255.GB10148@secunet.com>

Am Mittwoch, 23. Oktober 2013, 10:12:55 schrieb Steffen Klassert:
> On Tue, Oct 22, 2013 at 03:46:38PM -0400, David Miller wrote:
> > From: Wolfgang Walter <linux@stwm.de>
> > Date: Tue, 22 Oct 2013 21:07:41 +0200
> > 
> > > Am Mittwoch, 2. Oktober 2013, 00:20:02 schrieb Hannes Frederic Sowa:
> > >> On Tue, Oct 01, 2013 at 06:39:32PM +0200, Wolfgang Walter wrote:
> > >> > All network traffic over the router become slow and sluggish. If one
> > >> > pings
> > >> > the router there is a packet loss. After about 2 minutes the traffic
> > >> > completely stalls for about 1 minute. Then it works again as in the
> > >> > beginning to then stall again. And so on.
> > >> 
> > >> Maybe dropwatch can give a first hint?
> > > 
> > > I finally found the problem:
> > > 
> > > In 3.10.x and 3.11.x the value of /proc/sys/net/ipv4/xfrm4_gc_thresh is
> > > 1024.
> > > 
> > > It is much higher in 3.4.x. If I increase this value in 3.10.x to the
> > > one I
> > > see on 3.4.x all works fine with 3.10.x
> > 
> > Steffen, here is yet another report about this issue.
> > 
> > I think we should resolve this soon, even bumping it to 2048 or 4096
> > and leaving it at that would be I think acceptable.
> 
> Yes, of course. Let's use 4096 as the default for ipv4 and ipv6.
> I'll take care of it next week.
> 

I don't know what this value actually means. But on 3.4.x it is much higher. 
On a machine with 512MB ram it is 32768, on a machine with 1GB ram it is 
262144 and with 16GB ram it is 4194304.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* [PATCH net v3] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: Somnath Kotur @ 2013-10-23 11:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, Somnath Kotur

On very old FW versions < 4.0, the mailbox command to set interrupts
on the card succeeds even though it is not supported and should have
failed, leading to a scenario where interrupts do not work.
Hence warn users to upgrade to a suitable FW version to avoid seeing
broken functionality.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
v3: Incorporated comments from Ben Hutchings, Joe Perches and Ivan Vercera

 drivers/net/ethernet/emulex/benet/be.h      |    9 +++++++++
 drivers/net/ethernet/emulex/benet/be_main.c |    6 ++++++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h
index db02023..da9c04b 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -696,6 +696,15 @@ static inline int qnq_async_evt_rcvd(struct be_adapter *adapter)
 	return adapter->flags & BE_FLAGS_QNQ_ASYNC_EVT_RCVD;
 }
 
+static inline u32 fw_major_num(const char *fw_ver)
+{
+	int fw_major = 0;
+
+	sscanf(fw_ver, "%d.", &fw_major);
+
+	return fw_major;
+}
+
 extern void be_cq_notify(struct be_adapter *adapter, u16 qid, bool arm,
 		u16 num_popped);
 extern void be_link_status_update(struct be_adapter *adapter, u8 link_status);
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 2c38cc4..53ed58b 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -3247,6 +3247,12 @@ static int be_setup(struct be_adapter *adapter)
 
 	be_cmd_get_fw_ver(adapter, adapter->fw_ver, adapter->fw_on_flash);
 
+	if (BE2_chip(adapter) && fw_major_num(adapter->fw_ver) < 4) {
+		dev_err(dev, "Firmware on card is old(%s), IRQs may not work",
+			adapter->fw_ver);
+		dev_err(dev, "Please upgrade firmware to version >= 4.0\n");
+	}
+
 	if (adapter->vlans_added)
 		be_vid_config(adapter);
 
-- 
1.6.0.2

^ permalink raw reply related

* RE: [PATCH net v3] be2net: Warn users of possible broken functionality on BE2 cards with very old FW versions with latest driver
From: Somnath Kotur @ 2013-10-23 11:29 UTC (permalink / raw)
  To: Somnath Kotur, netdev@vger.kernel.org; +Cc: davem@davemloft.net
In-Reply-To: <c04a1288-7c23-499a-ae32-d4334fd39065@CMEXHTCAS1.ad.emulex.com>

Pls ignore this, sent this version out by mistake , will resend the correct one soon after this, extremely sorry for the inconvenience.

Thanks
Somnath

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Somnath Kotur
> Sent: Wednesday, October 23, 2013 1:49 PM
> To: netdev@vger.kernel.org
> Cc: davem@davemloft.net; Somnath Kotur
> Subject: [PATCH net v3] be2net: Warn users of possible broken functionality
> on BE2 cards with very old FW versions with latest driver
> 
> On very old FW versions < 4.0, the mailbox command to set interrupts on the
> card succeeds even though it is not supported and should have failed,
> leading to a scenario where interrupts do not work.
> Hence warn users to upgrade to a suitable FW version to avoid seeing broken
> functionality.
> 
> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
> ---
> v3: Incorporated comments from Ben Hutchings and Joe Perches
> 
>  drivers/net/ethernet/emulex/benet/be.h      |   14 ++++++++++++++
>  drivers/net/ethernet/emulex/benet/be_main.c |    6 ++++++
>  2 files changed, 20 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/ethernet/emulex/benet/be.h
> b/drivers/net/ethernet/emulex/benet/be.h
> index db02023..6a57051 100644
> --- a/drivers/net/ethernet/emulex/benet/be.h
> +++ b/drivers/net/ethernet/emulex/benet/be.h
> @@ -696,6 +696,20 @@ static inline int qnq_async_evt_rcvd(struct
> be_adapter *adapter)
>  	return adapter->flags & BE_FLAGS_QNQ_ASYNC_EVT_RCVD;  }
> 
> +static inline u32 fw_major_num(char *fw_ver_str) {
> +	u32 fw_major;
> +	char *next, *cp;
> +	char tmp_fw_ver[FW_VER_LEN];
> +
> +	strncpy(tmp_fw_ver, fw_ver_str, strlen(fw_ver_str));
> +	next = tmp_fw_ver;
> +	cp = strsep(&next, ".");
> +	sscanf(cp, "%i", &fw_major);
> +
> +	return fw_major;
> +}
> +
>  extern void be_cq_notify(struct be_adapter *adapter, u16 qid, bool arm,
>  		u16 num_popped);
>  extern void be_link_status_update(struct be_adapter *adapter, u8
> link_status); diff --git a/drivers/net/ethernet/emulex/benet/be_main.c
> b/drivers/net/ethernet/emulex/benet/be_main.c
> index 2c38cc4..d8da961 100644
> --- a/drivers/net/ethernet/emulex/benet/be_main.c
> +++ b/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -3247,6 +3247,12 @@ static int be_setup(struct be_adapter *adapter)
> 
>  	be_cmd_get_fw_ver(adapter, adapter->fw_ver, adapter-
> >fw_on_flash);
> 
> +	if (BE2_chip(adapter) && fw_major_num(adapter->fw_ver) < 4) {
> +		dev_err(dev, "Firmware on card is old(%s), IRQs may not
> work.",
> +			adapter->fw_ver);
> +		dev_err(dev, "Please upgrade firmware to version >=
> 4.0\n");
> +	}
> +
>  	if (adapter->vlans_added)
>  		be_vid_config(adapter);
> 
> --
> 1.6.0.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net] netpoll: fix rx_hook() interface by passing the skb
From: David Laight @ 2013-10-23 11:18 UTC (permalink / raw)
  To: Antonio Quartulli; +Cc: David S. Miller, netdev
In-Reply-To: <20131023102848.GB1535@neomailbox.net>

> My idea is to use the following API:
> 
> rx_skb_hook(struct netpoll *np, int source, struct sk_buff *skb, int len);
> 
> Any suggestion or objection?

Don't you need to pass the offset of the udp data?

	David

^ permalink raw reply

* [PATCH net-next 2/2] net: initialize hashrnd in flow_dissector with net_get_random_once
From: Hannes Frederic Sowa @ 2013-10-23 11:12 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet

We also can defer the initialization of hashrnd in flow_dissector
to its first use. Since net_get_random_once is irqsave now we don't
have to audit the call paths if one of this functions get called by an
interrupt handler.

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/core/flow_dissector.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index f8e25ac..5cac36e 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -184,6 +184,22 @@ ipv6:
 EXPORT_SYMBOL(skb_flow_dissect);
 
 static u32 hashrnd __read_mostly;
+static __always_inline void __flow_hash_secret_init(void)
+{
+	net_get_random_once(&hashrnd, sizeof(hashrnd));
+}
+
+static __always_inline u32 __flow_hash_3words(u32 a, u32 b, u32 c)
+{
+	__flow_hash_secret_init();
+	return jhash_3words(a, b, c, hashrnd);
+}
+
+static __always_inline u32 __flow_hash_1word(u32 a)
+{
+	__flow_hash_secret_init();
+	return jhash_1word(a, hashrnd);
+}
 
 /*
  * __skb_get_rxhash: calculate a flow hash based on src/dst addresses
@@ -210,9 +226,9 @@ void __skb_get_rxhash(struct sk_buff *skb)
 		swap(keys.port16[0], keys.port16[1]);
 	}
 
-	hash = jhash_3words((__force u32)keys.dst,
-			    (__force u32)keys.src,
-			    (__force u32)keys.ports, hashrnd);
+	hash = __flow_hash_3words((__force u32)keys.dst,
+				  (__force u32)keys.src,
+				  (__force u32)keys.ports);
 	if (!hash)
 		hash = 1;
 
@@ -248,7 +264,7 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol;
-	hash = jhash_1word(hash, hashrnd);
+	hash = __flow_hash_1word(hash);
 
 	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
 }
@@ -340,7 +356,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 				else
 					hash = (__force u16) skb->protocol ^
 					    skb->rxhash;
-				hash = jhash_1word(hash, hashrnd);
+				hash = __flow_hash_1word(hash);
 				queue_index = map->queues[
 				    ((u64)hash * map->len) >> 32];
 			}
@@ -395,11 +411,3 @@ struct netdev_queue *netdev_pick_tx(struct net_device *dev,
 	skb_set_queue_mapping(skb, queue_index);
 	return netdev_get_tx_queue(dev, queue_index);
 }
-
-static int __init initialize_hashrnd(void)
-{
-	get_random_bytes(&hashrnd, sizeof(hashrnd));
-	return 0;
-}
-
-late_initcall_sync(initialize_hashrnd);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/2] net: make net_get_random_once irqsave
From: Hannes Frederic Sowa @ 2013-10-23 11:12 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet

I initial build a non-irqsave version of net_get_random_once because I
would liked to have the freedom to defer even the extraction process of
get_random_bytes until the nonblocking pool is fully seeded.

I don't think this is a good idea anymore and thus this patch makes
net_get_random_once irqsave. Now someone using net_get_random_once does
not need to care from where it is called.

Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/net.h | 1 -
 net/core/utils.c    | 7 ++++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index aca446b..b292a04 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -250,7 +250,6 @@ bool __net_get_random_once(void *buf, int nbytes, bool *done,
 #define ___NET_RANDOM_STATIC_KEY_INIT STATIC_KEY_INIT_FALSE
 #endif /* HAVE_JUMP_LABEL */
 
-/* BE CAREFUL: this function is not interrupt safe */
 #define net_get_random_once(buf, nbytes)				\
 	({								\
 		bool ___ret = false;					\
diff --git a/net/core/utils.c b/net/core/utils.c
index bf09371..2f737bf 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -370,16 +370,17 @@ bool __net_get_random_once(void *buf, int nbytes, bool *done,
 			   struct static_key *done_key)
 {
 	static DEFINE_SPINLOCK(lock);
+	unsigned long flags;
 
-	spin_lock_bh(&lock);
+	spin_lock_irqsave(&lock, flags);
 	if (*done) {
-		spin_unlock_bh(&lock);
+		spin_unlock_irqrestore(&lock, flags);
 		return false;
 	}
 
 	get_random_bytes(buf, nbytes);
 	*done = true;
-	spin_unlock_bh(&lock);
+	spin_unlock_irqrestore(&lock, flags);
 
 	__net_random_once_disable_jump(done_key);
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH 1/1] mac80211:Resetting connection monitor timers in transmit path
From: Krishna Chaitanya @ 2013-10-23 10:54 UTC (permalink / raw)
  To: Dhahira Thesneem
  Cc: Johannes Berg, John W. Linville, David S. Miller, linux-wireless,
	netdev, linux-kernel
In-Reply-To: <1382523145-2302-1-git-send-email-dhahira.thesneem@mistralsolutions.com>

On Wed, Oct 23, 2013 at 3:42 PM, Dhahira Thesneem
<dhahira.thesneem@mistralsolutions.com> wrote:
>
> Reset connection monitor timers when we are able to successfully transmit data to an AP.
>
> Signed-off-by: Dhahira Thesneem <dhahira.thesneem@mistralsolutions.com>
> ---
>  net/mac80211/tx.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
> index 3456c04..e7725cf 100644
> --- a/net/mac80211/tx.c
> +++ b/net/mac80211/tx.c
> @@ -1700,7 +1700,8 @@ netdev_tx_t ieee80211_monitor_start_xmit(struct sk_buff *skb,
>
>         ieee80211_xmit(sdata, skb, chan->band);
>         rcu_read_unlock();
> -
> +       /*To reset connection monitor timers, after successful transmission*/
> +       ieee80211_sta_reset_conn_monitor(sdata);
>         return NETDEV_TX_OK;
>
>  fail_rcu:
> @@ -2139,7 +2140,8 @@ netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
>
>         ieee80211_xmit(sdata, skb, band);
>         rcu_read_unlock();
> -
> +       /*To reset connection monitor timers, after successful transmission*/
> +       ieee80211_sta_reset_conn_monitor(sdata);
>         return NETDEV_TX_OK;
>
>   fail_rcu:
> --
Successful data transmission should be checked in the tx_status not
after we transmit.
In fact its already taken care in status.c: through ieee80211_sta_tx_notify.

NACK.

^ permalink raw reply

* [PATCH] netfilter: ipset: remove duplicate define
From: Michael Opdenacker @ 2013-10-23 10:36 UTC (permalink / raw)
  To: pablo, kaber, kadlec, davem
  Cc: netfilter-devel, netfilter, coreteam, netdev, linux-kernel,
	Michael Opdenacker

This patch removes a duplicate define from
net/netfilter/ipset/ip_set_hash_gen.h

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 707bc52..b37a65f 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -243,7 +243,6 @@ hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
 #define mtype_uadt		TOKEN(MTYPE, _uadt)
 #define mtype			MTYPE
 
-#define mtype_elem		TOKEN(MTYPE, _elem)
 #define mtype_add		TOKEN(MTYPE, _add)
 #define mtype_del		TOKEN(MTYPE, _del)
 #define mtype_test_cidrs	TOKEN(MTYPE, _test_cidrs)
-- 
1.8.1.2

^ permalink raw reply related

* Re: [PATCH net] netpoll: fix rx_hook() interface by passing the skb
From: Antonio Quartulli @ 2013-10-23 10:28 UTC (permalink / raw)
  To: David Laight; +Cc: David S. Miller, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B73A1@saturn3.aculab.com>

[-- Attachment #1: Type: text/plain, Size: 2175 bytes --]

On Wed, Oct 23, 2013 at 09:33:49AM +0100, David Laight wrote:
> ...
> > > I can't remember which value you passed as 'offset' (and my mailer makes
> > > it hard to find), but to ease the code changes the offset of the udp data
> > > would make sense.
> > > In that case you still need to pass the source port.
> > 
> > I decided not to pass the source port because if the user is really interested
> > in it, it is still possible to get the udp_hdr from the skb and read its value.
> 
> It just seemed that there was no need to require that the hook re-parse
> the ip header just to find the source port.
> (ok it could assume that the udp header is just before the data)

Also David (M.) pointed out the same. I will keep the port as argument for
rx_hook.

>  
> > > If you do rx_hook(np, source_port, skb, offset) then if anyone manages to
> > > load an old module (or code that casts the assignement to rx_poll)
> > > at least it won't go 'bang'.
> > > Renaming the structure member will guarantee to generate compile errors.
> > 
> > so you suggest to rename rx_hook to something else to warn people about the
> > change?
> 
> Yes.
> 

mh..what about rx_skb_hook ? this way we also make it easy to notice the
difference (both in arguments and behaviour).

> > If we go for the "no udp port" approach they will get an error any way because
> > of the mismatching arguments.
> 
> No - you only get a warning when you assign a function pointer of the wrong type.
> And that is true even if you just change the type of the pointer.
> However code might already have a cast on the function pointer (eg because the
> hook has 'unsigned char *') - so you won't even get a warning.
> You then get an OOPS when the hook tries to read the buffer.
> 
> It is a really bad interface...
> There isn't even a flags/options (etc) word that can be used
> to detect enhancements.
> 


agreed. But I am not sure about what I could do to fix that.

My idea is to use the following API:

rx_skb_hook(struct netpoll *np, int source, struct sk_buff *skb, int len);


Any suggestion or objection?


Regards,


-- 
Antonio Quartulli

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH 1/1] mac80211:Resetting connection monitor timers in transmit path
From: Dhahira Thesneem @ 2013-10-23 10:12 UTC (permalink / raw)
  To: Johannes Berg, John W. Linville, David S. Miller,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Dhahira Thesneem

Reset connection monitor timers when we are able to successfully transmit data to an AP.

Signed-off-by: Dhahira Thesneem <dhahira.thesneem-EvXpCiN+lbve9wHmmfpqLFaTQe2KTcn/@public.gmane.org>
---
 net/mac80211/tx.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 3456c04..e7725cf 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -1700,7 +1700,8 @@ netdev_tx_t ieee80211_monitor_start_xmit(struct sk_buff *skb,
 
 	ieee80211_xmit(sdata, skb, chan->band);
 	rcu_read_unlock();
-
+	/*To reset connection monitor timers, after successful transmission*/
+	ieee80211_sta_reset_conn_monitor(sdata);
 	return NETDEV_TX_OK;
 
 fail_rcu:
@@ -2139,7 +2140,8 @@ netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
 
 	ieee80211_xmit(sdata, skb, band);
 	rcu_read_unlock();

^ permalink raw reply related

* RE: [PATCH 3/3] netfilter: x_tables: fix ordering of jumpstack allocation and table update
From: David Laight @ 2013-10-23  9:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1382519724-3953-4-git-send-email-pablo@netfilter.org>

> Subject: [PATCH 3/3] netfilter: x_tables: fix ordering of jumpstack allocation and table update
...
> Meanwhile, CPU0 is handling the network receive path and ends up in
> ipt_do_table, resulting in:
> 
> 	private = table->private;
> 
> 	[...]
> 
> 	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
> 
> On weakly ordered memory architectures, the writes to table->private
> and newinfo->jumpstack from CPU1 can be observed out of order by CPU0.
> Furthermore, on architectures which don't respect ordering of address
> dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered.

Which reads might be out of order?
AFAICT they are strongly sequenced because they second depends on the
value read by the first.
So I don't see why the read barrier is needed.

I presume the above code is tied to a single cpu.

...
> 
> -	table->private = newinfo;
>  	newinfo->initial_entries = private->initial_entries;
> +	/*
> +	 * Ensure contents of newinfo are visible before assigning to
> +	 * private.
> +	 */
> +	smp_wmb();
> +	table->private = newinfo;

Those writes were in the wrong order on all systems.
Also gcc needs to be told not to reorder the writes even on non-smp
systems (if the code might be pre-empted).
So an asm volatile (:::"memory") is needed there even if no specific
synchronisation instruction is needed.

	David

^ permalink raw reply

* Re: -27% netperf TCP_STREAM regression by "tcp_memcontrol: Kill struct tcp_memcontrol"
From: Eric W. Biederman @ 2013-10-23  9:43 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: David Miller, netdev, linux-kernel
In-Reply-To: <20131023061019.GA15698@localhost>

Fengguang Wu <fengguang.wu@intel.com> writes:

> On Tue, Oct 22, 2013 at 09:38:10PM -0700, Eric W. Biederman wrote:
>> David Miller <davem@davemloft.net> writes:
>> 
>> > From: fengguang.wu@intel.com
>> > Date: Tue, 22 Oct 2013 22:41:29 +0100
>> >
>> >> We noticed big netperf throughput regressions
>> >> 
>> >>     a4fe34bf902b8f709c63      2e685cad57906e19add7  
>> >> ------------------------  ------------------------  
>> >>                   707.40       -40.7%       419.60  lkp-nex04/micro/netperf/120s-200%-TCP_STREAM
>> >>                  2775.60       -23.7%      2116.40  lkp-sb03/micro/netperf/120s-200%-TCP_STREAM
>> >>                  3483.00       -27.2%      2536.00  TOTAL netperf.Throughput_Mbps
>> >> 
>> >> and bisected it to
>> >> 
>> >> commit 2e685cad57906e19add7189b5ff49dfb6aaa21d3
>> >> Author: Eric W. Biederman <ebiederm@xmission.com>
>> >> Date:   Sat Oct 19 16:26:19 2013 -0700
>> >> 
>> >>     tcp_memcontrol: Kill struct tcp_memcontrol
>> >
>> > Eric please look into this, I'd rather have a fix to apply than revert your
>> > work.
>> 
>> Will do I expect some ordering changed, and that changed the cache line
>> behavior.
>> 
>> If I can't find anything we can revert this one particular patch without
>> affecting anything else, but it would be nice to keep the data structure
>> smaller.
>> 
>> Fengguag what would I need to do to reproduce this?
>
> Eric, attached is the kernel config.
>
> We used these commands in the test:
>
>         netserver
>         netperf -t TCP_STREAM -c -C -l 120      # repeat 64 times and get average
>
> btw, we've got more complete change set (attached) and also noticed
> performance increase in the TCP_SENDFILE case:
>
>     a4fe34bf902b8f709c63      2e685cad57906e19add7
> ------------------------  ------------------------
>                   707.40       -40.7%       419.60  lkp-nex04/micro/netperf/120s-200%-TCP_STREAM
>                  2572.20       -17.7%      2116.20  lkp-sb03/micro/netperf/120s-200%-TCP_MAERTS
>                  2775.60       -23.7%      2116.40  lkp-sb03/micro/netperf/120s-200%-TCP_STREAM
>                  1006.60       -54.4%       459.40  lkp-sbx04/micro/netperf/120s-200%-TCP_STREAM
>                  3278.60       -25.2%      2453.80  lkp-t410/micro/netperf/120s-200%-TCP_MAERTS
>                  1902.80       +21.7%      2315.00  lkp-t410/micro/netperf/120s-200%-TCP_SENDFILE
>                  3345.40       -26.7%      2451.00  lkp-t410/micro/netperf/120s-200%-TCP_STREAM
>                 15588.60       -20.9%     12331.40  TOTAL netperf.Throughput_Mbps

I have a second question.  Do you mount the cgroup filesystem?  Do you
set memory.kmem.tcp.limit_in_bytes?

If you aren't setting any memory cgroup limits or creating any groups
this change should not have had any effect whatsoever.  And you haven't
mentioned it so I don't expect you are enabling the memory cgroup limits
explicitly.

If you have enabled the memory cgroups can you please describe your
configuration as that may play a significant role.

Eric

^ permalink raw reply

* Charity Donation
From: Gillian and Adrian Bayford @ 2013-10-23  5:20 UTC (permalink / raw)
  To: Recipients

My wife and i won £148.6 Million Pounds last year, and we have done lot of charity donation, so we decide to give 1.5 Million Pounds each to 5 lucky people, lucky for you, your email, was given to us by Google management as one of our lucky precipitants.

For verification process see below Please read the article - http://www.bbc.co.uk/news/uk-england-19254228

Send Name, Country, Age, Occupation and Phone Number for details

Congratulations & Happy Celebrations in Advance,

Gillian and Adrian Bayford
Email: gillian.adrianbayford01@rogers.com

^ permalink raw reply

* [PATCH 3/3] netfilter: x_tables: fix ordering of jumpstack allocation and table update
From: Pablo Neira Ayuso @ 2013-10-23  9:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1382519724-3953-1-git-send-email-pablo@netfilter.org>

From: Will Deacon <will.deacon@arm.com>

During kernel stability testing on an SMP ARMv7 system, Yalin Wang
reported the following panic from the netfilter code:

  1fe0: 0000001c 5e2d3b10 4007e779 4009e110 60000010 00000032 ff565656 ff545454
  [<c06c48dc>] (ipt_do_table+0x448/0x584) from [<c0655ef0>] (nf_iterate+0x48/0x7c)
  [<c0655ef0>] (nf_iterate+0x48/0x7c) from [<c0655f7c>] (nf_hook_slow+0x58/0x104)
  [<c0655f7c>] (nf_hook_slow+0x58/0x104) from [<c0683bbc>] (ip_local_deliver+0x88/0xa8)
  [<c0683bbc>] (ip_local_deliver+0x88/0xa8) from [<c0683718>] (ip_rcv_finish+0x418/0x43c)
  [<c0683718>] (ip_rcv_finish+0x418/0x43c) from [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598)
  [<c062b1c4>] (__netif_receive_skb+0x4cc/0x598) from [<c062b314>] (process_backlog+0x84/0x158)
  [<c062b314>] (process_backlog+0x84/0x158) from [<c062de84>] (net_rx_action+0x70/0x1dc)
  [<c062de84>] (net_rx_action+0x70/0x1dc) from [<c0088230>] (__do_softirq+0x11c/0x27c)
  [<c0088230>] (__do_softirq+0x11c/0x27c) from [<c008857c>] (do_softirq+0x44/0x50)
  [<c008857c>] (do_softirq+0x44/0x50) from [<c0088614>] (local_bh_enable_ip+0x8c/0xd0)
  [<c0088614>] (local_bh_enable_ip+0x8c/0xd0) from [<c06b0330>] (inet_stream_connect+0x164/0x298)
  [<c06b0330>] (inet_stream_connect+0x164/0x298) from [<c061d68c>] (sys_connect+0x88/0xc8)
  [<c061d68c>] (sys_connect+0x88/0xc8) from [<c000e340>] (ret_fast_syscall+0x0/0x30)
  Code: 2a000021 e59d2028 e59de01c e59f011c (e7824103)
  ---[ end trace da227214a82491bd ]---
  Kernel panic - not syncing: Fatal exception in interrupt

This comes about because CPU1 is executing xt_replace_table in response
to a setsockopt syscall, resulting in:

	ret = xt_jumpstack_alloc(newinfo);
		--> newinfo->jumpstack = kzalloc(size, GFP_KERNEL);

	[...]

	table->private = newinfo;
	newinfo->initial_entries = private->initial_entries;

Meanwhile, CPU0 is handling the network receive path and ends up in
ipt_do_table, resulting in:

	private = table->private;

	[...]

	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];

On weakly ordered memory architectures, the writes to table->private
and newinfo->jumpstack from CPU1 can be observed out of order by CPU0.
Furthermore, on architectures which don't respect ordering of address
dependencies (i.e. Alpha), the reads from CPU0 can also be re-ordered.

This patch adds an smp_wmb() before the assignment to table->private
(which is essentially publishing newinfo) to ensure that all writes to
newinfo will be observed before plugging it into the table structure.
A dependent-read barrier is also added on the consumer sides, to ensure
the same ordering requirements are also respected there.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reported-by: Wang, Yalin <Yalin.Wang@sonymobile.com>
Tested-by: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c |    5 +++++
 net/ipv4/netfilter/ip_tables.c  |    5 +++++
 net/ipv6/netfilter/ip6_tables.c |    5 +++++
 net/netfilter/x_tables.c        |    7 ++++++-
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 85a4f21..59da7cd 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -271,6 +271,11 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
 	private = table->private;
+	/*
+	 * Ensure we load private-> members after we've fetched the base
+	 * pointer.
+	 */
+	smp_read_barrier_depends();
 	table_base = private->entries[smp_processor_id()];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index d23118d..718dfbd 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -327,6 +327,11 @@ ipt_do_table(struct sk_buff *skb,
 	addend = xt_write_recseq_begin();
 	private = table->private;
 	cpu        = smp_processor_id();
+	/*
+	 * Ensure we load private-> members after we've fetched the base
+	 * pointer.
+	 */
+	smp_read_barrier_depends();
 	table_base = private->entries[cpu];
 	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 	stackptr   = per_cpu_ptr(private->stackptr, cpu);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 44400c2..710238f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -349,6 +349,11 @@ ip6t_do_table(struct sk_buff *skb,
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
 	private = table->private;
+	/*
+	 * Ensure we load private-> members after we've fetched the base
+	 * pointer.
+	 */
+	smp_read_barrier_depends();
 	cpu        = smp_processor_id();
 	table_base = private->entries[cpu];
 	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 8b03028..227aa11 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -845,8 +845,13 @@ xt_replace_table(struct xt_table *table,
 		return NULL;
 	}
 
-	table->private = newinfo;
 	newinfo->initial_entries = private->initial_entries;
+	/*
+	 * Ensure contents of newinfo are visible before assigning to
+	 * private.
+	 */
+	smp_wmb();
+	table->private = newinfo;
 
 	/*
 	 * Even though table entries have now been swapped, other CPU's
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/3] netfilter: ipt_ULOG: fix info leaks
From: Pablo Neira Ayuso @ 2013-10-23  9:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1382519724-3953-1-git-send-email-pablo@netfilter.org>

From: Mathias Krause <minipli@googlemail.com>

The ulog messages leak heap bytes by the means of padding bytes and
incompletely filled string arrays. Fix those by memset(0)'ing the
whole struct before filling it.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/ipt_ULOG.c |    7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_ULOG.c b/net/ipv4/netfilter/ipt_ULOG.c
index cbc2215..9cb993c 100644
--- a/net/ipv4/netfilter/ipt_ULOG.c
+++ b/net/ipv4/netfilter/ipt_ULOG.c
@@ -220,6 +220,7 @@ static void ipt_ulog_packet(struct net *net,
 	ub->qlen++;
 
 	pm = nlmsg_data(nlh);
+	memset(pm, 0, sizeof(*pm));
 
 	/* We might not have a timestamp, get one */
 	if (skb->tstamp.tv64 == 0)
@@ -238,8 +239,6 @@ static void ipt_ulog_packet(struct net *net,
 	}
 	else if (loginfo->prefix[0] != '\0')
 		strncpy(pm->prefix, loginfo->prefix, sizeof(pm->prefix));
-	else
-		*(pm->prefix) = '\0';
 
 	if (in && in->hard_header_len > 0 &&
 	    skb->mac_header != skb->network_header &&
@@ -251,13 +250,9 @@ static void ipt_ulog_packet(struct net *net,
 
 	if (in)
 		strncpy(pm->indev_name, in->name, sizeof(pm->indev_name));
-	else
-		pm->indev_name[0] = '\0';
 
 	if (out)
 		strncpy(pm->outdev_name, out->name, sizeof(pm->outdev_name));
-	else
-		pm->outdev_name[0] = '\0';
 
 	/* copy_len <= skb->len, so can't fail. */
 	if (skb_copy_bits(skb, 0, pm->payload, copy_len) < 0)
-- 
1.7.10.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox