Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Eric Dumazet @ 2011-10-28 11:06 UTC (permalink / raw)
  To: David Miller
  Cc: jesse, john.r.fastabend, hans.schillstrom, jpirko, mbizon, netdev,
	fubar
In-Reply-To: <1319796053.23112.92.camel@edumazet-laptop>

Le vendredi 28 octobre 2011 à 12:00 +0200, Eric Dumazet a écrit :

> Oh well, this broke my setup, a very basic one.
> 
> eth1 and eth2 on a bonding device, bond0, active-backup
> 
> some vlans on top of bond0, say vlan.103
> 
> $ ip link show dev vlan.103
> 8: vlan.103@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> pfifo_fast state UP qlen 100
>     link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
> 
> 
> arp_rcv() now gets packets with skb->type PACKET_OTHERHOST and drops
> such packets.
> 
>      [000] 52870.115435: skb_gro_reset_offset <-napi_gro_receive
>      [000] 52870.115435: dev_gro_receive <-napi_gro_receive
>      [000] 52870.115435: napi_skb_finish <-napi_gro_receive
>      [000] 52870.115435: netif_receive_skb <-napi_skb_finish
>      [000] 52870.115435: get_rps_cpu <-netif_receive_skb
>      [000] 52870.115435: __netif_receive_skb <-netif_receive_skb
>      [000] 52870.115436: vlan_do_receive <-__netif_receive_skb
>      [000] 52870.115436: bond_handle_frame <-__netif_receive_skb
>      [000] 52870.115436: vlan_do_receive <-__netif_receive_skb
>      [000] 52870.115436: arp_rcv <-__netif_receive_skb
>      [000] 52870.115436: kfree_skb <-arp_rcv
>      [000] 52870.115437: __kfree_skb <-kfree_skb
>      [000] 52870.115437: skb_release_head_state <-__kfree_skb
>      [000] 52870.115437: skb_release_data <-__kfree_skb
>      [000] 52870.115437: kfree <-skb_release_data
>      [000] 52870.115437: kmem_cache_free <-__kfree_skb
> 
> 
> By the way, we have no SNMP counter here so I spent some time to track
> this. I'll send a patch for this.
> 
> If this host initiates the trafic, all is well.
> 
> Please guys, can we get back ARP or revert this patch ?

Following patch cures the problem, I am not sure its the right fix.

Problem is we dont know how many times vlan_do_receive() can be called
for a packet.

Only last call should set/mess pkt_type to PACKET_OTHERHOST.

So the caller should be responsible for this, not vlan_do_receive()


Alternative would be to check skb->dev->rx_handler being NULL,
but its not clean.

Following patch is a hack because it handles multicast/broadcast trafic
only. Unicast is already handled in lines 26-33, this is why we didnt
catch the problem.

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index f1f2f7b..6861899 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -13,7 +13,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
 
 	vlan_dev = vlan_find_dev(skb->dev, vlan_id);
 	if (!vlan_dev) {
-		if (vlan_id)
+		if (vlan_id && skb->pkt_type == PACKET_HOST)
 			skb->pkt_type = PACKET_OTHERHOST;
 		return false;
 	}

^ permalink raw reply related

* [PATCH v7 1/7] SUNRPC: introduce helpers for reference counted rpcbind clients
From: Stanislav Kinsbursky @ 2011-10-28 11:52 UTC (permalink / raw)
  To: Trond.Myklebust
  Cc: linux-nfs, xemul, neilb, netdev, linux-kernel, bfields, davem,
	devel
In-Reply-To: <20111028104530.24628.23631.stgit@localhost6.localdomain6>

v6:
1) added write memory barrier to rpcb_set_local to make sure, that rpcbind
clients become valid before rpcb_users assignment
2) explicitly set rpcb_users to 1 instead of incrementing it (looks clearer from
my pow).

v5: fixed races with rpcb_users in rpcb_get_local()

This helpers will be used for dynamical creation and destruction of rpcbind
clients.
Variable rpcb_users is actually a counter of lauched RPC services. If rpcbind
clients has been created already, then we just increase rpcb_users.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>

---
 net/sunrpc/rpcb_clnt.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index e45d2fb..9830d87 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -114,6 +114,9 @@ static struct rpc_program	rpcb_program;
 static struct rpc_clnt *	rpcb_local_clnt;
 static struct rpc_clnt *	rpcb_local_clnt4;
 
+DEFINE_SPINLOCK(rpcb_clnt_lock);
+unsigned int			rpcb_users;
+
 struct rpcbind_args {
 	struct rpc_xprt *	r_xprt;
 
@@ -161,6 +164,56 @@ static void rpcb_map_release(void *data)
 	kfree(map);
 }
 
+static int rpcb_get_local(void)
+{
+	int cnt;
+
+	spin_lock(&rpcb_clnt_lock);
+	if (rpcb_users)
+		rpcb_users++;
+	cnt = rpcb_users;
+	spin_unlock(&rpcb_clnt_lock);
+
+	return cnt;
+}
+
+void rpcb_put_local(void)
+{
+	struct rpc_clnt *clnt = rpcb_local_clnt;
+	struct rpc_clnt *clnt4 = rpcb_local_clnt4;
+	int shutdown;
+
+	spin_lock(&rpcb_clnt_lock);
+	if (--rpcb_users == 0) {
+		rpcb_local_clnt = NULL;
+		rpcb_local_clnt4 = NULL;
+	}
+	shutdown = !rpcb_users;
+	spin_unlock(&rpcb_clnt_lock);
+
+	if (shutdown) {
+		/*
+		 * cleanup_rpcb_clnt - remove xprtsock's sysctls, unregister
+		 */
+		if (clnt4)
+			rpc_shutdown_client(clnt4);
+		if (clnt)
+			rpc_shutdown_client(clnt);
+	}
+}
+
+static void rpcb_set_local(struct rpc_clnt *clnt, struct rpc_clnt *clnt4)
+{
+	/* Protected by rpcb_create_local_mutex */
+	rpcb_local_clnt = clnt;
+	rpcb_local_clnt4 = clnt4;
+	smp_wmb(); 
+	rpcb_users = 1;
+	dprintk("RPC:       created new rpcb local clients (rpcb_local_clnt: "
+			"%p, rpcb_local_clnt4: %p)\n", rpcb_local_clnt,
+			rpcb_local_clnt4);
+}
+
 /*
  * Returns zero on success, otherwise a negative errno value
  * is returned.

^ permalink raw reply related

* Re: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Eric Dumazet @ 2011-10-28 10:00 UTC (permalink / raw)
  To: David Miller
  Cc: jesse, john.r.fastabend, hans.schillstrom, jpirko, mbizon, netdev,
	fubar
In-Reply-To: <20111018.234723.1235424375699917420.davem@davemloft.net>

Le mardi 18 octobre 2011 à 23:47 -0400, David Miller a écrit :
> From: Jesse Gross <jesse@nicira.com>
> Date: Thu, 13 Oct 2011 17:22:02 -0700
> 
> > Actually, for most of 2.6.x the behavior was somewhat
> > non-deterministic since it depended on kernel version and the NIC.  As
> > a result, I think we can safely say that this wasn't a particularly
> > firm interface that we have to be wedded to.  Based on overwhelming
> > feedback, I think the interface in this patch is the preferred one and
> > what we should stabilize on.
> 
> Agreed, and I've applied this patch to net-next, thanks everyone!

Hmm.

Oh well, this broke my setup, a very basic one.

eth1 and eth2 on a bonding device, bond0, active-backup

some vlans on top of bond0, say vlan.103

$ ip link show dev vlan.103
8: vlan.103@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast state UP qlen 100
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff


arp_rcv() now gets packets with skb->type PACKET_OTHERHOST and drops
such packets.

     [000] 52870.115435: skb_gro_reset_offset <-napi_gro_receive
     [000] 52870.115435: dev_gro_receive <-napi_gro_receive
     [000] 52870.115435: napi_skb_finish <-napi_gro_receive
     [000] 52870.115435: netif_receive_skb <-napi_skb_finish
     [000] 52870.115435: get_rps_cpu <-netif_receive_skb
     [000] 52870.115435: __netif_receive_skb <-netif_receive_skb
     [000] 52870.115436: vlan_do_receive <-__netif_receive_skb
     [000] 52870.115436: bond_handle_frame <-__netif_receive_skb
     [000] 52870.115436: vlan_do_receive <-__netif_receive_skb
     [000] 52870.115436: arp_rcv <-__netif_receive_skb
     [000] 52870.115436: kfree_skb <-arp_rcv
     [000] 52870.115437: __kfree_skb <-kfree_skb
     [000] 52870.115437: skb_release_head_state <-__kfree_skb
     [000] 52870.115437: skb_release_data <-__kfree_skb
     [000] 52870.115437: kfree <-skb_release_data
     [000] 52870.115437: kmem_cache_free <-__kfree_skb


By the way, we have no SNMP counter here so I spent some time to track
this. I'll send a patch for this.

If this host initiates the trafic, all is well.

Please guys, can we get back ARP or revert this patch ?

Thanks

^ permalink raw reply

* Re: [PATCH v6 5/8] SUNRPC: cleanup service destruction
From: Stanislav Kinsbursky @ 2011-10-28  9:49 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111027213038.GD31669@fieldses.org>

28.10.2011 01:30, J. Bruce Fields пишет:
> On Tue, Oct 25, 2011 at 02:17:18PM +0300, Stanislav Kinsbursky wrote:
>> svc_unregister() call have to be removed from svc_destroy() since it will be
>> called in sv_shutdown callback.
>
> It would be clearer that you're *moving* this if this were merged with
> the following patch.  And without doing that the series isn't quite
> bisectable, unless I'm missing something.
>

Yes, you are right. Will resend new version soon.

> --b.
>
>>
>> Signed-off-by: Stanislav Kinsbursky<skinsbursky@parallels.com>
>>
>> ---
>>   net/sunrpc/svc.c |    1 -
>>   1 files changed, 0 insertions(+), 1 deletions(-)
>>
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index 918edc3..407462f 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -530,7 +530,6 @@ svc_destroy(struct svc_serv *serv)
>>   	if (svc_serv_is_pooled(serv))
>>   		svc_pool_map_put();
>>
>> -	svc_unregister(serv);
>>   	kfree(serv->sv_pools);
>>   	kfree(serv);
>>   }
>>


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: [PATCH v3 0/3] SUNRPC: rcbind clients virtualization
From: Stanislav Kinsbursky @ 2011-10-28  9:41 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111028093036.GA2604@fieldses.org>

28.10.2011 13:30, J. Bruce Fields пишет:
> On Fri, Oct 28, 2011 at 01:24:45PM +0400, Stanislav Kinsbursky wrote:
>> This patch-set was created before you've sent your NFSd plan and we
>> disacussed Lockd per netns.
>> So, this sentence: "NFSd service will be per netns too from my pow"
>> is obsolete. And Lockd will be one for all.
>
> I believe lockd should be pert-netns--at least that's what the server
> needs.
>
> (The single lockd thread may handle requests from all netns, but it
> should behave like a different service depending on netns, so its data
> structures, etc. will need to be per-ns.
>

Sure. Looks like we have misunderstanding here. When I said, that Lockd should 
be one for all, I meaned, that we will have only one kthread for all ns (not one 
per ns). Private data will be per net ns, of course.

BTW, Bruce, please, have a brief look at my e-mail to linux-nfs@vger.kernel.org 
named "SUNRPC: non-exclusive pipe creation".
I've done a lot in "RPC pipefs per net ns" task, and going to send first patches 
soon. But right now I'm really confused will this non-exclusive pipes creation 
and almost ready so remove this functionality. But I'm afraid, that I've missed 
something. Would be greatly appreciate for your opinion about my question.

> --b.
>
>> Or you are asking about something else?
>>
>>> --b.
>>>
>>>> And also we have NFSd file system, which
>>>> is not virtualized yet.
>>>>
>>>> The following series consists of:
>>>>
>>>> ---
>>>>
>>>> Stanislav Kinsbursky (3):
>>>>        SUNRPC: move rpcbind internals to sunrpc part of network namespace context
>>>>        SUNRPC: optimize net_ns dereferencing in rpcbind creation calls
>>>>        SUNRPC: optimize net_ns dereferencing in rpcbind registering calls
>>>>
>>>>
>>>>   net/sunrpc/netns.h     |    5 ++
>>>>   net/sunrpc/rpcb_clnt.c |  103 ++++++++++++++++++++++++++----------------------
>>>>   2 files changed, 61 insertions(+), 47 deletions(-)
>>>>
>>>> --
>>>> Signature
>>
>>
>> --
>> Best regards,
>> Stanislav Kinsbursky


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: [PATCH v3 0/3] SUNRPC: rcbind clients virtualization
From: J. Bruce Fields @ 2011-10-28  9:30 UTC (permalink / raw)
  To: Stanislav Kinsbursky
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <4EAA74DD.3070708@parallels.com>

On Fri, Oct 28, 2011 at 01:24:45PM +0400, Stanislav Kinsbursky wrote:
> This patch-set was created before you've sent your NFSd plan and we
> disacussed Lockd per netns.
> So, this sentence: "NFSd service will be per netns too from my pow"
> is obsolete. And Lockd will be one for all.

I believe lockd should be pert-netns--at least that's what the server
needs.

(The single lockd thread may handle requests from all netns, but it
should behave like a different service depending on netns, so its data
structures, etc. will need to be per-ns.

--b.

> Or you are asking about something else?
> 
> >--b.
> >
> >>And also we have NFSd file system, which
> >>is not virtualized yet.
> >>
> >>The following series consists of:
> >>
> >>---
> >>
> >>Stanislav Kinsbursky (3):
> >>       SUNRPC: move rpcbind internals to sunrpc part of network namespace context
> >>       SUNRPC: optimize net_ns dereferencing in rpcbind creation calls
> >>       SUNRPC: optimize net_ns dereferencing in rpcbind registering calls
> >>
> >>
> >>  net/sunrpc/netns.h     |    5 ++
> >>  net/sunrpc/rpcb_clnt.c |  103 ++++++++++++++++++++++++++----------------------
> >>  2 files changed, 61 insertions(+), 47 deletions(-)
> >>
> >>--
> >>Signature
> 
> 
> -- 
> Best regards,
> Stanislav Kinsbursky

^ permalink raw reply

* Re: [PATCH v6 4/8] SUNRPC: setup rpcbind clients if service requires it
From: Stanislav Kinsbursky @ 2011-10-28  9:27 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111027212732.GC31669@fieldses.org>

28.10.2011 01:27, J. Bruce Fields пишет:
> On Tue, Oct 25, 2011 at 02:17:08PM +0300, Stanislav Kinsbursky wrote:
>> New function ("svc_uses_rpcbind") will be used to detect, that new service will
>> send portmapper register calls. For such services we will create rpcbind
>> clients and remove all stale portmap registrations.
>> Also, svc_rpcb_cleanup() will be set as sv_shutdown callback for such services
>> in case of this field wasn't initialized earlier. This will allow to destroy
>> rpcbind clients when no other users of them left.
>>
>> Note: Currently, any creating service will be detected as portmap user.
>> Probably, this is wrong. But now it depends on program versions "vs_hidden"
>> flag.
>>
>> Signed-off-by: Stanislav Kinsbursky<skinsbursky@parallels.com>
>>
>> ---
>>   net/sunrpc/svc.c |   11 +++++++++--
>>   1 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index d2d61bf..918edc3 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -454,8 +454,15 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>   		spin_lock_init(&pool->sp_lock);
>>   	}
>>
>> -	/* Remove any stale portmap registrations */
>> -	svc_unregister(serv);
>> +	if (svc_uses_rpcbind(serv)) {
>> +	       	if (svc_rpcb_setup(serv)<  0) {
>> +			kfree(serv->sv_pools);
>> +			kfree(serv);
>> +			return NULL;
>
> Nit: could we convert this (and the previous failure to allocate
> sv_pools) to the usual pattern of collecting the cleanup at the end and
> jumping to it with a goto?
>

Sure, we can. I will implement this "goto pattern", is you insist.

> Looks fine otherwise.
>
> --b.
>
>> +		}
>> +		if (!serv->sv_shutdown)
>> +			serv->sv_shutdown = svc_rpcb_cleanup;
>> +	}
>>
>>   	return serv;
>>   }
>>


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: [PATCH v3 0/3] SUNRPC: rcbind clients virtualization
From: Stanislav Kinsbursky @ 2011-10-28  9:24 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Trond.Myklebust@netapp.com, linux-nfs@vger.kernel.org,
	Pavel Emelianov, neilb@suse.de, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, davem@davemloft.net,
	devel@openvz.org
In-Reply-To: <20111027202514.GA31669@fieldses.org>

28.10.2011 00:25, J. Bruce Fields пишет:
> On Thu, Oct 27, 2011 at 10:10:43PM +0300, Stanislav Kinsbursky wrote:
>> v3:
>> 1) First two patches from previous version were squashed.
>>
>> This patch-set was created in context of clone of git branch:
>> git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git
>> and rebased on tag "v3.1".
>>
>> This patch-set virtualizes rpcbind clients per network namespace context. IOW,
>> each network namespace will have its own pair of rpcbind clients (if they would
>> be created by request).
>>
>> Note:
>> 1) this patch-set depends on "SUNRPC: make rpcbind clients allocated and
>> destroyed dynamically" patch-set which has been send earlier.
>> 2) init_net pointer is still used instead of current->nsproxy->net_ns,
>> because I'm not sure yet about how to virtualize services. I.e. NFS callback
>> services will be per netns. NFSd service will be per netns too from my pow. But
>> Lockd can be per netns or one for all.
>
> I'm not sure what you mean by that; could you explain?
>

This patch-set was created before you've sent your NFSd plan and we disacussed 
Lockd per netns.
So, this sentence: "NFSd service will be per netns too from my pow" is obsolete. 
And Lockd will be one for all.
Or you are asking about something else?

> --b.
>
>> And also we have NFSd file system, which
>> is not virtualized yet.
>>
>> The following series consists of:
>>
>> ---
>>
>> Stanislav Kinsbursky (3):
>>        SUNRPC: move rpcbind internals to sunrpc part of network namespace context
>>        SUNRPC: optimize net_ns dereferencing in rpcbind creation calls
>>        SUNRPC: optimize net_ns dereferencing in rpcbind registering calls
>>
>>
>>   net/sunrpc/netns.h     |    5 ++
>>   net/sunrpc/rpcb_clnt.c |  103 ++++++++++++++++++++++++++----------------------
>>   2 files changed, 61 insertions(+), 47 deletions(-)
>>
>> --
>> Signature


-- 
Best regards,
Stanislav Kinsbursky

^ permalink raw reply

* Re: [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Eric Dumazet @ 2011-10-28  8:44 UTC (permalink / raw)
  To: Daniel Baluta; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <CAEnQRZAj4H9neNsGB8hkrQO5trvP8QimhKhpnwchjTVrAo2LGQ@mail.gmail.com>

Le vendredi 28 octobre 2011 à 11:01 +0300, Daniel Baluta a écrit :
> On Fri, Oct 28, 2011 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le vendredi 28 octobre 2011 à 02:07 +0300, Daniel Baluta a écrit :
> >> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
> >>
> >> " .. an ACK SHOULD be generated for at least every second
> >>   full-sized segment, and MUST be generated within 500 ms
> >>   of the arrival of the first unacknowledged packet.
> >> "
> >>
> >> We export the number of segments and the timeout limits
> >> specified above, so that a user can tune them according
> >> to its needs.
> >>
> >
> > Well, this requires user has a machine exclusive use :)
> 
> So, this means that setting parameters system wide
> isn't an option?
> 

It is a first step, but we can notice a global setting might please one
application but negatively impact other applications.

I guess some users will want a per socket option, but this can come
later. An other idea to save space on socket structures would be to
select two set of values depending on TOS/TCLASS.

I can imagine ssh (lowdelay) and scp (throughput) wanting different
behavior here.

> On Windows there is a global setting TcpAckFrequency [1],
> which is similar with our tcp_delack_{min,max}.
> 
> On Solaris there is a global option tcp_deferred_acks_max [2],
> which is similar with our tcp_delack_segs.
> 

and also has tcp_deferred_ack_interval

> Thanks for your comments, I will post an updated patch asap.
> 
> Daniel.
> 
> [1] http://support.microsoft.com/kb/328890
> [2] http://www.sean.de/Solaris/soltune.html

Dont forget to CC Andy Lutomirski <luto@amacapital.net>, he might be
interested being part of the process.

Thanks

^ permalink raw reply

* Re: [RFC] tcp: Export TCP Delayed ACK parameters to user
From: Daniel Baluta @ 2011-10-28  8:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, kuznet, jmorris, yoshfuji, kaber, netdev
In-Reply-To: <1319760097.19125.55.camel@edumazet-laptop>

On Fri, Oct 28, 2011 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 28 octobre 2011 à 02:07 +0300, Daniel Baluta a écrit :
>> RFC2581 ($4.2) specifies when an ACK should be generated as follows:
>>
>> " .. an ACK SHOULD be generated for at least every second
>>   full-sized segment, and MUST be generated within 500 ms
>>   of the arrival of the first unacknowledged packet.
>> "
>>
>> We export the number of segments and the timeout limits
>> specified above, so that a user can tune them according
>> to its needs.
>>
>
> Well, this requires user has a machine exclusive use :)

So, this means that setting parameters system wide
isn't an option?

On Windows there is a global setting TcpAckFrequency [1],
which is similar with our tcp_delack_{min,max}.

On Solaris there is a global option tcp_deferred_acks_max [2],
which is similar with our tcp_delack_segs.

Thanks for your comments, I will post an updated patch asap.

Daniel.

[1] http://support.microsoft.com/kb/328890
[2] http://www.sean.de/Solaris/soltune.html

^ permalink raw reply

* >Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Eric Dumazet @ 2011-10-28  4:43 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319772566.6759.27.camel@deadeye>

Le vendredi 28 octobre 2011 à 04:29 +0100, Ben Hutchings a écrit :
> On Fri, 2011-10-28 at 04:52 +0200, Eric Dumazet wrote:
> > Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :
> > 
> > > Whether or not it needs to provide any ordering guarantee, atomic_read()
> > > must never read more than once, and I think that requires the volatile
> > > qualification.  It might be clearer to use the ACCESS_ONCE macro,
> > > however.
> > > 
> > 
> > Where this requirement comes from ?
> 
> That is the conventional behaviour of 'atomic' operations, and callers
> may depend on it.
> 

Thats your opinion, not a fact documented in
Documentation/atomic_ops.txt

<QUOTE>
---------------------------------------------------------
Next, we have:

        #define atomic_read(v)  ((v)->counter)

which simply reads the counter value currently visible to the calling thread.
The read is atomic in that the return value is guaranteed to be one of the
values initialized or modified with the interface operations if a proper
implicit or explicit memory barrier is used after possible runtime
initialization by any other thread and the value is modified only with the
interface operations.  atomic_read does not guarantee that the runtime
initialization by any other thread is visible yet, so the user of the
interface must take care of that with a proper implicit or explicit memory
barrier.

*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

Some architectures may choose to use the volatile keyword, barriers, or inline
assembly to guarantee some degree of immediacy for atomic_read() and
atomic_set().  This is not uniformly guaranteed, and may change in the future,
so all users of atomic_t should treat atomic_read() and atomic_set() as simple
C statements that may be reordered or optimized away entirely by the compiler
or processor, and explicitly invoke the appropriate compiler and/or memory
barrier for each use case.  Failure to do so will result in code that may
suddenly break when used with different architectures or compiler
optimizations, or even changes in unrelated code which changes how the
compiler optimizes the section accessing atomic_t variables.

*** YOU HAVE BEEN WARNED! ***
---------------------------------------------------------------
</QUOTE>

The only requirement of atomic_read() is that it must return value
before or after an atomic_write(), not a garbled value.

So the ACCESS_ONCE semantic is on the write side (the atomic itself
cannot be used as a temporary variable. It must contain only two value
of atomic_{write|add|sub}() [ prior to the operation, after the
operation ]

In fact, if a compiler is stupid enough to issue two reads on following
code :

static inline atomic_read(const atomic_t *p)
{
	return p->value;
}

Then its fine, because in the end the returned value will be the old or
new one.

^ permalink raw reply

* Re: [PATCH] ipv6: fix error propagation in ip6_ufo_append_data()
From: David Miller @ 2011-10-28  4:26 UTC (permalink / raw)
  To: zheng.z.yan; +Cc: netdev
In-Reply-To: <4EAA2E74.2000703@intel.com>

From: "Yan, Zheng" <zheng.z.yan@intel.com>
Date: Fri, 28 Oct 2011 12:24:20 +0800

> We should return errcode from sock_alloc_send_skb()
> 
> Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>

Applied, thanks!

^ permalink raw reply

* [PATCH] ipv6: fix error propagation in ip6_ufo_append_data()
From: Yan, Zheng @ 2011-10-28  4:24 UTC (permalink / raw)
  To: netdev@vger.kernel.org, davem@davemloft.net

We should return errcode from sock_alloc_send_skb()

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
---
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index ff30047..84d0bd5 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1123,7 +1123,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 			hh_len + fragheaderlen + transhdrlen + 20,
 			(flags & MSG_DONTWAIT), &err);
 		if (skb == NULL)
-			return -ENOMEM;
+			return err;
 
 		/* reserve space for Hardware header */
 		skb_reserve(skb, hh_len);

^ permalink raw reply related

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Ben Hutchings @ 2011-10-28  3:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319770376.23112.58.camel@edumazet-laptop>

On Fri, 2011-10-28 at 04:52 +0200, Eric Dumazet wrote:
> Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :
> 
> > Whether or not it needs to provide any ordering guarantee, atomic_read()
> > must never read more than once, and I think that requires the volatile
> > qualification.  It might be clearer to use the ACCESS_ONCE macro,
> > however.
> > 
> 
> Where this requirement comes from ?

That is the conventional behaviour of 'atomic' operations, and callers
may depend on it.

> Maybe then introduce atomic_read_once() for users really needing it :)
>
> ACCESS_ONCE will force the read/move instruction I try to avoid :(
[...]

I'm sure you can find some other way to avoid it.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/3] stmmac: fix a bug while checking the HW cap reg (v2)
From: David Miller @ 2011-10-28  3:17 UTC (permalink / raw)
  To: peppe.cavallaro; +Cc: netdev
In-Reply-To: <1319694189-25223-1-git-send-email-peppe.cavallaro@st.com>


All 3 patches applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/2] be2net: fixes
From: David Miller @ 2011-10-28  3:17 UTC (permalink / raw)
  To: somnath.kotur; +Cc: netdev
In-Reply-To: <900362ce-e840-44cb-bfcf-72eba609edf3@exht2.ad.emulex.com>

From: Somnath Kotur <somnath.kotur@Emulex.Com>
Date: Thu, 27 Oct 2011 22:41:47 +0530

> Somnath Kotur (2):
>   be2net: Refactored be_cmds.c file.
>   be2net: Changing MAC Address of a VF (in SR-IOV case) was broken.

Applied.

^ permalink raw reply

* Re: [PATCH net -v2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
From: David Miller @ 2011-10-28  3:15 UTC (permalink / raw)
  To: mitsuo.hayasaka.hu
  Cc: fubar, netdev, xiyou.wangcong, shemminger, andy, linux-kernel,
	yrl.pp-manager.tt
In-Reply-To: <4EAA0ADF.5020504@hitachi.com>

From: HAYASAKA Mitsuo <mitsuo.hayasaka.hu@hitachi.com>
Date: Fri, 28 Oct 2011 10:52:31 +0900

> I checked your patch, and any problems have not been observed for
> almost one day although I could reproduce the BUG in the latest net-next
> without it. 
> 
> I think this patch works well.
> 
> Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>

Jay, please formally submit this patch with proper signoffs etc.

Thanks!

^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Remove quotes of maintaners's names
From: David Miller @ 2011-10-28  3:01 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: marcos.mage, akpm, netdev, linux-kernel
In-Reply-To: <610B86C6-622E-4EFE-89DF-BF09BDB4D765@intel.com>

From: "Kirsher, Jeffrey T" <jeffrey.t.kirsher@intel.com>
Date: Thu, 27 Oct 2011 14:35:54 -0700

> The reason these names are in quotes is because they need to be
> wrapped in quotes when used in a git patch (I.e. CC: "David
> S. Miller" <davem@davemloft.net>) because of the . in the friendly
> name.

That's correct, if a character such as "." appears in an email
header field it must be surrounded by double quotes, as per
SMTP rules.

VGER rejects any email where this rule is not followed properly.

^ permalink raw reply

* Re: [patch net-next]alx: Atheros AR8131/AR8151/AR8152/AR8161 Ethernet driver
From: Joe Perches @ 2011-10-28  2:56 UTC (permalink / raw)
  To: Francois Romieu; +Cc: cloud.ren, davem, Luis.Rodriguez, netdev, linux-kernel
In-Reply-To: <20111019222140.GA9937@electric-eye.fr.zoreil.com>

On Thu, 2011-10-20 at 00:21 +0200, Francois Romieu wrote:
> cloud.ren@atheros.com <cloud.ren@atheros.com> :
> > diff --git a/drivers/net/ethernet/atheros/alx/alf_hw.c b/drivers/net/ethernet/atheros/alx/alf_hw.c
[]
> > +	if (en) { /* enable */
> > +		MEM_W32(ctx, L1F_RXQ0, rxq | L1F_RXQ0_EN);
> > +		MEM_W32(ctx, L1F_TXQ0, txq | L1F_TXQ0_EN);
> > +		if (0 != (en_ctrl & LX_MACSPEED_1000)) {
> > +			FIELD_SETL(mac, L1F_MAC_CTRL_SPEED,
> > +			    L1F_MAC_CTRL_SPEED_1000);
> > +		} else {
> > +			FIELD_SETL(mac, L1F_MAC_CTRL_SPEED,
> > +			    L1F_MAC_CTRL_SPEED_10_100);
> > +		}
> > +		if (0 != (en_ctrl & LX_MACDUPLEX_FULL)) {
> > +			mac |= L1F_MAC_CTRL_FULLD;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_FULLD;
> > +		}
> > +		/* rx filter */
> > +		if (0 != (en_ctrl & LX_FLT_PROMISC)) {
> > +			mac |= L1F_MAC_CTRL_PROMISC_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_PROMISC_EN;
> > +		}
> > +		if (0 != (en_ctrl & LX_FLT_MULTI_ALL)) {
> > +			mac |= L1F_MAC_CTRL_MULTIALL_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_MULTIALL_EN;
> > +		}
> > +		if (0 != (en_ctrl & LX_FLT_BROADCAST)) {
> > +			mac |= L1F_MAC_CTRL_BRD_EN;
> > +		} else {
> > +			mac &= ~L1F_MAC_CTRL_BRD_EN;
> > +		}
> Code duplication. Who cares ?

Maybe add a macro like:

#define mac_ctrl(mac, ctrl, flag, bit)		\
do {						\
	if ((ctrl) & (flag))			\
		mac |= (bit);			\
	else					\
		mac &= ~(bit);			\
} while (0)

so these become

	mac_ctrl(mac, en_ctrl, LX_MACDUPLEX_FULL, L1F_MAC_CTRL_FULLD);
	mac_ctrl(mac, en_ctrl, LX_FLT_PROMISC, L1F_MAC_CTRL_PROMISC_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_MULTI_ALL, L1F_MAC_CTRL_MULTIALL_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_BROADCAST, L1F_MAC_CTRL_BRD_EN);
	mac_ctrl(mac, en_ctrl, LX_FLT_DIRECT, L1F_MAC_CTRL_RX_EN);
	mac_ctrl(mac, en_ctrl, LX_FC_TXEN, L1F_MAC_CTRL_TXFC_EN);
	mac_ctrl(mac, en_ctrl, LX_FC_RXEN, L1F_MAC_CTRL_RXFC_EN);
	etc.

Or maybe add mac, en_ctrl and L1F_MAC_CTRL_ to the macro if you
really want to shorten it up.

	mac_ctrl(MACDUPLEX_FULL, FULLD);
	mac_ctrl(FLT_PROMISC, PROMISC_EN);
	mac_ctrl(FLT_MULTI_ALL, MULTIALL_EN);
	etc.

> It may make some sense to move these ~60 loc in a xyz_enable_something
> method...

2 macros?

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Eric Dumazet @ 2011-10-28  2:52 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, netdev, Andrew Morton
In-Reply-To: <1319766293.6759.17.camel@deadeye>

Le vendredi 28 octobre 2011 à 02:44 +0100, Ben Hutchings a écrit :

> Whether or not it needs to provide any ordering guarantee, atomic_read()
> must never read more than once, and I think that requires the volatile
> qualification.  It might be clearer to use the ACCESS_ONCE macro,
> however.
> 

Where this requirement comes from ?

Maybe then introduce atomic_read_once() for users really needing it :)

ACCESS_ONCE will force the read/move instruction I try to avoid :(

By the way, removing volatile on my x86_64 defconfig saves a bit of
space :

# size vmlinux vmlinux.old
   text	   data	    bss	    dec	    hex	filename
10907585	2890992	1540096	15338673	 ea0cb1	vmlinux
10912342	2891056	1540096	15343494	 ea1f86	vmlinux.old

booted and everything seems fine, but obviously lightly tested...

Note : There is still one useless access to page-flags,
'fixing' atomic_read() is not enough.


    0,04 :        ffffffff815e4675:       je     ffffffff815e4870 <tcp_sendmsg+0xc80>
    0,08 :        ffffffff815e467b:       mov    (%r9),%rax   // useless
    2,08 :        ffffffff815e467e:       lock incl 0x1c(%r9)
    3,58 :        ffffffff815e4683:       mov    (%r9),%rax
    0,04 :        ffffffff815e4686:       test   $0x80,%ah
    0,00 :        ffffffff815e4689:       jne    ffffffff815e48ee <tcp_sendmsg+0xcfe>

^ permalink raw reply

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
From: Eric Dumazet @ 2011-10-28  2:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, David S. Miller
In-Reply-To: <20111028013729.GA6524@neilslaptop.think-freely.org>

Le jeudi 27 octobre 2011 à 21:37 -0400, Neil Horman a écrit :
> > 
> Yes, I'd experimented with some of this, and had mixed results.  Specifically I
> came up with a way to use the additional space as an array of list heads, with
> backpointers to the socket_queue each list_head was owned by and the skb that
> owned the entry.  It was nice because it let me enqueue the same skb to hundreds
> of sockets at the same time, which was really nice.  It was bad however, because
> it completely broke socket accounting (any likely anything else that required on
> any part of the socket state).  Any thoughts on ways I might improve on that.
> If I could make some sort of reduced sk_buff so that I didn't have to allocate
> all 256 bytes of a full sk_buff, that would be great!

It seems your objectives are contradictory (memory saving and cpu
saving). Most of the time we have to fight false sharing first.

I dont want to try to use the available space from skb, since its cache
unfriendly, in case you have one CPU 100% in softirq handling,
dispatching messages to XX other application cpus.

You dont want each application cpu slowing down other cpus by
manipulating a list anchor, contained in a shared cache line, highly
contended. We already have one high contention point (skb refcount or
data refcount).

Another point of interest is that modern NIC tend to use paged frag
skbs, with very litle space available in skb head. frag part is
readonly, and sometime with no available space neither.

So your idea only works on some (old gen) nics and some narrow
conditions.

I believe it adds too much complex code for this.

I understand you are disappointed, but maybe you should have shared your
ideas before coding them !

^ permalink raw reply

* Re: [PATCH net -v2] [BUGFIX] bonding: use flush_delayed_work_sync in bond_close
From: HAYASAKA Mitsuo @ 2011-10-28  1:52 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: netdev, Américo Wang, Stephen Hemminger, Andy Gospodarek,
	linux-kernel, yrl.pp-manager.tt
In-Reply-To: <2216.1319650260@death>

Hi Joy,

I checked your patch, and any problems have not been observed for
almost one day although I could reproduce the BUG in the latest net-next
without it. 

I think this patch works well.

Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>


(2011/10/27 2:31), Jay Vosburgh wrote:
> HAYASAKA Mitsuo <mitsuo.hayasaka.hu@hitachi.com> wrote:
> [...]
>> The interval of mii_mon was set to 1 to reproduce this bug easily and 
>> the 802.3ad mode was used. Then, I executed the following command.
>>
>> # while true; do ifconfig bond0 down; done &
>> # while true; do ifconfig bond0 up; done &
>>
>> This bug rarely occurs since it is the severe timing problem.
>> I found that it is more easily to reproduce this bug when using guest OS.
>>
>> For example, it took one to three days for me to reproduce it on host OS, 
>> but some hours on guest OS.
> 
> 	Could you test this patch and see if it resolves the problem?
> 
> 	This patch does a few things:
> 
> 	All of the monitor functions that run on work queues are
> modified to never unconditionally acquire RTNL; all will reschedule the
> work and return if rtnl_trylock fails.  This covers bond_mii_monitor,
> bond_activebackup_arp_mon, and bond_alb_monitor.
> 
> 	The "clear out the work queues" calls in bond_close and
> bond_uninit now call cancel_delayed_work_sync, which should either
> delete a pending work item, or wait for an executing item to complete.
> I chose cancel_ over the original patch's flush_ because we just want
> the work queue stopped.  We don't need to have any pending items execute
> if they're not already running.
> 
> 	Also in reference to the previous, I'm not sure if we still need
> to check for delayed_work_pending, but I've left those checks in place.
> 
> 	Remove the "kill_timers" field and all references to it.  If
> cancel_delayed_work_sync is safe to use, we do not need an extra
> sentinel.
> 
> 	Lastly, for testing purposes only, the bond_alb_monitor in this
> patch includes an unconditional call to rtnl_trylock(); this is an
> artifical way to make the race in that function easier to test for,
> because the real race is very difficult to hit.
> 
> 	This patch is against net-next as of yesterday.
> 
> 	Comments?
> 
> 	-J
> 
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index b33c099..0ae0d7c 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -2110,9 +2110,6 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	//check if there are any slaves
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
> @@ -2161,9 +2158,8 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (!bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
> -out:
> +	queue_delayed_work(bond->wq, &bond->ad_work, ad_delta_in_ticks);
> +
>  	read_unlock(&bond->lock);
>  }
>  
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index d4fbd2e..13d1bf9 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -1343,10 +1343,6 @@ void bond_alb_monitor(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers) {
> -		goto out;
> -	}
> -
>  	if (bond->slave_cnt == 0) {
>  		bond_info->tx_rebalance_counter = 0;
>  		bond_info->lp_counter = 0;
> @@ -1394,6 +1390,14 @@ void bond_alb_monitor(struct work_struct *work)
>  		bond_info->tx_rebalance_counter = 0;
>  	}
>  
> +	/* XXX - unconditional attempt at RTNL for testing purposes, as
> +	 * normal case, below, is difficult to induce.
> +	 */
> +	read_unlock(&bond->lock);
> +	if (rtnl_trylock())
> +		rtnl_unlock();
> +	read_lock(&bond->lock);
> +
>  	/* handle rlb stuff */
>  	if (bond_info->rlb_enabled) {
>  		if (bond_info->primary_is_promisc &&
> @@ -1404,7 +1408,10 @@ void bond_alb_monitor(struct work_struct *work)
>  			 * nothing else.
>  			 */
>  			read_unlock(&bond->lock);
> -			rtnl_lock();
> +			if (!rtnl_trylock()) {
> +				read_lock(&bond->lock);
> +				goto re_arm;
> +			}
>  
>  			bond_info->rlb_promisc_timeout_counter = 0;
>  
> @@ -1440,9 +1447,8 @@ void bond_alb_monitor(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (!bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
> -out:
> +	queue_delayed_work(bond->wq, &bond->alb_work, alb_delta_in_ticks);
> +
>  	read_unlock(&bond->lock);
>  }
>  
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 71efff3..e6fefff 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -774,9 +774,6 @@ static void bond_resend_igmp_join_requests(struct bonding *bond)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	/* rejoin all groups on bond device */
>  	__bond_resend_igmp_join_requests(bond->dev);
>  
> @@ -790,9 +787,9 @@ static void bond_resend_igmp_join_requests(struct bonding *bond)
>  			__bond_resend_igmp_join_requests(vlan_dev);
>  	}
>  
> -	if ((--bond->igmp_retrans > 0) && !bond->kill_timers)
> +	if (--bond->igmp_retrans > 0)
>  		queue_delayed_work(bond->wq, &bond->mcast_work, HZ/5);
> -out:
> +
>  	read_unlock(&bond->lock);
>  }
>  
> @@ -2518,19 +2515,26 @@ void bond_mii_monitor(struct work_struct *work)
>  	struct bonding *bond = container_of(work, struct bonding,
>  					    mii_work.work);
>  	bool should_notify_peers = false;
> +	unsigned long delay;
>  
>  	read_lock(&bond->lock);
> -	if (bond->kill_timers)
> -		goto out;
> +
> +	delay = msecs_to_jiffies(bond->params.miimon);
>  
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
>  
> -	should_notify_peers = bond_should_notify_peers(bond);
> -
>  	if (bond_miimon_inspect(bond)) {
>  		read_unlock(&bond->lock);
> -		rtnl_lock();
> +
> +		/* Race avoidance with bond_close flush of workqueue */
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			delay = 1;
> +			should_notify_peers = false;
> +			goto re_arm;
> +		}
> +
>  		read_lock(&bond->lock);
>  
>  		bond_miimon_commit(bond);
> @@ -2540,15 +2544,21 @@ void bond_mii_monitor(struct work_struct *work)
>  		read_lock(&bond->lock);
>  	}
>  
> +	should_notify_peers = bond_should_notify_peers(bond);
> +
>  re_arm:
> -	if (bond->params.miimon && !bond->kill_timers)
> -		queue_delayed_work(bond->wq, &bond->mii_work,
> -				   msecs_to_jiffies(bond->params.miimon));
> -out:
> +	if (bond->params.miimon)
> +		queue_delayed_work(bond->wq, &bond->mii_work, delay);
> +
>  	read_unlock(&bond->lock);
>  
>  	if (should_notify_peers) {
> -		rtnl_lock();
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			bond->send_peer_notif++;
> +			read_unlock(&bond->lock);
> +			return;
> +		}
>  		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>  		rtnl_unlock();
>  	}
> @@ -2790,9 +2800,6 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
>  
>  	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	if (bond->slave_cnt == 0)
>  		goto re_arm;
>  
> @@ -2889,9 +2896,9 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
>  	}
>  
>  re_arm:
> -	if (bond->params.arp_interval && !bond->kill_timers)
> +	if (bond->params.arp_interval)
>  		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
> -out:
> +
>  	read_unlock(&bond->lock);
>  }
>  
> @@ -3132,9 +3139,6 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  
>  	read_lock(&bond->lock);
>  
> -	if (bond->kill_timers)
> -		goto out;
> -
>  	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
>  
>  	if (bond->slave_cnt == 0)
> @@ -3144,7 +3148,15 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  
>  	if (bond_ab_arp_inspect(bond, delta_in_ticks)) {
>  		read_unlock(&bond->lock);
> -		rtnl_lock();
> +
> +		/* Race avoidance with bond_close flush of workqueue */
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			delta_in_ticks = 1;
> +			should_notify_peers = false;
> +			goto re_arm;
> +		}
> +
>  		read_lock(&bond->lock);
>  
>  		bond_ab_arp_commit(bond, delta_in_ticks);
> @@ -3157,13 +3169,18 @@ void bond_activebackup_arp_mon(struct work_struct *work)
>  	bond_ab_arp_probe(bond);
>  
>  re_arm:
> -	if (bond->params.arp_interval && !bond->kill_timers)
> +	if (bond->params.arp_interval)
>  		queue_delayed_work(bond->wq, &bond->arp_work, delta_in_ticks);
> -out:
> +
>  	read_unlock(&bond->lock);
>  
>  	if (should_notify_peers) {
> -		rtnl_lock();
> +		if (!rtnl_trylock()) {
> +			read_lock(&bond->lock);
> +			bond->send_peer_notif++;
> +			read_unlock(&bond->lock);
> +			return;
> +		}
>  		netdev_bonding_change(bond->dev, NETDEV_NOTIFY_PEERS);
>  		rtnl_unlock();
>  	}
> @@ -3425,8 +3442,6 @@ static int bond_open(struct net_device *bond_dev)
>  	struct slave *slave;
>  	int i;
>  
> -	bond->kill_timers = 0;
> -
>  	/* reset slave->backup and slave->inactive */
>  	read_lock(&bond->lock);
>  	if (bond->slave_cnt > 0) {
> @@ -3495,33 +3510,30 @@ static int bond_close(struct net_device *bond_dev)
>  
>  	bond->send_peer_notif = 0;
>  
> -	/* signal timers not to re-arm */
> -	bond->kill_timers = 1;
> -
>  	write_unlock_bh(&bond->lock);
>  
>  	if (bond->params.miimon) {  /* link check interval, in milliseconds. */
> -		cancel_delayed_work(&bond->mii_work);
> +		cancel_delayed_work_sync(&bond->mii_work);
>  	}
>  
>  	if (bond->params.arp_interval) {  /* arp interval, in milliseconds. */
> -		cancel_delayed_work(&bond->arp_work);
> +		cancel_delayed_work_sync(&bond->arp_work);
>  	}
>  
>  	switch (bond->params.mode) {
>  	case BOND_MODE_8023AD:
> -		cancel_delayed_work(&bond->ad_work);
> +		cancel_delayed_work_sync(&bond->ad_work);
>  		break;
>  	case BOND_MODE_TLB:
>  	case BOND_MODE_ALB:
> -		cancel_delayed_work(&bond->alb_work);
> +		cancel_delayed_work_sync(&bond->alb_work);
>  		break;
>  	default:
>  		break;
>  	}
>  
>  	if (delayed_work_pending(&bond->mcast_work))
> -		cancel_delayed_work(&bond->mcast_work);
> +		cancel_delayed_work_sync(&bond->mcast_work);
>  
>  	if (bond_is_lb(bond)) {
>  		/* Must be called only after all
> @@ -4368,26 +4380,22 @@ static void bond_setup(struct net_device *bond_dev)
>  
>  static void bond_work_cancel_all(struct bonding *bond)
>  {
> -	write_lock_bh(&bond->lock);
> -	bond->kill_timers = 1;
> -	write_unlock_bh(&bond->lock);
> -
>  	if (bond->params.miimon && delayed_work_pending(&bond->mii_work))
> -		cancel_delayed_work(&bond->mii_work);
> +		cancel_delayed_work_sync(&bond->mii_work);
>  
>  	if (bond->params.arp_interval && delayed_work_pending(&bond->arp_work))
> -		cancel_delayed_work(&bond->arp_work);
> +		cancel_delayed_work_sync(&bond->arp_work);
>  
>  	if (bond->params.mode == BOND_MODE_ALB &&
>  	    delayed_work_pending(&bond->alb_work))
> -		cancel_delayed_work(&bond->alb_work);
> +		cancel_delayed_work_sync(&bond->alb_work);
>  
>  	if (bond->params.mode == BOND_MODE_8023AD &&
>  	    delayed_work_pending(&bond->ad_work))
> -		cancel_delayed_work(&bond->ad_work);
> +		cancel_delayed_work_sync(&bond->ad_work);
>  
>  	if (delayed_work_pending(&bond->mcast_work))
> -		cancel_delayed_work(&bond->mcast_work);
> +		cancel_delayed_work_sync(&bond->mcast_work);
>  }
>  
>  /*
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 82fec5f..1aecc37 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -222,7 +222,6 @@ struct bonding {
>  			       struct slave *);
>  	rwlock_t lock;
>  	rwlock_t curr_slave_lock;
> -	s8       kill_timers;
>  	u8	 send_peer_notif;
>  	s8	 setup_by_slave;
>  	s8       igmp_retrans;
> 
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Ben Hutchings @ 2011-10-28  1:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Eric Dumazet, linux-kernel, netdev, Andrew Morton
In-Reply-To: <CA+55aFy9oJSK99K=brRzSfSnLocO=PuS7yuKkAcw3t5hATJrfw@mail.gmail.com>

On Thu, 2011-10-27 at 18:34 -0700, Linus Torvalds wrote:
> On Thu, Oct 27, 2011 at 6:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > Seems reasonable too. In fact we usually should have memory barriers
> > for this anyways which obsolete the volatile.
> 
> No we shouldn't. Memory barriers are insanely expensive, and pointless
> for atomics - that aren't ordered anyway.
> 
> You may mean compiler barriers.
> 
> That said, removing the volatile entirely might be a good idea, and
> never mind any barriers at all. The ordering for atomics really isn't
> well enough specified that we should care. So I wouldn't object to a
> patch that just removes the volatile entirely, but it would have to be
> accompanied with quite a bit of testing, in case some odd case ends up
> depending on it. But nothing *should* be looping on those things
> anyway.

Whether or not it needs to provide any ordering guarantee, atomic_read()
must never read more than once, and I think that requires the volatile
qualification.  It might be clearer to use the ACCESS_ONCE macro,
however.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter
From: Neil Horman @ 2011-10-28  1:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller
In-Reply-To: <1319756146.19125.42.camel@edumazet-laptop>

On Fri, Oct 28, 2011 at 12:55:46AM +0200, Eric Dumazet wrote:
> Le jeudi 27 octobre 2011 à 15:53 -0400, Neil Horman a écrit :
> > I had this idea awhile ago while I was looking at the receive path for multicast
> > frames.   The top of the mcast recieve path (in __udp4_lib_mcast_deliver, has a
> > loop in which we traverse a hash list linearly, looking for sockets that are
> > listening to a given multicast group.  For each matching socket we clone the skb
> > to enqueue it to the corresponding socket.  This creates two problems:
> > 
> > 1) Application driven jitter in the receive path
> >    As you add processes that listen to the same multcast group, you increase the
> > number of iterations you have to preform in this loop, which can lead to
> > increases in the amount of time you spend processing each frame in softirq
> > context, expecially if you are memory constrained, and the skb_clone operation
> > has to call all the way back into the buddy allocator for more ram.  This can
> > lead to needlessly dropped frames as rx latency increases in the stack.
> > 
> 
> Hmm... time to perform this loop not depends on memory constraints,
> since GFP_ATOMIC allocations are done. It succeed or not, immediately.
> 
Thats not entirely correct.  The time it takes to allocate a new object from the
slab can be very much affected by memory contraints.  Systems under memory
pressue may need to shrink slab caches, allocate from remote nodes, etc, all of
which contributes to additional allocation time.  This time is multiplied by the
number of sockets listening to a particular group.  I'm not saying its a huge
change, but its a change that I saw the ability to address.   This change (I
hope) tries, when able, to eliminate the jitter from that path.

> Time is consumed on the copy of the skb head, and refcnt
> increases/decreases on datarefcnt. Your patch doesnt avoid this.
> 
No it doesn't, you're correct.  I was really looking at lower hanging fruit with
this change.

> When application calls recvmsg() we then perform the two atomics on skb
> refcnt and data refcnt and free them, with cache line false sharing...
> 
> > 2) Increased memory usage
> >    As you increase the number of listeners to a multicast group, you directly
> > increase the number of times you clone and skb, putting increased memory
> > pressure on the system.
> > 
> 
> One skb_head is about 256 bytes (232 bytes on 64bit arches)
> 
Yes, thats right.  Depending on the size of the received frame I can allocate
anywhere from 1 to 4 additional skbs using otherwise unused memory at the tail
of the data segment with this patch.

> > while neither of these problems is a huge concern, I thought it would be nice if
> > we could mitigate the effects of increased application instances on performance
> > in this area.  As such I came up with this patch set.  I created a new skb
> > fclone type called FCLONE_SCRATCH.  When available, it commandeers the
> > internally fragmented space of an skb data buffer and uses that to allocate
> > additional skbs during the clone operation. Since the skb->data area is
> > allocated with a kmalloc operation (and is therefore nominally a power of 2 in
> > size), and nominally network interfaces tend to have an mtu of around 1500
> > bytes, we typically can reclaim several hundred bytes of space at the end of an
> > skb (more if the incomming packet is not a full MTU in size).  This space, being
> > exclusively accessible to the softirq doing the reclaim, can be quickly accesed
> > without the need for additional locking, potntially providing lower jitter in
> > napi context per frame during a receive operation, as well as some memory
> > savings.
> > 
> > I'm still collecting stats on its performance, but I thought I would post now to
> > get some early reviews and feedback on it.
> > 
> 
> I really doubt you'll find a significative performance increase.
> 
With the perf script I included in this set, I'm currently reducing the napi rx
path runtime by about 2 microsecond per iteration when multiple processes are
listening to the same group, and we can allocate skbs from the data area of the
received skb.  I'm not sure how accurate (or valuable) that is, but thats what
I'm getting.  While I've not measured, theres also the fact that kfree_skb doesn't
have any work to do with FCLONE_SCRATCH skbs, since they all get returned when
the data area gets kfreed.  Lastly, theres the savings of 256 bytes per
listening app that uses an FCLONE_SCRATCH skb, since that memory is already
reserved, and effectively free.  I can't really put a number on how significant
that is, but I think its valuable.


> I do believe its a too complex : skb code is already a nightmare if you
> ask me.
> 
I'm not sure I agree with that.  Its certainly complex in the sense that it adds
additional state to be checked in the skb_clone and kfree_skb paths, but beyond
that, it doesn't require any semantic changes to any of the other parts of the
skb api(s).  Is there something specific about this change that you find
unacceptibly complex, or something I can do to make it simpler?

> And your hack/idea wont work quite well if you have 8 receivers for each
> frame.
> 
It will work fine, to whatever degree its able, and this it will fall back on
the standard skb_clone behavior.  The performance gains will be reduced, of
course, which I think is what you are referring to, but the memory advantages
still exit.  If you have 8 receivers, but a given skb has only enough free space
at the end of the data segment to allocate 4 additional skbs, we'll allocate
those 4, and do a real kmem_cache_alloc of 4 more in skb_clone.  The last 4 have
to use the slab_allocation path as we otherwise would, so we're no worse off
than before, but we still get the memory savings (which would be about 1k per
received frame for this mcast group, which I think is significant).

> What about finding another way to queue one skb to N receive queue(s),
> so that several multicast sockets can share same skb head ?
> 
Yes, I'd experimented with some of this, and had mixed results.  Specifically I
came up with a way to use the additional space as an array of list heads, with
backpointers to the socket_queue each list_head was owned by and the skb that
owned the entry.  It was nice because it let me enqueue the same skb to hundreds
of sockets at the same time, which was really nice.  It was bad however, because
it completely broke socket accounting (any likely anything else that required on
any part of the socket state).  Any thoughts on ways I might improve on that.
If I could make some sort of reduced sk_buff so that I didn't have to allocate
all 256 bytes of a full sk_buff, that would be great!

> I always found sk_receive queue being very inefficient, since a queue or
> dequeue must dirty a lot of cache lines.
> 
> This forces us to use a spinlock to protect queue/enqueue operations,
> but also the socket lock (because of the MSG_PEEK stuff and
> sk_rmem_alloc / sk_forward_alloc)
> 
> sk_receive_queue.lock is the real jitter source.
> 
> Ideally, we could have a fast path using a small circular array per
> socket, of say 8 or 16 pointers to skbs, or allow application or
> sysadmin to size this array.
> 
> A circular buffer can be handled without any lock, using atomic
> operations (cmpxchg()) on load/unload indexes. The array of pointers is
> written only by the softirq handler cpu, read by consumers.
> 
> Since this array is small [and finite size], and skb shared, we dont
> call skb_set_owner_r() anymore, avoiding expensive atomic ops on
> sk->sk_rmem_alloc.
> 
> UDP receive path could become lockless, allowing the softirq handler to
> run without being slowed down by concurrent recvmsg()
> 
> At recvmsg() time, N-1 threads would only perform the skb->refcnt
> decrement, and the last one would free the skb and data as well.
> 
This seems like a reasonable idea, and I'm happy to experiment with it, but I
still like the idea of using that often available space at the end of an skb,
just because I think the memory savings is attractive.  Shall I roll them into
the same patch series, or do them separately?
Neil

> 
> 
> 

^ permalink raw reply

* Re: [RFC] should VM_BUG_ON(cond) really evaluate cond
From: Linus Torvalds @ 2011-10-28  1:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Eric Dumazet, linux-kernel, netdev, Andrew Morton
In-Reply-To: <20111028012521.GF25795@one.firstfloor.org>

On Thu, Oct 27, 2011 at 6:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Seems reasonable too. In fact we usually should have memory barriers
> for this anyways which obsolete the volatile.

No we shouldn't. Memory barriers are insanely expensive, and pointless
for atomics - that aren't ordered anyway.

You may mean compiler barriers.

That said, removing the volatile entirely might be a good idea, and
never mind any barriers at all. The ordering for atomics really isn't
well enough specified that we should care. So I wouldn't object to a
patch that just removes the volatile entirely, but it would have to be
accompanied with quite a bit of testing, in case some odd case ends up
depending on it. But nothing *should* be looping on those things
anyway.

                       Linus

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox