Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 04/12] phylib: add a way to make PHY time stamps possible.
From: Richard Cochran @ 2010-06-16  5:40 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <AANLkTimhAvu0_WuMCwnCrw9Rbcy6dKxvTlzepJm7ZDdR@mail.gmail.com>

On Tue, Jun 15, 2010 at 10:33:51AM -0600, Grant Likely wrote:
> > +config NETWORK_PHY_TIMESTAMPING
> Some overhead?  At a brief glance of the series it looks like it could
> add a lot of overhead, but I'm not fully clear on what the full
> process is.  Can you describe how the hardware timestamping works?  I
> could use an overview of what the kernel has to do.

First of all, I want to emphasize that this network stack option is
purely voluntary. Only those people who know that they have a PTP
capable PHY and really want the timestamps will (or should) enable
this option. When it is not enabled, it has no effect at all.

Hardware timestamping is described in

   Documentation/networking/timestamping.txt
   Documentation/networking/timestamping/timestamping.c

The PTP subsystem is described in

   Documentation/ptp/ptp.txt

There really is more to say about the issue than appears in those
documents, but they are a good starting place for discussion.

BTW I am submitting a conference paper on the design on the PTP
subsystem. If you would like to have it, just ask me off-list.

Richard

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Paul E. McKenney @ 2010-06-16  5:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, herbert, shemminger, mst, frzhang, netdev, amwang,
	mpm
In-Reply-To: <1276657139.19249.50.camel@edumazet-laptop>

On Wed, Jun 16, 2010 at 04:58:59AM +0200, Eric Dumazet wrote:
> Le mardi 15 juin 2010 à 11:39 -0700, David Miller a écrit :
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Fri, 11 Jun 2010 12:11:42 +1000
> > 
> > > On Fri, Jun 11, 2010 at 08:48:39AM +1000, Herbert Xu wrote:
> > >> On Thu, Jun 10, 2010 at 02:59:15PM -0700, Stephen Hemminger wrote:
> > >> >
> > >> > Okay, then add a comment where in_irq is used?
> > >> 
> > >> Actually let me put it into a wrapper.  I'll respin the patches.
> > > 
> > > OK here is a repost.  And this time it really is 8 patches :)
> > > I've tested it lightly.
> > 
> > All applied to net-next-2.6, thanks Herbert.
> 
> Well...
> 
> [   52.914014] ===================================================
> [   52.914018] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   52.914020] ---------------------------------------------------
> [   52.914024] include/linux/netpoll.h:67 invoked rcu_dereference_check() without protection!
> [   52.914027] 
> [   52.914027] other info that might help us debug this:
> [   52.914029] 
> [   52.914031] 
> [   52.914032] rcu_scheduler_active = 1, debug_locks = 1
> [   52.914035] 4 locks held by swapper/0:
> [   52.914037]  #0:  (&n->timer){+.-...}, at: [<c103fd95>] run_timer_softirq+0x1b8/0x419
> [   52.914052]  #1:  (slock-AF_INET){+.....}, at: [<c12f2b3d>] icmp_send+0x149/0x58b
> [   52.914063]  #2:  (rcu_read_lock_bh){.+....}, at: [<c129978d>] dev_queue_xmit+0xf7/0x5df
> [   52.914073]  #3:  (rcu_read_lock_bh){.+....}, at: [<c12977ae>] netif_rx+0x0/0x195
> [   52.914081] 
> [   52.914081] stack backtrace:
> [   52.914086] Pid: 0, comm: swapper Not tainted 2.6.35-rc1-00508-gdbe3a24-dirty #78
> [   52.914089] Call Trace:
> [   52.914095]  [<c132cf0c>] ? printk+0xf/0x13
> [   52.914103]  [<c1059ac6>] lockdep_rcu_dereference+0x74/0x7d
> [   52.914107]  [<c1297819>] netif_rx+0x6b/0x195
> [   52.914111]  [<c129978d>] ? dev_queue_xmit+0xf7/0x5df
> [   52.914117]  [<c1240775>] loopback_xmit+0x4a/0x70
> [   52.914122]  [<c12995cf>] dev_hard_start_xmit+0x25b/0x322
> [   52.914126]  [<c1299b5b>] dev_queue_xmit+0x4c5/0x5df
> [   52.914131]  [<c105ccf7>] ? trace_hardirqs_on+0xb/0xd
> [   52.914135]  [<c129f611>] neigh_resolve_output+0x2e8/0x33f
> [   52.914142]  [<c12a8b2a>] ? eth_header+0x0/0x8e
> [   52.914147]  [<c12d3dbb>] ip_finish_output+0x323/0x3b1
> [   52.914152]  [<c103955f>] ? local_bh_enable_ip+0x97/0xad
> [   52.914156]  [<c12d485d>] ip_output+0xe2/0xfe
> [   52.914160]  [<c12d3ff5>] ip_local_out+0x41/0x55
> [   52.914164]  [<c12d5755>] ip_push_pending_frames+0x284/0x2fa
> [   52.914169]  [<c12f218d>] icmp_push_reply+0xe8/0xf3
> [   52.914174]  [<c12f2f36>] icmp_send+0x542/0x58b
> [   52.914181]  [<c102b6af>] ? find_busiest_group+0x1c9/0x631
> [   52.914188]  [<c12cb280>] ipv4_link_failure+0x17/0x7b
> [   52.914193]  [<c12f0841>] arp_error_report+0x46/0x61
> [   52.914197]  [<c129f8e0>] neigh_invalidate+0x68/0x80
> [   52.914201]  [<c12a0bef>] neigh_timer_handler+0x124/0x1d2
> [   52.914206]  [<c103fe7b>] run_timer_softirq+0x29e/0x419
> [   52.914210]  [<c12a0acb>] ? neigh_timer_handler+0x0/0x1d2
> [   52.914215]  [<c1039a21>] __do_softirq+0x126/0x277
> [   52.914219]  [<c10398fb>] ? __do_softirq+0x0/0x277
> [   52.914222]  <IRQ>  [<c1039c0d>] ? irq_exit+0x38/0x74
> [   52.914230]  [<c1003d1f>] ? do_IRQ+0x87/0x9b
> [   52.914235]  [<c1002d2e>] ? common_interrupt+0x2e/0x34
> [   52.914241]  [<c105007b>] ? sched_clock_local+0x3f/0x11f
> [   52.914249]  [<c11ba45b>] ? acpi_idle_enter_bm+0x271/0x2a0
> [   52.914256]  [<c12797bd>] ? cpuidle_idle_call+0x76/0x151
> [   52.914261]  [<c1001565>] ? cpu_idle+0x49/0x76
> [   52.914266]  [<c1319ece>] ? rest_init+0xd6/0xdb
> [   52.914274]  [<c156579f>] ? start_kernel+0x31b/0x320
> [   52.914278]  [<c15650c9>] ? i386_start_kernel+0xc9/0xd0
> 
> 
> Paul, could you please explain if current lockdep rules are correct, or could be relaxed ?
> 
> I thought :
> 
> rcu_read_lock_bh();
> 
> was a shorthand to
> 
> local_disable_bh();
> rcu_read_lock();

In CONFIG_TREE_RCU and CONFIG_TINY_RCU, rcu_read_lock_bh() is actually
shorthand for only local_disable_bh().  Therefore, rcu_dereference()
will scream if only rcu_read_lock_bh() is held.

However, in CONFIG_PREEMPT_TREE_RCU, rcu_read_lock_bh() is its own
mechanism that does local_disable_bh() but has its own set of grace
periods, independent of those of rcu_read_lock().

> Why lockdep is not able to make a correct diagnostic ?

Here is the situation I am concerned about:

o	Task 0 does rcu_read_lock(), then p=rcu_dereference_bh().
	If we make the change you are asking for, rcu_dereference_bh()
	is OK with this.

o	Task 0 now is preempted before finishing its RCU read-side
	critical section.

o	Task 1 removes the data element referenced by pointer p,
	then invokes synchronize_rcu_bh().

o	Task 0 does not block synchronize_rcu_bh(), so the grace
	period completes.

o	Task 1 frees up the data element referenced by pointer p,
	which might be reallocated as some other type, unmapped,
	or whatever else.

o	Task 0 resumes, and is sadly disappointed when the data
	element referenced by pointer p has been swept out from
	under it.

Or am I missing something here?

							Thanx, Paul

> Thanks
> 
> [PATCH net-next-2.6] netpoll: Fix one rcu_dereference() lockdep splat
> 
> lockdep doesnt allow yet following  construct :
> 
> rcu_read_lock_bh();
> npinfo = rcu_dereference(skb->dev->npinfo);
> 
> Fix lockdep splat using rcu_dereference_bh()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/netpoll.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
> index 4c77fe7..472365e 100644
> --- a/include/linux/netpoll.h
> +++ b/include/linux/netpoll.h
> @@ -64,7 +64,7 @@ static inline bool netpoll_rx(struct sk_buff *skb)
>  	bool ret = false;
> 
>  	rcu_read_lock_bh();
> -	npinfo = rcu_dereference(skb->dev->npinfo);
> +	npinfo = rcu_dereference_bh(skb->dev->npinfo);
> 
>  	if (!npinfo || (list_empty(&npinfo->rx_np) && !npinfo->rx_flags))
>  		goto out;
> 
> 

^ permalink raw reply

* Re: linux-next: build warnings after merge of the net tree
From: David Miller @ 2010-06-16  4:52 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, bhutchings
In-Reply-To: <20100616133854.1bbbbb37.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 16 Jun 2010 13:38:54 +1000

> After merging the net tree, today's linux-next build (x86_64 allmodconfig)
> produced these warnings:
> 
> In file included from drivers/usb/gadget/ether.c:123:
> drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
> drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type
> In file included from drivers/usb/gadget/multi.c:67:
> drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
> drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type
> In file included from drivers/usb/gadget/g_ffs.c:30:
> drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
> drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type
> 
> Introduced by commit be1f3c2c027cc5ad735df6a45a542ed1db7ec48b ("net:
> Enable 64-bit net device statistics on 32-bit architectures").  This is a
> call to dev_get_stats() and the return value is being assigned to a
> "struct net_device_stats *".

I've commited the patch below to deal with this, thanks for the report.

There's some pre-existing warnings someone will need to deal with at
some point:

drivers/usb/gadget/rndis.c: whole file: warning: coding style is bolixed

:-)

>From fdb93f8ac39aa5902f3d264edd50dffcabfdd13b Mon Sep 17 00:00:00 2001
From: David S. Miller <davem@davemloft.net>
Date: Tue, 15 Jun 2010 21:50:14 -0700
Subject: [PATCH] gadget/rndis: dev_get_stats() now returns rtnl_link_stats64.

Based upon a report by Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/usb/gadget/rndis.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/gadget/rndis.c b/drivers/usb/gadget/rndis.c
index 5c0d06c..fb69b01 100644
--- a/drivers/usb/gadget/rndis.c
+++ b/drivers/usb/gadget/rndis.c
@@ -171,7 +171,7 @@ gen_ndis_query_resp (int configNr, u32 OID, u8 *buf, unsigned buf_len,
 	int			i, count;
 	rndis_query_cmplt_type	*resp;
 	struct net_device	*net;
-	const struct net_device_stats	*stats;
+	const struct rtnl_link_stats64 *stats;
 
 	if (!r) return -ENOMEM;
 	resp = (rndis_query_cmplt_type *) r->buf;
-- 
1.7.0.4

^ permalink raw reply related

* Re: [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: David Miller @ 2010-06-16  4:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, paulmck
In-Reply-To: <1276656324.19249.39.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 16 Jun 2010 04:45:24 +0200

> [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
> 
> Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)
> 
> Unused inet_peer entries have a null refcnt.
> 
> Using atomic_inc_not_zero() in rcu lookups is not going to work for
> them, and slow path is taken.
> 
> Fix this using -1 marker instead of 0 for deleted entries.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH 6/8] scm: Capture the full credentials of the scm sender.
From: Serge E. Hallyn @ 2010-06-16  4:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, Linux Containers, Serge Hallyn, Pavel Emelyanov,
	netdev
In-Reply-To: <m1sk4oueym.fsf@fess.ebiederm.org>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> > I think this hunk needs to be documented.  I.e. given that scm_send()
> > will call scm_set_cred() before calling __scm_send, I don't see how
> > these conditions could happen?  If the condition can legitimately
> > happen, then given all of the pid_t vs struct pid and 'cred' vs. 'creds'
> > in these two hunks, I think a comment over each would be nice.
> 
> I think if you have the full context of __scm_send it becomes pretty obvious.
> 
> 		case SCM_CREDENTIALS:
> 			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct ucred)))
> 				goto error;
> 			memcpy(&p->creds, CMSG_DATA(cmsg), sizeof(struct ucred));
> 			err = scm_check_creds(&p->creds);
> 			if (err)
> 				goto error;
> 
> At this point we have just copied ucred from userspace.  We have done
> scm_check_creds to ensure we allow the user to send the pid, uid, and
> gid they have passed in.
> 
> These tests catch the case where the user is legitimately sending
> something other than their own credentials.

Of course.  Sorry.  And I even had the context in the window next to the
email...  So finally,

Acked-by: Serge E. Hallyn <serge@hallyn.com>

to the set, and I'm looking forward to this being in.  And it should solve
the nuisance of containers without private netns rebooting their hosts
when both use upstart.

thanks,
-serge

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: David Miller @ 2010-06-16  4:47 UTC (permalink / raw)
  To: herbert
  Cc: eric.dumazet, paulmck, shemminger, mst, frzhang, netdev, amwang,
	mpm
In-Reply-To: <20100616033336.GA17440@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 16 Jun 2010 13:33:36 +1000

> On Wed, Jun 16, 2010 at 05:03:20AM +0200, Eric Dumazet wrote:
>>
>> I wonder how these patches were tested, Herbert ?
> 
> You know, not everyone enables RCU debugging...

Even though I'm as guilty as you, I have to agree with Eric that
especially us core folks should be running with the various lock
debugging options on all the time.

Maybe someone should add the RCU debugging config option to
Documentation/SubmitChecklist :-)

> Anyway, this patch should fix the problems you've spotted.

> netpoll: Use correct primitives for RCU dereferencing
> 
> Now that RCU debugging checks for matching rcu_dereference calls
> and rcu_read_lock, we need to use the correct primitives or face
> nasty warnings.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks guys!


^ permalink raw reply

* Re: linux-next: build warning after merge of the final tree (net tree related)
From: David Miller @ 2010-06-16  4:44 UTC (permalink / raw)
  To: herbert; +Cc: sfr, netdev, linux-next, linux-kernel
In-Reply-To: <20100616035630.GB17440@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 16 Jun 2010 13:56:30 +1000

> bridge: Add const to dummy br_netpoll_send_skb
> 
> The version of br_netpoll_send_skb used when netpoll is off is
> missing a const thus causing a warning.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied.

^ permalink raw reply

* Re: bridge: Fix OOM crash in deliver_clone
From: David Miller @ 2010-06-16  4:43 UTC (permalink / raw)
  To: herbert; +Cc: netdev, mwagner
In-Reply-To: <20100616040219.GA18000@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 16 Jun 2010 14:02:19 +1000

> Hi Dave:
> 
> bridge: Fix OOM crash in deliver_clone
> 
> The bridge multicast patches introduced an OOM crash in the forward
> path, when deliver_clone fails to clone the skb.
> 
> This patch fixes it.
> 
> Reported-by: Mark Wagner <mwagner@redhat.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied, thanks.

^ permalink raw reply

* bridge: Fix OOM crash in deliver_clone
From: Herbert Xu @ 2010-06-16  4:02 UTC (permalink / raw)
  To: David S. Miller, netdev; +Cc: Mark Wagner

Hi Dave:

bridge: Fix OOM crash in deliver_clone

The bridge multicast patches introduced an OOM crash in the forward
path, when deliver_clone fails to clone the skb.

This patch fixes it.

Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6e97711..cbfe87f 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -130,10 +130,10 @@ static int deliver_clone(const struct net_bridge_port *prev,
 			 void (*__packet_hook)(const struct net_bridge_port *p,
 					       struct sk_buff *skb))
 {
+	struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
+
 	skb = skb_clone(skb, GFP_ATOMIC);
 	if (!skb) {
-		struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
-
 		dev->stats.tx_dropped++;
 		return -ENOMEM;
 	}

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: linux-next: build warning after merge of the final tree (net tree related)
From: Herbert Xu @ 2010-06-16  3:56 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: David Miller, netdev, linux-next, linux-kernel
In-Reply-To: <20100616134848.3edd4a25.sfr@canb.auug.org.au>

On Wed, Jun 16, 2010 at 01:48:48PM +1000, Stephen Rothwell wrote:
> Hi Dave,
> 
> After merging the final tree, today's linux-next build (powerpc
> ppc44x_defconfig) produced this warning:
> 
> net/bridge/br_forward.c: In function '__br_deliver':
> net/bridge/br_forward.c:76: warning: passing argument 1 of 'br_netpoll_send_skb' discards qualifiers from pointer target type
> net/bridge/br_private.h:308: note: expected 'struct net_bridge_port *' but argument is of type 'const struct net_bridge_port *'

Indeed, this patch should fix the warning.

bridge: Add const to dummy br_netpoll_send_skb

The version of br_netpoll_send_skb used when netpoll is off is
missing a const thus causing a warning.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 29f6d66..4ea7986 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -300,7 +300,7 @@ static inline struct netpoll_info *br_netpoll_info(struct net_bridge *br)
 	return NULL;
 }
 
-static inline void br_netpoll_send_skb(struct net_bridge_port *p,
+static inline void br_netpoll_send_skb(const struct net_bridge_port *p,
 				       struct sk_buff *skb)
 {
 }

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* linux-next: build warning after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2010-06-16  3:48 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Herbert Xu

[-- Attachment #1: Type: text/plain, Size: 749 bytes --]

Hi Dave,

After merging the final tree, today's linux-next build (powerpc
ppc44x_defconfig) produced this warning:

net/bridge/br_forward.c: In function '__br_deliver':
net/bridge/br_forward.c:76: warning: passing argument 1 of 'br_netpoll_send_skb' discards qualifiers from pointer target type
net/bridge/br_private.h:308: note: expected 'struct net_bridge_port *' but argument is of type 'const struct net_bridge_port *'

Introduced by commit 91d2c34a4eed32876ca333b0ca44f3bc56645805 ("bridge:
Fix netpoll support").

By the way, that function (br_netpoll_send_skb) is only used in that one
file, so could be static in there.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* linux-next: build warnings after merge of the net tree
From: Stephen Rothwell @ 2010-06-16  3:38 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Ben Hutchings

[-- Attachment #1: Type: text/plain, Size: 1078 bytes --]

Hi Dave,

After merging the net tree, today's linux-next build (x86_64 allmodconfig)
produced these warnings:

In file included from drivers/usb/gadget/ether.c:123:
drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type
In file included from drivers/usb/gadget/multi.c:67:
drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type
In file included from drivers/usb/gadget/g_ffs.c:30:
drivers/usb/gadget/rndis.c: In function 'gen_ndis_query_resp':
drivers/usb/gadget/rndis.c:197: warning: assignment from incompatible pointer type

Introduced by commit be1f3c2c027cc5ad735df6a45a542ed1db7ec48b ("net:
Enable 64-bit net device statistics on 32-bit architectures").  This is a
call to dev_get_stats() and the return value is being assigned to a
"struct net_device_stats *".

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Herbert Xu @ 2010-06-16  3:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Paul E. McKenney, shemminger, mst, frzhang, netdev,
	amwang, mpm
In-Reply-To: <1276657400.19249.53.camel@edumazet-laptop>

On Wed, Jun 16, 2010 at 05:03:20AM +0200, Eric Dumazet wrote:
>
> I wonder how these patches were tested, Herbert ?

You know, not everyone enables RCU debugging...

Anyway, this patch should fix the problems you've spotted.

netpoll: Use correct primitives for RCU dereferencing

Now that RCU debugging checks for matching rcu_dereference calls
and rcu_read_lock, we need to use the correct primitives or face
nasty warnings.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index 4c77fe7..413742c 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -64,7 +64,7 @@ static inline bool netpoll_rx(struct sk_buff *skb)
 	bool ret = false;
 
 	rcu_read_lock_bh();
-	npinfo = rcu_dereference(skb->dev->npinfo);
+	npinfo = rcu_dereference_bh(skb->dev->npinfo);
 
 	if (!npinfo || (list_empty(&npinfo->rx_np) && !npinfo->rx_flags))
 		goto out;
@@ -82,7 +82,7 @@ out:
 
 static inline int netpoll_rx_on(struct sk_buff *skb)
 {
-	struct netpoll_info *npinfo = rcu_dereference(skb->dev->npinfo);
+	struct netpoll_info *npinfo = rcu_dereference_bh(skb->dev->npinfo);
 
 	return npinfo && (!list_empty(&npinfo->rx_np) || npinfo->rx_flags);
 }

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 2nd part
From: Mitchell Erblich @ 2010-06-16  3:30 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev
In-Reply-To: <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net>


On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:

> 
> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> 
>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>> To whom it may concern,
>>> 
>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>> developers to see if there is any consensus that the below is a logical 
>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>> to identify if a case exists for the below possible changes.
>>> 
>>> I am not currently on the linux kernel mail group.
>>> 			
>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>> merged the changes into the company's local tree and left the possible 
>>> global integration to others.
>>> 
>>> I have been approached by a number of companies about scaling the
>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>> time on my hands and am considering looking into this area on my own.
>>> 
>>> The first assumption is that if extra cores are available, that a single
>>> received homogeneous flow of a large number of packets/segments per
>>> second (pps) can be split into non-equal flows. This split can in effect
>>> allow a larger recv'd pps rate at the same core load while splitting off
>>> other workloads, such as xmit'ing pure ACKs.
>>> 
>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>> between cores), and creating logical separations where in a many core 
>>> system, different cores could have new kernel threads  that operate in 
>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>> form of output.
>>> 
>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>> with some form of queuing & scheduling, would be needed. In addition,
>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>> to separate the I/O.
>>> 
>>> A possible validation test is to identify the max recv'd pps rate within the
>>> tcp/ip modules within normal flow TCP established state with normal order 
>>> of say 64byte non fragmented segments, before and after each 
>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>> 
>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>> changes into this tree and if there is willingness, a seen want/need then identify
>>> how to implement the merge.
>> 
>> Hi Mitchell
>> 
>> We work everyday to improve network stack, and standard linux tree is
>> pretty scalable, you dont need to setup a separate git tree for that.
>> 
>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>> net-next-2.6 where we put all our changes.
>> 
>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> 
>> I suggest you read the last patches (say .. about 10.000 of them), to
>> have an idea of things we did during last years.
>> 
>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>> placement...
>> 
>> Its nice to see another man joining the team !
>> 
>> Thanks
>> 
> 
> 
> Lets start with a two part Linux kernel change and a tcp input/output change:
> 
> 2 Parts: 2nd part TBD
> 
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
> 
> This change should have no effect with normal available kernel mem allocs.
> 
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.
> 
> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> decrease the number of in-flight ACKs and the peer should do burst avoidance
> if our later ack increases the window in a larger chunk..
> 
> And use the last pages to decrease the chance of dropping a input pkt or
> running out of recv descriptors, because of mem back pressure.
> 
> The change could check for some form of mem pressure before the alloc,
> but the alloc in itself should suffice. We could also do a ECN type check before
> the alloc.
> 
> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> and change just the 1 arg.
> 
> code : tcp_output.c : tcp_send_ack()
>   line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> 
> Suggestions, feedback??
> 
> Mitchell Erblich
> 
> 
> 
> 

Sorry :),

		2nd part:

		use GFP_NOWAIT as 2nd arg to alloc_skb()

Mitchell Erblich
> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
From: Mitchell Erblich @ 2010-06-16  3:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1275556440.2456.19.camel@edumazet-laptop>


On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:

> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>> To whom it may concern,
>> 
>> First, my assumption is to keep this discussion local to just a few tcp/ip
>> developers to see if there is any consensus that the below is a logical 
>> approach. Please also pass this email if there is a "owner(s)" of this stack
>> to identify if a case exists for the below possible changes.
>> 
>> I am not currently on the linux kernel mail group.
>> 			
>> I have experience with modifications of the Linux tcp/ip stack, and have
>> merged the changes into the company's local tree and left the possible 
>> global integration to others.
>> 
>> I have been approached by a number of companies about scaling the
>> stack with the assumption of a number of cpu cores. At present, I find extra
>> time on my hands and am considering looking into this area on my own.
>> 
>> The first assumption is that if extra cores are available, that a single
>> received homogeneous flow of a large number of packets/segments per
>> second (pps) can be split into non-equal flows. This split can in effect
>> allow a larger recv'd pps rate at the same core load while splitting off
>> other workloads, such as xmit'ing pure ACKs.
>> 
>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>> between cores), and creating logical separations where in a many core 
>> system, different cores could have new kernel threads  that operate in 
>> parallel within the tcp/ip stack. The initial separation points would be at 
>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>> form of output.
>> 
>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>> with some form of queuing & scheduling, would be needed. In addition,
>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>> to separate the I/O.
>> 
>> A possible validation test is to identify the max recv'd pps rate within the
>> tcp/ip modules within normal flow TCP established state with normal order 
>> of say 64byte non fragmented segments, before and after each 
>> incremental change. Or the same rate with fewer core/cpu cycles.
>> 
>> I am willing to have a private git Linux.org tree that concentrates proposed
>> changes into this tree and if there is willingness, a seen want/need then identify
>> how to implement the merge.
> 
> Hi Mitchell
> 
> We work everyday to improve network stack, and standard linux tree is
> pretty scalable, you dont need to setup a separate git tree for that.
> 
> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> net-next-2.6 where we put all our changes.
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> 
> I suggest you read the last patches (say .. about 10.000 of them), to
> have an idea of things we did during last years.
> 
> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> placement...
> 
> Its nice to see another man joining the team !
> 
> Thanks
> 


Lets start with a two part Linux kernel change and a tcp input/output change:

2 Parts: 2nd part TBD

Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
in the generic kernel. TBD.

This change should have no effect with normal available kernel mem allocs.

Assuming memory pressure ( WAITING for clean memory) we should be allocating
our last pages for input skbufs and not for xmit allocs.

By delaying skbuf allocations when we have low kmem, we secondarily slow down the
tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
decrease the number of in-flight ACKs and the peer should do burst avoidance
if our later ack increases the window in a larger chunk..

And use the last pages to decrease the chance of dropping a input pkt or
running out of recv descriptors, because of mem back pressure.

The change could check for some form of mem pressure before the alloc,
but the alloc in itself should suffice. We could also do a ECN type check before
the alloc.

Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
and change just the 1 arg.

code : tcp_output.c : tcp_send_ack()
   line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */

Suggestions, feedback??

Mitchell Erblich





> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16  3:03 UTC (permalink / raw)
  To: David Miller
  Cc: Paul E. McKenney, herbert, shemminger, mst, frzhang, netdev,
	amwang, mpm
In-Reply-To: <1276657139.19249.50.camel@edumazet-laptop>

Le mercredi 16 juin 2010 à 04:59 +0200, Eric Dumazet a écrit :
> Le mardi 15 juin 2010 à 11:39 -0700, David Miller a écrit :
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Fri, 11 Jun 2010 12:11:42 +1000
> > 
> > > On Fri, Jun 11, 2010 at 08:48:39AM +1000, Herbert Xu wrote:
> > >> On Thu, Jun 10, 2010 at 02:59:15PM -0700, Stephen Hemminger wrote:
> > >> >
> > >> > Okay, then add a comment where in_irq is used?
> > >> 
> > >> Actually let me put it into a wrapper.  I'll respin the patches.
> > > 
> > > OK here is a repost.  And this time it really is 8 patches :)
> > > I've tested it lightly.
> > 
> > All applied to net-next-2.6, thanks Herbert.
> 

For this second splat, I dont know yet how to fix it, its 5 in the
morning here, I need a sleep ;)

At this point, no rcu_lock is held.

I wonder how these patches were tested, Herbert ?

[   74.431712] 
[   74.431713] ===================================================
[   74.431717] [ INFO: suspicious rcu_dereference_check() usage. ]
[   74.431719] ---------------------------------------------------
[   74.431722] include/linux/netpoll.h:85 invoked rcu_dereference_check() without protection!
[   74.431725] 
[   74.431726] other info that might help us debug this:
[   74.431727] 
[   74.431730] 
[   74.431730] rcu_scheduler_active = 1, debug_locks = 1
[   74.431733] no locks held by swapper/0.
[   74.431735] 
[   74.431736] stack backtrace:
[   74.431739] Pid: 0, comm: swapper Not tainted 2.6.35-rc1-00508-gdbe3a24-dirty #78
[   74.431742] Call Trace:
[   74.431748]  [<c132cf0c>] ? printk+0xf/0x13
[   74.431754]  [<c1059ac6>] lockdep_rcu_dereference+0x74/0x7d
[   74.431759]  [<c1297628>] __napi_gro_receive+0x4d/0xf6
[   74.431764]  [<c12977a3>] napi_gro_receive+0x19/0x24
[   74.431775]  [<f805d83f>] bnx2x_rx_int+0x101b/0x124e [bnx2x]
[   74.431781]  [<c1050ffc>] ? async_thread+0x198/0x1de
[   74.431787]  [<c129580f>] ? net_tx_action+0x9a/0x12a
[   74.431797]  [<f805f267>] bnx2x_poll+0x5d/0x18b [bnx2x]
[   74.431801]  [<c1297360>] ? net_rx_action+0x1e4/0x21a
[   74.431805]  [<c105ccb2>] ? trace_hardirqs_on_caller+0xe2/0x11c
[   74.431810]  [<c1297218>] net_rx_action+0x9c/0x21a
[   74.431814]  [<c1039a21>] __do_softirq+0x126/0x277
[   74.431819]  [<c10398fb>] ? __do_softirq+0x0/0x277
[   74.431821]  <IRQ>  [<c1039c0d>] ? irq_exit+0x38/0x74
[   74.431828]  [<c1003d1f>] ? do_IRQ+0x87/0x9b
[   74.431833]  [<c1002d2e>] ? common_interrupt+0x2e/0x34
[   74.431838]  [<c105007b>] ? sched_clock_local+0x3f/0x11f
[   74.431843]  [<c11ba45b>] ? acpi_idle_enter_bm+0x271/0x2a0
[   74.431848]  [<c12797bd>] ? cpuidle_idle_call+0x76/0x151
[   74.431852]  [<c1001565>] ? cpu_idle+0x49/0x76
[   74.431857]  [<c1319ece>] ? rest_init+0xd6/0xdb
[   74.431861]  [<c156579f>] ? start_kernel+0x31b/0x320
[   74.431865]  [<c15650c9>] ? i386_start_kernel+0xc9/0xd0




^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16  2:58 UTC (permalink / raw)
  To: David Miller, Paul E. McKenney
  Cc: herbert, shemminger, mst, frzhang, netdev, amwang, mpm
In-Reply-To: <20100615.113940.245399246.davem@davemloft.net>

Le mardi 15 juin 2010 à 11:39 -0700, David Miller a écrit :
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 11 Jun 2010 12:11:42 +1000
> 
> > On Fri, Jun 11, 2010 at 08:48:39AM +1000, Herbert Xu wrote:
> >> On Thu, Jun 10, 2010 at 02:59:15PM -0700, Stephen Hemminger wrote:
> >> >
> >> > Okay, then add a comment where in_irq is used?
> >> 
> >> Actually let me put it into a wrapper.  I'll respin the patches.
> > 
> > OK here is a repost.  And this time it really is 8 patches :)
> > I've tested it lightly.
> 
> All applied to net-next-2.6, thanks Herbert.

Well...

[   52.914014] ===================================================
[   52.914018] [ INFO: suspicious rcu_dereference_check() usage. ]
[   52.914020] ---------------------------------------------------
[   52.914024] include/linux/netpoll.h:67 invoked rcu_dereference_check() without protection!
[   52.914027] 
[   52.914027] other info that might help us debug this:
[   52.914029] 
[   52.914031] 
[   52.914032] rcu_scheduler_active = 1, debug_locks = 1
[   52.914035] 4 locks held by swapper/0:
[   52.914037]  #0:  (&n->timer){+.-...}, at: [<c103fd95>] run_timer_softirq+0x1b8/0x419
[   52.914052]  #1:  (slock-AF_INET){+.....}, at: [<c12f2b3d>] icmp_send+0x149/0x58b
[   52.914063]  #2:  (rcu_read_lock_bh){.+....}, at: [<c129978d>] dev_queue_xmit+0xf7/0x5df
[   52.914073]  #3:  (rcu_read_lock_bh){.+....}, at: [<c12977ae>] netif_rx+0x0/0x195
[   52.914081] 
[   52.914081] stack backtrace:
[   52.914086] Pid: 0, comm: swapper Not tainted 2.6.35-rc1-00508-gdbe3a24-dirty #78
[   52.914089] Call Trace:
[   52.914095]  [<c132cf0c>] ? printk+0xf/0x13
[   52.914103]  [<c1059ac6>] lockdep_rcu_dereference+0x74/0x7d
[   52.914107]  [<c1297819>] netif_rx+0x6b/0x195
[   52.914111]  [<c129978d>] ? dev_queue_xmit+0xf7/0x5df
[   52.914117]  [<c1240775>] loopback_xmit+0x4a/0x70
[   52.914122]  [<c12995cf>] dev_hard_start_xmit+0x25b/0x322
[   52.914126]  [<c1299b5b>] dev_queue_xmit+0x4c5/0x5df
[   52.914131]  [<c105ccf7>] ? trace_hardirqs_on+0xb/0xd
[   52.914135]  [<c129f611>] neigh_resolve_output+0x2e8/0x33f
[   52.914142]  [<c12a8b2a>] ? eth_header+0x0/0x8e
[   52.914147]  [<c12d3dbb>] ip_finish_output+0x323/0x3b1
[   52.914152]  [<c103955f>] ? local_bh_enable_ip+0x97/0xad
[   52.914156]  [<c12d485d>] ip_output+0xe2/0xfe
[   52.914160]  [<c12d3ff5>] ip_local_out+0x41/0x55
[   52.914164]  [<c12d5755>] ip_push_pending_frames+0x284/0x2fa
[   52.914169]  [<c12f218d>] icmp_push_reply+0xe8/0xf3
[   52.914174]  [<c12f2f36>] icmp_send+0x542/0x58b
[   52.914181]  [<c102b6af>] ? find_busiest_group+0x1c9/0x631
[   52.914188]  [<c12cb280>] ipv4_link_failure+0x17/0x7b
[   52.914193]  [<c12f0841>] arp_error_report+0x46/0x61
[   52.914197]  [<c129f8e0>] neigh_invalidate+0x68/0x80
[   52.914201]  [<c12a0bef>] neigh_timer_handler+0x124/0x1d2
[   52.914206]  [<c103fe7b>] run_timer_softirq+0x29e/0x419
[   52.914210]  [<c12a0acb>] ? neigh_timer_handler+0x0/0x1d2
[   52.914215]  [<c1039a21>] __do_softirq+0x126/0x277
[   52.914219]  [<c10398fb>] ? __do_softirq+0x0/0x277
[   52.914222]  <IRQ>  [<c1039c0d>] ? irq_exit+0x38/0x74
[   52.914230]  [<c1003d1f>] ? do_IRQ+0x87/0x9b
[   52.914235]  [<c1002d2e>] ? common_interrupt+0x2e/0x34
[   52.914241]  [<c105007b>] ? sched_clock_local+0x3f/0x11f
[   52.914249]  [<c11ba45b>] ? acpi_idle_enter_bm+0x271/0x2a0
[   52.914256]  [<c12797bd>] ? cpuidle_idle_call+0x76/0x151
[   52.914261]  [<c1001565>] ? cpu_idle+0x49/0x76
[   52.914266]  [<c1319ece>] ? rest_init+0xd6/0xdb
[   52.914274]  [<c156579f>] ? start_kernel+0x31b/0x320
[   52.914278]  [<c15650c9>] ? i386_start_kernel+0xc9/0xd0


Paul, could you please explain if current lockdep rules are correct, or could be relaxed ?

I thought :

rcu_read_lock_bh();

was a shorthand to

local_disable_bh();
rcu_read_lock();

Why lockdep is not able to make a correct diagnostic ?

Thanks

[PATCH net-next-2.6] netpoll: Fix one rcu_dereference() lockdep splat

lockdep doesnt allow yet following  construct :

rcu_read_lock_bh();
npinfo = rcu_dereference(skb->dev->npinfo);

Fix lockdep splat using rcu_dereference_bh()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netpoll.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index 4c77fe7..472365e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -64,7 +64,7 @@ static inline bool netpoll_rx(struct sk_buff *skb)
 	bool ret = false;
 
 	rcu_read_lock_bh();
-	npinfo = rcu_dereference(skb->dev->npinfo);
+	npinfo = rcu_dereference_bh(skb->dev->npinfo);
 
 	if (!npinfo || (list_empty(&npinfo->rx_np) && !npinfo->rx_flags))
 		goto out;



^ permalink raw reply related

* [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: Eric Dumazet @ 2010-06-16  2:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulmck
In-Reply-To: <20100615.142506.02275206.davem@davemloft.net>

Le mardi 15 juin 2010 à 14:25 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 15 Jun 2010 20:23:14 +0200
> 
> > inetpeer currently uses an AVL tree protected by an rwlock.
> > 
> > It's possible to make most lookups use RCU
>  ...
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Applied, nice work Eric.

Thanks David !

Re-reading patch I realize refcnt is expected to be 0 for unused entries
(obviously), so we should use a different marker for 'about to be freed'
ones.

Thanks

[PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries

Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)

Unused inet_peer entries have a null refcnt.

Using atomic_inc_not_zero() in rcu lookups is not going to work for
them, and slow path is taken.

Fix this using -1 marker instead of 0 for deleted entries.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/inetpeer.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 58fbc7e..39a14ba 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -187,7 +187,12 @@ static struct inet_peer *lookup_rcu_bh(__be32 daddr)
 
 	while (u != peer_avl_empty) {
 		if (daddr == u->v4daddr) {
-			if (unlikely(!atomic_inc_not_zero(&u->refcnt)))
+			/* Before taking a reference, check if this entry was
+			 * deleted, unlink_from_pool() sets refcnt=-1 to make
+			 * distinction between an unused entry (refcnt=0) and
+			 * a freed one.
+			 */
+			if (unlikely(!atomic_add_unless(&u->refcnt, 1, -1)))
 				u = NULL;
 			return u;
 		}
@@ -322,8 +327,9 @@ static void unlink_from_pool(struct inet_peer *p)
 	 * in cleanup() function to prevent sudden disappearing.  If we can
 	 * atomically (because of lockless readers) take this last reference,
 	 * it's safe to remove the node and free it later.
+	 * We use refcnt=-1 to alert lockless readers this entry is deleted.
 	 */
-	if (atomic_cmpxchg(&p->refcnt, 1, 0) == 1) {
+	if (atomic_cmpxchg(&p->refcnt, 1, -1) == 1) {
 		struct inet_peer **stack[PEER_MAXDEPTH];
 		struct inet_peer ***stackptr, ***delp;
 		if (lookup(p->v4daddr, stack) != p)



^ permalink raw reply related

* Re: [PATCH] virtio_net: implements ethtool_ops.get_drvinfo
From: Rusty Russell @ 2010-06-16  1:54 UTC (permalink / raw)
  To: Taku Izumi; +Cc: David S. Miller, netdev@vger.kernel.org, Michael S. Tsirkin
In-Reply-To: <4C170D9E.5090407@jp.fujitsu.com>

On Tue, 15 Jun 2010 02:50:30 pm Taku Izumi wrote:
> Hi Rusty,
> 
> (2010/06/15 13:28), Rusty Russell wrote:
> > On Fri, 11 Jun 2010 10:59:02 am Taku Izumi wrote:
> >> This patch implements ethtool_ops.get_drvinfo interface of virtio_net driver.
> >>
> >> Signed-off-by: Taku Izumi<izumi.taku@jp.fujitsu.com>
> > 
> > Hi Taku!
> > 
> >     Does this have any useful effect?
> 
> I often use "ethtool -i" command to check what driver controls the ehternet device.
> But because current virtio_net driver doesn't support "ethtool -i", it becomes the
> following:
> 
> 	# ethtool -i eth3
> 	Cannot get driver information: Operation not supported
> 
> My patch simply adds the "ethtool -i" support. The following is the result when
> using the virtio_net driver with my patch applied to.
> 
> 	# ethtool -i eth3
> 	driver: virtio_net
> 	version: N/A
> 	firmware-version: N/A
> 	bus-info: virtio0
> 
> Personally, "-i" is one of the most frequently-used option, and
> most network drivers support "ethtool -i", so I think virtio_net also should do.

Thanks, Taku.

I put this explanation in the commit message, and changed 32 to ARRAY_SIZE().
It's queued for sending to DaveM for the next merge window.

Result below.

Thanks!
Rusty.

Subject: virtio_net: implements ethtool_ops.get_drvinfo
Date: Fri, 11 Jun 2010 10:29:02 +0900
From: Taku Izumi <izumi.taku@jp.fujitsu.com>

I often use "ethtool -i" command to check what driver controls the
ehternet device.  But because current virtio_net driver doesn't
support "ethtool -i", it becomes the following:

        # ethtool -i eth3
        Cannot get driver information: Operation not supported

This patch simply adds the "ethtool -i" support. The following is the
result when using the virtio_net driver with my patch applied to.

        # ethtool -i eth3
        driver: virtio_net
        version: N/A
        firmware-version: N/A
        bus-info: virtio0

Personally, "-i" is one of the most frequently-used option, and most
network drivers support "ethtool -i", so I think virtio_net also
should do.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use ARRAY_SIZE)
---
 0 files changed

Index: net-next.35/drivers/net/virtio_net.c
===================================================================
--- net-next.35.orig/drivers/net/virtio_net.c
+++ net-next.35/drivers/net/virtio_net.c
@@ -701,6 +701,19 @@ static int virtnet_close(struct net_devi
 	return 0;
 }
 
+static void virtnet_get_drvinfo(struct net_device *dev,
+				struct ethtool_drvinfo *drvinfo)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct virtio_device *vdev = vi->vdev;
+
+	strncpy(drvinfo->driver, KBUILD_MODNAME, ARRAY_SIZE(drvinfo->driver));
+	strncpy(drvinfo->version, "N/A", ARRAY_SIZE(drvinfo->version));
+	strncpy(drvinfo->fw_version, "N/A", ARRAY_SIZE(drvinfo->fw_version));
+	strncpy(drvinfo->bus_info, dev_name(&vdev->dev),
+		ARRAY_SIZE(drvinfo->bus_info));
+}
+
 static int virtnet_set_tx_csum(struct net_device *dev, u32 data)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
@@ -813,6 +825,7 @@ static void virtnet_vlan_rx_kill_vid(str
 }
 
 static const struct ethtool_ops virtnet_ethtool_ops = {
+	.get_drvinfo = virtnet_get_drvinfo,
 	.set_tx_csum = virtnet_set_tx_csum,
 	.set_sg = ethtool_op_set_sg,
 	.set_tso = ethtool_op_set_tso,

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: NET_SKB_PAD should depend on L1_CACHE_BYTES
From: David Miller @ 2010-06-16  1:16 UTC (permalink / raw)
  To: eric.dumazet
  Cc: alexander.h.duyck, jeffrey.t.kirsher, mingo, tglx, hpa, x86,
	linux-kernel, netdev, gospo
In-Reply-To: <1276520234.2478.82.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Jun 2010 14:57:14 +0200

> [PATCH net-next-2.6] net: NET_SKB_PAD should depend on L1_CACHE_BYTES
> 
> In old kernels, NET_SKB_PAD was defined to 16.
> 
> Then commit d6301d3dd1c2 (net: Increase default NET_SKB_PAD to 32), and
> commit 18e8c134f4e9 (net: Increase NET_SKB_PAD to 64 bytes) increased it
> to 64.
> 
> While first patch was governed by network stack needs, second was more
> driven by performance issues on current hardware. Real intent was to
> align data on a cache line boundary.
> 
> So use max(32, L1_CACHE_BYTES) instead of 64, to be more generic.
> 
> Remove microblaze and powerpc own NET_SKB_PAD definitions.
> 
> Thanks to Alexander Duyck and David Miller for their comments.
> 
> Suggested-by: David Miller <davem@davemloft.net>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH 0/3]netxen: bug fixes
From: David Miller @ 2010-06-16  1:15 UTC (permalink / raw)
  To: amit.salecha; +Cc: netdev, ameen.rahman
In-Reply-To: <1276508345-17070-1-git-send-email-amit.salecha@qlogic.com>

From: Amit Kumar Salecha <amit.salecha@qlogic.com>
Date: Mon, 14 Jun 2010 02:39:02 -0700

>    Sending series of 3 bug fixes. Please apply them on net-2.6.

All applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] ipfrag : frag_kfree_skb() cleanup
From: David Miller @ 2010-06-16  1:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1276507363.2478.43.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Jun 2010 11:22:43 +0200

> Third param (work) is unused, remove it.
> 
> Remove __inline__ and inline qualifiers.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] ip_frag: Remove some atomic ops
From: David Miller @ 2010-06-16  1:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1276506144.2478.40.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Jun 2010 11:02:24 +0200

> Instead of doing one atomic operation per frag, we can factorize them.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] ipv6: syncookies: do not skip ->iif initialization
From: David Miller @ 2010-06-16  1:10 UTC (permalink / raw)
  To: fw; +Cc: netdev, ggriffin.kernel
In-Reply-To: <1276464579-4399-1-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Sun, 13 Jun 2010 23:29:39 +0200

> When syncookies are in effect, req->iif is left uninitialized.
> In case of e.g. link-local addresses the route lookup then fails
> and no syn-ack is sent.
> 
> Rearrange things so ->iif is also initialized in the syncookie case.
> 
> want_cookie can only be true when the isn was zero, thus move the want_cookie
> check into the "!isn" branch.
> 
> Cc: Glenn Griffin <ggriffin.kernel@gmail.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] syncookies: check decoded options against sysctl settings
From: David Miller @ 2010-06-16  1:09 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <1276464875-4460-1-git-send-email-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Sun, 13 Jun 2010 23:34:35 +0200

> -	if (tcp_opt->sack_ok)
> -		tcp_sack_reset(tcp_opt);
> +	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
> +		return false;
>  

If you remove the tcp_sack_reset() call here, who is going to
do it?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox