Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-2.6] netns: assign NULL after pernet memory is freed
From: Eric W. Biederman @ 2010-04-26 18:05 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem
In-Reply-To: <20100426133020.GE2941@psychotron.lab.eng.brq.redhat.com>

Jiri Pirko <jpirko@redhat.com> writes:

> Mon, Apr 26, 2010 at 02:07:17PM CEST, jpirko@redhat.com wrote:
>>Mon, Apr 26, 2010 at 01:18:07PM CEST, jpirko@redhat.com wrote:
>>>This is needed to let know appropriate code (driver) know that pernet memory
>>>chunk  was freed already. For example when driver has also registered netdev
>>>notifier, it can be called after memory is freed which could eventually lead
>>>to panic. Also a null check should be present in these notifiers.
>>>
>>>For example, previously, when drivers were responsible for pernet memory
>>>allocation/freeing, in pppoe this assign was done in pppoe_exit_net() right
>>>after pernet memory was freed. The check in notifier stayed (in pppoe_flush_dev
>>>called from pppoe_device_event) but makes no sense now without this patch.
>>
>>Hmm, thinking about this, since the life of pernet memories is directly connected
>>with the life of net, as described by Eric, this should not be needed. So please
>>scrach this one too. Sorry.
>>
>>Anyway, I'm still wondering how to prevent netdev_notifiers from using this when
>>net is gone. Going to do more research, I'm probably missing something.
>
> Right, it becomes part of init_ns then. So this looks good then. One thing I can
> still imagine what may cause problem, since netdev_notifier is registered before
> register_pernet_device, the notifier can be called in between. So I think this
> should be reordered, am I right?

No.  We guarantee that all network devices are either destroyed or moved
to init_net with the appropriate notifications sent.  Before we tear down
the network namespace.  This guarantees we won't have packets in flight when
destroying the network subsystems.

The moving and destroying of network devices should simply look like normal
events, that don't need special handling.

Eric

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-26 18:04 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <q2o65634d661004261038t8f0c32f5ye8169ef0c38172b7@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Mon, 26 Apr 2010 10:38:27 -0700

> It would appear that way :-(.  I was going to ping Broadcom folks to
> see if there's support for UDP.

I'm pretty sure there isn't at this point.

We'll need to elide setting ->rxhash for non-TCP packets.  I bet that
the ETH_FAST_PATH_RX_CQE_RSS_HASH_TYPE field might be usable to making
this decision, but if not in the worst case we'll need to parse the
VLAN/ETH and IP4/IP6 headers to figure out the protocol.

Damn, I'm so pissed off about this.  This ruins everything!

How damn hard is it to add two 16-bit ports to the hash regardless of
protocol?

^ permalink raw reply

* Re: [PATCH net-2.6] netns: assign NULL after pernet memory is freed
From: Eric W. Biederman @ 2010-04-26 18:01 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem
In-Reply-To: <20100426120716.GD2941@psychotron.lab.eng.brq.redhat.com>

Jiri Pirko <jpirko@redhat.com> writes:

> Mon, Apr 26, 2010 at 01:18:07PM CEST, jpirko@redhat.com wrote:
>>This is needed to let know appropriate code (driver) know that pernet memory
>>chunk  was freed already. For example when driver has also registered netdev
>>notifier, it can be called after memory is freed which could eventually lead
>>to panic. Also a null check should be present in these notifiers.
>>
>>For example, previously, when drivers were responsible for pernet memory
>>allocation/freeing, in pppoe this assign was done in pppoe_exit_net() right
>>after pernet memory was freed. The check in notifier stayed (in pppoe_flush_dev
>>called from pppoe_device_event) but makes no sense now without this patch.
>
> Hmm, thinking about this, since the life of pernet memories is directly connected
> with the life of net, as described by Eric, this should not be needed. So please
> scrach this one too. Sorry.
>
> Anyway, I'm still wondering how to prevent netdev_notifiers from using this when
> net is gone. Going to do more research, I'm probably missing something.

Also semenatically net_assign_generic is only guaranteed to work once, so assigning
NULL is at least semantically wrong.

Eric


>>I already know about several notifiers where the check should be present,
>>patches will follow up.
>>
>>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>
>>diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>>index bd8c471..29f622c 100644
>>--- a/net/core/net_namespace.c
>>+++ b/net/core/net_namespace.c
>>@@ -51,6 +51,7 @@ static void ops_free(const struct pernet_operations *ops, struct net *net)
>> 	if (ops->id && ops->size) {
>> 		int id = *ops->id;
>> 		kfree(net_generic(net, id));
>>+		net_assign_generic(net, id, NULL);
>> 	}
>> }
>> 

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: David Miller @ 2010-04-26 17:56 UTC (permalink / raw)
  To: bhutchings; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <1272304072.2376.10.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Mon, 26 Apr 2010 18:47:52 +0100

> Unfortunately the widely-implemented Toeplitz hash functions are defined
> only for TCP/IPv4, TCP/IPv6, IPv4 and IPv6.

What a complete and utter waste all of this is then.

Defining the hash as TCP only is about as stupid as making hw checksum
offload be TCP only.

Really, I'm completely stunned.

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Ben Hutchings @ 2010-04-26 17:47 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, davem, netdev
In-Reply-To: <q2o65634d661004261038t8f0c32f5ye8169ef0c38172b7@mail.gmail.com>

On Mon, 2010-04-26 at 10:38 -0700, Tom Herbert wrote:
> > Hi Tom
> >
> > I tested this rxhash feature on my bnx2x card, using latest net-next-2.6
> > and appropriate ethtool
> >
> > ethtool -k eth1 rxhash on
> >
> > Then I used my pktgen script, to flood machine with flows with udp dst
> > port varying between 4000 and 4015.
> >
> > Software generated rxhash is OK (16 different values).
> >
> > But with bnx2x provided hash, all skb were hashed to same rxhash
> > value :(
> >
> > What are the specs of this hardware hash ?
> > It seems to not care of udp source/destination ports.
> >
> 
> It would appear that way :-(.  I was going to ping Broadcom folks to
> see if there's support for UDP.
[...]

Unfortunately the widely-implemented Toeplitz hash functions are defined
only for TCP/IPv4, TCP/IPv6, IPv4 and IPv6.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Tom Herbert @ 2010-04-26 17:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev
In-Reply-To: <1272302434.19143.76.camel@edumazet-laptop>

> Hi Tom
>
> I tested this rxhash feature on my bnx2x card, using latest net-next-2.6
> and appropriate ethtool
>
> ethtool -k eth1 rxhash on
>
> Then I used my pktgen script, to flood machine with flows with udp dst
> port varying between 4000 and 4015.
>
> Software generated rxhash is OK (16 different values).
>
> But with bnx2x provided hash, all skb were hashed to same rxhash
> value :(
>
> What are the specs of this hardware hash ?
> It seems to not care of udp source/destination ports.
>

It would appear that way :-(.  I was going to ping Broadcom folks to
see if there's support for UDP.

> Also, should'nt we feed same values for the seeds on different nics ?
>
> for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
>        REG_WR(bp, i, random32());
>

Yes, that is reasonable.

> Thanks
>
>
>

^ permalink raw reply

* [PATCH v2] tg3: Fix INTx fallback when MSI fails
From: Andre Detsch @ 2010-04-26 17:27 UTC (permalink / raw)
  To: netdev, mcarlson, Michael Chan
In-Reply-To: <20100421.161751.94080129.davem@davemloft.net>

tg3: Fix INTx fallback when MSI fails

MSI setup changes the value of irq_vec in struct tg3 *tp.
This attribute must be taken into account and restored before
we try to do a new request_irq for INTx fallback.

In powerpc, the original code was leading to an EINVAL return within
request_irq, because the driver was trying to use the disabled MSI
virtual irq number instead of tp->pdev->irq.

Signed-off-by: Andre Detsch <adetsch@br.ibm.com>

---
Tested on powerpc, but should be safe for other architectures as well.
v2: removed unnecessary changes

Index: linux-2.6.34-rc5/drivers/net/tg3.c
===================================================================
--- linux-2.6.34-rc5.orig/drivers/net/tg3.c	2010-04-19 20:29:56.000000000 -0300
+++ linux-2.6.34-rc5/drivers/net/tg3.c	2010-04-26 14:10:42.000000000 -0300
@@ -8633,6 +8633,7 @@ static int tg3_test_msi(struct tg3 *tp)
 	pci_disable_msi(tp->pdev);
 
 	tp->tg3_flags2 &= ~TG3_FLG2_USING_MSI;
+	tp->napi[0].irq_vec = tp->pdev->irq;
 
 	err = tg3_request_irq(tp, 0);
 	if (err)

^ permalink raw reply

* Re: [PATCH] bnx2x: add support for receive hashing
From: Eric Dumazet @ 2010-04-26 17:20 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1004222249400.27016@pokey.mtv.corp.google.com>

Le jeudi 22 avril 2010 à 22:54 -0700, Tom Herbert a écrit :
> Add support to bnx2x to extract Toeplitz hash out of the receive descriptor
> for use in skb->rxhash.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 0819530..8bd2368 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -1330,7 +1330,7 @@ static inline u32 reg_poll(struct bnx2x *bp, u32 reg, u32 expected, int ms,
>  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_UMP_TX_PARITY | \
>  		AEU_INPUTS_ATTN_BITS_MCP_LATCHED_SCPAD_PARITY)
>  
> -#define MULTI_FLAGS(bp) \
> +#define RSS_FLAGS(bp) \
>  		(TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_CAPABILITY | \
>  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV4_TCP_CAPABILITY | \
>  		 TSTORM_ETH_FUNCTION_COMMON_CONFIG_RSS_IPV6_CAPABILITY | \
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 0c6dba2..613f727 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -1582,7 +1582,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  		struct sw_rx_bd *rx_buf = NULL;
>  		struct sk_buff *skb;
>  		union eth_rx_cqe *cqe;
> -		u8 cqe_fp_flags;
> +		u8 cqe_fp_flags, cqe_fp_status_flags;
>  		u16 len, pad;
>  
>  		comp_ring_cons = RCQ_BD(sw_comp_cons);
> @@ -1598,6 +1598,7 @@ static int bnx2x_rx_int(struct bnx2x_fastpath *fp, int budget)
>  
>  		cqe = &fp->rx_comp_ring[comp_ring_cons];
>  		cqe_fp_flags = cqe->fast_path_cqe.type_error_flags;
> +		cqe_fp_status_flags = cqe->fast_path_cqe.status_flags;
>  
>  		DP(NETIF_MSG_RX_STATUS, "CQE type %x  err %x  status %x"
>  		   "  queue %x  vlan %x  len %u\n", CQE_TYPE(cqe_fp_flags),
> @@ -1727,6 +1728,12 @@ reuse_rx:
>  
>  			skb->protocol = eth_type_trans(skb, bp->dev);
>  
> +			if ((bp->dev->features & ETH_FLAG_RXHASH) &&
> +			    (cqe_fp_status_flags &
> +			     ETH_FAST_PATH_RX_CQE_RSS_HASH_FLG))
> +				skb->rxhash = le32_to_cpu(
> +				    cqe->fast_path_cqe.rss_hash_result);
> +
>  			skb->ip_summed = CHECKSUM_NONE;
>  			if (bp->rx_csum) {
>  				if (likely(BNX2X_RX_CSUM_OK(cqe)))
> @@ -5750,10 +5757,10 @@ static void bnx2x_init_internal_func(struct bnx2x *bp)
>  	u32 offset;
>  	u16 max_agg_size;
>  
> -	if (is_multi(bp)) {
> -		tstorm_config.config_flags = MULTI_FLAGS(bp);
> +	tstorm_config.config_flags = RSS_FLAGS(bp);
> +
> +	if (is_multi(bp))
>  		tstorm_config.rss_result_mask = MULTI_MASK;
> -	}
>  
>  	/* Enable TPA if needed */
>  	if (bp->flags & TPA_ENABLE_FLAG)
> @@ -6629,10 +6636,8 @@ static int bnx2x_init_common(struct bnx2x *bp)
>  	bnx2x_init_block(bp, PBF_BLOCK, COMMON_STAGE);
>  
>  	REG_WR(bp, SRC_REG_SOFT_RST, 1);
> -	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4) {
> -		REG_WR(bp, i, 0xc0cac01a);
> -		/* TODO: replace with something meaningful */
> -	}
> +	for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
> +		REG_WR(bp, i, random32());
>  	bnx2x_init_block(bp, SRCH_BLOCK, COMMON_STAGE);
>  #ifdef BCM_CNIC
>  	REG_WR(bp, SRC_REG_KEYSEARCH_0, 0x63285672);
> @@ -11001,6 +11006,11 @@ static int bnx2x_set_flags(struct net_device *dev, u32 data)
>  		changed = 1;
>  	}
>  
> +	if (data & ETH_FLAG_RXHASH)
> +		dev->features |= NETIF_F_RXHASH;
> +	else
> +		dev->features &= ~NETIF_F_RXHASH;
> +
>  	if (changed && netif_running(dev)) {
>  		bnx2x_nic_unload(bp, UNLOAD_NORMAL);
>  		rc = bnx2x_nic_load(bp, LOAD_NORMAL);
> --

Hi Tom

I tested this rxhash feature on my bnx2x card, using latest net-next-2.6
and appropriate ethtool

ethtool -k eth1 rxhash on

Then I used my pktgen script, to flood machine with flows with udp dst
port varying between 4000 and 4015.

Software generated rxhash is OK (16 different values).

But with bnx2x provided hash, all skb were hashed to same rxhash
value :(

What are the specs of this hardware hash ?
It seems to not care of udp source/destination ports.

Also, should'nt we feed same values for the seeds on different nics ?

for (i = SRC_REG_KEYRSS0_0; i <= SRC_REG_KEYRSS1_9; i += 4)
	REG_WR(bp, i, random32());

Thanks



^ permalink raw reply

* Re: 2.6.34-rc5-git6 (plus all patches) -- new INFO: suspicious rcu_dereference_check() usage.
From: Paul E. McKenney @ 2010-04-26 16:10 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <s2ka44ae5cd1004251419hb8f4c137md0e399c464d9290a@mail.gmail.com>

On Sun, Apr 25, 2010 at 05:19:38PM -0400, Miles Lane wrote:
> [  139.730133] [ INFO: suspicious rcu_dereference_check() usage. ]
> [  139.730136] ---------------------------------------------------
> [  139.730139] include/net/inet_timewait_sock.h:227 invoked
> rcu_dereference_check() without protection!
> [  139.730142]
> [  139.730143] other info that might help us debug this:
> [  139.730144]
> [  139.730147]
> [  139.730148] rcu_scheduler_active = 1, debug_locks = 1
> [  139.730151] 1 lock held by swapper/0:
> [  139.730158]  #0:  (net/ipv4/tcp_minisocks.c:41){+.-...}, at:
> [<ffffffff81046ad2>] run_timer_softirq+0x19f/0x370
> [  139.730169]
> [  139.730170] stack backtrace:
> [  139.730173] Pid: 0, comm: swapper Not tainted 2.6.34-rc5-git6 #28
> [  139.730176] Call Trace:
> [  139.730178]  <IRQ>  [<ffffffff81065f9a>] lockdep_rcu_dereference+0x9d/0xa5
> [  139.730189]  [<ffffffff813d1abf>] twsk_net+0x4f/0x57
> [  139.730193]  [<ffffffff813d2122>] __inet_twsk_kill+0x79/0xd0
> [  139.730198]  [<ffffffff813d24f9>] inet_twdr_do_twkill_work+0x59/0xba
> [  139.730203]  [<ffffffff813d263d>] inet_twdr_hangman+0x30/0x97
> [  139.730207]  [<ffffffff81046b9e>] run_timer_softirq+0x26b/0x370
> [  139.730211]  [<ffffffff81046ad2>] ? run_timer_softirq+0x19f/0x370
> [  139.730216]  [<ffffffff813d260d>] ? inet_twdr_hangman+0x0/0x97
> [  139.730221]  [<ffffffff81040230>] ? __do_softirq+0x7f/0x252
> [  139.730225]  [<ffffffff810402f5>] __do_softirq+0x144/0x252
> [  139.730230]  [<ffffffff8100320c>] call_softirq+0x1c/0x34
> [  139.730235]  [<ffffffff81004864>] do_softirq+0x38/0x80
> [  139.730239]  [<ffffffff8103fdc2>] irq_exit+0x45/0x94
> [  139.730243]  [<ffffffff81003fa9>] do_IRQ+0xad/0xc4
> [  139.730247]  [<ffffffff81432953>] ret_from_intr+0x0/0xf
> [  139.730250]  <EOI>  [<ffffffff81248383>] ? acpi_idle_enter_bm+0x262/0x28d
> [  139.730260]  [<ffffffff81248379>] ? acpi_idle_enter_bm+0x258/0x28d
> [  139.730266]  [<ffffffff8137b9cf>] cpuidle_idle_call+0x9c/0x13d
> [  139.730273]  [<ffffffff810012c2>] cpu_idle+0xa5/0xfc
> [  139.730277]  [<ffffffff8142abdd>] start_secondary+0x1f3/0x1fc

This appears to be another symptom of the bug you located in your earlier
testing, so either Eric Biederman sends a patch or I improvise.

And thank you again for all your testing, Miles!!!

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-26 16:09 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg, Eric W. Biederman
In-Reply-To: <p2ga44ae5cd1004251320j55d23ba7ua0fa14497f623d76@mail.gmail.com>

On Sun, Apr 25, 2010 at 04:20:13PM -0400, Miles Lane wrote:
> > I am down to seeing three suspicious rcu_dereference_check traces when
> > I apply this patch and all the previous patches to 2.6.34-rc5-git6.
> >
> > 1. The "__sched_setscheduler+0x19d/0x300" trace.
> > 2. The two "is_swiotlb_buffer+0x2e/0x3b" traces (waiting to see
> > Johannes' patch show up in a Linux snapshot)
> >
> > Did I miss a patch for the setscheduler issue?
> 
> Hmm.  I am still seeing these two messages as well.
> 
> [   83.363146] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   83.363148] ---------------------------------------------------
> [   83.363151] include/net/inet_timewait_sock.h:227 invoked
> rcu_dereference_check() without protection!
> [   83.363154]
> [   83.363155] other info that might help us debug this:
> [   83.363156]
> [   83.363158]
> [   83.363159] rcu_scheduler_active = 1, debug_locks = 1
> [   83.363162] 2 locks held by gwibber-service/5076:
> [   83.363164]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff8110534a>]
> seq_read+0x37/0x381
> [   83.363176]  #1:  (&(&hashinfo->ehash_locks[i])->rlock){+.-...},
> at: [<ffffffff813ddcd5>] established_get_next+0xc4/0x132
> [   83.363186]
> [   83.363187] stack backtrace:
> [   83.363191] Pid: 5076, comm: gwibber-service Not tainted 2.6.34-rc5-git6 #27
> [   83.363194] Call Trace:
> [   83.363202]  [<ffffffff81068086>] lockdep_rcu_dereference+0x9d/0xa5
> [   83.363207]  [<ffffffff813dc998>] twsk_net+0x4f/0x57
> [   83.363212]  [<ffffffff813ddc65>] established_get_next+0x54/0x132
> [   83.363216]  [<ffffffff813dde47>] tcp_seq_next+0x5d/0x6a
> [   83.363221]  [<ffffffff81105599>] seq_read+0x286/0x381
> [   83.363226]  [<ffffffff81105313>] ? seq_read+0x0/0x381
> [   83.363231]  [<ffffffff8113503c>] proc_reg_read+0x8d/0xac
> [   83.363236]  [<ffffffff810ebf14>] vfs_read+0xa6/0x103
> [   83.363241]  [<ffffffff810ec027>] sys_read+0x45/0x69
> [   83.363246]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b
> 
> [   84.660302] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   84.660304] ---------------------------------------------------
> [   84.660308] include/net/inet_timewait_sock.h:227 invoked
> rcu_dereference_check() without protection!
> [   84.660311]
> [   84.660312] other info that might help us debug this:
> [   84.660313]
> [   84.660315]
> [   84.660316] rcu_scheduler_active = 1, debug_locks = 1
> [   84.660319] no locks held by gwibber-service/5081.
> [   84.660321]
> [   84.660322] stack backtrace:
> [   84.660325] Pid: 5081, comm: gwibber-service Not tainted 2.6.34-rc5-git6 #27
> [   84.660328] Call Trace:
> [   84.660339]  [<ffffffff81068086>] lockdep_rcu_dereference+0x9d/0xa5
> [   84.660345]  [<ffffffff813cad6f>] twsk_net+0x4f/0x57
> [   84.660350]  [<ffffffff813cb18f>] __inet_twsk_hashdance+0x50/0x158
> [   84.660355]  [<ffffffff813e0bb9>] tcp_time_wait+0x1c1/0x24b
> [   84.660360]  [<ffffffff813d3d97>] tcp_fin+0x83/0x162
> [   84.660364]  [<ffffffff813d4727>] tcp_data_queue+0x1ff/0xa1e
> [   84.660370]  [<ffffffff810496aa>] ? mod_timer+0x1e/0x20
> [   84.660375]  [<ffffffff813d8363>] tcp_rcv_state_process+0x89d/0x8f2
> [   84.660381]  [<ffffffff813943bb>] ? release_sock+0x30/0x10b
> [   84.660386]  [<ffffffff813de772>] tcp_v4_do_rcv+0x2de/0x33f
> [   84.660391]  [<ffffffff8139440d>] release_sock+0x82/0x10b
> [   84.660395]  [<ffffffff813ce875>] tcp_close+0x1b5/0x37e
> [   84.660401]  [<ffffffff813ecdb7>] inet_release+0x50/0x57
> [   84.660405]  [<ffffffff81391ae4>] sock_release+0x1a/0x66
> [   84.660410]  [<ffffffff81391b52>] sock_close+0x22/0x26
> [   84.660415]  [<ffffffff810ece07>] __fput+0x120/0x1cd
> [   84.660420]  [<ffffffff810ecec9>] fput+0x15/0x17
> [   84.660424]  [<ffffffff810e9d41>] filp_close+0x63/0x6d
> [   84.660428]  [<ffffffff810e9e22>] sys_close+0xd7/0x111
> [   84.660434]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

Eric Dumazet traced these down to a commit from Eric Biederman.

If I don't hear from Eric Biederman in a few days, I will attempt a
patch, but it would be more likely to be correct coming from someone
with a better understanding of the code.  ;-)

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH 1/4] e1000e: save skb counts in TX to avoid cache misses
From: Alexander Duyck @ 2010-04-26 15:55 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <alpine.DEB.1.00.1004231719050.16806@pokey.mtv.corp.google.com>

Tom Herbert wrote:
> In e1000_tx_map, precompute number of segements and bytecounts which
> are derived from fields in skb; these are stored in buffer_info.  When
> cleaning tx in e1000_clean_tx_irq use the values in the associated
> buffer_info for statistics counting, this eliminates cache misses
> on skb fields.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
> index 12648a1..d6da75b 100644
> --- a/drivers/net/e1000e/e1000.h
> +++ b/drivers/net/e1000e/e1000.h
> @@ -188,6 +188,8 @@ struct e1000_buffer {
>  			unsigned long time_stamp;
>  			u16 length;
>  			u16 next_to_watch;
> +			unsigned int segs;
> +			unsigned int bytecount;
>  			u16 mapped_as_page;
>  		};
>  		/* Rx */

Does segs need to be a full int here?  I think you can probably get away 
with either an unsigned short or just define it as a u16, then combine 
it with mapped_as_page to cut down on the overall size.

Thanks,

Alex

^ permalink raw reply

* [PATCH] cxgb3: Wait longer for control packets on initialization
From: Andre Detsch @ 2010-04-26 15:38 UTC (permalink / raw)
  To: netdev, divy


In some Power7 platforms, when using VIOS (Virtual I/O Server), we
need to wait longer for control packets to finish transfer during
initialization.
Without this change, initialization may fail prematurely.

Signed-off-by: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Andre Detsch <adetsch@br.ibm.com>

Index: linux-2.6.34-rc5/drivers/net/cxgb3/cxgb3_main.c
===================================================================
--- linux-2.6.34-rc5.orig/drivers/net/cxgb3/cxgb3_main.c	2010-04-23 18:59:43.000000000 -0300
+++ linux-2.6.34-rc5/drivers/net/cxgb3/cxgb3_main.c	2010-04-23 18:59:55.000000000 -0300
@@ -439,7 +439,7 @@ static void free_irq_resources(struct ad
 static int await_mgmt_replies(struct adapter *adap, unsigned long init_cnt,
 			      unsigned long n)
 {
-	int attempts = 5;
+	int attempts = 10;

 	while (adap->sge.qs[0].rspq.offload_pkts < init_cnt + n) {
 		if (!--attempts)

^ permalink raw reply

* Re: [patch v2] bluetooth: handle l2cap_create_connless_pdu() errors
From: Gustavo F. Padovan @ 2010-04-26 15:09 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Marcel Holtmann, David S. Miller, Andrei Emeltchenko,
	linux-bluetooth, netdev, kernel-janitors
In-Reply-To: <20100426113627.GS29093@bicker>

Hi Dan,

* Dan Carpenter <error27@gmail.com> [2010-04-26 13:36:27 +0200]:

> l2cap_create_connless_pdu() can sometimes return ERR_PTR(-ENOMEM) or
> ERR_PTR(-EFAULT).
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>
> ---
> In v2 I wrote the patch on top of Gustavo Padovon's devel tree
> 
> diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
> index a668ef4..b32682c 100644
> --- a/net/bluetooth/l2cap.c
> +++ b/net/bluetooth/l2cap.c
> @@ -1712,8 +1712,12 @@ static int l2cap_sock_sendmsg(struct kiocb *iocb, struct socket *sock, struct ms
>  	/* Connectionless channel */
>  	if (sk->sk_type == SOCK_DGRAM) {
>  		skb = l2cap_create_connless_pdu(sk, msg, len);
> -		l2cap_do_send(sk, skb);
> -		err = len;
> +		if (IS_ERR(skb)) {
> +			err = PTR_ERR(skb);

Remove the braces on the 'if', then we are fine to pick your patch ;)

> +		} else {
> +			l2cap_do_send(sk, skb);
> +			err = len;
> +		}
>  		goto done;
>  	}
>  


Regards,

-- 
Gustavo F. Padovan
http://padovan.org

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-26 14:55 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David S. Miller, Tom Herbert, Stephen Hemminger,
	netdev, Andi Kleen
In-Reply-To: <1272290584.19143.43.camel@edumazet-laptop>

Le lundi 26 avril 2010 à 16:03 +0200, Eric Dumazet a écrit :
> Le samedi 24 avril 2010 à 10:10 -0400, jamal a écrit :
> > On Fri, 2010-04-23 at 18:02 -0400, jamal wrote:
> > 
> > > Ive done a setup with the last patch from Changli + net-next - I will
> > > post test results tomorrow AM.
> > 
> > ok, annotated results attached. 
> > 
> > cheers,
> > jamal
> 
> Jamal, I have a Nehalem setup now, and I can see
> _raw_spin_lock_irqsave() abuse is not coming from network tree, but from
> clockevents_notify()
> 

Another interesting finding:

- if all packets are received on a single queue, max speed seems to be
1.200.000 packets per second on my machine :-(

And on profile of receiving cpu (RPS enabled, pakets sent to 15 other
cpus), we can see default_send_IPI_mask_sequence_phys() is the slow
thing...

Andi, what do you think of this one ?
Dont we have a function to send an IPI to an individual cpu instead ?

void default_send_IPI_mask_sequence_phys(const struct cpumask *mask, int
vector)
{
        unsigned long query_cpu;
        unsigned long flags;

        /*
         * Hack. The clustered APIC addressing mode doesn't allow us to
send
         * to an arbitrary mask, so I do a unicast to each CPU instead.
         * - mbligh
         */
        local_irq_save(flags);
        for_each_cpu(query_cpu, mask) {
                __default_send_IPI_dest_field(per_cpu(x86_cpu_to_apicid,
                                query_cpu), vector, APIC_DEST_PHYSICAL);
        }
        local_irq_restore(flags);
}


-----------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1000 irqs/sec  kernel:100.0% [1000Hz cycles],  (all, cpu:
7)
-----------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ _______

              668.00 17.7% default_send_IPI_mask_sequence_phys vmlinux
              363.00  9.6% bnx2x_rx_int                        vmlinux
              354.00  9.4% eth_type_trans                      vmlinux
              332.00  8.8% kmem_cache_alloc_node               vmlinux
              285.00  7.6% __kmalloc_node_track_caller         vmlinux
              278.00  7.4% _raw_spin_lock                      vmlinux
              166.00  4.4% __slab_alloc                        vmlinux
              147.00  3.9% __memset                            vmlinux
              136.00  3.6% list_del                            vmlinux
              132.00  3.5% get_partial_node                    vmlinux
              131.00  3.5% get_rps_cpu                         vmlinux
              102.00  2.7% enqueue_to_backlog                  vmlinux
               95.00  2.5% unmap_single                        vmlinux
               94.00  2.5% __alloc_skb                         vmlinux
               74.00  2.0% vlan_gro_common                     vmlinux
               52.00  1.4% __phys_addr                         vmlinux
               48.00  1.3% dev_gro_receive                     vmlinux
               39.00  1.0% swiotlb_dma_mapping_error           vmlinux
               36.00  1.0% swiotlb_map_page                    vmlinux
               34.00  0.9% skb_put                             vmlinux
               27.00  0.7% is_swiotlb_buffer                   vmlinux
               23.00  0.6% deactivate_slab                     vmlinux
               20.00  0.5% vlan_gro_receive                    vmlinux
               17.00  0.5% __skb_bond_should_drop              vmlinux
               14.00  0.4% netif_receive_skb                   vmlinux
               14.00  0.4% __netdev_alloc_skb                  vmlinux
               12.00  0.3% skb_gro_reset_offset                vmlinux
               12.00  0.3% get_slab                            vmlinux
               11.00  0.3% napi_skb_finish                     vmlinux



^ permalink raw reply

* [PATCH] ieee802154: Fix oops during ieee802154_sock_ioctl
From: Dmitry Eremin-Solenikov @ 2010-04-26 14:46 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, linux-kernel, Stefan Schmidt

From: Stefan Schmidt <stefan@datenfreihafen.org>

Trying to run izlisten (from lowpan-tools tests) on a device that does not
exists I got the oops below. The problem is that we are using get_dev_by_name
without checking if we really get a device back. We don't in this case and
writing to dev->type generates this oops.

[Oops code removed by Dmitry Eremin-Solenikov]

If possible this patch should be applied to the current -rc fixes branch.

Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
---
 net/ieee802154/af_ieee802154.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ieee802154/af_ieee802154.c b/net/ieee802154/af_ieee802154.c
index bad1c49..72340dd 100644
--- a/net/ieee802154/af_ieee802154.c
+++ b/net/ieee802154/af_ieee802154.c
@@ -147,6 +147,9 @@ static int ieee802154_dev_ioctl(struct sock *sk, struct ifreq __user *arg,
 	dev_load(sock_net(sk), ifr.ifr_name);
 	dev = dev_get_by_name(sock_net(sk), ifr.ifr_name);
 
+	if (!dev)
+		return -ENODEV;
+
 	if (dev->type == ARPHRD_IEEE802154 && dev->netdev_ops->ndo_do_ioctl)
 		ret = dev->netdev_ops->ndo_do_ioctl(dev, &ifr, cmd);
 
-- 
1.7.0

^ permalink raw reply related

* Re: [PATCH] e100: expose broadcast_disabled as a module option
From: Erwan Velu @ 2010-04-26 14:45 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jeff Kirsher, netdev, David Miller, linux-kernel,
	jesse.brandeburg, bruce.w.allan, alexander.h.duyck,
	peter.p.waskiewicz.jr, john.ronciak
In-Reply-To: <y2lb43bf5491004260149x23beb95dqc9e177d9a6ecc679@mail.gmail.com>

2010/4/26 Erwan Velu <erwanaliasr1@gmail.com>:

After some research, here come my status :

> Does any one you could help me understanding what should be the good way to
> 1- enabled/disabled IFF_BROADCAST for a given interface

Looks like the current code isn't taking care of a possible changing
of IFF_BROADCAST. I can find some code in net/core/dev.c to handle
MULTICAST or PROMISC but nothing for BROADCAST.
do I have to implement one ?


> 2- populate this changes to the driver

Looks like I need then to add a .ndo_change_rx_flags function to
manage this changes.

Does someone confirm this is the way to go ?

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Jesper Dangaard Brouer @ 2010-04-26 14:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, paulmck, Patrick McHardy, Changli Gao,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1272139861.20714.525.camel@edumazet-laptop>

On Sat, 2010-04-24 at 22:11 +0200, Eric Dumazet wrote:
>  
> > Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> > see if I can reproduce the DDoS attack senario.  And see if I can get 
> > it into to lookup loop.
> 
> Theorically a loop is very unlikely, given a single retry is very
> unlikly too.
> 
> Unless a cpu gets in its cache a corrupted value of a 'next' pointer.
> 
...
>
> With same hash bucket size (300.032) and max conntracks (800.000), and
> after more than 10 hours of test, not a single lookup was restarted
> because of a nulls with wrong value.

So fare, I have to agree with you.  I have now tested on the same type
of hardware (although running a 64-bit kernel, and off net-next-2.6),
and the result is, the same as yours, I don't see a any restarts of the
loop.  The test systems differs a bit, as it has two physical CPU and 2M
cache (and annoyingly the system insists on using HPET as clocksource).

Guess, the only explanation would be bad/sub-optimal hash distribution.
With 40 kpps and 700.000 'searches' per second, the hash bucket list
length "only" need to be 17.5 elements on average, where optimum is 3.
With my pktgen DoS test, where I tried to reproduce the DoS attack, only
see a screw of 6 elements on average.


> I can setup a test on a 16 cpu machine, multiqueue card too.

Don't think that is necessary.  My theory was it was possible on slower
single queue NIC, where one CPU is 100% busy in the conntrack search,
and the other CPUs delete the entries (due to early drop and
call_rcu()).  But guess that note the case, and RCU works perfectly ;-)

> Hmm, I forgot to say I am using net-next-2.6, not your kernel version...

I also did this test using net-next-2.6, perhaps I should try the
version I use in production...


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-26 14:03 UTC (permalink / raw)
  To: hadi; +Cc: Changli Gao, David S. Miller, Tom Herbert, Stephen Hemminger,
	netdev
In-Reply-To: <1272118252.8918.13.camel@bigi>

Le samedi 24 avril 2010 à 10:10 -0400, jamal a écrit :
> On Fri, 2010-04-23 at 18:02 -0400, jamal wrote:
> 
> > Ive done a setup with the last patch from Changli + net-next - I will
> > post test results tomorrow AM.
> 
> ok, annotated results attached. 
> 
> cheers,
> jamal

Jamal, I have a Nehalem setup now, and I can see
_raw_spin_lock_irqsave() abuse is not coming from network tree, but from
clockevents_notify()

My pktgen sends 1040989pps :

# Samples: 389707198131
#
# Overhead         Command                 Shared Object  Symbol
# ........  ..............  ............................  ......
#
    23.52%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--94.74%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--4.10%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--0.58%-- lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                          --0.58%-- [...]

     8.94%            init  [kernel.kallsyms]             [k] acpi_os_read_port
                      |
                      --- acpi_os_read_port
                         |          
                         |--99.55%-- acpi_hw_read_port
                         |          acpi_hw_read
                         |          acpi_hw_read_multiple
                         |          acpi_hw_register_read
                         |          acpi_read_bit_register



# Samples: 389233082962
#
# Overhead         Command                 Shared Object  Symbol
# ........  ..............  ............................  ......
#
    23.25%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
     8.90%            init  [kernel.kallsyms]             [k] acpi_os_read_port
     2.93%            init  [kernel.kallsyms]             [k] mwait_idle_with_hints
     1.99%            init  [kernel.kallsyms]             [k] schedule
     1.94%         udpsink  [kernel.kallsyms]             [k] schedule
     1.73%         swapper  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
     1.48%            init  [kernel.kallsyms]             [k] bnx2x_rx_int
     1.47%            init  [kernel.kallsyms]             [k] _raw_spin_unlock_irqrestore
     1.44%            init  [kernel.kallsyms]             [k] _raw_spin_lock
     1.36%         udpsink  [kernel.kallsyms]             [k] udp_recvmsg
     1.05%         udpsink  [kernel.kallsyms]             [k] __skb_recv_datagram
     1.05%            init  [kernel.kallsyms]             [k] __udp4_lib_lookup
     1.04%         udpsink  [kernel.kallsyms]             [k] copy_user_generic_string
     1.04%         udpsink  [kernel.kallsyms]             [k] __slab_free
     0.99%            init  [kernel.kallsyms]             [k] select_task_rq_fair
     0.99%            init  [kernel.kallsyms]             [k] try_to_wake_up
     0.98%            init  [kernel.kallsyms]             [k] task_rq_lock
     0.93%            init  [kernel.kallsyms]             [k] tick_broadcast_oneshot_control
     0.89%            init  [kernel.kallsyms]             [k] sock_queue_rcv_skb
     0.89%         udpsink  [kernel.kallsyms]             [k] sock_recv_ts_and_drops
     0.88%         udpsink  [kernel.kallsyms]             [k] kfree
     0.79%         swapper  [kernel.kallsyms]             [k] acpi_os_read_port
     0.76%         udpsink  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
     0.73%         udpsink  [kernel.kallsyms]             [k] inet_recvmsg
     0.71%         udpsink  [vdso]                        [.] 0x000000ffffe431
     0.65%         udpsink  [kernel.kallsyms]             [k] sock_recvmsg
     0.62%            init  [kernel.kallsyms]             [k] gs_change
     0.61%            init  [kernel.kallsyms]             [k] enqueue_task_fair
     0.61%            init  [kernel.kallsyms]             [k] eth_type_trans
     0.61%            init  [kernel.kallsyms]             [k] sock_def_readable
     0.60%         udpsink  [kernel.kallsyms]             [k] _raw_spin_lock_bh
     0.59%            init  [kernel.kallsyms]             [k] ip_route_input
     0.59%         udpsink  libpthread-2.3.4.so           [.] __pthread_disable_asynccancel
     0.56%            init  [kernel.kallsyms]             [k] bnx2x_poll
     0.56%         udpsink  [kernel.kallsyms]             [k] __get_user_4



^ permalink raw reply

* [PATCH net-next-2.6] pppoe: use phonet_pernet instead of directly net_generic
From: Jiri Pirko @ 2010-04-26 13:41 UTC (permalink / raw)
  To: netdev; +Cc: davem

As in for example pppoe introduce phonet_pernet and use it instead of calling
net_generic directly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

diff --git a/net/phonet/pn_dev.c b/net/phonet/pn_dev.c
index 9b4ced6..c33da65 100644
--- a/net/phonet/pn_dev.c
+++ b/net/phonet/pn_dev.c
@@ -46,9 +46,16 @@ struct phonet_net {
 
 int phonet_net_id __read_mostly;
 
+static struct phonet_net *phonet_pernet(struct net *net)
+{
+	BUG_ON(!net);
+
+	return net_generic(net, phonet_net_id);
+}
+
 struct phonet_device_list *phonet_device_list(struct net *net)
 {
-	struct phonet_net *pnn = net_generic(net, phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(net);
 	return &pnn->pndevs;
 }
 
@@ -261,7 +268,7 @@ static int phonet_device_autoconf(struct net_device *dev)
 
 static void phonet_route_autodel(struct net_device *dev)
 {
-	struct phonet_net *pnn = net_generic(dev_net(dev), phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(dev_net(dev));
 	unsigned i;
 	DECLARE_BITMAP(deleted, 64);
 
@@ -313,7 +320,7 @@ static struct notifier_block phonet_device_notifier = {
 /* Per-namespace Phonet devices handling */
 static int __net_init phonet_init_net(struct net *net)
 {
-	struct phonet_net *pnn = net_generic(net, phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(net);
 
 	if (!proc_net_fops_create(net, "phonet", 0, &pn_sock_seq_fops))
 		return -ENOMEM;
@@ -326,7 +333,7 @@ static int __net_init phonet_init_net(struct net *net)
 
 static void __net_exit phonet_exit_net(struct net *net)
 {
-	struct phonet_net *pnn = net_generic(net, phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(net);
 	struct net_device *dev;
 	unsigned i;
 
@@ -376,7 +383,7 @@ void phonet_device_exit(void)
 
 int phonet_route_add(struct net_device *dev, u8 daddr)
 {
-	struct phonet_net *pnn = net_generic(dev_net(dev), phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(dev_net(dev));
 	struct phonet_routes *routes = &pnn->routes;
 	int err = -EEXIST;
 
@@ -393,7 +400,7 @@ int phonet_route_add(struct net_device *dev, u8 daddr)
 
 int phonet_route_del(struct net_device *dev, u8 daddr)
 {
-	struct phonet_net *pnn = net_generic(dev_net(dev), phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(dev_net(dev));
 	struct phonet_routes *routes = &pnn->routes;
 
 	daddr = daddr >> 2;
@@ -413,7 +420,7 @@ int phonet_route_del(struct net_device *dev, u8 daddr)
 
 struct net_device *phonet_route_get(struct net *net, u8 daddr)
 {
-	struct phonet_net *pnn = net_generic(net, phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(net);
 	struct phonet_routes *routes = &pnn->routes;
 	struct net_device *dev;
 
@@ -428,7 +435,7 @@ struct net_device *phonet_route_get(struct net *net, u8 daddr)
 
 struct net_device *phonet_route_output(struct net *net, u8 daddr)
 {
-	struct phonet_net *pnn = net_generic(net, phonet_net_id);
+	struct phonet_net *pnn = phonet_pernet(net);
 	struct phonet_routes *routes = &pnn->routes;
 	struct net_device *dev;
 

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: Changli Gao @ 2010-04-26 13:35 UTC (permalink / raw)
  To: hadi; +Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <1272281709.8918.35.camel@bigi>

On Mon, Apr 26, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
> On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:
>
>> I read the code again, and find that we don't use spin_lock_irqsave(),
>> and we use local_irq_save() and spin_lock() instead, so
>> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
>> related to backlog. the lock maybe sk_receive_queue.lock.
>
> Possible.
> I am wondering if there's a way we can precisely nail where that is
> happening? is lockstat any use?
> Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.
>

Maybe lockstat can help in this case.

> So looking at your patch now i see it is likely there was an improvement
> made for non-rps case (moving out of loop some irq_enable etc).
> i.e my results may not be crazy after adding your patch and seeing an
> improvement for non-rps case.
> However, whatever your patch did - it did not help the rps case case:
> call_function_single_interrupt() comes out higher in the profile,
> and # of IPIs seems to have gone up (although i did not measure this, I
> can see the interrupts/second went up by almost 50-60%)

Did you apply the patch from Eric? It would reduce the number of
local_irq_disable() calls but increase the number of IPIs.

>
>> Jamal, did you use a single socket to serve all the clients?
>
> Socket per detected cpu.

Ignore it. I made a mistake here.

>
>> BTW:  completion_queue and output_queue in softnet_data both are LIFO
>> queues. For completion_queue, FIFO is better, as the last used skb is
>> more likely in cache, and should be used first. Since slab has always
>> cache the last used memory at the head, we'd better free the skb in
>> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.
>
> I think it will depend on how many of those skbs are sitting in the
> completion queue, cache warmth etc. LIFO is always safest, you have
> higher probability of finding a cached skb infront.
>

we call kfree_skb() to release skbs to slab allocator, then slab
allocator stores them in a LIFO queue. If completion queue is also a
LIFO queue, the latest unused skb will be in the front of the queue,
and will be released to slab allocator at first. At the next time, we
call alloc_skb(), the memory used by the skb in the end of the
completion queue will be returned instead of the hot one.

However, as Eric said, new drivers don't rely on completion queue, it
isn't a real problem, especially in your test case.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-2.6] netns: assign NULL after pernet memory is freed
From: Jiri Pirko @ 2010-04-26 13:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm
In-Reply-To: <20100426120716.GD2941@psychotron.lab.eng.brq.redhat.com>

Mon, Apr 26, 2010 at 02:07:17PM CEST, jpirko@redhat.com wrote:
>Mon, Apr 26, 2010 at 01:18:07PM CEST, jpirko@redhat.com wrote:
>>This is needed to let know appropriate code (driver) know that pernet memory
>>chunk  was freed already. For example when driver has also registered netdev
>>notifier, it can be called after memory is freed which could eventually lead
>>to panic. Also a null check should be present in these notifiers.
>>
>>For example, previously, when drivers were responsible for pernet memory
>>allocation/freeing, in pppoe this assign was done in pppoe_exit_net() right
>>after pernet memory was freed. The check in notifier stayed (in pppoe_flush_dev
>>called from pppoe_device_event) but makes no sense now without this patch.
>
>Hmm, thinking about this, since the life of pernet memories is directly connected
>with the life of net, as described by Eric, this should not be needed. So please
>scrach this one too. Sorry.
>
>Anyway, I'm still wondering how to prevent netdev_notifiers from using this when
>net is gone. Going to do more research, I'm probably missing something.

Right, it becomes part of init_ns then. So this looks good then. One thing I can
still imagine what may cause problem, since netdev_notifier is registered before
register_pernet_device, the notifier can be called in between. So I think this
should be reordered, am I right?

Thanks

Jirka

>
>
>>
>>I already know about several notifiers where the check should be present,
>>patches will follow up.
>>
>>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>>
>>diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>>index bd8c471..29f622c 100644
>>--- a/net/core/net_namespace.c
>>+++ b/net/core/net_namespace.c
>>@@ -51,6 +51,7 @@ static void ops_free(const struct pernet_operations *ops, struct net *net)
>> 	if (ops->id && ops->size) {
>> 		int id = *ops->id;
>> 		kfree(net_generic(net, id));
>>+		net_assign_generic(net, id, NULL);
>> 	}
>> }
>> 

^ permalink raw reply

* Re: [PATCH net-2.6] netns: assign NULL after pernet memory is freed
From: Jiri Pirko @ 2010-04-26 12:07 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm
In-Reply-To: <20100426111806.GB2941@psychotron.lab.eng.brq.redhat.com>

Mon, Apr 26, 2010 at 01:18:07PM CEST, jpirko@redhat.com wrote:
>This is needed to let know appropriate code (driver) know that pernet memory
>chunk  was freed already. For example when driver has also registered netdev
>notifier, it can be called after memory is freed which could eventually lead
>to panic. Also a null check should be present in these notifiers.
>
>For example, previously, when drivers were responsible for pernet memory
>allocation/freeing, in pppoe this assign was done in pppoe_exit_net() right
>after pernet memory was freed. The check in notifier stayed (in pppoe_flush_dev
>called from pppoe_device_event) but makes no sense now without this patch.

Hmm, thinking about this, since the life of pernet memories is directly connected
with the life of net, as described by Eric, this should not be needed. So please
scrach this one too. Sorry.

Anyway, I'm still wondering how to prevent netdev_notifiers from using this when
net is gone. Going to do more research, I'm probably missing something.


>
>I already know about several notifiers where the check should be present,
>patches will follow up.
>
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
>diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>index bd8c471..29f622c 100644
>--- a/net/core/net_namespace.c
>+++ b/net/core/net_namespace.c
>@@ -51,6 +51,7 @@ static void ops_free(const struct pernet_operations *ops, struct net *net)
> 	if (ops->id && ops->size) {
> 		int id = *ops->id;
> 		kfree(net_generic(net, id));
>+		net_assign_generic(net, id, NULL);
> 	}
> }
> 

^ permalink raw reply

* [PATCH net-next-2.6] pppoe: use pppoe_pernet instead of directly net_generic
From: Jiri Pirko @ 2010-04-26 11:46 UTC (permalink / raw)
  To: netdev; +Cc: davem, mostrows


Signed-off-by: Jiri Pirko <jpirko@redhat.com>

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index cdd11ba..c059c8d 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -258,7 +258,7 @@ static inline struct pppox_sock *get_item_by_addr(struct net *net,
 	dev = dev_get_by_name_rcu(net, sp->sa_addr.pppoe.dev);
 	if (dev) {
 		ifindex = dev->ifindex;
-		pn = net_generic(net, pppoe_net_id);
+		pn = pppoe_pernet(net);
 		pppox_sock = get_item(pn, sp->sa_addr.pppoe.sid,
 				sp->sa_addr.pppoe.remote, ifindex);
 	}

^ permalink raw reply related

* [patch v2] bluetooth: handle l2cap_create_connless_pdu() errors
From: Dan Carpenter @ 2010-04-26 11:36 UTC (permalink / raw)
  To: Gustavo F. Padovan
  Cc: Marcel Holtmann, David S. Miller, Andrei Emeltchenko,
	linux-bluetooth, netdev, kernel-janitors
In-Reply-To: <20100422123347.GA5265@vigoh>

l2cap_create_connless_pdu() can sometimes return ERR_PTR(-ENOMEM) or
ERR_PTR(-EFAULT).

Signed-off-by: Dan Carpenter <error27@gmail.com>
---
In v2 I wrote the patch on top of Gustavo Padovon's devel tree

diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
index a668ef4..b32682c 100644
--- a/net/bluetooth/l2cap.c
+++ b/net/bluetooth/l2cap.c
@@ -1712,8 +1712,12 @@ static int l2cap_sock_sendmsg(struct kiocb *iocb, struct socket *sock, struct ms
 	/* Connectionless channel */
 	if (sk->sk_type == SOCK_DGRAM) {
 		skb = l2cap_create_connless_pdu(sk, msg, len);
-		l2cap_do_send(sk, skb);
-		err = len;
+		if (IS_ERR(skb)) {
+			err = PTR_ERR(skb);
+		} else {
+			l2cap_do_send(sk, skb);
+			err = len;
+		}
 		goto done;
 	}
 

^ permalink raw reply related

* Re: rps perfomance WAS(Re: rps: question
From: jamal @ 2010-04-26 11:35 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Rick Jones, David Miller, therbert, netdev, robert,
	andi
In-Reply-To: <t2r412e6f7f1004241931ie9e70e3aka77b49557a7872e3@mail.gmail.com>

On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote:

> I read the code again, and find that we don't use spin_lock_irqsave(),
> and we use local_irq_save() and spin_lock() instead, so
> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
> related to backlog. the lock maybe sk_receive_queue.lock.

Possible.
I am wondering if there's a way we can precisely nail where that is
happening? is lockstat any use? 
Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.

So looking at your patch now i see it is likely there was an improvement
made for non-rps case (moving out of loop some irq_enable etc).
i.e my results may not be crazy after adding your patch and seeing an
improvement for non-rps case.
However, whatever your patch did - it did not help the rps case case:
call_function_single_interrupt() comes out higher in the profile,
and # of IPIs seems to have gone up (although i did not measure this, I
can see the interrupts/second went up by almost 50-60%)

> Jamal, did you use a single socket to serve all the clients?

Socket per detected cpu.

> BTW:  completion_queue and output_queue in softnet_data both are LIFO
> queues. For completion_queue, FIFO is better, as the last used skb is
> more likely in cache, and should be used first. Since slab has always
> cache the last used memory at the head, we'd better free the skb in
> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

I think it will depend on how many of those skbs are sitting in the
completion queue, cache warmth etc. LIFO is always safest, you have
higher probability of finding a cached skb infront.

cheers,
jamal


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox