Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 4/9] ipvs network name space aware
From: Daniel Lezcano @ 2010-10-12 16:02 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: lvs-devel, netdev, netfilter-devel, horms, ja, wensong
In-Reply-To: <201010081317.01120.hans.schillstrom@ericsson.com>

On 10/08/2010 01:16 PM, Hans Schillstrom wrote:
> This patch just contains ip_vs_core.c
>
> Signed-off-by:Hans Schillstrom<hans.schillstrom@ericsson.com>
>
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 0c043b6..4fdc5cb 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -52,7 +52,6 @@
>
>   #include<net/ip_vs.h>
>
> -
>   EXPORT_SYMBOL(register_ip_vs_scheduler);
>   EXPORT_SYMBOL(unregister_ip_vs_scheduler);
>   EXPORT_SYMBOL(ip_vs_proto_name);
> @@ -67,6 +66,8 @@ EXPORT_SYMBOL(ip_vs_conn_put);
>   EXPORT_SYMBOL(ip_vs_get_debug_level);
>   #endif
>
> +/* netns cnt used for uniqueness */
> +static atomic_t ipvs_netns_cnt = ATOMIC_INIT(0);
>    

Why is this counter needed ?

[ cut ]

> + *	Initialize IP Virtual Server netns mem.
> + */
> +static int __net_init __ip_vs_init(struct net *net)
> +{
> +	struct netns_ipvs *ipvs = 0;
>
> +	ipvs = kzalloc(sizeof(struct netns_ipvs), GFP_ATOMIC);
> +	if( ipvs == NULL ) {
> +		pr_err("%s(): no memory.\n", __func__);
> +		return -ENOMEM;
> +	}
> +	ipvs->inc = atomic_read(&ipvs_netns_cnt);
>    

AFAICS, this counter is never used. Is it really needed ?

> +	atomic_inc(&ipvs_netns_cnt);
> +	IP_VS_DBG(10, "Creating new netns *net=%p *ipvs=%p size=%lu\n",
> +		     net, ipvs, sizeof(struct netns_ipvs));
> +	net->ipvs = ipvs;
> +
> +	return 0;
> +}
>    


^ permalink raw reply

* [BUG net-next] bnx2x: all traffic comes to RX queue 0
From: Eric Dumazet @ 2010-10-12 16:07 UTC (permalink / raw)
  To: David Miller, Dmitry Kravkov, Vladislav Zolotarov, Yaniv Rosner
  Cc: netdev, Michael Chan, Eilon Greenstein
In-Reply-To: <1286838210.30423.128.camel@edumazet-laptop>

Hmm, while doing tests for the netdev_alloc_skb() problem, I found
current net-next tree is not really multi queue enabled...

ethtool -S eth1|grep _ucast
     [0]: rx_ucast_packets: 3507786
     [0]: tx_ucast_packets: 416925
     [1]: rx_ucast_packets: 0
     [1]: tx_ucast_packets: 4
     [2]: rx_ucast_packets: 0
     [2]: tx_ucast_packets: 397467
     [3]: rx_ucast_packets: 0
     [3]: tx_ucast_packets: 75832
     [4]: rx_ucast_packets: 0
     [4]: tx_ucast_packets: 171025
     [5]: rx_ucast_packets: 0
     [5]: tx_ucast_packets: 233025
     [6]: rx_ucast_packets: 0
     [6]: tx_ucast_packets: 250358
     [7]: rx_ucast_packets: 0
     [7]: tx_ucast_packets: 240792
     [8]: rx_ucast_packets: 0
     [8]: tx_ucast_packets: 216366
     [9]: rx_ucast_packets: 0
     [9]: tx_ucast_packets: 1
     [10]: rx_ucast_packets: 0
     [10]: tx_ucast_packets: 350324
     [11]: rx_ucast_packets: 0
     [11]: tx_ucast_packets: 92403
     [12]: rx_ucast_packets: 0
     [12]: tx_ucast_packets: 307678
     [13]: rx_ucast_packets: 0
     [13]: tx_ucast_packets: 314315
     [14]: rx_ucast_packets: 0
     [14]: tx_ucast_packets: 256767
     [15]: rx_ucast_packets: 0
     [15]: tx_ucast_packets: 185105
     rx_ucast_packets: 3507786
     tx_ucast_packets: 3508387

# ethtool -i eth1
driver: bnx2x
version: 1.60.00-1
firmware-version: bc 4.8.0 phy baa0.105
bus-info: 0000:02:00.1

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E
10Gigabit PCIe
	Subsystem: Hewlett-Packard Company NC532i Dual Port 10GbE Multifunction
BL-C Adapter
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at fa000000 (64-bit, non-prefetchable) [size=8M]
	Memory at f9800000 (64-bit, non-prefetchable) [size=8M]
	[virtual] Expansion ROM at e7010000 [disabled] [size=64K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Device Serial Number f4-ce-46-ff-fe-bb-32-d4
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel <?>
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x


Any idea before a biscection ?

Thanks !



^ permalink raw reply

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
From: Dmitry Kravkov @ 2010-10-12 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller,
	Vladislav Zolotarov, Yaniv Rosner
In-Reply-To: <1286899657.2732.93.camel@edumazet-laptop>

Eric,

I will take a look

Thanks

-----Original Message-----
From: Eric Dumazet [mailto:eric.dumazet@gmail.com] 
Sent: Tuesday, October 12, 2010 6:08 PM
To: David Miller; Dmitry Kravkov; Vladislav Zolotarov; Yaniv Rosner
Cc: netdev; Michael Chan; Eilon Greenstein
Subject: [BUG net-next] bnx2x: all traffic comes to RX queue 0

Hmm, while doing tests for the netdev_alloc_skb() problem, I found
current net-next tree is not really multi queue enabled...

ethtool -S eth1|grep _ucast
     [0]: rx_ucast_packets: 3507786
     [0]: tx_ucast_packets: 416925
     [1]: rx_ucast_packets: 0
     [1]: tx_ucast_packets: 4
     [2]: rx_ucast_packets: 0
     [2]: tx_ucast_packets: 397467
     [3]: rx_ucast_packets: 0
     [3]: tx_ucast_packets: 75832
     [4]: rx_ucast_packets: 0
     [4]: tx_ucast_packets: 171025
     [5]: rx_ucast_packets: 0
     [5]: tx_ucast_packets: 233025
     [6]: rx_ucast_packets: 0
     [6]: tx_ucast_packets: 250358
     [7]: rx_ucast_packets: 0
     [7]: tx_ucast_packets: 240792
     [8]: rx_ucast_packets: 0
     [8]: tx_ucast_packets: 216366
     [9]: rx_ucast_packets: 0
     [9]: tx_ucast_packets: 1
     [10]: rx_ucast_packets: 0
     [10]: tx_ucast_packets: 350324
     [11]: rx_ucast_packets: 0
     [11]: tx_ucast_packets: 92403
     [12]: rx_ucast_packets: 0
     [12]: tx_ucast_packets: 307678
     [13]: rx_ucast_packets: 0
     [13]: tx_ucast_packets: 314315
     [14]: rx_ucast_packets: 0
     [14]: tx_ucast_packets: 256767
     [15]: rx_ucast_packets: 0
     [15]: tx_ucast_packets: 185105
     rx_ucast_packets: 3507786
     tx_ucast_packets: 3508387

# ethtool -i eth1
driver: bnx2x
version: 1.60.00-1
firmware-version: bc 4.8.0 phy baa0.105
bus-info: 0000:02:00.1

02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E
10Gigabit PCIe
	Subsystem: Hewlett-Packard Company NC532i Dual Port 10GbE Multifunction
BL-C Adapter
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at fa000000 (64-bit, non-prefetchable) [size=8M]
	Memory at f9800000 (64-bit, non-prefetchable) [size=8M]
	[virtual] Expansion ROM at e7010000 [disabled] [size=64K]
	Capabilities: [48] Power Management version 3
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable- Count=1/8 Maskable- 64bit+
	Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
	Capabilities: [ac] Express Endpoint, MSI 00
	Capabilities: [100] Device Serial Number f4-ce-46-ff-fe-bb-32-d4
	Capabilities: [110] Advanced Error Reporting
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Virtual Channel <?>
	Kernel driver in use: bnx2x
	Kernel modules: bnx2x


Any idea before a biscection ?

Thanks !




^ permalink raw reply

* Re: [PATCH] Phonet: 'connect' socket implementation for Pipe controller
From: Rémi Denis-Courmont @ 2010-10-12 16:30 UTC (permalink / raw)
  To: Kumar A Sanghvi
  Cc: remi.denis-courmont, davem, netdev, linus.walleij,
	gulshan.karmani, sudeep.divakaran
In-Reply-To: <1286897211-3198-1-git-send-email-kumar.sanghvi@stericsson.com>

	Hello,

Just a few comments...

Le mardi 12 octobre 2010 18:26:51 Kumar A Sanghvi, vous avez écrit :
> diff --git a/include/net/phonet/pep.h b/include/net/phonet/pep.h
> index def6cfa..b60b28c 100644
> --- a/include/net/phonet/pep.h
> +++ b/include/net/phonet/pep.h
> @@ -802,6 +682,42 @@ static void pipe_destruct(struct sock *sk)
>  	skb_queue_purge(&pn->ctrlreq_queue);
>  }
> 
> +#ifdef CONFIG_PHONET_PIPECTRLR
> +static int pep_connresp_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> +	struct pep_sock *pn = pep_sk(sk);
> +	static u8 host_pref_rx_fc[3] = {3, 2, 1}, host_req_tx_fc[3] = {3, 2, 1};

Why is this 'static' ? Doesn't that break concurrent uses?

> +	u8 remote_pref_rx_fc[3], remote_req_tx_fc[3];
> +	u8 negotiated_rx_fc, negotiated_tx_fc;
> +	int ret;
> +
> +	pipe_get_flow_info(sk, skb, remote_pref_rx_fc,
> +			remote_req_tx_fc);
> +	negotiated_tx_fc = pipe_negotiate_fc(remote_req_tx_fc,
> +			host_pref_rx_fc,
> +			sizeof(host_pref_rx_fc));
> +	negotiated_rx_fc = pipe_negotiate_fc(host_req_tx_fc,
> +			remote_pref_rx_fc,
> +			sizeof(host_pref_rx_fc));
> +
> +	pn->pipe_state = PIPE_DISABLED;
> +	sk->sk_state = TCP_SYN_RECV;
> +	sk->sk_backlog_rcv = pipe_do_rcv;
> +	sk->sk_destruct = pipe_destruct;
> +	pn->rx_credits = 0;
> +	pn->rx_fc = negotiated_rx_fc;
> +	pn->tx_fc = negotiated_tx_fc;
> +	sk->sk_state_change(sk);
> +
> +	ret = pipe_handler_send_created_ind(sk,
> +			PNS_PIPE_CREATED_IND_UTID,
> +			PNS_PIPE_CREATED_IND
> +			);
> +
> +	return ret;
> +}
> +#endif
> +
>  static int pep_connreq_rcv(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct sock *newsk;

> diff --git a/net/phonet/socket.c b/net/phonet/socket.c
> index aca8fba..123a374 100644
> --- a/net/phonet/socket.c
> +++ b/net/phonet/socket.c
> @@ -225,6 +225,102 @@ static int pn_socket_autobind(struct socket *sock)
>  	return 0; /* socket was already bound */
>  }
> 
> +static int pn_socket_connect(struct socket *sock, struct sockaddr *addr,
> +		int len, int flags)
> +{
> +	struct sock *sk = sock->sk;
> +	struct sockaddr_pn *spn = (struct sockaddr_pn *)addr;
> +	long timeo;
> +	int err;
> +
> +	lock_sock(sk);
> +
> +	if (len < sizeof(struct sockaddr_pn))
> +		return -EINVAL;
> +	if (spn->spn_family != AF_PHONET)
> +		return -EAFNOSUPPORT;

You should move lock_sock(sk); here, I think.

> +
> +	switch (sock->state) {
> +	case SS_UNCONNECTED:
> +		sk->sk_state = TCP_CLOSE;
> +		break;
> +	case SS_CONNECTING:
> +		switch (sk->sk_state) {
> +		case TCP_SYN_RECV:
> +			sock->state = SS_CONNECTED;
> +			err = -EISCONN;
> +			goto out;
> +		case TCP_CLOSE:
> +			err = -EALREADY;
> +			if (flags & O_NONBLOCK)
> +				goto out;
> +			goto wait_connect;
> +			break;

I think the kernel policy is against redumdant break statements.

> +		}
> +		break;
> +	case SS_CONNECTED:
> +		switch (sk->sk_state) {
> +		case TCP_SYN_RECV:
> +			err = -EISCONN;
> +			goto out;
> +		case TCP_CLOSE:
> +			sock->state = SS_UNCONNECTED;
> +			break;
> +		}
> +		break;
> +	case SS_DISCONNECTING:
> +	case SS_FREE:
> +		break;
> +	}
> +	sk->sk_state = TCP_CLOSE;
> +	sock->state = SS_UNCONNECTED;

This is dead code...

> +	sk_stream_kill_queues(sk);
> +
> +
> +	sock->state = SS_CONNECTING;

...because of this ^ .

> +	err = sk->sk_prot->connect(sk, addr, len);
> +	if (err < 0) {
> +		sock->state = SS_UNCONNECTED;
> +		sk->sk_state = TCP_CLOSE;
> +		goto out;
> +	}
> +
> +	err = -EINPROGRESS;
> +wait_connect:
> +	if (sk->sk_state != TCP_SYN_RECV && (flags & O_NONBLOCK))
> +		goto out;
> +
> +	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
> +	release_sock(sk);
> +
> +	err = -ERESTARTSYS;
> +	timeo = wait_event_interruptible_timeout(*sk_sleep(sk),
> +			sk->sk_state != TCP_CLOSE,
> +			timeo);
> +
> +	lock_sock(sk);
> +	if (timeo < 0)
> +		goto out; /* -ERESTARTSYS */
> +
> +	err = -ETIMEDOUT;
> +	if (timeo == 0 && sk->sk_state != TCP_SYN_RECV)
> +		goto out;
> +
> +	if (sk->sk_state != TCP_SYN_RECV) {
> +		sock->state = SS_UNCONNECTED;
> +		err = sock_error(sk);
> +		if (!err)
> +			err = -ECONNREFUSED;
> +		goto out;
> +	}
> +	sock->state = SS_CONNECTED;
> +	err = 0;
> +
> +out:
> +	release_sock(sk);
> +	return err;
> +}
> +
>  static int pn_socket_accept(struct socket *sock, struct socket *newsock,
>  				int flags)
>  {


-- 
Rémi Denis-Courmont
http://www.remlab.net/
http://fi.linkedin.com/in/remidenis

^ permalink raw reply

* Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-12 17:09 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: anthony, arnd, avi, davem, kvm, netdev, rusty
In-Reply-To: <OF6C412DBC.EA03FC24-ON652577B9.00228DCA-652577B9.0028232D@in.ibm.com>

On Mon, Oct 11, 2010 at 12:51:27PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/06/2010 07:04:31 PM:
> 
> > On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > > for degradation for 1 stream case:
> >
> > I thought about possible RX/TX contention reasons, and I realized that
> > we get/put the mm counter all the time.  So I write the following: I
> > haven't seen any performance gain from this in a single queue case, but
> > maybe this will help multiqueue?
> 
> Sorry for the delay, I was sick last couple of days. The results
> with your patch are (%'s over original code):
> 
> Code               BW%       CPU%       RemoteCPU
> MQ     (#txq=16)   31.4%     38.42%     6.41%
> MQ+MST (#txq=16)   28.3%     18.9%      -10.77%
> 
> The patch helps CPU utilization but didn't help single stream
> drop.
> 
> Thanks,

What other shared TX/RX locks are there?  In your setup, is the same
macvtap socket structure used for RX and TX?  If yes this will create
cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
there might also be contention on the lock in sk_sleep waitqueue.
Anything else?

-- 
MST

^ permalink raw reply

* kernel panic in fib_rules_lookup [2.6.27.7 vendor-patched]
From: Joe Buehler @ 2010-10-12 17:14 UTC (permalink / raw)
  To: netdev

I am seeing a kernel panic (NULL pointer) in fib_rules_lookup.  There
were some other reports for 2.6.32 back in March and May.  It looks to
me as though "rules_list" is not in a good state when fib_rules_lookup
traverses it.

My application is bringing TAP interfaces up and down and making
changes to associated routing tables at a fairly good clip (say, a few
times a second).  That use case seems to be similar to a previously
reported crash case.

This is a MIPS kernel (Cavium Octeon) running two CPUs SMP.  I am
using 2.6.27.7 patched by Cavium for hardware support reasons.  I
cannot upgrade because the vendor patches are non-trivial to
forward-port.

Here is one stack trace:

[<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
[<ffffffff814bd144>] fib_lookup+0x2c/0x48
[<ffffffff814788d8>] __ip_route_output_key+0x918/0xf38
[<ffffffff81478f30>] ip_route_output_flow+0x38/0x2e8
[<ffffffff8149fd1c>] tcp_v4_connect+0x134/0x498
[<ffffffff814aef80>] inet_stream_connect+0xf8/0x2f0
[<ffffffff81442680>] sys_connect+0xe0/0xf8
[<ffffffff8114528c>] handle_sys+0x12c/0x148

Here is another:

[<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
[<ffffffff814bd144>] fib_lookup+0x2c/0x48
[<ffffffff814b6550>] fib_validate_source+0xb0/0x4c0
[<ffffffff8147a524>] ip_route_input+0x11a4/0x13c0
[<ffffffff8147c304>] ip_rcv_finish+0x2f4/0x4c0
[<ffffffff81454220>] process_backlog+0xa8/0x160
[<ffffffff81451ea8>] net_rx_action+0x190/0x2e0
[<ffffffff81166978>] __do_softirq+0xf0/0x218
[<ffffffff81166b18>] do_softirq+0x78/0x80
[<ffffffff81100e30>] plat_irq_dispatch+0x130/0x1e0
[<ffffffff81130948>] ret_from_irq+0x0/0x4
[<ffffffff8151167c>] _cond_resched+0x34/0x50
[<ffffffff81148b60>] fpu_emulator_cop1Handler+0x90/0x1c80
[<ffffffff81136f4c>] do_cpu+0x24c/0x360
[<ffffffff81130940>] ret_from_exception+0x0/0x8

*IF* my reading of the disassembled code at point of panic is correct,
 the "pos" pointer in list_for_each_entry_rcu appears to be NULL.

Looking at the code in net/core/fib_rules.c I see some uses of the
"rules_list" using rcu and some apparently not.  Has something simple
been overlooked?

I need this fixed so will try adding a spinlock to protect rules_list
if necessary.

Joe Buehler

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: Michael S. Tsirkin @ 2010-10-12 17:19 UTC (permalink / raw)
  To: David Miller, eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101011172932.GB12342@orbit.nwl.cc>

On Mon, Oct 11, 2010 at 07:29:32PM +0200, Phil Sutter wrote:
> On Mon, Oct 11, 2010 at 09:01:53AM -0700, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Mon, 11 Oct 2010 16:03:02 +0200
> > 
> > > If we dont test ETH_P_8021Q protocol here, we allow sending 1504 bytes
> > > frames for MTU=1500
> > > 
> > > Should we really care ?
> > > 
> > > If not, just do
> > > 
> > > reserve = dev->hard_header_len + VLAN_HLEN;
> > 
> > Thats a good point, I think we need to validate the SKB protocol
> > field.
> 
> Which is set to the value of the passed struct sockaddr_ll field
> sll_protocol. At least in the two userspace code samples I have here,
> the later field is set to htons(ETH_P_ALL). So unless one changes the
> API, the only way to find out the packet type is to actually parse the
> given ethernet header.

Yes, like eth_type_trans does I guess.  I think we had a similar
discussion already:

http://lists.openwall.net/netdev/2010/01/06/38

Summary: if we want to make the protocol field have the correct
value for this case we need to make it work for other
transports not just for ethernet.

> Since tpacket_rcv() just interprets the vlan_tci skb field, such
> detailed packet inspection is otherwise not done in af_packet.c.
> 
> Greetings, Phil
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: kvm networking todo wiki
From: Michael S. Tsirkin @ 2010-10-12 17:38 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Sridhar Samudrala, David Stevens, sri, Anthony Liguori,
	Rusty Russell, Krishna Kumar2, Shirley Ma, Xin, Xiaohui, jdike,
	herbert, lmr, akong, kvm, qemu-devel, netdev, mjt
In-Reply-To: <AANLkTikzU6g2pYKoyXOWfGPE64VZusPaYdn+TQ14tE7_@mail.gmail.com>

On Sun, Oct 10, 2010 at 01:37:45PM +0200, Dragos Tatulea wrote:
> Hi,
> 
> > More importantly: anyone's going to work on this?
> 
> I'd like to work on this. Might need some assistance though.
> 
> Thanks,
> Dragos

BTW, as in some situations hardware might not be able satisfy
requirements, a subset of this item would be to expose whatever macvtap
is capable of, to the guest.  E.g. if mac can not be changed we could at
least query the mac, something that would be convenient as noted by mjt
in an irc chat.

To enable migration we'd then need a set of flags to limit this to a least
common denominator on a given network.

-- 
MST

^ permalink raw reply

* Re: [PATCH] af_packet: account for VLAN when checking packet size
From: David Miller @ 2010-10-12 17:40 UTC (permalink / raw)
  To: mst; +Cc: eric.dumazet, netdev, johann.baudy
In-Reply-To: <20101012171940.GB30613@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 12 Oct 2010 19:19:41 +0200

> Yes, like eth_type_trans does I guess.  I think we had a similar
> discussion already:
> 
> http://lists.openwall.net/netdev/2010/01/06/38
> 
> Summary: if we want to make the protocol field have the correct
> value for this case we need to make it work for other
> transports not just for ethernet.

Right, so for now we should just allow 4-byte larger
than MTU TX packets, as long as the device is ethernet
and can handle VLANs properly.

^ permalink raw reply

* Re: kernel panic in fib_rules_lookup [2.6.27.7 vendor-patched]
From: Eric Dumazet @ 2010-10-12 17:40 UTC (permalink / raw)
  To: Joe Buehler; +Cc: netdev
In-Reply-To: <loom.20101012T185411-307@post.gmane.org>

Le mardi 12 octobre 2010 à 17:14 +0000, Joe Buehler a écrit :
> I am seeing a kernel panic (NULL pointer) in fib_rules_lookup.  There
> were some other reports for 2.6.32 back in March and May.  It looks to
> me as though "rules_list" is not in a good state when fib_rules_lookup
> traverses it.
> 
> My application is bringing TAP interfaces up and down and making
> changes to associated routing tables at a fairly good clip (say, a few
> times a second).  That use case seems to be similar to a previously
> reported crash case.
> 
> This is a MIPS kernel (Cavium Octeon) running two CPUs SMP.  I am
> using 2.6.27.7 patched by Cavium for hardware support reasons.  I
> cannot upgrade because the vendor patches are non-trivial to
> forward-port.
> 
> Here is one stack trace:
> 
> [<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
> [<ffffffff814bd144>] fib_lookup+0x2c/0x48
> [<ffffffff814788d8>] __ip_route_output_key+0x918/0xf38
> [<ffffffff81478f30>] ip_route_output_flow+0x38/0x2e8
> [<ffffffff8149fd1c>] tcp_v4_connect+0x134/0x498
> [<ffffffff814aef80>] inet_stream_connect+0xf8/0x2f0
> [<ffffffff81442680>] sys_connect+0xe0/0xf8
> [<ffffffff8114528c>] handle_sys+0x12c/0x148
> 
> Here is another:
> 
> [<ffffffff814671ec>] fib_rules_lookup+0x11c/0x1a8
> [<ffffffff814bd144>] fib_lookup+0x2c/0x48
> [<ffffffff814b6550>] fib_validate_source+0xb0/0x4c0
> [<ffffffff8147a524>] ip_route_input+0x11a4/0x13c0
> [<ffffffff8147c304>] ip_rcv_finish+0x2f4/0x4c0
> [<ffffffff81454220>] process_backlog+0xa8/0x160
> [<ffffffff81451ea8>] net_rx_action+0x190/0x2e0
> [<ffffffff81166978>] __do_softirq+0xf0/0x218
> [<ffffffff81166b18>] do_softirq+0x78/0x80
> [<ffffffff81100e30>] plat_irq_dispatch+0x130/0x1e0
> [<ffffffff81130948>] ret_from_irq+0x0/0x4
> [<ffffffff8151167c>] _cond_resched+0x34/0x50
> [<ffffffff81148b60>] fpu_emulator_cop1Handler+0x90/0x1c80
> [<ffffffff81136f4c>] do_cpu+0x24c/0x360
> [<ffffffff81130940>] ret_from_exception+0x0/0x8
> 
> *IF* my reading of the disassembled code at point of panic is correct,
>  the "pos" pointer in list_for_each_entry_rcu appears to be NULL.
> 
> Looking at the code in net/core/fib_rules.c I see some uses of the
> "rules_list" using rcu and some apparently not.  Has something simple
> been overlooked?
> 
> I need this fixed so will try adding a spinlock to protect rules_list
> if necessary.

2.6.27 is a bit old, you might try :

commit 7fa7cb7109d07c29ab28bb877bc7049a0150dbe5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Sep 27 04:18:27 2010 +0000

    fib: use atomic_inc_not_zero() in fib_rules_lookup
    
    It seems we dont use appropriate refcount increment in an
    rcu_read_lock() protected section.
    
    fib_rule_get() might increment a null refcount and bad things could
    happen.
    
    While fib_nl_delrule() respects an rcu grace period before calling
    fib_rule_put(), fib_rules_cleanup_ops() calls fib_rule_put() without a
    grace period.
    
    Note : after this patch, we might avoid the synchronize_rcu() call done
    in fib_nl_delrule()
    
    Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 42e84e0..d078728 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -225,9 +225,11 @@ jumped:
 			err = ops->action(rule, fl, flags, arg);
 
 		if (err != -EAGAIN) {
-			fib_rule_get(rule);
-			arg->rule = rule;
-			goto out;
+			if (likely(atomic_inc_not_zero(&rule->refcnt))) {
+				arg->rule = rule;
+				goto out;
+			}
+			break;
 		}
 	}
 



^ permalink raw reply related

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
From: Eric Dumazet @ 2010-10-12 18:11 UTC (permalink / raw)
  To: Dmitry Kravkov
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller,
	Vladislav Zolotarov, Yaniv Rosner
In-Reply-To: <2DFD360E328B3941911E6D28B085D990129FC2028B@SJEXCHCCR01.corp.ad.broadcom.com>

Le mardi 12 octobre 2010 à 09:20 -0700, Dmitry Kravkov a écrit :
> Eric,
> 
> I will take a look
> 

Thanks ! Here is the bisection result

# git bisect bad
523224a3b3cd407ce4e6731a087194e13a90db18 is the first bad commit
commit 523224a3b3cd407ce4e6731a087194e13a90db18
Author: Dmitry Kravkov <dmitry@broadcom.com>
Date:   Wed Oct 6 03:23:26 2010 +0000

    bnx2x, cnic, bnx2i: use new FW/HSI
    
    This is the new FW HSI blob and the relevant definitions without logic changes.
    It also included code adaptation for new HSI. New features are not enabled.
    
    New FW/HSI includes:
    - Support for 57712 HW
    - Future support for VF (not used)
    - Improvements in FW interrupts scheme
    - FW FCoE hooks (stubs for future usage)
    
    Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
    Signed-off-by: Michael Chan <mchan@broadcom.com>
    Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 8931a748ae89487466d99e6121a3965131e3a201 fed2be49565ca291203a865554a0e071a2a9c3fb M	drivers
:040000 040000 2ff6a5c290ca7e9dc339b7213f49c341f4fc5822 82c0a7117b58f538a8b49891c5524e3537c5ae66 M	firmware


# git bisect log
git bisect start '--' 'drivers/net/bnx2x'
# bad: [b9e9b15966e35f3b4a8d3820cac460505552f72c] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
git bisect bad b9e9b15966e35f3b4a8d3820cac460505552f72c
# good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
# good: [b7737c9be9d3e894d1a4375c52f5f47789475f26] bnx2x: Split PHY functions
git bisect good b7737c9be9d3e894d1a4375c52f5f47789475f26
# good: [8681dc3abd54e845a2effab441921b4c4457c241] bnx2x: Moved enabling of MSI to the bnx2x_set_num_queues()
git bisect good 8681dc3abd54e845a2effab441921b4c4457c241
# bad: [c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2] bnx2x: remove unused fields in main driver structure
git bisect bad c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2
# bad: [fb3bff178e722fe88b5ab02319c9636da0980e25] bnx2x: rename MF related fields
git bisect bad fb3bff178e722fe88b5ab02319c9636da0980e25
# good: [560131f313ea9b9439742138289b6f68d61531ec] bnx2x: create folder for bnx2x firmware files
git bisect good 560131f313ea9b9439742138289b6f68d61531ec
# bad: [523224a3b3cd407ce4e6731a087194e13a90db18] bnx2x, cnic, bnx2i: use new FW/HSI
git bisect bad 523224a3b3cd407ce4e6731a087194e13a90db18



^ permalink raw reply

* RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
From: Vladislav Zolotarov @ 2010-10-12 18:18 UTC (permalink / raw)
  To: Eric Dumazet, Dmitry Kravkov
  Cc: netdev, Michael Chan, Eilon Greenstein, David Miller,
	Yaniv Rosner
In-Reply-To: <1286907103.2703.6.camel@edumazet-laptop>

Thanks, Eric. Dima has already found a problem and we are running 
a regression to verify nothing else is broken after the RSS is 
brought back to life.

If nothing goes wrong Dima will send a patch in an hour or so...

Thanks again for the catch.
vlad

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Eric Dumazet
> Sent: Tuesday, October 12, 2010 8:12 PM
> To: Dmitry Kravkov
> Cc: netdev; Michael Chan; Eilon Greenstein; David Miller; Vladislav
> Zolotarov; Yaniv Rosner
> Subject: RE: [BUG net-next] bnx2x: all traffic comes to RX queue 0
> 
> Le mardi 12 octobre 2010 à 09:20 -0700, Dmitry Kravkov a écrit :
> > Eric,
> >
> > I will take a look
> >
> 
> Thanks ! Here is the bisection result
> 
> # git bisect bad
> 523224a3b3cd407ce4e6731a087194e13a90db18 is the first bad commit
> commit 523224a3b3cd407ce4e6731a087194e13a90db18
> Author: Dmitry Kravkov <dmitry@broadcom.com>
> Date:   Wed Oct 6 03:23:26 2010 +0000
> 
>     bnx2x, cnic, bnx2i: use new FW/HSI
> 
>     This is the new FW HSI blob and the relevant definitions without
> logic changes.
>     It also included code adaptation for new HSI. New features are not
> enabled.
> 
>     New FW/HSI includes:
>     - Support for 57712 HW
>     - Future support for VF (not used)
>     - Improvements in FW interrupts scheme
>     - FW FCoE hooks (stubs for future usage)
> 
>     Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
>     Signed-off-by: Michael Chan <mchan@broadcom.com>
>     Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> :040000 040000 8931a748ae89487466d99e6121a3965131e3a201
> fed2be49565ca291203a865554a0e071a2a9c3fb M	drivers
> :040000 040000 2ff6a5c290ca7e9dc339b7213f49c341f4fc5822
> 82c0a7117b58f538a8b49891c5524e3537c5ae66 M	firmware
> 
> 
> # git bisect log
> git bisect start '--' 'drivers/net/bnx2x'
> # bad: [b9e9b15966e35f3b4a8d3820cac460505552f72c] Merge branch 'master'
> of git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6
> git bisect bad b9e9b15966e35f3b4a8d3820cac460505552f72c
> # good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35
> git bisect good 9fe6206f400646a2322096b56c59891d530e8d51
> # good: [b7737c9be9d3e894d1a4375c52f5f47789475f26] bnx2x: Split PHY
> functions
> git bisect good b7737c9be9d3e894d1a4375c52f5f47789475f26
> # good: [8681dc3abd54e845a2effab441921b4c4457c241] bnx2x: Moved
> enabling of MSI to the bnx2x_set_num_queues()
> git bisect good 8681dc3abd54e845a2effab441921b4c4457c241
> # bad: [c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2] bnx2x: remove unused
> fields in main driver structure
> git bisect bad c518cd1c7aca2e2a65edcf4fa5ce4160d59878c2
> # bad: [fb3bff178e722fe88b5ab02319c9636da0980e25] bnx2x: rename MF
> related fields
> git bisect bad fb3bff178e722fe88b5ab02319c9636da0980e25
> # good: [560131f313ea9b9439742138289b6f68d61531ec] bnx2x: create folder
> for bnx2x firmware files
> git bisect good 560131f313ea9b9439742138289b6f68d61531ec
> # bad: [523224a3b3cd407ce4e6731a087194e13a90db18] bnx2x, cnic, bnx2i:
> use new FW/HSI
> git bisect bad 523224a3b3cd407ce4e6731a087194e13a90db18
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: net-next-2.6 [PATCH 0/6] dccp: miscellaneous fixes and helper functions
From: David Miller @ 2010-10-12 18:45 UTC (permalink / raw)
  To: gerrit; +Cc: dccp, netdev
In-Reply-To: <1286860551-6809-1-git-send-email-gerrit@erg.abdn.ac.uk>

From: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Date: Tue, 12 Oct 2010 07:15:45 +0200

>  Patch #1: fixes two problems in the DCCP  (AWL/SWL) window-size adjustment.
>  Patch #2: refactors the connect_init() function by combining related code.
>  Patch #3: removes an never-user argument from the CCID tx function.
>  Patch #4: provides a function to generalize the data-loss condition 
>            (patch originally came from the ccid-4 subtree).
>  Patch #5: schedules an Ack as carrier for a pending timestamp echo.
>  Patch #6: tidies up the format of DCCP per-connection warnings.
> 
> The set has also been placed into a fresh (today's) copy of net-next-2.6, on
> 
>     git://eden-feed.erg.abdn.ac.uk/net-next-2.6        [subtree 'dccp']

Looks good, pulled, thanks!

^ permalink raw reply

* Re: [PATCH -next] sundance: Add initial ethtool stats support
From: David Miller @ 2010-10-12 18:51 UTC (permalink / raw)
  To: dkirjanov; +Cc: netdev, eric.dumazet, bhutchings
In-Reply-To: <4CB0C56C.2000106@kernel.org>

From: Denis Kirjanov <dkirjanov@kernel.org>
Date: Sat, 09 Oct 2010 23:41:32 +0400

> +	/* ethtool extra stats */
> +	struct {
> +		unsigned long tx_multiple_collisions;
> +		unsigned long tx_single_collisions;
> +		unsigned long tx_late_collisions;
> +		unsigned long tx_deffered;
> +		unsigned long tx_deffered_excessive;
> +		unsigned long tx_aborted;
> +		unsigned long tx_bcasts;
> +		unsigned long rx_bcasts;
> +		unsigned long tx_mcasts;
> +		unsigned long rx_mcasts;
> +	} xstats;

I think these should be "u64".

^ permalink raw reply

* [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: Dmitry Kravkov @ 2010-10-12 19:02 UTC (permalink / raw)
  To: davem, netdev, eric.dumazet; +Cc: vladz, eilong

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
---
 drivers/net/bnx2x/bnx2x_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
index 7a9556b..ead524b 100644
--- a/drivers/net/bnx2x/bnx2x_main.c
+++ b/drivers/net/bnx2x/bnx2x_main.c
@@ -2486,6 +2486,7 @@ void bnx2x_pf_init(struct bnx2x *bp)
 	 * if (is_eth_multi(bp))
 	 *	flags |= FUNC_FLG_RSS;
 	 */
+	flags |= FUNC_FLG_RSS;
 
 	/* function setup */
 	if (flags & FUNC_FLG_RSS) {
-- 
1.7.1





^ permalink raw reply related

* Re: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: Eric Dumazet @ 2010-10-12 19:18 UTC (permalink / raw)
  To: Dmitry Kravkov; +Cc: davem, netdev, vladz, eilong
In-Reply-To: <1286910141.16797.23.camel@lb-tlvb-dmitry>

Le mardi 12 octobre 2010 à 21:02 +0200, Dmitry Kravkov a écrit :
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> ---
>  drivers/net/bnx2x/bnx2x_main.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
> index 7a9556b..ead524b 100644
> --- a/drivers/net/bnx2x/bnx2x_main.c
> +++ b/drivers/net/bnx2x/bnx2x_main.c
> @@ -2486,6 +2486,7 @@ void bnx2x_pf_init(struct bnx2x *bp)
>  	 * if (is_eth_multi(bp))
>  	 *	flags |= FUNC_FLG_RSS;
>  	 */
> +	flags |= FUNC_FLG_RSS;
>  
>  	/* function setup */
>  	if (flags & FUNC_FLG_RSS) {

Thanks, this solved the problem.

Tested-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: Joe Perches @ 2010-10-12 19:26 UTC (permalink / raw)
  To: Dmitry Kravkov; +Cc: davem, netdev, eric.dumazet, vladz, eilong
In-Reply-To: <1286910141.16797.23.camel@lb-tlvb-dmitry>

On Tue, 2010-10-12 at 21:02 +0200, Dmitry Kravkov wrote:
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
> ---
>  drivers/net/bnx2x/bnx2x_main.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/bnx2x/bnx2x_main.c b/drivers/net/bnx2x/bnx2x_main.c
> index 7a9556b..ead524b 100644
> --- a/drivers/net/bnx2x/bnx2x_main.c
> +++ b/drivers/net/bnx2x/bnx2x_main.c
> @@ -2486,6 +2486,7 @@ void bnx2x_pf_init(struct bnx2x *bp)
>  	 * if (is_eth_multi(bp))
>  	 *	flags |= FUNC_FLG_RSS;
>  	 */
> +	flags |= FUNC_FLG_RSS;
>  
>  	/* function setup */
>  	if (flags & FUNC_FLG_RSS) {

Then the "if (flags & FUNC_FLG_RSS)" test should be removed.




^ permalink raw reply

* Re: [PATCH] IPv4: Remove check for ipv4_is_lbcast() that will always return false
From: David Miller @ 2010-10-12 19:28 UTC (permalink / raw)
  To: awalls; +Cc: netdev, linux-kernel, kuznet, jmorris, kaber
In-Reply-To: <1286727021.2435.17.camel@localhost>

From: Andy Walls <awalls@md.metrocast.net>
Date: Sun, 10 Oct 2010 12:10:21 -0400

> In making an IPv4 routing decision, packets with an all 1's broadcast
> destination are accepted as input packets, before being checked for being a
> martian.  Remove the martian check for the all 1's broadcast destination
> address.  Make the initial check for the all 1's broadcast destination
> address easier to read.
> 
> Signed-off-by: Andy Walls <awalls@md.metrocast.net>

Your email client corrupted this patch, by turning tab characters
into spaces, amongst other things.

Please give Documentation/email-clients.txt a read and resubmit this
patch after you have these issues sorted out.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: David Miller @ 2010-10-12 19:31 UTC (permalink / raw)
  To: eric.dumazet; +Cc: dmitry, netdev, vladz, eilong
In-Reply-To: <1286911129.2703.7.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Oct 2010 21:18:49 +0200

> Le mardi 12 octobre 2010 à 21:02 +0200, Dmitry Kravkov a écrit :
>> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
>> 
>> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
>> Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
 ...
> Thanks, this solved the problem.
> 
> Tested-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks guys.

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: David Miller @ 2010-10-12 19:31 UTC (permalink / raw)
  To: joe; +Cc: dmitry, netdev, eric.dumazet, vladz, eilong
In-Reply-To: <1286911579.1117.76.camel@Joe-Laptop>

From: Joe Perches <joe@perches.com>
Date: Tue, 12 Oct 2010 12:26:19 -0700

> On Tue, 2010-10-12 at 21:02 +0200, Dmitry Kravkov wrote:
>> @@ -2486,6 +2486,7 @@ void bnx2x_pf_init(struct bnx2x *bp)
>>  	 * if (is_eth_multi(bp))
>>  	 *	flags |= FUNC_FLG_RSS;
>>  	 */
>> +	flags |= FUNC_FLG_RSS;
>>  
>>  	/* function setup */
>>  	if (flags & FUNC_FLG_RSS) {
> 
> Then the "if (flags & FUNC_FLG_RSS)" test should be removed.

Yeah it probably should.  If necessary it could be added back
later.

^ permalink raw reply

* Re: tbf/htb qdisc limitations
From: Steven Brudenell @ 2010-10-12 19:31 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev
In-Reply-To: <20101012101022.GA8578@ff.dom.local>

> Yes, it's not allowed according to Documentation/HOWTO. Btw, as you
> can see e.g. in sch_hfsc comments, 64-bit division is avoided too.

i see sch_hfsc avoids do_div in critical areas for performance
reasons, but uses it other places. it should still be alright to
do_div in tbf_change and htb_change_class, right? it would be nice to
compute the rtabs in those functions instead of having userspace do
it.

> I can only say there is no versioning, but backward compatibility
> is crucial, so you need to do some tricks or data duplication.
> You could probably try to get opinions about it with an RFC on
> moving tbf and htb schedulers to 64 bits if you're interested
> (decoupling it from your specific burst problem).

my burst problem is the only semi-legitimate motivation i can think
of. the only other possible motivations i can imagine are setting
"limit" to buffer more than 4GB of packets and setting "rate" to
something more than 32 gigabit; both of these seem kind of dubious. is
there something else you had in mind?

looking more at the netlink tc interface: why is it that the interface
for so many qdiscs consists of passing a big options struct as a
single netlink attr, instead of a bunch of individual attrs? this kind
of seems contrary to the extensibility / flexibility spirit of
netlink, and seems to be getting in the way of changing the interface.
maybe i should RFC about this instead ;)

^ permalink raw reply

* Re: [PATCH net-next V3] net: percpu net_device refcount
From: David Miller @ 2010-10-12 19:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1286828532.30423.16.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 11 Oct 2010 22:22:12 +0200

> This was a bit long (allyesconfig), but eventually succeeded ...
 ...
> [PATCH net-next V3] net: percpu net_device refcount

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next] net:  allocate skbs on local node
From: David Rientjes @ 2010-10-12 19:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Andrew Morton, Eric Dumazet, David Miller, netdev,
	Michael Chan, Eilon Greenstein, Christoph Hellwig, LKML,
	Nick Piggin
In-Reply-To: <alpine.DEB.2.00.1010120745360.31832@router.home>

On Tue, 12 Oct 2010, Christoph Lameter wrote:

> Hmmm. Given these effects I think we should be more cautious regarding the
> unification work. May be the "unified allocator" should replace SLAB
> instead and SLUB can stay unchanged?

Linus has said that he refuses to merge another allocator until one is 
removed or replaced, so that would force the unificiation patches to go 
into slab instead if you want to leave slub untouched.

> The unification patches go back to
> the one lock per node SLAB thing because the queue maintenance overhead is
> otherwise causing large regressions in hackbench because of lots of atomic
> ops. The per node lock seem to be causing problems here in the network
> stack,.

The TCP_RR regression on slub is because of what I described a couple 
years ago as "slab thrashing" where cpu slabs would be filled with 
allocations, then frees would occur to move those slabs from the full to 
partial list with only a few free objects, those partial slabs would 
quickly become full, etc.  Performance gets better if you change the 
per-node lock to a trylock when iterating the partial list and preallocate 
and have a substantially longer partial list than normal (and it still 
didn't rival slab's performance), so I don't think it's only a per-node 
lock that's the issue , it's all the slowpath overhead of swapping the cpu 
slab out for another slab.  The TCP_RR load would show slub stats that 
indicate certain caches, kmalloc-256 and kmalloc-2048, would have ~98% of 
allocations coming from the slowpath.

This gets better if you allocate higher order slabs (and kmalloc-2048 is 
already order-3 by default) but then allocating new slabs gets really slow 
if not impossible on smaller machines.  The overhead of even compaction 
will kill us.

> Take the unified as a SLAB cleanup instead? Then at least we have
> a large common code base and just differentiate through the locking
> mechanism?
> 

Will you be adding the extensive slub debugging to slab then?  It would be 
a shame to lose it because one allocator is chosen over another for 
performance reasons and then we need to recompile to debug issues as they 
arise.

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: David Miller @ 2010-10-12 20:05 UTC (permalink / raw)
  To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
In-Reply-To: <m11v801tfr.fsf@fess.ebiederm.org>

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 08 Oct 2010 10:32:40 -0700

> It is just dealing with not flushing the entire routing cache, just the
> routes that have expired.  Which prevents one network namespace from
> flushing it's routes and DOS'ing another.

That's a very indirect and obfuscated way of handling it.

And I still don't know why we let the first contiguous set of expired
entries in the chain get freed outside of the lock, and the rest
inside the lock.  That really isn't explained by anything I've read.

How about we just do exactly what's intended, and with no ifdefs?

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/include/net/route.h b/include/net/route.h
index 7e5e73b..8d24761 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -106,7 +106,7 @@ extern int		ip_rt_init(void);
 extern void		ip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
 				       __be32 src, struct net_device *dev);
 extern void		rt_cache_flush(struct net *net, int how);
-extern void		rt_cache_flush_batch(void);
+extern void		rt_cache_flush_batch(struct net *net);
 extern int		__ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
 extern int		ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
 extern int		ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 919f2ad..4039f56 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -999,7 +999,7 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
 		rt_cache_flush(dev_net(dev), 0);
 		break;
 	case NETDEV_UNREGISTER_BATCH:
-		rt_cache_flush_batch();
+		rt_cache_flush_batch(dev_net(dev));
 		break;
 	}
 	return NOTIFY_DONE;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0755aa4..6ad730c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -712,13 +712,14 @@ static inline int rt_is_expired(struct rtable *rth)
  * Can be called by a softirq or a process.
  * In the later case, we want to be reschedule if necessary
  */
-static void rt_do_flush(int process_context)
+static void rt_do_flush(struct net *net, int process_context)
 {
 	unsigned int i;
 	struct rtable *rth, *next;
-	struct rtable * tail;
 
 	for (i = 0; i <= rt_hash_mask; i++) {
+		struct rtable *list, **pprev;
+
 		if (process_context && need_resched())
 			cond_resched();
 		rth = rt_hash_table[i].chain;
@@ -726,41 +727,27 @@ static void rt_do_flush(int process_context)
 			continue;
 
 		spin_lock_bh(rt_hash_lock_addr(i));
-#ifdef CONFIG_NET_NS
-		{
-		struct rtable ** prev, * p;
 
-		rth = rt_hash_table[i].chain;
+		pprev = &rt_hash_table[i].chain;
+		rth = *pprev;
+		while (rth) {
+			next = rth->dst.rt_next;
+			if (dev_net(rth->dst.dev) == net) {
+				*pprev = next;
 
-		/* defer releasing the head of the list after spin_unlock */
-		for (tail = rth; tail; tail = tail->dst.rt_next)
-			if (!rt_is_expired(tail))
-				break;
-		if (rth != tail)
-			rt_hash_table[i].chain = tail;
-
-		/* call rt_free on entries after the tail requiring flush */
-		prev = &rt_hash_table[i].chain;
-		for (p = *prev; p; p = next) {
-			next = p->dst.rt_next;
-			if (!rt_is_expired(p)) {
-				prev = &p->dst.rt_next;
-			} else {
-				*prev = next;
-				rt_free(p);
-			}
-		}
+				rth->dst.rt_next = list;
+				list = rth;
+			} else
+				pprev = &rth->dst.rt_next;
+
+			rth = next;
 		}
-#else
-		rth = rt_hash_table[i].chain;
-		rt_hash_table[i].chain = NULL;
-		tail = NULL;
-#endif
+
 		spin_unlock_bh(rt_hash_lock_addr(i));
 
-		for (; rth != tail; rth = next) {
-			next = rth->dst.rt_next;
-			rt_free(rth);
+		for (; list; list = next) {
+			next = list->dst.rt_next;
+			rt_free(list);
 		}
 	}
 }
@@ -906,13 +893,13 @@ void rt_cache_flush(struct net *net, int delay)
 {
 	rt_cache_invalidate(net);
 	if (delay >= 0)
-		rt_do_flush(!in_softirq());
+		rt_do_flush(net, !in_softirq());
 }
 
 /* Flush previous cache invalidated entries from the cache */
-void rt_cache_flush_batch(void)
+void rt_cache_flush_batch(struct net *net)
 {
-	rt_do_flush(!in_softirq());
+	rt_do_flush(net, !in_softirq());
 }
 
 static void rt_emergency_hash_rebuild(struct net *net)

^ permalink raw reply related

* RE: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS enablement
From: Vladislav Zolotarov @ 2010-10-12 20:08 UTC (permalink / raw)
  To: David Miller, joe@perches.com
  Cc: Dmitry Kravkov, netdev@vger.kernel.org, eric.dumazet@gmail.com,
	Eilon Greenstein
In-Reply-To: <20101012.123142.59664847.davem@davemloft.net>




> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, October 12, 2010 9:32 PM
> To: joe@perches.com
> Cc: Dmitry Kravkov; netdev@vger.kernel.org; eric.dumazet@gmail.com;
> Vladislav Zolotarov; Eilon Greenstein
> Subject: Re: [PATCH net-next] bnx2x: Fixing a typo: added a missing RSS
> enablement
> 
> From: Joe Perches <joe@perches.com>
> Date: Tue, 12 Oct 2010 12:26:19 -0700
> 
> > On Tue, 2010-10-12 at 21:02 +0200, Dmitry Kravkov wrote:
> >> @@ -2486,6 +2486,7 @@ void bnx2x_pf_init(struct bnx2x *bp)
> >>  	 * if (is_eth_multi(bp))
> >>  	 *	flags |= FUNC_FLG_RSS;
> >>  	 */
> >> +	flags |= FUNC_FLG_RSS;
> >>
> >>  	/* function setup */
> >>  	if (flags & FUNC_FLG_RSS) {
> >
> > Then the "if (flags & FUNC_FLG_RSS)" test should be removed.
> 
> Yeah it probably should.  If necessary it could be added back
> later.

Thanks, Joe. We will consider removing this "if" and will post
an appropriate patch. Most likely in the close patch series we
have promised to respin... ;)

Thanks to all, guys.
vlad




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox