Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-19  7:27 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87tycsn9lt.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Wed, May 18, 2011 at 09:49:42AM +0930, Rusty Russell wrote:
> On Tue, 17 May 2011 09:10:31 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Well one can imagine a driver doing:
> > 
> > 	while (virtqueue_get_buf()) {
> > 		virtqueue_add_buf()
> > 	}
> > 	virtqueue_kick()
> > 
> > which looks sensible (batch kicks) but might
> > process any number of bufs between kicks.
> 
> No, we currently only expose the buffers in the kick, so it can only
> fill the ring doing that.
> 
> We could change that (and maybe that's worth looking at)...

That's actually what one of the early patches in the series did.
I guess I can try and reorder the patches, I do believe
it makes sense to publish immediately as this way
host can work in parallel with the guest.

> > If we look at drivers closely enough, I think none
> > of them do the equivalent of the above, but not 100% sure.
> 
> I'm pretty sure we don't have this kind of 'echo' driver yet.  Drivers
> tend to take OS requests and queue them.  The only one which does
> anything even partially sophisticated is the net driver...
> 
> Thanks,
> Rusty.

I guess I'll just need to do the legwork and check then.

-- 
MST

^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface,  PERCPU:  allocation failed
From: Denys Fedoryshchenko @ 2011-05-19  7:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1305788113.3019.19.camel@edumazet-laptop>

 On Thu, 19 May 2011 08:55:13 +0200, Eric Dumazet wrote:
> Le jeudi 19 mai 2011 à 08:39 +0200, Eric Dumazet a écrit :
>
>> Its a known problem : When ipv6 is enabled, we allocate percpu 
>> memory to
>> hold per device snmp counters.
>>
>> make sure kernel idea of max possible cpus matches real number of 
>> cpus.
>>
>> And yes, switching to 64bit kernel helps a lot.
>>
>>
>
> Looking at snmp6_alloc_dev(), we allocate three mib per device :
>
> ipstats_mib  (30 * sizeof(u64) * number_of_possible_cpus)
> icmpv6_mib    (4 * sizeof(long) * number_of_possible_cpus)
> icmpv6msg_mib  (26 * sizeof(long))
 1920 +
 256 +
 208 = 2386 * 3000ppp's = 7152000, i think it is not that much at any 
 case, if i am not wrong.

 But at any case i will try 64bit.

>
> For sure icmp ones dont need percpu counter. Plain atomic_long_t
> (shared) would be enough, since ICMP messages are rare enough.



^ permalink raw reply

* Re: [PATCH RFC] virtio_net: fix patch: virtio_net: limit xmit polling
From: Rusty Russell @ 2011-05-19  7:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, habanero, Shirley Ma, Krishna Kumar2, kvm,
	steved, Tom
  Cc: Michael S. Tsirkin, virtualization, netdev, linux-kernel
In-Reply-To: <20110518220125.GA26835@redhat.com>

On Thu, 19 May 2011 01:01:25 +0300, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> The patch  virtio_net: limit xmit polling
> got the logic reversed: it polled while we had
> capacity not while ring was empty.
> 
> Fix it up and clean up a bit by using a for loop.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> OK, turns out that patch was borken. Here's
> a fix that survived stress test on my box.
> Pushed on my branch, I'll send a rebased series
> with Rusty's comments addressed ASAP.

Normally you would have missed the merge window by now, but I'd really
like this stuff in, so I'm holding it open for this.  I want these patches
in linux-next for at least a few days before I push them.

If you think we're not close enough, please tell me and I'll push
the rest of the virtio patches to Linus now.  

Thanks,
Rusty.

^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface, PERCPU:  allocation failed
From: Eric Dumazet @ 2011-05-19  7:44 UTC (permalink / raw)
  To: Denys Fedoryshchenko; +Cc: netdev
In-Reply-To: <10f61af229a48d77d06d154b4647cdde@visp.net.lb>

Le jeudi 19 mai 2011 à 10:28 +0300, Denys Fedoryshchenko a écrit :
> On Thu, 19 May 2011 08:55:13 +0200, Eric Dumazet wrote:
> > Le jeudi 19 mai 2011 à 08:39 +0200, Eric Dumazet a écrit :
> >
> >> Its a known problem : When ipv6 is enabled, we allocate percpu 
> >> memory to
> >> hold per device snmp counters.
> >>
> >> make sure kernel idea of max possible cpus matches real number of 
> >> cpus.
> >>
> >> And yes, switching to 64bit kernel helps a lot.
> >>
> >>
> >
> > Looking at snmp6_alloc_dev(), we allocate three mib per device :
> >
> > ipstats_mib  (30 * sizeof(u64) * number_of_possible_cpus)
> > icmpv6_mib    (4 * sizeof(long) * number_of_possible_cpus)
> > icmpv6msg_mib  (26 * sizeof(long))
>  1920 +
>  256 +
>  208 = 2386 * 3000ppp's = 7152000, i think it is not that much at any 
>  case, if i am not wrong.
> 
>  But at any case i will try 64bit.

If you really want to stay 32bit, you might try to enlarge vmalloc aread
(128 Mbytes default) to get room for pcpu data :

grep pcpu /proc/vmallocinfo 


boot param : vmalloc=256M




^ permalink raw reply

* Re: 2.6.39-rc7-git11, x86/32, failed on ppp2897'th interface, PERCPU: allocation failed
From: David Miller @ 2011-05-19  7:51 UTC (permalink / raw)
  To: denys; +Cc: netdev
In-Reply-To: <f9797bb034f650a24e927629d1ab77d8@visp.net.lb>

From: Denys Fedoryshchenko <denys@visp.net.lb>
Date: Thu, 19 May 2011 09:35:29 +0300

> I am not sure it is a bug, but it looks i had free memory(the box had
> 8GB free), and lowmem too, also i will try to enable there 64bit
> kernel at evening.

It's not free memory, you ran out of per-cpu chunks which are allocated
in fixed virtual region(s).

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Hans Schillstrom @ 2011-05-19  7:52 UTC (permalink / raw)
  To: Simon Horman, Dave Jones, Julian Anastasov
  Cc: netdev@vger.kernel.org, Wensong Zhang
In-Reply-To: <20110519032611.GG16688@verge.net.au>

Hello
On Thursday 19 May 2011 05:26:14 Simon Horman wrote:
> On Thu, May 19, 2011 at 10:10:46AM +0900, Simon Horman wrote:
> > On Wed, May 18, 2011 at 04:19:15PM -0400, Dave Jones wrote:
> > > I get this oops from ip_vs_ftp..
> > > 
> > >  general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > >  last sysfs file: /sys/module/nf_nat/refcnt
> > >  CPU 3 
> > >  Modules linked in: ip_vs(+) libcrc32c nf_nat nfsd lockd nfs_acl auth_rpcgss sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_hda_codec_realtek ppdev snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm microcode edac_core snd_timer k10temp snd pcspkr usb_debug edac_mce_amd soundcore snd_page_alloc sp5100_tco i2c_piix4 parport_pc parport wmi r8169 mii lm63 ipv6 pata_acpi firewire_ohci ata_generic firewire_core crc_itu_t pata_atiixp floppy radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: nf_nat]
> > >  
> > >  Pid: 1366, comm: modprobe Not tainted 2.6.39-rc7+ #15 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > >  RIP: 0010:[<ffffffff8107bddb>]  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > >  RSP: 0018:ffff880114139e68  EFLAGS: 00010206
> > >  RAX: 2f736e74656e2f74 RBX: ffffffffa04265d0 RCX: 0000000000000003
> > >  RDX: 00000000656e6567 RSI: ffffffffa04265d0 RDI: ffffffffa04235d8
> > >  RBP: ffff880114139e68 R08: ffff880114139df8 R09: 0000000000000001
> > >  R10: 0000000000000001 R11: 00000000000001cc R12: ffffffffa0432106
> > >  R13: 0000000000000000 R14: 0000000000007f0d R15: 0000000000410e40
> > >  FS:  00007f2aaf242720(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > >  CR2: 00007f2aaea0100f CR3: 000000011424f000 CR4: 00000000000006e0
> > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > >  Process modprobe (pid: 1366, threadinfo ffff880114138000, task ffff8801146cc7a0)
> > >  Stack:
> > >   ffff880114139e78 ffffffff8107be36 ffff880114139ec8 ffffffff81403058
> > >   0000000000000000 0000000000000000 ffff880114139ea8 0000000000000000
> > >   ffffffffa0432106 0000000000000000 0000000000007f0d 0000000000410e40
> > >  Call Trace:
> > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > >   RSP <ffff880114139e68>
> > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > 
> > > 
> > > This script replicates the bug.
> > > (it usually oopses after just a few loops)
> > > 
> > > #!/bin/sh
> > > while [ 1 ];
> > > do
> > > 	modprobe ip_vs_ftp
> > > 	modprobe -r ip_vs_ftp
> > > done
> > > 
> > > Looks like something isn't getting cleaned up on module exit
> > > that we fall over when we encounter it next time it gets loaded ?

It's a bug in ip_vs_ftp related to netns.

> > 
> > Thanks Dave, I will look into this.
> 
> Hi Dave,
> 
> I'm not having much luck reproducing this in KVM.
> I will try this evening on real hardware.
> 
> Just to make sure we are testing the same thing, are you using Linus's tree?
> --
I can reproduce the source of the problem,
use multiple netns and then unload the ftp module...
i.e. same list head used in multiple netns

This brings up a question:
- How should ftp be handled in a netns ? 
You might want to have it in one netns but not in another,
this requires changes to ipvsadm

A way of doing it could be a disable switch like --noftp [port,port]
i.e. do not break old apps.

Any other ideas ?

This patch solves the root problem, I'm not sure if this is the way to go
or if we should split the ip_vs_app struct ?

If it's the way to go I can send it as a proper formated patch ...
(after some testing)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 4fff432..481f856 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -797,7 +797,8 @@ struct netns_ipvs {
 	struct list_head	rs_table[IP_VS_RTAB_SIZE];
 	/* ip_vs_app */
 	struct list_head	app_list;
-
+	/* ip_vs_ftp */
+	struct ip_vs_app	*ftp_app;
 	/* ip_vs_proto */
 	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
 	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
index 6b5dd6d..17afb09 100644
--- a/net/netfilter/ipvs/ip_vs_ftp.c
+++ b/net/netfilter/ipvs/ip_vs_ftp.c
@@ -411,25 +411,36 @@ static struct ip_vs_app ip_vs_ftp = {
 static int __net_init __ip_vs_ftp_init(struct net *net)
 {
 	int i, ret;
-	struct ip_vs_app *app = &ip_vs_ftp;
+	struct ip_vs_app *app;
+	struct netns_ipvs *ipvs = net_ipvs(net);
+
+	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
+	if (!app)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&app->a_list);
+	INIT_LIST_HEAD(&app->incs_list);
+	INIT_LIST_HEAD(&app->p_list);
+	ipvs->ftp_app = app;
 
 	ret = register_ip_vs_app(net, app);
 	if (ret)
-		return ret;
+		goto err_exit;
 
 	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
 		if (!ports[i])
 			continue;
 		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
 		if (ret)
-			break;
+			goto err_unreg;
 		pr_info("%s: loaded support on port[%d] = %d\n",
 			app->name, i, ports[i]);
 	}
+	return 0;
 
-	if (ret)
-		unregister_ip_vs_app(net, app);
-
+err_unreg:
+	unregister_ip_vs_app(net, app);
+err_exit:
+	kfree(ipvs->ftp_app);
 	return ret;
 }
 /*
@@ -437,9 +448,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
  */
 static void __ip_vs_ftp_exit(struct net *net)
 {
-	struct ip_vs_app *app = &ip_vs_ftp;
-
-	unregister_ip_vs_app(net, app);
+	unregister_ip_vs_app(net, net_ipvs(net)->ftp_app);
 }
 
 static struct pernet_operations ip_vs_ftp_ops = {


-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply related

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Simon Horman @ 2011-05-19  7:55 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: Dave Jones, netdev, Wensong Zhang, Hans Schillstrom
In-Reply-To: <alpine.LFD.2.00.1105190930480.1993@ja.ssi.bg>

On Thu, May 19, 2011 at 09:33:55AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 May 2011, Simon Horman wrote:
> 
> > > >  Call Trace:
> > > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > > >   RSP <ffff880114139e68>
> > > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > > 
> > > > 
> > > > This script replicates the bug.
> > > > (it usually oopses after just a few loops)
> > > > 
> > > > #!/bin/sh
> > > > while [ 1 ];
> > > > do
> > > > 	modprobe ip_vs_ftp
> > > > 	modprobe -r ip_vs_ftp
> > > > done
> > > > 
> > > > Looks like something isn't getting cleaned up on module exit
> > > > that we fall over when we encounter it next time it gets loaded ?
> > > 
> > > Thanks Dave, I will look into this.
> > 
> > Hi Dave,
> > 
> > I'm not having much luck reproducing this in KVM.
> > I will try this evening on real hardware.
> > 
> > Just to make sure we are testing the same thing, are you using Linus's tree?
> 
> 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> is missing in ip_vs_control_cleanup for sure.

Like this?

>From 840edfcc48e5b98d928ee9d66def761a808945b3 Mon Sep 17 00:00:00 2001
From: Simon Horman <horms@verge.net.au>
Date: Thu, 19 May 2011 16:54:26 +0900
Subject: [PATCH] IPVS: Free resources on module removal

Cc: Julian Anastasov <ja@ssi.bg>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_ctl.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 37890f2..9b9039b 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3774,6 +3774,7 @@ err_sock:
 void ip_vs_control_cleanup(void)
 {
 	EnterFunction(2);
+	unregister_netdevice_notifier(&ip_vs_dst_notifier);
 	ip_vs_genl_unregister();
 	nf_unregister_sockopt(&ip_vs_sockopts);
 	LeaveFunction(2);
-- 
1.7.4.4


^ permalink raw reply related

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Hans Schillstrom @ 2011-05-19  7:58 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Simon Horman, Dave Jones, netdev@vger.kernel.org, Wensong Zhang,
	Hans Schillstrom
In-Reply-To: <alpine.LFD.2.00.1105190930480.1993@ja.ssi.bg>

On Thursday 19 May 2011 08:33:55 Julian Anastasov wrote:
> 
> 	Hello,
> 
[snip]
> 
> 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> is missing in ip_vs_control_cleanup for sure.
> 
Oops, 
Should I prepare a patch for that one ?

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC  2988bis-02.
From: Hagen Paul Pfeifer @ 2011-05-19  8:02 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: tsuna, David Miller, kuznet, pekkas, jmorris, yoshfuji, kaber,
	eric.dumazet, netdev, linux-kernel
In-Reply-To: <8C5DF277-320D-4DEB-A133-EEC301DE58DC@comsys.rwth-aachen.de>

On Thu, 19 May 2011 08:52:10 +0200, Alexander Zimmermann wrote:

>> #define TCP_RTO_MAX     ((unsigned)(120*HZ))

>> #define TCP_RTO_MIN     ((unsigned)(HZ/5))

>> 

>> So we're talking about a [200ms ; 120s] range no matter what.

> 

> Why is 200ms a valid lower bound for initRTO? I'm aware of

> measurements that 1s is save for Internet, but I don't know of any

> studies that 200ms is save... 

TCP_RTO_MAX and TCP_RTO_MIN is the lower/upper bound for the RTO in

general, not for the initial RTO. RFC 2988 specify a lower bound of 1

second but all operating system choose a lower one because at the time

where RFC 2988 was written the clock granularity was not that accurate. The

minimum RTO for FreeBSD is even 30ms! Furthermore, analysis had

demonstrated that a minimum RTO of 1 second badly breaks throughput in

environments faster then 33kB with minor packet loss rate (e.g. 1%).

So yes, it CAN be wise to choose other lower/upper bounds. But keep in

mind that we should NOT artificial limit ourself. I can image data center

scenarios where a initial RTO of <1 match perfectly.

Hagen

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Simon Horman @ 2011-05-19  8:07 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Dave Jones, Julian Anastasov, netdev@vger.kernel.org,
	Wensong Zhang
In-Reply-To: <201105190952.49006.hans.schillstrom@ericsson.com>

On Thu, May 19, 2011 at 09:52:47AM +0200, Hans Schillstrom wrote:
> Hello
> On Thursday 19 May 2011 05:26:14 Simon Horman wrote:
> > On Thu, May 19, 2011 at 10:10:46AM +0900, Simon Horman wrote:
> > > On Wed, May 18, 2011 at 04:19:15PM -0400, Dave Jones wrote:
> > > > I get this oops from ip_vs_ftp..
> > > > 
> > > >  general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> > > >  last sysfs file: /sys/module/nf_nat/refcnt
> > > >  CPU 3 
> > > >  Modules linked in: ip_vs(+) libcrc32c nf_nat nfsd lockd nfs_acl auth_rpcgss sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables snd_hda_codec_realtek ppdev snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm microcode edac_core snd_timer k10temp snd pcspkr usb_debug edac_mce_amd soundcore snd_page_alloc sp5100_tco i2c_piix4 parport_pc parport wmi r8169 mii lm63 ipv6 pata_acpi firewire_ohci ata_generic firewire_core crc_itu_t pata_atiixp floppy radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: nf_nat]
> > > >  
> > > >  Pid: 1366, comm: modprobe Not tainted 2.6.39-rc7+ #15 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> > > >  RIP: 0010:[<ffffffff8107bddb>]  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > > >  RSP: 0018:ffff880114139e68  EFLAGS: 00010206
> > > >  RAX: 2f736e74656e2f74 RBX: ffffffffa04265d0 RCX: 0000000000000003
> > > >  RDX: 00000000656e6567 RSI: ffffffffa04265d0 RDI: ffffffffa04235d8
> > > >  RBP: ffff880114139e68 R08: ffff880114139df8 R09: 0000000000000001
> > > >  R10: 0000000000000001 R11: 00000000000001cc R12: ffffffffa0432106
> > > >  R13: 0000000000000000 R14: 0000000000007f0d R15: 0000000000410e40
> > > >  FS:  00007f2aaf242720(0000) GS:ffff88012a800000(0000) knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > >  CR2: 00007f2aaea0100f CR3: 000000011424f000 CR4: 00000000000006e0
> > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > >  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > >  Process modprobe (pid: 1366, threadinfo ffff880114138000, task ffff8801146cc7a0)
> > > >  Stack:
> > > >   ffff880114139e78 ffffffff8107be36 ffff880114139ec8 ffffffff81403058
> > > >   0000000000000000 0000000000000000 ffff880114139ea8 0000000000000000
> > > >   ffffffffa0432106 0000000000000000 0000000000007f0d 0000000000410e40
> > > >  Call Trace:
> > > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > > >   RSP <ffff880114139e68>
> > > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > > 
> > > > 
> > > > This script replicates the bug.
> > > > (it usually oopses after just a few loops)
> > > > 
> > > > #!/bin/sh
> > > > while [ 1 ];
> > > > do
> > > > 	modprobe ip_vs_ftp
> > > > 	modprobe -r ip_vs_ftp
> > > > done
> > > > 
> > > > Looks like something isn't getting cleaned up on module exit
> > > > that we fall over when we encounter it next time it gets loaded ?
> 
> It's a bug in ip_vs_ftp related to netns.
> 
> > > 
> > > Thanks Dave, I will look into this.
> > 
> > Hi Dave,
> > 
> > I'm not having much luck reproducing this in KVM.
> > I will try this evening on real hardware.
> > 
> > Just to make sure we are testing the same thing, are you using Linus's tree?
> > --
> I can reproduce the source of the problem,
> use multiple netns and then unload the ftp module...
> i.e. same list head used in multiple netns
> 
> This brings up a question:
> - How should ftp be handled in a netns ? 
> You might want to have it in one netns but not in another,
> this requires changes to ipvsadm
> 
> A way of doing it could be a disable switch like --noftp [port,port]
> i.e. do not break old apps.
> 
> Any other ideas ?
> 
> This patch solves the root problem, I'm not sure if this is the way to go
> or if we should split the ip_vs_app struct ?
> 
> If it's the way to go I can send it as a proper formated patch ...
> (after some testing)

I'm also unsure what a good solution in the longer term is.
But in the immediate future I think that the best idea is as
simple a fix as possible that can go 2.6.39 or -stable.

> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 4fff432..481f856 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -797,7 +797,8 @@ struct netns_ipvs {
>  	struct list_head	rs_table[IP_VS_RTAB_SIZE];
>  	/* ip_vs_app */
>  	struct list_head	app_list;
> -
> +	/* ip_vs_ftp */
> +	struct ip_vs_app	*ftp_app;
>  	/* ip_vs_proto */
>  	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
>  	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index 6b5dd6d..17afb09 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -411,25 +411,36 @@ static struct ip_vs_app ip_vs_ftp = {
>  static int __net_init __ip_vs_ftp_init(struct net *net)
>  {
>  	int i, ret;
> -	struct ip_vs_app *app = &ip_vs_ftp;
> +	struct ip_vs_app *app;
> +	struct netns_ipvs *ipvs = net_ipvs(net);
> +
> +	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> +	if (!app)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&app->a_list);
> +	INIT_LIST_HEAD(&app->incs_list);
> +	INIT_LIST_HEAD(&app->p_list);
> +	ipvs->ftp_app = app;
>  
>  	ret = register_ip_vs_app(net, app);
>  	if (ret)
> -		return ret;
> +		goto err_exit;
>  
>  	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
>  		if (!ports[i])
>  			continue;
>  		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
>  		if (ret)
> -			break;
> +			goto err_unreg;
>  		pr_info("%s: loaded support on port[%d] = %d\n",
>  			app->name, i, ports[i]);
>  	}
> +	return 0;
>  
> -	if (ret)
> -		unregister_ip_vs_app(net, app);
> -
> +err_unreg:
> +	unregister_ip_vs_app(net, app);
> +err_exit:
> +	kfree(ipvs->ftp_app);
>  	return ret;
>  }
>  /*
> @@ -437,9 +448,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
>   */
>  static void __ip_vs_ftp_exit(struct net *net)
>  {
> -	struct ip_vs_app *app = &ip_vs_ftp;
> -
> -	unregister_ip_vs_app(net, app);
> +	unregister_ip_vs_app(net, net_ipvs(net)->ftp_app);
>  }
>  
>  static struct pernet_operations ip_vs_ftp_ops = {
> 
> 
> -- 
> Regards
> Hans Schillstrom <hans.schillstrom@ericsson.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Julian Anastasov @ 2011-05-19  8:15 UTC (permalink / raw)
  To: Simon Horman; +Cc: Dave Jones, netdev, Wensong Zhang, Hans Schillstrom
In-Reply-To: <20110519075557.GB3922@verge.net.au>


	Hello,

On Thu, 19 May 2011, Simon Horman wrote:

> On Thu, May 19, 2011 at 09:33:55AM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Thu, 19 May 2011, Simon Horman wrote:
> > 
> > > > >  Call Trace:
> > > > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > > > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > > > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > > > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > > > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > > > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > > > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > > > >   RSP <ffff880114139e68>
> > > > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > > > 
> > > > > 
> > > > > This script replicates the bug.
> > > > > (it usually oopses after just a few loops)
> > > > > 
> > > > > #!/bin/sh
> > > > > while [ 1 ];
> > > > > do
> > > > > 	modprobe ip_vs_ftp
> > > > > 	modprobe -r ip_vs_ftp
> > > > > done
> > > > > 
> > > > > Looks like something isn't getting cleaned up on module exit
> > > > > that we fall over when we encounter it next time it gets loaded ?
> > > > 
> > > > Thanks Dave, I will look into this.
> > > 
> > > Hi Dave,
> > > 
> > > I'm not having much luck reproducing this in KVM.
> > > I will try this evening on real hardware.
> > > 
> > > Just to make sure we are testing the same thing, are you using Linus's tree?
> > 
> > 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> > is missing in ip_vs_control_cleanup for sure.
> 
> Like this?

	Yes, I think oops is for 2nd or next module load after
first unload forgets entry in notifier list.

> >From 840edfcc48e5b98d928ee9d66def761a808945b3 Mon Sep 17 00:00:00 2001
> From: Simon Horman <horms@verge.net.au>
> Date: Thu, 19 May 2011 16:54:26 +0900
> Subject: [PATCH] IPVS: Free resources on module removal
> 
> Cc: Julian Anastasov <ja@ssi.bg>
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Simon Horman <horms@verge.net.au>

Acked-by: Julian Anastasov <ja@ssi.bg>

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 37890f2..9b9039b 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -3774,6 +3774,7 @@ err_sock:
>  void ip_vs_control_cleanup(void)
>  {
>  	EnterFunction(2);
> +	unregister_netdevice_notifier(&ip_vs_dst_notifier);
>  	ip_vs_genl_unregister();
>  	nf_unregister_sockopt(&ip_vs_sockopts);
>  	LeaveFunction(2);
> -- 
> 1.7.4.4

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Hans Schillstrom @ 2011-05-19  8:14 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, Dave Jones, netdev@vger.kernel.org,
	Wensong Zhang, Hans Schillstrom
In-Reply-To: <20110519075557.GB3922@verge.net.au>

Hello, Simon
On Thursday 19 May 2011 09:55:57 Simon Horman wrote:
> On Thu, May 19, 2011 at 09:33:55AM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Thu, 19 May 2011, Simon Horman wrote:
> > 
> > > > >  Call Trace:
> > > > >   [<ffffffff8107be36>] raw_notifier_chain_register+0xe/0x10
> > > > >   [<ffffffff81403058>] register_netdevice_notifier+0x2d/0x1b6
> > > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > > >   [<ffffffffa04322c7>] ip_vs_control_init+0xa5/0xce [ip_vs]
> > > > >   [<ffffffffa0432106>] ? ip_vs_conn_init+0x106/0x106 [ip_vs]
> > > > >   [<ffffffffa0432116>] ip_vs_init+0x10/0x11c [ip_vs]
> > > > >   [<ffffffff81002099>] do_one_initcall+0x7f/0x13a
> > > > >   [<ffffffff81096524>] sys_init_module+0x132/0x281
> > > > >   [<ffffffff814cc702>] system_call_fastpath+0x16/0x1b
> > > > >  Code: 07 ff c8 89 43 48 eb 08 48 89 df e8 dc 95 44 00 4c 89 e6 48 89 df e8 a7 a5 44 00 5b 41 5c 5d c3 55 48 89 e5 66 66 66 66 90 eb 0c <8b> 50 10 39 56 10 7f 0c 48 8d 78 08 48 8b 07 48 85 c0 75 ec 48 
> > > > >  RIP  [<ffffffff8107bddb>] notifier_chain_register+0xb/0x2a
> > > > >   RSP <ffff880114139e68>
> > > > >  ---[ end trace e90d7053ad1a7a5b ]---
> > > > > 
> > > > > 
> > > > > This script replicates the bug.
> > > > > (it usually oopses after just a few loops)
> > > > > 
> > > > > #!/bin/sh
> > > > > while [ 1 ];
> > > > > do
> > > > > 	modprobe ip_vs_ftp
> > > > > 	modprobe -r ip_vs_ftp
> > > > > done
> > > > > 
> > > > > Looks like something isn't getting cleaned up on module exit
> > > > > that we fall over when we encounter it next time it gets loaded ?
> > > > 
> > > > Thanks Dave, I will look into this.
> > > 
> > > Hi Dave,
> > > 
> > > I'm not having much luck reproducing this in KVM.
> > > I will try this evening on real hardware.
> > > 
> > > Just to make sure we are testing the same thing, are you using Linus's tree?
> > 
> > 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> > is missing in ip_vs_control_cleanup for sure.
> 
> Like this?

Yes,
we need this patch and the ip_vs_ftp patch in some format.

> 
> From 840edfcc48e5b98d928ee9d66def761a808945b3 Mon Sep 17 00:00:00 2001
> From: Simon Horman <horms@verge.net.au>
> Date: Thu, 19 May 2011 16:54:26 +0900
> Subject: [PATCH] IPVS: Free resources on module removal
> 
> Cc: Julian Anastasov <ja@ssi.bg>
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 37890f2..9b9039b 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -3774,6 +3774,7 @@ err_sock:
>  void ip_vs_control_cleanup(void)
>  {
>  	EnterFunction(2);
> +	unregister_netdevice_notifier(&ip_vs_dst_notifier);
>  	ip_vs_genl_unregister();
>  	nf_unregister_sockopt(&ip_vs_sockopts);
>  	LeaveFunction(2);

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Simon Horman @ 2011-05-19  8:17 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Julian Anastasov, Dave Jones, netdev@vger.kernel.org,
	Wensong Zhang, Hans Schillstrom
In-Reply-To: <201105190958.56185.hans.schillstrom@ericsson.com>

On Thu, May 19, 2011 at 09:58:55AM +0200, Hans Schillstrom wrote:
> On Thursday 19 May 2011 08:33:55 Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> [snip]
> > 
> > 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> > is missing in ip_vs_control_cleanup for sure.
> > 
> Oops, 
> Should I prepare a patch for that one ?

Could you test the one I posted?
(Or send another one if I got it wrong? :-)

^ permalink raw reply

* [V2 Patch net-next-2.6] netpoll: disable netpoll when enslave a device
From: Amerigo Wang @ 2011-05-19  8:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, WANG Cong, Neil Horman, Jay Vosburgh, David S. Miller,
	Ian Campbell, Paul E. McKenney, Josh Triplett, netdev
In-Reply-To: <20110518105558.GA3203@hmsreliant.think-freely.org>

Currently we do nothing when we enslave a net device which is running netconsole.
Neil pointed out that we may get weird results in such case, so let's disable
netpoll on the device being enslaved. I think it is too harsh to prevent
the device being ensalved if it is running netconsole.

By the way, this patch also removes the NETDEV_GOING_DOWN from netconsole
netdev notifier, because netpoll will check if the device is running or not
and we don't handle NETDEV_PRE_UP neither.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>

---
 drivers/net/bonding/bond_main.c |    2 ++
 drivers/net/netconsole.c        |   26 +++++++++++++++++---------
 include/linux/notifier.h        |    1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 088fd84..b9c70c5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1640,6 +1640,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		}
 	}
 
+	netdev_bonding_change(slave_dev, NETDEV_ENSLAVE);
+
 	/* If this is the first slave, then we need to set the master's hardware
 	 * address to be the same as the slave's. */
 	if (is_zero_ether_addr(bond->dev->dev_addr))
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index a83e101..c2a8230 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -621,11 +621,10 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	bool stopped = false;
 
 	if (!(event == NETDEV_CHANGENAME || event == NETDEV_UNREGISTER ||
-	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN))
+	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_ENSLAVE))
 		goto done;
 
 	spin_lock_irqsave(&target_list_lock, flags);
-restart:
 	list_for_each_entry(nt, &target_list, list) {
 		netconsole_target_get(nt);
 		if (nt->np.dev == dev) {
@@ -633,6 +632,8 @@ restart:
 			case NETDEV_CHANGENAME:
 				strlcpy(nt->np.dev_name, dev->name, IFNAMSIZ);
 				break;
+			case NETDEV_BONDING_DESLAVE:
+			case NETDEV_ENSLAVE:
 			case NETDEV_UNREGISTER:
 				/*
 				 * rtnl_lock already held
@@ -647,11 +648,7 @@ restart:
 					dev_put(nt->np.dev);
 					nt->np.dev = NULL;
 					netconsole_target_put(nt);
-					goto restart;
 				}
-				/* Fall through */
-			case NETDEV_GOING_DOWN:
-			case NETDEV_BONDING_DESLAVE:
 				nt->enabled = 0;
 				stopped = true;
 				break;
@@ -660,10 +657,21 @@ restart:
 		netconsole_target_put(nt);
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
-	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
+	if (stopped) {
 		printk(KERN_INFO "netconsole: network logging stopped on "
-			"interface %s as it %s\n",  dev->name,
-			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
+		       "interface %s as it ", dev->name);
+		switch (event) {
+		case NETDEV_UNREGISTER:
+			printk(KERN_CONT "unregistered\n");
+			break;
+		case NETDEV_BONDING_DESLAVE:
+			printk(KERN_CONT "released slaves\n");
+			break;
+		case NETDEV_ENSLAVE:
+			printk(KERN_CONT "is enslaved\n");
+			break;
+		}
+	}
 
 done:
 	return NOTIFY_DONE;
diff --git a/include/linux/notifier.h b/include/linux/notifier.h
index 621dfa1..3d82867 100644
--- a/include/linux/notifier.h
+++ b/include/linux/notifier.h
@@ -211,6 +211,7 @@ static inline int notifier_to_errno(int ret)
 #define NETDEV_UNREGISTER_BATCH 0x0011
 #define NETDEV_BONDING_DESLAVE  0x0012
 #define NETDEV_NOTIFY_PEERS	0x0013
+#define NETDEV_ENSLAVE		0x0014
 
 #define SYS_DOWN	0x0001	/* Notify of system down */
 #define SYS_RESTART	SYS_DOWN

^ permalink raw reply related

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Hans Schillstrom @ 2011-05-19  8:52 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, Dave Jones, netdev@vger.kernel.org,
	Wensong Zhang, Hans Schillstrom
In-Reply-To: <20110519081706.GE3922@verge.net.au>

On Thursday 19 May 2011 10:17:07 Simon Horman wrote:
> On Thu, May 19, 2011 at 09:58:55AM +0200, Hans Schillstrom wrote:
> > On Thursday 19 May 2011 08:33:55 Julian Anastasov wrote:
> > > 
> > > 	Hello,
> > > 
> > [snip]
> > > 
> > > 	One unregister_netdevice_notifier(&ip_vs_dst_notifier);
> > > is missing in ip_vs_control_cleanup for sure.
> > > 
> > Oops, 
> > Should I prepare a patch for that one ?
> 
> Could you test the one I posted?
> (Or send another one if I got it wrong? :-)
> 
Tested,
it works fine :-)
-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Julian Anastasov @ 2011-05-19  8:55 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Simon Horman, Dave Jones, netdev@vger.kernel.org, Wensong Zhang
In-Reply-To: <201105190952.49006.hans.schillstrom@ericsson.com>


	Hello,

On Thu, 19 May 2011, Hans Schillstrom wrote:

> I can reproduce the source of the problem,
> use multiple netns and then unload the ftp module...
> i.e. same list head used in multiple netns
> 
> This brings up a question:
> - How should ftp be handled in a netns ? 
> You might want to have it in one netns but not in another,
> this requires changes to ipvsadm
> 
> A way of doing it could be a disable switch like --noftp [port,port]
> i.e. do not break old apps.
> 
> Any other ideas ?
> 
> This patch solves the root problem, I'm not sure if this is the way to go
> or if we should split the ip_vs_app struct ?

	This patch is a fast fix but may be it is too late for it,
after 2.6.39 is out. It seems we overlooked the apps when
migrating to netns. I think, the apps do not need to be pernet.
If one day application needs pernet context we can add such
fields in the ipvs structure.

	While the protocols have controls that manipulate pernet
timeouts, the apps do not have such controls about app->timeouts.
May be we can remove app->timeouts to avoid confusion because
it was never implemented in user space. May be instead of this
fix we should restore the global ip_vs_app_list and all things in
ip_vs_app.c and ip_vs_ftp.c as before the netns changes?

> If it's the way to go I can send it as a proper formated patch ...
> (after some testing)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 4fff432..481f856 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -797,7 +797,8 @@ struct netns_ipvs {
>  	struct list_head	rs_table[IP_VS_RTAB_SIZE];
>  	/* ip_vs_app */
>  	struct list_head	app_list;
> -
> +	/* ip_vs_ftp */
> +	struct ip_vs_app	*ftp_app;
>  	/* ip_vs_proto */
>  	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
>  	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index 6b5dd6d..17afb09 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -411,25 +411,36 @@ static struct ip_vs_app ip_vs_ftp = {
>  static int __net_init __ip_vs_ftp_init(struct net *net)
>  {
>  	int i, ret;
> -	struct ip_vs_app *app = &ip_vs_ftp;
> +	struct ip_vs_app *app;
> +	struct netns_ipvs *ipvs = net_ipvs(net);
> +
> +	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> +	if (!app)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&app->a_list);
> +	INIT_LIST_HEAD(&app->incs_list);
> +	INIT_LIST_HEAD(&app->p_list);
> +	ipvs->ftp_app = app;
>  
>  	ret = register_ip_vs_app(net, app);
>  	if (ret)
> -		return ret;
> +		goto err_exit;
>  
>  	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
>  		if (!ports[i])
>  			continue;
>  		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
>  		if (ret)
> -			break;
> +			goto err_unreg;
>  		pr_info("%s: loaded support on port[%d] = %d\n",
>  			app->name, i, ports[i]);
>  	}
> +	return 0;
>  
> -	if (ret)
> -		unregister_ip_vs_app(net, app);
> -
> +err_unreg:
> +	unregister_ip_vs_app(net, app);
> +err_exit:
> +	kfree(ipvs->ftp_app);
>  	return ret;
>  }
>  /*
> @@ -437,9 +448,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
>   */
>  static void __ip_vs_ftp_exit(struct net *net)
>  {
> -	struct ip_vs_app *app = &ip_vs_ftp;
> -
> -	unregister_ip_vs_app(net, app);
> +	unregister_ip_vs_app(net, net_ipvs(net)->ftp_app);
>  }
>  
>  static struct pernet_operations ip_vs_ftp_ops = {

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: net: add seq_before/seq_after functions
From: Sven Eckelmann @ 2011-05-19  8:54 UTC (permalink / raw)
  To: Antonio Quartulli; +Cc: linux-kernel, netdev, davem, Paul Mackerras, linux-ppp
In-Reply-To: <1305722319-8315-1-git-send-email-ordex@autistici.org>

[-- Attachment #1: Type: Text/Plain, Size: 4595 bytes --]

On Wednesday 18 May 2011 14:38:39 Antonio Quartulli wrote:
> Introduce two operations to handle comparison between packet sequence
> numbers taking into account overflow/wraparound. Batman-adv uses
> these functions already to check for successor packet even in case of
> overflow.

Thanks for your efforts to bring that to the kernel. But when you prepare a 
patch then you have to add a signoff. And also David S. Miller is the 
maintainer for this header - it would be interesting to ask him first when we 
want to change that file.

> ---
> I added this two functions in net.h because I didn't really know where
> best placement is. I saw several modules that redefine their own functions
> for the same purpose.
> 
>  include/linux/net.h |   17 +++++++++++++++++
>  1 files changed, 17 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/net.h b/include/linux/net.h
> index 94de83c..c7bc9bf 100644
> --- a/include/linux/net.h
> +++ b/include/linux/net.h
> @@ -295,4 +295,21 @@ extern struct ratelimit_state net_ratelimit_state;
>  #endif
> 
>  #endif /* __KERNEL__ */
> +
> +/* Returns the smallest signed integer in two's complement with the sizeof
> x */ +#define smallest_signed_int(x) (1u << (7u + 8u * (sizeof(x) - 1u)))
> +
> +/* Checks if a sequence number x is a predecessor/successor of y.
> + * they handle overflows/underflows and can correctly check for a
> + * predecessor/successor unless the variable sequence number has grown by
> + * more then 2**(bitwidth(x)-1)-1.
> + * This means that for a uint8_t with the maximum value 255, it would
> think: + *  - when adding nothing - it is neither a predecessor nor a
> successor + *  - before adding more than 127 to the starting value - it is
> a predecessor, + *  - when adding 128 - it is neither a predecessor nor a
> successor, + *  - after adding more than 127 to the starting value - it is
> a successor */ +#define seq_before(x, y) ({typeof(x) _dummy = (x - y); \
> +			_dummy > smallest_signed_int(_dummy); })
> +#define seq_after(x, y) seq_before(y, x)
> +
>  #endif	/* _LINUX_NET_H */

I suggested yesterday (probably too late) that it would be good to check the
type of both parameters (similar to the min and max functions in
include/linux/kernel.h

#define seq_before(x, y) ({typeof(x) _d1 = (x); \
			  typeof(y) _d2 = (y); \
			  (void) (&_d1 == &_d2); \
			  typeof(x) _dummy = (_d1 - _d2); \
			  _dummy > smallest_signed_int(_dummy); })


And your seq_before/after conflicts with the one defined in ppp_generic.c

drivers/net/ppp_generic.c:232:0: warning: "seq_before" redefined [enabled by 
default]
include/linux/net.h:312:0: note: this is the location of the previous 
definition
drivers/net/ppp_generic.c:233:0: warning: "seq_after" redefined [enabled by 
default]
include/linux/net.h:314:0: note: this is the location of the previous 
definition

The definition there is only for u32 - thus you would have to remove it and 
check that it always gives the same result:
#define seq_before(a, b)        ((s32)((a) - (b)) < 0)
#define seq_after(a, b)         ((s32)((a) - (b)) > 0)

But I would say that they have a different definition of seq_before. Changing 
that behaviour for batman-adv would not be that problematic, but maybe for 
ppp.

A defintion which should fulfil the requirements for ppp could be:

#define seq_after(x, y) ({typeof(x) _d1 = (x); \
			  typeof(y) _d2 = (y); \
			  (void) (&_d1 == &_d2); \
			  typeof(x) _dummy = (_d2 - _d1); \
			  _dummy > smallest_signed_int(_dummy); })
#define seq_before(x, y) ({typeof(x) _d1 = (x); \
			  typeof(y) _d2 = (y); \
			  (void) (&_d1 == &_d2); \
			  typeof(x) _dummy = (_d1 - _d2); \
			  _dummy >= smallest_signed_int(_dummy); })

Of course the comment above the seq_before/seq_after would be wrong.

/* Checks if a sequence number x is a predecessor/successor of y.
 * they handle overflows/underflows and can correctly check for a
 * predecessor/successor unless the variable sequence number has grown by
 * more then 2**(bitwidth(x)-1).
 * This means that for a uint8_t with the maximum value 255, it would think:
 *  - when adding nothing - it is neither a predecessor nor a successor
 *  - before adding more than 128 to the starting value - it is a predecessor,
 *  - after adding more than 127 to the starting value - it is a successor */

I think there could be more candidates which would like to use this abstract 
functionality. Maybe some one else on linux-kernel or netdev has a suggestion.

Kind regards,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: net: add seq_before/seq_after functions
From: David Miller @ 2011-05-19  9:08 UTC (permalink / raw)
  To: sven; +Cc: ordex, linux-kernel, netdev, paulus, linux-ppp
In-Reply-To: <201105191054.34912.sven@narfation.org>

From: Sven Eckelmann <sven@narfation.org>
Date: Thu, 19 May 2011 10:54:32 +0200

> On Wednesday 18 May 2011 14:38:39 Antonio Quartulli wrote:
>> Introduce two operations to handle comparison between packet sequence
>> numbers taking into account overflow/wraparound. Batman-adv uses
>> these functions already to check for successor packet even in case of
>> overflow.
> 
> Thanks for your efforts to bring that to the kernel. But when you prepare a 
> patch then you have to add a signoff. And also David S. Miller is the 
> maintainer for this header - it would be interesting to ask him first when we 
> want to change that file.

Well it makes no sense to add these interfaces until we see an
upstream submission of code which will actually use it.

Also I'm skeptical that such generic sounding interfaces make
sense when it appears to me that these are protocol specific
sequence number tests so probably belong in whatever protocol
is upcoming which will use these interfaces.

Again, this is why we want to see the code that's going to use
these new routines before we can seriously consider adding them
at all.

^ permalink raw reply

* Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Michał Mirosław @ 2011-05-19  9:18 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, David Miller
In-Reply-To: <1305745379.2929.9.camel@bwh-desktop>

On Wed, May 18, 2011 at 08:02:59PM +0100, Ben Hutchings wrote:
> On Mon, 2011-05-16 at 23:09 +0100, Ben Hutchings wrote:
> > On Mon, 2011-05-16 at 23:50 +0200, Michał Mirosław wrote:
> > > On Mon, May 16, 2011 at 10:08:59PM +0100, Ben Hutchings wrote:
> > > > I've explained before that I do not want to add new options to do
> > > > (mostly) the same thing.  Users should have not have to use a different
> > > > command depending on the kernel version.
> > > We can avoid new option by checking feature-strings for unrecognised
> > > arguments to -K. This way, we will have the old options which work
> > > regardless of kernel version ('tx', 'rx', 'sg', etc.) and new options
> > > which need recent kernel anyway (separated 'tx-checksum-*', 'loopback',
> > > others coming in for 2.6.40).
> > This is just too subtle a distinction.  It will mostly confuse users.
> Sorry, I think I misunderstood you here.  I agree that new feature names
> that do not correspond exactly to existing keywords should be supported
> as keywords after the -K option.  I think those that do (e.g.
> "tx-udp-fragmentation" vs "ufo") should not be, as adding a
> kernel-version-dependent *alias* would be confusing.

The alias can be marked as such in the documentation. Shouldn't it be
that hard for a user to read the manpage to know what the new options
are for when he sees them. I don't like the idea of translating strings,
either, because if e.g. ufo becomes split in the feature to ufo4+ufo6
or new checksum offloads are implemented, it will break.

> I also want users to benefit from your improvements (as I explained
> above) even when they use the old names, if they are using a new kernel
> version.  That is why I want ethtool to try using ETHTOOL_SFEATURES
> first, and why the fallback in the kernel is problematic.

Which benefits do you want to have? If checking what other features
changed with selected one, it's easily done by rereading the state -
possibly with GFEATURES.

I'll cook another PoC patch over those I sent to show the idea.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: net: add seq_before/seq_after functions
From: Sven Eckelmann @ 2011-05-19  9:21 UTC (permalink / raw)
  To: David Miller; +Cc: ordex, linux-kernel, netdev, paulus, linux-ppp
In-Reply-To: <20110519.050824.838971689287732822.davem@davemloft.net>

[-- Attachment #1: Type: Text/Plain, Size: 1736 bytes --]

On Thursday 19 May 2011 11:08:24 David Miller wrote:
> From: Sven Eckelmann <sven@narfation.org>
> Date: Thu, 19 May 2011 10:54:32 +0200
> 
> > On Wednesday 18 May 2011 14:38:39 Antonio Quartulli wrote:
> >> Introduce two operations to handle comparison between packet sequence
> >> numbers taking into account overflow/wraparound. Batman-adv uses
> >> these functions already to check for successor packet even in case of
> >> overflow.
> > 
> > Thanks for your efforts to bring that to the kernel. But when you prepare
> > a patch then you have to add a signoff. And also David S. Miller is the
> > maintainer for this header - it would be interesting to ask him first
> > when we want to change that file.
> 
> Well it makes no sense to add these interfaces until we see an
> upstream submission of code which will actually use it.
> 
> Also I'm skeptical that such generic sounding interfaces make
> sense when it appears to me that these are protocol specific
> sequence number tests so probably belong in whatever protocol
> is upcoming which will use these interfaces.
> 
> Again, this is why we want to see the code that's going to use
> these new routines before we can seriously consider adding them
> at all.

This is currently used by vis.c in net/batman-adv and could also be used by 
ppp-generic.c (with my changes of course). And it is planned to be used by 
transtable.c in net/batman-adv. The idea was to propose this to linux-
kernel/netdev before we move it to a place were only batman-adv can use it 
(the current situation is that vis.c in batman-adv can only use it).

It is ok that you say that it should be batman-adv specific - we only wanted 
to ask first.

Thanks,
	Sven

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] ethtool: ETHTOOL_SFEATURES: remove NETIF_F_COMPAT return
From: Michał Mirosław @ 2011-05-19 10:03 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, netdev
In-Reply-To: <20110516.140958.625993829749556424.davem@davemloft.net>

On Mon, May 16, 2011 at 02:09:58PM -0400, David Miller wrote:
> You guys really need to sort this out properly.
> Please resubmit whatever final solution is agreed upon.

I noticed that v2.6.39 was tagged today. We should definitely remove
NETIF_F_COMPAT so it won't bite us in the future. The other patch that
fixes ethtool_ops->set_flags compatibility is a bugfix, so it should go
in - if we decide that the SFEATURES compatibility should be removed
it won't matter.

Ben, do you agree?

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: ip_vs_ftp causing ip_vs oops on module load.
From: Hans Schillstrom @ 2011-05-19 10:19 UTC (permalink / raw)
  To: Julian Anastasov, lvs-devel@vger.kernel.org
  Cc: Simon Horman, Dave Jones, netdev@vger.kernel.org, Wensong Zhang
In-Reply-To: <alpine.LFD.2.00.1105191139360.2171@ja.ssi.bg>

On Thursday 19 May 2011 10:55:07 Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 May 2011, Hans Schillstrom wrote:
> 
> > I can reproduce the source of the problem,
> > use multiple netns and then unload the ftp module...
> > i.e. same list head used in multiple netns
> > 
> > This brings up a question:
> > - How should ftp be handled in a netns ? 
> > You might want to have it in one netns but not in another,
> > this requires changes to ipvsadm
> > 
> > A way of doing it could be a disable switch like --noftp [port,port]
> > i.e. do not break old apps.
> > 
> > Any other ideas ?
> > 
> > This patch solves the root problem, I'm not sure if this is the way to go
> > or if we should split the ip_vs_app struct ?
> 
> 	This patch is a fast fix but may be it is too late for it,
> after 2.6.39 is out. It seems we overlooked the apps when
> migrating to netns. I think, the apps do not need to be pernet.
> If one day application needs pernet context we can add such
> fields in the ipvs structure.

If we talk about the long term solution,
applications that affects other netns should have their own data.
ex. like the ftp, ports should be per netns
    ipvsadm --appftp [port list]

> 
> 	While the protocols have controls that manipulate pernet
> timeouts, the apps do not have such controls about app->timeouts.
> May be we can remove app->timeouts to avoid confusion because
> it was never implemented in user space. May be instead of this
> fix we should restore the global ip_vs_app_list and all things in
> ip_vs_app.c and ip_vs_ftp.c as before the netns changes?

Doesn't matter, just keep the patch as simple as possible
while we discuss the long term solution.

> 
> > If it's the way to go I can send it as a proper formated patch ...
> > (after some testing)
> > 
> > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> > index 4fff432..481f856 100644
> > --- a/include/net/ip_vs.h
> > +++ b/include/net/ip_vs.h
> > @@ -797,7 +797,8 @@ struct netns_ipvs {
> >  	struct list_head	rs_table[IP_VS_RTAB_SIZE];
> >  	/* ip_vs_app */
> >  	struct list_head	app_list;
> > -
> > +	/* ip_vs_ftp */
> > +	struct ip_vs_app	*ftp_app;
> >  	/* ip_vs_proto */
> >  	#define IP_VS_PROTO_TAB_SIZE	32	/* must be power of 2 */
> >  	struct ip_vs_proto_data *proto_data_table[IP_VS_PROTO_TAB_SIZE];
> > diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> > index 6b5dd6d..17afb09 100644
> > --- a/net/netfilter/ipvs/ip_vs_ftp.c
> > +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> > @@ -411,25 +411,36 @@ static struct ip_vs_app ip_vs_ftp = {
> >  static int __net_init __ip_vs_ftp_init(struct net *net)
> >  {
> >  	int i, ret;
> > -	struct ip_vs_app *app = &ip_vs_ftp;
> > +	struct ip_vs_app *app;
> > +	struct netns_ipvs *ipvs = net_ipvs(net);
> > +
> > +	app = kmemdup(&ip_vs_ftp, sizeof(struct ip_vs_app), GFP_KERNEL);
> > +	if (!app)
> > +		return -ENOMEM;
> > +	INIT_LIST_HEAD(&app->a_list);
> > +	INIT_LIST_HEAD(&app->incs_list);
> > +	INIT_LIST_HEAD(&app->p_list);
> > +	ipvs->ftp_app = app;
> >  
> >  	ret = register_ip_vs_app(net, app);
> >  	if (ret)
> > -		return ret;
> > +		goto err_exit;
> >  
> >  	for (i=0; i<IP_VS_APP_MAX_PORTS; i++) {
> >  		if (!ports[i])
> >  			continue;
> >  		ret = register_ip_vs_app_inc(net, app, app->protocol, ports[i]);
> >  		if (ret)
> > -			break;
> > +			goto err_unreg;
> >  		pr_info("%s: loaded support on port[%d] = %d\n",
> >  			app->name, i, ports[i]);
> >  	}
> > +	return 0;
> >  
> > -	if (ret)
> > -		unregister_ip_vs_app(net, app);
> > -
> > +err_unreg:
> > +	unregister_ip_vs_app(net, app);
> > +err_exit:
> > +	kfree(ipvs->ftp_app);
> >  	return ret;
> >  }
> >  /*
> > @@ -437,9 +448,7 @@ static int __net_init __ip_vs_ftp_init(struct net *net)
> >   */
> >  static void __ip_vs_ftp_exit(struct net *net)
> >  {
> > -	struct ip_vs_app *app = &ip_vs_ftp;
> > -
> > -	unregister_ip_vs_app(net, app);
> > +	unregister_ip_vs_app(net, net_ipvs(net)->ftp_app);
> >  }
> >  
> >  static struct pernet_operations ip_vs_ftp_ops = {
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 

-- 
Regards
Hans Schillstrom <hans.schillstrom@ericsson.com>

^ permalink raw reply

* [Patch] bridge: call NETDEV_ENSLAVE notifiers when adding a slave
From: Amerigo Wang @ 2011-05-19 10:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, WANG Cong, Neil Horman, Jay Vosburgh, Stephen Hemminger,
	David S. Miller, netdev, bridge
In-Reply-To: <1305794393-20775-1-git-send-email-amwang@redhat.com>

In the previous patch I added NETDEV_ENSLAVE, now
we can notify netconsole when adding a device to a bridge too.

By the way, s/netdev_bonding_change/call_netdevice_notifiers/ in
bond_main.c, since this is not bonding specific.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@redhat.com>

---
 drivers/net/bonding/bond_main.c |    2 +-
 net/bridge/br_if.c              |    2 ++
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b9c70c5..765fdcf 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1640,7 +1640,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		}
 	}
 
-	netdev_bonding_change(slave_dev, NETDEV_ENSLAVE);
+	call_netdevice_notifiers(NETDEV_ENSLAVE, slave_dev);
 
 	/* If this is the first slave, then we need to set the master's hardware
 	 * address to be the same as the slave's. */
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 5dbdfdf..b44fae5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -338,6 +338,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev)
 	if (IS_ERR(p))
 		return PTR_ERR(p);
 
+	call_netdevice_notifiers(NETDEV_ENSLAVE, dev);
+
 	err = dev_set_promiscuity(dev, 1);
 	if (err)
 		goto put_back;

^ permalink raw reply related

* Re: [Patch net-next-2.6] netpoll: disable netpoll when enslave a device
From: Neil Horman @ 2011-05-19 11:03 UTC (permalink / raw)
  To: Cong Wang
  Cc: Neil Horman, linux-kernel, Jay Vosburgh, Andy Gospodarek,
	David S. Miller, Alexey Dobriyan, Ferenc Wagner, Andrew Morton,
	Paul E. McKenney, Josh Triplett, Ian Campbell, netdev
In-Reply-To: <4DD4A6F9.4010002@redhat.com>

On Thu, May 19, 2011 at 01:13:29PM +0800, Cong Wang wrote:
> 于 2011年05月18日 18:56, Neil Horman 写道:
> >On Wed, May 18, 2011 at 06:00:35PM +0800, Amerigo Wang wrote:
> ...
> >>-			case NETDEV_GOING_DOWN:
> >>  			case NETDEV_BONDING_DESLAVE:
> >>+			case NETDEV_ENSLAVE:
> >>  				nt->enabled = 0;
> >>  				stopped = true;
> >>  				break;
> >This wasn't introduced by this patch, but looking at it made me realize that
> >nt->enabled, if it passes through this code path, doesn't properly track weather
> >or not netpoll_setup has been called on this interface.  If you look at
> >drop_netconsole_target, you'll see we only call netpoll_cleanup_target if
> >nt->enabled is set.  We should probably change the nt->enabled check there, and
> >in store_enabled to be if (nt->np.dev), like we do in the NETDEV_UNREGISTER case
> >in netconsole_netdev_event.
> 
> Yeah, also note that we can change ->enabled via configfs too.
> I guess we probably need to fix this in another patch...
> 
Yeah, or you can roll it into this one, I think this is the only location that
needs fixing.

> 
> >>+#define NETDEV_ENSLAVE		0x0014
> >>
> >Nit:
> >Shouldn't this be NETDEV_BONDING_ENSLAVE, to keep it in line with
> >NETDEV_BONDING_DESLAVE above?
> 
> Actually that is my first thought, but I plan to use this in bridge
> case too, because using netconsole on a device underlying a bridge
> makes little sense too. Thus, I prefer NETDEV_ENSLAVE to
> NETDEV_BONDING_ENSLAVE.
> 
That seems reasonable, but if its going to be more generic, could you change
NETDEV_BONDING_DESLAVE to NETDEV_DESLAVE?

> >
> >>  #define SYS_DOWN	0x0001	/* Notify of system down */
> >>  #define SYS_RESTART	SYS_DOWN
> >>
> >
> >
> >Other than those two points, this looks good to me
> 
> Thanks for review.
Thank you!

^ permalink raw reply

* [PATCH net-next-2.6] ipv6: reduce per device ICMP mib sizes
From: Eric Dumazet @ 2011-05-19 11:14 UTC (permalink / raw)
  To: Denys Fedoryshchenko, David Miller; +Cc: netdev
In-Reply-To: <1305788113.3019.19.camel@edumazet-laptop>

Le jeudi 19 mai 2011 à 08:55 +0200, Eric Dumazet a écrit :

> Looking at snmp6_alloc_dev(), we allocate three mib per device :
> 
> ipstats_mib  (30 * sizeof(u64) * number_of_possible_cpus)
> icmpv6_mib    (4 * sizeof(long) * number_of_possible_cpus)
> icmpv6msg_mib  (26 * sizeof(long))
> 

Oops, I forgot that mibs were doubled (one set for USER, one set for BH)

And :
#define __ICMP6MSG_MIB_MAX 512

So icmpv6msg_mib is really 512*sizeof(long)*number_of_possible_cpus*2 

32 kbytes per device on a 8cpu machine, 32bit kernel.

Plus all other mibs... yes thats way too big for a seldom used stuff.

Here is patch I cooked and tested on my machine :

[PATCH net-next-2.6] ipv6: reduce per device ICMP mib sizes.

ipv6 has per device ICMP SNMP counters, taking too much space because
they use percpu storage.

needed size per device is : 
(512+4)*sizeof(long)*number_of_possible_cpus*2 

On a 32bit kernel, 16 possible cpus, this wastes more than 64kbytes of
memory per ipv6 enabled network device, taken in vmalloc pool.

Since ICMP messages are rare, just use shared counters (atomic_long_t)

Per network space ICMP counters are still using percpu memory, we might
also convert them to shared counters in a future patch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Denys Fedoryshchenko <denys@visp.net.lb>
---
 include/net/if_inet6.h |    4 +--
 include/net/ipv6.h     |   19 +++++++++++++-----
 include/net/snmp.h     |   14 +++++++++++++
 net/ipv6/addrconf.c    |   24 +++++++++++------------
 net/ipv6/proc.c        |   40 +++++++++++++++++++++++++--------------
 5 files changed, 68 insertions(+), 33 deletions(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 0c603fe..11cf373 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -154,8 +154,8 @@ struct ifacaddr6 {
 struct ipv6_devstat {
 	struct proc_dir_entry	*proc_dir_entry;
 	DEFINE_SNMP_STAT(struct ipstats_mib, ipv6);
-	DEFINE_SNMP_STAT(struct icmpv6_mib, icmpv6);
-	DEFINE_SNMP_STAT(struct icmpv6msg_mib, icmpv6msg);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpv6_mib_device, icmpv6dev);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpv6msg_mib_device, icmpv6msgdev);
 };
 
 struct inet6_dev {
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index e1c60b4..c033ed0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -123,6 +123,15 @@ extern struct ctl_path net_ipv6_ctl_path[];
 	SNMP_INC_STATS##modifier((net)->mib.statname##_statistics, (field));\
 })
 
+/* per device counters are atomic_long_t */
+#define _DEVINCATOMIC(net, statname, modifier, idev, field)		\
+({									\
+	struct inet6_dev *_idev = (idev);				\
+	if (likely(_idev != NULL))					\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field)); \
+	SNMP_INC_STATS##modifier((net)->mib.statname##_statistics, (field));\
+})
+
 #define _DEVADD(net, statname, modifier, idev, field, val)		\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
@@ -154,16 +163,16 @@ extern struct ctl_path net_ipv6_ctl_path[];
 #define IP6_UPD_PO_STATS_BH(net, idev,field,val)   \
 		_DEVUPD(net, ipv6, 64_BH, idev, field, val)
 #define ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINC(net, icmpv6, , idev, field)
+		_DEVINCATOMIC(net, icmpv6, , idev, field)
 #define ICMP6_INC_STATS_BH(net, idev, field)	\
-		_DEVINC(net, icmpv6, _BH, idev, field)
+		_DEVINCATOMIC(net, icmpv6, _BH, idev, field)
 
 #define ICMP6MSGOUT_INC_STATS(net, idev, field)		\
-	_DEVINC(net, icmpv6msg, , idev, field +256)
+	_DEVINCATOMIC(net, icmpv6msg, , idev, field +256)
 #define ICMP6MSGOUT_INC_STATS_BH(net, idev, field)	\
-	_DEVINC(net, icmpv6msg, _BH, idev, field +256)
+	_DEVINCATOMIC(net, icmpv6msg, _BH, idev, field +256)
 #define ICMP6MSGIN_INC_STATS_BH(net, idev, field)	\
-	_DEVINC(net, icmpv6msg, _BH, idev, field)
+	_DEVINCATOMIC(net, icmpv6msg, _BH, idev, field)
 
 struct ip6_ra_chain {
 	struct ip6_ra_chain	*next;
diff --git a/include/net/snmp.h b/include/net/snmp.h
index 27461d6..479083a 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -72,14 +72,24 @@ struct icmpmsg_mib {
 
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX	__ICMP6_MIB_MAX
+/* per network ns counters */
 struct icmpv6_mib {
 	unsigned long	mibs[ICMP6_MIB_MAX];
 };
+/* per device counters, (shared on all cpus) */
+struct icmpv6_mib_device {
+	atomic_long_t	mibs[ICMP6_MIB_MAX];
+};
 
 #define ICMP6MSG_MIB_MAX  __ICMP6MSG_MIB_MAX
+/* per network ns counters */
 struct icmpv6msg_mib {
 	unsigned long	mibs[ICMP6MSG_MIB_MAX];
 };
+/* per device counters, (shared on all cpus) */
+struct icmpv6msg_mib_device {
+	atomic_long_t	mibs[ICMP6MSG_MIB_MAX];
+};
 
 
 /* TCP */
@@ -114,6 +124,8 @@ struct linux_xfrm_mib {
  */ 
 #define DEFINE_SNMP_STAT(type, name)	\
 	__typeof__(type) __percpu *name[2]
+#define DEFINE_SNMP_STAT_ATOMIC(type, name)	\
+	__typeof__(type) *name
 #define DECLARE_SNMP_STAT(type, name)	\
 	extern __typeof__(type) __percpu *name[2]
 
@@ -124,6 +136,8 @@ struct linux_xfrm_mib {
 			__this_cpu_inc(mib[0]->mibs[field])
 #define SNMP_INC_STATS_USER(mib, field)	\
 			this_cpu_inc(mib[1]->mibs[field])
+#define SNMP_INC_STATS_ATOMIC_LONG(mib, field)	\
+			atomic_long_inc(&mib->mibs[field])
 #define SNMP_INC_STATS(mib, field)	\
 			this_cpu_inc(mib[!in_softirq()]->mibs[field])
 #define SNMP_DEC_STATS(mib, field)	\
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index f2f9b2e..3cfbbf3 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -289,19 +289,19 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
 			  sizeof(struct ipstats_mib),
 			  __alignof__(struct ipstats_mib)) < 0)
 		goto err_ip;
-	if (snmp_mib_init((void __percpu **)idev->stats.icmpv6,
-			  sizeof(struct icmpv6_mib),
-			  __alignof__(struct icmpv6_mib)) < 0)
+	idev->stats.icmpv6dev = kzalloc(sizeof(struct icmpv6_mib_device),
+					GFP_KERNEL);
+	if (!idev->stats.icmpv6dev)
 		goto err_icmp;
-	if (snmp_mib_init((void __percpu **)idev->stats.icmpv6msg,
-			  sizeof(struct icmpv6msg_mib),
-			  __alignof__(struct icmpv6msg_mib)) < 0)
+	idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
+					   GFP_KERNEL);
+	if (!idev->stats.icmpv6msgdev)
 		goto err_icmpmsg;
 
 	return 0;
 
 err_icmpmsg:
-	snmp_mib_free((void __percpu **)idev->stats.icmpv6);
+	kfree(idev->stats.icmpv6dev);
 err_icmp:
 	snmp_mib_free((void __percpu **)idev->stats.ipv6);
 err_ip:
@@ -310,8 +310,8 @@ err_ip:
 
 static void snmp6_free_dev(struct inet6_dev *idev)
 {
-	snmp_mib_free((void __percpu **)idev->stats.icmpv6msg);
-	snmp_mib_free((void __percpu **)idev->stats.icmpv6);
+	kfree(idev->stats.icmpv6msgdev);
+	kfree(idev->stats.icmpv6dev);
 	snmp_mib_free((void __percpu **)idev->stats.ipv6);
 }
 
@@ -3838,7 +3838,7 @@ static inline size_t inet6_if_nlmsg_size(void)
 	       + nla_total_size(inet6_ifla6_size()); /* IFLA_PROTINFO */
 }
 
-static inline void __snmp6_fill_stats(u64 *stats, void __percpu **mib,
+static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib,
 				      int items, int bytes)
 {
 	int i;
@@ -3848,7 +3848,7 @@ static inline void __snmp6_fill_stats(u64 *stats, void __percpu **mib,
 	/* Use put_unaligned() because stats may not be aligned for u64. */
 	put_unaligned(items, &stats[0]);
 	for (i = 1; i < items; i++)
-		put_unaligned(snmp_fold_field(mib, i), &stats[i]);
+		put_unaligned(atomic_long_read(&mib[i]), &stats[i]);
 
 	memset(&stats[items], 0, pad);
 }
@@ -3877,7 +3877,7 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype,
 				     IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp));
 		break;
 	case IFLA_INET6_ICMP6STATS:
-		__snmp6_fill_stats(stats, (void __percpu **)idev->stats.icmpv6, ICMP6_MIB_MAX, bytes);
+		__snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes);
 		break;
 	}
 }
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 24b3558..18ff5df 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -141,7 +141,11 @@ static const struct snmp_mib snmp6_udplite6_list[] = {
 	SNMP_MIB_SENTINEL
 };
 
-static void snmp6_seq_show_icmpv6msg(struct seq_file *seq, void __percpu **mib)
+/* can be called either with percpu mib (pcpumib != NULL),
+ * or shared one (smib != NULL)
+ */
+static void snmp6_seq_show_icmpv6msg(struct seq_file *seq, void __percpu **pcpumib,
+				     atomic_long_t *smib)
 {
 	char name[32];
 	int i;
@@ -158,14 +162,14 @@ static void snmp6_seq_show_icmpv6msg(struct seq_file *seq, void __percpu **mib)
 		snprintf(name, sizeof(name), "Icmp6%s%s",
 			i & 0x100 ? "Out" : "In", p);
 		seq_printf(seq, "%-32s\t%lu\n", name,
-			snmp_fold_field(mib, i));
+			pcpumib ? snmp_fold_field(pcpumib, i) : atomic_long_read(smib + i));
 	}
 
 	/* print by number (nonzero only) - ICMPMsgStat format */
 	for (i = 0; i < ICMP6MSG_MIB_MAX; i++) {
 		unsigned long val;
 
-		val = snmp_fold_field(mib, i);
+		val = pcpumib ? snmp_fold_field(pcpumib, i) : atomic_long_read(smib + i);
 		if (!val)
 			continue;
 		snprintf(name, sizeof(name), "Icmp6%sType%u",
@@ -174,14 +178,22 @@ static void snmp6_seq_show_icmpv6msg(struct seq_file *seq, void __percpu **mib)
 	}
 }
 
-static void snmp6_seq_show_item(struct seq_file *seq, void __percpu **mib,
+/* can be called either with percpu mib (pcpumib != NULL),
+ * or shared one (smib != NULL)
+ */
+static void snmp6_seq_show_item(struct seq_file *seq, void __percpu **pcpumib,
+				atomic_long_t *smib,
 				const struct snmp_mib *itemlist)
 {
 	int i;
+	unsigned long val;
 
-	for (i = 0; itemlist[i].name; i++)
-		seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name,
-			   snmp_fold_field(mib, itemlist[i].entry));
+	for (i = 0; itemlist[i].name; i++) {
+		val = pcpumib ?
+			snmp_fold_field(pcpumib, itemlist[i].entry) :
+			atomic_long_read(smib + itemlist[i].entry);
+		seq_printf(seq, "%-32s\t%lu\n", itemlist[i].name, val);
+	}
 }
 
 static void snmp6_seq_show_item64(struct seq_file *seq, void __percpu **mib,
@@ -201,13 +213,13 @@ static int snmp6_seq_show(struct seq_file *seq, void *v)
 	snmp6_seq_show_item64(seq, (void __percpu **)net->mib.ipv6_statistics,
 			    snmp6_ipstats_list, offsetof(struct ipstats_mib, syncp));
 	snmp6_seq_show_item(seq, (void __percpu **)net->mib.icmpv6_statistics,
-			    snmp6_icmp6_list);
+			    NULL, snmp6_icmp6_list);
 	snmp6_seq_show_icmpv6msg(seq,
-			    (void __percpu **)net->mib.icmpv6msg_statistics);
+			    (void __percpu **)net->mib.icmpv6msg_statistics, NULL);
 	snmp6_seq_show_item(seq, (void __percpu **)net->mib.udp_stats_in6,
-			    snmp6_udp6_list);
+			    NULL, snmp6_udp6_list);
 	snmp6_seq_show_item(seq, (void __percpu **)net->mib.udplite_stats_in6,
-			    snmp6_udplite6_list);
+			    NULL, snmp6_udplite6_list);
 	return 0;
 }
 
@@ -229,11 +241,11 @@ static int snmp6_dev_seq_show(struct seq_file *seq, void *v)
 	struct inet6_dev *idev = (struct inet6_dev *)seq->private;
 
 	seq_printf(seq, "%-32s\t%u\n", "ifIndex", idev->dev->ifindex);
-	snmp6_seq_show_item(seq, (void __percpu **)idev->stats.ipv6,
+	snmp6_seq_show_item(seq, (void __percpu **)idev->stats.ipv6, NULL,
 			    snmp6_ipstats_list);
-	snmp6_seq_show_item(seq, (void __percpu **)idev->stats.icmpv6,
+	snmp6_seq_show_item(seq, NULL, idev->stats.icmpv6dev->mibs,
 			    snmp6_icmp6_list);
-	snmp6_seq_show_icmpv6msg(seq, (void __percpu **)idev->stats.icmpv6msg);
+	snmp6_seq_show_icmpv6msg(seq, NULL, idev->stats.icmpv6msgdev->mibs);
 	return 0;
 }
 



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox