Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 14:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <20100502071324.1a95fdad@infradead.org>

Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit :
> > 
> > Cn                Avg residency       P-states (frequencies)
> > C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
> > polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
> > C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
> > C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
> >                                         1.60 Ghz    38.2%
> 
> I bet your system advertizes C2 with the same latency as C1,
> but with lower power... which means Linux will pretty much never
> pick C1.... no matter how much you take Andi's patch.
> 
> this is a bios thing... and until we put in the patch to override the
> bios values (I can dust it off but it might need a bit of tweaking
> since it was against .31) Andi's patch alone won't cut it... you also
> need a non-lying bios ;)
> 
> 
> 
# pwd
/sys/devices/system/cpu/cpu15/cpuidle
# grep . */*
state0/desc:CPUIDLE CORE POLL IDLE
state0/latency:0
state0/name:C0
state0/power:4294967295
state0/time:0
state0/usage:0
state1/desc:ACPI FFH INTEL MWAIT 0x0
state1/latency:1
state1/name:C1
state1/power:1000
state1/time:433855186
state1/usage:126869
state2/desc:ACPI FFH INTEL MWAIT 0x10
state2/latency:64
state2/name:C2
state2/power:500
state2/time:198095020416
state2/usage:76287744

C2 latency seems to be 64  (us ?), while C1 seems to be 1

BIOS Information
	Vendor: HP
	Version: I24
	Release Date: 10/01/2009

# powertop
PowerTOP 1.11   (C) 2007, 2008 Intel Corporation 

Collecting data for 5 seconds 


Your CPU supports the following C-states : C1 C2 C3 
Your BIOS reports the following C-states : C1 C2 

C3 seems to be disabled in BIOS



^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 15:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <1272810448.2173.31.camel@edumazet-laptop>

Le dimanche 02 mai 2010 à 16:27 +0200, Eric Dumazet a écrit :
> Le dimanche 02 mai 2010 à 07:13 -0700, Arjan van de Ven a écrit :
> > > 
> > > Cn                Avg residency       P-states (frequencies)
> > > C0 (cpu running)        (68.9%)         2.93 Ghz    46.5%
> > > polling           0.0ms ( 0.0%)         2.80 Ghz     5.1%
> > > C1 mwait          0.0ms ( 0.0%)         2.53 Ghz     3.0%
> > > C2 mwait          0.0ms (31.1%)         2.13 Ghz     2.8%
> > >                                         1.60 Ghz    38.2%
> > 
> > I bet your system advertizes C2 with the same latency as C1,
> > but with lower power... which means Linux will pretty much never
> > pick C1.... no matter how much you take Andi's patch.
> > 
> > this is a bios thing... and until we put in the patch to override the
> > bios values (I can dust it off but it might need a bit of tweaking
> > since it was against .31) Andi's patch alone won't cut it... you also
> > need a non-lying bios ;)
> > 
> > 
> > 
> # pwd
> /sys/devices/system/cpu/cpu15/cpuidle
> # grep . */*
> state0/desc:CPUIDLE CORE POLL IDLE
> state0/latency:0
> state0/name:C0
> state0/power:4294967295
> state0/time:0
> state0/usage:0
> state1/desc:ACPI FFH INTEL MWAIT 0x0
> state1/latency:1
> state1/name:C1
> state1/power:1000
> state1/time:433855186
> state1/usage:126869
> state2/desc:ACPI FFH INTEL MWAIT 0x10
> state2/latency:64
> state2/name:C2
> state2/power:500
> state2/time:198095020416
> state2/usage:76287744
> 
> C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> BIOS Information
> 	Vendor: HP
> 	Version: I24
> 	Release Date: 10/01/2009
> 
> # powertop
> PowerTOP 1.11   (C) 2007, 2008 Intel Corporation 
> 
> Collecting data for 5 seconds 
> 
> 
> Your CPU supports the following C-states : C1 C2 C3 
> Your BIOS reports the following C-states : C1 C2 
> 
> C3 seems to be disabled in BIOS
> 

I took a look at BIOS settings and enabled the minimum sleep state to be
C6 (instead of C3, the default). Now we see C3 being available...

No changes, only more IPI delivered during the test, and more overhead
in clockevents_notify()

# grep . */*
state0/desc:CPUIDLE CORE POLL IDLE
state0/latency:0
state0/name:C0
state0/power:4294967295
state0/time:0
state0/usage:0
state1/desc:ACPI FFH INTEL MWAIT 0x0
state1/latency:1
state1/name:C1
state1/power:1000
state1/time:39432
state1/usage:119
state2/desc:ACPI FFH INTEL MWAIT 0x10
state2/latency:64
state2/name:C2
state2/power:500
state2/time:3170745
state2/usage:11177
state3/desc:ACPI FFH INTEL MWAIT 0x20
state3/latency:96
state3/name:C3
state3/power:350
state3/time:1030987453
state3/usage:14047019

---------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15984 irqs/sec  kernel:98.5% [1000Hz cycles],  (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                       DSO
             _______ _____ ______________________________ _______

            23822.00 40.2% _raw_spin_lock_irqsave         vmlinux
             4413.00  7.4% acpi_os_read_port              vmlinux
             1426.00  2.4% _raw_spin_lock                 vmlinux
             1284.00  2.2% _raw_spin_unlock_irqrestore    vmlinux
             1247.00  2.1% schedule                       vmlinux
             1137.00  1.9% bnx2x_rx_int                   vmlinux
              643.00  1.1% tick_broadcast_oneshot_control vmlinux
              597.00  1.0% copy_user_generic_string       vmlinux
              595.00  1.0% __napi_complete                vmlinux
              550.00  0.9% call_function_single_interrupt vmlinux
              548.00  0.9% bnx2x_msix_fp_int              vmlinux
              486.00  0.8% __netif_receive_skb            vmlinux
              461.00  0.8% bnx2x_poll                     vmlinux
              433.00  0.7% eth_type_trans                 vmlinux
              428.00  0.7% acpi_idle_enter_bm             vmlinux
              422.00  0.7% sock_recv_ts_and_drops         vmlinux
              382.00  0.6% __udp4_lib_lookup              vmlinux
              369.00  0.6% __slab_free                    vmlinux
              357.00  0.6% ip_route_input                 vmlinux
              341.00  0.6% kfree                          vmlinux
              335.00  0.6% ipt_do_table                   vmlinux
              334.00  0.6% ip_rcv                         vmlinux
              332.00  0.6% udp_recvmsg                    vmlinux
              317.00  0.5% __kmalloc_node_track_caller    vmlinux

    37.46%            init  [kernel.kallsyms]             [k] _raw_spin_lock_irqsave
                      |
                      --- _raw_spin_lock_irqsave
                         |          
                         |--95.58%-- clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm
                         |          cpuidle_idle_call
                         |          cpu_idle
                         |          start_secondary
                         |          
                         |--3.27%-- tick_broadcast_oneshot_control
                         |          tick_notify
                         |          notifier_call_chain
                         |          __raw_notifier_call_chain
                         |          raw_notifier_call_chain
                         |          clockevents_do_notify
                         |          clockevents_notify
                         |          lapic_timer_state_broadcast
                         |          acpi_idle_enter_bm



^ permalink raw reply

* [PATCH v3] net: fix softnet_stat
From: Changli Gao @ 2010-05-02 15:42 UTC (permalink / raw)
  To: David S. Miller; +Cc: Eric Dumazet, netdev, Changli Gao

fix softnet_stat

Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
without any protection. And enqueue_to_backlog should update the netdev_rx_stat
of the target CPU.

This patch renames softnet_data.total to softnet_data.processed: the number of
packets processed in uppper levels(IP stacks).

softnet_stat data is moved into softnet_data.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
 include/linux/netdevice.h |   17 +++++++----------
 net/core/dev.c            |   26 ++++++++++++--------------
 net/sched/sch_generic.c   |    2 +-
 3 files changed, 20 insertions(+), 25 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 40d4c20..c39938f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -218,16 +218,6 @@ struct neighbour;
 struct neigh_parms;
 struct sk_buff;
 
-struct netif_rx_stats {
-	unsigned total;
-	unsigned dropped;
-	unsigned time_squeeze;
-	unsigned cpu_collision;
-	unsigned received_rps;
-};
-
-DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat);
-
 struct netdev_hw_addr {
 	struct list_head	list;
 	unsigned char		addr[MAX_ADDR_LEN];
@@ -1390,6 +1380,12 @@ struct softnet_data {
 	struct sk_buff		*completion_queue;
 	struct sk_buff_head	process_queue;
 
+	/* stats */
+	unsigned		processed;
+	unsigned		time_squeeze;
+	unsigned		cpu_collision;
+	unsigned		received_rps;
+
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 
@@ -1399,6 +1395,7 @@ struct softnet_data {
 	unsigned int		cpu;
 	unsigned int		input_queue_head;
 #endif
+	unsigned		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 100dcbd..36d53be 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2205,8 +2205,6 @@ int netdev_max_backlog __read_mostly = 1000;
 int netdev_budget __read_mostly = 300;
 int weight_p __read_mostly = 64;            /* old backlog weight */
 
-DEFINE_PER_CPU(struct netif_rx_stats, netdev_rx_stat) = { 0, };
-
 #ifdef CONFIG_RPS
 
 /* One global table that all flow-based protocols share. */
@@ -2366,7 +2364,7 @@ static void rps_trigger_softirq(void *data)
 	struct softnet_data *sd = data;
 
 	__napi_schedule(&sd->backlog);
-	__get_cpu_var(netdev_rx_stat).received_rps++;
+	sd->received_rps++;
 }
 
 #endif /* CONFIG_RPS */
@@ -2405,7 +2403,6 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	sd = &per_cpu(softnet_data, cpu);
 
 	local_irq_save(flags);
-	__get_cpu_var(netdev_rx_stat).total++;
 
 	rps_lock(sd);
 	if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) {
@@ -2429,9 +2426,9 @@ enqueue:
 		goto enqueue;
 	}
 
+	sd->dropped++;
 	rps_unlock(sd);
 
-	__get_cpu_var(netdev_rx_stat).dropped++;
 	local_irq_restore(flags);
 
 	kfree_skb(skb);
@@ -2806,7 +2803,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 			skb->dev = master;
 	}
 
-	__get_cpu_var(netdev_rx_stat).total++;
+	__get_cpu_var(softnet_data).processed++;
 
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
@@ -3490,7 +3487,7 @@ out:
 	return;
 
 softnet_break:
-	__get_cpu_var(netdev_rx_stat).time_squeeze++;
+	sd->time_squeeze++;
 	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
 	goto out;
 }
@@ -3691,17 +3688,17 @@ static int dev_seq_show(struct seq_file *seq, void *v)
 	return 0;
 }
 
-static struct netif_rx_stats *softnet_get_online(loff_t *pos)
+static struct softnet_data *softnet_get_online(loff_t *pos)
 {
-	struct netif_rx_stats *rc = NULL;
+	struct softnet_data *sd = NULL;
 
 	while (*pos < nr_cpu_ids)
 		if (cpu_online(*pos)) {
-			rc = &per_cpu(netdev_rx_stat, *pos);
+			sd = &per_cpu(softnet_data, *pos);
 			break;
 		} else
 			++*pos;
-	return rc;
+	return sd;
 }
 
 static void *softnet_seq_start(struct seq_file *seq, loff_t *pos)
@@ -3721,12 +3718,12 @@ static void softnet_seq_stop(struct seq_file *seq, void *v)
 
 static int softnet_seq_show(struct seq_file *seq, void *v)
 {
-	struct netif_rx_stats *s = v;
+	struct softnet_data *sd = v;
 
 	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
-		   s->total, s->dropped, s->time_squeeze, 0,
+		   sd->processed, sd->dropped, sd->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   sd->cpu_collision, sd->received_rps);
 	return 0;
 }
 
@@ -5869,6 +5866,7 @@ static int __init net_dev_init(void)
 	for_each_possible_cpu(i) {
 		struct softnet_data *sd = &per_cpu(softnet_data, i);
 
+		memset(sd, 0, sizeof(*sd));
 		skb_queue_head_init(&sd->input_pkt_queue);
 		skb_queue_head_init(&sd->process_queue);
 		sd->completion_queue = NULL;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index aeddabf..a969b11 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -94,7 +94,7 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 		 * Another cpu is holding lock, requeue & delay xmits for
 		 * some time.
 		 */
-		__get_cpu_var(netdev_rx_stat).cpu_collision++;
+		__get_cpu_var(softnet_data).cpu_collision++;
 		ret = dev_requeue_skb(skb, q);
 	}
 

^ permalink raw reply related

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02 15:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <1272797690.2173.26.camel@edumazet-laptop>

> In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
>  packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
> packets are then redistributed by RPS to same 16 cpus, generating about
> 650.000 IPI per second.

BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place.

-Andi


^ permalink raw reply

* Re: Re: Kernel panic in fib_rules_lookup (kernel 2.6.32)
From: Eric Dumazet @ 2010-05-02 16:31 UTC (permalink / raw)
  To: "Oleg A. Arkhangelsky"; +Cc: netdev
In-Reply-To: <382091272797186@web84.yandex.ru>

Le dimanche 02 mai 2010 à 14:46 +0400, "Oleg A. Arkhangelsky" a écrit :
> Hello,
> 
> 09.03.10, 17:09, "Eric Dumazet" <eric.dumazet@gmail.com>:
> 
> > Le mardi 09 mars 2010 à 10:44 +0300, "Oleg A. Arkhangelsky" a écrit :
> >  > Hello,
> >  > 
> >  > Got this kernel panic tomorrow. This PC is rather heavy loaded router with BGP full view (> 300K routes).
> >  > We're using FIB_TRIE. Last time we got similar panic about 1 month ago. Please, let me know if you
> >  > need additional information to debug (e.g. objdump). Thanks!
> >  > 
> >  > Mar  9 10:08:55 bras-1 kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
> >  > Mar  9 10:08:55 bras-1 kernel: IP: [] fib_rules_lookup+0xa2/0xd0
> >  > Mar  9 10:08:55 bras-1 kernel: *pde = 00000000
> >  > Mar  9 10:08:55 bras-1 kernel: Thread overran stack, or stack corrupted
> >  
> >  Hmm...
> 
> Got the same panic, at the same place (fib_rules_lookup+0xa2/0xd0). Looks like the problem with NULL dereference is somewhere in list_for_each_entry_rcu macro. But I don't understand how this can be.
> 
> Any thoughts? :(
> 

Do you have any modify rules activity ?

I dont understand why we need synchronize_rcu() in fib_nl_delrule() but
this certainly not a bug.




^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 16:35 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100502154649.GA18063@gargoyle.fritz.box>

Le dimanche 02 mai 2010 à 17:46 +0200, Andi Kleen a écrit :
> > In this test, all 16 queues of one BCM57711E NIC (1Gb link) delivers
> >  packets at about 1.300.000 pps to 16 cpus (one cpu per queue) and these
> > packets are then redistributed by RPS to same 16 cpus, generating about
> > 650.000 IPI per second.
> 
> BTW if rps was SMT aware it could avoid a lot of the IPIs in the first place.

RPS do what you want, just stick a good cpumask, not a unware one :)

In my test, I specifically do something 'stupid' like :

echo fffe >/sys/class/net/bond0.2240/queues/rx-0/rps_cpus
echo fffd >/sys/class/net/bond0.2240/queues/rx-1/rps_cpus
echo fffb >/sys/class/net/bond0.2240/queues/rx-2/rps_cpus
echo fff7 >/sys/class/net/bond0.2240/queues/rx-3/rps_cpus

echo ffef >/sys/class/net/bond0.2240/queues/rx-4/rps_cpus
echo ffdf >/sys/class/net/bond0.2240/queues/rx-5/rps_cpus
echo ffbf >/sys/class/net/bond0.2240/queues/rx-6/rps_cpus
echo ff7f >/sys/class/net/bond0.2240/queues/rx-7/rps_cpus

echo feff >/sys/class/net/bond0.2240/queues/rx-8/rps_cpus
echo fdff >/sys/class/net/bond0.2240/queues/rx-9/rps_cpus
echo fbff >/sys/class/net/bond0.2240/queues/rx-10/rps_cpus
echo f7ff >/sys/class/net/bond0.2240/queues/rx-11/rps_cpus

echo efff >/sys/class/net/bond0.2240/queues/rx-12/rps_cpus
echo dfff >/sys/class/net/bond0.2240/queues/rx-13/rps_cpus
echo bfff >/sys/class/net/bond0.2240/queues/rx-14/rps_cpus
echo 7fff >/sys/class/net/bond0.2240/queues/rx-15/rps_cpus

echo 0001 >/proc/irq/*/eth1-fp-0/../smp_affinity
echo 0002 >/proc/irq/*/eth1-fp-1/../smp_affinity
echo 0004 >/proc/irq/*/eth1-fp-2/../smp_affinity
echo 0008 >/proc/irq/*/eth1-fp-3/../smp_affinity
echo 0010 >/proc/irq/*/eth1-fp-4/../smp_affinity
echo 0020 >/proc/irq/*/eth1-fp-5/../smp_affinity
echo 0040 >/proc/irq/*/eth1-fp-6/../smp_affinity
echo 0080 >/proc/irq/*/eth1-fp-7/../smp_affinity
echo 0100 >/proc/irq/*/eth1-fp-8/../smp_affinity
echo 0200 >/proc/irq/*/eth1-fp-9/../smp_affinity
echo 0400 >/proc/irq/*/eth1-fp-10/../smp_affinity
echo 0800 >/proc/irq/*/eth1-fp-11/../smp_affinity
echo 1000 >/proc/irq/*/eth1-fp-12/../smp_affinity
echo 2000 >/proc/irq/*/eth1-fp-13/../smp_affinity
echo 4000 >/proc/irq/*/eth1-fp-14/../smp_affinity
echo 8000 >/proc/irq/*/eth1-fp-15/../smp_affinity


You mean we can wakeup a thread with something else than an IPI ?




^ permalink raw reply

* Re: [PATCH v3] net: fix softnet_stat
From: Eric Dumazet @ 2010-05-02 16:43 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1272814936-4902-1-git-send-email-xiaosuo@gmail.com>

Le dimanche 02 mai 2010 à 23:42 +0800, Changli Gao a écrit :
> fix softnet_stat
> 
> Per cpu variable softnet_data.total was shared between IRQ and SoftIRQ context
> without any protection. And enqueue_to_backlog should update the netdev_rx_stat
> of the target CPU.
> 
> This patch renames softnet_data.total to softnet_data.processed: the number of
> packets processed in uppper levels(IP stacks).
> 
> softnet_stat data is moved into softnet_data.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Thats a fine patch, thanks Changli.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>


>  
> @@ -5869,6 +5866,7 @@ static int __init net_dev_init(void)
>  	for_each_possible_cpu(i) {
>  		struct softnet_data *sd = &per_cpu(softnet_data, i);
>  
> +		memset(sd, 0, sizeof(*sd));
>  		skb_queue_head_init(&sd->input_pkt_queue);
>  		skb_queue_head_init(&sd->process_queue);
>  		sd->completion_queue = NULL;

Minor note : 
- You could have removed some NULL initializations after the memset()
- Also, per_cpu data is already 0 initialized, so memset() is not
strictly needed ;)

This is very minor, since we dont have a bss section in per_cpu data, so
it doesnt matter.




^ permalink raw reply

* Re: Kernel panic in fib_rules_lookup (kernel 2.6.32)
From: "Oleg A. Arkhangelsky" @ 2010-05-02 17:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1272817864.2173.123.camel@edumazet-laptop>

Hello,

02.05.10, 18:31, "Eric Dumazet" <eric.dumazet@gmail.com>:

> Le dimanche 02 mai 2010 à 14:46 +0400, "Oleg A. Arkhangelsky" a écrit :

>  > Got the same panic, at the same place (fib_rules_lookup+0xa2/0xd0).
>  > Looks like the problem with NULL dereference is somewhere in
>  > list_for_each_entry_rcu macro. But I don't understand how this can be.
>  > 
>  > Any thoughts? :(
>  > 
>  
>  Do you have any modify rules activity ?

It can be a nice clue, but unfortunately no. Moreover, we don't use rules at
all. Only default template:

0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Also, there are no "floating" interfaces (such as ppp) on this PC. Only routes
are changing periodically (Quagga with full-view BGP sessions).

-- 
wbr, Oleg.

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Arjan van de Ven @ 2010-05-02 17:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <1272818131.2173.127.camel@edumazet-laptop>

On Sun, 02 May 2010 18:35:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote
> 
> 
> You mean we can wakeup a thread with something else than an IPI ?
> 

actually we can.

mwait is not only "go idle", it is "go idle until someone writes to
<THIS> cacheline". where <THIS> is set up with a "monitor" instruction.
We don't need to send an ipi per se.. all we need is to write to the
right cacheline that we're monitoring.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply

* mmotm 2010-04-28 - RCU whinges
From: Valdis.Kletnieks @ 2010-05-02 17:46 UTC (permalink / raw)
  To: Andrew Morton, Peter Zijlstra, Patrick McHardy, David S. Miller
  Cc: linux-kernel, netfilter-devel, netdev
In-Reply-To: <201004290021.o3T0L04Y028017@imap1.linux-foundation.org>

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

On Wed, 28 Apr 2010 16:53:32 PDT, akpm@linux-foundation.org said:
> The mm-of-the-moment snapshot 2010-04-28-16-53 has been uploaded to
> 
>    http://userweb.kernel.org/~akpm/mmotm/

I thought we swatted all these, hit another one...

[    9.131490] ctnetlink v0.93: registering with nfnetlink.
[    9.131535]
[    9.131535] ===================================================
[    9.131704] [ INFO: suspicious rcu_dereference_check() usage. ]
[    9.131794] ---------------------------------------------------
[    9.131883] net/netfilter/nf_conntrack_ecache.c:88 invoked rcu_dereference_check() without protection!
[    9.131977]
[    9.131977] other info that might help us debug this:
[    9.131978]
[    9.132218]
[    9.132219] rcu_scheduler_active = 1, debug_locks = 0
[    9.132434] 1 lock held by swapper/1:
[    9.132519]  #0:  (nf_ct_ecache_mutex){+.+...}, at: [<ffffffff8148922d>] nf_conntrack_register_notifier+0x1a/0x75
[    9.132938]
[    9.132939] stack backtrace:
[    9.133129] Pid: 1, comm: swapper Tainted: G        W   2.6.34-rc5-mmotm0428 #1
[    9.133220] Call Trace:
[    9.133319]  [<ffffffff81064832>] lockdep_rcu_dereference+0xaa/0xb2
[    9.133410]  [<ffffffff81489250>] nf_conntrack_register_notifier+0x3d/0x75
[    9.133521]  [<ffffffff81b5a157>] ctnetlink_init+0x71/0xd5
[    9.133627]  [<ffffffff81b5a0e6>] ? ctnetlink_init+0x0/0xd5
[    9.133735]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
[    9.133843]  [<ffffffff81b2e68a>] kernel_init+0x144/0x1ce
[    9.133949]  [<ffffffff81003414>] kernel_thread_helper+0x4/0x10
[    9.134060]  [<ffffffff81598a40>] ? restore_args+0x0/0x30
[    9.134196]  [<ffffffff81b2e546>] ? kernel_init+0x0/0x1ce
[    9.134328]  [<ffffffff81003410>] ? kernel_thread_helper+0x0/0x10
[    9.134530] ip_tables: (C) 2000-2006 Netfilter Core Team
[    9.134655] TCP bic registered


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 17:47 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <20100502104304.6c6b2764@infradead.org>

Le dimanche 02 mai 2010 à 10:43 -0700, Arjan van de Ven a écrit :
> On Sun, 02 May 2010 18:35:31 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote
> > 
> > 
> > You mean we can wakeup a thread with something else than an IPI ?
> > 
> 
> actually we can.
> 
> mwait is not only "go idle", it is "go idle until someone writes to
> <THIS> cacheline". where <THIS> is set up with a "monitor" instruction.
> We don't need to send an ipi per se.. all we need is to write to the
> right cacheline that we're monitoring.
> 
> 

Thats a bit x86 specific, isnt it ?

But we want to eventually send a 'signal' to a cpu, even if not blocked
in idle, so that it can do following action :

/* Called from hardirq (IPI) context */
static void rps_trigger_softirq(void *data)
{
        struct softnet_data *sd = data;

        __napi_schedule(&sd->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

And it also should be portable ;)

If something else than an IPI is available, please let us know !

Thanks



^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Arjan van de Ven @ 2010-05-02 17:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <1272810448.2173.31.camel@edumazet-laptop>

On Sun, 02 May 2010 16:27:28 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> C2 latency seems to be 64  (us ?), while C1 seems to be 1

the processor_idle module has a "latency_factor" module parameter.
The default is 2, but sometimes people think 6 is a better value...
.. any chance you can try that value ?

Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
to be replaced with a net_schedule() kind of thing. The cpuidle code
currently has a weight factor for IO (based on measuring/experiments),
and maybe networking really needs another factor... so just having a
parallel concept with a different weight could be the right answer for
that.



> 
> Your CPU supports the following C-states : C1 C2 C3 
> Your BIOS reports the following C-states : C1 C2 
> 
> C3 seems to be disabled in BIOS

btw this C2 == marketing name C3, and C3 == marketing name C6

(too many translations ;-)

we'll fix powertop to report the marketing name soon.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply

* [PATCH] net: fix compile error due to double return type in SOCK_DEBUG
From: Jan Engelhardt @ 2010-05-02 18:00 UTC (permalink / raw)
  To: davem; +Cc: netdev

Fix this one:
include/net/sock.h: error: two or more data types in declaration specifiers

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
---
 include/net/sock.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..0f0b6d6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -74,7 +74,7 @@
 					printk(KERN_DEBUG msg); } while (0)
 #else
 /* Validate arguments and do nothing */
-static void inline int __attribute__ ((format (printf, 2, 3)))
+static inline void __attribute__ ((format (printf, 2, 3)))
 SOCK_DEBUG(struct sock *sk, const char *msg, ...)
 {
 }
-- 
1.7.0.5


^ permalink raw reply related

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 19:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <20100502105418.7abf83a7@infradead.org>

Le dimanche 02 mai 2010 à 10:54 -0700, Arjan van de Ven a écrit :
> On Sun, 02 May 2010 16:27:28 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> the processor_idle module has a "latency_factor" module parameter.
> The default is 2, but sometimes people think 6 is a better value...
> .. any chance you can try that value ?
> 

I tried 6 and 20, nothing changed ;(

> Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
> to be replaced with a net_schedule() kind of thing. The cpuidle code
> currently has a weight factor for IO (based on measuring/experiments),
> and maybe networking really needs another factor... so just having a
> parallel concept with a different weight could be the right answer for
> that.
> 

But a task blocked on disk IO is probably blocked for a small amount of
time, while on network, it can be for a long time. I am not sure its the
right metric.

I was expecting something based on recent history.
Say if we have 20.000 wakeups per second, most likely we should not
enter C2/C3 states...

> 
> we'll fix powertop to report the marketing name soon.
> 
> 

Ah, I see, thanks :)



^ permalink raw reply

* Re: [PATCH] net: fix compile error due to double return type in SOCK_DEBUG
From: David Miller @ 2010-05-02 20:43 UTC (permalink / raw)
  To: jengelh; +Cc: netdev
In-Reply-To: <1272823202-25957-1-git-send-email-jengelh@medozas.de>

From: Jan Engelhardt <jengelh@medozas.de>
Date: Sun,  2 May 2010 20:00:01 +0200

> Fix this one:
> include/net/sock.h: error: two or more data types in declaration specifiers
> 
> Signed-off-by: Jan Engelhardt <jengelh@medozas.de>

Applied, thanks Jan.

Shows how many people actually try to mess with that ifdef.
I think we should just kill it off.

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02 21:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <1272818131.2173.127.camel@edumazet-laptop>

> You mean we can wakeup a thread with something else than an IPI ?

It's pointless to send an IPI to your thread sibling for this. 
Everything it could do you can do yourself too with the same performance.

-Andi

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02 21:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Eric Dumazet, David Miller, hadi, xiaosuo, therbert, shemminger,
	netdev, lenb
In-Reply-To: <20100502105418.7abf83a7@infradead.org>

On Sun, May 02, 2010 at 10:54:18AM -0700, Arjan van de Ven wrote:
> On Sun, 02 May 2010 16:27:28 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > C2 latency seems to be 64  (us ?), while C1 seems to be 1
> 
> the processor_idle module has a "latency_factor" module parameter.
> The default is 2, but sometimes people think 6 is a better value...
> .. any chance you can try that value ?
> 
> Also, I'm starting to wonder if Andi's patch to use io_schedule() needs
> to be replaced with a net_schedule() kind of thing. The cpuidle code
> currently has a weight factor for IO (based on measuring/experiments),
> and maybe networking really needs another factor... so just having a
> parallel concept with a different weight could be the right answer for
> that.

We definitely need a net_schedule() for other reasons too: to avoid the blkio 
wait code and then also because networking needs a short "fast idle" timeout 
because the delays are not bounded.  

Otherwise a sender that suddenly stops sending could break all your power 
saving.

I think the reference count used in io_schedule is not the right model for 
this, probably needs a per cpu timeout ("be fast until this time"). Possibly 
a dynamic one feed by the measured input rate.

-Andi

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 21:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100502212550.GA2673@gargoyle.fritz.box>

Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit :

> It's pointless to send an IPI to your thread sibling for this. 
> Everything it could do you can do yourself too with the same performance.
> 
> -Andi

Amen

Tests just prove the reverse.

I have some collegues that disable HyperThreading for exact same
reasons. I wonder why Intel designed HT. Should be marketing I guess.




^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02 21:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <1272836755.2173.153.camel@edumazet-laptop>

On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote:
> Le dimanche 02 mai 2010 à 23:25 +0200, Andi Kleen a écrit :
> 
> > It's pointless to send an IPI to your thread sibling for this. 
> > Everything it could do you can do yourself too with the same performance.
> > 
> > -Andi
> 
> Amen

That is in terms of cache locality.

> 
> Tests just prove the reverse.

What do you mean? 

> 
> I have some collegues that disable HyperThreading for exact same
> reasons. I wonder why Intel designed HT. Should be marketing I guess.

HT (especially Nehalem HT) is useful for a wide range of workloads.
Just handling network interrupts for its thread sibling is not one of them.

-Andi


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Andi Kleen @ 2010-05-02 22:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arjan van de Ven, David Miller, hadi, xiaosuo, therbert,
	shemminger, netdev, lenb
In-Reply-To: <1272828143.2173.150.camel@edumazet-laptop>

> But a task blocked on disk IO is probably blocked for a small amount of
> time, while on network, it can be for a long time. I am not sure its the
> right metric.

I think it needs a dynamic timeout.

I agree the reference count as is will not work well for networking.

> 
> I was expecting something based on recent history.
> Say if we have 20.000 wakeups per second, most likely we should not
> enter C2/C3 states...

That's what the menu governour already does, it just doesn't work
in some cases :/

-Andi


^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-05-02 22:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, hadi, xiaosuo, therbert, shemminger, netdev, lenb,
	arjan
In-Reply-To: <20100502215450.GC2673@gargoyle.fritz.box>

Le dimanche 02 mai 2010 à 23:54 +0200, Andi Kleen a écrit :
> On Sun, May 02, 2010 at 11:45:55PM +0200, Eric Dumazet wrote:

> > Tests just prove the reverse.
> 
> What do you mean? 
> 

Test I did this week with Jamal.

We first set a "ee" rps mask, because all NIC interrupts were handled by
CPU0, and Jamal thought like you, that not using cpu4 would give better
performance.

But using "fe" mask gave me a bonus, from ~700.000 pps to ~800.000 pps

CPU : E5450  @3.00GHz
Two quad-core cpus in the machine, tg3 NIC.

With RPS, CPU0 does not a lot of things, just talk with the NIC, bring a
few cache lines per packet and dispatch it to a slave cpu.



> HT (especially Nehalem HT) is useful for a wide range of workloads.
> Just handling network interrupts for its thread sibling is not one of them.
> 

Thats the theory, now in practice I see different results.

Of course, this might be related to hash distribution being different
and more uniform.

I should redo the test with many more flows.




^ permalink raw reply

* [PATCH] ixgb and e1000: Use new function for copybreak tests
From: Joe Perches @ 2010-05-03  0:46 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Jesse Brandeburg, Bruce Allan, Alex Duyck, PJ Waskiewicz,
	John Ronciak, e1000-devel, netdev, LKML

There appears to be an off-by-1 defect in the maximum packet size
copied when copybreak is speified in these modules.

The copybreak module params are specified as:
"Maximum size of packet that is copied to a new buffer on receive"

The tests are changed from "< copybreak" to "<= copybreak"
and moved into new static functions for readability.

Signed-off-by: Joe Perches <joe@perches.com>
---
 drivers/net/e1000/e1000_main.c |   47 ++++++++++++++++++++---------------
 drivers/net/ixgb/ixgb_main.c   |   52 +++++++++++++++++++++++----------------
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index e6ebc22..2adc83c 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3773,6 +3773,31 @@ next_desc:
 	return cleaned;
 }
 
+/*
+ * this should improve performance for small packets with large amounts
+ * of reassembly being done in the stack
+ */
+static void e1000_check_copybreak(struct net_device *netdev,
+				 struct e1000_buffer *buffer_info,
+				 u32 length, struct sk_buff **skb)
+{
+	struct sk_buff *new_skb;
+
+	if (length > copybreak)
+		return;
+
+	new_skb = netdev_alloc_skb_ip_align(netdev, length);
+	if (!new_skb)
+		return;
+
+	skb_copy_to_linear_data_offset(new_skb, -NET_IP_ALIGN,
+				       (*skb)->data - NET_IP_ALIGN,
+				       length + NET_IP_ALIGN);
+	/* save the skb in buffer_info as good */
+	buffer_info->skb = *skb;
+	*skb = new_skb;
+}
+
 /**
  * e1000_clean_rx_irq - Send received data up the network stack; legacy
  * @adapter: board private structure
@@ -3871,26 +3896,8 @@ static bool e1000_clean_rx_irq(struct e1000_adapter *adapter,
 		total_rx_bytes += length;
 		total_rx_packets++;
 
-		/* code added for copybreak, this should improve
-		 * performance for small packets with large amounts
-		 * of reassembly being done in the stack */
-		if (length < copybreak) {
-			struct sk_buff *new_skb =
-			    netdev_alloc_skb_ip_align(netdev, length);
-			if (new_skb) {
-				skb_copy_to_linear_data_offset(new_skb,
-							       -NET_IP_ALIGN,
-							       (skb->data -
-							        NET_IP_ALIGN),
-							       (length +
-							        NET_IP_ALIGN));
-				/* save the skb in buffer_info as good */
-				buffer_info->skb = skb;
-				skb = new_skb;
-			}
-			/* else just continue with the old one */
-		}
-		/* end copybreak code */
+		e1000_check_copybreak(netdev, buffer_info, length, &skb);
+
 		skb_put(skb, length);
 
 		/* Receive Checksum Offload */
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index d58ca6b..c6b75c8 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1921,6 +1921,31 @@ ixgb_rx_checksum(struct ixgb_adapter *adapter,
 	}
 }
 
+/*
+ * this should improve performance for small packets with large amounts
+ * of reassembly being done in the stack
+ */
+static void ixgb_check_copybreak(struct net_device *netdev,
+				 struct ixgb_buffer *buffer_info,
+				 u32 length, struct sk_buff **skb)
+{
+	struct sk_buff *new_skb;
+
+	if (length > copybreak)
+		return;
+
+	new_skb = netdev_alloc_skb_ip_align(netdev, length);
+	if (!new_skb)
+		return;
+
+	skb_copy_to_linear_data_offset(new_skb, -NET_IP_ALIGN,
+				       (*skb)->data - NET_IP_ALIGN,
+				       length + NET_IP_ALIGN);
+	/* save the skb in buffer_info as good */
+	buffer_info->skb = *skb;
+	*skb = new_skb;
+}
+
 /**
  * ixgb_clean_rx_irq - Send received data up the network stack,
  * @adapter: board private structure
@@ -1957,11 +1982,14 @@ ixgb_clean_rx_irq(struct ixgb_adapter *adapter, int *work_done, int work_to_do)
 
 		prefetch(skb->data - NET_IP_ALIGN);
 
-		if (++i == rx_ring->count) i = 0;
+		if (++i == rx_ring->count)
+			i = 0;
 		next_rxd = IXGB_RX_DESC(*rx_ring, i);
 		prefetch(next_rxd);
 
-		if ((j = i + 1) == rx_ring->count) j = 0;
+		j = i + 1;
+		if (j == rx_ring->count)
+			j = 0;
 		next2_buffer = &rx_ring->buffer_info[j];
 		prefetch(next2_buffer);
 
@@ -1997,25 +2025,7 @@ ixgb_clean_rx_irq(struct ixgb_adapter *adapter, int *work_done, int work_to_do)
 			goto rxdesc_done;
 		}
 
-		/* code added for copybreak, this should improve
-		 * performance for small packets with large amounts
-		 * of reassembly being done in the stack */
-		if (length < copybreak) {
-			struct sk_buff *new_skb =
-			    netdev_alloc_skb_ip_align(netdev, length);
-			if (new_skb) {
-				skb_copy_to_linear_data_offset(new_skb,
-							       -NET_IP_ALIGN,
-							       (skb->data -
-							        NET_IP_ALIGN),
-							       (length +
-							        NET_IP_ALIGN));
-				/* save the skb in buffer_info as good */
-				buffer_info->skb = skb;
-				skb = new_skb;
-			}
-		}
-		/* end copybreak code */
+		ixgb_check_copybreak(netdev, buffer_info, length, &skb);
 
 		/* Good Receive */
 		skb_put(skb, length);



^ permalink raw reply related

* linux-next: build failure after merge of the net tree
From: Stephen Rothwell @ 2010-05-03  2:08 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: linux-next, linux-kernel, Eric Dumazet, Wei Yongjun

[-- Attachment #1: Type: text/plain, Size: 774 bytes --]

Hi Dave,

After merging the net tree, today's linux-next build (x86_64
allmodconfig) failed like this:

net/sctp/socket.c: In function 'sctp_data_ready':
net/sctp/socket.c:6191: error: implicit declaration of function 'sk_has_sleeper'
net/sctp/socket.c:6192: error: 'struct sock' has no member named 'sk_sleep'

Caused by commit 43815482370c510c569fd18edb57afcb0fa8cab6 ("net:
sock_def_readable() and friends RCU conversion") interacting with commit
561b1733a465cf9677356b40c27653dd45f1ac56 ("sctp: avoid irq lock inversion
while call sk->sk_data_ready()") which entered Linus' tree on April 29.

I have used the net tree from next-20100430 for today.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: linux-next: build failure after merge of the net tree
From: David Miller @ 2010-05-03  2:37 UTC (permalink / raw)
  To: sfr; +Cc: netdev, linux-next, linux-kernel, eric.dumazet, yjwei
In-Reply-To: <20100503120847.e9328e75.sfr@canb.auug.org.au>

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 3 May 2010 12:08:47 +1000

> After merging the net tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> net/sctp/socket.c: In function 'sctp_data_ready':
> net/sctp/socket.c:6191: error: implicit declaration of function 'sk_has_sleeper'
> net/sctp/socket.c:6192: error: 'struct sock' has no member named 'sk_sleep'
> 
> Caused by commit 43815482370c510c569fd18edb57afcb0fa8cab6 ("net:
> sock_def_readable() and friends RCU conversion") interacting with commit
> 561b1733a465cf9677356b40c27653dd45f1ac56 ("sctp: avoid irq lock inversion
> while call sk->sk_data_ready()") which entered Linus' tree on April 29.
> 
> I have used the net tree from next-20100430 for today.

Thanks Stephen, I'll do a merge of net-2.6 into net-next-2.6 and fix
the build for you.

^ permalink raw reply

* Re: Receive issues with bonding and vlans
From: John Fastabend @ 2010-05-03  3:04 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Leech, Christopher, netdev@vger.kernel.org, Andy Gospodarek,
	Patrick McHardy, bonding-devel@lists.sourceforge.net
In-Reply-To: <2695.1271117318@death.nxdomain.ibm.com>

Jay Vosburgh wrote:
> Chris Leech <christopher.leech@intel.com> wrote:
> 
>> On Mon, Apr 12, 2010 at 04:10:51PM -0700, Jay Vosburgh wrote:
>>> 	Is the FCoE supposed to run over the inactive bonding slave?  Or
>>> am I misunderstanding what you're saying?  I had thought the LLDP, et
>>> al, special case in bonding was to permit, essentially, path discovery,
>>> not necessarily active use of the inactive slave.
>> That's what I'm trying to do, yes.  Mostly because it's a setup that
>> would work if you removed the FCoE traffic from the network data path,
>> and only converged at the driver level and below.  It's possible that
>> the answer is "don't do that".
> 
> 	So, basically, you want the bond to act like usual for "regular"
> ethernet traffic, but act like the slaves are independent from the bond
> for the magic FCoE traffic, right?
> 
> 	I'm not really sure if that's a "don't do that" or not.
> 
>>> 	Not that this is necessarily bad; the "drop stuff on inactive
>>> slaves" is really there for duplicate suppression, but it also can
>>> uncover network topology issues, e.g., network layouts that won't work
>>> if the devices fail, but appear to work during testing because the
>>> "inactive" slave still receives traffic (it hasn't really failed).
>>>
>>>> The problem is that it doesn't work for hardware accelerated VLAN
>>>> devices, because the VLAN receive paths have their own
>>>> skb_bond_should_drop calls that were not updated.
>>>>
>>> >From what I can tell, VLAN receives always end up going through
>>>> netif_receive_skb anyway, so skb_bond_should_drop gets called twice if
>>>> the frame isn't dropped the first time.  I think the bonding checks in
>>>> __vlan_hwaccel_rx and vlan_gro_common should just be removed.
>>> 	I'm not so sure.  The checks in __vlan_hwaccel_rx are done with
>>> the original receiving device in skb->dev; by the time the packet gets
>>> to netif_receive_skb, the original slave the packet was received on has
>>> been lost (and replaced with the VLAN device).  Various things are
>>> interested in that, in particular the "arp_validate" and the "inactive
>>> slave drop" logic for bonding depend on knowing the real device the
>>> packet arrived on.
>>>
>>> 	I note that the vlan accel logic doesn't change skb_iif to the
>>> VLAN device; it remains as the original device.  I suppose one
>>> alternative would be to convert the bonding drop, et al, logic to use
>>> skb_iif instead of skb->dev; if that works, then I think the VLAN core
>>> would not need to call skb_bond_should_drop, which in turn would be a
>>> bit more complicated as it would have to look up the dev from the
>>> skb_iif.  There's already some code in bonding that takes advantage of
>>> this property of the VLANs, so maybe this is the way to go.
>> Thanks, I'll take another look and see if I can come up with something
>> better.
> 
> 	I looked at the skb_bond_should_drop stuff a bit more after I
> wrote that; it's not as easy as I had suspected.  The big sticking point
> is that currently the test in netif_receive_skb (now __netif_receive_skb
> in net-next-2.6) is on skb->dev->master to identify packets arriving on
> slaves of bonding.  The VLAN skb->dev has ->master set to NULL.  Doing
> that test against skb->skb_iif would be much more expensive, as it would
> require a device lookup for every packet.
> 
> 	So, I suspect that something has to happen in the VLAN
> acceleration path, although I don't know exactly what.  I don't know if
> it would be possible to flag the packets in some special way to indicate
> that they're "bonding slave" packets, or if it's better to keep the
> current structure and just fix the calls somehow.
> 
> 	-J
> 
> 

Jay,

It should be OK to allow packets to be received on the VLAN if it is not 
explicitly in the bond?

Or if we want to be more paranoid deliver packets only to handlers with 
exact matches for the device. For non vlan devices we deliver skb's to 
packet handlers that match exactly even on inactive slaves so doing this 
on vlan devices as well makes sense and shouldn't cause any unexpected 
problems.

Also on a somewhat unrelated note I suspect null_or_orig and 
null_or_bond are not working as expected in __netif_receive_skb().  At 
least the comment 'deliver only exact match' could be inaccurate.

Here's a patch to illustrate what I'm thinking compile tested only.  If 
this sounds reasonable I'll work up an official patch.


[PATCH] net: allow vlans on bonded real net_devices

For converged I/O it is reasonable to use dm_multipathing to provice
failover and load balancing for storage traffic and then use bonding
for the LAN failover and load balancing.

Currently this works if the multipathed devices are using the real
net_device and those devices are enslaved to a bonded device what
does not work is creating a vlan on the real device and trying to
use it outside the bond for multipathing.

This patch adds logic so that if the skb is destined for a vlan
that is not in the bond the skb will not be dropped.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

  net/8021q/vlan_core.c |   31 +++++++++++++++++++++----------
  net/core/dev.c        |   11 ++++++++---
  2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index c584a0a..3bce0c3 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -8,18 +8,24 @@
  int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
  		      u16 vlan_tci, int polling)
  {
+	struct net_device *vlan_dev;
+
  	if (netpoll_rx(skb))
  		return NET_RX_DROP;

-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
+
+	if (!vlan_dev)
+		goto drop;
+
+	if ((vlan_dev->priv_flags & IFF_BONDING ||
+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
  		goto drop;

  	skb->skb_iif = skb->dev->ifindex;
  	__vlan_hwaccel_put_tag(skb, vlan_tci);
-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
-
-	if (!skb->dev)
-		goto drop;
+	skb->dev = vlan_dev;

  	return (polling ? netif_receive_skb(skb) : netif_rx(skb));

@@ -82,16 +88,21 @@ vlan_gro_common(struct napi_struct *napi, struct 
vlan_group *grp,
  		unsigned int vlan_tci, struct sk_buff *skb)
  {
  	struct sk_buff *p;
+	struct net_device *vlan_dev;

-	if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
+	vlan_dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
+
+	if (!vlan_dev)
+		goto drop;
+
+	if ((vlan_dev->priv_flags & IFF_BONDING ||
+	    vlan_dev_real_dev(vlan_dev)->flags & IFF_MASTER) &&
+	    skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
  		goto drop;

  	skb->skb_iif = skb->dev->ifindex;
  	__vlan_hwaccel_put_tag(skb, vlan_tci);
-	skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
-
-	if (!skb->dev)
-		goto drop;
+	skb->dev = vlan_dev;

  	for (p = napi->gro_list; p; p = p->next) {
  		NAPI_GRO_CB(p)->same_flow =
diff --git a/net/core/dev.c b/net/core/dev.c
index 100dcbd..9ea4550 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2780,6 +2780,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
  	struct net_device *master;
  	struct net_device *null_or_orig;
  	struct net_device *null_or_bond;
+	struct net_device *real_dev;
  	int ret = NET_RX_DROP;
  	__be16 type;

@@ -2853,9 +2854,13 @@ ncls:
  	 * handler may have to adjust skb->dev and orig_dev.
  	 */
  	null_or_bond = NULL;
-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		null_or_bond = vlan_dev_real_dev(skb->dev);
+	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN)) {
+		real_dev = vlan_dev_real_dev(skb->dev);
+		if (real_dev->priv_flags & IFF_BONDING)
+			null_or_bond = vlan_dev_real_dev(skb->dev);
+		if (!(skb->dev->priv_flags & IFF_BONDING) &&
+		    real_dev->priv_flags & IFF_SLAVE_INACTIVE)
+			null_or_orig = skb->dev;
  	}

  	type = skb->protocol;

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox