Netdev List
 help / color / mirror / Atom feed
* Re: BUG: skge ethernet breakage (PCI: Unable to reserve mem region)
From: Stephen Hemminger @ 2007-09-11 12:37 UTC (permalink / raw)
  To: Jan Gukelberger; +Cc: netdev
In-Reply-To: <1189508304.4069.5.camel@localhost>

On Tue, 11 Sep 2007 12:58:24 +0200
Jan Gukelberger <g.u.g.i@gmx.de> wrote:

> On Tue, 2007-09-11 at 10:21 +0200, Stephen Hemminger wrote:
> > On Fri, 07 Sep 2007 18:42:35 +0200
> > Jan Gukelberger <g.u.g.i@gmx.de> wrote:
> [...]
> > > The key problem seem to be the following lines in dmesg:
> > > ------------------------------------------------------------------------
> > > ACPI: PCI Interrupt 0000:04:04.0[A] -> GSI 19 (level, low) -> IRQ 19
> > > PCI: Unable to reserve mem region #1:4000@ff9f8000 for device 0000:04:04.0
> > > skge 0000:04:04.0: cannot obtain PCI resources
> > > ACPI: PCI interrupt for device 0000:04:04.0 disabled
> > > skge: probe of 0000:04:04.0 failed with error -16
> > > ------------------------------------------------------------------------
> > > 
> > 
> > There is some kind of device conflict, please provide lspci -vvvxx output.
> 
> I'm attaching the output of 'lspci -vvvxx' on the working 2.6.20 kernel
> as well as the output of 'lspci -vvxxx' on 2.6.23-rc5 which I recorded
> earlier.
> I you specifically need 'lspci -vvvxx' on 2.6.23-rc5 please drop me a
> note and I'll reboot quickly.
> 
> Thanks,
> Jan

All looks in order, on the PCI tables. There is a firewire control just
above the skge device, perhaps you enabled one of the firewire stacks
in the configuration?  Perhaps the console (dmesg) output will show some clue.

^ permalink raw reply

* Re: 2.6.23-rc5: possible irq lock inversion dependency detected
From: Herbert Xu @ 2007-09-11 12:43 UTC (permalink / raw)
  To: jamal, David S. Miller; +Cc: Christian Kujau, linux-kernel, netdev
In-Reply-To: <1189512106.4231.6.camel@localhost>

On Tue, Sep 11, 2007 at 08:01:46AM -0400, jamal wrote:
>
> [NET_SCHED] protect action config/dump from irqs

Looks good! Thanks Jamal.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH] NET : convert IP route cache garbage colleciton from softirq processing to a workqueue
From: Eric Dumazet @ 2007-09-11 12:56 UTC (permalink / raw)
  To: Herbert Xu, David Miller; +Cc: netdev@vger.kernel.org

Hi Herbert and all

When the periodic IP route cache flush is done (every 600 seconds on 
default configuration), some hosts suffer a lot and eventually trigger
the "soft lockup" message.

dst_run_gc() is doing a scan of a possibly huge list of dst_entries,
eventually freeing some (less than 1%) of them, while holding the 
dst_lock spinlock for the whole scan.

Then it rearms a timer to redo the full thing 1/10 s later...
The slowdown can last one minute or so, depending on how active are
the tcp sessions.

This second version of the patch converts the processing from a softirq
based one to a workqueue.

Even if the list of entries in garbage_list is huge, host is still
responsive to softirqs and can make progress.

Instead of reseting gc timer to 0.1 second if one entry was freed in a
gc run, we do this if more than 10% of entries were freed.


Before patch :

Aug 16 06:21:37 SRV1 kernel: BUG: soft lockup detected on CPU#0!
Aug 16 06:21:37 SRV1 kernel: 
Aug 16 06:21:37 SRV1 kernel: Call Trace:
Aug 16 06:21:37 SRV1 kernel:  <IRQ>  [<ffffffff802286f0>] wake_up_process+0x10/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80251e09>] softlockup_tick+0xe9/0x110
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd380>] dst_run_gc+0x0/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802376f3>] run_local_timers+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802379c7>] update_process_times+0x57/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80216034>] smp_local_timer_interrupt+0x34/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802165cc>] smp_apic_timer_interrupt+0x5c/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020a816>] apic_timer_interrupt+0x66/0x70
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3d3>] dst_run_gc+0x53/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd3c6>] dst_run_gc+0x46/0x140
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80237148>] run_timer_softirq+0x148/0x1c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8023340c>] __do_softirq+0x6c/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8020ad6c>] call_softirq+0x1c/0x30
Aug 16 06:21:37 SRV1 kernel:  <EOI>  [<ffffffff8020cb34>] do_softirq+0x34/0x90
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff802331cf>] local_bh_enable_ip+0x3f/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422913>] _spin_unlock_bh+0x13/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803dfde8>] rt_garbage_collect+0x1d8/0x320
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803cd4dd>] dst_alloc+0x1d/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e1433>] __ip_route_output_key+0x573/0x800
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c02e2>] sock_common_recvmsg+0x32/0x50
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803e16dc>] ip_route_output_flow+0x1c/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80400160>] tcp_v4_connect+0x150/0x610
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803ebf07>] inet_bind_bucket_create+0x17/0x60
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8040cd16>] inet_stream_connect+0xa6/0x2c0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c0bdf>] lock_sock_nested+0xcf/0xe0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80422981>] _spin_lock_bh+0x11/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be551>] sys_connect+0x71/0xa0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803eee3f>] tcp_setsockopt+0x1f/0x30
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803c030f>] sock_common_setsockopt+0xf/0x20
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff803be4bd>] sys_setsockopt+0x9d/0xc0
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff8028881e>] sys_ioctl+0x5e/0x80
Aug 16 06:21:37 SRV1 kernel:  [<ffffffff80209c4e>] system_call+0x7e/0x83

After patch : (RT_CACHE_DEBUG set to 2 to get following traces)

dst_total: 75469 delayed: 74109 work_perf: 141 expires: 150 elapsed: 8092 us
dst_total: 78725 delayed: 73366 work_perf: 743 expires: 400 elapsed: 8542 us
dst_total: 86126 delayed: 71844 work_perf: 1522 expires: 775 elapsed: 8849 us
dst_total: 100173 delayed: 68791 work_perf: 3053 expires: 1256 elapsed: 9748 us
dst_total: 121798 delayed: 64711 work_perf: 4080 expires: 1997 elapsed: 10146 us
dst_total: 154522 delayed: 58316 work_perf: 6395 expires: 25 elapsed: 11402 us
dst_total: 154957 delayed: 58252 work_perf: 64 expires: 150 elapsed: 6148 us
dst_total: 157377 delayed: 57843 work_perf: 409 expires: 400 elapsed: 6350 us
dst_total: 163745 delayed: 56679 work_perf: 1164 expires: 775 elapsed: 7051 us
dst_total: 176577 delayed: 53965 work_perf: 2714 expires: 1389 elapsed: 8120 us
dst_total: 198993 delayed: 49627 work_perf: 4338 expires: 1997 elapsed: 8909 us
dst_total: 226638 delayed: 46865 work_perf: 2762 expires: 2748 elapsed: 7351 us

I successfully reduced the IP route cache of many hosts by a four factor 
thanks to this patch. Previously, I had to disable "ip route flush cache"
to avoid crashes.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/core/dst.c b/net/core/dst.c
index c6a0587..527b4c2 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -9,6 +9,7 @@
 #include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/workqueue.h>
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/netdevice.h>
@@ -18,50 +19,67 @@
 
 #include <net/dst.h>
 
-/* Locking strategy:
- * 1) Garbage collection state of dead destination cache
- *    entries is protected by dst_lock.
- * 2) GC is run only from BH context, and is the only remover
- *    of entries.
- * 3) Entries are added to the garbage list from both BH
- *    and non-BH context, so local BH disabling is needed.
- * 4) All operations modify state, so a spinlock is used.
+/*
+ * Theory of operations:
+ * 1) We use a list, protected by a spinlock, to add
+ *    new entries from both BH and non-BH context.
+ * 2) In order to keep spinlock held for a small delay,
+ *    we use a second list where are stored long lived
+ *    entries, that are handled by the garbage collect thread
+ *    fired by a workqueue.
+ * 3) This list is guarded by a mutex,
+ *    so that the gc_task and dst_dev_event() can be synchronized.
  */
-static struct dst_entry 	*dst_garbage_list;
 #if RT_CACHE_DEBUG >= 2
 static atomic_t			 dst_total = ATOMIC_INIT(0);
 #endif
-static DEFINE_SPINLOCK(dst_lock);
 
-static unsigned long dst_gc_timer_expires;
-static unsigned long dst_gc_timer_inc = DST_GC_MAX;
-static void dst_run_gc(unsigned long);
+static struct {
+	spinlock_t		lock;
+	struct dst_entry 	*list;
+	unsigned long		timer_inc;
+	unsigned long		timer_expires;
+} dst_garbage = {
+	.lock = __SPIN_LOCK_UNLOCKED(dst_garbage.lock),
+	.timer_inc = DST_GC_MAX,
+};
+static void dst_gc_task(struct work_struct *work);
 static void ___dst_free(struct dst_entry * dst);
 
-static DEFINE_TIMER(dst_gc_timer, dst_run_gc, DST_GC_MIN, 0);
+DECLARE_DELAYED_WORK(dst_gc_work, dst_gc_task);
+
+static DEFINE_MUTEX(dst_gc_mutex);
+/*
+ * long lived entries are maintained in this list, guarded by dst_gc_mutex
+ */
+static struct dst_entry         *dst_busy_list;
 
-static void dst_run_gc(unsigned long dummy)
+static void dst_gc_task(struct work_struct *work)
 {
 	int    delayed = 0;
-	int    work_performed;
-	struct dst_entry * dst, **dstp;
+	int    work_performed = 0;
+	unsigned long expires = ~0L;
+	struct dst_entry *dst, *next, head;
+	struct dst_entry *last = &head;
+#if RT_CACHE_DEBUG >= 2
+	ktime_t time_start = ktime_get();
+	struct timespec elapsed;
+#endif
 
-	if (!spin_trylock(&dst_lock)) {
-		mod_timer(&dst_gc_timer, jiffies + HZ/10);
-		return;
-	}
+	mutex_lock(&dst_gc_mutex);
+	next = dst_busy_list;
 
-	del_timer(&dst_gc_timer);
-	dstp = &dst_garbage_list;
-	work_performed = 0;
-	while ((dst = *dstp) != NULL) {
-		if (atomic_read(&dst->__refcnt)) {
-			dstp = &dst->next;
+loop:
+	while ((dst = next) != NULL) {
+		next = dst->next;
+		__builtin_prefetch(&next->next, 1, 0);
+		if (likely(atomic_read(&dst->__refcnt))) {
+			last->next = dst;
+			last = dst;
 			delayed++;
 			continue;
 		}
-		*dstp = dst->next;
-		work_performed = 1;
+		work_performed++;
 
 		dst = dst_destroy(dst);
 		if (dst) {
@@ -77,38 +95,56 @@ static void dst_run_gc(unsigned long dummy)
 				continue;
 
 			___dst_free(dst);
-			dst->next = *dstp;
-			*dstp = dst;
-			dstp = &dst->next;
+			dst->next = next;
+			next = dst;
 		}
 	}
-	if (!dst_garbage_list) {
-		dst_gc_timer_inc = DST_GC_MAX;
-		goto out;
+
+	spin_lock_bh(&dst_garbage.lock);
+	next = dst_garbage.list;
+	if (next) {
+		dst_garbage.list = NULL;
+		spin_unlock_bh(&dst_garbage.lock);
+		goto loop;
 	}
-	if (!work_performed) {
-		if ((dst_gc_timer_expires += dst_gc_timer_inc) > DST_GC_MAX)
-			dst_gc_timer_expires = DST_GC_MAX;
-		dst_gc_timer_inc += DST_GC_INC;
-	} else {
-		dst_gc_timer_inc = DST_GC_INC;
-		dst_gc_timer_expires = DST_GC_MIN;
+	last->next = NULL;
+	dst_busy_list = head.next;
+	if (!dst_busy_list)
+		dst_garbage.timer_inc = DST_GC_MAX;
+	else {
+		/*
+		 * if we freed less than 1/10 of delayed entries,
+		 * we can sleep longer.
+		 */
+		if (work_performed <= delayed/10) {
+			dst_garbage.timer_expires += dst_garbage.timer_inc;
+			if (dst_garbage.timer_expires > DST_GC_MAX)
+				dst_garbage.timer_expires = DST_GC_MAX;
+			dst_garbage.timer_inc += DST_GC_INC;
+		} else {
+			dst_garbage.timer_inc = DST_GC_INC;
+			dst_garbage.timer_expires = DST_GC_MIN;
+		}
+		expires = dst_garbage.timer_expires;
+		/*
+		 * if the next desired timer is more than 4 seconds in the future
+		 * then round the timer to whole seconds
+		 */
+		if (expires > 4*HZ)
+			expires = round_jiffies_relative(expires);
+		schedule_delayed_work(&dst_gc_work, expires);
 	}
+
+	spin_unlock_bh(&dst_garbage.lock);
+	mutex_unlock(&dst_gc_mutex);
 #if RT_CACHE_DEBUG >= 2
-	printk("dst_total: %d/%d %ld\n",
-	       atomic_read(&dst_total), delayed,  dst_gc_timer_expires);
+	elapsed = ktime_to_timespec(ktime_sub(ktime_get(), time_start));
+	printk(KERN_DEBUG "dst_total: %d delayed: %d work_perf: %d"
+		" expires: %lu elapsed: %lu us\n",
+		atomic_read(&dst_total), delayed, work_performed,
+		expires,
+		elapsed.tv_sec * USEC_PER_SEC + elapsed.tv_nsec / NSEC_PER_USEC);
 #endif
-	/* if the next desired timer is more than 4 seconds in the future
-	 * then round the timer to whole seconds
-	 */
-	if (dst_gc_timer_expires > 4*HZ)
-		mod_timer(&dst_gc_timer,
-			round_jiffies(jiffies + dst_gc_timer_expires));
-	else
-		mod_timer(&dst_gc_timer, jiffies + dst_gc_timer_expires);
-
-out:
-	spin_unlock(&dst_lock);
 }
 
 static int dst_discard(struct sk_buff *skb)
@@ -153,16 +189,16 @@ static void ___dst_free(struct dst_entry * dst)
 
 void __dst_free(struct dst_entry * dst)
 {
-	spin_lock_bh(&dst_lock);
+	spin_lock_bh(&dst_garbage.lock);
 	___dst_free(dst);
-	dst->next = dst_garbage_list;
-	dst_garbage_list = dst;
-	if (dst_gc_timer_inc > DST_GC_INC) {
-		dst_gc_timer_inc = DST_GC_INC;
-		dst_gc_timer_expires = DST_GC_MIN;
-		mod_timer(&dst_gc_timer, jiffies + dst_gc_timer_expires);
+	dst->next = dst_garbage.list;
+	dst_garbage.list = dst;
+	if (dst_garbage.timer_inc > DST_GC_INC) {
+		dst_garbage.timer_inc = DST_GC_INC;
+		dst_garbage.timer_expires = DST_GC_MIN;
+		schedule_delayed_work(&dst_gc_work, dst_garbage.timer_expires);
 	}
-	spin_unlock_bh(&dst_lock);
+	spin_unlock_bh(&dst_garbage.lock);
 }
 
 struct dst_entry *dst_destroy(struct dst_entry * dst)
@@ -250,16 +286,30 @@ static inline void dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 static int dst_dev_event(struct notifier_block *this, unsigned long event, void *ptr)
 {
 	struct net_device *dev = ptr;
-	struct dst_entry *dst;
+	struct dst_entry *dst, *last = NULL;
 
 	switch (event) {
 	case NETDEV_UNREGISTER:
 	case NETDEV_DOWN:
-		spin_lock_bh(&dst_lock);
-		for (dst = dst_garbage_list; dst; dst = dst->next) {
+		mutex_lock(&dst_gc_mutex);
+		for (dst = dst_busy_list; dst; dst = dst->next) {
+			last = dst;
+			dst_ifdown(dst, dev, event != NETDEV_DOWN);
+		}
+
+		spin_lock_bh(&dst_garbage.lock);
+		dst = dst_garbage.list;
+		dst_garbage.list = NULL;
+		spin_unlock_bh(&dst_garbage.lock);
+
+		if (last)
+			last->next = dst;
+		else
+			dst_busy_list = dst;
+		for (; dst; dst = dst->next) {
 			dst_ifdown(dst, dev, event != NETDEV_DOWN);
 		}
-		spin_unlock_bh(&dst_lock);
+		mutex_unlock(&dst_gc_mutex);
 		break;
 	}
 	return NOTIFY_DONE;

^ permalink raw reply related

* Re: BUG: skge ethernet breakage (PCI: Unable to reserve mem region)
From: Jan Gukelberger @ 2007-09-11 13:39 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20070911143701.628c8d18@oldman>

[-- Attachment #1: Type: text/plain, Size: 2548 bytes --]

On Tue, 2007-09-11 at 14:37 +0200, Stephen Hemminger wrote:
> On Tue, 11 Sep 2007 12:58:24 +0200
> Jan Gukelberger <g.u.g.i@gmx.de> wrote:
> 
> > On Tue, 2007-09-11 at 10:21 +0200, Stephen Hemminger wrote:
> > > On Fri, 07 Sep 2007 18:42:35 +0200
> > > Jan Gukelberger <g.u.g.i@gmx.de> wrote:
> > [...]
> > > > The key problem seem to be the following lines in dmesg:
> > > > ------------------------------------------------------------------------
> > > > ACPI: PCI Interrupt 0000:04:04.0[A] -> GSI 19 (level, low) -> IRQ 19
> > > > PCI: Unable to reserve mem region #1:4000@ff9f8000 for device 0000:04:04.0
> > > > skge 0000:04:04.0: cannot obtain PCI resources
> > > > ACPI: PCI interrupt for device 0000:04:04.0 disabled
> > > > skge: probe of 0000:04:04.0 failed with error -16
> > > > ------------------------------------------------------------------------
> > > > 
> > > 
> > > There is some kind of device conflict, please provide lspci -vvvxx output.
> > 
> > I'm attaching the output of 'lspci -vvvxx' on the working 2.6.20 kernel
> > as well as the output of 'lspci -vvxxx' on 2.6.23-rc5 which I recorded
> > earlier.
> > I you specifically need 'lspci -vvvxx' on 2.6.23-rc5 please drop me a
> > note and I'll reboot quickly.
> > 
> > Thanks,
> > Jan
> 
> All looks in order, on the PCI tables. There is a firewire control just
> above the skge device, perhaps you enabled one of the firewire stacks
> in the configuration? 

I did a quick diff of the respective kernel .config's (this is the
configuration you mean, right?) and haven't found any notable
differences in the firewire options.

>  Perhaps the console (dmesg) output will show some clue.

I'm attaching a diff between dmesg of a working and a non-working boot.
You can find the full dmesg records in my first mail and in the Debian
BTS respectively.

The only thing I can see there is the old  kernel having some problems
with the SATA controller - even though I did never notice any unusual
behaviour apart from these messages:
------------------------------------------------------------------------
PCI: Device 0000:02:00.0 not available because of resource collisions
ahci: probe of 0000:02:00.0 failed with error -22
JMB363: IDE controller at PCI slot 0000:02:00.0
PCI: Device 0000:02:00.0 not available because of resource collisions
ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
JMB363: BIOS configuration fixed.
------------------------------------------------------------------------

Don't know whether this could be related?

Thanks,
Jan

[-- Attachment #2: dmesg.diff --]
[-- Type: text/x-patch, Size: 25846 bytes --]

--- /home/jan/dmesg-2.6.20	2007-09-07 17:17:09.000000000 +0200
+++ dmesg-2.6.23-rc5	2007-09-07 18:13:09.000000000 +0200
@@ -1,4 +1,4 @@
-Linux version 2.6.20-1-amd64 (Debian 2.6.20-3) (waldi@debian.org) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP Tue Apr 24 21:10:58 UTC 2007
+Linux version 2.6.23-rc5-amd64 (Debian 2.6.23~rc5-1~experimental.1~snapshot.9462) (waldi@debian.org) (gcc version 4.1.3 20070718 (prerelease) (Debian 4.1.2-14+1)) #1 SMP Thu Sep 6 06:18:52 UTC 2007
 Command line: root=/dev/sda6 ro quiet vga=791 
 BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009ec00 (usable)
@@ -14,18 +14,19 @@
 Entering add_active_range(0, 256, 520080) 1 entries of 3200 used
 end_pfn_map = 1048576
 DMI 2.4 present.
-ACPI: RSDP (v002 ACPIAM                                ) @ 0x00000000000fad70
-ACPI: XSDT (v001 NEC             0x05000730 MSFT 0x00000097) @ 0x000000007ef90100
-ACPI: FADT (v003 MSTEST OEMFACP  0x05000730 MSFT 0x00000097) @ 0x000000007ef90290
-ACPI: MADT (v001 MSTEST OEMAPIC  0x05000730 MSFT 0x00000097) @ 0x000000007ef90390
-ACPI: MCFG (v001 MSTEST OEMMCFG  0x05000730 MSFT 0x00000097) @ 0x000000007ef90400
-ACPI: SLIC (v001 NEC             0x05000730 MSFT 0x00000097) @ 0x000000007ef90440
-ACPI: OEMB (v001 MSTEST AMI_OEM  0x05000730 MSFT 0x00000097) @ 0x000000007ef9e040
-ACPI: HPET (v001 MSTEST OEMHPET  0x05000730 MSFT 0x00000097) @ 0x000000007ef9a250
-ACPI: GSCI (v001 MSTEST GMCHSCI  0x05000730 MSFT 0x00000097) @ 0x000000007ef9e0c0
-ACPI: SSDT (v001    AMI   CPU1PM 0x00000001 INTL 0x20060113) @ 0x000000007efa00f0
-ACPI: SSDT (v001    AMI   CPU2PM 0x00000001 INTL 0x20060113) @ 0x000000007efa02b0
-ACPI: DSDT (v001  A0579 A0579000 0x00000000 INTL 0x20060113) @ 0x0000000000000000
+ACPI: RSDP 000FAD70, 0024 (r2 ACPIAM)
+ACPI: XSDT 7EF90100, 006C (r1 NEC              5000730 MSFT       97)
+ACPI: FACP 7EF90290, 00F4 (r3 MSTEST OEMFACP   5000730 MSFT       97)
+ACPI: DSDT 7EF905C0, 9C85 (r1  A0579 A0579000        0 INTL 20060113)
+ACPI: FACS 7EF9E000, 0040
+ACPI: APIC 7EF90390, 006C (r1 MSTEST OEMAPIC   5000730 MSFT       97)
+ACPI: MCFG 7EF90400, 003C (r1 MSTEST OEMMCFG   5000730 MSFT       97)
+ACPI: SLIC 7EF90440, 0176 (r1 NEC              5000730 MSFT       97)
+ACPI: OEMB 7EF9E040, 007B (r1 MSTEST AMI_OEM   5000730 MSFT       97)
+ACPI: HPET 7EF9A250, 0038 (r1 MSTEST OEMHPET   5000730 MSFT       97)
+ACPI: GSCI 7EF9E0C0, 2024 (r1 MSTEST GMCHSCI   5000730 MSFT       97)
+ACPI: SSDT 7EFA00F0, 01BF (r1    AMI   CPU1PM        1 INTL 20060113)
+ACPI: SSDT 7EFA02B0, 0133 (r1    AMI   CPU2PM        1 INTL 20060113)
 No NUMA configuration found
 Faking a node at 0000000000000000-000000007ef90000
 Entering add_active_range(0, 0, 158) 0 entries of 3200 used
@@ -35,16 +36,18 @@
   DMA             0 ->     4096
   DMA32        4096 ->  1048576
   Normal    1048576 ->  1048576
+Movable zone start PFN for each node
 early_node_map[2] active PFN ranges
     0:        0 ->      158
     0:      256 ->   520080
 On node 0 totalpages: 519982
   DMA zone: 56 pages used for memmap
-  DMA zone: 972 pages reserved
-  DMA zone: 2970 pages, LIFO batch:0
+  DMA zone: 1044 pages reserved
+  DMA zone: 2898 pages, LIFO batch:0
   DMA32 zone: 7054 pages used for memmap
   DMA32 zone: 508930 pages, LIFO batch:31
   Normal zone: 0 pages used for memmap
+  Movable zone: 0 pages used for memmap
 ACPI: PM-Timer IO Port: 0x808
 ACPI: Local APIC address 0xfee00000
 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
@@ -60,31 +63,35 @@
 ACPI: IRQ0 used by override.
 ACPI: IRQ2 used by override.
 ACPI: IRQ9 used by override.
-Setting APIC routing to physical flat
+Setting APIC routing to flat
 ACPI: HPET id: 0x8086a202 base: 0xfed00000
 Using ACPI (MADT) for SMP configuration information
-Nosave address range: 000000000009e000 - 000000000009f000
-Nosave address range: 000000000009f000 - 00000000000a0000
-Nosave address range: 00000000000a0000 - 00000000000e4000
-Nosave address range: 00000000000e4000 - 0000000000100000
+swsusp: Registered nosave memory region: 000000000009e000 - 000000000009f000
+swsusp: Registered nosave memory region: 000000000009f000 - 00000000000a0000
+swsusp: Registered nosave memory region: 00000000000a0000 - 00000000000e4000
+swsusp: Registered nosave memory region: 00000000000e4000 - 0000000000100000
 Allocating PCI resources starting at 80000000 (gap: 7f000000:7fe00000)
 SMP: Allowing 4 CPUs, 2 hotplug CPUs
-PERCPU: Allocating 36992 bytes of per cpu data
-Built 1 zonelists.  Total pages: 511900
+PERCPU: Allocating 35176 bytes of per cpu data
+Built 1 zonelists in Node order.  Total pages: 511828
+Policy zone: DMA32
 Kernel command line: root=/dev/sda6 ro quiet vga=791 
 Initializing CPU#0
 PID hash table entries: 4096 (order: 12, 32768 bytes)
+Extended CMOS year: 2000
+time.c: Detected 2399.997 MHz processor.
 Console: colour dummy device 80x25
-Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
-Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
+console [tty0] enabled
 Checking aperture...
 Calgary: detecting Calgary via BIOS EBDA area
 Calgary: Unable to locate Rio Grande table in EBDA - bailing!
-Memory: 2038324k/2080320k available (1970k kernel code, 41604k reserved, 920k data, 284k init)
-Calibrating delay using timer specific routine.. 4804.04 BogoMIPS (lpj=9608092)
+Memory: 2041096k/2080320k available (2066k kernel code, 38832k reserved, 977k data, 304k init)
+Calibrating delay using timer specific routine.. 4803.34 BogoMIPS (lpj=9606690)
 Security Framework v1.0.0 initialized
 SELinux:  Disabled at boot.
 Capability LSM initialized
+Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
+Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
 Mount-cache hash table entries: 256
 CPU: L1 I cache: 32K, L1 D cache: 32K
 CPU: L2 cache: 4096K
@@ -94,14 +101,14 @@
 CPU: Processor Core ID: 0
 CPU0: Thermal monitoring enabled (TM2)
 SMP alternatives: switching to UP code
-ACPI: Core revision 20060707
+ACPI: Core revision 20070126
 Using local APIC timer interrupts.
-result 16666675
+result 16666629
 Detected 16.666 MHz APIC timer.
 SMP alternatives: switching to SMP code
 Booting processor 1/2 APIC 0x1
 Initializing CPU#1
-Calibrating delay using timer specific routine.. 4800.01 BogoMIPS (lpj=9600034)
+Calibrating delay using timer specific routine.. 4800.03 BogoMIPS (lpj=9600060)
 CPU: L1 I cache: 32K, L1 D cache: 32K
 CPU: L2 cache: 4096K
 CPU 1/1 -> Node 0
@@ -109,25 +116,20 @@
 CPU: Processor Core ID: 1
 CPU1: Thermal monitoring enabled (TM2)
 Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz stepping 06
+checking TSC synchronization [CPU#0 -> CPU#1]: passed.
 Brought up 2 CPUs
-testing NMI watchdog ... OK.
-time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.
-time.c: Detected 2400.003 MHz processor.
-migration_cost=20
 NET: Registered protocol family 16
 ACPI: bus type pci registered
 PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
 PCI: Not using MMCONFIG.
 PCI: Using configuration type 1
+ACPI: EC: Look up EC in DSDT
 ACPI: Interpreter enabled
+ACPI: (supports S0 S1 S3)
 ACPI: Using IOAPIC for interrupt routing
 ACPI: PCI Root Bridge [PCI0] (0000:00)
-PCI: Probing PCI hardware (bus 00)
 PCI quirk: region 0800-087f claimed by ICH6 ACPI/GPIO/TCO
 PCI quirk: region 0480-04bf claimed by ICH6 GPIO
-Boot video device is 0000:01:00.0
-0000:02:00.0: cannot adjust BAR2 (not I/O)
-0000:02:00.0: cannot adjust BAR3 (not I/O)
 PCI: Transparent bridge - 0000:00:1e.0
 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P2._PRT]
@@ -142,8 +144,10 @@
 ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 *14 15)
 ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 7 10 11 12 14 15)
 ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)
+ACPI Warning (tbutils-0217): Incorrect checksum in table [OEMB] -  CD, should be C4 [20070126]
 Linux Plug and Play Support v0.97 (c) Adam Belay
 pnp: PnP ACPI init
+ACPI: bus type pnp registered
 pnp: ACPI device : hid PNP0A08
 pnp: ACPI device : hid PNP0C01
 pnp: ACPI device : hid PNP0200
@@ -161,26 +165,39 @@
 pnp: ACPI device : hid PNP0C02
 pnp: ACPI device : hid PNP0C01
 pnp: PnP ACPI: found 16 devices
+ACPI: ACPI bus type pnp unregistered
 usbcore: registered new interface driver usbfs
 usbcore: registered new interface driver hub
 usbcore: registered new device driver usb
 PCI: Using ACPI for IRQ routing
 PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
-PCI: Cannot allocate resource region 2 of device 0000:02:00.0
-PCI: Cannot allocate resource region 3 of device 0000:02:00.0
 NET: Registered protocol family 8
 NET: Registered protocol family 20
+PCI-GART: No AMD northbridge found.
 hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
 hpet0: 3 64-bit timers, 14318180 Hz
-PCI-GART: No AMD northbridge found.
+ACPI: RTC can wake from S4
+Time: tsc clocksource has been installed.
 pnp: the driver 'system' has been registered
 pnp: match found with the PnP device '00:01' and the driver 'system'
+pnp: 00:01: iomem range 0xfed14000-0xfed19fff has been reserved
 pnp: match found with the PnP device '00:08' and the driver 'system'
 pnp: 00:08: ioport range 0x290-0x297 has been reserved
 pnp: match found with the PnP device '00:09' and the driver 'system'
+pnp: 00:09: iomem range 0xfed1c000-0xfed1ffff has been reserved
+pnp: 00:09: iomem range 0xfed20000-0xfed8ffff has been reserved
+pnp: 00:09: iomem range 0xff9fa000-0xff9fafff has been reserved
+pnp: 00:09: iomem range 0xfff00000-0xfffffffe could not be reserved
 pnp: match found with the PnP device '00:0b' and the driver 'system'
+pnp: 00:0b: iomem range 0xfec00000-0xfec00fff has been reserved
+pnp: 00:0b: iomem range 0xfee00000-0xfee00fff could not be reserved
 pnp: match found with the PnP device '00:0e' and the driver 'system'
+pnp: 00:0e: iomem range 0xe0000000-0xefffffff has been reserved
 pnp: match found with the PnP device '00:0f' and the driver 'system'
+pnp: 00:0f: iomem range 0x0-0x9ffff could not be reserved
+pnp: 00:0f: iomem range 0xc0000-0xcffff has been reserved
+pnp: 00:0f: iomem range 0xe0000-0xfffff could not be reserved
+pnp: 00:0f: iomem range 0x100000-0x7effffff could not be reserved
 PCI: Bridge: 0000:00:01.0
   IO window: 8000-afff
   MEM window: ff700000-ff7fffff
@@ -206,20 +223,21 @@
 PCI: Setting latency timer of device 0000:00:1e.0 to 64
 NET: Registered protocol family 2
 IP route cache hash table entries: 65536 (order: 7, 524288 bytes)
-TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
+TCP established hash table entries: 262144 (order: 10, 6291456 bytes)
 TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
 TCP: Hash tables configured (established 262144 bind 65536)
 TCP reno registered
 checking if image is initramfs... it is
-Freeing initrd memory: 5937k freed
+Freeing initrd memory: 5950k freed
 audit: initializing netlink socket (disabled)
-audit(1181990747.912:1): initialized
+audit(1189181505.780:1): initialized
 VFS: Disk quotas dquot_6.5.1
 Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
 io scheduler noop registered
 io scheduler anticipatory registered
 io scheduler deadline registered
 io scheduler cfq registered (default)
+Boot video device is 0000:01:00.0
 PCI: Setting latency timer of device 0000:00:01.0 to 64
 assign_interrupt_mode Found MSI capability
 Allocate Port Service[0000:00:01.0:pcie00]
@@ -231,7 +249,7 @@
 assign_interrupt_mode Found MSI capability
 Allocate Port Service[0000:00:1c.4:pcie00]
 Allocate Port Service[0000:00:1c.4:pcie02]
-vesafb: framebuffer at 0xc0000000, mapped to 0xffffc20000080000, using 3072k, total 16384k
+vesafb: framebuffer at 0xc0000000, mapped to 0xffffc20000b00000, using 3072k, total 16384k
 vesafb: mode is 1024x768x16, linelength=2048, pages=9
 vesafb: scrolling: redraw
 vesafb: Truecolor: size=0:5:6:5, shift=0:11:5:0
@@ -239,7 +257,7 @@
 fb0: VESA VGA frame buffer device
 Real Time Clock Driver v1.12ac
 hpet_resources: 0xfed00000 is busy
-Linux agpgart interface v0.101 (c) Dave Jones
+Linux agpgart interface v0.102
 Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
 serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
 pnp: the driver 'serial' has been registered
@@ -257,11 +275,10 @@
 TCP bic registered
 NET: Registered protocol family 1
 NET: Registered protocol family 17
-ACPI: (supports S0 S1 S3 S4 S5)
-Freeing unused kernel memory: 284k freed
+Freeing unused kernel memory: 304k freed
 input: AT Translated Set 2 keyboard as /class/input/input0
-ACPI Exception (acpi_processor-0677): AE_NOT_FOUND, Processor Device is not present [20060707]
-ACPI Exception (acpi_processor-0677): AE_NOT_FOUND, Processor Device is not present [20060707]
+ACPI Exception (processor_core-0797): AE_NOT_FOUND, Processor Device is not present [20070126]
+ACPI Exception (processor_core-0797): AE_NOT_FOUND, Processor Device is not present [20070126]
 USB Universal Host Controller Interface driver v3.0
 ACPI: PCI Interrupt 0000:00:1a.0[A] -> GSI 16 (level, low) -> IRQ 16
 PCI: Setting latency timer of device 0000:00:1a.0 to 64
@@ -274,10 +291,9 @@
 Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
 pnp: the driver 'ide' has been registered
-SCSI subsystem initialized
-ieee1394: Initialized config rom entry `ip1394'
 Floppy drive(s): fd0 is 1.44M
-libata version 2.00 loaded.
+SCSI subsystem initialized
+libata version 2.21 loaded.
 FDC 0 is a post-1991 82077
 ACPI: PCI Interrupt 0000:00:1a.1[B] -> GSI 17 (level, low) -> IRQ 17
 PCI: Setting latency timer of device 0000:00:1a.1 to 64
@@ -312,8 +328,42 @@
 hub 5-0:1.0: USB hub found
 hub 5-0:1.0: 2 ports detected
 ACPI: PCI Interrupt 0000:04:04.0[A] -> GSI 19 (level, low) -> IRQ 19
-skge 1.9 addr 0xff9f8000 irq 19 chip Yukon-Lite rev 9
-skge eth0: addr 00:18:f3:90:8e:7b
+PCI: Unable to reserve mem region #1:4000@ff9f8000 for device 0000:04:04.0
+skge 0000:04:04.0: cannot obtain PCI resources
+ACPI: PCI interrupt for device 0000:04:04.0 disabled
+skge: probe of 0000:04:04.0 failed with error -16
+ACPI: PCI Interrupt 0000:04:03.0[A] -> GSI 21 (level, low) -> IRQ 21
+firewire_ohci: Added fw-ohci device 0000:04:03.0, OHCI version 1.10
+ahci 0000:00:1f.2: version 2.3
+ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19
+firewire_core: created new fw device fw0 (0 config rom retries, S400)
+ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0x33 impl SATA mode
+ahci 0000:00:1f.2: flags: 64bit ncq sntf ilck stag pm led clo pmp pio slum part 
+PCI: Setting latency timer of device 0000:00:1f.2 to 64
+scsi0 : ahci
+scsi1 : ahci
+scsi2 : ahci
+scsi3 : ahci
+scsi4 : ahci
+scsi5 : ahci
+ata1: SATA max UDMA/133 cmd 0xffffc20000ac2900 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 1276
+ata2: SATA max UDMA/133 cmd 0xffffc20000ac2980 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 1276
+ata3: DUMMY
+ata4: DUMMY
+ata5: SATA max UDMA/133 cmd 0xffffc20000ac2b00 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 1276
+ata6: SATA max UDMA/133 cmd 0xffffc20000ac2b80 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 1276
+ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
+ata1.00: ATA-7: SAMSUNG HD401LJ, ZZ100-15, max UDMA7
+ata1.00: 781422768 sectors, multi 0: LBA48 NCQ (not used)
+ata1.00: configured for UDMA/133
+ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
+ata2.00: ATAPI: HL-DT-STDVD-RAM GSA-H30N, 1.01, max UDMA/100
+ata2.00: configured for UDMA/100
+ata5: SATA link down (SStatus 0 SControl 300)
+ata6: SATA link down (SStatus 0 SControl 300)
+scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG HD401LJ  ZZ10 PQ: 0 ANSI: 5
+scsi 1:0:0:0: CD-ROM            HL-DT-ST DVD-RAM GSA-H30N 1.01 PQ: 0 ANSI: 5
+ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
 ACPI: PCI Interrupt 0000:00:1a.7[C] -> GSI 18 (level, low) -> IRQ 18
 PCI: Setting latency timer of device 0000:00:1a.7 to 64
 ehci_hcd 0000:00:1a.7: EHCI Host Controller
@@ -325,6 +375,15 @@
 usb usb6: configuration #1 chosen from 1 choice
 hub 6-0:1.0: USB hub found
 hub 6-0:1.0: 4 ports detected
+ahci 0000:02:00.0: AHCI 0001.0000 32 slots 2 ports 3 Gbps 0x3 impl SATA mode
+ahci 0000:02:00.0: flags: 64bit ncq pm led clo pmp pio slum part 
+PCI: Setting latency timer of device 0000:02:00.0 to 64
+scsi6 : ahci
+scsi7 : ahci
+ata7: SATA max UDMA/133 cmd 0xffffc20000ac4100 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 16
+ata8: SATA max UDMA/133 cmd 0xffffc20000ac4180 ctl 0x0000000000000000 bmdma 0x0000000000000000 irq 16
+ata7: SATA link down (SStatus 0 SControl 300)
+ata8: SATA link down (SStatus 0 SControl 300)
 ACPI: PCI Interrupt 0000:00:1d.7[A] -> GSI 23 (level, low) -> IRQ 23
 PCI: Setting latency timer of device 0000:00:1d.7 to 64
 ehci_hcd 0000:00:1d.7: EHCI Host Controller
@@ -336,114 +395,69 @@
 usb usb7: configuration #1 chosen from 1 choice
 hub 7-0:1.0: USB hub found
 hub 7-0:1.0: 6 ports detected
-ACPI: PCI Interrupt 0000:04:03.0[A] -> GSI 21 (level, low) -> IRQ 21
-ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[21]  MMIO=[ff9ff000-ff9ff7ff]  Max Packet=[2048]  IR/IT contexts=[4/8]
-ahci 0000:00:1f.2: version 2.0
-ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19
-PCI: Setting latency timer of device 0000:00:1f.2 to 64
-ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0x33 impl SATA mode
-ahci 0000:00:1f.2: flags: 64bit ncq ilck stag pm led clo pmp pio slum part 
-ata1: SATA max UDMA/133 cmd 0xFFFFC20000056900 ctl 0x0 bmdma 0x0 irq 1276
-ata2: SATA max UDMA/133 cmd 0xFFFFC20000056980 ctl 0x0 bmdma 0x0 irq 1276
-ata3: DUMMY
-ata4: DUMMY
-ata5: SATA max UDMA/133 cmd 0xFFFFC20000056B00 ctl 0x0 bmdma 0x0 irq 1276
-ata6: SATA max UDMA/133 cmd 0xFFFFC20000056B80 ctl 0x0 bmdma 0x0 irq 1276
-scsi0 : ahci
-ieee1394: Host added: ID:BUS[0-00:1023]  GUID[0011d80000ec51d2]
-ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
-ata1.00: ATA-7, max UDMA7, 781422768 sectors: LBA48 NCQ (depth 31/32)
-ata1.00: configured for UDMA/133
-scsi1 : ahci
-ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
-ata2.00: ATAPI, max UDMA/100
-ata2.00: configured for UDMA/100
-scsi2 : ahci
-scsi3 : ahci
-scsi4 : ahci
-ata5: SATA link down (SStatus 0 SControl 300)
-scsi5 : ahci
-ata6: SATA link down (SStatus 0 SControl 300)
-scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG HD401LJ  ZZ10 PQ: 0 ANSI: 5
-scsi 1:0:0:0: CD-ROM            HL-DT-ST DVD-RAM GSA-H30N 1.01 PQ: 0 ANSI: 5
-PCI: Device 0000:02:00.0 not available because of resource collisions
-ahci: probe of 0000:02:00.0 failed with error -22
-JMB363: IDE controller at PCI slot 0000:02:00.0
-PCI: Device 0000:02:00.0 not available because of resource collisions
-ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
-JMB363: BIOS configuration fixed.
+sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
+sd 0:0:0:0: [sda] Write Protect is off
+sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
+sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
+sd 0:0:0:0: [sda] 781422768 512-byte hardware sectors (400088 MB)
+sd 0:0:0:0: [sda] Write Protect is off
+sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
+sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
+ sda: sda1 sda2 sda3 < sda5 sda6 sda7 >
+sd 0:0:0:0: [sda] Attached SCSI disk
+JMB363: IDE controller at PCI slot 0000:02:00.1
+PCI: Enabling device 0000:02:00.1 (0000 -> 0001)
+ACPI: PCI Interrupt 0000:02:00.1[B] -> GSI 17 (level, low) -> IRQ 17
 JMB363: chipset revision 2
-JMB363: 100% native mode on irq 16
-JMB363: dma_base is invalid
-ide0: JMB363 Bus-Master DMA disabled (BIOS)
-JMB363: dma_base is invalid
-ide1: JMB363 Bus-Master DMA disabled (BIOS)
+JMB363: 100% native mode on irq 17
+PCI: Setting latency timer of device 0000:02:00.1 to 64
+    ide0: BM-DMA at 0xb400-0xb407, BIOS settings: hda:DMA, hdb:DMA
+    ide1: BM-DMA at 0xb408-0xb40f, BIOS settings: hdc:pio, hdd:pio
 Probing IDE interface ide0...
-SCSI device sda: 781422768 512-byte hdwr sectors (400088 MB)
-sda: Write Protect is off
-sda: Mode Sense: 00 3a 00 00
-SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-SCSI device sda: 781422768 512-byte hdwr sectors (400088 MB)
-sda: Write Protect is off
-sda: Mode Sense: 00 3a 00 00
-SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
- sda: sda1 sda2 sda3 < sda5 sda6 >
-sd 0:0:0:0: Attached scsi disk sda
 sr0: scsi3-mmc drive: 48x/48x writer dvd-ram cd/rw xa/form2 cdda tray
 Uniform CD-ROM driver Revision: 3.20
 sr 1:0:0:0: Attached scsi CD-ROM sr0
 sd 0:0:0:0: Attached scsi generic sg0 type 0
 sr 1:0:0:0: Attached scsi generic sg1 type 5
+hdb: WDC WD400BB-00AUA1, ATA DISK drive
+hdb: selected mode 0x45
+ide0 at 0xbc00-0xbc07,0xb882 on irq 17
 Probing IDE interface ide1...
-JMB363: IDE controller at PCI slot 0000:02:00.1
-PCI: Enabling device 0000:02:00.1 (0000 -> 0001)
-ACPI: PCI Interrupt 0000:02:00.1[B] -> GSI 17 (level, low) -> IRQ 17
-JMB363: chipset revision 2
-JMB363: 100% native mode on irq 17
-PCI: Setting latency timer of device 0000:02:00.1 to 64
-    ide2: BM-DMA at 0xb400-0xb407, BIOS settings: hde:DMA, hdf:DMA
-    ide3: BM-DMA at 0xb408-0xb40f, BIOS settings: hdg:pio, hdh:pio
-Probing IDE interface ide2...
-hdf: WDC WD400BB-00AUA1, ATA DISK drive
-hdf: hw_config=6b00
-hdf: hw_config=6b00
-ide2 at 0xbc00-0xbc07,0xb882 on irq 17
-Probing IDE interface ide3...
-hdf: max request size: 128KiB
-hdf: 78165360 sectors (40020 MB) w/2048KiB Cache, CHS=65535/16/63<6>hdf: hw_config=6b00
-, UDMA(100)
-hdf: cache flushes not supported
- hdf: hdf1 hdf2 < hdf5 hdf6 hdf7 hdf8 hdf9 hdf10 hdf11 >
+hdb: max request size: 128KiB
+hdb: 78165360 sectors (40020 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
+hdb: cache flushes not supported
+ hdb: hdb1 hdb2 < hdb5 hdb6 hdb7 hdb8 hdb9 hdb10 hdb11 >
 kjournald starting.  Commit interval 5 seconds
 EXT3-fs: mounted filesystem with ordered data mode.
-eth1394: eth1: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
-PCI: Enabling device 0000:00:1f.3 (0001 -> 0003)
-ACPI: PCI Interrupt 0000:00:1f.3[C] -> GSI 18 (level, low) -> IRQ 18
-agpgart: Detected an Intel 965G Chipset.
-agpgart: AGP aperture is 256M @ 0x0
-iTCO_wdt: Intel TCO WatchDog Timer Driver v1.01 (11-Nov-2006)
+input: PC Speaker as /class/input/input1
+iTCO_wdt: Intel TCO WatchDog Timer Driver v1.02 (26-Jul-2007)
 iTCO_wdt: Found a ICH8 or ICH8R TCO device (Version=2, TCOBASE=0x0860)
 iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
-input: PC Speaker as /class/input/input1
+input: Power Button (FF) as /class/input/input2
+ACPI: Power Button (FF) [PWRF]
+input: Power Button (CM) as /class/input/input3
+ACPI: Power Button (CM) [PWRB]
+ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 22 (level, low) -> IRQ 22
+PCI: Setting latency timer of device 0000:00:1b.0 to 64
+hda_codec: Unknown model for AD1988, trying auto-probe from BIOS...
 Linux video capture interface: v2.00
+PCI: Enabling device 0000:00:1f.3 (0001 -> 0003)
+ACPI: PCI Interrupt 0000:00:1f.3[C] -> GSI 18 (level, low) -> IRQ 18
 logips2pp: Detected unknown logitech mouse model 95
 saa7146: register extension 'dvb'.
 ACPI: PCI Interrupt 0000:04:01.0[A] -> GSI 22 (level, low) -> IRQ 22
-saa7146: found saa7146 @ mem ffffc2000039c800 (revision 1, irq 22) (0x13c2,0x0003).
-DVB: registering new adapter (Technotrend/Hauppauge WinTV Nexus-S rev2.X).
-ACPI: PCI Interrupt 0000:00:1b.0[A] -> GSI 22 (level, low) -> IRQ 22
-PCI: Setting latency timer of device 0000:00:1b.0 to 64
+saa7146: found saa7146 @ mem ffffc20000ace800 (revision 1, irq 22) (0x13c2,0x0003).
+DVB: registering new adapter (Technotrend/Hauppauge WinTV Nexus-S rev2.X)
 adapter has MAC addr = 00:d0:5c:22:68:cc
 dvb-ttpci: gpioirq unknown type=0 len=0
 dvb-ttpci: info @ card 0: firm f0240009, rtsl b0250018, vid 71010068, app 80002622
 dvb-ttpci: firmware @ card 0 supports CI link layer interface
-hda_codec: Unknown model for AD1988, trying auto-probe from BIOS...
-input: ImExPS/2 Logitech Explorer Mouse as /class/input/input2
+input: ImExPS/2 Logitech Explorer Mouse as /class/input/input4
 dvb-ttpci: adac type set to 0 @ card 0
 saa7146_vv: saa7146 (0): registered device video0 [v4l2]
 saa7146_vv: saa7146 (0): registered device vbi0 [v4l2]
 DVB: registering frontend 0 (ST STV0299 DVB-S)...
-input: DVB on-card IR receiver as /class/input/input3
+input: DVB on-card IR receiver as /class/input/input5
 dvb-ttpci: found av7110-0.
 EXT3 FS on sda6, internal journal
 device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
@@ -452,23 +466,16 @@
 EXT3-fs: mounted filesystem with ordered data mode.
 kjournald starting.  Commit interval 5 seconds
 EXT3-fs warning: maximal mount count reached, running e2fsck is recommended
-EXT3 FS on hdf10, internal journal
+EXT3 FS on sda7, internal journal
 EXT3-fs: mounted filesystem with ordered data mode.
-skge eth0: enabling interface
-skge eth0: Link is up at 1000 Mbps, full duplex, flow control both
 NET: Registered protocol family 10
 lo: Disabled Privacy Extensions
-input: Power Button (FF) as /class/input/input4
-ACPI: Power Button (FF) [PWRF]
-input: Power Button (CM) as /class/input/input5
-ACPI: Power Button (CM) [PWRB]
 pnp: the driver 'parport_pc' has been registered
 lp: driver loaded but no devices found
 ppdev: user-space parallel port driver
-eth0: no IPv6 routers present
 [drm] Initialized drm 1.1.0 20060810
 ACPI: PCI Interrupt 0000:01:00.0[A] -> GSI 16 (level, low) -> IRQ 16
-[drm] Initialized radeon 1.25.0 20060524 on minor 0
+[drm] Initialized radeon 1.28.0 20060524 on minor 0
 [drm] Setting GART location based on new memory map
 [drm] Loading R300 Microcode
 [drm] writeback test succeeded in 1 usecs

^ permalink raw reply

* Re: [PATCH] Add IP1000A Driver
From: Stefan Lippers-Hollmann @ 2007-09-11 13:57 UTC (permalink / raw)
  To: Jesse Huang; +Cc: jeff, akpm, netdev, Francois Romieu
In-Reply-To: <1189524638.3261.1.camel@localhost.localdomain>

[-- Attachment #1: Type: text/plain, Size: 3603 bytes --]

Hi

Just some very basic comments to actually get it compiling, adding Francois 
Romieu to CC because he has been involved with this driver in the past.

On Dienstag, 11. September 2007, Jesse Huang wrote:
> From: Jesse Huang <jesse@icplus.com.tw>
>
> Change Logs: Add IP1000A Driver to kernel tree.
>
> Signed-off-by: Jesse Huang <jesse@icplus.com.tw>
> ---
>
> drivers/net/ipg.c | 2331 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/net/ipg.h |  856 +++++++++++++++++++
>  2 files changed, 3187 insertions(+), 0 deletions(-)
>  create mode 100755 drivers/net/ipg.c
>  create mode 100755 drivers/net/ipg.h

Kconfig/ Makefile adaptions missing (borrowed from 
http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.19-rc2/ip1000/0001-ipg-new-gigabit-ethernet-device-driver.txt):

diff -Nrup a/drivers/net/Kconfig b/drivers/net/Kconfig
--- a/drivers/net/Kconfig       2007-09-11 12:56:50.000000000 +0200
+++ b/drivers/net/Kconfig       2007-09-11 13:00:52.000000000 +0200
@@ -159,6 +159,15 @@ config NET_SB1000

          If you don't have this card, of course say N.

+config IP1000
+       tristate "IP1000 Gigabit Ethernet support"
+       depends on PCI && EXPERIMENTAL
+       ---help---
+         This driver supports IP1000 gigabit Ethernet cards.
+
+         To compile this driver as a module, choose M here: the module
+         will be called ipg.  This is recommended.
+
 source "drivers/net/arcnet/Kconfig"

 source "drivers/net/phy/Kconfig"
diff -Nrup a/drivers/net/Makefile b/drivers/net/Makefile
--- a/drivers/net/Makefile      2007-09-11 13:17:23.000000000 +0200
+++ b/drivers/net/Makefile      2007-09-11 13:28:00.000000000 +0200
@@ -4,6 +4,7 @@

 obj-$(CONFIG_E1000) += e1000/
 obj-$(CONFIG_IBM_EMAC) += ibm_emac/
+obj-$(CONFIG_IP1000) += ipg.o
 obj-$(CONFIG_IXGB) += ixgb/
 obj-$(CONFIG_CHELSIO_T1) += chelsio/
 obj-$(CONFIG_CHELSIO_T3) += cxgb3/


> e804d1c265bf1d843f845457f925a1728bbfdff7
> diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
> new file mode 100755
> index 0000000..bdc2b8d
> --- /dev/null
> +++ b/drivers/net/ipg.c
[...]
> +static struct pci_device_id ipg_pci_tbl[] __devinitdata = {
> +	{ PCI_DEVICE(PCI_VENDOR_ID_SUNDANCE,	0x1023), 0, 0, 0 },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_SUNDANCE,	0x2021), 0, 0, 1 },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_SUNDANCE,	0x1021), 0, 0, 2 },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_DLINK,	0x9021), 0, 0, 3 },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_DLINK,	0x4000), 0, 0, 4 },
> +	{ PCI_DEVICE(PCI_VENDOR_ID_DLINK,	0x4020), 0, 0, 5 },
> +	{ 0, }
> +};

PCI_VENDOR_ID_SUNDANCE is undefined in kernel 2.6.23-rc6:

diff -Nrup a/include/linux/pci_ids.h b/include/linux/pci_ids.h
--- a/include/linux/pci_ids.h   2007-09-11 13:17:25.000000000 +0200
+++ b/include/linux/pci_ids.h   2007-09-11 13:15:34.000000000 +0200
@@ -1841,6 +1841,8 @@
 #define PCI_VENDOR_ID_ABOCOM           0x13D1
 #define PCI_DEVICE_ID_ABOCOM_2BD1       0x2BD1

+#define PCI_VENDOR_ID_SUNDANCE         0x13F0
+
 #define PCI_VENDOR_ID_CMEDIA           0x13f6
 #define PCI_DEVICE_ID_CMEDIA_CM8338A   0x0100
 #define PCI_DEVICE_ID_CMEDIA_CM8338B   0x0101

After these changes it seems to work in a 100 MBit/s network for me.
00:0a.0 Ethernet controller [0200]: Sundance Technology Inc / IC Plus Corp IC Plus IP1000 Family Gigabit Ethernet [13f0:1023] (rev 41)

> --- /dev/null
> +++ b/drivers/net/ipg.h
[...] 
> +
> +/* Miscellaneous Constants. */
> +#define   TRUE  1
> +#define   FALSE 0

Using the generic boolean definitions might be preferred.

Regards
	Stefan Lippers-Hollmann

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 1/2] SCTP: Add RCU synchronization around sctp_localaddr_list
From: Vlad Yasevich @ 2007-09-11 14:05 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: paulmck, netdev, lksctp-developers
In-Reply-To: <46E642AA.7090703@us.ibm.com>

Sridhar, Paul

Thanks for review.  Some answers and questions below...

Sridhar Samudrala wrote:
> Paul E. McKenney wrote:
>> On Mon, Sep 10, 2007 at 03:46:29PM -0400, Vlad Yasevich wrote:
>>> sctp_localaddr_list is modified dynamically via NETDEV_UP
>>> and NETDEV_DOWN events, but there is not synchronization
>>> between writer (even handler) and readers.  As a result,
>>> the readers can access an entry that has been freed and
>>> crash the sytem.
>>
>> Good start, but few questions interspersed below...
>>
>>                             Thanx, Paul
>>

>>> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>>> index f8aa23d..fc2e4e2 100644
>>> --- a/net/sctp/ipv6.c
>>> +++ b/net/sctp/ipv6.c
>>> @@ -77,13 +77,18 @@
>>>
>>>  #include <asm/uaccess.h>
>>>
>>> -/* Event handler for inet6 address addition/deletion events.  */
>>> +/* Event handler for inet6 address addition/deletion events.
>>> + * This even is part of the atomic notifier call chain
>>> + * and thus happens atomically and can NOT sleep.  As a result
>>> + * we can't and really don't need to add any locks to guard the
>>> + * RCU.
>>> + */
>>>  static int sctp_inet6addr_event(struct notifier_block *this,
>>> unsigned long ev,
>>>                  void *ptr)
>>>  {
>>>      struct inet6_ifaddr *ifa = (struct inet6_ifaddr *)ptr;
>>> -    struct sctp_sockaddr_entry *addr;
>>> -    struct list_head *pos, *temp;
>>> +    struct sctp_sockaddr_entry *addr = NULL;
>>> +    struct sctp_sockaddr_entry *temp;
>>>
>>>      switch (ev) {
>>>      case NETDEV_UP:
>>> @@ -94,19 +99,26 @@ static int sctp_inet6addr_event(struct
>>> notifier_block *this, unsigned long ev,
>>>              memcpy(&addr->a.v6.sin6_addr, &ifa->addr,
>>>                   sizeof(struct in6_addr));
>>>              addr->a.v6.sin6_scope_id = ifa->idev->dev->ifindex;
>>> -            list_add_tail(&addr->list, &sctp_local_addr_list);
>>> +            addr->valid = 1;
>>> +            rcu_read_lock();
>>> +            list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
>>> +            rcu_read_unlock();
>>
>> If we are under the protection of the update-side mutex, the
>> rcu_read_lock()
>> and rcu_read_unlock() are (harmlessly) redundant.  If we are not under
>> the protection of some mutex, what prevents concurrent
>> list_add_tail_rcu()
>> calls from messing up the sctp_sockaddr_entry list?
> 
> This is an atomic notifier call chain event and as such can be called
> from a
> softirq. So i think we need a spin_lock_bh here.

But the question is, can two notifiers be called at the same time?
If yes, then I think there is need for spin lock protection.  (and I think
bonding might need that too).

> 
>>
>>>          }
>>>          break;
>>>      case NETDEV_DOWN:
>>> -        list_for_each_safe(pos, temp, &sctp_local_addr_list) {
>>> -            addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>>> -            if (ipv6_addr_equal(&addr->a.v6.sin6_addr, &ifa->addr)) {
>>> -                list_del(pos);
>>> -                kfree(addr);
>>> +        rcu_read_lock();
>>> +        list_for_each_entry_safe(addr, temp,
>>> +                    &sctp_local_addr_list, list) {
>>> +            if (ipv6_addr_equal(&addr->a.v6.sin6_addr,
>>> +                         &ifa->addr)) {
>>> +                addr->valid = 0;
>>> +                list_del_rcu(&addr->list);
>>>                  break;
>>>              }
>>>          }
>>> -
>>> +        rcu_read_unlock();
>>> +        if (addr && !addr->valid)
>>> +            call_rcu(&addr->rcu, sctp_local_addr_free);
>>
>> Are we under the protection of the update-side lock here?  If not,
>> what prevents two different tasks from executing this in parallel,
>> potentially tangling both the list that the sctp_sockaddr_entry list and
>> the internal RCU lists?  (It is forbidden to call_rcu() a given element
>> twice concurrently.)
>>
>> If we are in fact under the protection of the update-side lock, the
>> rcu_read_lock() and rcu_read_unlock() pairs are redundant (though this
>> is harmless, aside from the (small) potential for confusion).
> 
> There is no update-side lock protection here. We need a spin_lock_bh().

Recurring theme.  Same questions about notifiers apply.

> 
>>
>>>          break;
>>>      }
>>>
>>> @@ -368,6 +380,7 @@ static void sctp_v6_copy_addrlist(struct
>>> list_head *addrlist,
>>>              addr->a.v6.sin6_addr = ifp->addr;
>>>              addr->a.v6.sin6_scope_id = dev->ifindex;
>>>              INIT_LIST_HEAD(&addr->list);
>>> +            INIT_RCU_HEAD(&addr->rcu);
>>>              list_add_tail(&addr->list, addrlist);
>>>          }
>>>      }
>>> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
>>> index e98579b..ac52f9e 100644
>>> --- a/net/sctp/protocol.c
>>> +++ b/net/sctp/protocol.c
>>> @@ -153,6 +153,7 @@ static void sctp_v4_copy_addrlist(struct
>>> list_head *addrlist,
>>>              addr->a.v4.sin_family = AF_INET;
>>>              addr->a.v4.sin_port = 0;
>>>              addr->a.v4.sin_addr.s_addr = ifa->ifa_local;
>>> +            INIT_RCU_HEAD(&addr->rcu);
>>>              list_add_tail(&addr->list, addrlist);
>>>          }
>>>      }
>>> @@ -192,16 +193,24 @@ static void sctp_free_local_addr_list(void)
>>>      }
>>>  }
>>>
>>> +void sctp_local_addr_free(struct rcu_head *head)
>>> +{
>>> +    struct sctp_sockaddr_entry *e = container_of(head,
>>> +                struct sctp_sockaddr_entry, rcu);
>>> +    kfree(e);
>>> +}
>>> +
>>>  /* Copy the local addresses which are valid for 'scope' into 'bp'.  */
>>>  int sctp_copy_local_addr_list(struct sctp_bind_addr *bp,
>>> sctp_scope_t scope,
>>>                    gfp_t gfp, int copy_flags)
>>>  {
>>>      struct sctp_sockaddr_entry *addr;
>>>      int error = 0;
>>> -    struct list_head *pos, *temp;
>>>
>>> -    list_for_each_safe(pos, temp, &sctp_local_addr_list) {
>>> -        addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>>> +    rcu_read_lock();
>>> +    list_for_each_entry_rcu(addr, &sctp_local_addr_list, list) {
>>> +        if (!addr->valid)
>>> +            continue;
>>
>> What happens if the update-side code removes the element from the list
>> and marks it !->valid right here?
>>
>> If this turns out to be harmless, why not just dispense with the ->valid
>> flag entirely?
> 
> It should be OK if an address gets removed from the list. So i agree that
> ->valid flag is not really useful.

I added the valid flag because we don't really want to give back addresses that
are going away at the rcu quiescent state.  I agree, there is a race between
notifier marking the address invalid, and these checks on the reader side.   This
was modeled after other code that uses similar semantics.

> 
>>
>>>          if (sctp_in_scope(&addr->a, scope)) {
>>>              /* Now that the address is in scope, check to see if
>>>               * the address type is really supported by the local
>>> @@ -221,6 +230,7 @@ int sctp_copy_local_addr_list(struct
>>> sctp_bind_addr *bp, sctp_scope_t scope,
>>>      }
>>>
>>>  end_copy:
>>> +    rcu_read_unlock();
>>>      return error;
>>>  }
>>>
>>> @@ -600,13 +610,17 @@ static void sctp_v4_seq_dump_addr(struct
>>> seq_file *seq, union sctp_addr *addr)
>>>      seq_printf(seq, "%d.%d.%d.%d ", NIPQUAD(addr->v4.sin_addr));
>>>  }
>>>
>>> -/* Event handler for inet address addition/deletion events.  */
>>> +/* Event handler for inet address addition/deletion events.
>>> + * This is part of the blocking notifier call chain that is
>>> + * guarted by a mutex.  As a result we don't need to add
>>> + * any additional guards for the RCU
>>> + */
>>>  static int sctp_inetaddr_event(struct notifier_block *this, unsigned
>>> long ev,
>>>                     void *ptr)
>>>  {
>>>      struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
>>> -    struct sctp_sockaddr_entry *addr;
>>> -    struct list_head *pos, *temp;
>>> +    struct sctp_sockaddr_entry *addr = NULL;
>>> +    struct sctp_sockaddr_entry *temp;
>>>
>>>      switch (ev) {
>>>      case NETDEV_UP:
>>> @@ -615,19 +629,25 @@ static int sctp_inetaddr_event(struct
>>> notifier_block *this, unsigned long ev,
>>>              addr->a.v4.sin_family = AF_INET;
>>>              addr->a.v4.sin_port = 0;
>>>              addr->a.v4.sin_addr.s_addr = ifa->ifa_local;
>>> -            list_add_tail(&addr->list, &sctp_local_addr_list);
>>> +            addr->valid = 1;
>>> +            rcu_read_lock();
>>> +            list_add_tail_rcu(&addr->list, &sctp_local_addr_list);
>>> +            rcu_read_unlock();
>>
>> Based on the additions to the header comment, I am assuming that we
>> hold an update-side mutex.  This means that the rcu_read_lock() and
>> rcu_read_unlock() are (harmlessly) redundant.
> 
> This is called via a blocking notifier call chain and hence we could
> protect using an update-side mutex. But considering that
> sctp_inet6addr_event
> requires a spin_lock_bh(), may be we should use it here also to make it
> simple.

This again boils down the question of whether notifiers can be running concurrently...
If yes, then spin_lock_bh is the right choice.


>>
>>>          }
>>>          break;
>>>      case NETDEV_DOWN:
>>> -        list_for_each_safe(pos, temp, &sctp_local_addr_list) {
>>> -            addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>>> +        rcu_read_lock();
>>> +        list_for_each_entry_safe(addr, temp,
>>> +                    &sctp_local_addr_list, list) {
>>>              if (addr->a.v4.sin_addr.s_addr == ifa->ifa_local) {
>>> -                list_del(pos);
>>> -                kfree(addr);
>>> +                addr->valid = 0;
>>> +                list_del_rcu(&addr->list);
>>>                  break;
>>>              }
>>>          }
>>> -
>>> +        rcu_read_unlock();
>>
>> Ditto.
>>
>>> +        if (addr && !addr->valid)
>>> +            call_rcu(&addr->rcu, sctp_local_addr_free);
>>
>> This one is OK, since we hold the update-side mutex.

This would need to be protected as well, if the answer to the notifier
question is yes.

>>
>>>          break;
>>>      }
>>>
>>> @@ -1227,6 +1247,9 @@ SCTP_STATIC __exit void sctp_exit(void)
>>>      sctp_v6_del_protocol();
>>>      inet_del_protocol(&sctp_protocol, IPPROTO_SCTP);
>>>
>>> +    /* Unregister notifier for inet address additions/deletions. */
>>> +    unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
>>> +
>>>      /* Free the local address list.  */
>>>      sctp_free_local_addr_list();
>>>
>>> @@ -1240,9 +1263,6 @@ SCTP_STATIC __exit void sctp_exit(void)
>>>      inet_unregister_protosw(&sctp_stream_protosw);
>>>      inet_unregister_protosw(&sctp_seqpacket_protosw);
>>>
>>> -    /* Unregister notifier for inet address additions/deletions. */
>>> -    unregister_inetaddr_notifier(&sctp_inetaddr_notifier);
>>> -
>>>      sctp_sysctl_unregister();
>>>      list_del(&sctp_ipv4_specific.list);
>>>
>>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>>> index 3335460..a3acf78 100644
>>> --- a/net/sctp/socket.c
>>> +++ b/net/sctp/socket.c
>>> @@ -4057,9 +4057,9 @@ static int
>>> sctp_getsockopt_local_addrs_num_old(struct sock *sk, int len,
>>>                             int __user *optlen)
>>>  {
>>>      sctp_assoc_t id;
>>> +    struct list_head *pos;
>>>      struct sctp_bind_addr *bp;
>>>      struct sctp_association *asoc;
>>> -    struct list_head *pos, *temp;
>>>      struct sctp_sockaddr_entry *addr;
>>>      rwlock_t *addr_lock;
>>>      int cnt = 0;
>>> @@ -4096,15 +4096,19 @@ static int
>>> sctp_getsockopt_local_addrs_num_old(struct sock *sk, int len,
>>>          addr = list_entry(bp->address_list.next,
>>>                    struct sctp_sockaddr_entry, list);
>>>          if (sctp_is_any(&addr->a)) {
>>> -            list_for_each_safe(pos, temp, &sctp_local_addr_list) {
>>> -                addr = list_entry(pos,
>>> -                          struct sctp_sockaddr_entry,
>>> -                          list);
>>> +            rcu_read_lock();
>>> +            list_for_each_entry_rcu(addr,
>>> +                        &sctp_local_addr_list, list) {
>>> +                if (!addr->valid)
>>> +                    continue;
>>> +
>>
>> Again, what happens if the element is deleted just at this point?
>> If harmless, might be good to get rid of ->valid.
>>
>>>                  if ((PF_INET == sk->sk_family) &&
>>>                      (AF_INET6 == addr->a.sa.sa_family))
>>>                      continue;
>>> +
>>>                  cnt++;
>>>              }
>>> +            rcu_read_unlock();
>>
>> We are just counting these things, right?  If on the other hand we are
>> keeping a reference outside of rcu_read_lock() protection, then there
>> needs to be some explicit mechanism preventing the corresponding entry
>> from being freed.

In this particular case, we are just counting.  There are other cases,
we make a copy of the address in the list.  The goal was to reduce the
probability that an address that is about to be deleted at the rcu
quiescent state will not be copied/counted.

My other thought was to use atomics, but with all the yelling about atomic_read(),
that didn't seem any better then a simple __u8 flag.

Thanks
-vlad

^ permalink raw reply

* Re: sk98lin for 2.6.23-rc1
From: Bill Davidsen @ 2007-09-11 14:29 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Stephen Hemminger, Kyle Rose, James Corey, Rob Sims, linux-kernel,
	Jeff Garzik, netdev
In-Reply-To: <20070911115456.GM3563@stusta.de>

Adrian Bunk wrote:
> On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
>   
>> There are several different problems in this thread:
>> 1. The removal of old sk98lin driver caused some users to be forced to use
>>     skge. These users have uncovered issues with the dual port fiber based versions
>>     of the board.  
>>     Short term: The sk98lin driver should be restored to previous state, 
>>        and the PCI table should be used to limit the usage to only fiber systems.
>>        If Adrian doesn't do it, I'll do it when I return from Germany.
>> ...
>>     
>
> No problem with this, but since it was Jeff's patch it should better be 
> him who reverts it (and he's anyway one step nearer to Linus).
>
> But the underlying general problem still remains:
>
> How can we get people to test and report bugs with the new drivers 
> before removing the old driver?
>
>   
Sorry for a long answer, I'm trying to provide insight on two recent cases.

Thinking back to several drivers, when e100 was new I tried it because I 
had problems with eepro100 in the area of multiple cards, multiple 
cables on a single card, and jumbo packets. For a while I used both, 
until e100 worked where I need it. So I initially tried it because it 
had features I needed, and then dropped to older driver just to avoid 
having to decide.

With sk98lin, the driver worked flawlessly with all (3-4) systems, so I 
had no reason to try any other. When removing sk98lin was first 
proposed, I tried skge, first measurements showed it was 5-8% slower, 
NOT what I want, so I went back. For me there was no reliability issue, 
but I never tried it in a system with more than on NIC on the driver. 
Would "it's a little slower" be a valid bug report? Or would I have 
gotten "works fine for me" from people not beating it over Gbit? I 
didn't try sky2 until you suggested it, and I have reported my results 
previously, just stops working. Could it be my hardware? I tried it on 
one system, so yes, but sk98lin works for months.
> That's a question especially for the people who now had problems after 
> sk98lin was removed.

So if you want people to try a new driver, I think it really has to have 
some benefits to the users, in terms of performance, reliability, or 
features. "Cleaner design" doesn't motivate, and it does raise the 
question of why the old driver wasn't just cleaned up. I've been doing 
software for decades, I appreciate why, but users in general just want 
to use their system. Which raises the question of why to delete drivers 
which work for many or even most users? Testing a new kernel is no 
longer a drop in a boot operation if modprobe.conf must be edited to get 
the network up, and the typical user isn't going to write that shell 
script to try one or the other driver.

Honestly, new drivers which offer little benefit to most users are the 
exception rather than the rule, so this may a corner case I would like 
to see sk98lin back in the kernel, for a while I can build my own 
kernels and patch it in, but until other drivers are drop-in, I probably 
won't change.

Separate but related: why keep skge and sky2? Are we going through this 
again in a year? Is the benefit worth the effort?

Hope some of this is helpful.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply

* Re: [PATCH] Add IP1000A Driver
From: Stephen Hemminger @ 2007-09-11 14:41 UTC (permalink / raw)
  To: Jesse Huang; +Cc: jeff, akpm, netdev, jesse
In-Reply-To: <1189524638.3261.1.camel@localhost.localdomain>

On Tue, 11 Sep 2007 11:30:38 -0400
Jesse Huang <jesse@icplus.com.tw> wrote:

> From: Jesse Huang <jesse@icplus.com.tw>
> 
> Change Logs: Add IP1000A Driver to kernel tree.
> 
> Signed-off-by: Jesse Huang <jesse@icplus.com.tw>

Who will be listed as maintainer of this device?
A good way to show that is to add an entry to MAINTAINERS file.


>  drivers/net/ipg.c | 2331 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/net/ipg.h |  856 +++++++++++++++++++
>  2 files changed, 3187 insertions(+), 0 deletions(-)
>  create mode 100755 drivers/net/ipg.c
>  create mode 100755 drivers/net/ipg.h
> 
> e804d1c265bf1d843f845457f925a1728bbfdff7
> diff --git a/drivers/net/ipg.c b/drivers/net/ipg.c
> new file mode 100755
> index 0000000..bdc2b8d
> --- /dev/null
> +++ b/drivers/net/ipg.c
> @@ -0,0 +1,2331 @@
> +/*
> + * ipg.c: Device Driver for the IP1000 Gigabit Ethernet Adapter
> + *
> + * Copyright (C) 2003, 2006  IC Plus Corp.
> + *
> + * Original Author:
> + *
> + *   Craig Rich
> + *   Sundance Technology, Inc.
> + *   1485 Saratoga Avenue
> + *   Suite 200
> + *   San Jose, CA 95129
> + *   408 873 4117
> + *   www.sundanceti.com
> + *   craig_rich@sundanceti.com
> + *
> + * Current Maintainer:
> + *
> + *   Sorbica Shieh.
> + *   10F, No.47, Lane 2, Kwang-Fu RD.
> + *   Sec. 2, Hsin-Chu, Taiwan, R.O.C.
> + *   http://www.icplus.com.tw
> + *   sorbica@icplus.com.tw
> + */

Names only, no physical addresses please.

> +/*
> + * Read a register from the Physical Layer device located
> + * on the IPG NIC, using the IPG PHYCTRL register.
> + */
> +static int mdio_read(struct net_device * dev, int phy_id, int phy_reg)
> +{
> +	void __iomem *ioaddr = ipg_ioaddr(dev);
> +	/*
> +	 * The GMII mangement frame structure for a read is as follows:
> +	 *
> +	 * |Preamble|st|op|phyad|regad|ta|      data      |idle|
> +	 * |< 32 1s>|01|10|AAAAA|RRRRR|z0|DDDDDDDDDDDDDDDD|z   |
> +	 *
> +	 * <32 1s> = 32 consecutive logic 1 values
> +	 * A = bit of Physical Layer device address (MSB first)
> +	 * R = bit of register address (MSB first)
> +	 * z = High impedance state
> +	 * D = bit of read data (MSB first)
> +	 *
> +	 * Transmission order is 'Preamble' field first, bits transmitted
> +	 * left to right (first to last).
> +	 */
> +	struct {
> +		u32 field;
> +		unsigned int len;
> +	} p[] = {
> +		{ GMII_PREAMBLE,	32 },	/* Preamble */
> +		{ GMII_ST,		2  },	/* ST */
> +		{ GMII_READ,		2  },	/* OP */
> +		{ phy_id,		5  },	/* PHYAD */
> +		{ phy_reg,		5  },	/* REGAD */
> +		{ 0x0000,		2  },	/* TA */
> +		{ 0x0000,		16 },	/* DATA */
> +		{ 0x0000,		1  }	/* IDLE */
> +	};

This could be declared static const, since it doesn't change.

> +	unsigned int i, j;
> +	u8 polarity, data;
> +
> +	polarity  = ipg_r8(PHY_CTRL);
> +	polarity &= (IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY);
> +
> +	/* Create the Preamble, ST, OP, PHYAD, and REGAD field. */
> +	for (j = 0; j < 5; j++) {
> +		for (i = 0; i < p[j].len; i++) {
> +			/* For each variable length field, the MSB must be
> +			 * transmitted first. Rotate through the field bits,
> +			 * starting with the MSB, and move each bit into the
> +			 * the 1st (2^1) bit position (this is the bit position
> +			 * corresponding to the MgmtData bit of the PhyCtrl
> +			 * register for the IPG).
> +			 *
> +			 * Example: ST = 01;
> +			 *
> +			 *          First write a '0' to bit 1 of the PhyCtrl
> +			 *          register, then write a '1' to bit 1 of the
> +			 *          PhyCtrl register.
> +			 *
> +			 * To do this, right shift the MSB of ST by the value:
> +			 * [field length - 1 - #ST bits already written]
> +			 * then left shift this result by 1.
> +			 */
> +			data  = (p[j].field >> (p[j].len - 1 - i)) << 1;
> +			data &= IPG_PC_MGMTDATA;
> +			data |= polarity | IPG_PC_MGMTDIR;
> +
> +			ipg_drive_phy_ctl_low_high(ioaddr, data);
> +		}
> +	}
> +
> +	send_three_state(ioaddr, polarity);
> +
> +	read_phy_bit(ioaddr, polarity);
> +
> +	/*
> +	 * For a read cycle, the bits for the next two fields (TA and
> +	 * DATA) are driven by the PHY (the IPG reads these bits).
> +	 */
> +	for (i = 0; i < p[6].len; i++) {
> +		p[6].field |=
> +		    (read_phy_bit(ioaddr, polarity) << (p[6].len - 1 - i));
> +	}
> +
> +	send_three_state(ioaddr, polarity);
> +	send_three_state(ioaddr, polarity);
> +	send_three_state(ioaddr, polarity);
> +	send_end(ioaddr, polarity);
> +
> +	/* Return the value of the DATA field. */
> +	return p[6].field;
> +}
> +
> +/*
> + * Write to a register from the Physical Layer device located
> + * on the IPG NIC, using the IPG PHYCTRL register.
> + */
> +static void mdio_write(struct net_device *dev, int phy_id, int phy_reg, int val)
> +{
> +	void __iomem *ioaddr = ipg_ioaddr(dev);
> +	/*
> +	 * The GMII mangement frame structure for a read is as follows:
> +	 *
> +	 * |Preamble|st|op|phyad|regad|ta|      data      |idle|
> +	 * |< 32 1s>|01|10|AAAAA|RRRRR|z0|DDDDDDDDDDDDDDDD|z   |
> +	 *
> +	 * <32 1s> = 32 consecutive logic 1 values
> +	 * A = bit of Physical Layer device address (MSB first)
> +	 * R = bit of register address (MSB first)
> +	 * z = High impedance state
> +	 * D = bit of write data (MSB first)
> +	 *
> +	 * Transmission order is 'Preamble' field first, bits transmitted
> +	 * left to right (first to last).
> +	 */
> +	struct {
> +		u32 field;
> +		unsigned int len;
> +	} p[] = {
> +		{ GMII_PREAMBLE,	32 },	/* Preamble */
> +		{ GMII_ST,		2  },	/* ST */
> +		{ GMII_WRITE,		2  },	/* OP */
> +		{ phy_id,		5  },	/* PHYAD */
> +		{ phy_reg,		5  },	/* REGAD */
> +		{ 0x0002,		2  },	/* TA */
> +		{ val & 0xffff,		16 },	/* DATA */
> +		{ 0x0000,		1  }	/* IDLE */
> +	};
> +	unsigned int i, j;
> +	u8 polarity, data;
> +
> +	polarity  = ipg_r8(PHY_CTRL);
> +	polarity &= (IPG_PC_DUPLEX_POLARITY | IPG_PC_LINK_POLARITY);
> +
> +	/* Create the Preamble, ST, OP, PHYAD, and REGAD field. */
> +	for (j = 0; j < 7; j++) {
> +		for (i = 0; i < p[j].len; i++) {
> +			/* For each variable length field, the MSB must be
> +			 * transmitted first. Rotate through the field bits,
> +			 * starting with the MSB, and move each bit into the
> +			 * the 1st (2^1) bit position (this is the bit position
> +			 * corresponding to the MgmtData bit of the PhyCtrl
> +			 * register for the IPG).
> +			 *
> +			 * Example: ST = 01;
> +			 *
> +			 *          First write a '0' to bit 1 of the PhyCtrl
> +			 *          register, then write a '1' to bit 1 of the
> +			 *          PhyCtrl register.
> +			 *
> +			 * To do this, right shift the MSB of ST by the value:
> +			 * [field length - 1 - #ST bits already written]
> +			 * then left shift this result by 1.
> +			 */
> +			data  = (p[j].field >> (p[j].len - 1 - i)) << 1;
> +			data &= IPG_PC_MGMTDATA;
> +			data |= polarity | IPG_PC_MGMTDIR;
> +
> +			ipg_drive_phy_ctl_low_high(ioaddr, data);
> +		}
> +	}
> +
> +	/* The last cycle is a tri-state, so read from the PHY. */
> +	for (j = 7; j < 8; j++) {
> +		for (i = 0; i < p[j].len; i++) {
> +			ipg_write_phy_ctl(ioaddr, IPG_PC_MGMTCLK_LO | polarity);
> +
> +			p[j].field |= ((ipg_r8(PHY_CTRL) &
> +				IPG_PC_MGMTDATA) >> 1) << (p[j].len - 1 - i);
> +
> +			ipg_write_phy_ctl(ioaddr, IPG_PC_MGMTCLK_HI | polarity);
> +		}
> +	}
> +}
> +
> +/* Set LED_Mode JES20040127EEPROM */
> +static void ipg_set_led_mode(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	u32 mode;
> +
> +	mode = ipg_r32(ASIC_CTRL);
> +	mode &= ~(IPG_AC_LED_MODE_BIT_1 | IPG_AC_LED_MODE | IPG_AC_LED_SPEED);
> +
> +	if ((sp->LED_Mode & 0x03) > 1)
> +		mode |= IPG_AC_LED_MODE_BIT_1;	/* Write Asic Control Bit 29 */
> +
> +	if ((sp->LED_Mode & 0x01) == 1)
> +		mode |= IPG_AC_LED_MODE;	/* Write Asic Control Bit 14 */
> +
> +	if ((sp->LED_Mode & 0x08) == 8)
> +		mode |= IPG_AC_LED_SPEED;	/* Write Asic Control Bit 27 */
> +
> +	ipg_w32(mode, ASIC_CTRL);
> +}
> +
> +/* Set PHYSet JES20040127EEPROM */
> +static void ipg_set_phy_set(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	int physet;
> +
> +	physet = ipg_r8(PHY_SET);
> +	physet &= ~(IPG_PS_MEM_LENB9B | IPG_PS_MEM_LEN9 | IPG_PS_NON_COMPDET);
> +	physet |= ((sp->LED_Mode & 0x70) >> 4);
> +	ipg_w8(physet, PHY_SET);
> +}
> +
> +static int ipg_reset(struct net_device *dev, u32 resetflags)
> +{
> +	/* Assert functional resets via the IPG AsicCtrl
> +	 * register as specified by the 'resetflags' input
> +	 * parameter.
> +	 */
> +	void __iomem *ioaddr = ipg_ioaddr(dev);	//JES20040127EEPROM:
> +	unsigned int timeout_count = 0;
> +
> +	IPG_DEBUG_MSG("_reset\n");
> +
> +	ipg_w32(ipg_r32(ASIC_CTRL) | resetflags, ASIC_CTRL);
> +
> +	/* Delay added to account for problem with 10Mbps reset. */
> +	mdelay(IPG_AC_RESETWAIT);
> +
> +	while (IPG_AC_RESET_BUSY & ipg_r32(ASIC_CTRL)) {
> +		mdelay(IPG_AC_RESETWAIT);
> +		if (++timeout_count > IPG_AC_RESET_TIMEOUT)
> +			return -ETIME;
> +	}
> +	/* Set LED Mode in Asic Control JES20040127EEPROM */
> +	ipg_set_led_mode(dev);
> +
> +	/* Set PHYSet Register Value JES20040127EEPROM */
> +	ipg_set_phy_set(dev);
> +	return 0;
> +}
> +
> +/* Find the GMII PHY address. */
> +static int ipg_find_phyaddr(struct net_device *dev)
> +{
> +	unsigned int phyaddr, i;
> +
> +	for (i = 0; i < 32; i++) {
> +		u32 status;
> +
> +		/* Search for the correct PHY address among 32 possible. */
> +		phyaddr = (IPG_NIC_PHY_ADDRESS + i) % 32;
> +
> +		/* 10/22/03 Grace change verify from GMII_PHY_STATUS to
> +		   GMII_PHY_ID1
> +		 */
> +
> +		status = mdio_read(dev, phyaddr, MII_BMSR);
> +
> +		if ((status != 0xFFFF) && (status != 0))
> +			return phyaddr;
> +	}
> +
> +	return 0x1f;
> +}
> +
> +/*
> + * Configure IPG based on result of IEEE 802.3 PHY
> + * auto-negotiation.
> + */
> +static int ipg_config_autoneg(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	unsigned int txflowcontrol;
> +	unsigned int rxflowcontrol;
> +	unsigned int fullduplex;
> +	unsigned int gig;
> +	u32 mac_ctrl_val;
> +	u32 asicctrl;
> +	u8 phyctrl;
> +
> +	IPG_DEBUG_MSG("_config_autoneg\n");
> +
> +	asicctrl = ipg_r32(ASIC_CTRL);
> +	phyctrl = ipg_r8(PHY_CTRL);
> +	mac_ctrl_val = ipg_r32(MAC_CTRL);
> +
> +	/* Set flags for use in resolving auto-negotation, assuming
> +	 * non-1000Mbps, half duplex, no flow control.
> +	 */
> +	fullduplex = 0;
> +	txflowcontrol = 0;
> +	rxflowcontrol = 0;
> +	gig = 0;
> +
> +	/* To accomodate a problem in 10Mbps operation,
> +	 * set a global flag if PHY running in 10Mbps mode.
> +	 */
> +	sp->tenmbpsmode = 0;
> +
> +	printk(KERN_INFO "%s: Link speed = ", dev->name);
> +
> +	/* Determine actual speed of operation. */
> +	switch (phyctrl & IPG_PC_LINK_SPEED) {
> +	case IPG_PC_LINK_SPEED_10MBPS:
> +		printk("10Mbps.\n");
> +		printk(KERN_INFO "%s: 10Mbps operational mode enabled.\n",
> +		       dev->name);
> +		sp->tenmbpsmode = 1;
> +		break;
> +	case IPG_PC_LINK_SPEED_100MBPS:
> +		printk("100Mbps.\n");
> +		break;
> +	case IPG_PC_LINK_SPEED_1000MBPS:
> +		printk("1000Mbps.\n");
> +		gig = 1;
> +		break;
> +	default:
> +		printk("undefined!\n");
> +		return 0;
> +	}
> +
> +	if (phyctrl & IPG_PC_DUPLEX_STATUS) {
> +		fullduplex = 1;
> +		txflowcontrol = 1;
> +		rxflowcontrol = 1;
> +	}
> +
> +	/* Configure full duplex, and flow control. */
> +	if (fullduplex == 1) {
> +		/* Configure IPG for full duplex operation. */
> +		printk(KERN_INFO "%s: setting full duplex, ", dev->name);
> +
> +		mac_ctrl_val |= IPG_MC_DUPLEX_SELECT_FD;
> +
> +		if (txflowcontrol == 1) {
> +			printk("TX flow control");
> +			mac_ctrl_val |= IPG_MC_TX_FLOW_CONTROL_ENABLE;
> +		} else {
> +			printk("no TX flow control");
> +			mac_ctrl_val &= ~IPG_MC_TX_FLOW_CONTROL_ENABLE;
> +		}
> +
> +		if (rxflowcontrol == 1) {
> +			printk(", RX flow control.");
> +			mac_ctrl_val |= IPG_MC_RX_FLOW_CONTROL_ENABLE;
> +		} else {
> +			printk(", no RX flow control.");
> +			mac_ctrl_val &= ~IPG_MC_RX_FLOW_CONTROL_ENABLE;
> +		}
> +
> +		printk("\n");
> +	} else {
> +		/* Configure IPG for half duplex operation. */
> +	        printk(KERN_INFO "%s: setting half duplex, "
> +		       "no TX flow control, no RX flow control.\n", dev->name);
> +
> +		mac_ctrl_val &= ~IPG_MC_DUPLEX_SELECT_FD &
> +			~IPG_MC_TX_FLOW_CONTROL_ENABLE &
> +			~IPG_MC_RX_FLOW_CONTROL_ENABLE;
> +	}
> +	ipg_w32(mac_ctrl_val, MAC_CTRL);
> +	return 0;
> +}
> +
> +/* Determine and configure multicast operation and set
> + * receive mode for IPG.
> + */
> +static void ipg_nic_set_multicast_list(struct net_device *dev)
> +{
> +	void __iomem *ioaddr = ipg_ioaddr(dev);
> +	struct dev_mc_list *mc_list_ptr;
> +	unsigned int hashindex;
> +	u32 hashtable[2];
> +	u8 receivemode;
> +
> +	IPG_DEBUG_MSG("_nic_set_multicast_list\n");
> +
> +	receivemode = IPG_RM_RECEIVEUNICAST | IPG_RM_RECEIVEBROADCAST;
> +
> +	if (dev->flags & IFF_PROMISC) {
> +		/* NIC to be configured in promiscuous mode. */
> +		receivemode = IPG_RM_RECEIVEALLFRAMES;
> +	} else if ((dev->flags & IFF_ALLMULTI) ||
> +		   (dev->flags & IFF_MULTICAST &
> +		    (dev->mc_count > IPG_MULTICAST_HASHTABLE_SIZE))) {
> +		/* NIC to be configured to receive all multicast
> +		 * frames. */
> +		receivemode |= IPG_RM_RECEIVEMULTICAST;
> +	} else if (dev->flags & IFF_MULTICAST & (dev->mc_count > 0)) {
> +		/* NIC to be configured to receive selected
> +		 * multicast addresses. */
> +		receivemode |= IPG_RM_RECEIVEMULTICASTHASH;
> +	}
> +
> +	/* Calculate the bits to set for the 64 bit, IPG HASHTABLE.
> +	 * The IPG applies a cyclic-redundancy-check (the same CRC
> +	 * used to calculate the frame data FCS) to the destination
> +	 * address all incoming multicast frames whose destination
> +	 * address has the multicast bit set. The least significant
> +	 * 6 bits of the CRC result are used as an addressing index
> +	 * into the hash table. If the value of the bit addressed by
> +	 * this index is a 1, the frame is passed to the host system.
> +	 */
> +
> +	/* Clear hashtable. */
> +	hashtable[0] = 0x00000000;
> +	hashtable[1] = 0x00000000;
> +
> +	/* Cycle through all multicast addresses to filter. */
> +	for (mc_list_ptr = dev->mc_list;
> +	     mc_list_ptr != NULL; mc_list_ptr = mc_list_ptr->next) {
> +		/* Calculate CRC result for each multicast address. */
> +		hashindex = crc32_le(0xffffffff, mc_list_ptr->dmi_addr,
> +				     ETH_ALEN);
> +
> +		/* Use only the least significant 6 bits. */
> +		hashindex = hashindex & 0x3F;
> +
> +		/* Within "hashtable", set bit number "hashindex"
> +		 * to a logic 1.
> +		 */
> +		set_bit(hashindex, (void *)hashtable);
> +	}
> +
> +	/* Write the value of the hashtable, to the 4, 16 bit
> +	 * HASHTABLE IPG registers.
> +	 */
> +	ipg_w32(hashtable[0], HASHTABLE_0);
> +	ipg_w32(hashtable[1], HASHTABLE_1);
> +
> +	ipg_w8(IPG_RM_RSVD_MASK & receivemode, RECEIVE_MODE);
> +
> +	IPG_DEBUG_MSG("ReceiveMode = %x\n", ipg_r8(RECEIVE_MODE));
> +}
> +
> +static int ipg_io_config(struct net_device *dev)
> +{
> +	void __iomem *ioaddr = ipg_ioaddr(dev);
> +	u32 origmacctrl;
> +	u32 restoremacctrl;
> +
> +	IPG_DEBUG_MSG("_io_config\n");
> +
> +	origmacctrl = ipg_r32(MAC_CTRL);
> +
> +	restoremacctrl = origmacctrl | IPG_MC_STATISTICS_ENABLE;
> +
> +	/* Based on compilation option, determine if FCS is to be
> +	 * stripped on receive frames by IPG.
> +	 */
> +	if (!IPG_STRIP_FCS_ON_RX)
> +		restoremacctrl |= IPG_MC_RCV_FCS;
> +
> +	/* Determine if transmitter and/or receiver are
> +	 * enabled so we may restore MACCTRL correctly.
> +	 */
> +	if (origmacctrl & IPG_MC_TX_ENABLED)
> +		restoremacctrl |= IPG_MC_TX_ENABLE;
> +
> +	if (origmacctrl & IPG_MC_RX_ENABLED)
> +		restoremacctrl |= IPG_MC_RX_ENABLE;
> +
> +	/* Transmitter and receiver must be disabled before setting
> +	 * IFSSelect.
> +	 */
> +	ipg_w32((origmacctrl & (IPG_MC_RX_DISABLE | IPG_MC_TX_DISABLE)) &
> +		IPG_MC_RSVD_MASK, MAC_CTRL);
> +
> +	/* Now that transmitter and receiver are disabled, write
> +	 * to IFSSelect.
> +	 */
> +	ipg_w32((origmacctrl & IPG_MC_IFS_96BIT) & IPG_MC_RSVD_MASK, MAC_CTRL);
> +
> +	/* Set RECEIVEMODE register. */
> +	ipg_nic_set_multicast_list(dev);
> +
> +	ipg_w16(IPG_MAX_RXFRAME_SIZE, MAX_FRAME_SIZE);
> +
> +	ipg_w8(IPG_RXDMAPOLLPERIOD_VALUE,   RX_DMA_POLL_PERIOD);
> +	ipg_w8(IPG_RXDMAURGENTTHRESH_VALUE, RX_DMA_URGENT_THRESH);
> +	ipg_w8(IPG_RXDMABURSTTHRESH_VALUE,  RX_DMA_BURST_THRESH);
> +	ipg_w8(IPG_TXDMAPOLLPERIOD_VALUE,   TX_DMA_POLL_PERIOD);
> +	ipg_w8(IPG_TXDMAURGENTTHRESH_VALUE, TX_DMA_URGENT_THRESH);
> +	ipg_w8(IPG_TXDMABURSTTHRESH_VALUE,  TX_DMA_BURST_THRESH);
> +	ipg_w16((IPG_IE_HOST_ERROR | IPG_IE_TX_DMA_COMPLETE |
> +		 IPG_IE_TX_COMPLETE | IPG_IE_INT_REQUESTED |
> +		 IPG_IE_UPDATE_STATS | IPG_IE_LINK_EVENT |
> +		 IPG_IE_RX_DMA_COMPLETE | IPG_IE_RX_DMA_PRIORITY), INT_ENABLE);
> +	ipg_w16(IPG_FLOWONTHRESH_VALUE,  FLOW_ON_THRESH);
> +	ipg_w16(IPG_FLOWOFFTHRESH_VALUE, FLOW_OFF_THRESH);
> +
> +	/* IPG multi-frag frame bug workaround.
> +	 * Per silicon revision B3 eratta.
> +	 */
> +	ipg_w16(ipg_r16(DEBUG_CTRL) | 0x0200, DEBUG_CTRL);
> +
> +	/* IPG TX poll now bug workaround.
> +	 * Per silicon revision B3 eratta.
> +	 */
> +	ipg_w16(ipg_r16(DEBUG_CTRL) | 0x0010, DEBUG_CTRL);
> +
> +	/* IPG RX poll now bug workaround.
> +	 * Per silicon revision B3 eratta.
> +	 */
> +	ipg_w16(ipg_r16(DEBUG_CTRL) | 0x0020, DEBUG_CTRL);
> +
> +	/* Now restore MACCTRL to original setting. */
> +	ipg_w32(IPG_MC_RSVD_MASK & restoremacctrl, MAC_CTRL);
> +
> +	/* Disable unused RMON statistics. */
> +	ipg_w32(IPG_RZ_ALL, RMON_STATISTICS_MASK);
> +
> +	/* Disable unused MIB statistics. */
> +	ipg_w32(IPG_SM_MACCONTROLFRAMESXMTD | IPG_SM_MACCONTROLFRAMESRCVD |
> +		IPG_SM_BCSTOCTETXMTOK_BCSTFRAMESXMTDOK | IPG_SM_TXJUMBOFRAMES |
> +		IPG_SM_MCSTOCTETXMTOK_MCSTFRAMESXMTDOK | IPG_SM_RXJUMBOFRAMES |
> +		IPG_SM_BCSTOCTETRCVDOK_BCSTFRAMESRCVDOK |
> +		IPG_SM_UDPCHECKSUMERRORS | IPG_SM_TCPCHECKSUMERRORS |
> +		IPG_SM_IPCHECKSUMERRORS, STATISTICS_MASK);
> +
> +	return 0;
> +}
> +
> +/*
> + * Create a receive buffer within system memory and update
> + * NIC private structure appropriately.
> + */
> +static int ipg_get_rxbuff(struct net_device *dev, int entry)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	struct ipg_rx *rxfd = sp->rxd + entry;
> +	struct sk_buff *skb;
> +	u64 rxfragsize;
> +
> +	IPG_DEBUG_MSG("_get_rxbuff\n");
> +
> +	skb = netdev_alloc_skb(dev, IPG_RXSUPPORT_SIZE + NET_IP_ALIGN);
> +	if (!skb) {
> +		sp->RxBuff[entry] = NULL;
> +		return -ENOMEM;
> +	}
> +
> +	/* Adjust the data start location within the buffer to
> +	 * align IP address field to a 16 byte boundary.
> +	 */
> +	skb_reserve(skb, NET_IP_ALIGN);
> +
> +	/* Associate the receive buffer with the IPG NIC. */
> +	skb->dev = dev;
> +
> +	/* Save the address of the sk_buff structure. */
> +	sp->RxBuff[entry] = skb;
> +
> +	rxfd->frag_info = cpu_to_le64(pci_map_single(sp->pdev, skb->data,
> +		sp->rx_buf_sz, PCI_DMA_FROMDEVICE));
> +
> +	/* Set the RFD fragment length. */
> +	rxfragsize = IPG_RXFRAG_SIZE;
> +	rxfd->frag_info |= cpu_to_le64((rxfragsize << 48) & IPG_RFI_FRAGLEN);
> +
> +	return 0;
> +}
> +
> +static int init_rfdlist(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	unsigned int i;
> +
> +	IPG_DEBUG_MSG("_init_rfdlist\n");
> +
> +	for (i = 0; i < IPG_RFDLIST_LENGTH; i++) {
> +		struct ipg_rx *rxfd = sp->rxd + i;
> +
> +		if (sp->RxBuff[i]) {
> +			pci_unmap_single(sp->pdev,
> +				le64_to_cpu(rxfd->frag_info & ~IPG_RFI_FRAGLEN),
> +				sp->rx_buf_sz, PCI_DMA_FROMDEVICE);
> +			IPG_DEV_KFREE_SKB(sp->RxBuff[i]);
> +			sp->RxBuff[i] = NULL;
> +		}
> +
> +		/* Clear out the RFS field. */
> +		rxfd->rfs = 0x0000000000000000;
> +
> +		if (ipg_get_rxbuff(dev, i) < 0) {
> +			/*
> +			 * A receive buffer was not ready, break the
> +			 * RFD list here.
> +			 */
> +			IPG_DEBUG_MSG("Cannot allocate Rx buffer.\n");
> +
> +			/* Just in case we cannot allocate a single RFD.
> +			 * Should not occur.
> +			 */
> +			if (i == 0) {
> +				printk(KERN_ERR "%s: No memory available"
> +					" for RFD list.\n", dev->name);
> +				return -ENOMEM;
> +			}
> +		}
> +
> +		rxfd->next_desc = cpu_to_le64(sp->rxd_map +
> +			sizeof(struct ipg_rx)*(i + 1));
> +	}
> +	sp->rxd[i - 1].next_desc = cpu_to_le64(sp->rxd_map);
> +
> +	sp->rx_current = 0;
> +	sp->rx_dirty = 0;
> +
> +	/* Write the location of the RFDList to the IPG. */
> +	ipg_w32((u32) sp->rxd_map, RFD_LIST_PTR_0);
> +	ipg_w32(0x00000000, RFD_LIST_PTR_1);
> +
> +	return 0;
> +}
> +
> +static void init_tfdlist(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	unsigned int i;
> +
> +	IPG_DEBUG_MSG("_init_tfdlist\n");
> +
> +	for (i = 0; i < IPG_TFDLIST_LENGTH; i++) {
> +		struct ipg_tx *txfd = sp->txd + i;
> +
> +		txfd->tfc = cpu_to_le64(IPG_TFC_TFDDONE);
> +
> +		if (sp->TxBuff[i]) {
> +			IPG_DEV_KFREE_SKB(sp->TxBuff[i]);
> +			sp->TxBuff[i] = NULL;
> +		}
> +
> +		txfd->next_desc = cpu_to_le64(sp->txd_map +
> +			sizeof(struct ipg_tx)*(i + 1));
> +	}
> +	sp->txd[i - 1].next_desc = cpu_to_le64(sp->txd_map);
> +
> +	sp->tx_current = 0;
> +	sp->tx_dirty = 0;
> +
> +	/* Write the location of the TFDList to the IPG. */
> +	IPG_DDEBUG_MSG("Starting TFDListPtr = %8.8x\n",
> +		       (u32) sp->txd_map);
> +	ipg_w32((u32) sp->txd_map, TFD_LIST_PTR_0);
> +	ipg_w32(0x00000000, TFD_LIST_PTR_1);
> +
> +	sp->ResetCurrentTFD = 1;
> +}
> +
> +/*
> + * Free all transmit buffers which have already been transfered
> + * via DMA to the IPG.
> + */
> +static void ipg_nic_txfree(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	const unsigned int curr = ipg_r32(TFD_LIST_PTR_0) -
> +		(sp->txd_map / sizeof(struct ipg_tx)) - 1;
> +	unsigned int released, pending;
> +
> +	IPG_DEBUG_MSG("_nic_txfree\n");
> +
> +	pending = sp->tx_current - sp->tx_dirty;
> +
> +	for (released = 0; released < pending; released++) {
> +		unsigned int dirty = sp->tx_dirty % IPG_TFDLIST_LENGTH;
> +		struct sk_buff *skb = sp->TxBuff[dirty];
> +		struct ipg_tx *txfd = sp->txd + dirty;
> +
> +		IPG_DEBUG_MSG("TFC = %16.16lx\n", (unsigned long) txfd->tfc);
> +
> +		/* Look at each TFD's TFC field beginning
> +		 * at the last freed TFD up to the current TFD.
> +		 * If the TFDDone bit is set, free the associated
> +		 * buffer.
> +		 */
> +		if (dirty == curr)
> +			break;
> +
> +		/* Setup TFDDONE for compatible issue. */
> +		txfd->tfc |= cpu_to_le64(IPG_TFC_TFDDONE);
> +
> +		/* Free the transmit buffer. */
> +		if (skb) {
> +			pci_unmap_single(sp->pdev,
> +				le64_to_cpu(txfd->frag_info & ~IPG_TFI_FRAGLEN),
> +				skb->len, PCI_DMA_TODEVICE);
> +
> +			IPG_DEV_KFREE_SKB(skb);
> +
> +			sp->TxBuff[dirty] = NULL;
> +		}
> +	}
> +
> +	sp->tx_dirty += released;
> +
> +	if (netif_queue_stopped(dev) &&
> +	    (sp->tx_current != (sp->tx_dirty + IPG_TFDLIST_LENGTH))) {
> +		netif_wake_queue(dev);
> +	}
> +}
> +
> +static void ipg_tx_timeout(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +
> +	ipg_reset(dev, IPG_AC_TX_RESET | IPG_AC_DMA | IPG_AC_NETWORK |
> +		  IPG_AC_FIFO);
> +
> +	spin_lock_irq(&sp->lock);
> +
> +	/* Re-configure after DMA reset. */
> +	if (ipg_io_config(dev) < 0) {
> +		printk(KERN_INFO "%s: Error during re-configuration.\n",
> +		       dev->name);
> +	}
> +
> +	init_tfdlist(dev);
> +
> +	spin_unlock_irq(&sp->lock);
> +
> +	ipg_w32((ipg_r32(MAC_CTRL) | IPG_MC_TX_ENABLE) & IPG_MC_RSVD_MASK,
> +		MAC_CTRL);
> +}
> +
> +/*
> + * For TxComplete interrupts, free all transmit
> + * buffers which have already been transfered via DMA
> + * to the IPG.
> + */
> +static void ipg_nic_txcleanup(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	unsigned int i;
> +
> +	IPG_DEBUG_MSG("_nic_txcleanup\n");
> +
> +	for (i = 0; i < IPG_TFDLIST_LENGTH; i++) {
> +		/* Reading the TXSTATUS register clears the
> +		 * TX_COMPLETE interrupt.
> +		 */
> +		u32 txstatusdword = ipg_r32(TX_STATUS);
> +
> +		IPG_DEBUG_MSG("TxStatus = %8.8x\n", txstatusdword);
> +
> +		/* Check for Transmit errors. Error bits only valid if
> +		 * TX_COMPLETE bit in the TXSTATUS register is a 1.
> +		 */
> +		if (!(txstatusdword & IPG_TS_TX_COMPLETE))
> +			break;
> +
> +		/* If in 10Mbps mode, indicate transmit is ready. */
> +		if (sp->tenmbpsmode) {
> +			netif_wake_queue(dev);
> +		}
> +
> +		/* Transmit error, increment stat counters. */
> +		if (txstatusdword & IPG_TS_TX_ERROR) {
> +			IPG_DEBUG_MSG("Transmit error.\n");
> +			sp->stats.tx_errors++;
> +		}
> +
> +		/* Late collision, re-enable transmitter. */
> +		if (txstatusdword & IPG_TS_LATE_COLLISION) {
> +			IPG_DEBUG_MSG("Late collision on transmit.\n");
> +			ipg_w32((ipg_r32(MAC_CTRL) | IPG_MC_TX_ENABLE) &
> +				IPG_MC_RSVD_MASK, MAC_CTRL);
> +		}
> +
> +		/* Maximum collisions, re-enable transmitter. */
> +		if (txstatusdword & IPG_TS_TX_MAX_COLL) {
> +			IPG_DEBUG_MSG("Maximum collisions on transmit.\n");
> +			ipg_w32((ipg_r32(MAC_CTRL) | IPG_MC_TX_ENABLE) &
> +				IPG_MC_RSVD_MASK, MAC_CTRL);
> +		}
> +
> +		/* Transmit underrun, reset and re-enable
> +		 * transmitter.
> +		 */
> +		if (txstatusdword & IPG_TS_TX_UNDERRUN) {
> +			IPG_DEBUG_MSG("Transmitter underrun.\n");
> +			sp->stats.tx_fifo_errors++;
> +			ipg_reset(dev, IPG_AC_TX_RESET | IPG_AC_DMA |
> +				  IPG_AC_NETWORK | IPG_AC_FIFO);
> +
> +			/* Re-configure after DMA reset. */
> +			if (ipg_io_config(dev) < 0) {
> +				printk(KERN_INFO
> +				       "%s: Error during re-configuration.\n",
> +				       dev->name);
> +			}
> +			init_tfdlist(dev);
> +
> +			ipg_w32((ipg_r32(MAC_CTRL) | IPG_MC_TX_ENABLE) &
> +				IPG_MC_RSVD_MASK, MAC_CTRL);
> +		}
> +	}
> +
> +	ipg_nic_txfree(dev);
> +}
> +
> +/* Provides statistical information about the IPG NIC. */
> +struct net_device_stats *ipg_nic_get_stats(struct net_device *dev)
> +{
> +	struct ipg_nic_private *sp = netdev_priv(dev);
> +	void __iomem *ioaddr = sp->ioaddr;
> +	u16 temp1;
> +	u16 temp2;
> +
> +	IPG_DEBUG_MSG("_nic_get_stats\n");
> +
> +	/* Check to see if the NIC has been initialized via nic_open,
> +	 * before trying to read statistic registers.
> +	 */
> +	if (!test_bit(__LINK_STATE_START, &dev->state))
> +		return &sp->stats;

The latest kernel has a statistics struct inside the netdevice that
can be used instead of having your own.


...

> +			/* If the frame contains an IP/TCP/UDP frame,
> +			 * determine if upper layer must check IP/TCP/UDP
> +			 * checksums.
> +			 *
> +			 * NOTE: DO NOT RELY ON THE TCP/UDP CHECKSUM
> +			 *       VERIFICATION FOR SILICON REVISIONS B3
> +			 *       AND EARLIER!
> +			 *
> +			 if ((le64_to_cpu(rxfd->rfs &
> +			 (IPG_RFS_TCPDETECTED | IPG_RFS_UDPDETECTED |
> +			 IPG_RFS_IPDETECTED))) &&
> +			 !(le64_to_cpu(rxfd->rfs &
> +			 (IPG_RFS_TCPERROR | IPG_RFS_UDPERROR |
> +			 IPG_RFS_IPERROR))))
> +			 {
> +			 * Indicate IP checksums were performed
> +			 * by the IPG.
> +			 *
> +			 skb->ip_summed = CHECKSUM_UNNECESSARY;
> +			 }

Sudden loss of proper indentation style

> +			 else
> +			 */
> +			if (1 == 1) {
> +				/* The IPG encountered an error with (or
> +				 * there were no) IP/TCP/UDP checksums.
> +				 * This may or may not indicate an invalid
> +				 * IP/TCP/UDP frame was received. Let the
> +				 * upper layer decide.
> +				 */
> +				skb->ip_summed = CHECKSUM_NONE;
> +			}
> +
> +			/* Hand off frame for higher layer processing.
> +			 * The function netif_rx() releases the sk_buff
> +			 * when processing completes.
> +			 */
> +			netif_rx(skb);
> +
> +			/* Record frame receive time (jiffies = Linux
> +			 * kernel current time stamp).
> +			 */
> +			dev->last_rx = jiffies;
> +		}
> +
> +		/* Assure RX buffer is not reused by IPG. */
> +		sp->RxBuff[entry] = NULL;
> +	}
> +
> +	/*
> +	 * If there are more RFDs to proces and the allocated amount of RFD
> +	 * processing time has expired, assert Interrupt Requested to make
> +	 * sure we come back to process the remaining RFDs.
> +	 */
> +	if (i == IPG_MAXRFDPROCESS_COUNT)
> +		ipg_w32(ipg_r32(ASIC_CTRL) | IPG_AC_INT_REQUEST, ASIC_CTRL);
> +
> +#ifdef IPG_DEBUG
> +	/* Check if the RFD list contained no receive frame data. */
> +	if (!i)
> +		sp->EmptyRFDListCount++;
> +#endif
> +	while ((le64_to_cpu(rxfd->rfs & IPG_RFS_RFDDONE)) &&
> +	       !((le64_to_cpu(rxfd->rfs & IPG_RFS_FRAMESTART)) &&
> +		 (le64_to_cpu(rxfd->rfs & IPG_RFS_FRAMEEND)))) {
> +		unsigned int entry = curr++ % IPG_RFDLIST_LENGTH;
> +
> +		rxfd = sp->rxd + entry;
> +
> +		IPG_DEBUG_MSG("Frame requires multiple RFDs.\n");
> +
> +		/* An unexpected event, additional code needed to handle
> +		 * properly. So for the time being, just disregard the
> +		 * frame.
> +		 */
> +
> +		/* Free the memory associated with the RX
> +		 * buffer since it is erroneous and we will
> +		 * not pass it to higher layer processes.
> +		 */
> +		if (sp->RxBuff[entry]) {
> +			pci_unmap_single(sp->pdev,
> +				le64_to_cpu(rxfd->frag_info & ~IPG_RFI_FRAGLEN),
> +				sp->rx_buf_sz, PCI_DMA_FROMDEVICE);
> +			IPG_DEV_KFREE_SKB(sp->RxBuff[entry]);
> +		}
> +
> +		/* Assure RX buffer is not reused by IPG. */
> +		sp->RxBuff[entry] = NULL;
> +	}
> +
> +	sp->rx_current = curr;
> +
> +	/* Check to see if there are a minimum number of used
> +	 * RFDs before restoring any (should improve performance.)
> +	 */
> +	if ((curr - sp->rx_dirty) >= IPG_MINUSEDRFDSTOFREE)
> +		ipg_nic_rxrestore(dev);
> +
> +	return 0;
> +}
> +#endif
>

^ permalink raw reply

* [PATCH] Configurable tap interface MTU
From: Ed Swierk @ 2007-09-11 14:42 UTC (permalink / raw)
  To: netdev, maxk; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 422 bytes --]

This patch makes it possible to change the MTU on a tap interface.
Increasing the MTU beyond the 1500-byte default is useful for
applications that interoperate with Ethernet devices supporting jumbo
frames.

The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is
slightly lower than the value used by the e1000 driver, so it seems
like a safe upper limit.

Signed-off-by: Ed Swierk <eswierk@arastra.com>

---

[-- Attachment #2: tap-change-mtu.patch --]
[-- Type: application/octet-stream, Size: 931 bytes --]

Index: linux-2.6.22.6/drivers/net/tun.c
===================================================================
--- linux-2.6.22.6.orig/drivers/net/tun.c
+++ linux-2.6.22.6/drivers/net/tun.c
@@ -171,6 +171,18 @@ tun_net_mclist(struct net_device *dev)
 	}
 }
 
+#define MIN_MTU 68
+#define MAX_MTU 16000
+
+static int
+tun_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+	if (new_mtu < MIN_MTU || new_mtu > MAX_MTU)
+		return -EINVAL;
+	dev->mtu = new_mtu;
+	return 0;
+}
+
 static struct net_device_stats *tun_net_stats(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -200,6 +212,7 @@ static void tun_net_init(struct net_devi
 		dev->set_multicast_list = tun_net_mclist;
 
 		ether_setup(dev);
+		dev->change_mtu = tun_net_change_mtu;
 
 		/* random address already created for us by tun_set_iff, use it */
 		memcpy(dev->dev_addr, tun->dev_addr, min(sizeof(tun->dev_addr), sizeof(dev->dev_addr)) );

^ permalink raw reply

* Re: [RFC PATCH 2/2] SCTP: Convert bind_addr_list locking to RCU
From: Vlad Yasevich @ 2007-09-11 14:56 UTC (permalink / raw)
  To: paulmck; +Cc: netdev, lksctp-developers
In-Reply-To: <20070910220841.GK11801@linux.vnet.ibm.com>

Hi Paul

Thanks for review.  I'll leave out the comments about
the ->valid usage since there are the same as the first patch
in the series.

Other questions/responses below...

Paul E. McKenney wrote:
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index 7fc369f..9c7db1f 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -167,7 +167,10 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *new,
>>
>>  	INIT_LIST_HEAD(&addr->list);
>>  	INIT_RCU_HEAD(&addr->rcu);
>> -	list_add_tail(&addr->list, &bp->address_list);
>> +
>> +	rcu_read_lock();
>> +	list_add_tail_rcu(&addr->list, &bp->address_list);
>> +	rcu_read_unlock();
> 
> Given the original code, we presumably hold the update-side lock.  If so,
> the rcu_read_lock() and rcu_read_unlock() are (harmlessly) redundant.
> 

Yes, it this case, the writer would already hold the socket lock.  However, I was told during
private review that even writers need to be in rcu critical section.  Looking at the different
users of RCU, it seems there is no consistency, i.e. sometimes writers take the rcu_read_lock()
and sometimes the don't.

Is there a rule of when writer needs to be in rcu critical section?

>>  	SCTP_DBG_OBJCNT_INC(addr);
>>
>>  	return 0;
>> @@ -178,20 +181,23 @@ int sctp_add_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *new,
>>   */
>>  int sctp_del_bind_addr(struct sctp_bind_addr *bp, union sctp_addr *del_addr)
>>  {
>> -	struct list_head *pos, *temp;
>> -	struct sctp_sockaddr_entry *addr;
>> +	struct sctp_sockaddr_entry *addr, *temp;
>>
>> -	list_for_each_safe(pos, temp, &bp->address_list) {
>> -		addr = list_entry(pos, struct sctp_sockaddr_entry, list);
>> +	rcu_read_lock_bh();
>> +	list_for_each_entry_safe(addr, temp, &bp->address_list, list) {
>>  		if (sctp_cmp_addr_exact(&addr->a, del_addr)) {
>>  			/* Found the exact match. */
>> -			list_del(pos);
>> -			kfree(addr);
>> -			SCTP_DBG_OBJCNT_DEC(addr);
>> -
>> -			return 0;
>> +			addr->valid = 0;
>> +			list_del_rcu(&addr->list);
>> +			break;
>>  		}
>>  	}
>> +	rcu_read_unlock_bh();
> 
> Ditto.
> 
>> +
>> +	if (addr && !addr->valid) {
>> +		call_rcu_bh(&addr->rcu, sctp_local_addr_free);
>> +		SCTP_DBG_OBJCNT_DEC(addr);
>> +	}
>>
>>  	return -EINVAL;
>>  }

>> @@ -325,27 +336,31 @@ union sctp_addr *sctp_find_unmatch_addr(struct sctp_bind_addr	*bp,
>>  	union sctp_addr			*addr;
>>  	void 				*addr_buf;
>>  	struct sctp_af			*af;
>> -	struct list_head		*pos;
>>  	int				i;
>>
>> -	list_for_each(pos, &bp->address_list) {
>> -		laddr = list_entry(pos, struct sctp_sockaddr_entry, list);
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(laddr, &bp->address_list, list) {
>> +		if (!laddr->valid)
>> +			continue;
> 
> Ditto...
> 
>>  		addr_buf = (union sctp_addr *)addrs;
>>  		for (i = 0; i < addrcnt; i++) {
>>  			addr = (union sctp_addr *)addr_buf;
>>  			af = sctp_get_af_specific(addr->v4.sin_family);
>>  			if (!af)
>> -				return NULL;
>> +				break;
>>
>>  			if (opt->pf->cmp_addr(&laddr->a, addr, opt))
>>  				break;
>>
>>  			addr_buf += af->sockaddr_len;
>>  		}
>> -		if (i == addrcnt)
>> +		if (i == addrcnt) {
>> +			rcu_read_unlock();
> 
> Since rcu_read_unlock() just happened, some other CPU is free to
> free up this data structure.  In a CONFIG_PREEMPT kernel (as well as a
> CONFIG_PREEMPT_RT kernel, for that matter), this task might be preempted
> at this point, and a full grace period might elapse.
> 
> In which case, the following statement returns a pointer to the freelist,
> which is not good.

Hm...  my saving grace here is that this happens under the protection of the
socket lock so the rcu_read_lock is again potentially redundant.  If it wasn't
for that socket lock, the original code would also have the same race. 

> 
>>  			return &laddr->a;
>> +		}
>>  	}
>> +	rcu_read_unlock();
>>
>>  	return NULL;
>>  }

>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>> index 79856c9..caaa29f 100644
>> --- a/net/sctp/sm_make_chunk.c
>> +++ b/net/sctp/sm_make_chunk.c
>> @@ -1531,7 +1531,7 @@ no_hmac:
>>  	/* Also, add the destination address. */
>>  	if (list_empty(&retval->base.bind_addr.address_list)) {
>>  		sctp_add_bind_addr(&retval->base.bind_addr, &chunk->dest, 1,
>> -				   GFP_ATOMIC);
>> +				GFP_ATOMIC);
>>  	}
>>
>>  	retval->next_tsn = retval->c.initial_tsn;
>> @@ -2613,22 +2613,17 @@ static int sctp_asconf_param_success(struct sctp_association *asoc,
>>
>>  	switch (asconf_param->param_hdr.type) {
>>  	case SCTP_PARAM_ADD_IP:
>> -		sctp_local_bh_disable();
>> -		sctp_write_lock(&asoc->base.addr_lock);
>> -		list_for_each(pos, &bp->address_list) {
>> -			saddr = list_entry(pos, struct sctp_sockaddr_entry, list);
>> +		rcu_read_lock_bh();
>> +		list_for_each_entry_rcu(saddr, &bp->address_list, list) {
>> +			if (!saddr->valid)
>> +				continue;
>>  			if (sctp_cmp_addr_exact(&saddr->a, &addr))
>>  				saddr->use_as_src = 1;
>>  		}
>> -		sctp_write_unlock(&asoc->base.addr_lock);
>> -		sctp_local_bh_enable();
>> +		rcu_read_unlock_bh();
> 
> If you use rcu_read_lock_bh() and rcu_read_unlock_bh() in one read path
> for a given data structure, you need to use them in all the other read
> paths for that data structure.  In addition, you must use call_rcu_bh()
> when deleting the corresponding data elements.
> 
> The normal and the _bh RCU grace periods are unrelated, so mixing them
> for a given RCU-protected data structure is a bad idea.  (Or are these
> somehow two independent data structures?)

It's the same data structure, but update some of its contents in the softirq
context. Thankfully we don't change the list in this case.  As you can see,
the code was calling local_bh_disable() before, so I was trying to preserve
that condition.

This condition also happens under the synchronization of the socket lock, so
rcu may be potentially redundant here.  It might be sufficient to simply
call local_bh_disable() for this particular case.

> 
>>  		break;
>>  	case SCTP_PARAM_DEL_IP:
>> -		sctp_local_bh_disable();
>> -		sctp_write_lock(&asoc->base.addr_lock);
>>  		retval = sctp_del_bind_addr(bp, &addr);
>> -		sctp_write_unlock(&asoc->base.addr_lock);
>> -		sctp_local_bh_enable();

This one is actually the one I worry more about.  If you look close to 
the beginning of this patch at sctp_del_bind_addr(), you'll see that
it also uses BH conventions for it's RCU calls and uses call_rcu_bh()
to free up the entry.

However, given what you said about the grace periods being unrelated, I
am questioning that condition.

I think I'll need 2 version of the del_bind_addr() call, depending on
which list I am cleaning up.

There is a list on the 'endpoint' structure, that we add to and remove from
only in user context.
There is also a list on the 'association' structure, the we add to in user
context, but remove from in BH context only.  I think this one will
need be cleaned up using rcu_read_[un]lock_bh() and call_rcu_bh() calls, while
the first one can use regular rcu calls.  Does that sound generally right?

>>  		list_for_each(pos, &asoc->peer.transport_addr_list) {
>>  			transport = list_entry(pos, struct sctp_transport,
>>  						 transports);
>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> index a3acf78..35cc30c 100644
>> --- a/net/sctp/socket.c
>> +++ b/net/sctp/socket.c
>> @@ -367,14 +367,10 @@ SCTP_STATIC int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
>>  	if (!bp->port)
>>  		bp->port = inet_sk(sk)->num;
>>
>> -	/* Add the address to the bind address list.  */
>> -	sctp_local_bh_disable();
>> -	sctp_write_lock(&ep->base.addr_lock);
>> -
>> -	/* Use GFP_ATOMIC since BHs are disabled.  */
>> +	/* Add the address to the bind address list.
>> +	 * Use GFP_ATOMIC since BHs will be disabled.
>> +	 */
>>  	ret = sctp_add_bind_addr(bp, addr, 1, GFP_ATOMIC);
>> -	sctp_write_unlock(&ep->base.addr_lock);
>> -	sctp_local_bh_enable();
>>
>>  	/* Copy back into socket for getsockname() use. */
>>  	if (!ret) {
>> @@ -497,7 +493,6 @@ static int sctp_send_asconf_add_ip(struct sock		*sk,
>>  	void				*addr_buf;
>>  	struct sctp_af			*af;
>>  	struct list_head		*pos;
>> -	struct list_head		*p;
>>  	int 				i;
>>  	int 				retval = 0;
>>
>> @@ -544,14 +539,15 @@ static int sctp_send_asconf_add_ip(struct sock		*sk,
>>  		if (i < addrcnt)
>>  			continue;
>>
>> -		/* Use the first address in bind addr list of association as
>> -		 * Address Parameter of ASCONF CHUNK.
>> +		/* Use the first valid address in bind addr list of
>> +		 * association as Address Parameter of ASCONF CHUNK.
>>  		 */
>> -		sctp_read_lock(&asoc->base.addr_lock);
>>  		bp = &asoc->base.bind_addr;
>> -		p = bp->address_list.next;
>> -		laddr = list_entry(p, struct sctp_sockaddr_entry, list);
>> -		sctp_read_unlock(&asoc->base.addr_lock);
>> +		rcu_read_lock();
>> +		list_for_each_entry_rcu(laddr, &bp->address_list, list)
>> +			if (laddr->valid)
>> +				break;
>> +		rcu_read_unlock();
> 
> Here you are carrying an RCU-protected data item (*laddr) outside of an
> rcu_read_lock()/rcu_read_unlock() pair.  This is not good -- you need
> to move the rcu_read_unlock() farther down to cover the full extend to
> uses of the laddr pointer.
> 
> Again, RCU is within its rights allowing a grace period to elapse, so
> that past this point, laddr might well point into the freelist.
> 

This is another area, where we may not even need rcu locking, since we
are already holding a socket lock that guarantees that the list will not change.
This was a blind conversion on my part, and I can probably restore the simple
list_entry() dereference there.

<snip the rest, there were no comments>

Thanks a lot
-vlad

^ permalink raw reply

* Re: sk98lin for 2.6.23-rc1
From: Adrian Bunk @ 2007-09-11 15:03 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Stephen Hemminger, Kyle Rose, James Corey, Rob Sims, linux-kernel,
	Jeff Garzik, netdev
In-Reply-To: <46E6A65B.2050209@tmr.com>

On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
> Adrian Bunk wrote:
>> On Tue, Sep 11, 2007 at 10:05:26AM +0200, Stephen Hemminger wrote:
>>   
>>> There are several different problems in this thread:
>>> 1. The removal of old sk98lin driver caused some users to be forced to 
>>> use
>>>     skge. These users have uncovered issues with the dual port fiber 
>>> based versions
>>>     of the board.      Short term: The sk98lin driver should be restored 
>>> to previous state,        and the PCI table should be used to limit the 
>>> usage to only fiber systems.
>>>        If Adrian doesn't do it, I'll do it when I return from Germany.
>>> ...
>>>     
>>
>> No problem with this, but since it was Jeff's patch it should better be 
>> him who reverts it (and he's anyway one step nearer to Linus).
>>
>> But the underlying general problem still remains:
>>
>> How can we get people to test and report bugs with the new drivers before 
>> removing the old driver?
>>
>>   
> Sorry for a long answer, I'm trying to provide insight on two recent cases.
>
> Thinking back to several drivers, when e100 was new I tried it because I 
> had problems with eepro100 in the area of multiple cards, multiple cables 
> on a single card, and jumbo packets. For a while I used both, until e100 
> worked where I need it. So I initially tried it because it had features I 
> needed, and then dropped to older driver just to avoid having to decide.
>
> With sk98lin, the driver worked flawlessly with all (3-4) systems, so I had 
> no reason to try any other. When removing sk98lin was first proposed, I 
> tried skge, first measurements showed it was 5-8% slower, NOT what I want, 
> so I went back. For me there was no reliability issue, but I never tried it 
> in a system with more than on NIC on the driver. Would "it's a little 
> slower" be a valid bug report? Or would I have gotten "works fine for me" 
> from people not beating it over Gbit?
>...

If you get less throughput that is a regression, and it should be 
reported and fixed.

I doubt anybody would have told you otherwise.

Is this bug still present as of 2.6.23-rc6?

>> That's a question especially for the people who now had problems after 
>> sk98lin was removed.
>
> So if you want people to try a new driver, I think it really has to have 
> some benefits to the users, in terms of performance, reliability, or 
> features. "Cleaner design" doesn't motivate, and it does raise the question 
> of why the old driver wasn't just cleaned up. I've been doing software for 
> decades, I appreciate why, but users in general just want to use their 
> system. Which raises the question of why to delete drivers which work for 
> many or even most users?

As I already explained, there is a long term advantage for all users if 
there is only one driver in the kernel. Therefore all users should 
switch away from obsolete drivers to the replacement drivers, and the 
obsolete driver will be removed at some point in time. The only question 
is how to do it.

> Testing a new kernel is no longer a drop in a boot 
> operation if modprobe.conf must be edited to get the network up, and the 
> typical user isn't going to write that shell script to try one or the other 
> driver.

The typical user will let his distribution handle this.

And MODULE_ALIAS can also handle this.

> Honestly, new drivers which offer little benefit to most users are the 
> exception rather than the rule, so this may a corner case I would like to 
> see sk98lin back in the kernel, for a while I can build my own kernels and 
> patch it in, but until other drivers are drop-in, I probably won't change.

That a new driver offers benefits that cause most users to switch isn't 
realistic.

You mention e100 as an example - well, I'm using this driver in my 
computer, but I doubt anything would be worse for me if I'd use the 
obsolete eepro100 driver instead since I'm not using any of the fancy 
e100 features you mentioned as advantages.

There is a long term advantage for all users if there is only one driver 
in the kernel. Therefore all users should switch away from obsolete 
drivers to the replacement drivers, and the obsolete driver will be 
removed at some point in time. The only question is how to do it.

> Separate but related: why keep skge and sky2? Are we going through this 
> again in a year? Is the benefit worth the effort?
>...

skge and sky2 support distinct hardware.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply

* [PATCH] Move the definition of pr_err() into kernel.h
From: Emil Medve @ 2007-09-11 14:56 UTC (permalink / raw)
  To: linux-kernel, netdev, i2c, linux-omap-open-source; +Cc: Emil Medve

Other pr_*() macros are already defined in kernel.h, but pr_err() was defined
multiple times in several other places

Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
---
 drivers/i2c/chips/menelaus.c     |   10 ++++------
 drivers/net/spider_net.h         |    3 ---
 drivers/video/omap/lcd_h3.c      |    6 ++----
 drivers/video/omap/lcd_inn1610.c |    6 ++----
 include/linux/kernel.h           |    2 ++
 5 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/i2c/chips/menelaus.c b/drivers/i2c/chips/menelaus.c
index d9c92c5..fc5eed4 100644
--- a/drivers/i2c/chips/menelaus.c
+++ b/drivers/i2c/chips/menelaus.c
@@ -49,8 +49,6 @@
 
 #define DRIVER_NAME			"menelaus"
 
-#define pr_err(fmt, arg...)	printk(KERN_ERR DRIVER_NAME ": ", ## arg);
-
 #define MENELAUS_I2C_ADDRESS		0x72
 
 #define MENELAUS_REV			0x01
@@ -155,7 +153,7 @@ static int menelaus_write_reg(int reg, u8 value)
 	int val = i2c_smbus_write_byte_data(the_menelaus->client, reg, value);
 
 	if (val < 0) {
-		pr_err("write error");
+		pr_err(DRIVER_NAME ":write error");
 		return val;
 	}
 
@@ -167,7 +165,7 @@ static int menelaus_read_reg(int reg)
 	int val = i2c_smbus_read_byte_data(the_menelaus->client, reg);
 
 	if (val < 0)
-		pr_err("read error");
+		pr_err(DRIVER_NAME ":read error");
 
 	return val;
 }
@@ -1177,7 +1175,7 @@ static int menelaus_probe(struct i2c_client *client)
 	/* If a true probe check the device */
 	rev = menelaus_read_reg(MENELAUS_REV);
 	if (rev < 0) {
-		pr_err("device not found");
+		pr_err(DRIVER_NAME ":device not found");
 		err = -ENODEV;
 		goto fail1;
 	}
@@ -1258,7 +1256,7 @@ static int __init menelaus_init(void)
 
 	res = i2c_add_driver(&menelaus_i2c_driver);
 	if (res < 0) {
-		pr_err("driver registration failed\n");
+		pr_err(DRIVER_NAME ":driver registration failed\n");
 		return res;
 	}
 
diff --git a/drivers/net/spider_net.h b/drivers/net/spider_net.h
index dbbdb8c..c67b11d 100644
--- a/drivers/net/spider_net.h
+++ b/drivers/net/spider_net.h
@@ -493,7 +493,4 @@ struct spider_net_card {
 	struct spider_net_descr darray[0];
 };
 
-#define pr_err(fmt,arg...) \
-	printk(KERN_ERR fmt ,##arg)
-
 #endif
diff --git a/drivers/video/omap/lcd_h3.c b/drivers/video/omap/lcd_h3.c
index 51807b4..c81271d 100644
--- a/drivers/video/omap/lcd_h3.c
+++ b/drivers/video/omap/lcd_h3.c
@@ -28,8 +28,6 @@
 
 #define MODULE_NAME	"omapfb-lcd_h3"
 
-#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME ": " fmt, ## args)
-
 static int h3_panel_init(struct lcd_panel *panel, struct omapfb_device *fbdev)
 {
 	return 0;
@@ -48,7 +46,7 @@ static int h3_panel_enable(struct lcd_panel *panel)
 	if (!r)
 		r = tps65010_set_gpio_out_value(GPIO2, HIGH);
 	if (r)
-		pr_err("Unable to turn on LCD panel\n");
+		pr_err(MODULE_NAME ":Unable to turn on LCD panel\n");
 
 	return r;
 }
@@ -62,7 +60,7 @@ static void h3_panel_disable(struct lcd_panel *panel)
 	if (!r)
 		tps65010_set_gpio_out_value(GPIO2, LOW);
 	if (r)
-		pr_err("Unable to turn off LCD panel\n");
+		pr_err(MODULE_NAME ":Unable to turn off LCD panel\n");
 }
 
 static unsigned long h3_panel_get_caps(struct lcd_panel *panel)
diff --git a/drivers/video/omap/lcd_inn1610.c b/drivers/video/omap/lcd_inn1610.c
index 95604ca..ea85cb9 100644
--- a/drivers/video/omap/lcd_inn1610.c
+++ b/drivers/video/omap/lcd_inn1610.c
@@ -27,20 +27,18 @@
 
 #define MODULE_NAME	"omapfb-lcd_h3"
 
-#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME ": " fmt, ## args)
-
 static int innovator1610_panel_init(struct lcd_panel *panel,
 				    struct omapfb_device *fbdev)
 {
 	int r = 0;
 
 	if (omap_request_gpio(14)) {
-		pr_err("can't request GPIO 14\n");
+		pr_err(MODULE_NAME ":can't request GPIO 14\n");
 		r = -1;
 		goto exit;
 	}
 	if (omap_request_gpio(15)) {
-		pr_err("can't request GPIO 15\n");
+		pr_err(MODULE_NAME ":can't request GPIO 15\n");
 		omap_free_gpio(14);
 		r = -1;
 		goto exit;
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f592df7..c51936a 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -242,6 +242,8 @@ static inline int __attribute__ ((format (printf, 1, 2))) pr_debug(const char *
 
 #define pr_info(fmt,arg...) \
 	printk(KERN_INFO fmt,##arg)
+#define pr_err(fmt, arg...) \
+	printk(KERN_ERR fmt, ##arg)
 
 /*
  *      Display an IP address in readable format.
-- 
1.5.3.GIT


^ permalink raw reply related

* Re: [PATCH][2/2] Add ICMPMsgStats MIB (RFC 4293)
From: David Stevens @ 2007-09-11 15:21 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: davem, netdev, netdev-owner, yoshfuji
In-Reply-To: <20070911.175053.104206872.yoshfuji@linux-ipv6.org>

Yoshifuji Hideaki <yoshfuji@linux-ipv6.org> wrote on 09/11/2007 01:50:53 
AM:

> Dave, we've been supporting per-interface stats for IPv6, and
> you seem to remove them.  Please keep them.  Thank you.

        The reason I didn't for ICMPMsgStats is the size. The RFC requires
in & out counters for all types, whether or not they are known by the OS,
so that's 512 stats.

Because the MIB variables in the existing infrastructure are not 
dynamically
allocated, that's:

512 * numInterfaces * numCPUs * 2protos

That seems like a lot of mostly-zero memory. Changing them to be run-time
allocated (and re-allocated when new types are seen) is another 
alternative,
but more significant changes.

"Memory is Cheap."
        -- Doug Comer

So maybe it's not so bad -- I'll roll another per-interface version
to see.

                                                        +-DLS


^ permalink raw reply

* Re: [PATCH] Move the definition of pr_err() into kernel.h
From: Randy Dunlap @ 2007-09-11 15:41 UTC (permalink / raw)
  To: Emil Medve; +Cc: linux-kernel, netdev, i2c, linux-omap-open-source
In-Reply-To: <11895225651521-git-send-email-Emilian.Medve@Freescale.com>

On Tue, 11 Sep 2007 09:56:05 -0500 Emil Medve wrote:

> Other pr_*() macros are already defined in kernel.h, but pr_err() was defined
> multiple times in several other places
> 
> Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com>
> ---
>  drivers/i2c/chips/menelaus.c     |   10 ++++------
>  drivers/net/spider_net.h         |    3 ---
>  drivers/video/omap/lcd_h3.c      |    6 ++----
>  drivers/video/omap/lcd_inn1610.c |    6 ++----
>  include/linux/kernel.h           |    2 ++
>  5 files changed, 10 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/i2c/chips/menelaus.c b/drivers/i2c/chips/menelaus.c
> index d9c92c5..fc5eed4 100644
> --- a/drivers/i2c/chips/menelaus.c
> +++ b/drivers/i2c/chips/menelaus.c
> @@ -49,8 +49,6 @@
>  
>  #define DRIVER_NAME			"menelaus"
>  
> -#define pr_err(fmt, arg...)	printk(KERN_ERR DRIVER_NAME ": ", ## arg);
> -
>  #define MENELAUS_I2C_ADDRESS		0x72
>  
>  #define MENELAUS_REV			0x01
> @@ -155,7 +153,7 @@ static int menelaus_write_reg(int reg, u8 value)
>  	int val = i2c_smbus_write_byte_data(the_menelaus->client, reg, value);
>  
>  	if (val < 0) {
> -		pr_err("write error");
> +		pr_err(DRIVER_NAME ":write error");

Hi,
Please keep the space after the ":" in all cases...

>  		return val;
>  	}
>  
> @@ -167,7 +165,7 @@ static int menelaus_read_reg(int reg)
>  	int val = i2c_smbus_read_byte_data(the_menelaus->client, reg);
>  
>  	if (val < 0)
> -		pr_err("read error");
> +		pr_err(DRIVER_NAME ":read error");
>  
>  	return val;
>  }
> @@ -1177,7 +1175,7 @@ static int menelaus_probe(struct i2c_client *client)
>  	/* If a true probe check the device */
>  	rev = menelaus_read_reg(MENELAUS_REV);
>  	if (rev < 0) {
> -		pr_err("device not found");
> +		pr_err(DRIVER_NAME ":device not found");
>  		err = -ENODEV;
>  		goto fail1;
>  	}
> @@ -1258,7 +1256,7 @@ static int __init menelaus_init(void)
>  
>  	res = i2c_add_driver(&menelaus_i2c_driver);
>  	if (res < 0) {
> -		pr_err("driver registration failed\n");
> +		pr_err(DRIVER_NAME ":driver registration failed\n");
>  		return res;
>  	}
>  
> diff --git a/drivers/net/spider_net.h b/drivers/net/spider_net.h
> index dbbdb8c..c67b11d 100644
> --- a/drivers/net/spider_net.h
> +++ b/drivers/net/spider_net.h
> @@ -493,7 +493,4 @@ struct spider_net_card {
>  	struct spider_net_descr darray[0];
>  };
>  
> -#define pr_err(fmt,arg...) \
> -	printk(KERN_ERR fmt ,##arg)
> -
>  #endif
> diff --git a/drivers/video/omap/lcd_h3.c b/drivers/video/omap/lcd_h3.c
> index 51807b4..c81271d 100644
> --- a/drivers/video/omap/lcd_h3.c
> +++ b/drivers/video/omap/lcd_h3.c
> @@ -28,8 +28,6 @@
>  
>  #define MODULE_NAME	"omapfb-lcd_h3"
>  
> -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME ": " fmt, ## args)
> -
>  static int h3_panel_init(struct lcd_panel *panel, struct omapfb_device *fbdev)
>  {
>  	return 0;
> @@ -48,7 +46,7 @@ static int h3_panel_enable(struct lcd_panel *panel)
>  	if (!r)
>  		r = tps65010_set_gpio_out_value(GPIO2, HIGH);
>  	if (r)
> -		pr_err("Unable to turn on LCD panel\n");
> +		pr_err(MODULE_NAME ":Unable to turn on LCD panel\n");
>  
>  	return r;
>  }
> @@ -62,7 +60,7 @@ static void h3_panel_disable(struct lcd_panel *panel)
>  	if (!r)
>  		tps65010_set_gpio_out_value(GPIO2, LOW);
>  	if (r)
> -		pr_err("Unable to turn off LCD panel\n");
> +		pr_err(MODULE_NAME ":Unable to turn off LCD panel\n");
>  }
>  
>  static unsigned long h3_panel_get_caps(struct lcd_panel *panel)
> diff --git a/drivers/video/omap/lcd_inn1610.c b/drivers/video/omap/lcd_inn1610.c
> index 95604ca..ea85cb9 100644
> --- a/drivers/video/omap/lcd_inn1610.c
> +++ b/drivers/video/omap/lcd_inn1610.c
> @@ -27,20 +27,18 @@
>  
>  #define MODULE_NAME	"omapfb-lcd_h3"
>  
> -#define pr_err(fmt, args...) printk(KERN_ERR MODULE_NAME ": " fmt, ## args)
> -
>  static int innovator1610_panel_init(struct lcd_panel *panel,
>  				    struct omapfb_device *fbdev)
>  {
>  	int r = 0;
>  
>  	if (omap_request_gpio(14)) {
> -		pr_err("can't request GPIO 14\n");
> +		pr_err(MODULE_NAME ":can't request GPIO 14\n");
>  		r = -1;
>  		goto exit;
>  	}
>  	if (omap_request_gpio(15)) {
> -		pr_err("can't request GPIO 15\n");
> +		pr_err(MODULE_NAME ":can't request GPIO 15\n");
>  		omap_free_gpio(14);
>  		r = -1;
>  		goto exit;
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index f592df7..c51936a 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -242,6 +242,8 @@ static inline int __attribute__ ((format (printf, 1, 2))) pr_debug(const char *
>  
>  #define pr_info(fmt,arg...) \
>  	printk(KERN_INFO fmt,##arg)
> +#define pr_err(fmt, arg...) \
> +	printk(KERN_ERR fmt, ##arg)
>  
>  /*
>   *      Display an IP address in readable format.
> -- 

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [PATCH 3/3] rfkill: Add rfkill documentation
From: Ivo van Doorn @ 2007-09-11 16:51 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: davem, Dmitry Torokhov, netdev, Inaky Perez-Gonzalez
In-Reply-To: <20070910112625.376efc50.randy.dunlap@oracle.com>

On Monday 10 September 2007, Randy Dunlap wrote:
> On Mon, 10 Sep 2007 19:56:03 +0200 Ivo van Doorn wrote:
> 
> > Add a documentation file which contains
> > a short description about rfkill with some
> > notes about drivers and the userspace interface.
> 
> Thanks.  I have noted a few typo/editorial changes below.

Thanks!
I'll resend this patch within a few minutes. :)

Ivo

> > Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
> > Acked-by: Dmitry Torokhov <dtor@mail.ru>
> > ---
> >  Documentation/rfkill.txt |   88 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 88 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/rfkill.txt
> > 
> > diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt
> > new file mode 100644
> > index 0000000..93c76fc
> > --- /dev/null
> > +++ b/Documentation/rfkill.txt
> > @@ -0,0 +1,88 @@
> > +rfkill - RF switch subsystem support
> > +====================================
> > +
> > +1 Implementation details
> > +2 Driver support
> > +3 Userspace support
> > +
> > +===============================================================================
> > +1: Implementation details
> > +
> > +The rfkill switch subsystem offers support for keys often found on laptops
> > +to enable wireless devices like WiFi and Bluetooth.
> > +
> > +This is done by providing the user 3 possibilities:
> > + - The rfkill system handles all events, userspace is not aware of events.
> > + - The rfkill system handles all events, userspace is informed about the event.
> > + - The rfkill system does not handle events, userspace handles all events.
> 
> I would s/,/;/ in the 3 lines above.
> 
> > +The buttons to enable and disable the wireless radios are important in
> > +situations where the user is for example using his laptop on a location where
> > +wireless radios _must_ be disabled (e.g airplanes).
> > +Because of this requirement, userspace support for the keys should not be
> > +made mandatory. Because userspace might want to perform some additional smarter
> > +tasks when the key is pressed, rfkill still provides userspace the possibility
> > +to take over the task to handle the key events.
> > +
> > +The system inside the kernel has been split into 2 seperate sections:
> 
>                                                       separate
> 
> > +	1 - RFKILL
> > +	2 - RFKILL_INPUT
> > +
> > +The first option enables rfkill support and will make sure userspace will
> > +be notified of any events through the input device. It also creates several
> > +sysfs entries which can be used by userspace. See section "Userspace support".
> > +
> > +The second option provides a rfkill input handler. This handler will
> 
>                               an
> 
> > +listen to all rfkill key events and will toggle the radio accordingly,
> 
> end above with ; or .  If '.', s/with/With/ on next line.
> 
> > +with this option enabled userspace could either do nothing or simply
> > +perform monitoring tasks.
> > +
> > +====================================
> > +2: Driver support
> > +
> > +Drivers who wish to build in rfkill subsystem support should
> 
>    Drivers that
> 
> But, drivers can't/don't wish, so it would be better to say something
> like:
> 
> To build a driver with rfkill subsystem support, the driver should
> depend on the Kconfig symbol RFKILL; it should _not_ depend on
> RKFILL_INPUT.
> 
> 
> > +make sure their driver depends of the Kconfig option RFKILL, it should
> > +_not_ depend on RFKILL_INPUT.
> > +
> > +Unless key events trigger a interrupt to which the driver listens, polling
> 
>                              an interrupt
> 
> > +will be required to determine the key state changes. For this the input
> > +layer providers the input-polldev handler.
> > +
> > +A driver should implement a few steps to correctly make use of the
> > +rfkill subsystem. First for non-polling drivers:
> > +
> > +	- rfkill_allocate()
> > +	- input_allocate_device()
> > +	- rfkill_register()
> > +	- input_register_device()
> > +
> > +For polling drivers:
> > +
> > +	- rfkill_allocate()
> > +	- input_allocate_polled_device()
> > +	- rfkill_register()
> > +	- input_register_polled_device()
> > +
> > +When a key event has been detected, the correct event should be
> > +send over the input device which has been registered by the driver.
> 
>    sent
> 
> > +
> > +====================================
> > +3: Userspace support
> > +
> > +For each key a input device will be created which will send out the correct
> 
>                 an
> 
> > +key event when the rfkill key has been pressed.
> > +
> > +The following sysfs entries will be created:
> > +
> > +	name: Name assigned by driver to this key (interface or driver name).
> > +	type: Name of the key type ("wlan", "bluetooth", etc).
> > +	state: Current state of the key. 1: On, 0: Off.
> > +	claim: 1: Userspace handles events, 0: Kernel handles events
> > +
> > +Both the "state" and "claim" entries are also writable. For the "state" entry
> > +this means that when 1 or 0 is written all radios will be toggled accordingly.
> 
> will be written even if they are already in that state?
> 
> > +For the "claim" entry writing 1 to it will mean that the kernel will no longer
> 
> s/will mean/means/
> s/will no longer handle/no longer handles/
> 
> > +handle key events even though RFKILL_INPUT input was enabled. When "claim" has
> > +been set to 0, userspace should make sure it will listen for the input events
> 
> s/it will listen/that it listens/
> 
> > +or check the sysfs "state" entry regularly to correctly perform the required
> > +tasks when the rkfill key is pressed.
> > -- 
> 
> ---
> ~Randy
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 



^ permalink raw reply

* [PATCH] tehuti: driver for Tehuti 10GbE network adapters
From: Andy Gospodarek @ 2007-09-11 16:40 UTC (permalink / raw)
  To: jeff; +Cc: davem, akpm, netdev, baum


This introduces support for a line of 10GbE adapters made by Tehuti
Networks.  An attempt to get this included was made a few months ago and
since the driver has been re-factored based on Jeff's suggestions.

You can download a patch against Jeff's netdev-2.6#upstream tree here:

http://git.greyhouse.net/gospo/tehuti-netdev-26-upstream.patch

and I setup a git tree last night based off Linus' tree.  You can sync
with that here if you like:

http://git.greyhouse.net/gospo/tehuti-2.6.git

This driver is in good enough shape to go into at least Jeff's tree and
we can make fixes as they are needed.  I'm not convinced all endian
issues are completely flushed out and there are few more ethtool ops that
could be implemented, but they don't seem to prevent basic card
functionality on my systems.


^ permalink raw reply

* [PATCH 3/3 v2] rfkill: Add rfkill documentation
From: Ivo van Doorn @ 2007-09-11 17:01 UTC (permalink / raw)
  To: davem; +Cc: Dmitry Torokhov, netdev, Inaky Perez-Gonzalez, Randy Dunlap
In-Reply-To: <200709101956.03861.IvDoorn@gmail.com>

Add a documentation file which contains
a short description about rfkill with some
notes about drivers and the userspace interface.

Changes since v1:
 - Spellchecking

Signed-off-by: Ivo van Doorn <IvDoorn@gmail.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
---

Only patch 3 was updated, patches 1 and 2 remain the same.

 Documentation/rfkill.txt |   89 ++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 89 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/rfkill.txt

diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt
new file mode 100644
index 0000000..0f35134
--- /dev/null
+++ b/Documentation/rfkill.txt
@@ -0,0 +1,89 @@
+rfkill - RF switch subsystem support
+====================================
+
+1 Implementation details
+2 Driver support
+3 Userspace support
+
+===============================================================================
+1: Implementation details
+
+The rfkill switch subsystem offers support for keys often found on laptops
+to enable wireless devices like WiFi and Bluetooth.
+
+This is done by providing the user 3 possibilities:
+ 1 - The rfkill system handles all events, userspace is not aware of events;
+ 2 - The rfkill system handles all events, userspace is informed about the event;
+ 3 - The rfkill system does not handle events, userspace handles all events;
+
+The buttons to enable and disable the wireless radios are important in
+situations where the user is for example using his laptop on a location where
+wireless radios _must_ be disabled (e.g. airplanes).
+Because of this requirement, userspace support for the keys should not be
+made mandatory. Because userspace might want to perform some additional smarter
+tasks when the key is pressed, rfkill still provides userspace the possibility
+to take over the task to handle the key events.
+
+The system inside the kernel has been split into 2 separate sections:
+	1 - RFKILL
+	2 - RFKILL_INPUT
+
+The first option enables rfkill support and will make sure userspace will
+be notified of any events through the input device. It also creates several
+sysfs entries which can be used by userspace. See section "Userspace support".
+
+The second option provides an rfkill input handler. This handler will
+listen to all rfkill key events and will toggle the radio accordingly.
+With this option enabled userspace could either do nothing or simply
+perform monitoring tasks.
+
+====================================
+2: Driver support
+
+To build a driver with rfkill subsystem support, the driver should
+depend on the Kconfig symbol RFKILL; it should _not_ depend on
+RKFILL_INPUT.
+
+Unless key events trigger an interrupt to which the driver listens, polling
+will be required to determine the key state changes. For this the input
+layer providers the input-polldev handler.
+
+A driver should implement a few steps to correctly make use of the
+rfkill subsystem. First for non-polling drivers:
+
+	- rfkill_allocate()
+	- input_allocate_device()
+	- rfkill_register()
+	- input_register_device()
+
+For polling drivers:
+
+	- rfkill_allocate()
+	- input_allocate_polled_device()
+	- rfkill_register()
+	- input_register_polled_device()
+
+When a key event has been detected, the correct event should be
+sent over the input device which has been registered by the driver.
+
+====================================
+3: Userspace support
+
+For each key an input device will be created which will send out the correct
+key event when the rfkill key has been pressed.
+
+The following sysfs entries will be created:
+
+	name: Name assigned by driver to this key (interface or driver name).
+	type: Name of the key type ("wlan", "bluetooth", etc).
+	state: Current state of the key. 1: On, 0: Off.
+	claim: 1: Userspace handles events, 0: Kernel handles events
+
+Both the "state" and "claim" entries are also writable. For the "state" entry
+this means that when 1 or 0 is written all radios, not yet in the requested
+state, will be will be toggled accordingly.
+For the "claim" entry writing 1 to it means that the kernel no longer handles
+handle key events even though RFKILL_INPUT input was enabled. When "claim" has
+been set to 0, userspace should make sure that it listens for the input events
+or check the sysfs "state" entry regularly to correctly perform the required
+tasks when the rkfill key is pressed.
-- 
1.5.3


^ permalink raw reply related

* Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info
From: Rick Jones @ 2007-09-11 17:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, netdev
In-Reply-To: <46E63121.5090607@cosmosbay.com>

Eric Dumazet wrote:
> Sridhar Samudrala a écrit :
> 
>> On Mon, 2007-09-10 at 16:13 -0700, Rick Jones wrote:
>>
>>> Return some useful information such as the maximum listen backlog and
>>> the current listen backlog in the tcp_info structure and have that 
>>> match what one can see in /proc/net/tcp and /proc/net/tcp6.
>>
>>
>> If we are also exporting max listen backlog, another place to
>> consider adding this is to tcp_diag_get_info() called via INET_DIAG_INFO.
>> Current listen backlog is returned in inet_diag_msg->idiag_rqueue.
>> max listen backlog can be returned in inet_diag_msg->idiag_wqueue.
>>
> 
> I agree, /proc/net/tcp is deprecated nowadays...

What takes its place?

> Rick, could you add this part in your patch, and add my Sign-off-by ?

My pleasure.

I have a small test program for the tcp_info bit - where do I go to find 
how the inet diag stuff works?

BTW, what do people think about doing the same thing with the rxqueue 
and txqueue's of netstat output?

rick jones

> 
> Thank you
> Eric
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
> index 57c5f0b..f5b6275 100644
> --- a/net/ipv4/tcp_diag.c
> +++ b/net/ipv4/tcp_diag.c
> @@ -25,11 +25,13 @@ static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
>  	const struct tcp_sock *tp = tcp_sk(sk);
>  	struct tcp_info *info = _info;
>  
> -	if (sk->sk_state == TCP_LISTEN)
> +	if (sk->sk_state == TCP_LISTEN) {
>  		r->idiag_rqueue = sk->sk_ack_backlog;
> -	else
> +		r->idiag_wqueue = sk->sk_max_ack_backlog;
> +	else {
>  		r->idiag_rqueue = tp->rcv_nxt - tp->copied_seq;
> -	r->idiag_wqueue = tp->write_seq - tp->snd_una;
> +		r->idiag_wqueue = tp->write_seq - tp->snd_una;
> +	}
>  	if (info != NULL)
>  		tcp_get_info(sk, info);
>  }


^ permalink raw reply

* Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info
From: Eric Dumazet @ 2007-09-11 17:15 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sridhar Samudrala, netdev
In-Reply-To: <46E6C9A5.8000008@hp.com>

On Tue, 11 Sep 2007 10:00:21 -0700
Rick Jones <rick.jones2@hp.com> wrote:

> Eric Dumazet wrote:
> > Sridhar Samudrala a écrit :
> > 
> >> On Mon, 2007-09-10 at 16:13 -0700, Rick Jones wrote:
> >>
> >>> Return some useful information such as the maximum listen backlog and
> >>> the current listen backlog in the tcp_info structure and have that 
> >>> match what one can see in /proc/net/tcp and /proc/net/tcp6.
> >>
> >>
> >> If we are also exporting max listen backlog, another place to
> >> consider adding this is to tcp_diag_get_info() called via INET_DIAG_INFO.
> >> Current listen backlog is returned in inet_diag_msg->idiag_rqueue.
> >> max listen backlog can be returned in inet_diag_msg->idiag_wqueue.
> >>
> > 
> > I agree, /proc/net/tcp is deprecated nowadays...
> 
> What takes its place?

ss command from iproute2 package ( http://linux-net.osdl.org/index.php/Iproute2 )

Problem with /proc/net/tcp is its quadratic time O(N^2) to output N lines...

On a loaded server :

# time ss|wc -l
 145695

real    0m9.383s
user    0m4.656s
sys     0m0.632s

# time netstat -an|wc -l
^C after some minutes .... no way...
real    3m23.825s
user    0m0.744s
sys     3m18.721s


> 
> > Rick, could you add this part in your patch, and add my Sign-off-by ?
> 
> My pleasure.
> 
> I have a small test program for the tcp_info bit - where do I go to find 
> how the inet diag stuff works?

ss state listen

> 
> BTW, what do people think about doing the same thing with the rxqueue 
> and txqueue's of netstat output?
>

I dont understand this question, I thought your patch already handled this 
(for the txqueue, since rxqueue is already there),  as netstat uses 
/proc/net/tcp (unfortunatly)


^ permalink raw reply

* Re: [PATCH] Configurable tap interface MTU
From: Rick Jones @ 2007-09-11 17:20 UTC (permalink / raw)
  To: Ed Swierk; +Cc: netdev, maxk, linux-kernel
In-Reply-To: <e6c711510709110742v4d2da644ha63b0a6d47c1230b@mail.gmail.com>

Ed Swierk wrote:
> This patch makes it possible to change the MTU on a tap interface.
> Increasing the MTU beyond the 1500-byte default is useful for
> applications that interoperate with Ethernet devices supporting jumbo
> frames.
> 
> The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is
> slightly lower than the value used by the e1000 driver, so it seems
> like a safe upper limit.

FWIW the OFED 1.2 bits take the MTU of IPoIB up to 65520 bytes :)

rick jones

^ permalink raw reply

* Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info
From: Rick Jones @ 2007-09-11 17:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Sridhar Samudrala, netdev
In-Reply-To: <20070911191539.ff60362b.dada1@cosmosbay.com>

> ss command from iproute2 package ( http://linux-net.osdl.org/index.php/Iproute2 )
> 
> Problem with /proc/net/tcp is its quadratic time O(N^2) to output N lines...

I could see where that might be a problem.

>>>Rick, could you add this part in your patch, and add my Sign-off-by ?
>>
>>My pleasure.
>>
>>I have a small test program for the tcp_info bit - where do I go to find 
>>how the inet diag stuff works?
> 
> 
> ss state listen

hpcpc103:~# ss --version
ss utility, iproute2-ss070313
hpcpc103:~# ss state listen
ss: no socket states to show with such filter.
hpcpc103:~# ss --all
State      Recv-Q Send-Q      Local Address:Port          Peer 
Address:Port
LISTEN     0      128                     *:sunrpc                   *:* 

LISTEN     0      128                     *:auth                     *:* 

LISTEN     0      128                    :::ssh                     :::* 

LISTEN     0      20              127.0.0.1:smtp                     *:* 

LISTEN     0      128                     *:42137                    *:* 


> 
> 
>>BTW, what do people think about doing the same thing with the rxqueue 
>>and txqueue's of netstat output?
>>
> 
> 
> I dont understand this question, I thought your patch already handled this 
> (for the txqueue, since rxqueue is already there),  as netstat uses 
> /proc/net/tcp (unfortunatly)

Well, it doesn't seem to be the case.  This is from the same system as 
the ss output above:

hpcpc103:~# netstat -an | grep LISTEN
tcp        0      0 0.0.0.0:111          0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:113          0.0.0.0:*               LISTEN
tcp        0      0 127.0.0.1:25         0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:42137        0.0.0.0:*               LISTEN
tcp6       0      0 :::22                :::*                    LISTEN
unix  2    [ ACC ]     STREAM   LISTENING   5666  /var/run/acpid.socket

I thought I saw some other code in there when I was stumbling around.

rick

^ permalink raw reply

* Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info
From: Sridhar Samudrala @ 2007-09-11 17:46 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev
In-Reply-To: <46E6D1E7.2060100@hp.com>

On Tue, 2007-09-11 at 10:35 -0700, Rick Jones wrote:
> > ss command from iproute2 package ( http://linux-net.osdl.org/index.php/Iproute2 )
> > 
> > Problem with /proc/net/tcp is its quadratic time O(N^2) to output N lines...
> 
> I could see where that might be a problem.
> 
> >>>Rick, could you add this part in your patch, and add my Sign-off-by ?
> >>
> >>My pleasure.
> >>
> >>I have a small test program for the tcp_info bit - where do I go to find 
> >>how the inet diag stuff works?
> > 
> > 
> > ss state listen
> 
> hpcpc103:~# ss --version
> ss utility, iproute2-ss070313
> hpcpc103:~# ss state listen
> ss: no socket states to show with such filter.

looks like 'ss state listening' or 'ss -l' is the right syntax.

> hpcpc103:~# ss --all
> State      Recv-Q Send-Q      Local Address:Port          Peer 
> Address:Port
> LISTEN     0      128                     *:sunrpc                   *:* 
> 
> LISTEN     0      128                     *:auth                     *:* 
> 
> LISTEN     0      128                    :::ssh                     :::* 
> 
> LISTEN     0      20              127.0.0.1:smtp                     *:* 
> 
> LISTEN     0      128                     *:42137                    *:* 
> 
> 
> > 
> > 
> >>BTW, what do people think about doing the same thing with the rxqueue 
> >>and txqueue's of netstat output?
> >>
> > 
> > 
> > I dont understand this question, I thought your patch already handled this 
> > (for the txqueue, since rxqueue is already there),  as netstat uses 
> > /proc/net/tcp (unfortunatly)
> 
> Well, it doesn't seem to be the case.  This is from the same system as 
> the ss output above:
> 
> hpcpc103:~# netstat -an | grep LISTEN
> tcp        0      0 0.0.0.0:111          0.0.0.0:*               LISTEN
> tcp        0      0 0.0.0.0:113          0.0.0.0:*               LISTEN
> tcp        0      0 127.0.0.1:25         0.0.0.0:*               LISTEN
> tcp        0      0 0.0.0.0:42137        0.0.0.0:*               LISTEN
> tcp6       0      0 :::22                :::*                    LISTEN
> unix  2    [ ACC ]     STREAM   LISTENING   5666  /var/run/acpid.socket
> 
> I thought I saw some other code in there when I was stumbling around.

Yes. netstat code seems to have a explicit check for TCP_LISTEN state
and zeroing txq and rxq.
>From tcp_do_one() in netstat.c
    if (state == TCP_LISTEN) {
        time_len = 0;
        retr = 0L;
        rxq = 0L;
        txq = 0L;
    }

We should fix this. Also i think it is a good idea to update netstat to use 
INET_DIAG_INFO instead of /proc/net/tcp.

Thanks
Sridhar


^ permalink raw reply

* Re: [PATCH] include listenq max backlog in /proc/net/tcp and include in tcp_info
From: Rick Jones @ 2007-09-11 18:10 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Eric Dumazet, netdev
In-Reply-To: <1189532811.19000.6.camel@w-sridhar2.beaverton.ibm.com>

>>>>BTW, what do people think about doing the same thing with the rxqueue 
>>>>and txqueue's of netstat output?
>>>>
>>>
>>>
>>>I dont understand this question, I thought your patch already handled this 
>>>(for the txqueue, since rxqueue is already there),  as netstat uses 
>>>/proc/net/tcp (unfortunatly)
>>
>>Well, it doesn't seem to be the case.  This is from the same system as 
>>the ss output above:
>>
>>hpcpc103:~# netstat -an | grep LISTEN
>>tcp        0      0 0.0.0.0:111          0.0.0.0:*               LISTEN
>>tcp        0      0 0.0.0.0:113          0.0.0.0:*               LISTEN
>>tcp        0      0 127.0.0.1:25         0.0.0.0:*               LISTEN
>>tcp        0      0 0.0.0.0:42137        0.0.0.0:*               LISTEN
>>tcp6       0      0 :::22                :::*                    LISTEN
>>unix  2    [ ACC ]     STREAM   LISTENING   5666  /var/run/acpid.socket
>>
>>I thought I saw some other code in there when I was stumbling around.
> 
> 
> Yes. netstat code seems to have a explicit check for TCP_LISTEN state
> and zeroing txq and rxq.
>>From tcp_do_one() in netstat.c
>     if (state == TCP_LISTEN) {
>         time_len = 0;
>         retr = 0L;
>         rxq = 0L;
>         txq = 0L;
>     }

How terribly cheeky of them.  I wonder why they were doing that?

> We should fix this. Also i think it is a good idea to update netstat to use 
> INET_DIAG_INFO instead of /proc/net/tcp.

Since that is user space I went ahead and sent the updated kernel patch 
in a fresh thread.

rick

^ permalink raw reply

* Re: [0/7] [PPP]: Fix shared/cloned/non-linear skb bugs (was: malformed captured packets)
From: Toralf Förster @ 2007-09-11 18:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, Paul Mackerras, James Chapman
In-Reply-To: <20070831090625.GA28175@gondor.apana.org.au>


[-- Attachment #1.1: Type: text/plain, Size: 2148 bytes --]

Am Freitag, 31. August 2007 schrieb Herbert Xu:
> On Thu, Aug 30, 2007 at 09:51:31AM +0000, James Chapman wrote:
> >
> > The captured PPPoE stream seems to show incorrect data lengths in the
> > PPPoE header for some captured PPPoE packets. The kernel's PPPoE
> > datapath uses this length to extract the PPP frame and send it through
> > to the ppp interface. Since your ppp stream is fine, the actual PPPoE
> > header contents must be correct when it is parsed by the kernel PPPoE
> > code. It seems more likely that this is a wireshark bug to me.
> 
> If he were using the kernel pppoe driver, then this is because
> PPP filtering is writing over a cloned skb without copying it.
> 
> In fact, there seems to be quite a few bugs of this kind in
> the various ppp*.c files.
> 
> Please try the following patches to see if they make a
> difference.
> 
> I've audited ppp_generic.c and pppoe.c.  I'll do pppol2tp
> tomorrow.
> 
> Cheers,

Running a stable Gentoo kernel 2.6.22-gentoo-r5 now for a while there's only
one thing left related to this topic.

I'm wondering why some UDP packets of the MS messenger protocol (with the usual
text like "please click at www.we-destroy-your-computer.com") always have wrong
check sums regardless whether sniffed at ppp0 or eth0 interface.

But from all UDP packets of this (today) useless protocol only those have wrong
check sums which are marked as "[Long frame (2 bytes)]" within wireshark.

And - last but now least - I have defined the following rule for this protocol :

Chain INPUT (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
...
8        1   485 DROP       udp  --  any    any     anywhere             anywhere            multiport dports 1026,1027

and this kernel options :
n22 ~ # zgrep ^CONFIG_PPP /proc/config.gz
CONFIG_PPP=m
CONFIG_PPP_FILTER=y
CONFIG_PPPOE=m

and I'm wondering why it is still possible to capture such packets at eth0.

Thanks for an answer.

-- 
MfG/Sincerely

Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

[-- Attachment #1.2: messenger_ethereal_eth0.pcap --]
[-- Type: application/octet-stream, Size: 3934 bytes --]

[-- Attachment #1.3: messenger_tcpdump_ppp0.pcap --]
[-- Type: application/octet-stream, Size: 3886 bytes --]

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* [PATCH 0/2] Clean up owner field in sock_lock_t
From: John Heffner @ 2007-09-11 18:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, John Heffner

I don't know why the owner field is a (struct sock_iocb *).  I'm assuming
it's historical.  Can someone check this out?  Did I miss some alternate
usage?

These patches are against net-2.6.24.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox