Netdev List

Netdev List
 help / color / mirror / Atom feed

* sctp use-uninitialized warning in net-2.6.25
From: Andrew Morton @ 2008-01-16 21:59 UTC (permalink / raw)
  To: netdev

net/sctp/sm_statefuns.c: In function 'sctp_sf_do_5_1C_ack':
net/sctp/sm_statefuns.c:484: warning: 'error' may be used uninitialized in this function

It is not obvious that this is a false positive.

^ permalink raw reply

* Please pull 'fixes-jgarzik' branch of wireless-2.6
From: John W. Linville @ 2008-01-16 21:27 UTC (permalink / raw)
  To: jeff; +Cc: netdev, linux-wireless

Jeff,

A few more fixes for 2.6.24...note that this branch is based
on 2.6.24-rc8.

Thanks,

John

---

Individual patches available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-jgarzik/

---

The following changes since commit cbd9c883696da72b2b1f03f909dbacc04bbf8b58:
  Linus Torvalds (1):
        Linux 2.6.24-rc8

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-jgarzik

Ivo van Doorn (1):
      rt2x00: Fix ieee80211 payload alignment

Marc Pignat (1):
      wireless/libertas support for 88w8385 sdio older revision

Randy Dunlap (1):
      hostap: section mismatch warning

Stefano Brivio (2):
      ipw2200: fix typo in kerneldoc
      b43: fix use-after-free rfkill bug

 drivers/net/wireless/b43/rfkill.c        |   11 ++++++-----
 drivers/net/wireless/hostap/hostap_plx.c |    6 +++---
 drivers/net/wireless/ipw2200.c           |    2 +-
 drivers/net/wireless/libertas/if_sdio.c  |    4 ++++
 drivers/net/wireless/rt2x00/rt2x00pci.c  |    2 +-
 drivers/net/wireless/rt2x00/rt2x00usb.c  |   11 +++++++++--
 6 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/net/wireless/b43/rfkill.c b/drivers/net/wireless/b43/rfkill.c
index 98cf70c..11f53cb 100644
--- a/drivers/net/wireless/b43/rfkill.c
+++ b/drivers/net/wireless/b43/rfkill.c
@@ -138,8 +138,11 @@ void b43_rfkill_init(struct b43_wldev *dev)
 	rfk->rfkill->user_claim_unsupported = 1;
 
 	rfk->poll_dev = input_allocate_polled_device();
-	if (!rfk->poll_dev)
-		goto err_free_rfk;
+	if (!rfk->poll_dev) {
+		rfkill_free(rfk->rfkill);
+		goto err_freed_rfk;
+	}
+
 	rfk->poll_dev->private = dev;
 	rfk->poll_dev->poll = b43_rfkill_poll;
 	rfk->poll_dev->poll_interval = 1000; /* msecs */
@@ -175,8 +178,7 @@ err_unreg_rfk:
 err_free_polldev:
 	input_free_polled_device(rfk->poll_dev);
 	rfk->poll_dev = NULL;
-err_free_rfk:
-	rfkill_free(rfk->rfkill);
+err_freed_rfk:
 	rfk->rfkill = NULL;
 out_error:
 	rfk->registered = 0;
@@ -195,6 +197,5 @@ void b43_rfkill_exit(struct b43_wldev *dev)
 	rfkill_unregister(rfk->rfkill);
 	input_free_polled_device(rfk->poll_dev);
 	rfk->poll_dev = NULL;
-	rfkill_free(rfk->rfkill);
 	rfk->rfkill = NULL;
 }
diff --git a/drivers/net/wireless/hostap/hostap_plx.c b/drivers/net/wireless/hostap/hostap_plx.c
index 040dc3e..cbf15d7 100644
--- a/drivers/net/wireless/hostap/hostap_plx.c
+++ b/drivers/net/wireless/hostap/hostap_plx.c
@@ -608,7 +608,7 @@ static void prism2_plx_remove(struct pci_dev *pdev)
 
 MODULE_DEVICE_TABLE(pci, prism2_plx_id_table);
 
-static struct pci_driver prism2_plx_drv_id = {
+static struct pci_driver prism2_plx_driver = {
 	.name		= "hostap_plx",
 	.id_table	= prism2_plx_id_table,
 	.probe		= prism2_plx_probe,
@@ -618,13 +618,13 @@ static struct pci_driver prism2_plx_drv_id = {
 
 static int __init init_prism2_plx(void)
 {
-	return pci_register_driver(&prism2_plx_drv_id);
+	return pci_register_driver(&prism2_plx_driver);
 }
 
 
 static void __exit exit_prism2_plx(void)
 {
-	pci_unregister_driver(&prism2_plx_drv_id);
+	pci_unregister_driver(&prism2_plx_driver);
 }
 
 
diff --git a/drivers/net/wireless/ipw2200.c b/drivers/net/wireless/ipw2200.c
index 88062c1..003f73f 100644
--- a/drivers/net/wireless/ipw2200.c
+++ b/drivers/net/wireless/ipw2200.c
@@ -4935,7 +4935,7 @@ static int ipw_queue_reset(struct ipw_priv *priv)
 /**
  * Reclaim Tx queue entries no more used by NIC.
  *
- * When FW adwances 'R' index, all entries between old and
+ * When FW advances 'R' index, all entries between old and
  * new 'R' index need to be reclaimed. As result, some free space
  * forms. If there is enough free space (> low mark), wake Tx queue.
  *
diff --git a/drivers/net/wireless/libertas/if_sdio.c b/drivers/net/wireless/libertas/if_sdio.c
index b24425f..4f1efb1 100644
--- a/drivers/net/wireless/libertas/if_sdio.c
+++ b/drivers/net/wireless/libertas/if_sdio.c
@@ -871,6 +871,10 @@ static int if_sdio_probe(struct sdio_func *func,
 		if (sscanf(func->card->info[i],
 				"ID: %x", &model) == 1)
 			break;
+               if (!strcmp(func->card->info[i], "IBIS Wireless SDIO Card")) {
+                       model = 4;
+                       break;
+               }
 	}
 
 	if (i == func->card->num_info) {
diff --git a/drivers/net/wireless/rt2x00/rt2x00pci.c b/drivers/net/wireless/rt2x00/rt2x00pci.c
index 6d5d9ab..04663eb 100644
--- a/drivers/net/wireless/rt2x00/rt2x00pci.c
+++ b/drivers/net/wireless/rt2x00/rt2x00pci.c
@@ -149,7 +149,7 @@ void rt2x00pci_rxdone(struct rt2x00_dev *rt2x00dev)
 		 * The data behind the ieee80211 header must be
 		 * aligned on a 4 byte boundary.
 		 */
-		align = NET_IP_ALIGN + (2 * (header_size % 4 == 0));
+		align = header_size % 4;
 
 		/*
 		 * Allocate the sk_buffer, initialize it and copy
diff --git a/drivers/net/wireless/rt2x00/rt2x00usb.c b/drivers/net/wireless/rt2x00/rt2x00usb.c
index ab4797e..568d738 100644
--- a/drivers/net/wireless/rt2x00/rt2x00usb.c
+++ b/drivers/net/wireless/rt2x00/rt2x00usb.c
@@ -245,13 +245,20 @@ static void rt2x00usb_interrupt_rxdone(struct urb *urb)
 	 * Allocate a new sk buffer to replace the current one.
 	 * If allocation fails, we should drop the current frame
 	 * so we can recycle the existing sk buffer for the new frame.
+	 * As alignment we use 2 and not NET_IP_ALIGN because we need
+	 * to be sure we have 2 bytes room in the head. (NET_IP_ALIGN
+	 * can be 0 on some hardware). We use these 2 bytes for frame
+	 * alignment later, we assume that the chance that
+	 * header_size % 4 == 2 is bigger then header_size % 2 == 0
+	 * and thus optimize alignment by reserving the 2 bytes in
+	 * advance.
 	 */
 	frame_size = entry->ring->data_size + entry->ring->desc_size;
-	skb = dev_alloc_skb(frame_size + NET_IP_ALIGN);
+	skb = dev_alloc_skb(frame_size + 2);
 	if (!skb)
 		goto skip_entry;
 
-	skb_reserve(skb, NET_IP_ALIGN);
+	skb_reserve(skb, 2);
 	skb_put(skb, frame_size);
 
 	/*
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply related

* Please pull 'fixes-davem' branch of wireless-2.6
From: John W. Linville @ 2008-01-16 21:26 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

Dave,

One more fix for rfkill in 2.6.24...note that this branch is based
on 2.6.24-rc8.

Thanks,

John

---

Individual patches available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-davem/

---

The following changes since commit cbd9c883696da72b2b1f03f909dbacc04bbf8b58:
  Linus Torvalds (1):
        Linux 2.6.24-rc8

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-davem

Eric Paris (1):
      rfkill: call rfkill_led_trigger_unregister() on error

 net/rfkill/rfkill.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/rfkill/rfkill.c b/net/rfkill/rfkill.c
index 4469a7b..d06d338 100644
--- a/net/rfkill/rfkill.c
+++ b/net/rfkill/rfkill.c
@@ -392,11 +392,14 @@ int rfkill_register(struct rfkill *rfkill)
 	rfkill_led_trigger_register(rfkill);
 
 	error = rfkill_add_switch(rfkill);
-	if (error)
+	if (error) {
+		rfkill_led_trigger_unregister(rfkill);
 		return error;
+	}
 
 	error = device_add(dev);
 	if (error) {
+		rfkill_led_trigger_unregister(rfkill);
 		rfkill_remove_switch(rfkill);
 		return error;
 	}
-- 
John W. Linville
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org

^ permalink raw reply related

* Re: questions on NAPI processing latency and dropped network packets
From: Willy Tarreau @ 2008-01-16 20:04 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Herbert Xu, cfriesen, davem, netdev, linux-kernel
In-Reply-To: <20080116065836.GA1638@ff.dom.local>

On Wed, Jan 16, 2008 at 07:58:36AM +0100, Jarek Poplawski wrote:
> On Wed, Jan 16, 2008 at 11:17:08AM +1100, Herbert Xu wrote:
> ...
> > Well people are always going to operate on this model for commercial
> > reasons.  FWIW I used to work for a company that stuck to a specific
> > version of the Linux kernel, and I suppose I still do even now :)
> > 
> > But the important thing is that if you're going to do that, then the
> > cost that comes with it should be borne by the company and not the
> > community.
> 
> Sure. But the most sad thing is there seems to be not so much savings
> in this (unless a company isn't sure of its near future). Trying to
> upgrade and test current products with current kernels, even if not
> necessary, should be always useful and make developing of new products
> faster and better fit (and of course, BTW, make the kernel better on
> time).

you can work with latest release provided that you always have a fallback
to an earlier one. That way, you don't bet too much on something you don't
completely control. If it works, it tells you you'll be able to completely
exploit its new possibilities in next product release, and if it breaks,
it's easy to issue a fix to get back to earlier, well-tested version.

Regards,
Willy


^ permalink raw reply

* Re: SO_RCVBUF doesn't change receiver advertised window
From: John Heffner @ 2008-01-16 19:42 UTC (permalink / raw)
  To: Ritesh Kumar; +Cc: Bill Fink, netdev
In-Reply-To: <f47983b00801161127u5fa4de0bgf53bb3b8fc57a3c7@mail.gmail.com>

Ritesh Kumar wrote:
> On 1/16/08, Bill Fink <billfink@mindspring.com> wrote:
>> On Tue, 15 Jan 2008, Ritesh Kumar wrote:
>>
>>> Hi,
>>>     I am using linux 2.6.20 and am trying to limit the receiver window
>>> size for a TCP connection. However, it seems that auto tuning is not
>>> turning itself off even after I use the syscall
>>>
>>> rwin=65536
>>> setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, sizeof(rwin));
>>>
>>> and verify using
>>>
>>> getsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, &rwin_size);
>>>
>>> that RCVBUF indeed is getting set (the value returned from getsockopt
>>> is double that, 131072).
>> Linux doubles what you requested, and then uses (by default) 1/4
>> of the socket space for overhead, so you effectively get 1.5 times
>> what you requested as an actual advertised receiver window, which
>> means since you specified 64 KB, you actually get 96 KB.
>>
>>> The above calls are made before connect() on the client side and
>>> before bind(), accept() on the server side. Bulk data is being sent
>>> from the client to the server. The client and the server machines also
>>> have tcp_moderate_rcvbuf set to 0 (though I don't think that's really
>>> needed; setting a value to SO_RCVBUF should automatically turnoff auto
>>> tuning.).
>>>
>>> However the tcp trace shows the SYN, SYN/ACK and the first few packets as:
>>> 14:34:18.831703 IP 192.168.1.153.45038 > 192.168.2.204.9999: S
>>> 3947298186:3947298186(0) win 5840 <mss 1460,sackOK,timestamp 2842625
>>> 0,nop,wscale 5>
>>> 14:34:18.836000 IP 192.168.2.204.9999 > 192.168.1.153.45038: S
>>> 3955381015:3955381015(0) ack 3947298187 win 5792 <mss
>>> 1460,sackOK,timestamp 2843649 2842625,nop,wscale 2>
>>> 14:34:18.837654 IP 192.168.1.153.45038 > 192.168.2.204.9999: . ack 1
>>> win 183 <nop,nop,timestamp 2842634 2843649>
>>> 14:34:18.837849 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
>>> 1:1449(1448) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
>>> 14:34:18.837851 IP 192.168.1.153.45038 > 192.168.2.204.9999: P
>>> 1449:1461(12) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
>>> 14:34:18.839001 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
>>> 1449 win 2172 <nop,nop,timestamp 2843652 2842634>
>>> 14:34:18.839011 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
>>> 1461 win 2172 <nop,nop,timestamp 2843652 2842634>
>>> 14:34:18.840875 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
>>> 1461:2909(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
>>> 14:34:18.840997 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
>>> 2909:4357(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
>>> 14:34:18.841120 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
>>> 4357:5805(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
>>> 14:34:18.841244 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
>>> 5805:7253(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
>>> 14:34:18.841388 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
>>> 2909 win 2896 <nop,nop,timestamp 2843655 2842637>
>>> 14:34:18.841399 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
>>> 4357 win 3620 <nop,nop,timestamp 2843655 2842637>
>>> 14:34:18.841413 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
>>> 5805 win 4344 <nop,nop,timestamp 2843655 2842637>
>>>
>>> As you can see, the syn and syn ack show rcv windows to be 5840 and
>>> 5792 and it automatically increases for the receiver to values 2172
>>> till 4344 and more in the later part of the trace till 24214.
>> Since the window scale was 2, the final advertised receiver window
>> you indicate of 24214 gives 2^2*24214 or right around 96 KB, which
>> is what is expected given the way Linux works.
>>
>>                                                 -Bill
> 
> Thanks for the explanation Bill. That surely clears part of my doubt.
> However, why doesn't linux advertise 24214 in the SYN packets? I was
> hoping that the moment I setup a RCVBUF, linux would pre-allocate
> buffers and drop any autotuning. Doesn't the above behavior count as
> autotuning?


Linux also starts all connections with a small advertised window.  It 
only grows the window after observing the ratio of data to overhead in 
received packets.  If it receives only small packets from the sender 
with a high overhead ratio, it will only open the window just far enough 
that it doesn't overflow the receive buffer.  This algorithm (look for 
rcv_ssthresh in the code) controls the advertised window given a receive 
buffer size.  This is separate from autotuning, which adjusts the buffer 
size.  You're correct that autotuning is disabled when SO_RCVBUF is set, 
but the "receive slow-start" is always used.

   -John

^ permalink raw reply

* Re: SO_RCVBUF doesn't change receiver advertised window
From: Ritesh Kumar @ 2008-01-16 19:27 UTC (permalink / raw)
  To: Bill Fink; +Cc: netdev
In-Reply-To: <20080116045026.eb62287a.billfink@mindspring.com>

On 1/16/08, Bill Fink <billfink@mindspring.com> wrote:
> On Tue, 15 Jan 2008, Ritesh Kumar wrote:
>
> > Hi,
> >     I am using linux 2.6.20 and am trying to limit the receiver window
> > size for a TCP connection. However, it seems that auto tuning is not
> > turning itself off even after I use the syscall
> >
> > rwin=65536
> > setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, sizeof(rwin));
> >
> > and verify using
> >
> > getsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, &rwin_size);
> >
> > that RCVBUF indeed is getting set (the value returned from getsockopt
> > is double that, 131072).
>
> Linux doubles what you requested, and then uses (by default) 1/4
> of the socket space for overhead, so you effectively get 1.5 times
> what you requested as an actual advertised receiver window, which
> means since you specified 64 KB, you actually get 96 KB.
>
> > The above calls are made before connect() on the client side and
> > before bind(), accept() on the server side. Bulk data is being sent
> > from the client to the server. The client and the server machines also
> > have tcp_moderate_rcvbuf set to 0 (though I don't think that's really
> > needed; setting a value to SO_RCVBUF should automatically turnoff auto
> > tuning.).
> >
> > However the tcp trace shows the SYN, SYN/ACK and the first few packets as:
> > 14:34:18.831703 IP 192.168.1.153.45038 > 192.168.2.204.9999: S
> > 3947298186:3947298186(0) win 5840 <mss 1460,sackOK,timestamp 2842625
> > 0,nop,wscale 5>
> > 14:34:18.836000 IP 192.168.2.204.9999 > 192.168.1.153.45038: S
> > 3955381015:3955381015(0) ack 3947298187 win 5792 <mss
> > 1460,sackOK,timestamp 2843649 2842625,nop,wscale 2>
> > 14:34:18.837654 IP 192.168.1.153.45038 > 192.168.2.204.9999: . ack 1
> > win 183 <nop,nop,timestamp 2842634 2843649>
> > 14:34:18.837849 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> > 1:1449(1448) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
> > 14:34:18.837851 IP 192.168.1.153.45038 > 192.168.2.204.9999: P
> > 1449:1461(12) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
> > 14:34:18.839001 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> > 1449 win 2172 <nop,nop,timestamp 2843652 2842634>
> > 14:34:18.839011 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> > 1461 win 2172 <nop,nop,timestamp 2843652 2842634>
> > 14:34:18.840875 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> > 1461:2909(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> > 14:34:18.840997 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> > 2909:4357(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> > 14:34:18.841120 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> > 4357:5805(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> > 14:34:18.841244 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> > 5805:7253(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> > 14:34:18.841388 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> > 2909 win 2896 <nop,nop,timestamp 2843655 2842637>
> > 14:34:18.841399 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> > 4357 win 3620 <nop,nop,timestamp 2843655 2842637>
> > 14:34:18.841413 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> > 5805 win 4344 <nop,nop,timestamp 2843655 2842637>
> >
> > As you can see, the syn and syn ack show rcv windows to be 5840 and
> > 5792 and it automatically increases for the receiver to values 2172
> > till 4344 and more in the later part of the trace till 24214.
>
> Since the window scale was 2, the final advertised receiver window
> you indicate of 24214 gives 2^2*24214 or right around 96 KB, which
> is what is expected given the way Linux works.
>
>                                                 -Bill

Thanks for the explanation Bill. That surely clears part of my doubt.
However, why doesn't linux advertise 24214 in the SYN packets? I was
hoping that the moment I setup a RCVBUF, linux would pre-allocate
buffers and drop any autotuning. Doesn't the above behavior count as
autotuning?


Ritesh

^ permalink raw reply

* Re: memory leakage in bridge(kernel-2.6.23.14)
From: Stephen Hemminger @ 2008-01-16 19:04 UTC (permalink / raw)
  To: wyb; +Cc: netdev
In-Reply-To: <200801161006.AZZ53966@topsec.com.cn>

On Wed, 16 Jan 2008 18:04:53 +0800
<wyb@topsec.com.cn> wrote:

> 
> In SMP, if a bridge fdb is being created when another CPU at the same time
> delete the bridge, this newly created fdb may incur a leakage:
> 
> CPU0:
> 
> static void del_nbp(struct net_bridge_port *p)
> {
> 	/*
> 	 * CPU1 enter br_fdb_update(), bridge port is still valid.
> 	 */
> 	......
> 	spin_lock_bh(&br->lock);
> 	br_stp_disable_port(p);
> 	spin_unlock_bh(&br->lock);
> 
> 	br_ifinfo_notify(RTM_DELLINK, p);
> 
> 	br_fdb_delete_by_port(br, p, 1);
> 
> 	/*
> 	 * CPU1 call fdb_create() for the being deleted bridge,
> 	 * a fdb would be add to bridge's fdb hash table, and will never
> 	 * be freed. because when deleting a bridge, linux flush fdb for
> each
> 	 * bridge port, but this newly created fdb belong to no bridge port
> 	 */
> 	......
> }
> 
> To fix this, fdb_create() should be changed to:
> {
> 	struct net_bridge_fdb_entry *fdb;
> 
> 	/*
> 	 * if the bridge port is deleted, then return.
> 	 */
> 	if (!(source->state == BR_STATE_LEARNING ||
> 	      source->state == BR_STATE_FORWARDING))
> 		return;
> 
> 	fdb = kmem_cache_alloc(br_fdb_cache, GFP_ATOMIC);
> 	if (fdb) {
> 		memcpy(fdb->addr.addr, addr, ETH_ALEN);
> 		atomic_set(&fdb->use_count, 1);
> 		hlist_add_head_rcu(&fdb->hlist, head);
> 
> 		fdb->dst = source;
> 		fdb->is_local = is_local;
> 		fdb->is_static = is_local;
> 		fdb->ageing_timer = jiffies;
> 	}
> 	return fdb;
> }
> 

That check is not enough, since the state might change during
fdb_create.


I think fdb_delete_by_port needs to be moved after RCU barrier event.
Something like this (untested) patch.

--- a/net/bridge/br_if.c	2008-01-16 11:00:09.000000000 -0800
+++ b/net/bridge/br_if.c	2008-01-16 11:01:11.000000000 -0800
@@ -116,6 +116,9 @@ static void destroy_nbp_rcu(struct rcu_h
 {
 	struct net_bridge_port *p =
 			container_of(head, struct net_bridge_port, rcu);
+
+	br_fdb_delete_by_port(p->br, p, 1);
+
 	destroy_nbp(p);
 }
 
@@ -143,8 +146,6 @@ static void del_nbp(struct net_bridge_po
 
 	br_ifinfo_notify(RTM_DELLINK, p);
 
-	br_fdb_delete_by_port(br, p, 1);
-
 	list_del_rcu(&p->list);
 
 	rcu_assign_pointer(dev->br_port, NULL);

^ permalink raw reply

* Re: [Bugme-new] [Bug 9758] New: net_device refcnt bug when NFQUEUEing bridged packets
From: Stephen Hemminger @ 2008-01-16 18:56 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev
In-Reply-To: <478D8F29.5040703@trash.net>

On Wed, 16 Jan 2008 05:59:21 +0100
Patrick McHardy <kaber@trash.net> wrote:

> Patrick McHardy wrote:
> > Very nice catch, that explains quite a few bug reports about
> > refcnt leaks. Your patch looks correct and performs the copying
> > in the logically correct place, it would be nicer to keep this
> > crap limited to bridge netfilter however.
> > 
> > What should work is to perform the copying in br_netfilter.c
> > at the spots where phsyoutdev is assigned. As an optimization
> > we should be able to avoid the copying in most cases by
> > checking that the bridge info has a refcount above 1.
> > 
> > Could you test whether this patch also fixes the problem?
> 
> 
> That patch had a bug, we need to set the refcount of the
> new bridge info to 1 after performing the copy.
> 

This looks good, but you could use a structure assignment rather memcpy
(just a personal style preference because assignment is typed and memcpy
is not).

^ permalink raw reply

* Re: [patch] add net_device_stats support to ethtool
From: Ben Greear @ 2008-01-16 18:46 UTC (permalink / raw)
  To: Dan Nicolaescu; +Cc: netdev
In-Reply-To: <200801161834.m0GIYCIr027613@sallyv1.ics.uci.edu>

Dan Nicolaescu wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>   > Dan Nicolaescu wrote:
>   > > Hi,
>   > >
>   > > I have posted this patch in the past with absolutely no reply.
>   > > I would appreciate some sort of feedback of the form interested/not
>   > > interested.  Should I just drop it?
>   > >
>   > >   
>   > I like it, but why not offer this for all devices since they all have
>   > these stats.
>   >
>   > Could add new handlers called something like .get_strings_generic, or
>   > just add this to the higher-level ethtool handling before it looks for
>   > handlers.
>
> If I get your point, then the difference would be that drivers would add
> to the initialization of the ethtool structure something like:
>   
I meant something more like this (this will not apply..I hand-edited it 
to remove
some extraneous crap from my patch...and I'm sure it's white-space damaged).

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index c5e0593..095d1eb 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c

+/* Handle some generic ethtool commands here */
+static int ethtool_get_netdev_stats(struct net_device *dev, void 
*useraddr) {
+   
+    struct ethtool_ndstats* nds = (struct ethtool_ndstats*)(useraddr);
+     
+    struct net_device_stats *stats = dev->get_stats(dev);
+    if (stats) {
+        if (copy_to_user(nds->data, stats, sizeof(*stats))) {
+            return -EFAULT;
+        }
+    }
+    else {
+        return -EOPNOTSUPP;
+    }
+    return 0;
+}
+
+
 /* The main entry point in this file.  Called from net/core/dev.c */
 
 int dev_ethtool(struct ifreq *ifr)
@@ -796,9 +884,6 @@ int dev_ethtool(struct ifreq *ifr)
     if (!dev || !netif_device_present(dev))
         return -ENODEV;
 
-    if (!dev->ethtool_ops)
-        return -EOPNOTSUPP;
-
     if (copy_from_user(&ethcmd, useraddr, sizeof (ethcmd)))
         return -EFAULT;
 
@@ -823,12 +908,25 @@ int dev_ethtool(struct ifreq *ifr)
             return -EPERM;
     }
 
-    if (dev->ethtool_ops->begin)
+    if (dev->ethtool_ops && dev->ethtool_ops->begin)
         if ((rc = dev->ethtool_ops->begin(dev)) < 0)
             return rc;
 
     old_features = dev->features;
 
+    /* Handle some generic operations that do not require specific
+     * ethtool handlers.
+     */
+    switch (ethcmd) {
+    case ETHTOOL_GNDSTATS:
+        return ethtool_get_netdev_stats(dev, useraddr);
+    default:
+        break;
+    }
+
+    if (!dev->ethtool_ops)
+        return -EOPNOTSUPP;
+
     switch (ethcmd) {
     case ETHTOOL_GSET:
         rc = ethtool_get_settings(dev, useraddr);


-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* Re: [patch] add net_device_stats support to ethtool
From: Dan Nicolaescu @ 2008-01-16 18:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <478E46F6.4070901@candelatech.com>

Ben Greear <greearb@candelatech.com> writes:

  > Dan Nicolaescu wrote:
  > > Hi,
  > >
  > > I have posted this patch in the past with absolutely no reply.
  > > I would appreciate some sort of feedback of the form interested/not
  > > interested.  Should I just drop it?
  > >
  > >   
  > I like it, but why not offer this for all devices since they all have
  > these stats.
  >
  > Could add new handlers called something like .get_strings_generic, or
  > just add this to the higher-level ethtool handling before it looks for
  > handlers.

If I get your point, then the difference would be that drivers would add
to the initialization of the ethtool structure something like:

         .get_strings_generic = 1;

instead of what I originally proposed:

	.get_strings	   = ethtool_op_net_device_stats_get_strings,
	.get_stats_count   = ethtool_op_net_device_stats_get_stats_count,
	.get_ethtool_stats = ethtool_op_net_device_get_ethtool_stats,

Sure that could work, but it would require a few lines of changes in
ethtool.

I can submit a patch that does things that way, if that is considered
better.  But I would like to hear that this code is wanted before
putting any effort in it. I has been ignored for so long... 

Thanks

        --dan

^ permalink raw reply

* Re: [patch] add net_device_stats support to ethtool
From: Ben Greear @ 2008-01-16 18:03 UTC (permalink / raw)
  To: Dan Nicolaescu; +Cc: netdev
In-Reply-To: <200801161529.m0GFTfYR005096@sallyv1.ics.uci.edu>

Dan Nicolaescu wrote:
> Hi,
>
> I have posted this patch in the past with absolutely no reply.
> I would appreciate some sort of feedback of the form 
> interested/not interested.  Should I just drop it?
>
>   
I like it, but why not offer this for all devices since they all have 
these stats.
Could add new handlers called something like .get_strings_generic, or
just add this to the higher-level ethtool handling before it looks for 
handlers.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* [IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone()
From: Eric Dumazet @ 2008-01-16 17:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Robert Olsson

I noticed "ip route list" was slower than "cat /proc/net/route" on a machine with a full Internet
routing table (214392 entries : Special thanks to Robert ;) )

This is similar to problem reported in commit d8c9283089287341c85a0a69de32c2287a990e71

Fix is to avoid scanning the begining of fz_hash table, but directly seek to the right offset.

Before patch :

time ip route >/tmp/ROUTE

real    0m1.285s
user    0m0.712s
sys     0m0.436s

After patch

# time ip route >/tmp/ROUTE

real    0m0.835s
user    0m0.692s
sys     0m0.124s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 net/ipv4/fib_hash.c |   20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 527a6e0..4156988 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -718,19 +718,18 @@ fn_hash_dump_zone(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int h, s_h;
 
+	if (fz->fz_hash == NULL)
+		return skb->len;
 	s_h = cb->args[3];
-	for (h=0; h < fz->fz_divisor; h++) {
-		if (h < s_h) continue;
-		if (h > s_h)
-			memset(&cb->args[4], 0,
-			       sizeof(cb->args) - 4*sizeof(cb->args[0]));
-		if (fz->fz_hash == NULL ||
-		    hlist_empty(&fz->fz_hash[h]))
+	for (h = s_h; h < fz->fz_divisor; h++) {
+		if (hlist_empty(&fz->fz_hash[h]))
 			continue;
-		if (fn_hash_dump_bucket(skb, cb, tb, fz, &fz->fz_hash[h])<0) {
+		if (fn_hash_dump_bucket(skb, cb, tb, fz, &fz->fz_hash[h]) < 0) {
 			cb->args[3] = h;
 			return -1;
 		}
+		memset(&cb->args[4], 0,
+		       sizeof(cb->args) - 4*sizeof(cb->args[0]));
 	}
 	cb->args[3] = h;
 	return skb->len;
@@ -746,14 +745,13 @@ static int fn_hash_dump(struct fib_table *tb, struct sk_buff *skb, struct netlin
 	read_lock(&fib_hash_lock);
 	for (fz = table->fn_zone_list, m=0; fz; fz = fz->fz_next, m++) {
 		if (m < s_m) continue;
-		if (m > s_m)
-			memset(&cb->args[3], 0,
-			       sizeof(cb->args) - 3*sizeof(cb->args[0]));
 		if (fn_hash_dump_zone(skb, cb, tb, fz) < 0) {
 			cb->args[2] = m;
 			read_unlock(&fib_hash_lock);
 			return -1;
 		}
+		memset(&cb->args[3], 0,
+		       sizeof(cb->args) - 3*sizeof(cb->args[0]));
 	}
 	read_unlock(&fib_hash_lock);
 	cb->args[2] = m;

^ permalink raw reply related

* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Robert Olsson @ 2008-01-16 17:07 UTC (permalink / raw)
  To: David Miller; +Cc: elendil, jesse.brandeburg, slavon, netdev, linux-kernel
In-Reply-To: <20080116.022953.184520911.davem@davemloft.net>


David Miller writes:
 > > On Wednesday 16 January 2008, David Miller wrote:
 > > > Ok, here is the patch I'll propose to fix this.  The goal is to make
 > > > it as simple as possible without regressing the thing we were trying
 > > > to fix.
 > > 
 > > Looks good to me. Tested with -rc8.
 > 
 > Thanks for testing.

 Yes that code looks nice. I'm using the patch but I've noticed another 
 phenomena with the current e1000 driver. There is a race when taking a 
 device down at high traffic loads. I've tracked and instrumented and it 
 seems like occasionly irq_sem can get bump up so interrupts can't be 
 enabled again.


eth0 e1000_irq_enable sem = 1    <- High netload
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1    <- ifconfig eth0 down
eth0 e1000_irq_disable sem = 2

**e1000_open                     <- ifconfig eth0 up
eth0 e1000_irq_disable sem = 3      Dead. irq's can't be enabled
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 2
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 1
ADDRCONF(NETDEV_UP): eth0: link is not ready


Cheers
					--ro

static void
e1000_irq_disable(struct e1000_adapter *adapter)
{
        atomic_inc(&adapter->irq_sem);
        E1000_WRITE_REG(&adapter->hw, IMC, ~0);
        E1000_WRITE_FLUSH(&adapter->hw);
        synchronize_irq(adapter->pdev->irq);

        if(adapter->netdev->ifindex == 3)
        printk("%s e1000_irq_disable sem = %d\n",  adapter->netdev->name,
               atomic_read(&adapter->irq_sem));
}

static void
e1000_irq_enable(struct e1000_adapter *adapter)
{
        if (likely(atomic_dec_and_test(&adapter->irq_sem))) {
                E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
                E1000_WRITE_FLUSH(&adapter->hw);
                }
        else
                printk("e1000_irq_enable miss\n");

        if(adapter->netdev->ifindex == 3)
          printk("%s e1000_irq_enable sem = %d\n",  adapter->netdev->name,
                 atomic_read(&adapter->irq_sem));
}

^ permalink raw reply

* A reliable way to improve your life!
From: santos @ 2008-01-16 16:47 UTC (permalink / raw)
  To: netdev

Say good bye to ED dysfunction! http://rhwe.imagineoh.com


^ permalink raw reply

* Re: [PATCH] ICMP: ICMP_MIB_OUTMSGS increment duplicated
From: David Stevens @ 2008-01-16 16:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, netdev, Wang Chen
In-Reply-To: <E1JF6lR-0007sF-00@gondolin.me.apana.org.au>

Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/16/2008 03:49:01 AM:

> Actually having the icmp_out_count call in ip_push_pending_frames seems
> inconsistent.  Having it there means that we count raw socket ICMP 
packets
> too.  But we don't do that for any other protocol, e.g., raw UDP packets
> don't get counted.

Herbert,
        The patch was to support the ICMPMsgStats table. Since none of 
certain
types of output ICMP messages are generated by the kernel, but are 
required
by the RFC, counting raw sockets is intentional (and the only way those 
ICMP
types can be counted at all).
        Raw UDP packets would not be counted either before or after the 
patch,
but aren't part of the ICMPMsgStats table. Adding those might be 
worthwhile,
but it isn't quite the hole that the ICMP out stats were, since there is a
cooked interface for UDP output that counts the common use, at least.

Wang,
        I think your patch is correct; did you test the same case for 
IPv6?

 +-DLS


^ permalink raw reply

* [patch] add net_device_stats support to ethtool
From: Dan Nicolaescu @ 2008-01-16 15:29 UTC (permalink / raw)
  To: netdev


Hi,

I have posted this patch in the past with absolutely no reply.
I would appreciate some sort of feedback of the form 
interested/not interested.  Should I just drop it?

"ethtool -S" only supports devices that have custom code written to
print the stats. 

A lot of drivers use "struct net_device_stats", so adding code to
ethtool would make it very easy for such drivers to add support for
"ethtool -S". The drivers would just need to add this:

	.get_strings	   = ethtool_op_net_device_stats_get_strings,
	.get_stats_count   = ethtool_op_net_device_stats_get_stats_count,
	.get_ethtool_stats = ethtool_op_net_device_get_ethtool_stats,

to their struct ethtool_ops. I found this very useful when debugging a
new driver.

(The function names are not the best, suggestions for better names are
welcome).

The code added is very small, and it gives easy access to stats that the
drivers are computing anyway.

This should save code in case the drivers the use net_device_stats
decide they want ethtool stats support.

The patch only adds, it does not modify any existing line of code.

Thanks
        --Dan

--- ethtool.c~	2004-12-24 13:35:50.000000000 -0800
+++ ethtool.c	2006-12-01 08:55:26.000000000 -0800
@@ -809,6 +809,64 @@
 	return -EOPNOTSUPP;
 }
 
+
+#define NET_DEVICE_NUM_STATS (sizeof(struct net_device_stats) / sizeof(unsigned long))
+
+static struct {
+	const char string[ETH_GSTRING_LEN];
+} ethtool_net_device_stats_keys[NET_DEVICE_NUM_STATS] = {
+	{ "rx_packets"},	 
+	{ "tx_packets"},	 
+	{ "rx_bytes"},	 
+	{ "tx_bytes"},	 
+	{ "rx_errors"},	 
+	{ "tx_errors"},	 
+	{ "rx_dropped"},	 
+	{ "tx_dropped"},	 
+	{ "multicast"},	 
+	{ "collisions"},
+	{ "rx_length_errors"},
+	{ "rx_over_errors"},	 
+	{ "rx_crc_errors"},	 
+	{ "rx_frame_errors"}, 
+	{ "rx_fifo_errors"},	 
+	{ "rx_missed_errors"},
+	{ "tx_aborted_errors"},
+	{ "tx_carrier_errors"},
+	{ "tx_fifo_errors"},
+	{ "tx_heartbeat_errors"},
+	{ "tx_window_errors"},
+	{ "rx_compressed"},
+	{ "tx_compressed"}
+};
+
+int ethtool_op_net_device_stats_get_stats_count(struct net_device *dev)
+{
+	return NET_DEVICE_NUM_STATS;
+}
+
+void ethtool_op_net_device_stats_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+	switch (stringset) {
+	case ETH_SS_STATS:
+		memcpy(buf, &ethtool_net_device_stats_keys, sizeof(ethtool_net_device_stats_keys));
+		break;
+	default:
+		WARN_ON(1);	/* we need a WARN() */
+		break;
+	}
+}
+
+void ethtool_op_net_device_get_ethtool_stats(struct net_device *dev,
+					      struct ethtool_stats *estats, u64 *tmp_stats)
+{
+	u32 i;
+	u64 *dest = tmp_stats;
+	unsigned long *src = (unsigned long*)dev->get_stats(dev);
+	for (i = 0; i < estats->n_stats; i++)
+	  *dest++ = *src++;
+}
+
 EXPORT_SYMBOL(dev_ethtool);
 EXPORT_SYMBOL(ethtool_op_get_link);
 EXPORT_SYMBOL(ethtool_op_get_sg);
@@ -817,3 +875,6 @@
 EXPORT_SYMBOL(ethtool_op_set_sg);
 EXPORT_SYMBOL(ethtool_op_set_tso);
 EXPORT_SYMBOL(ethtool_op_set_tx_csum);
+EXPORT_SYMBOL(ethtool_op_net_device_stats_get_stats_count);
+EXPORT_SYMBOL(ethtool_op_net_device_stats_get_strings);
+EXPORT_SYMBOL(ethtool_op_net_device_get_ethtool_stats);
--- ethtool.h~	2006-09-05 12:29:45.000000000 -0700
+++ ethtool.h	2006-12-01 08:51:46.000000000 -0800
@@ -260,6 +260,12 @@
 int ethtool_op_set_sg(struct net_device *dev, u32 data);
 u32 ethtool_op_get_tso(struct net_device *dev);
 int ethtool_op_set_tso(struct net_device *dev, u32 data);
+int ethtool_op_net_device_stats_get_stats_count(struct net_device *dev);
+void ethtool_op_net_device_stats_get_strings(struct net_device *dev,
+					     u32 stringset, u8 *buf);
+void ethtool_op_net_device_get_ethtool_stats(struct net_device *dev,
+					     struct ethtool_stats *estats,
+					     u64 *tmp_stats);
 
 /**
  * &ethtool_ops - Alter and report network device settings

^ permalink raw reply

* [patch 2/2][NETNS][DST] add the network namespace pointer in dst_ops
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery
In-Reply-To: <20080116145416.844293640@localhost.localdomain>

[-- Attachment #1: add-net-ns-to-dst-ops.patch --]
[-- Type: text/plain, Size: 788 bytes --]

The network namespace pointer can be stored into the dst_ops structure.
This is usefull when there are multiple instances of the dst_ops for a
protocol. When there are no several instances, this field will be never
used in the protocol. So there is no impact for the protocols which do
implement the network namespaces.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/dst.h |    1 +
 1 file changed, 1 insertion(+)

Index: net-2.6.25-misc/include/net/dst.h
===================================================================
--- net-2.6.25-misc.orig/include/net/dst.h
+++ net-2.6.25-misc/include/net/dst.h
@@ -102,6 +102,7 @@ struct dst_ops
 
 	atomic_t		entries;
 	struct kmem_cache 		*kmem_cachep;
+	struct net              *dst_net;
 };
 
 #ifdef __KERNEL__

-- 

^ permalink raw reply

* [patch 1/2][NETNS][DST] dst: pass the dst_ops as parameter to the gc functions
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery
In-Reply-To: <20080116145416.844293640@localhost.localdomain>

[-- Attachment #1: pass-dst-gc-ops-parameter.patch --]
[-- Type: text/plain, Size: 5523 bytes --]

The garbage collection function receive the dst_ops structure as
parameter. This is useful for the next incoming patchset because
it will need the dst_ops (there will be several instances) and 
the network namespace pointer (contained in the dst_ops).

The protocols which do not take care of the namespaces will not
be impacted by this change (expect for the function signature),
they do just ignore the parameter.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/dst.h       |    2 +-
 net/core/dst.c          |    2 +-
 net/decnet/dn_route.c   |    4 ++--
 net/ipv4/route.c        |    6 +++---
 net/ipv4/xfrm4_policy.c |    2 +-
 net/ipv6/route.c        |    4 ++--
 net/ipv6/xfrm6_policy.c |    2 +-
 7 files changed, 11 insertions(+), 11 deletions(-)

Index: net-2.6.25-misc/include/net/dst.h
===================================================================
--- net-2.6.25-misc.orig/include/net/dst.h
+++ net-2.6.25-misc/include/net/dst.h
@@ -89,7 +89,7 @@ struct dst_ops
 	__be16			protocol;
 	unsigned		gc_thresh;
 
-	int			(*gc)(void);
+	int			(*gc)(struct dst_ops *ops);
 	struct dst_entry *	(*check)(struct dst_entry *, __u32 cookie);
 	void			(*destroy)(struct dst_entry *);
 	void			(*ifdown)(struct dst_entry *,
Index: net-2.6.25-misc/net/core/dst.c
===================================================================
--- net-2.6.25-misc.orig/net/core/dst.c
+++ net-2.6.25-misc/net/core/dst.c
@@ -165,7 +165,7 @@ void * dst_alloc(struct dst_ops * ops)
 	struct dst_entry * dst;
 
 	if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
-		if (ops->gc())
+		if (ops->gc(ops))
 			return NULL;
 	}
 	dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
Index: net-2.6.25-misc/net/decnet/dn_route.c
===================================================================
--- net-2.6.25-misc.orig/net/decnet/dn_route.c
+++ net-2.6.25-misc/net/decnet/dn_route.c
@@ -107,7 +107,7 @@ static const int dn_rt_mtu_expires = 10 
 
 static unsigned long dn_rt_deadline;
 
-static int dn_dst_gc(void);
+static int dn_dst_gc(struct dst_ops *ops);
 static struct dst_entry *dn_dst_check(struct dst_entry *, __u32);
 static struct dst_entry *dn_dst_negative_advice(struct dst_entry *);
 static void dn_dst_link_failure(struct sk_buff *);
@@ -185,7 +185,7 @@ static void dn_dst_check_expire(unsigned
 	mod_timer(&dn_route_timer, now + decnet_dst_gc_interval * HZ);
 }
 
-static int dn_dst_gc(void)
+static int dn_dst_gc(struct dst_ops *ops)
 {
 	struct dn_route *rt, **rtp;
 	int i;
Index: net-2.6.25-misc/net/ipv4/route.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv4/route.c
+++ net-2.6.25-misc/net/ipv4/route.c
@@ -154,7 +154,7 @@ static void		 ipv4_dst_ifdown(struct dst
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
-static int rt_garbage_collect(void);
+static int rt_garbage_collect(struct dst_ops *ops);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -820,7 +820,7 @@ static void rt_secret_rebuild(unsigned l
    and when load increases it reduces to limit cache size.
  */
 
-static int rt_garbage_collect(void)
+static int rt_garbage_collect(struct dst_ops *ops)
 {
 	static unsigned long expire = RT_GC_TIMEOUT;
 	static unsigned long last_gc;
@@ -1035,7 +1035,7 @@ restart:
 				int saved_int = ip_rt_gc_min_interval;
 				ip_rt_gc_elasticity	= 1;
 				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect();
+				rt_garbage_collect(&ipv4_dst_ops);
 				ip_rt_gc_min_interval	= saved_int;
 				ip_rt_gc_elasticity	= saved_elasticity;
 				goto restart;
Index: net-2.6.25-misc/net/ipv4/xfrm4_policy.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv4/xfrm4_policy.c
+++ net-2.6.25-misc/net/ipv4/xfrm4_policy.c
@@ -185,7 +185,7 @@ _decode_session4(struct sk_buff *skb, st
 	fl->fl4_tos = iph->tos;
 }
 
-static inline int xfrm4_garbage_collect(void)
+static inline int xfrm4_garbage_collect(struct dst_ops *ops)
 {
 	xfrm4_policy_afinfo.garbage_collect();
 	return (atomic_read(&xfrm4_dst_ops.entries) > xfrm4_dst_ops.gc_thresh*2);
Index: net-2.6.25-misc/net/ipv6/route.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv6/route.c
+++ net-2.6.25-misc/net/ipv6/route.c
@@ -79,7 +79,7 @@ static struct dst_entry *ip6_negative_ad
 static void		ip6_dst_destroy(struct dst_entry *);
 static void		ip6_dst_ifdown(struct dst_entry *,
 				       struct net_device *dev, int how);
-static int		 ip6_dst_gc(void);
+static int		 ip6_dst_gc(struct dst_ops *ops);
 
 static int		ip6_pkt_discard(struct sk_buff *skb);
 static int		ip6_pkt_discard_out(struct sk_buff *skb);
@@ -978,7 +978,7 @@ int ndisc_dst_gc(int *more)
 	return freed;
 }
 
-static int ip6_dst_gc(void)
+static int ip6_dst_gc(struct dst_ops *ops)
 {
 	static unsigned expire = 30*HZ;
 	static unsigned long last_gc;
Index: net-2.6.25-misc/net/ipv6/xfrm6_policy.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv6/xfrm6_policy.c
+++ net-2.6.25-misc/net/ipv6/xfrm6_policy.c
@@ -212,7 +212,7 @@ _decode_session6(struct sk_buff *skb, st
 	}
 }
 
-static inline int xfrm6_garbage_collect(void)
+static inline int xfrm6_garbage_collect(struct dst_ops *ops)
 {
 	xfrm6_policy_afinfo.garbage_collect();
 	return (atomic_read(&xfrm6_dst_ops.entries) > xfrm6_dst_ops.gc_thresh*2);

-- 

^ permalink raw reply

* [patch 0/2][NETNS][DST] pass dst_ops to gc functions and add a netns pointer in it
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery

The two following patches do trivial changes to facilitate the
implementation of the network namespace in some protocols.

The first one pass the dst_ops as parameter to the gc functions.
The second patch just adds a netns pointer field into the dst_ops
structure.

-- 

^ permalink raw reply

* Re: [PATCH] net: NEWEMAC: Remove "rgmii-interface" from rgmii matching table
From: Josh Boyer @ 2008-01-16 15:01 UTC (permalink / raw)
  To: benh; +Cc: Stefan Roese, linuxppc-dev, netdev
In-Reply-To: <1200477239.6755.21.camel@pasglop>

On Wed, 16 Jan 2008 20:53:59 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> 
> On Wed, 2008-01-16 at 10:37 +0100, Stefan Roese wrote:
> > With the removal the the "rgmii-interface" device_type property from the
> > dts files, the newemac driver needs an update to only rely on compatible
> > property.
> > 
> > Signed-off-by: Stefan Roese <sr@denx.de>
> 
> I need to test if it works on CAB, can't change the DT on those. I'll
> let you know tomorrow.

This should be fine on CAB.  The rgmii node has:

compatible = "ibm,rgmii-axon", "ibm,rgmii"

so the match should still catch on the latter.

josh

^ permalink raw reply

* Re: [RFC][PATCH] Fixing SA/SP dumps on netlink/af_key
From: Timo Teräs @ 2008-01-16 14:28 UTC (permalink / raw)
  To: hadi; +Cc: netdev
In-Reply-To: <1200491531.4457.91.camel@localhost>

jamal wrote:
> On Sun, 2008-13-01 at 14:26 +0200, Timo Teräs wrote:
>> I tried to address all these problems, and added the xfrm_policy
>> and xfrm_state structure to a new linked list that is used solely
>> for dumping. This way the dumps can be done in O(n) and have an
>> iterator point to them. I also modified to af_key to stop dumping
>> when receive queue is filling up and continue it from a callback at
>> recvfrom() when there's more room.
>>
>> But I'd like to hear about:
>> - if the approach is ok (adding the entries in one more list)?
>> - if the new/modified variables, structures and function names are ok?
>> - if I should port these against net-2.6 or net-2.6.25 git tree?
>> - if I should split this to more than one commit? (maybe xfrm core,
>> xfrm user and af_key parts?)
> 
> To answer your last 2 questions:
> There are two issues that are inter-mingled in there. The most
> important is pf_key not being robust on dump. The other being the
> accurracy of netlink dumping - this has much broader implications. I
> think you need to separate the patches into those two at least and
> prioritize on the pf_key as a patch that you push in. I would also
> try to make the pf_key dumping approach to mirror what netlink does
> today - in the worst case it would be as innacurate as netlink; which
> is not bad. I think youll find no resistance getting such a patch in.

Creating a separate af_key patch would not be a big problem. I was
just hoping avoiding it as the xfrm_state / xfrm_policy changes
modify the API and requires changing af_key also.

> To answer your first 2 questions:
> 
> My main dilema with your approach is the policy of maintaining such a
>  list in the kernel and memory consumption needed vs the innacuracy 
> of netlink dumps.With your approach of maintaining extra SA/P D, i
> would need double the RAM amount.

No. I'm not creating second copies of the SADB/SPD entries. The entries
are just added to one more list. Memory requirements go up on per entry
(figures based on 2.6.22.9 kernel):
sizeof(struct xfrm_policy) 636 -> 644 bytes
sizeof(struct xfrm_state)  452 -> 460 bytes

For 400K entries it would take about 3 megs of more memory. Where the
total database size is around 240-250 megs.

> Netlink dumping gives up accuracy[1], uses 50% of the memory you are 
> proposing but abuses more cpu cycles. User space could maintain the
> list you are proposing instead - and compensate for the innaccuracies
> by watching events[2] Of course that shifts the accuracy to events
> which could be lost on their way to user space. This issue is
> alleviated a little more with your approach of keeping the list in
> the kernel and adding new updates to the tail of the dump list
> (which, btw, your patch didnt seem to do); however, even if you solve
> that: ** you are still will be faced with challenges of not being
> atomic on updates; example an entry already on your dump list could
> be deleted before being read by user space. I cant see you solving
> that without abusing a _lot_ more cpu (trying to find the entry on
> the dump list that may have gotten updated or deleted). Theres also
> the issue of how long to keep the dump list before aging it out and
> how to deal with multiple users.

If more entries are added, you can get notifications of them.
 
Cheers,
  Timo


^ permalink raw reply

* Re: [RFC][PATCH] Fixing SA/SP dumps on netlink/af_key
From: jamal @ 2008-01-16 13:52 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <478A038B.4090900@iki.fi>

Timo,

Thanks for the effort you are putting on this.
I would just speak to the concepts instead of the code - hopefully that
would simulate a discussion (and shorten my email)

On Sun, 2008-13-01 at 14:26 +0200, Timo Teräs wrote:
> Hi all,
> 
> The problem with IPsec SP/SA dumping is that they don't work well with large SPD/SADB:s. 
> In netlink the dumping code is O(n^2) and can miss some entries/dump duplicates if the DB 
> is changed between the recv() calls to read the dump entries. This is due to the entry 
> index counting done in xfrm_user code. With af_key the dump entries are just dropped 
> when the socket receive queue is filled.

nod to the above

> I tried to address all these problems, and added the xfrm_policy and xfrm_state structure 
> to a new linked list that is used solely for dumping. This way the dumps can be done in O(n) 
> and have an iterator point to them. I also modified to af_key to stop dumping when receive 
> queue is filling up and continue it from a callback at recvfrom() when there's more room.
> 
> This patch is against 2.6.23 vanilla. I wanted to get some feedback before I make it against 
> the git tree.
> 
> But I'd like to hear about:
> - if the approach is ok (adding the entries in one more list)?
> - if the new/modified variables, structures and function names are ok?
> - if I should port these against net-2.6 or net-2.6.25 git tree?
> - if I should split this to more than one commit? (maybe xfrm core, xfrm user and af_key parts?)
> 

To answer your last 2 questions:
There are two issues that are inter-mingled in there. The most important
is pf_key not being robust on dump. The other being the accurracy of
netlink dumping - this has much broader implications. I think you need
to separate the patches into those two at least and prioritize on the
pf_key as a patch that you push in. I would also try to make the pf_key
dumping approach to mirror what netlink does today - in the worst case
it would be as innacurate as netlink; which is not bad. I think youll
find no resistance getting such a patch in.

To answer your first 2 questions:

My main dilema with your approach is the policy of maintaining such a
list in the kernel and memory consumption needed vs the innacuracy of
netlink dumps. I have run a system with >400K SPD entries and >100K SAD
entries on a running system and i needed a few Gig of RAM to just make
sure my other apps dont get hit by oom at some point or other.  With
your approach of maintaining extra SA/P D, i would need double the RAM
amount. RAM is relatively cheap these days, but latency involved is not
(and one could argue that CPU cycles are much much cheaper).

Netlink dumping gives up accuracy[1], uses 50% of the memory you are
proposing but abuses more cpu cycles. User space could maintain the list
you are proposing instead - and compensate for the innaccuracies by
watching events[2]
Of course that shifts the accuracy to events which could be lost on
their way to user space. This issue is alleviated a little more with
your approach of keeping the list in the kernel and adding new updates
to the tail of the dump list (which, btw, your patch didnt seem to do);
however, even if you solve that: 
** you are still will be faced with challenges of not being atomic on
updates; example an entry already on your dump list could be deleted
before being read by user space. I cant see you solving that without
abusing a _lot_ more cpu (trying to find the entry on the dump list that
may have gotten updated or deleted). Theres also the issue of how long
to keep the dump list before aging it out and how to deal with multiple
users.

Ok, lets say you throw in the computation in there to walk the dump list
and find the proper entry and find out it only consumes 62% cpu of what
current netlink dump approach does, you are still using twice the RAM
needs. And hence it becomes a policy choice and what would be needed is 
some knob for me to select "i dont care about abusing more memory; i
want more accuracy dammit". I can tell you based on my current
experience i would rather use less RAM - but thats a matter of choice at
the moment because i havent come across scenarios of the conflicts where
user space wasnt able to resolve. 

I apologize for the long email, it would have been longer if i commented
on the code.

I hope you find my comments useful and dont see them as rocks being
thrown at you - rather look at them as worth at least 2-cents Canadian
and will stimulate other people speak out.

cheers,
jamal

[1] In cases where a dump happens to not complete while SAD/SPD are
getting concurently updated, it is feasible that a dump list will have
innacurate details.

[2] Racoon for example doesnt do this, hence it needs to dump from the
kernel when purging SAs.  

^ permalink raw reply

* Compiling as a module using the 2.6.25 git tree
From: Mark Ryden @ 2008-01-16 13:32 UTC (permalink / raw)
  To: netdev

Hello,

A question about compiling as module using the 2.6.25 git tree:
I had git cloned the 2.6.25 DaveM tree.
I ran "make menuconfig".
In many cases I see in the help:
"To compile this code as a module, choose M here: the
 module will be called ..."
For example, in "Packet Generator" or in 802.1d Etherent Bridging
(CONFIG_BRIDGE).

However, I cannot choose 'M' here ; the toggle is only between
building into the kernel ('*')
or not.

BRIDGE depends (according to the help) only on CONFIG_NET and I have
CONFIG_NET selected.
more .config | grep  CONFIG_INET
CONFIG_INET=y

Also llc is configured: (though in fact bridge selects llc).
more .config | grep LLC
CONFIG_LLC=y
CONFIG_LLC2=y

And in the Kconfig file we have:
config BRIDGE
        tristate "802.1d Ethernet Bridging"

When I try "make menuconfig" on the kernel of a distro (like FC8) I
**can** choose
bridge as a module. (I see it as 'M').

Any ideas why? Is is something due to using the git tree?

Mark

^ permalink raw reply

* Re: [PATCH 3/4] bonding: Fix work rearming
From: Jarek Poplawski @ 2008-01-16 13:36 UTC (permalink / raw)
  To: Makito SHIOKAWA; +Cc: netdev
In-Reply-To: <478D77D7.60703@miraclelinux.com>

On Wed, Jan 16, 2008 at 12:19:51PM +0900, Makito SHIOKAWA wrote:
> This patch is supposing a case that bond_mii_monitor() is invoked in 
> bond_open(), and after that, 0 is set to miimon via sysfs (see same place 
> on other monitors).
> Though message in bonding_store_miimon() says miimon value 1-INT_MAX 
> rejected, but it looks like 0 can be accepted and monitor must be stopped 
> in that case.

But, since during this change from sysfs cancel_delayed_work_sync()
could be probably used, and it's rather efficient with killing
rearming works, it seems this check could be unnecessary yet.

Jarek P.

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: jamal @ 2008-01-16 12:45 UTC (permalink / raw)
  To: mahatma; +Cc: netdev
In-Reply-To: <478BE021.1070408@bspu.unibel.by>

On Mon, 2008-14-01 at 20:20 -0200, Dzianis Kahanovich wrote:
> jamal wrote:
[..] 

> > Did that make sense?
>
> After current "#endif" - may be.

I am afraid that would be counter to expected behavior. 
Default is meant to apply when no value has been defined. Mark of 0 for
example doesnt mean "default". Let me demonstrate with the ifdefs again
with some arbitrary example:

-----------------
#ifdef CONFIG_NET_CLS_ACT
..classify ...
    .. action 1 sets mark to 0x11111
    .. action 2 checks some state and conditionally let action 3 execute
    .. action 3 sets mark to 0

if OK is returned set tc_index based on classid

#else // no actions compiled
..classify
.... jamal suggests: set default mark and tc_index for ingress here
#endif

// mahatma wants to set default for mark and tcindex here 
// so it works for both actions and none-action code
------------------------

Lets look at the case of actions compiled in:
I have defined my policies (in user space) so that the mark can be set
to either 0 or 0x1111 depending on some runtime state. 
Your default (kernel) code is now going to overide my policy - which is
bad. Even in the case of OK being returned, it is wrong to set tc_index;
unfortunately, we dont have an action that can set tc_index today; if we
did, we would need to remove that setting.

You other intent was to set the value of mark based on the value of
classid. You _can do that today already_ with no changes via a policy in
user space. You suggested to do an ifdef so you wont have to type in the
line which says how to mark, and i said that was a bad idea (we need
less ifdefs not more). 

For the case of no actions compiled in:
nothing can write into the values of either tcindex or mark after
classification (on ingress), so it is ok to override. If you did this
for egress as well, that would be wrong because it is expected that some
qdiscs may set or utilize these metadatum.

I am not sure if it made more sense this time?

> What "result" are with:
> 1) no filters?
> 2) 1 filter only, with "action continue"?

Please refer to above verbosity and see if it all makes better sense.

cheers,
jamal

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox