Netdev List

Netdev List
 help / color / mirror / Atom feed

* [patch v3] fix stack overflow in pktgen_if_write()
From: Dan Carpenter @ 2010-10-28  6:05 UTC (permalink / raw)
  To: Nelson Elhage
  Cc: Eric Dumazet, David S. Miller, Robert Olsson, Andy Shevchenko,
	netdev
In-Reply-To: <20101027230657.GT16803@ksplice.com>

Nelson Elhage says he was able to oops both amd64 and i386 test 
machines with 8k writes to the pktgen file.  Let's just allocate the
buffer on the heap instead of on the stack.

This can only be triggered by root so there are no security issues here.

Reported-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Dan Carpenter <error27@gmail.com>
---
v3:  just use kmalloc()

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 2c0df0f..c8d3620 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -887,12 +887,17 @@ static ssize_t pktgen_if_write(struct file *file,
 	i += len;
 
 	if (debug) {
-		char tb[count + 1];
+		char *tb;
+
+		tb = kmalloc(count + 1, GFP_KERNEL);
+		if (!tb)
+			return -ENOMEM;
 		if (copy_from_user(tb, user_buffer, count))
 			return -EFAULT;
 		tb[count] = 0;
 		printk(KERN_DEBUG "pktgen: %s,%lu  buffer -:%s:-\n", name,
 		       (unsigned long)count, tb);
+		kfree(tb);
 	}
 
 	if (!strcmp(name, "min_pkt_size")) {

^ permalink raw reply related

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  6:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <20101028055054.GG5599@redhat.com>

> "Michael S. Tsirkin" <mst@redhat.com>

> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > For external host, there is a contention among different
> > queues (vhosts) when packets are processed in tun/bridge,
> > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > my testing shows a small improvement (1 to 1.5% average)
> > in BW and a rise in SD (between 10-15%).  For remote host,
> > I think tun/macvtap needs MQ TX support?
>
> Confused. I thought this *is* with a multiqueue tun/macvtap?
> bridge does not do any queueing AFAIK ...
> I think we need to fix the contention. With migration what was guest to
> host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK


^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-10-28  6:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev,
	netdev-owner, rusty
In-Reply-To: <OF03CB03BB.27A87B85-ON652577CA.002069A5-652577CA.0021C2E2@in.ibm.com>

On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com>
> 
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > For external host, there is a contention among different
> > > queues (vhosts) when packets are processed in tun/bridge,
> > > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > > my testing shows a small improvement (1 to 1.5% average)
> > > in BW and a rise in SD (between 10-15%).  For remote host,
> > > I think tun/macvtap needs MQ TX support?
> >
> > Confused. I thought this *is* with a multiqueue tun/macvtap?
> > bridge does not do any queueing AFAIK ...
> > I think we need to fix the contention. With migration what was guest to
> > host a minute ago might become guest to external now ...
> 
> Macvtap RX is MQ but not TX. I don't think MQ TX support is
> required for macvtap, though. Is it enough for existing
> macvtap sendmsg to work, since it calls dev_queue_xmit
> which selects the txq for the outgoing device?
> 
> Thanks,
> 
> - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST

^ permalink raw reply

* Re: [PATCH] ipv6: addrconf: don't remove address state on ifdown if the address is being kept
From: Lorenzo Colitti @ 2010-10-28  6:45 UTC (permalink / raw)
  To: netdev; +Cc: Maciej Żenczykowski, David Miller, Brian Haley
In-Reply-To: <AANLkTimyqFoUvjQM_qLS2YeVmJQ_6-LJc6KfLVE-mzH3@mail.gmail.com>

On Wed, Oct 27, 2010 at 9:16 PM, Lorenzo Colitti <lorenzo@google.com> wrote:
> Fix it so that none of this state is updated if the address is being kept on the interface.

Or, since the logic of the patched code is a little hard to follow,
here's an alternative patch against 2.6.36. I think it does exactly
the same as the previous patch, but it moves things around to make the
result more readable.

Tested: Added a statically configured IPv6 address to an interface,
started ping, brought link down, brought link up again. When link came
up ping kept on going and "ip -6 maddr" showed that the host was still
subscribed to there

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>

--- 2.6.36-orig/net/ipv6/addrconf.c	2010-10-20 13:30:22.000000000 -0700
+++ linux-2.6.36/net/ipv6/addrconf.c	2010-10-27 23:26:57.000000000 -0700
@@ -2737,10 +2737,6 @@ static int addrconf_ifdown(struct net_de
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
 			ifa->state = INET6_IFADDR_STATE_DAD;
-
-			write_unlock_bh(&idev->lock);
-
-			in6_ifa_hold(ifa);
 		} else {
 			list_del(&ifa->if_list);

@@ -2755,19 +2751,15 @@ static int addrconf_ifdown(struct net_de
 			ifa->state = INET6_IFADDR_STATE_DEAD;
 			spin_unlock_bh(&ifa->state_lock);

-			if (state == INET6_IFADDR_STATE_DEAD)
-				goto put_ifa;
+			if (state == INET6_IFADDR_STATE_DEAD) {
+				in6_ifa_put(ifa);
+			} else {
+				__ipv6_ifa_notify(RTM_DELADDR, ifa);
+				atomic_notifier_call_chain(&inet6addr_chain,
+							   NETDEV_DOWN, ifa);
+			}
+			write_lock_bh(&idev->lock);
 		}
-
-		__ipv6_ifa_notify(RTM_DELADDR, ifa);
-		if (ifa->state == INET6_IFADDR_STATE_DEAD)
-			atomic_notifier_call_chain(&inet6addr_chain,
-						   NETDEV_DOWN, ifa);
-
-put_ifa:
-		in6_ifa_put(ifa);
-
-		write_lock_bh(&idev->lock);
 	}

 	list_splice(&keep_list, &idev->addr_list);

^ permalink raw reply

* About the Davicom PHY in drivers/net/phy in Linux kernel
From: macpaul @ 2010-10-28  6:33 UTC (permalink / raw)
  To: netdev; +Cc: afleming, jeff, f.rodo, joseph_chang

Hi all,

According to the source code of Davicom PHY in Linux,
We should do "ISOLATE" command to PHY before setting "auto negotiation" or "MII/RMII".

However, I've found that with Faraday's MAC/GMAC controller (ftmac100/ftgmac100), setting ISOLATE for multiple PHY configuration will lead MDC become stop because the flaw inside the MAC controller. Faraday's MAC/GMAC will leverage the TX_CLK as the MDC source. When FTMAC100 send "ISOLATE" to Davicom's PHY, the TX_CLK send out from PHY will be stopped, then MDC will also become stop.

However, this mail wasn't meant to discuss about the design flaw inside the IC.

We've done two test to the following codes.

1st: if we just skip the " BMCR_ISOLATE" setting command, we will get PHY sometimes become unstable, for example, could not do DHCP request successfully.

2nd: if we replace "BMCR_ISOLATE" to "BMCR_RESET", we could get rid of the problem occurred by Faraday GMAC. And the PHY works well still.

I've found that in some other PHY implementation, for example, in "marvell.c", there are seems no ISOLATE commands. There are only RESET commands.
If we could replace BMCR_ISOLATE to BMCR_RESET in current kernel source, will there be any unpredictable behavior happened?

Please give us suggestion according to your experiences.
Thanks a lot.

+static int dm9161_config_aneg(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE);
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best regards,
Macpaul Lin

^ permalink raw reply

* Re: [PATCH] cxgb3: fix crash due to manipulating queues before registration
From: Divy Le Ray @ 2010-10-28  6:35 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: eric.dumazet, sonnyrao, Dimitrios Michailidis, netdev,
	linux-kernel
In-Reply-To: <1288242390-28574-1-git-send-email-nacc@us.ibm.com>

On 10/27/2010 10:06 PM, Nishanth Aravamudan wrote:
> Hi Eric,
>
> Something like the following?:
>
> Thanks,
> Nish
>
>
> Along the same lines as "cxgb4: fix crash due to manipulating queues
> before registration" (8f6d9f40476895571df039b6f1f5230ec7faebad), before
> commit "net: allocate tx queues in register_netdevice"
> netif_tx_stop_all_queues and related functions could be used between
> device allocation and registration but now only after registration.
> cxgb4 has such a call before registration and crashes now.  Move it
> after register_netdev.
>
> Signed-off-by: Nishanth Aravamudan<nacc@us.ibm.com>

Acked-by: Divy Le Ray <divy@chelsio.com>

> Cc: eric.dumazet@gmail.com
> Cc: sonnyrao@us.ibm.com
> Cc: Divy Le Ray<divy@chelsio.com>
> Cc: Dimitris Michailidis<dm@chelsio.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>   drivers/net/cxgb3/cxgb3_main.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
> index 4e3c123..96c70a5 100644
> --- a/drivers/net/cxgb3/cxgb3_main.c
> +++ b/drivers/net/cxgb3/cxgb3_main.c
> @@ -3301,7 +3301,6 @@ static int __devinit init_one(struct pci_dev *pdev,
>   		pi->rx_offload = T3_RX_CSUM | T3_LRO;
>   		pi->port_id = i;
>   		netif_carrier_off(netdev);
> -		netif_tx_stop_all_queues(netdev);
>   		netdev->irq = pdev->irq;
>   		netdev->mem_start = mmio_start;
>   		netdev->mem_end = mmio_start + mmio_len - 1;
> @@ -3342,6 +3341,7 @@ static int __devinit init_one(struct pci_dev *pdev,
>   				adapter->name = adapter->port[i]->name;
>
>   			__set_bit(i,&adapter->registered_device_map);
> +			 netif_tx_stop_all_queues(adapter->port[i]);
>   		}
>   	}
>   	if (!adapter->registered_device_map) {



^ permalink raw reply

* Re: 2.6.35->2.6.36 regression, vanilla kernel panic, ppp or hrtimers crashing
From: Jarek Poplawski @ 2010-10-28  7:05 UTC (permalink / raw)
  To: Denys Fedoryshchenko
  Cc: Thomas Gleixner, Paul Mackerras, linux-kernel, netdev
In-Reply-To: <201010251222.37191.nuclearcat@nuclearcat.com>

On 2010-10-25 11:22, Denys Fedoryshchenko wrote:
> Hi
> 
> Here is what i got from netconsole
>  [  259.238755] BUG: unable to handle kernel 
>  paging request
>  at f8ba001c
>  [  259.238953] IP:
>  [<c0199ebe>] do_select+0x2cc/0x502
...
> It is not easy to do full git bisect(it is semi-embedded distro), but i can 
> try reversing particular commits, if someone can give idea which one, and can 
> try debug patches.

Hi,
Nothing concrete, but you might try reverting this one:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.36.y.git;a=commitdiff;h=15fd0cd9a2ad24a78fbee369dec8ca660979d57e

Jarek P.

^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-10-28  7:18 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, Michael S. Tsirkin,
	netdev, rusty
In-Reply-To: <OFC29C4491.59069AD1-ON652577CA.00170F0D-652577CA.001C76C8@LocalDomain>

> Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
>
> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
>
> I will get a test run with and without binding and post the
> results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_____________________________________________________
     numtxqs=8,vhosts=5, Bind vs No-bind
#     BW%     CPU%     RCPU%     SD%       RSD%
_____________________________________________________
1     11.25     10.77    1.89     0        -6.06
2     18.66     7.20     7.20    -14.28    -7.40
4     4.24     -1.27     1.56    -2.70     -.98
8     14.91    -3.79     5.46    -12.19    -3.76
16    12.32    -8.67     4.63    -35.97    -26.66
24    11.68    -7.83     5.10    -40.73    -32.37
32    13.09    -10.51    6.57    -51.52    -42.28
40    11.04    -4.12     11.23   -50.69    -42.81
48    8.61     -10.30    6.04    -62.38    -55.54
64    7.55     -6.05     6.41    -61.20    -56.04
80    8.74     -11.45    6.29    -72.65    -67.17
96    9.84     -6.01     9.87    -69.89    -64.78
128   5.57     -6.23     8.99    -75.03    -70.97
_____________________________________________________
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
    1.  All my test results earlier was binding vhost
        to cpus 0-3 for both org and new kernel.
    2.  I am not using MST's use_mq patch, only mainline
        kernel. However, I reported earlier that I got
        better results with that patch. The result for
        MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK


^ permalink raw reply

* Re: [Bugme-new] [Bug 20772] New: ip= nfsaddrs= stopped working
From: Simon Horman @ 2010-10-27 22:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, bugzilla-daemon, bugme-daemon, mpetersen
In-Reply-To: <20101019120938.93622e82.akpm@linux-foundation.org>

On Tue, Oct 19, 2010 at 12:09:38PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 19 Oct 2010 18:20:15 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=20772
> > 
> >            Summary: ip= nfsaddrs= stopped working
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 2.6.33, 2.6.32, 2.6.26
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: IPV4
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: mpetersen@peak6.com
> >         Regression: Yes
> > 
> > 
> > I can no longer set a static IP address using either ip= or nfsaddr=,
> > ie ip=192.168.0.42:192.168.0.69:255.255.255.0:testhost:eth0:none has no
> > effect, and DHCP is still attempted.
> > 
> > Even if I undefine IP_PNP_DHCP IP_PNP_BOOTP and IP_PNP_RARP in the
> > config (but leave IP_PNP,) the kernel seems to ignore the static IP
> > settings and try to get a an IP address with DHCP somehow.

Hi,

I believe that the problem is that there is a minor syntax error in the
configuration parameter. Specifically there should be two colons between
192.168.0.69 and 255.255.255.0 or between 192.168.0.42 and 192.168.0.69.

As it stands the settings are:

	my address	192.168.0.42
	server address	192.168.0.69
	gateway		255.255.255.0
	netmask		testhost
	hostname	eth0
	interface	none

I believe it is the interface=none that is killing any IP configuration
by the kernel as no such interface exists.

^ permalink raw reply

* RE: About the Davicom PHY in drivers/net/phy in Linux kernel
From: Joseph Chang @ 2010-10-28  7:59 UTC (permalink / raw)
  To: macpaul, netdev; +Cc: afleming, jeff, f.rodo
In-Reply-To: <50FD90C65C53FB45BADEEBCD84FF07F2029CB176@ATCPCS06.andestech.com>

Dear Mac Paul,

For you are using DAVICOM PHY in this case.

Our suggestion, please check below red comment text:

 * == > Would you tell us your:
	CPU = ?
	Linux Kernel version= ?
      I will like to download the same source code from LXR.
      And can check more detail for you. 

+static int dm9161_config_aneg(struct phy_device *phydev) {
+	int err;
+
+	/* Isolate the PHY */
     err = 0;
+	//err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
    err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE | 0x200 );
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev) {
+	int err;
+
+	/* Isolate the PHY */
     err = 0;
+	//err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE | 0x200 );
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best Regards,
Joseph CHANG 
System Application Engineering Division 
Davicom Semiconductor, Inc. 
No. 6 Li-Hsin Rd. VI, Science-Based Park, 
Hsin-Chu, Taiwan. 
Tel: 886-3-5798797 Ex 8534
Fax: 886-3-5646929 
Web: http://www.davicom.com.tw 

-----Original Message-----
From: macpaul@andestech.com [mailto:macpaul@andestech.com] 
Sent: Thursday, October 28, 2010 2:34 PM
To: netdev@vger.kernel.org
Cc: afleming@freescale.com; jeff@garzik.org; f.rodo@til-technologies.fr; joseph_chang@mail.davicom.com.tw
Subject: About the Davicom PHY in drivers/net/phy in Linux kernel

Hi all,

According to the source code of Davicom PHY in Linux,
We should do "ISOLATE" command to PHY before setting "auto negotiation" or "MII/RMII".

However, I've found that with Faraday's MAC/GMAC controller (ftmac100/ftgmac100), setting ISOLATE for multiple PHY configuration will lead MDC become stop because the flaw inside the MAC controller. Faraday's MAC/GMAC will leverage the TX_CLK as the MDC source. When FTMAC100 send "ISOLATE" to Davicom's PHY, the TX_CLK send out from PHY will be stopped, then MDC will also become stop.

However, this mail wasn't meant to discuss about the design flaw inside the IC.

We've done two test to the following codes.

1st: if we just skip the " BMCR_ISOLATE" setting command, we will get PHY sometimes become unstable, for example, could not do DHCP request successfully.

2nd: if we replace "BMCR_ISOLATE" to "BMCR_RESET", we could get rid of the problem occurred by Faraday GMAC. And the PHY works well still.

I've found that in some other PHY implementation, for example, in "marvell.c", there are seems no ISOLATE commands. There are only RESET commands.
If we could replace BMCR_ISOLATE to BMCR_RESET in current kernel source, will there be any unpredictable behavior happened?

Please give us suggestion according to your experiences.
Thanks a lot.

+static int dm9161_config_aneg(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Configure the new settings */
+	err = genphy_config_aneg(phydev);
+
+	if (err < 0)
+		return err;
+
+	return 0;
+}
+
+static int dm9161_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	/* Isolate the PHY */
+	err = phy_write(phydev, MII_BMCR, BMCR_ISOLATE);
+
+	if (err < 0)
+		return err;
+
+	/* Do not bypass the scrambler/descrambler */
+	err = phy_write(phydev, MII_DM9161_SCR, MII_DM9161_SCR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Clear 10BTCSR to default */
+	err = phy_write(phydev, MII_DM9161_10BTCSR, MII_DM9161_10BTCSR_INIT);
+
+	if (err < 0)
+		return err;
+
+	/* Reconnect the PHY, and enable Autonegotiation */
+	err = phy_write(phydev, MII_BMCR, BMCR_ANENABLE);
+
+	if (err < 0)
+		return err;
+
+	return 0;

Best regards,
Macpaul Lin

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

^ permalink raw reply

* Re: IPV6 raw socket denies bind(2)
From: Rémi Denis-Courmont @ 2010-10-28  8:25 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: netdev
In-Reply-To: <alpine.LNX.2.01.1010280000260.7820@obet.zrqbmnf.qr>


On Thu, 28 Oct 2010 00:01:57 +0200 (CEST), Jan Engelhardt
<jengelh@medozas.de> wrote:
> #include <sys/socket.h>
> #include <stdio.h>
> #include <string.h>
> #include <netinet/udp.h>
> #include <netinet/in.h>
> #include <netinet/ip6.h>
> #include <arpa/inet.h>
> #include <stdlib.h>
> 
> int main(void)
> {
> 	struct sockaddr_in6 src = {};
> 	int sk;
> 
> 	sk = socket(AF_INET6, SOCK_RAW, IPPROTO_UDP);
> 	memset(&src, 0, sizeof(src));
> 	inet_pton(AF_INET6, "::1", &src);

That should be:
inet_pton(AF_INET6, "::1", &src.sin6_addr);
No wonder it does not work.


-- 
Rémi Denis-Courmont
http://www.remlab.net
http://fi.linkedin.com/in/remidenis


^ permalink raw reply

* Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
From: Stefan Hajnoczi @ 2010-10-28  8:57 UTC (permalink / raw)
  To: Shirley Ma; +Cc: mst@redhat.com, David Miller, netdev, kvm, linux-kernel
In-Reply-To: <1288213557.17571.28.camel@localhost.localdomain>

On Wed, Oct 27, 2010 at 10:05 PM, Shirley Ma <mashirle@us.ibm.com> wrote:
> This patch changes vhost TX used buffer signal to guest from one by
> one to up to 3/4 of vring size. This change improves vhost TX message
> size from 256 to 8K performance for both bandwidth and CPU utilization
> without inducing any regression.

Any concerns about introducing latency or does the guest not care when
TX completions come in?

> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>
>  drivers/vhost/net.c   |   19 ++++++++++++++++++-
>  drivers/vhost/vhost.c |   31 +++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |    3 +++
>  3 files changed, 52 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 4b4da5b..bd1ba71 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -198,7 +198,24 @@ static void handle_tx(struct vhost_net *net)
>                if (err != len)
>                        pr_debug("Truncated TX packet: "
>                                 " len %d != %zd\n", err, len);
> -               vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               /*
> +                * if no pending buffer size allocate, signal used buffer
> +                * one by one, otherwise, signal used buffer when reaching
> +                * 3/4 ring size to reduce CPU utilization.
> +                */
> +               if (unlikely(vq->pend))
> +                       vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +               else {
> +                       vq->pend[vq->num_pend].id = head;

I don't understand the logic here: if !vq->pend then we assign to
vq->pend[vq->num_pend].

> +                       vq->pend[vq->num_pend].len = 0;
> +                       ++vq->num_pend;
> +                       if (vq->num_pend == (vq->num - (vq->num >> 2))) {
> +                               vhost_add_used_and_signal_n(&net->dev, vq,
> +                                                           vq->pend,
> +                                                           vq->num_pend);
> +                               vq->num_pend = 0;
> +                       }
> +               }
>                total_len += len;
>                if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>                        vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 94701ff..47696d2 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -170,6 +170,16 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>        vq->call_ctx = NULL;
>        vq->call = NULL;
>        vq->log_ctx = NULL;
> +       /* signal pending used buffers */
> +       if (vq->pend) {
> +               if (vq->num_pend != 0) {
> +                       vhost_add_used_and_signal_n(dev, vq, vq->pend,
> +                                                   vq->num_pend);
> +                       vq->num_pend = 0;
> +               }
> +               kfree(vq->pend);
> +       }
> +       vq->pend = NULL;
>  }
>
>  static int vhost_worker(void *data)
> @@ -273,7 +283,13 @@ long vhost_dev_init(struct vhost_dev *dev,
>                dev->vqs[i].heads = NULL;
>                dev->vqs[i].dev = dev;
>                mutex_init(&dev->vqs[i].mutex);
> +               dev->vqs[i].num_pend = 0;
> +               dev->vqs[i].pend = NULL;
>                vhost_vq_reset(dev, dev->vqs + i);
> +               /* signal 3/4 of ring size used buffers */
> +               dev->vqs[i].pend = kmalloc((dev->vqs[i].num -
> +                                          (dev->vqs[i].num >> 2)) *
> +                                          sizeof *vq->peed, GFP_KERNEL);

Has this patch been compile tested?  vq->peed?

Stefan

^ permalink raw reply

* Re: [RFC PATCH 0/1] vhost: Reduce TX used buffer signal for performance
From: Stefan Hajnoczi @ 2010-10-28  8:59 UTC (permalink / raw)
  To: Shirley Ma; +Cc: mst@redhat.com, David Miller, netdev, kvm, linux-kernel
In-Reply-To: <AANLkTinqG_G2pevoYJyiHiBda1ZZS-53Uqz9WEspbMOy@mail.gmail.com>

On Thu, Oct 28, 2010 at 9:57 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
Just read the patch 1/1 discussion and it looks like you're already on
it.  Sorry for the noise.

Stefan

^ permalink raw reply

* [PATCH] 8390: Don't oops on starting dev queue
From: Pavel Emelyanov @ 2010-10-28  9:01 UTC (permalink / raw)
  To: Paul Gortmaker; +Cc: David Miller, Linux Netdev List

The __NS8390_init tries to start the device queue before the
device is registered. This results in an oops (snipped):

[    2.865493] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[    2.866106] IP: [<ffffffffa000602a>] netif_start_queue+0xb/0x12 [8390]
[    2.881267] Call Trace:
[    2.881437]  [<ffffffffa000624d>] __NS8390_init+0x102/0x15a [8390]
[    2.881999]  [<ffffffffa00062ae>] NS8390_init+0x9/0xb [8390]
[    2.882237]  [<ffffffffa000d820>] ne2k_pci_init_one+0x297/0x354 [ne2k_pci]
[    2.882955]  [<ffffffff811c7a0e>] local_pci_probe+0x12/0x16
[    2.883308]  [<ffffffff811c85ad>] pci_device_probe+0xc3/0xef
[    2.884049]  [<ffffffff8129218d>] driver_probe_device+0xbe/0x14b
[    2.884937]  [<ffffffff81292260>] __driver_attach+0x46/0x62
[    2.885170]  [<ffffffff81291788>] bus_for_each_dev+0x49/0x78
[    2.885781]  [<ffffffff81291fbb>] driver_attach+0x1c/0x1e
[    2.886089]  [<ffffffff812912ab>] bus_add_driver+0xba/0x227
[    2.886330]  [<ffffffff8129259a>] driver_register+0x9e/0x115
[    2.886933]  [<ffffffff811c8815>] __pci_register_driver+0x50/0xac
[    2.887785]  [<ffffffffa001102c>] ne2k_pci_init+0x2c/0x2e [ne2k_pci]
[    2.888093]  [<ffffffff81000212>] do_one_initcall+0x7c/0x130
[    2.888693]  [<ffffffff8106d74f>] sys_init_module+0x99/0x1da
[    2.888946]  [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

This happens because the netif_start_queue sets respective bit on the dev->_tx
array which is not yet allocated.

As far as I understand the code removing the netif_start_queue from __NS8390_init
is OK, since queue will be started later on device open. Plz, correct me if I'm wrong.

Found in the Dave's current tree, so he's in Cc.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

diff --git a/drivers/net/lib8390.c b/drivers/net/lib8390.c
index 316bb70..e7030ce 100644
--- a/drivers/net/lib8390.c
+++ b/drivers/net/lib8390.c
@@ -1077,7 +1077,6 @@ static void __NS8390_init(struct net_device *dev, int startp)
 	ei_outb_p(ei_local->rx_start_page, e8390_base + EN1_CURPAG);
 	ei_outb_p(E8390_NODMA+E8390_PAGE0+E8390_STOP, e8390_base+E8390_CMD);
 
-	netif_start_queue(dev);
 	ei_local->tx1 = ei_local->tx2 = 0;
 	ei_local->txing = 0;
 


^ permalink raw reply related

* Re: [PATCH] conntrack: allow nf_ct_alloc_hashtable() to get highmem pages
From: Patrick McHardy @ 2010-10-28 10:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Netfilter Development Mailinglist, stable
In-Reply-To: <1288170555.2709.63.camel@edumazet-laptop>

On 27.10.2010 11:09, Eric Dumazet wrote:
> commit ea781f197d6a8 (use SLAB_DESTROY_BY_RCU and get rid of call_rcu())
> did a mistake in __vmalloc() call in nf_ct_alloc_hashtable().
> 
> I forgot to add __GFP_HIGHMEM, so pages were taken from LOWMEM only.

Applied, thanks. I'll also pass this to -stable.

^ permalink raw reply

* Does ip does support labels for IPv6 addresses?
From: Heiko Gerstung @ 2010-10-28 10:41 UTC (permalink / raw)
  To: netdev

Hi !

Not sure if this is where I should ask, please feel free to send me to
someone else.

I just tried to use "ip" (from the iproute2 project) to assign multiple
IPv6 addresses to a physical network interface. In order to be able to
distinguish between them, I wanted to use the "label" feature as
described on the ip man page, but it seems that this is not supported
for IPv6 addresses.

Does anybody know if this is a (documentation-)bug, a feature or a
misconfiguration of ~/brain ?

Example:

# ip addr add dev eth0 2001::23/64 label eth0:0

at least adds the address correctly. But when I want to check with
# ip addr show
the label is not listed anywhere and using
# ip addr show label "eth0:0" does not work either (does not come up
with any output at all).

The description of the "label" option on the man page does not mention
that this is not supported on IPv6.

Best Regards,
 Heiko

-- 

Heiko Gerstung

*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>

------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*

^ permalink raw reply

* [net-2.6 PATCH] ixgb: call pci_disable_device in ixgb_remove
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Ben Hutchings, Emil Tantilov,
	Jeff Kirsher

From: Emil Tantilov <emil.s.tantilov@intel.com>

ixgb fails to work after reload on recent kernels:

rmmod ixgb (dev->current_state = PCI_UNKNOWN)
modprobe ixgb (pci_enable_device will bail leaving current_state to PCI_UNKNOWN)
ifup eth0
do_IRQ: 2.82 No irq handler for vector (irq -1)

The issue was exposed by commit fcd097f31a6ee207cc0c3da9cccd2a86d4334785
PCI: MSI: Remove unsafe and unnecessary hardware access

which avoids HW writes for power states != PCI_D0

CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgb/ixgb_main.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 666207a..caa8192 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -533,6 +533,7 @@ ixgb_remove(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 
 	free_netdev(netdev);
+	pci_disable_device(pdev);
 }
 
 /**


^ permalink raw reply related

* [net-2.6 PATCH] igbvf: fix panic on load
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Emil Tantilov, Greg Rose, Jeff Kirsher

From: Emil Tantilov <emil.s.tantilov@intel.com>

Introduced by commit:e6484930d7c73d324bccda7d43d131088da697b9
net: allocate tx queues in register_netdevice

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Acked-by: Greg Rose <greg.v.rose@intel.com>
Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/igbvf/netdev.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/igbvf/netdev.c b/drivers/net/igbvf/netdev.c
index ebfaa68..28af019 100644
--- a/drivers/net/igbvf/netdev.c
+++ b/drivers/net/igbvf/netdev.c
@@ -2783,15 +2783,15 @@ static int __devinit igbvf_probe(struct pci_dev *pdev,
 	/* reset the hardware with the new settings */
 	igbvf_reset(adapter);
 
-	/* tell the stack to leave us alone until igbvf_open() is called */
-	netif_carrier_off(netdev);
-	netif_stop_queue(netdev);
-
 	strcpy(netdev->name, "eth%d");
 	err = register_netdev(netdev);
 	if (err)
 		goto err_hw_init;
 
+	/* tell the stack to leave us alone until igbvf_open() is called */
+	netif_carrier_off(netdev);
+	netif_stop_queue(netdev);
+
 	igbvf_print_device_info(adapter);
 
 	igbvf_initialize_last_counter_stats(adapter);


^ permalink raw reply related

* [net-2.6 PATCH] e1000e: reset PHY after errors detected
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem
  Cc: netdev, gospo, bphilips, Jesse Brandeburg, Carolyn Wyborny,
	Jeff Kirsher

From: Carolyn Wyborny <carolyn.wyborny@intel.com>

Some errors can be induced in the PHY via environmental testing
(specifically extreme temperature changes and electro static
discharge testing), and in the case of the PHY hanging due to
this input, this detects the problem and resets to continue.
This issue only applies to 82574 silicon.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/e1000e/82571.c  |   38 ++++++++++++++++++++++++++++++++++++++
 drivers/net/e1000e/e1000.h  |    3 +++
 drivers/net/e1000e/netdev.c |   22 ++++++++++++++++++++++
 3 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/drivers/net/e1000e/82571.c b/drivers/net/e1000e/82571.c
index ca663f1..7236f1a 100644
--- a/drivers/net/e1000e/82571.c
+++ b/drivers/net/e1000e/82571.c
@@ -52,6 +52,10 @@
 			      (ID_LED_DEF1_DEF2))
 
 #define E1000_GCR_L1_ACT_WITHOUT_L0S_RX 0x08000000
+#define E1000_BASE1000T_STATUS          10
+#define E1000_IDLE_ERROR_COUNT_MASK     0xFF
+#define E1000_RECEIVE_ERROR_COUNTER     21
+#define E1000_RECEIVE_ERROR_MAX         0xFFFF
 
 #define E1000_NVM_INIT_CTRL2_MNGM 0x6000 /* Manageability Operation Mode mask */
 
@@ -1243,6 +1247,39 @@ static s32 e1000_led_on_82574(struct e1000_hw *hw)
 }
 
 /**
+ *  e1000_check_phy_82574 - check 82574 phy hung state
+ *  @hw: pointer to the HW structure
+ *
+ *  Returns whether phy is hung or not
+ **/
+bool e1000_check_phy_82574(struct e1000_hw *hw)
+{
+	u16 status_1kbt = 0;
+	u16 receive_errors = 0;
+	bool phy_hung = false;
+	s32 ret_val = 0;
+
+	/*
+	 * Read PHY Receive Error counter first, if its is max - all F's then
+	 * read the Base1000T status register If both are max then PHY is hung.
+	 */
+	ret_val = e1e_rphy(hw, E1000_RECEIVE_ERROR_COUNTER, &receive_errors);
+
+	if (ret_val)
+		goto out;
+	if (receive_errors == E1000_RECEIVE_ERROR_MAX)  {
+		ret_val = e1e_rphy(hw, E1000_BASE1000T_STATUS, &status_1kbt);
+		if (ret_val)
+			goto out;
+		if ((status_1kbt & E1000_IDLE_ERROR_COUNT_MASK) ==
+		    E1000_IDLE_ERROR_COUNT_MASK)
+			phy_hung = true;
+	}
+out:
+	return phy_hung;
+}
+
+/**
  *  e1000_setup_link_82571 - Setup flow control and link settings
  *  @hw: pointer to the HW structure
  *
@@ -1859,6 +1896,7 @@ struct e1000_info e1000_82574_info = {
 				  | FLAG_HAS_SMART_POWER_DOWN
 				  | FLAG_HAS_AMT
 				  | FLAG_HAS_CTRLEXT_ON_LOAD,
+	.flags2			  = FLAG2_CHECK_PHY_HANG,
 	.pba			= 36,
 	.max_hw_frame_size	= DEFAULT_JUMBO,
 	.get_variants		= e1000_get_variants_82571,
diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
index cee882d..fdc67fe 100644
--- a/drivers/net/e1000e/e1000.h
+++ b/drivers/net/e1000e/e1000.h
@@ -397,6 +397,7 @@ struct e1000_adapter {
 	struct work_struct print_hang_task;
 
 	bool idle_check;
+	int phy_hang_count;
 };
 
 struct e1000_info {
@@ -454,6 +455,7 @@ struct e1000_info {
 #define FLAG2_HAS_EEE                     (1 << 5)
 #define FLAG2_DMA_BURST                   (1 << 6)
 #define FLAG2_DISABLE_AIM                 (1 << 8)
+#define FLAG2_CHECK_PHY_HANG              (1 << 9)
 
 #define E1000_RX_DESC_PS(R, i)	    \
 	(&(((union e1000_rx_desc_packet_split *)((R).desc))[i]))
@@ -631,6 +633,7 @@ extern s32 e1000_get_phy_info_ife(struct e1000_hw *hw);
 extern s32 e1000_check_polarity_ife(struct e1000_hw *hw);
 extern s32 e1000_phy_force_speed_duplex_ife(struct e1000_hw *hw);
 extern s32 e1000_check_polarity_igp(struct e1000_hw *hw);
+extern bool e1000_check_phy_82574(struct e1000_hw *hw);
 
 static inline s32 e1000_phy_hw_reset(struct e1000_hw *hw)
 {
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index ec8cf3f..35a9dd4 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4098,6 +4098,25 @@ static void e1000e_enable_receives(struct e1000_adapter *adapter)
 	}
 }
 
+static void e1000e_check_82574_phy_workaround(struct e1000_adapter *adapter)
+{
+	struct e1000_hw *hw = &adapter->hw;
+
+	/*
+	 * With 82574 controllers, PHY needs to be checked periodically
+	 * for hung state and reset, if two calls return true
+	 */
+	if (e1000_check_phy_82574(hw))
+		adapter->phy_hang_count++;
+	else
+		adapter->phy_hang_count = 0;
+
+	if (adapter->phy_hang_count > 1) {
+		adapter->phy_hang_count = 0;
+		schedule_work(&adapter->reset_task);
+	}
+}
+
 /**
  * e1000_watchdog - Timer Call-back
  * @data: pointer to adapter cast into an unsigned long
@@ -4333,6 +4352,9 @@ link_up:
 	if (e1000e_get_laa_state_82571(hw))
 		e1000e_rar_set(hw, adapter->hw.mac.addr, 0);
 
+	if (adapter->flags2 & FLAG2_CHECK_PHY_HANG)
+		e1000e_check_82574_phy_workaround(adapter);
+
 	/* Reset the timer */
 	if (!test_bit(__E1000_DOWN, &adapter->state))
 		mod_timer(&adapter->watchdog_timer,


^ permalink raw reply related

* [net-2.6 PATCH] e1000e: Add check for reset flags before displaying reset message
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, Carolyn Wyborny, Bruce Allan,
	Jeff Kirsher

From: Carolyn Wyborny <carolyn.wyborny@intel.com>

Some parts need to execute resets during normal operation.  This flag
check ensures that those parts reset without needlessly alarming the
user.  Other unexpected resets by other parts will dump debug info
and message the reset action to the user, as originally intended.

Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Acked-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/e1000e/netdev.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 35a9dd4..c4ca162 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4882,8 +4882,11 @@ static void e1000_reset_task(struct work_struct *work)
 	struct e1000_adapter *adapter;
 	adapter = container_of(work, struct e1000_adapter, reset_task);
 
-	e1000e_dump(adapter);
-	e_err("Reset adapter\n");
+	if (!((adapter->flags & FLAG_RX_NEEDS_RESTART) &&
+	      (adapter->flags & FLAG_RX_RESTART_NOW))) {
+		e1000e_dump(adapter);
+		e_err("Reset adapter\n");
+	}
 	e1000e_reinit_locked(adapter);
 }
 


^ permalink raw reply related

* [net-2.6 PATCH] ixgbe: DCB, fix TX hang occurring in stress condition with PFC
From: Jeff Kirsher @ 2010-10-28 10:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, bphilips, John Fastabend, Jeff Kirsher

From: John Fastabend <john.r.fastabend@intel.com>

The DCB credits refill quantum _must_ be greater than half the max
packet size. This is needed to guarantee that TX DMA operations
are not attempted during a pause state. Additionally, the min IFG
must be set correctly for DCB mode. If a DMA operation is
requested unexpectedly during the pause state the HW data
store may be corrupted leading to a DMA hang.  The DMA hang
requires a reset to correct. This fixes the HW configuration
to avoid this condition.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe_dcb.c       |   39 +++++++++++++++++++++++++++++++----
 drivers/net/ixgbe/ixgbe_dcb.h       |    5 ++--
 drivers/net/ixgbe/ixgbe_dcb_82599.c |    5 ++++
 drivers/net/ixgbe/ixgbe_dcb_82599.h |    3 +++
 drivers/net/ixgbe/ixgbe_main.c      |   12 +++++++++--
 5 files changed, 54 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_dcb.c b/drivers/net/ixgbe/ixgbe_dcb.c
index 8bb9ddb..0d44c64 100644
--- a/drivers/net/ixgbe/ixgbe_dcb.c
+++ b/drivers/net/ixgbe/ixgbe_dcb.c
@@ -43,9 +43,12 @@
  * ixgbe_dcb_check_config().
  */
 s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *dcb_config,
-                                   u8 direction)
+				   int max_frame, u8 direction)
 {
 	struct tc_bw_alloc *p;
+	int min_credit;
+	int min_multiplier;
+	int min_percent = 100;
 	s32 ret_val = 0;
 	/* Initialization values default for Tx settings */
 	u32 credit_refill       = 0;
@@ -59,6 +62,31 @@ s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *dcb_config,
 		goto out;
 	}
 
+	min_credit = ((max_frame / 2) + DCB_CREDIT_QUANTUM - 1) /
+			DCB_CREDIT_QUANTUM;
+
+	/* Find smallest link percentage */
+	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
+		p = &dcb_config->tc_config[i].path[direction];
+		bw_percent = dcb_config->bw_percentage[direction][p->bwg_id];
+		link_percentage = p->bwg_percent;
+
+		link_percentage = (link_percentage * bw_percent) / 100;
+
+		if (link_percentage && link_percentage < min_percent)
+			min_percent = link_percentage;
+	}
+
+	/*
+	 * The ratio between traffic classes will control the bandwidth
+	 * percentages seen on the wire. To calculate this ratio we use
+	 * a multiplier. It is required that the refill credits must be
+	 * larger than the max frame size so here we find the smallest
+	 * multiplier that will allow all bandwidth percentages to be
+	 * greater than the max frame size.
+	 */
+	min_multiplier = (min_credit / min_percent) + 1;
+
 	/* Find out the link percentage for each TC first */
 	for (i = 0; i < MAX_TRAFFIC_CLASS; i++) {
 		p = &dcb_config->tc_config[i].path[direction];
@@ -73,8 +101,9 @@ s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *dcb_config,
 		/* Save link_percentage for reference */
 		p->link_percent = (u8)link_percentage;
 
-		/* Calculate credit refill and save it */
-		credit_refill = link_percentage * MINIMUM_CREDIT_REFILL;
+		/* Calculate credit refill ratio using multiplier */
+		credit_refill = min(link_percentage * min_multiplier,
+				    MAX_CREDIT_REFILL);
 		p->data_credits_refill = (u16)credit_refill;
 
 		/* Calculate maximum credit for the TC */
@@ -85,8 +114,8 @@ s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *dcb_config,
 		 * of a TC is too small, the maximum credit may not be
 		 * enough to send out a jumbo frame in data plane arbitration.
 		 */
-		if (credit_max && (credit_max < MINIMUM_CREDIT_FOR_JUMBO))
-			credit_max = MINIMUM_CREDIT_FOR_JUMBO;
+		if (credit_max && (credit_max < min_credit))
+			credit_max = min_credit;
 
 		if (direction == DCB_TX_CONFIG) {
 			/*
diff --git a/drivers/net/ixgbe/ixgbe_dcb.h b/drivers/net/ixgbe/ixgbe_dcb.h
index eb1059f..0208a87 100644
--- a/drivers/net/ixgbe/ixgbe_dcb.h
+++ b/drivers/net/ixgbe/ixgbe_dcb.h
@@ -150,15 +150,14 @@ struct ixgbe_dcb_config {
 /* DCB driver APIs */
 
 /* DCB credits calculation */
-s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *, u8);
+s32 ixgbe_dcb_calculate_tc_credits(struct ixgbe_dcb_config *, int, u8);
 
 /* DCB hw initialization */
 s32 ixgbe_dcb_hw_config(struct ixgbe_hw *, struct ixgbe_dcb_config *);
 
 /* DCB definitions for credit calculation */
+#define DCB_CREDIT_QUANTUM	64   /* DCB Quantum */
 #define MAX_CREDIT_REFILL       511  /* 0x1FF * 64B = 32704B */
-#define MINIMUM_CREDIT_REFILL   5    /* 5*64B = 320B */
-#define MINIMUM_CREDIT_FOR_JUMBO 145  /* 145= UpperBound((9*1024+54)/64B) for 9KB jumbo frame */
 #define DCB_MAX_TSO_SIZE        (32*1024) /* MAX TSO packet size supported in DCB mode */
 #define MINIMUM_CREDIT_FOR_TSO  (DCB_MAX_TSO_SIZE/64 + 1) /* 513 for 32KB TSO packet */
 #define MAX_CREDIT              4095 /* Maximum credit supported: 256KB * 1204 / 64B */
diff --git a/drivers/net/ixgbe/ixgbe_dcb_82599.c b/drivers/net/ixgbe/ixgbe_dcb_82599.c
index 67c219f..05f2247 100644
--- a/drivers/net/ixgbe/ixgbe_dcb_82599.c
+++ b/drivers/net/ixgbe/ixgbe_dcb_82599.c
@@ -397,6 +397,11 @@ static s32 ixgbe_dcb_config_82599(struct ixgbe_hw *hw)
 	reg &= ~IXGBE_RTTDCS_ARBDIS;
 	IXGBE_WRITE_REG(hw, IXGBE_RTTDCS, reg);
 
+	/* Enable Security TX Buffer IFG for DCB */
+	reg = IXGBE_READ_REG(hw, IXGBE_SECTXMINIFG);
+	reg |= IXGBE_SECTX_DCB;
+	IXGBE_WRITE_REG(hw, IXGBE_SECTXMINIFG, reg);
+
 	return 0;
 }
 
diff --git a/drivers/net/ixgbe/ixgbe_dcb_82599.h b/drivers/net/ixgbe/ixgbe_dcb_82599.h
index 18d7fbf..3841649 100644
--- a/drivers/net/ixgbe/ixgbe_dcb_82599.h
+++ b/drivers/net/ixgbe/ixgbe_dcb_82599.h
@@ -95,6 +95,9 @@
 
 #define IXGBE_TXPBTHRESH_DCB    0xA        /* THRESH value for DCB mode */
 
+/* SECTXMINIFG DCB */
+#define IXGBE_SECTX_DCB		0x00001F00 /* DCB TX Buffer IFG */
+
 
 /* DCB hardware-specific driver APIs */
 
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index f856312..2bd3eb4 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3347,6 +3347,7 @@ static void ixgbe_napi_disable_all(struct ixgbe_adapter *adapter)
 static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
+	int max_frame = adapter->netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
 	u32 txdctl;
 	int i, j;
 
@@ -3359,8 +3360,15 @@ static void ixgbe_configure_dcb(struct ixgbe_adapter *adapter)
 	if (hw->mac.type == ixgbe_mac_82598EB)
 		netif_set_gso_max_size(adapter->netdev, 32768);
 
-	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_TX_CONFIG);
-	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, DCB_RX_CONFIG);
+#ifdef CONFIG_FCOE
+	if (adapter->netdev->features & NETIF_F_FCOE_MTU)
+		max_frame = max(max_frame, IXGBE_FCOE_JUMBO_FRAME_SIZE);
+#endif
+
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, max_frame,
+					DCB_TX_CONFIG);
+	ixgbe_dcb_calculate_tc_credits(&adapter->dcb_cfg, max_frame,
+					DCB_RX_CONFIG);
 
 	/* reconfigure the hardware */
 	ixgbe_dcb_hw_config(&adapter->hw, &adapter->dcb_cfg);


^ permalink raw reply related

* [PATCH] fib: Fix fib zone and its hash leak on namespace stop
From: Pavel Emelyanov @ 2010-10-28 12:00 UTC (permalink / raw)
  To: David Miller; +Cc: Linux Netdev List

When we stop a namespace we flush the table and free one, but the
added fn_zone-s (and their hashes if grown) are leaked. Need to free.
Tries releases all its stuff in the flushing code.

Shame on us - this bug exists since the very first make-fib-per-net
patches in 2.6.27 :(

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>

---

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index ba3666d..07bdb5e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -158,6 +158,8 @@ extern int fib_table_flush(struct fib_table *table);
 extern void fib_table_select_default(struct fib_table *table,
 				     const struct flowi *flp,
 				     struct fib_result *res);
+extern void fib_free_table(struct fib_table *tb);
+
 
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 36e27c2..eb6f69a 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1052,7 +1052,7 @@ static void ip_fib_net_exit(struct net *net)
 		hlist_for_each_entry_safe(tb, node, tmp, head, tb_hlist) {
 			hlist_del(node);
 			fib_table_flush(tb);
-			kfree(tb);
+			fib_free_table(tb);
 		}
 	}
 	kfree(net->ipv4.fib_table_hash);
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index b232375..b3acb04 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -716,6 +716,24 @@ int fib_table_flush(struct fib_table *tb)
 	return found;
 }
 
+void fib_free_table(struct fib_table *tb)
+{
+	struct fn_hash *table = (struct fn_hash *) tb->tb_data;
+	struct fn_zone *fz, *next;
+
+	next = table->fn_zone_list;
+	while (next != NULL) {
+		fz = next;
+		next = fz->fz_next;
+
+		if (fz->fz_hash != fz->fz_embedded_hash)
+			fz_hash_free(fz->fz_hash, fz->fz_divisor);
+
+		kfree(fz);
+	}
+
+	kfree(tb);
+}
 
 static inline int
 fn_hash_dump_bucket(struct sk_buff *skb, struct netlink_callback *cb,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index b144508..200eb53 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1797,6 +1797,11 @@ int fib_table_flush(struct fib_table *tb)
 	return found;
 }
 
+void fib_free_table(struct fib_table *tb)
+{
+	kfree(tb);
+}
+
 void fib_table_select_default(struct fib_table *tb,
 			      const struct flowi *flp,
 			      struct fib_result *res)

^ permalink raw reply related

* [PATCH] ixgbe: delay rx_ring freeing
From: Eric Dumazet @ 2010-10-28 12:02 UTC (permalink / raw)
  To: Tantilov, Emil S
  Cc: David Miller, Waskiewicz Jr, Peter P, Kirsher, Jeffrey T,
	Brattain, Ross B, netdev
In-Reply-To: <1288239858.2658.72.camel@edumazet-laptop>

Le jeudi 28 octobre 2010 à 06:24 +0200, Eric Dumazet a écrit :
> Le mercredi 27 octobre 2010 à 16:35 -0600, Tantilov, Emil S a écrit :

> > Eric,
> > 
> > We are seeing intermittent hangs on ia32 arch which seem to point to this patch:
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 00000040
> >  IP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe]
> >  *pdpt = 000000002dc83001 *pde = 000000032d7e5067
> >  Oops: 0000 [#2] SMP
> >  last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
> >  Modules linked in: act_skbedit cls_u32 sch_multiq ixgbe mdio netconsole configfs autofs4 8021q garp stp llc sunrpc ipv6 e
> > xt3 jbd dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath power_meter hwmon sg ses enclosure dcdbas pcspkr serio_raw iTCO_wdt iTCO_vendor_support io
> > atdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mod [last unloaded: mperf
> > ]
> > 
> >  Pid: 1939, comm: irqbalance Tainted: G      D W   2.6.36-rc7-upstream-net-next-2.6-ixgbe-queue-i386-g55e1a84 #1 09CGW2/Po
> > werEdge T610
> >  EIP: 0060:[<f7f6b537>] EFLAGS: 00010206 CPU: 0
> >  EIP is at ixgbe_get_stats64+0x47/0x120 [ixgbe]
> >  EAX: 00000000 EBX: ecc45e4c ECX: ebea0400 EDX: 00000000
> >  ESI: ebea0000 EDI: 00000018 EBP: f7f846a0 ESP: ecc45d88
> >  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> >  Process irqbalance (pid: 1939, ti=ecc44000 task=efc63a50 task.ti=ecc44000)
> >  Stack:
> >  ecc45e4c 00000000 ebea0400 ebea0000 ecc45e4c ebea0000 ecc45f04 f7f846a0
> >  <0> c0750593 ebea0000 edfec340 ebea0000 000002b5 c075063a edfec340 c0993eec
> >  <0> ebe78000 0000021c 00000000 00000009 00000000 00000000 00000000 00000000
> >  Call Trace:
> >  [<c0750593>] ? dev_get_stats+0x33/0xc0
> >  [<c075063a>] ? dev_seq_printf_stats+0x1a/0x180
> >  [<c07507aa>] ? dev_seq_show+0xa/0x20
> >  [<c052398f>] ? seq_read+0x22f/0x3d0
> >  [<c0523760>] ? seq_read+0x0/0x3d0
> >  [<c054fdde>] ? proc_reg_read+0x5e/0x90
> >  [<c054fd80>] ? proc_reg_read+0x0/0x90
> >  [<c050a1dd>] ? vfs_read+0x9d/0x160
> >  [<c049d4ef>] ? audit_syscall_entry+0x20f/0x230
> >  [<c050a971>] ? sys_read+0x41/0x70
> >  [<c0409cdf>] ? sysenter_do_call+0x12/0x28
> >  Code: 60 4f 7e c8 8b 44 24 08 8b b8 20 06 00 00 85 ff 7e 63 c7 44 24 04 00 00 00 00 66 90 8b 54 24 04 8b 4c 24 08 8b 84 9
> > 1 00 05 00 00 <8b> 50 40 eb 06 8d 74 26 00 89 ca f6 c2 01 0f 85 ae 00 00 00 8b
> >  EIP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe] SS:ESP 0068:ecc45d88
> >  CR2: 0000000000000040
> >  ---[ end trace 51ea89f4e57f54f1 ]---
> > 
> > Emil

I believe I now understand the problem, please try following patch.

Thanks



[PATCH] ixgbe: delay rx_ring freeing

"cat /proc/net/dev" uses RCU protection only.

Its quite possible we call a driver get_stats() method while device is
dismantling and freeing its data structures.

So get_stats() methods must be very careful not accessing driver private
data without appropriate locking.

In ixgbe case, we access rx_ring pointers. These pointers are freed in
ixgbe_clear_interrupt_scheme() and set to NULL, this can trigger NULL
dereference in ixgbe_get_stats64()

A possible fix is to use RCU locking in ixgbe_get_stats64() and defer
rx_ring freeing after a grace period in ixgbe_clear_interrupt_scheme()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Tantilov, Emil S <emil.s.tantilov@intel.com>
---
 drivers/net/ixgbe/ixgbe.h      |    1 
 drivers/net/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index ed8703c..018e143 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -192,6 +192,7 @@ struct ixgbe_ring {
 
 	unsigned int size;		/* length in bytes */
 	dma_addr_t dma;			/* phys. address of descriptor ring */
+	struct rcu_head rcu;
 } ____cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index f856312..1b16292 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -4742,6 +4742,11 @@ err_set_interrupt:
 	return err;
 }
 
+static void ring_free_rcu(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ixgbe_ring, rcu));
+}
+
 /**
  * ixgbe_clear_interrupt_scheme - Clear the current interrupt scheme settings
  * @adapter: board private structure to clear interrupt scheme on
@@ -4758,7 +4763,12 @@ void ixgbe_clear_interrupt_scheme(struct ixgbe_adapter *adapter)
 		adapter->tx_ring[i] = NULL;
 	}
 	for (i = 0; i < adapter->num_rx_queues; i++) {
-		kfree(adapter->rx_ring[i]);
+		struct ixgbe_ring *ring = adapter->rx_ring[i];
+		
+		/* ixgbe_get_stats64() might access this ring, we must wait
+		 * a grace period before freeing it.
+		 */
+		call_rcu(&ring->rcu, ring_free_rcu);
 		adapter->rx_ring[i] = NULL;
 	}
 
@@ -6551,20 +6561,23 @@ static struct rtnl_link_stats64 *ixgbe_get_stats64(struct net_device *netdev,
 
 	/* accurate rx/tx bytes/packets stats */
 	dev_txq_stats_fold(netdev, stats);
+	rcu_read_lock();
 	for (i = 0; i < adapter->num_rx_queues; i++) {
-		struct ixgbe_ring *ring = adapter->rx_ring[i];
+		struct ixgbe_ring *ring = ACCESS_ONCE(adapter->rx_ring[i]);
 		u64 bytes, packets;
 		unsigned int start;
 
-		do {
-			start = u64_stats_fetch_begin_bh(&ring->syncp);
-			packets = ring->stats.packets;
-			bytes   = ring->stats.bytes;
-		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
-		stats->rx_packets += packets;
-		stats->rx_bytes   += bytes;
+		if (ring) {
+			do {
+				start = u64_stats_fetch_begin_bh(&ring->syncp);
+				packets = ring->stats.packets;
+				bytes   = ring->stats.bytes;
+			} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
+			stats->rx_packets += packets;
+			stats->rx_bytes   += bytes;
+		}
 	}
-
+	rcu_read_unlock();
 	/* following stats updated by ixgbe_watchdog_task() */
 	stats->multicast	= netdev->stats.multicast;
 	stats->rx_errors	= netdev->stats.rx_errors;
@@ -7270,6 +7283,7 @@ static void __exit ixgbe_exit_module(void)
 	dca_unregister_notify(&dca_notifier);
 #endif
 	pci_unregister_driver(&ixgbe_driver);
+	rcu_barrier(); /* Wait for completion of call_rcu()'s */
 }
 
 #ifdef CONFIG_IXGBE_DCA



^ permalink raw reply related

* [PATCH] Fix missing dependency of PCH_GBE driver
From: Steven Rostedt @ 2010-10-28 12:16 UTC (permalink / raw)
  To: LKML, netdev
  Cc: David S. Miller, Dan Carpenter, Masayuki Ohtake, Andrew Morton

While doing some randconfigs I stumbled across this build error:

ERROR: "mii_ethtool_gset" [drivers/net/pch_gbe/pch_gbe.ko] undefined!
ERROR: "generic_mii_ioctl" [drivers/net/pch_gbe/pch_gbe.ko] undefined!
ERROR: "mii_nway_restart" [drivers/net/pch_gbe/pch_gbe.ko] undefined!
ERROR: "mii_check_gmii_support" [drivers/net/pch_gbe/pch_gbe.ko] undefined!
ERROR: "mii_link_ok" [drivers/net/pch_gbe/pch_gbe.ko] undefined!
ERROR: "mii_ethtool_sset" [drivers/net/pch_gbe/pch_gbe.ko] undefined!

The config option PCH_GBE is missing the dependency of MII.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9334539..cf9769c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2540,7 +2540,7 @@ source "drivers/net/stmmac/Kconfig"

 config PCH_GBE
 	tristate "PCH Gigabit Ethernet"
-	depends on PCI
+	depends on PCI && MII
 	---help---
 	  This is a gigabit ethernet driver for Topcliff PCH.
 	  Topcliff PCH is the platform controller hub that is used in Intel's

^ permalink raw reply related

* Re: [PATCH] ixgbe: delay rx_ring freeing
From: Jeff Kirsher @ 2010-10-28 12:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tantilov, Emil S, David Miller, Waskiewicz Jr, Peter P,
	Brattain, Ross B, netdev
In-Reply-To: <1288267354.2649.369.camel@edumazet-laptop>

[-- Attachment #1: Type: text/plain, Size: 3956 bytes --]

On Thu, 2010-10-28 at 05:02 -0700, Eric Dumazet wrote:
> Le jeudi 28 octobre 2010 à 06:24 +0200, Eric Dumazet a écrit :
> > Le mercredi 27 octobre 2010 à 16:35 -0600, Tantilov, Emil S a écrit :
> 
> > > Eric,
> > > 
> > > We are seeing intermittent hangs on ia32 arch which seem to point to this patch:
> > > 
> > > BUG: unable to handle kernel NULL pointer dereference at 00000040
> > >  IP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe]
> > >  *pdpt = 000000002dc83001 *pde = 000000032d7e5067
> > >  Oops: 0000 [#2] SMP
> > >  last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
> > >  Modules linked in: act_skbedit cls_u32 sch_multiq ixgbe mdio netconsole configfs autofs4 8021q garp stp llc sunrpc ipv6 e
> > > xt3 jbd dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath power_meter hwmon sg ses enclosure dcdbas pcspkr serio_raw iTCO_wdt iTCO_vendor_support io
> > > atdma dca i7core_edac edac_core bnx2 ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix megaraid_sas dm_mod [last unloaded: mperf
> > > ]
> > > 
> > >  Pid: 1939, comm: irqbalance Tainted: G      D W   2.6.36-rc7-upstream-net-next-2.6-ixgbe-queue-i386-g55e1a84 #1 09CGW2/Po
> > > werEdge T610
> > >  EIP: 0060:[<f7f6b537>] EFLAGS: 00010206 CPU: 0
> > >  EIP is at ixgbe_get_stats64+0x47/0x120 [ixgbe]
> > >  EAX: 00000000 EBX: ecc45e4c ECX: ebea0400 EDX: 00000000
> > >  ESI: ebea0000 EDI: 00000018 EBP: f7f846a0 ESP: ecc45d88
> > >  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> > >  Process irqbalance (pid: 1939, ti=ecc44000 task=efc63a50 task.ti=ecc44000)
> > >  Stack:
> > >  ecc45e4c 00000000 ebea0400 ebea0000 ecc45e4c ebea0000 ecc45f04 f7f846a0
> > >  <0> c0750593 ebea0000 edfec340 ebea0000 000002b5 c075063a edfec340 c0993eec
> > >  <0> ebe78000 0000021c 00000000 00000009 00000000 00000000 00000000 00000000
> > >  Call Trace:
> > >  [<c0750593>] ? dev_get_stats+0x33/0xc0
> > >  [<c075063a>] ? dev_seq_printf_stats+0x1a/0x180
> > >  [<c07507aa>] ? dev_seq_show+0xa/0x20
> > >  [<c052398f>] ? seq_read+0x22f/0x3d0
> > >  [<c0523760>] ? seq_read+0x0/0x3d0
> > >  [<c054fdde>] ? proc_reg_read+0x5e/0x90
> > >  [<c054fd80>] ? proc_reg_read+0x0/0x90
> > >  [<c050a1dd>] ? vfs_read+0x9d/0x160
> > >  [<c049d4ef>] ? audit_syscall_entry+0x20f/0x230
> > >  [<c050a971>] ? sys_read+0x41/0x70
> > >  [<c0409cdf>] ? sysenter_do_call+0x12/0x28
> > >  Code: 60 4f 7e c8 8b 44 24 08 8b b8 20 06 00 00 85 ff 7e 63 c7 44 24 04 00 00 00 00 66 90 8b 54 24 04 8b 4c 24 08 8b 84 9
> > > 1 00 05 00 00 <8b> 50 40 eb 06 8d 74 26 00 89 ca f6 c2 01 0f 85 ae 00 00 00 8b
> > >  EIP: [<f7f6b537>] ixgbe_get_stats64+0x47/0x120 [ixgbe] SS:ESP 0068:ecc45d88
> > >  CR2: 0000000000000040
> > >  ---[ end trace 51ea89f4e57f54f1 ]---
> > > 
> > > Emil
> 
> I believe I now understand the problem, please try following patch.
> 
> Thanks
> 
> 
> 
> [PATCH] ixgbe: delay rx_ring freeing
> 
> "cat /proc/net/dev" uses RCU protection only.
> 
> Its quite possible we call a driver get_stats() method while device is
> dismantling and freeing its data structures.
> 
> So get_stats() methods must be very careful not accessing driver private
> data without appropriate locking.
> 
> In ixgbe case, we access rx_ring pointers. These pointers are freed in
> ixgbe_clear_interrupt_scheme() and set to NULL, this can trigger NULL
> dereference in ixgbe_get_stats64()
> 
> A possible fix is to use RCU locking in ixgbe_get_stats64() and defer
> rx_ring freeing after a grace period in ixgbe_clear_interrupt_scheme()
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-by: Tantilov, Emil S <emil.s.tantilov@intel.com>
> ---
>  drivers/net/ixgbe/ixgbe.h      |    1 
>  drivers/net/ixgbe/ixgbe_main.c |   34 +++++++++++++++++++++----------
>  2 files changed, 25 insertions(+), 10 deletions(-)
> 

Thanks Eric, I have added this patch to my queue.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox