Netdev List

Netdev List
 help / color / mirror / Atom feed

* how many msi (msi-x) vectors can be setup?
From: zhou rui @ 2010-05-19 15:18 UTC (permalink / raw)
  To: netdev

hi there:
how many msi (msi-x) vectors can be setup?
the number is limited by hardware resource(nic), or kernel ?
I found that the driver (broadcom 57711 ver 1.5.12) tried to request
16 queues on my kernel2.6.27,but only 2  available
will it be increased if I update the driver or kernel?
and there is a limitiation in the system? if the other devices have
already occupied too many MSI vectors then it is not enough.

thanks
rui

^ permalink raw reply

* Re: [PATCH iproute2] document initcwnd
From: Stephen Hemminger @ 2010-05-19 15:31 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: dormando, netdev, Rick Jones, shemminger
In-Reply-To: <4BE21D64.4040600@athenacr.com>

On Wed, 05 May 2010 21:37:40 -0400
Brian Bloniarz <bmb@athenacr.com> wrote:

> Stephen Hemminger wrote:
> > On Wed, 05 May 2010 16:56:34 -0400
> > Brian Bloniarz <bmb@athenacr.com> wrote:
> > 
> >> dormando wrote:
> >>>> This sounds like TCP slow start.
> >>>>
> >>>> http://en.wikipedia.org/wiki/Slow-start
> >>>>
> >>>> As far as tunables you might want to play with the initcwnd route
> >>>> flag (see "ip route help")
> >>> Ah, yes, initcwnd was it. I'm well aware of TCP Congestion control / slow
> >>> start / etc. However I couldn't find the damn tunable for it :)
> >> Documenting the flag in ip(8) might increase its visibility
> >> a little. I don't see it documented in the iproute2 git head,
> >> though it shows up on http://linux.die.net/man/8/ip somehow.
> >>
> >> Stephen, do you know why that is?
> > 
> > No one sent me an official patch to change it?
> 
> Mention initcwnd in ip(8). Text taken from doc/ip-cref.tex.

Applied

^ permalink raw reply

* Re: sky2 poweroff screws up my network
From: Kyle McMartin @ 2010-05-19 15:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Kyle McMartin, netdev
In-Reply-To: <20100519081440.2883e71d@nehalam>

On Wed, May 19, 2010 at 08:14:40AM -0700, Stephen Hemminger wrote:
> The sky2 shutdown puts the chip in Wake On Lan state; this
> does a separate link speed negotiation (100 mbit) which may be a problem
> if speed duplex is forced.
>

Hrm, I'd disabled WoL in ethtool explicitly hoping this sort of thing
could be avioded, but it doesn't seem to help. :/

I'm not sure what you mean by "if speed duplex is forced," I've left the
nic in autoneg mode.

regards, Kyle

^ permalink raw reply

* [ANNOUNCE] iproute2 2.6.34
From: Stephen Hemminger @ 2010-05-19 15:54 UTC (permalink / raw)
  To: netdev; +Cc: linux-net, linux-kernel

This version of iproute2 utilities intended for use with 2.6.34 or
later kernel, but should be backward compatible with older releases.
In addition to build and man page fixes, this release includes a
support for several new features:

   * SR-IOV (I/O Virtualization) support.
   * tuntap support
   * bus-error reporting and counters
   * new FIFO type head drop queue discipline

The tar ball is available at:
  http://devresources.linuxfoundation.org/dev/iproute2/download

Repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

For more info on iproute2 see:
  http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

Changes since last release (2.6.33)

Alexandre Cassen (1):
      Detect 6rd kernel missing support / 6rd tunnel scope

Andreas Henriksson (2):
      iproute2: detect iptables modules dir in configure.
      iproute2: add option to build m_xt as a tc module (v3)

Bart Trojanowski (1):
      fix build issues with flex ver 2.5

Brian Bloniarz (1):
      ip: document initcwnd

Chris Wright (1):
      iproute2: rework SR-IOV VF support

David Woodhouse (1):
      Add 'ip tuntap' support.

Florian Westphal (1):
      iproute2: fix addrlabel interface names handling

Hagen Paul Pfeifer (1):
      tc: add new queue discipline: head drop fifo

Jamal Hadi Salim (3):
      xfrm: policy by mark
      xfrm: Introduce xfrm by mark
      xfrm: add support for SA by mark

Jan Engelhardt (1):
      ip: correctly report tunnel link type

Michele Petrazzo - Unipex (1):
      Continue after errors in -batch

Williams, Mitch A (3):
      Update man page to indicate current options
      ip: Add support for setting and showing SR-IOV virtual funtion link params
      libnetlink: Modify the parser to track first duplicated attributes

Wolfgang Grandegger (1):
      iproute2: netlink support for bus-error reporting and counters

YOSHIFUJI Hideaki / 吉藤英明 (1):
      gaiconf: /etc/gai.conf configuration helper.

jamal (1):
      skbedit: use get_u32 for parsing mark

laurent chavey (1):
      Add initcwnd to iproute2

Stephen Hemminger:
      Fix line numbering on batch commands
      Remove mirred debug message
      Workaround missing ALIGN() macro.
      Update ip.8 man page to describe route table id values
      Update kernel headers to 2.6.34 final version
      Add documentation for ip link add/delete sub-commands
      ip: add documentation for initrwnd
      v2.6.34

^ permalink raw reply

* [PATCH] drivers/net/arcnet/capmode.c: clean up code
From: Daniel Mack @ 2010-05-19 16:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Daniel Mack, Tejun Heo, Jiri Kosina, Christoph Lameter,
	Joe Perches, netdev

 - shuffle around functions to get rid of forward declarations
 - fix some CodingStyle and indentation issues
 - last but not least, get rid of the following CONFIG_MODULE=n warning:

	drivers/net/arcnet/capmode.c:52: warning: ‘capmode_proto’ defined but not used

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: netdev@vger.kernel.org
---
 drivers/net/arcnet/capmode.c |  176 +++++++++++++++++++-----------------------
 1 files changed, 79 insertions(+), 97 deletions(-)

diff --git a/drivers/net/arcnet/capmode.c b/drivers/net/arcnet/capmode.c
index 20e833a..e1810a3 100644
--- a/drivers/net/arcnet/capmode.c
+++ b/drivers/net/arcnet/capmode.c
@@ -37,67 +37,6 @@
 
 #define VERSION "arcnet: cap mode (`c') encapsulation support loaded.\n"
 
-
-static void rx(struct net_device *dev, int bufnum,
-	       struct archdr *pkthdr, int length);
-static int build_header(struct sk_buff *skb,
-			struct net_device *dev,
-			unsigned short type,
-			uint8_t daddr);
-static int prepare_tx(struct net_device *dev, struct archdr *pkt, int length,
-		      int bufnum);
-static int ack_tx(struct net_device *dev, int acked);
-
-
-static struct ArcProto capmode_proto =
-{
-	'r',
-	XMTU,
-	0,
-       	rx,
-	build_header,
-	prepare_tx,
-	NULL,
-	ack_tx
-};
-
-#ifdef MODULE
-
-static void arcnet_cap_init(void)
-{
-	int count;
-
-	for (count = 1; count <= 8; count++)
-		if (arc_proto_map[count] == arc_proto_default)
-			arc_proto_map[count] = &capmode_proto;
-
-	/* for cap mode, we only set the bcast proto if there's no better one */
-	if (arc_bcast_proto == arc_proto_default)
-		arc_bcast_proto = &capmode_proto;
-
-	arc_proto_default = &capmode_proto;
-	arc_raw_proto = &capmode_proto;
-}
-
-static int __init capmode_module_init(void)
-{
-	printk(VERSION);
-	arcnet_cap_init();
-	return 0;
-}
-
-static void __exit capmode_module_exit(void)
-{
-	arcnet_unregister_proto(&capmode_proto);
-}
-module_init(capmode_module_init);
-module_exit(capmode_module_exit);
-
-MODULE_LICENSE("GPL");
-
-#endif				/* MODULE */
-
-
 /* packet receiver */
 static void rx(struct net_device *dev, int bufnum,
 	       struct archdr *pkthdr, int length)
@@ -229,65 +168,108 @@ static int prepare_tx(struct net_device *dev, struct archdr *pkt, int length,
 	BUGMSG(D_DURING, "prepare_tx: length=%d ofs=%d\n",
 	       length,ofs);
 
-	// Copy the arcnet-header + the protocol byte down:
+	/* Copy the arcnet-header + the protocol byte down: */
 	lp->hw.copy_to_card(dev, bufnum, 0, hard, ARC_HDR_SIZE);
 	lp->hw.copy_to_card(dev, bufnum, ofs, &pkt->soft.cap.proto,
 			    sizeof(pkt->soft.cap.proto));
 
-	// Skip the extra integer we have written into it as a cookie
-	// but write the rest of the message:
+	/* Skip the extra integer we have written into it as a cookie
+	   but write the rest of the message: */
 	lp->hw.copy_to_card(dev, bufnum, ofs+1,
 			    ((unsigned char*)&pkt->soft.cap.mes),length-1);
 
 	lp->lastload_dest = hard->dest;
 
-	return 1;		/* done */
+	return 1;	/* done */
 }
 
-
 static int ack_tx(struct net_device *dev, int acked)
 {
-  struct arcnet_local *lp = netdev_priv(dev);
-  struct sk_buff *ackskb;
-  struct archdr *ackpkt;
-  int length=sizeof(struct arc_cap);
+	struct arcnet_local *lp = netdev_priv(dev);
+	struct sk_buff *ackskb;
+	struct archdr *ackpkt;
+	int length=sizeof(struct arc_cap);
+
+	BUGMSG(D_DURING, "capmode: ack_tx: protocol: %x: result: %d\n",
+		lp->outgoing.skb->protocol, acked);
 
-  BUGMSG(D_DURING, "capmode: ack_tx: protocol: %x: result: %d\n",
-	 lp->outgoing.skb->protocol, acked);
+	BUGLVL(D_SKB) arcnet_dump_skb(dev, lp->outgoing.skb, "ack_tx");
 
-  BUGLVL(D_SKB) arcnet_dump_skb(dev, lp->outgoing.skb, "ack_tx");
+	/* Now alloc a skb to send back up through the layers: */
+	ackskb = alloc_skb(length + ARC_HDR_SIZE , GFP_ATOMIC);
+	if (ackskb == NULL) {
+		BUGMSG(D_NORMAL, "Memory squeeze, can't acknowledge.\n");
+		goto free_outskb;
+	}
 
-  /* Now alloc a skb to send back up through the layers: */
-  ackskb = alloc_skb(length + ARC_HDR_SIZE , GFP_ATOMIC);
-  if (ackskb == NULL) {
-	  BUGMSG(D_NORMAL, "Memory squeeze, can't acknowledge.\n");
-	  goto free_outskb;
-  }
+	skb_put(ackskb, length + ARC_HDR_SIZE );
+	ackskb->dev = dev;
 
-  skb_put(ackskb, length + ARC_HDR_SIZE );
-  ackskb->dev = dev;
+	skb_reset_mac_header(ackskb);
+	ackpkt = (struct archdr *)skb_mac_header(ackskb);
+	/* skb_pull(ackskb, ARC_HDR_SIZE); */
 
-  skb_reset_mac_header(ackskb);
-  ackpkt = (struct archdr *)skb_mac_header(ackskb);
-  /* skb_pull(ackskb, ARC_HDR_SIZE); */
+	skb_copy_from_linear_data(lp->outgoing.skb, ackpkt,
+				  ARC_HDR_SIZE + sizeof(struct arc_cap));
+	ackpkt->soft.cap.proto = 0; /* using protocol 0 for acknowledge */
+	ackpkt->soft.cap.mes.ack=acked;
 
+	BUGMSG(D_PROTO, "Ackknowledge for cap packet %x.\n",
+			*((int*)&ackpkt->soft.cap.cookie[0]));
+
+	ackskb->protocol = cpu_to_be16(ETH_P_ARCNET);
+
+	BUGLVL(D_SKB) arcnet_dump_skb(dev, ackskb, "ack_tx_recv");
+	netif_rx(ackskb);
+
+free_outskb:
+	dev_kfree_skb_irq(lp->outgoing.skb);
+	lp->outgoing.proto = NULL; /* We are always finished when in this protocol */
+
+	return 0;
+}
 
-  skb_copy_from_linear_data(lp->outgoing.skb, ackpkt,
-		ARC_HDR_SIZE + sizeof(struct arc_cap));
-  ackpkt->soft.cap.proto=0; /* using protocol 0 for acknowledge */
-  ackpkt->soft.cap.mes.ack=acked;
+static struct ArcProto capmode_proto =
+{
+	'r',
+	XMTU,
+	0,
+	rx,
+	build_header,
+	prepare_tx,
+	NULL,
+	ack_tx
+};
+
+static void arcnet_cap_init(void)
+{
+	int count;
 
-  BUGMSG(D_PROTO, "Ackknowledge for cap packet %x.\n",
-	 *((int*)&ackpkt->soft.cap.cookie[0]));
+	for (count = 1; count <= 8; count++)
+		if (arc_proto_map[count] == arc_proto_default)
+			arc_proto_map[count] = &capmode_proto;
 
-  ackskb->protocol = cpu_to_be16(ETH_P_ARCNET);
+	/* for cap mode, we only set the bcast proto if there's no better one */
+	if (arc_bcast_proto == arc_proto_default)
+		arc_bcast_proto = &capmode_proto;
 
-  BUGLVL(D_SKB) arcnet_dump_skb(dev, ackskb, "ack_tx_recv");
-  netif_rx(ackskb);
+	arc_proto_default = &capmode_proto;
+	arc_raw_proto = &capmode_proto;
+}
 
- free_outskb:
-  dev_kfree_skb_irq(lp->outgoing.skb);
-  lp->outgoing.proto = NULL; /* We are always finished when in this protocol */
+static int __init capmode_module_init(void)
+{
+	printk(VERSION);
+	arcnet_cap_init();
+	return 0;
+}
 
-  return 0;
+static void __exit capmode_module_exit(void)
+{
+	arcnet_unregister_proto(&capmode_proto);
 }
+module_init(capmode_module_init);
+module_exit(capmode_module_exit);
+
+MODULE_LICENSE("GPL");
+
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH iproute] ip: add support for multicast rules
From: Stephen Hemminger @ 2010-05-19 16:03 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Linux Netdev List
In-Reply-To: <4BC48870.7070704@trash.net>

On Tue, 13 Apr 2010 17:06:24 +0200
Patrick McHardy <kaber@trash.net> wrote:

> This patch adds support for a "ip mrule" command, which is used
> to configure multicast routing rules.
> 
> The corresponding kernel patches have been sent to Dave and
> should (hopefully) appear in net-next soon.

The fib_rules.h file in iproute2 is kept in sync with the kernel
headers. But I do not see the definitions of FIB_RULES_IPV4 etc
in net-next kernel.  What happened to this?


^ permalink raw reply

* Re: [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature
From: Avi Kivity @ 2010-05-19 16:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, Juan Quintela, Rusty Russell, Paul E. McKenney,
	Arnd Bergmann, kvm, virtualization, netdev, linux-kernel,
	alex.williamson, amit.shah
In-Reply-To: <4BF2D2A7.8030803@redhat.com>

On 05/18/2010 08:47 PM, Avi Kivity wrote:
> On 05/18/2010 05:21 AM, Michael S. Tsirkin wrote:
>> With PUBLISH_USED_IDX, guest tells us which used entries
>> it has consumed. This can be used to reduce the number
>> of interrupts: after we write a used entry, if the guest has not yet
>> consumed the previous entry, or if the guest has already consumed the
>> new entry, we do not need to interrupt.
>> This imporves bandwidth by 30% under some workflows.
>
> Seems to be missing the cacheline alignment.
>
> Rusty's clarification did not satisfy me, I think it's needed.
>

Oh, and this should definitely follow the patch to the virtio spec, not 
precede it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply

* Re: [PATCH v3] Fix SJA1000 command register writes on SMP systems
From: Oliver Hartkopp @ 2010-05-19 16:23 UTC (permalink / raw)
  To: Sam Ravnborg
  Cc: SocketCAN Core Mailing List, Linux Netdev List, David Miller,
	Wolfgang Grandegger
In-Reply-To: <20100518213109.GA29894-OoSGOWW0KRunlFQ6Q1D1Y0B+6BGkLq7r@public.gmane.org>

On 18.05.2010 23:31, Sam Ravnborg wrote:
> Hi Oliver.
> 
>> diff --git a/drivers/net/can/sja1000/sja1000.h b/drivers/net/can/sja1000/sja1000.h
>> index 97a622b..de8e778 100644
>> --- a/drivers/net/can/sja1000/sja1000.h
>> +++ b/drivers/net/can/sja1000/sja1000.h
>> @@ -167,6 +167,7 @@ struct sja1000_priv {
>>  
>>  	void __iomem *reg_base;	 /* ioremap'ed address to registers */
>>  	unsigned long irq_flags; /* for request_irq() */
>> +	spinlock_t cmdreg_lock;  /* lock for concurrent cmd register writes */
>>  
>>  	u16 flags;		/* custom mode flags */
>>  	u8 ocr;			/* output control register */
> 
> You define your spinlock inside a struct so you cannot use
> DEFINE_SPINLOCK().
> 
> But then you need to use spin_lock_init() - which I fail to see
> you are doing in your patch.

Indeed. Sorry.

Will send a patch with spin_lock_init() e.g. to enable the spinlock debugging ...

Regards,
Oliver

^ permalink raw reply

* [PATCH net-next-2.6] can: SJA1000 add missing spin_lock_init()
From: Oliver Hartkopp @ 2010-05-19 16:46 UTC (permalink / raw)
  To: David Miller; +Cc: SocketCAN Core Mailing List, Linux Netdev List

As remarked by Sam Ravnborg the spin_lock variable, that has been introduced
in commit 57c8a456640fa3ca777652f11f2db4179a3e66b6 ("can: Fix SJA1000 command
register writes on SMP systems") has not been initialized properly.

This patch adds the initialization to allow spinlock debugging. 

Signed-off-by: Oliver Hartkopp <socketcan-fJ+pQTUTwRTk1uMJSBkQmQ@public.gmane.org>
CC: Sam Ravnborg <sam-uyr5N9Q2VtJg9hUCZPvPmw@public.gmane.org>

---

diff --git a/drivers/net/can/sja1000/sja1000.c b/drivers/net/can/sja1000/sja1000.c
index 85f7cbf..0a8de01 100644
--- a/drivers/net/can/sja1000/sja1000.c
+++ b/drivers/net/can/sja1000/sja1000.c
@@ -599,6 +599,8 @@ struct net_device *alloc_sja1000dev(int sizeof_priv)
 	priv->can.ctrlmode_supported = CAN_CTRLMODE_3_SAMPLES |
 		CAN_CTRLMODE_BERR_REPORTING;
 
+	spin_lock_init(&priv->cmdreg_lock);
+
 	if (sizeof_priv)
 		priv->priv = (void *)priv + sizeof(struct sja1000_priv);

^ permalink raw reply related

* Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature
From: Avi Kivity @ 2010-05-19 17:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, Juan Quintela, Rusty Russell, Paul E. McKenney,
	Arnd Bergmann, kvm, virtualization, netdev, linux-kernel,
	alex.williamson, amit.shah
In-Reply-To: <20100518011931.GA21918@redhat.com>

On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote:
> With PUBLISH_USED_IDX, guest tells us which used entries
> it has consumed. This can be used to reduce the number
> of interrupts: after we write a used entry, if the guest has not yet
> consumed the previous entry, or if the guest has already consumed the
> new entry, we do not need to interrupt.
> This imporves bandwidth by 30% under some workflows.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>
> Rusty, Dave, this patch depends on the patch
> "virtio: put last seen used index into ring itself"
> which is currently destined at Rusty's tree.
> Rusty, if you are taking that one for 2.6.35, please
> take this one as well.
> Dave, any objections?
>    

I object: I think the index should have its own cacheline, and that it 
should be documented before merging.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply

* Re: dev_get_valid_name buggy with hash collision
From: Octavian Purdila @ 2010-05-19 17:05 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Netdev List
In-Reply-To: <4BF2AA68.5090008@free.fr>

On Tuesday 18 May 2010 17:55:36 you wrote:

> >>         if (!dev_valid_name(name))
> >>                 return -EINVAL;
> >>
> >>         if (fmt&&  strchr(name, '%'))
> >> -               return __dev_alloc_name(net, name, buf);
> >> +               return dev_alloc_name(dev, name);
> >>         else if (__dev_get_by_name(net, name))
> >>                 return -EEXIST;
> >> -       else if (buf != name)
> >> -               strlcpy(buf, name, IFNAMSIZ);
> >> +       else if (strncmp(dev->name, name, IFNAMSIZ))
> >> +                strlcpy(dev->name, name, IFNAMSIZ);
> >
> > Why do the strncmp, can't we preserve the (buf != name) condition
> 
> The 'buf' parameter is no longer passed to the function. We have the
> 'dev'  and the 'newname' parameters.
> The pointer test was just to check 'dev_get_valid_name' was called from
> the 'register_netdevice' function context with 'dev_get_valid_name(net,
> dev->name, dev->name, 0)'. Comparing the strings is valid in this case.
> 
> Otherwise dev_get_valid_name is called from:
> 
>   *  "dev_change_net_namespace" with "dev%d" or "ifname" specified
> within the netlink message. Both are different pointers, the first will
> fall in the "if (fmt && strchr(name, '%'))".
> 
>   * "dev_change_name", where the pointers are different and the strings
> are different.
> 

True, but we why not use "if (dev->name !=name)" instead of strncmp? It should 
yield the same results and it is lighter then full strncmp.


^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Neil Horman @ 2010-05-19 18:00 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, Thomas Graf, netdev
In-Reply-To: <20100519125543.GA26519@hmsreliant.think-freely.org>

On Wed, May 19, 2010 at 08:55:43AM -0400, Neil Horman wrote:
> On Wed, May 19, 2010 at 08:05:47AM -0400, Neil Horman wrote:
> > On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > > 
> > > > Another concern I have is about RPS.
> > > > 
> > > > netif_receive_skb() must be called from process_backlog() context, or
> > > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > > another cpu.
> > > 
> > > Hmm, I just checked again, and this is wrong.
> > > 
> > > In case we enqueue skb on a remote cpu backlog, we also
> > > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > > later.
> > > 
> > But if this happens, then we loose the connection between the packet being
> > received and the process doing the reception, so the network cgroup classifier
> > breaks again.
> > 
> > Performance gains are still a big advantage here of course.
> > Neil
> > 
> Scratch what I said here, Herbert corrected me on this, and we're ok, as tun has
> no rps map.
> 
> I'll test this patch out in just a bit
> Neil
> 

I'm currently testing this, unfortunately, and its not breaking anything, but it
doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
investigating, but I think the issue is that, because we call local_bh_disable
with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
for the task.  Since the cgroup classifier has this check:

if (softirq_count() != SOFTIRQ_OFFSET))
	return -1;

We still fail to classify the frame.  the cgroup classifier is assuming that any
frame arriving with a softirq count of 1 means we came directly from the
dev_queue_xmit routine and is safe to check current().  Any less than that, and
something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
and any more implies that we have nested calls to local_bh_disable, meaning
we're really handling a softirq context.

Neil

> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [GIT] Networking
From: David Miller @ 2010-05-19 19:28 UTC (permalink / raw)
  To: ben; +Cc: torvalds, akpm, netdev, linux-kernel
In-Reply-To: <1274267582.2763.24.camel@localhost>

From: Ben Hutchings <ben@decadent.org.uk>
Date: Wed, 19 May 2010 12:13:02 +0100

> On Tue, 2010-05-18 at 23:37 -0700, David Miller wrote:
> [...]
>> Some other things that stand out:
>> 
>> 1) Allow the administrator to reserve port ranges, such that the
>>    kernel bind allocation scheme won't use them.  From Amerigo Wang.
>> 
>> 2) ipv6 address et al. handling converted to use generic kernel lists.
>>    From Stephem Hemminger.
>> 
>> 3) PHY module autoloading support from David Howells
> [...]
> 
> For the record, that was actually David Woodhouse.

Indeed, already apologized to him in private email.   My bad :-/

^ permalink raw reply

* Re: dev_get_valid_name buggy with hash collision
From: Daniel Lezcano @ 2010-05-19 19:39 UTC (permalink / raw)
  To: Octavian Purdila; +Cc: Linux Netdev List
In-Reply-To: <201005192005.49459.opurdila@ixiacom.com>

On 05/19/2010 07:05 PM, Octavian Purdila wrote:
> On Tuesday 18 May 2010 17:55:36 you wrote:
>
>    
>>>>          if (!dev_valid_name(name))
>>>>                  return -EINVAL;
>>>>
>>>>          if (fmt&&   strchr(name, '%'))
>>>> -               return __dev_alloc_name(net, name, buf);
>>>> +               return dev_alloc_name(dev, name);
>>>>          else if (__dev_get_by_name(net, name))
>>>>                  return -EEXIST;
>>>> -       else if (buf != name)
>>>> -               strlcpy(buf, name, IFNAMSIZ);
>>>> +       else if (strncmp(dev->name, name, IFNAMSIZ))
>>>> +                strlcpy(dev->name, name, IFNAMSIZ);
>>>>          
>>> Why do the strncmp, can't we preserve the (buf != name) condition
>>>        
>> The 'buf' parameter is no longer passed to the function. We have the
>> 'dev'  and the 'newname' parameters.
>> The pointer test was just to check 'dev_get_valid_name' was called from
>> the 'register_netdevice' function context with 'dev_get_valid_name(net,
>> dev->name, dev->name, 0)'. Comparing the strings is valid in this case.
>>
>> Otherwise dev_get_valid_name is called from:
>>
>>    *  "dev_change_net_namespace" with "dev%d" or "ifname" specified
>> within the netlink message. Both are different pointers, the first will
>> fall in the "if (fmt&&  strchr(name, '%'))".
>>
>>    * "dev_change_name", where the pointers are different and the strings
>> are different.
>>
>>      
> True, but we why not use "if (dev->name !=name)" instead of strncmp? It should
> yield the same results and it is lighter then full strncmp.
>    

Yes, I agree. In the context of the different callers, that's correct.
Will resend it with the pointer comparison.

Thanks
   -- Daniel

^ permalink raw reply

* Re: [PATCH 06/11] netdev: bfin_mac: avoid tx skb overflows in the tx DMA ring
From: David Miller @ 2010-05-19 20:12 UTC (permalink / raw)
  To: sonic.adi; +Cc: netdev
In-Reply-To: <AANLkTikKz6v09VTjtANmmcGHU1eKcVWjuDIO24KqkHWK@mail.gmail.com>

From: Sonic Zhang <sonic.adi@gmail.com>
Date: Wed, 19 May 2010 17:23:16 +0800

> No, this doesn't happen, because before ndo_start_xmit() returns, the
> old TX buffers and skbs in the ring, which finished DMA operation, are
> freed. The only difference is that the free operation of a skb is done
> in next tx transfer.

This is still illegal.

What if TX activity stops right then, and there is no "next tx
transfer"?

That SKB will never get freed, ever.

You have to fix this.

^ permalink raw reply

* [PATCH] net-2.6 : V2 - fix dev_get_valid_name
From: Daniel Lezcano @ 2010-05-19 20:12 UTC (permalink / raw)
  To: davem; +Cc: opurdila, netdev

the commit:

commit d90310243fd750240755e217c5faa13e24f41536
Author: Octavian Purdila <opurdila@ixiacom.com>
Date:   Wed Nov 18 02:36:59 2009 +0000

    net: device name allocation cleanups

introduced a bug when there is a hash collision making impossible
to rename a device with eth%d. This bug is very hard to reproduce
and appears rarely.

The problem is coming from we don't pass a temporary buffer to
__dev_alloc_name but 'dev->name' which is modified by the function.

A detailed explanation is here:

http://marc.info/?l=linux-netdev&m=127417784011987&w=2

Changelog:
 V2 : replaced strings comparison by pointers comparison

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
---
 net/core/dev.c |   20 ++++++++++++--------
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..a2bfe57 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -936,18 +936,22 @@ int dev_alloc_name(struct net_device *dev, const char *name)
 }
 EXPORT_SYMBOL(dev_alloc_name);
 
-static int dev_get_valid_name(struct net *net, const char *name, char *buf,
-			      bool fmt)
+static int dev_get_valid_name(struct net_device *dev, const char *name, bool fmt)
 {
+	struct net *net;
+
+	BUG_ON(!dev_net(dev));
+	net = dev_net(dev);
+
 	if (!dev_valid_name(name))
 		return -EINVAL;
 
 	if (fmt && strchr(name, '%'))
-		return __dev_alloc_name(net, name, buf);
+		return dev_alloc_name(dev, name);
 	else if (__dev_get_by_name(net, name))
 		return -EEXIST;
-	else if (buf != name)
-		strlcpy(buf, name, IFNAMSIZ);
+	else if (dev->name != name)
+		strlcpy(dev->name, name, IFNAMSIZ);
 
 	return 0;
 }
@@ -979,7 +983,7 @@ int dev_change_name(struct net_device *dev, const char *newname)
 
 	memcpy(oldname, dev->name, IFNAMSIZ);
 
-	err = dev_get_valid_name(net, newname, dev->name, 1);
+	err = dev_get_valid_name(dev, newname, 1);
 	if (err < 0)
 		return err;
 
@@ -5083,7 +5087,7 @@ int register_netdevice(struct net_device *dev)
 		}
 	}
 
-	ret = dev_get_valid_name(net, dev->name, dev->name, 0);
+	ret = dev_get_valid_name(dev, dev->name, 0);
 	if (ret)
 		goto err_uninit;
 
@@ -5661,7 +5665,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
 		/* We get here if we can't use the current device name */
 		if (!pat)
 			goto out;
-		if (dev_get_valid_name(net, pat, dev->name, 1))
+		if (dev_get_valid_name(dev, pat, 1))
 			goto out;
 	}
 
-- 
1.7.0.4


^ permalink raw reply related

* Re: tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2010-05-19 20:14 UTC (permalink / raw)
  To: herbert; +Cc: eric.dumazet, tgraf, nhorman, netdev
In-Reply-To: <20100519082047.GA24331@gondor.apana.org.au>

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 19 May 2010 18:20:47 +1000

> On Wed, May 19, 2010 at 10:09:42AM +0200, Eric Dumazet wrote:
>> 
>> 6) netif_rx() pro is that packet processing is done while stack usage is
>> guaranteed to be low (from process_backlog, using a special softirq
>> stack, instead of current stack)
>> 
>> After your patch, tun will use more stack. Is it safe on all contexts ?
> 
> Dave also raised this but I believe nothing changes with regards
> to the stack.  We currently call do_softirq which does not switch
> stacks.

do_softirq() _does_ switch stacks, it's a per-arch function that
does the stack switch and calls __do_softirq() on the softirq
stack.

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Neil Horman @ 2010-05-19 20:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, Herbert Xu, David S. Miller, Thomas Graf, netdev
In-Reply-To: <20100519180053.GC26519@hmsreliant.think-freely.org>

On Wed, May 19, 2010 at 02:00:53PM -0400, Neil Horman wrote:
> On Wed, May 19, 2010 at 08:55:43AM -0400, Neil Horman wrote:
> > On Wed, May 19, 2010 at 08:05:47AM -0400, Neil Horman wrote:
> > > On Wed, May 19, 2010 at 10:18:09AM +0200, Eric Dumazet wrote:
> > > > Le mercredi 19 mai 2010 à 10:09 +0200, Eric Dumazet a écrit :
> > > > 
> > > > > Another concern I have is about RPS.
> > > > > 
> > > > > netif_receive_skb() must be called from process_backlog() context, or
> > > > > there is no guarantee the IPI will be sent if this skb is enqueued for
> > > > > another cpu.
> > > > 
> > > > Hmm, I just checked again, and this is wrong.
> > > > 
> > > > In case we enqueue skb on a remote cpu backlog, we also
> > > > do __raise_softirq_irqoff(NET_RX_SOFTIRQ); so the IPI will be done
> > > > later.
> > > > 
> > > But if this happens, then we loose the connection between the packet being
> > > received and the process doing the reception, so the network cgroup classifier
> > > breaks again.
> > > 
> > > Performance gains are still a big advantage here of course.
> > > Neil
> > > 
> > Scratch what I said here, Herbert corrected me on this, and we're ok, as tun has
> > no rps map.
> > 
> > I'll test this patch out in just a bit
> > Neil
> > 
> 
> I'm currently testing this, unfortunately, and its not breaking anything, but it
> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
> investigating, but I think the issue is that, because we call local_bh_disable
> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
> for the task.  Since the cgroup classifier has this check:
> 
> if (softirq_count() != SOFTIRQ_OFFSET))
> 	return -1;
> 
> We still fail to classify the frame.  the cgroup classifier is assuming that any
> frame arriving with a softirq count of 1 means we came directly from the
> dev_queue_xmit routine and is safe to check current().  Any less than that, and
> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
> and any more implies that we have nested calls to local_bh_disable, meaning
> we're really handling a softirq context.
> 
> Neil
> 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Just out of curiosity, how unsavory would it be if we were to dedicate the upper
bit in the SOFTIRQ_BITS range to be an indicator of weather we were actually
executing softirqs?  As noted above, we're tripping over the ambiguity here
between running in softirq context and actually just having softirqs disabled.
Would it be against anyones sensibilities if we were to dedicate the upper bit
in the softirq_count range to disambiguate the two conitions (or use a separate
flag for that matter)?

Neil


^ permalink raw reply

* Re: how many msi (msi-x) vectors can be setup?
From: Yinghai Lu @ 2010-05-19 20:33 UTC (permalink / raw)
  To: zhou rui; +Cc: netdev
In-Reply-To: <AANLkTikT3ThBnO8q3AvhS9t4jUu8npOwYmjY6WUMhseq@mail.gmail.com>

On Wed, May 19, 2010 at 8:18 AM, zhou rui <wirelesser@gmail.com> wrote:
> hi there:
> how many msi (msi-x) vectors can be setup?
> the number is limited by hardware resource(nic), or kernel ?
> I found that the driver (broadcom 57711 ver 1.5.12) tried to request
> 16 queues on my kernel2.6.27,but only 2  available
> will it be increased if I update the driver or kernel?
> and there is a limitiation in the system? if the other devices have
> already occupied too many MSI vectors then it is not enough.

from kernel 2.6.19 x86_64, there is per-cpu vector irq support.

depends your system : CPU num? 64bit or 32bit.

YH

^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Thomas Graf @ 2010-05-19 20:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: Neil Horman, Eric Dumazet, Herbert Xu, David S. Miller, netdev
In-Reply-To: <20100519180053.GC26519@hmsreliant.think-freely.org>

On Wed, 2010-05-19 at 14:00 -0400, Neil Horman wrote: 
> I'm currently testing this, unfortunately, and its not breaking anything, but it
> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
> investigating, but I think the issue is that, because we call local_bh_disable
> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
> for the task.  Since the cgroup classifier has this check:
> 
> if (softirq_count() != SOFTIRQ_OFFSET))
> 	return -1;
> 
> We still fail to classify the frame.  the cgroup classifier is assuming that any
> frame arriving with a softirq count of 1 means we came directly from the
> dev_queue_xmit routine and is safe to check current().  Any less than that, and
> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
> and any more implies that we have nested calls to local_bh_disable, meaning
> we're really handling a softirq context.

It is a hack but the only method to check for softirq context I found. I
would favor using a flag if there was one.


^ permalink raw reply

* Re: tun: Use netif_receive_skb instead of netif_rx
From: Brian Bloniarz @ 2010-05-19 21:00 UTC (permalink / raw)
  To: tgraf
  Cc: Neil Horman, Neil Horman, Eric Dumazet, Herbert Xu,
	David S. Miller, netdev
In-Reply-To: <1274302191.3148.2.camel@lsx.localdomain>

On 05/19/2010 04:49 PM, Thomas Graf wrote:
> On Wed, 2010-05-19 at 14:00 -0400, Neil Horman wrote: 
>> I'm currently testing this, unfortunately, and its not breaking anything, but it
>> doesn't allow cgroups to classify frames comming from tun interfaces.  I'm still
>> investigating, but I think the issue is that, because we call local_bh_disable
>> with this patch, we wind up raising the count at SOFTIRQ_OFFSET in preempt_count
>> for the task.  Since the cgroup classifier has this check:
>>
>> if (softirq_count() != SOFTIRQ_OFFSET))
>> 	return -1;
>>
>> We still fail to classify the frame.  the cgroup classifier is assuming that any
>> frame arriving with a softirq count of 1 means we came directly from the
>> dev_queue_xmit routine and is safe to check current().  Any less than that, and
>> something is wrong (as we at least need the local_bh_disable in dev_queue_xmit),
>> and any more implies that we have nested calls to local_bh_disable, meaning
>> we're really handling a softirq context.
> 
> It is a hack but the only method to check for softirq context I found. I
> would favor using a flag if there was one.

Eric probably has some thoughts on this -- his scheduler-batching patch RFC
from last year needed the same bit of info:
http://patchwork.ozlabs.org/patch/24536/
(see the changes to trace_softirq_context).

^ permalink raw reply

* [PATCH] net: fix problem in dequeuing from input_pkt_queue
From: Tom Herbert @ 2010-05-19 21:47 UTC (permalink / raw)
  To: davem; +Cc: eric.dumazet, xiaosuo, netdev

Fix some issues introduced in batch skb dequeuing for input_pkt_queue.
The primary issue it that the queue head must be incremented only
after a packet has been processed, that is only after
__netif_receive_skb has been called.  This is needed for the mechanism
to prevent OOO packet in RFS.  Also when flushing the input_pkt_queue
and process_queue, the process queue should be done first to prevent
OOO packets.

Because the input_pkt_queue has been effectively split into two queues,
the calculation of the tail ptr is no longer correct.  The correct value
would be head+input_pkt_queue->len+process_queue->len.  To avoid
this calculation we added an explict input_queue_tail in softnet_data.
The tail value is simply incremented when queuing to input_pkt_queue.

In process_backlog the processing of the packet queue can be done
without irq's being disabled.

Made dropped in softnet_data to be "unsigned int" for consistency.

Signed-off-by: Tom Herbert <therbert@google.com>
---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c3487a6..bc0bc85 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1403,17 +1403,25 @@ struct softnet_data {
 	struct softnet_data	*rps_ipi_next;
 	unsigned int		cpu;
 	unsigned int		input_queue_head;
+	unsigned int		input_queue_tail;
 #endif
-	unsigned		dropped;
+	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
 };
 
-static inline void input_queue_head_add(struct softnet_data *sd,
-					unsigned int len)
+static inline void input_queue_head_incr(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	sd->input_queue_head += len;
+	sd->input_queue_head++;
+#endif
+}
+
+static inline void input_queue_tail_incr_save(struct softnet_data *sd,
+					      unsigned int *qtail)
+{
+#ifdef CONFIG_RPS
+	*qtail = ++sd->input_queue_tail;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c82065..be7d475 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2426,10 +2426,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 		if (skb_queue_len(&sd->input_pkt_queue)) {
 enqueue:
 			__skb_queue_tail(&sd->input_pkt_queue, skb);
-#ifdef CONFIG_RPS
-			*qtail = sd->input_queue_head +
-					skb_queue_len(&sd->input_pkt_queue);
-#endif
+			input_queue_tail_incr_save(sd, qtail);
 			rps_unlock(sd);
 			local_irq_restore(flags);
 			return NET_RX_SUCCESS;
@@ -2959,22 +2956,24 @@ static void flush_backlog(void *arg)
 	struct softnet_data *sd = &__get_cpu_var(softnet_data);
 	struct sk_buff *skb, *tmp;
 
-	rps_lock(sd);
-	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
+	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev == dev) {
-			__skb_unlink(skb, &sd->input_pkt_queue);
+			__skb_unlink(skb, &sd->process_queue);
 			kfree_skb(skb);
-			input_queue_head_add(sd, 1);
+			input_queue_head_incr(sd);
 		}
 	}
-	rps_unlock(sd);
 
-	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
+	rps_lock(sd);
+	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
 		if (skb->dev == dev) {
-			__skb_unlink(skb, &sd->process_queue);
+			__skb_unlink(skb, &sd->input_pkt_queue);
 			kfree_skb(skb);
+			input_queue_head_incr(sd);
 		}
 	}
+	rps_unlock(sd);
+
 }
 
 static int napi_gro_complete(struct sk_buff *skb)
@@ -3320,26 +3319,24 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	}
 #endif
 	napi->weight = weight_p;
-	local_irq_disable();
 	while (work < quota) {
 		struct sk_buff *skb;
 		unsigned int qlen;
 
 		while ((skb = __skb_dequeue(&sd->process_queue))) {
-			local_irq_enable();
 			__netif_receive_skb(skb);
+			input_queue_head_incr(sd);
 			if (++work >= quota)
 				return work;
-			local_irq_disable();
 		}
 
+		local_irq_disable();
 		rps_lock(sd);
 		qlen = skb_queue_len(&sd->input_pkt_queue);
-		if (qlen) {
-			input_queue_head_add(sd, qlen);
+		if (qlen)
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
 						   &sd->process_queue);
-		}
+
 		if (qlen < quota - work) {
 			/*
 			 * Inline a custom version of __napi_complete().
@@ -3354,8 +3351,8 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			quota = work + qlen;
 		}
 		rps_unlock(sd);
+		local_irq_enable();
 	}
-	local_irq_enable();
 
 	return work;
 }
@@ -5679,12 +5676,14 @@ static int dev_cpu_callback(struct notifier_block *nfb,
 	local_irq_enable();
 
 	/* Process offline CPU's input_pkt_queue */
-	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
+	while ((skb = __skb_dequeue(&oldsd->process_queue))) {
 		netif_rx(skb);
-		input_queue_head_add(oldsd, 1);
+		input_queue_head_incr(oldsd);
 	}
-	while ((skb = __skb_dequeue(&oldsd->process_queue)))
+	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
 		netif_rx(skb);
+		input_queue_head_incr(oldsd);
+	}
 
 	return NOTIFY_OK;
 }

^ permalink raw reply related

* Re: [RFC] netem: correlated loss generation (v3)
From: Hagen Paul Pfeifer @ 2010-05-19 21:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Stefano Salsano, David Miller, Fabio Ludovici, netdev, netem
In-Reply-To: <20100517205621.036a06e0@nehalam>

* Stephen Hemminger | 2010-05-17 20:56:21 [-0700]:

>Subject: netem - revised correlated loss generator
>
>This is a patch originated with Stefano Salsano and Fabio Ludovici.
>It provides several alternative loss models for use with netem.
>There are two state machine based models and one table driven model.
>
>To simplify the original code:
>   * eliminated the debugging messages and statistics
>   * reformatted for clarity
>   * changed API to nested attribute relating to loss
>   * changed the table to always loop across bits
>   * only allocate parameters needed
>
>Still untested, for comment only...
>Should have tested version before 2.6.35 merge window closes.

Why mainline? I questioning the advantage for the big audience, it looks like
a academic only piece of software - correct me if I'm wrong.

The authors pointed to some weak points in the implementation of the current
loss/correlation logic. But this "fix", add another - complicated component -
and let the broken components untouched ...

HGN

-- 
Hagen Paul Pfeifer <hagen@jauu.net>  ||  http://jauu.net/
Telephone: +49 174 5455209           ||  Key Id: 0x98350C22
Key Fingerprint: 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22

^ permalink raw reply

* [net-next PATCH] ixgbe:add support for a new 82599 10G Base-T device
From: Jeff Kirsher @ 2010-05-19 22:16 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Mallikarjuna R Chilakala, Jeff Kirsher

From: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>

This adds support for a new copper device for 82599, device id 0x151c.
This 82599 10GBase-T device uses the PHY's internal temperature sensor
to guard against over-temp conditions. In this scenario the PHY will be
put in a low power mode and link will no longer be able to transmit or
receive any data. When this occurs, the over-temp interrupt is latched
and driver logs this error message. A HW reset or power cycle is
required to clear this status.

Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/ixgbe/ixgbe.h       |    3 ++
 drivers/net/ixgbe/ixgbe_82598.c |    1 +
 drivers/net/ixgbe/ixgbe_82599.c |    1 +
 drivers/net/ixgbe/ixgbe_main.c  |   68 +++++++++++++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_phy.c   |   31 ++++++++++++++++++
 drivers/net/ixgbe/ixgbe_phy.h   |    3 ++
 drivers/net/ixgbe/ixgbe_type.h  |    4 ++
 7 files changed, 111 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index d0ea3d6..ffae480 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -360,6 +360,7 @@ struct ixgbe_adapter {
 	u32 flags2;
 #define IXGBE_FLAG2_RSC_CAPABLE                 (u32)(1)
 #define IXGBE_FLAG2_RSC_ENABLED                 (u32)(1 << 1)
+#define IXGBE_FLAG2_TEMP_SENSOR_CAPABLE         (u32)(1 << 2)
 /* default to trying for four seconds */
 #define IXGBE_TRY_LINK_TIMEOUT (4 * HZ)
 
@@ -407,6 +408,8 @@ struct ixgbe_adapter {
 	u16 eeprom_version;
 
 	int node;
+	struct work_struct check_overtemp_task;
+	u32 interrupt_event;
 
 	/* SR-IOV */
 	DECLARE_BITMAP(active_vfs, IXGBE_MAX_VF_FUNCTIONS);
diff --git a/drivers/net/ixgbe/ixgbe_82598.c b/drivers/net/ixgbe/ixgbe_82598.c
index f2b7ff4..9c02d60 100644
--- a/drivers/net/ixgbe/ixgbe_82598.c
+++ b/drivers/net/ixgbe/ixgbe_82598.c
@@ -1236,6 +1236,7 @@ static struct ixgbe_phy_operations phy_ops_82598 = {
 	.setup_link		= &ixgbe_setup_phy_link_generic,
 	.setup_link_speed	= &ixgbe_setup_phy_link_speed_generic,
 	.read_i2c_eeprom	= &ixgbe_read_i2c_eeprom_82598,
+	.check_overtemp   = &ixgbe_tn_check_overtemp,
 };
 
 struct ixgbe_info ixgbe_82598_info = {
diff --git a/drivers/net/ixgbe/ixgbe_82599.c b/drivers/net/ixgbe/ixgbe_82599.c
index e9706eb..a4e2901 100644
--- a/drivers/net/ixgbe/ixgbe_82599.c
+++ b/drivers/net/ixgbe/ixgbe_82599.c
@@ -2395,6 +2395,7 @@ static struct ixgbe_phy_operations phy_ops_82599 = {
 	.write_i2c_byte         = &ixgbe_write_i2c_byte_generic,
 	.read_i2c_eeprom        = &ixgbe_read_i2c_eeprom_generic,
 	.write_i2c_eeprom       = &ixgbe_write_i2c_eeprom_generic,
+	.check_overtemp         = &ixgbe_tn_check_overtemp,
 };
 
 struct ixgbe_info ixgbe_82599_info = {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 9551cbb..3ee702b 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -108,6 +108,8 @@ static DEFINE_PCI_DEVICE_TABLE(ixgbe_pci_tbl) = {
 	 board_82599 },
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_82599_CX4),
 	 board_82599 },
+	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_82599_T3_LOM),
+	 board_82599 },
 	{PCI_VDEVICE(INTEL, IXGBE_DEV_ID_82599_COMBO_BACKPLANE),
 	 board_82599 },
 
@@ -1618,6 +1620,47 @@ static void ixgbe_set_itr_msix(struct ixgbe_q_vector *q_vector)
 	}
 }
 
+/**
+ * ixgbe_check_overtemp_task - worker thread to check over tempurature
+ * @work: pointer to work_struct containing our data
+ **/
+static void ixgbe_check_overtemp_task(struct work_struct *work)
+{
+	struct ixgbe_adapter *adapter = container_of(work,
+	                                             struct ixgbe_adapter,
+	                                             check_overtemp_task);
+	struct ixgbe_hw *hw = &adapter->hw;
+	u32 eicr = adapter->interrupt_event;
+
+	if (adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE) {
+		switch (hw->device_id) {
+		case IXGBE_DEV_ID_82599_T3_LOM: {
+			u32 autoneg;
+			bool link_up = false;
+
+			if (hw->mac.ops.check_link)
+				hw->mac.ops.check_link(hw, &autoneg, &link_up, false);
+
+			if (((eicr & IXGBE_EICR_GPI_SDP0) && (!link_up)) ||
+			    (eicr & IXGBE_EICR_LSC))
+				/* Check if this is due to overtemp */
+				if (hw->phy.ops.check_overtemp(hw) == IXGBE_ERR_OVERTEMP)
+					break;
+			}
+			return;
+		default:
+			if (!(eicr & IXGBE_EICR_GPI_SDP0))
+				return;
+			break;
+		}
+		e_crit("Network adapter has been stopped because it has "
+		        "over heated. Restart the computer. If the problem persists, "
+		        "power off the system and replace the adapter\n");
+		/* write to clear the interrupt */
+		IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_GPI_SDP0);
+	}
+}
+
 static void ixgbe_check_fan_failure(struct ixgbe_adapter *adapter, u32 eicr)
 {
 	struct ixgbe_hw *hw = &adapter->hw;
@@ -1689,6 +1732,10 @@ static irqreturn_t ixgbe_msix_lsc(int irq, void *data)
 
 	if (hw->mac.type == ixgbe_mac_82599EB) {
 		ixgbe_check_sfp_event(adapter, eicr);
+		adapter->interrupt_event = eicr;
+		if ((adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE) &&
+		    ((eicr & IXGBE_EICR_GPI_SDP0) || (eicr & IXGBE_EICR_LSC)))
+			schedule_work(&adapter->check_overtemp_task);
 
 		/* Handle Flow Director Full threshold interrupt */
 		if (eicr & IXGBE_EICR_FLOW_DIR) {
@@ -2190,6 +2237,8 @@ static inline void ixgbe_irq_enable(struct ixgbe_adapter *adapter)
 	u32 mask;
 
 	mask = (IXGBE_EIMS_ENABLE_MASK & ~IXGBE_EIMS_RTX_QUEUE);
+	if (adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE)
+		mask |= IXGBE_EIMS_GPI_SDP0;
 	if (adapter->flags & IXGBE_FLAG_FAN_FAIL_CAPABLE)
 		mask |= IXGBE_EIMS_GPI_SDP1;
 	if (adapter->hw.mac.type == ixgbe_mac_82599EB) {
@@ -2250,6 +2299,9 @@ static irqreturn_t ixgbe_intr(int irq, void *data)
 		ixgbe_check_sfp_event(adapter, eicr);
 
 	ixgbe_check_fan_failure(adapter, eicr);
+	if ((adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE) &&
+	    ((eicr & IXGBE_EICR_GPI_SDP0) || (eicr & IXGBE_EICR_LSC)))
+		schedule_work(&adapter->check_overtemp_task);
 
 	if (napi_schedule_prep(&(q_vector->napi))) {
 		adapter->tx_ring[0]->total_packets = 0;
@@ -3265,6 +3317,13 @@ static int ixgbe_up_complete(struct ixgbe_adapter *adapter)
 		IXGBE_WRITE_REG(hw, IXGBE_EIAM, IXGBE_EICS_RTX_QUEUE);
 	}
 
+	/* Enable Thermal over heat sensor interrupt */
+	if (adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE) {
+		gpie = IXGBE_READ_REG(hw, IXGBE_GPIE);
+		gpie |= IXGBE_SDP0_GPIEN;
+		IXGBE_WRITE_REG(hw, IXGBE_GPIE, gpie);
+	}
+
 	/* Enable fan failure interrupt if media type is copper */
 	if (adapter->flags & IXGBE_FLAG_FAN_FAIL_CAPABLE) {
 		gpie = IXGBE_READ_REG(hw, IXGBE_GPIE);
@@ -3666,6 +3725,9 @@ void ixgbe_down(struct ixgbe_adapter *adapter)
 	    adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)
 		cancel_work_sync(&adapter->fdir_reinit_task);
 
+	if (adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE)
+		cancel_work_sync(&adapter->check_overtemp_task);
+
 	/* disable transmits in the hardware now that interrupts are off */
 	for (i = 0; i < adapter->num_tx_queues; i++) {
 		j = adapter->tx_ring[i]->reg_idx;
@@ -4645,6 +4707,8 @@ static int __devinit ixgbe_sw_init(struct ixgbe_adapter *adapter)
 		adapter->max_msix_q_vectors = MAX_MSIX_Q_VECTORS_82599;
 		adapter->flags2 |= IXGBE_FLAG2_RSC_CAPABLE;
 		adapter->flags2 |= IXGBE_FLAG2_RSC_ENABLED;
+		if (hw->device_id == IXGBE_DEV_ID_82599_T3_LOM)
+			adapter->flags2 |= IXGBE_FLAG2_TEMP_SENSOR_CAPABLE;
 		if (dev->features & NETIF_F_NTUPLE) {
 			/* Flow Director perfect filter enabled */
 			adapter->flags |= IXGBE_FLAG_FDIR_PERFECT_CAPABLE;
@@ -6561,7 +6625,9 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 	}
 
 	/* reset_hw fills in the perm_addr as well */
+	hw->phy.reset_if_overtemp = true;
 	err = hw->mac.ops.reset_hw(hw);
+	hw->phy.reset_if_overtemp = false;
 	if (err == IXGBE_ERR_SFP_NOT_PRESENT &&
 	    hw->mac.type == ixgbe_mac_82598EB) {
 		/*
@@ -6730,6 +6796,8 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 	    adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE)
 		INIT_WORK(&adapter->fdir_reinit_task, ixgbe_fdir_reinit_task);
 
+	if (adapter->flags2 & IXGBE_FLAG2_TEMP_SENSOR_CAPABLE)
+		INIT_WORK(&adapter->check_overtemp_task, ixgbe_check_overtemp_task);
 #ifdef CONFIG_IXGBE_DCA
 	if (dca_add_requester(&pdev->dev) == 0) {
 		adapter->flags |= IXGBE_FLAG_DCA_ENABLED;
diff --git a/drivers/net/ixgbe/ixgbe_phy.c b/drivers/net/ixgbe/ixgbe_phy.c
index 22d21af..9c8fb85 100644
--- a/drivers/net/ixgbe/ixgbe_phy.c
+++ b/drivers/net/ixgbe/ixgbe_phy.c
@@ -135,6 +135,11 @@ static enum ixgbe_phy_type ixgbe_get_phy_type_from_id(u32 phy_id)
  **/
 s32 ixgbe_reset_phy_generic(struct ixgbe_hw *hw)
 {
+	/* Don't reset PHY if it's shut down due to overtemp. */
+	if (!hw->phy.reset_if_overtemp &&
+	    (IXGBE_ERR_OVERTEMP == hw->phy.ops.check_overtemp(hw)))
+		return 0;
+
 	/*
 	 * Perform soft PHY reset to the PHY_XS.
 	 * This will cause a soft reset to the PHY
@@ -1345,3 +1350,29 @@ s32 ixgbe_get_phy_firmware_version_tnx(struct ixgbe_hw *hw,
 	return status;
 }
 
+/**
+ *  ixgbe_tn_check_overtemp - Checks if an overtemp occured.
+ *  @hw: pointer to hardware structure
+ *
+ *  Checks if the LASI temp alarm status was triggered due to overtemp
+ **/
+s32 ixgbe_tn_check_overtemp(struct ixgbe_hw *hw)
+{
+	s32 status = 0;
+	u16 phy_data = 0;
+
+	if (hw->device_id != IXGBE_DEV_ID_82599_T3_LOM)
+		goto out;
+
+	/* Check that the LASI temp alarm status was triggered */
+	hw->phy.ops.read_reg(hw, IXGBE_TN_LASI_STATUS_REG,
+	                     MDIO_MMD_PMAPMD, &phy_data);
+
+	if (!(phy_data & IXGBE_TN_LASI_STATUS_TEMP_ALARM))
+		goto out;
+
+	status = IXGBE_ERR_OVERTEMP;
+out:
+	return status;
+}
+
diff --git a/drivers/net/ixgbe/ixgbe_phy.h b/drivers/net/ixgbe/ixgbe_phy.h
index c9c5459..ef4ba83 100644
--- a/drivers/net/ixgbe/ixgbe_phy.h
+++ b/drivers/net/ixgbe/ixgbe_phy.h
@@ -80,6 +80,8 @@
 #define IXGBE_I2C_T_SU_STO  4
 #define IXGBE_I2C_T_BUF     5
 
+#define IXGBE_TN_LASI_STATUS_REG        0x9005
+#define IXGBE_TN_LASI_STATUS_TEMP_ALARM 0x0008
 
 s32 ixgbe_init_phy_ops_generic(struct ixgbe_hw *hw);
 s32 ixgbe_identify_phy_generic(struct ixgbe_hw *hw);
@@ -106,6 +108,7 @@ s32 ixgbe_identify_sfp_module_generic(struct ixgbe_hw *hw);
 s32 ixgbe_get_sfp_init_sequence_offsets(struct ixgbe_hw *hw,
                                         u16 *list_offset,
                                         u16 *data_offset);
+s32 ixgbe_tn_check_overtemp(struct ixgbe_hw *hw);
 s32 ixgbe_read_i2c_byte_generic(struct ixgbe_hw *hw, u8 byte_offset,
                                 u8 dev_addr, u8 *data);
 s32 ixgbe_write_i2c_byte_generic(struct ixgbe_hw *hw, u8 byte_offset,
diff --git a/drivers/net/ixgbe/ixgbe_type.h b/drivers/net/ixgbe/ixgbe_type.h
index 39b9be8..2eb6e15 100644
--- a/drivers/net/ixgbe/ixgbe_type.h
+++ b/drivers/net/ixgbe/ixgbe_type.h
@@ -51,6 +51,7 @@
 #define IXGBE_DEV_ID_82599_KX4           0x10F7
 #define IXGBE_DEV_ID_82599_KX4_MEZZ      0x1514
 #define IXGBE_DEV_ID_82599_KR            0x1517
+#define IXGBE_DEV_ID_82599_T3_LOM        0x151C
 #define IXGBE_DEV_ID_82599_CX4           0x10F9
 #define IXGBE_DEV_ID_82599_SFP           0x10FB
 #define IXGBE_DEV_ID_82599_SFP_EM        0x1507
@@ -2470,6 +2471,7 @@ struct ixgbe_phy_operations {
 	s32 (*write_i2c_byte)(struct ixgbe_hw *, u8, u8, u8);
 	s32 (*read_i2c_eeprom)(struct ixgbe_hw *, u8 , u8 *);
 	s32 (*write_i2c_eeprom)(struct ixgbe_hw *, u8, u8);
+	s32 (*check_overtemp)(struct ixgbe_hw *);
 };
 
 struct ixgbe_eeprom_info {
@@ -2518,6 +2520,7 @@ struct ixgbe_phy_info {
 	enum ixgbe_smart_speed          smart_speed;
 	bool                            smart_speed_active;
 	bool                            multispeed_fiber;
+	bool                            reset_if_overtemp;
 };
 
 #include "ixgbe_mbx.h"
@@ -2605,6 +2608,7 @@ struct ixgbe_info {
 #define IXGBE_ERR_FDIR_REINIT_FAILED            -23
 #define IXGBE_ERR_EEPROM_VERSION                -24
 #define IXGBE_ERR_NO_SPACE                      -25
+#define IXGBE_ERR_OVERTEMP                      -26
 #define IXGBE_NOT_IMPLEMENTED                   0x7FFFFFFF
 
 #endif /* _IXGBE_TYPE_H_ */


^ permalink raw reply related

* Re: [PATCH] vhost-net: utilize PUBLISH_USED_IDX feature
From: Michael S. Tsirkin @ 2010-05-19 22:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: davem, Juan Quintela, Rusty Russell, Paul E. McKenney,
	Arnd Bergmann, kvm, virtualization, netdev, linux-kernel,
	alex.williamson, amit.shah
In-Reply-To: <4BF41A33.8090309@redhat.com>

On Wed, May 19, 2010 at 08:04:51PM +0300, Avi Kivity wrote:
> On 05/18/2010 04:19 AM, Michael S. Tsirkin wrote:
>> With PUBLISH_USED_IDX, guest tells us which used entries
>> it has consumed. This can be used to reduce the number
>> of interrupts: after we write a used entry, if the guest has not yet
>> consumed the previous entry, or if the guest has already consumed the
>> new entry, we do not need to interrupt.
>> This imporves bandwidth by 30% under some workflows.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>
>> Rusty, Dave, this patch depends on the patch
>> "virtio: put last seen used index into ring itself"
>> which is currently destined at Rusty's tree.
>> Rusty, if you are taking that one for 2.6.35, please
>> take this one as well.
>> Dave, any objections?
>>    
>
> I object: I think the index should have its own cacheline,

The issue here is that host/guest do not know each
other's cache line size. I guess we could just put it
at offset 128 or something like that ... Rusty?

> and that it should be documented before merging.

I think you meant to object to the virtio patch, not this one.  This
patch does not introduce new layout, just implements host support.
virtio spec patch will follow: it is not part of linux tree so
there is no patch dependency.

> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox