Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] ixgbe: driver for Intel(R) 82598 PCI-Express 10GbE adapters (v4)
From: Andrew Morton @ 2007-09-06 22:39 UTC (permalink / raw)
  To: David Miller
  Cc: auke-jan.h.kok, netdev, jeff, ayyappan.veeraiyan, arjan, hch,
	billfink, shemminger, rick.jones2, inaky, mb, nhorman
In-Reply-To: <20070906.124422.71092816.davem@davemloft.net>

> On Thu, 06 Sep 2007 12:44:22 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
> From: "Kok, Auke" <auke-jan.h.kok@intel.com>
> Date: Thu, 06 Sep 2007 11:31:47 -0700
> 
> > Also available through git:// and http:// here:
> > 
> >    http://foo-projects.org/~sofar/ixgbe-20070905-submission.patch
> >    http://foo-projects.org/~sofar/ixgbe-20070905-submission.patch.bz2
> >    (git-am formatted!)
> > 
> >    git://lost.foo-projects.org/~ahkok/linux-2.6 ixgbe-20070905-submission
> 
> To be honest I have absolutely no problems with this driver and we
> should just cut the crap and merge it in now.
> 
> Any objections anyone makes at this point is frankly nit picking crap
> which we can cure with followon cleanups and corrections.

fyi, I've been dropping this tree because changes in the net-2.6 tree kind
of wreck it.

We can merge it into Jeff's tree OK but I'll probably disable it in Kconfig
and whoever merges second will have some fixing to do..

^ permalink raw reply

* Re: [PATCH] ixgbe: driver for Intel(R) 82598 PCI-Express 10GbE adapters (v4)
From: Kok, Auke @ 2007-09-06 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Miller, netdev, jeff, ayyappan.veeraiyan, arjan, hch,
	billfink, shemminger, rick.jones2, inaky, mb, nhorman
In-Reply-To: <20070906153950.466a5c77.akpm@linux-foundation.org>

Andrew Morton wrote:
>> On Thu, 06 Sep 2007 12:44:22 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>> From: "Kok, Auke" <auke-jan.h.kok@intel.com>
>> Date: Thu, 06 Sep 2007 11:31:47 -0700
>>
>>> Also available through git:// and http:// here:
>>>
>>>    http://foo-projects.org/~sofar/ixgbe-20070905-submission.patch
>>>    http://foo-projects.org/~sofar/ixgbe-20070905-submission.patch.bz2
>>>    (git-am formatted!)
>>>
>>>    git://lost.foo-projects.org/~ahkok/linux-2.6 ixgbe-20070905-submission
>> To be honest I have absolutely no problems with this driver and we
>> should just cut the crap and merge it in now.
>>
>> Any objections anyone makes at this point is frankly nit picking crap
>> which we can cure with followon cleanups and corrections.
> 
> fyi, I've been dropping this tree because changes in the net-2.6 tree kind
> of wreck it.
> 
> We can merge it into Jeff's tree OK but I'll probably disable it in Kconfig
> and whoever merges second will have some fixing to do..

yes, I am already preparing to fix this up like I did for e1000e. As soon as I 
have something I'll post it to you and netdev for -mm

Auke

^ permalink raw reply

* Re: [PATCH][MIPS][7/7] AR7: ethernet
From: Randy Dunlap @ 2007-09-06 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matteo Croce, linux-mips, ejka, jgarzik, netdev, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20070906153025.7cb71cb1.akpm@linux-foundation.org>

On Thu, 6 Sep 2007 15:30:25 -0700 Andrew Morton wrote:

> > On Thu, 6 Sep 2007 17:34:10 +0200 Matteo Croce <technoboy85@gmail.com> wrote:
> > Driver for the cpmac 100M ethernet driver.
> > It works fine disabling napi support, enabling it gives a kernel panic
> > when the first IPv6 packet has to be forwarded.
> > Other than that works fine.
> > 
> 
> I'm not too sure why I got cc'ed on this (and not on patches 1-6?) but
> whatever.
> 
> This patch introduces quite a number of basic coding-style mistakes. 
> Please run it through scripts/checkpatch.pl and review the output.
> 
> The patch introduces vast number of volatile structure fields.  Please see
> Documentation/volatile-considered-harmful.txt.
> 
> The patch inroduces a modest number of unneeded (and undesirable) casts of
> void*, such as
> 
> +	struct cpmac_mdio_regs *regs = (struct cpmac_mdio_regs *)bus->priv;
> 
> please check for those and fix them up.
> 
> The driver implements a driver-private skb pool.  I don't know if this is
> something which we like net drivers doing?  If it is approved then surely
> there should be a common implementation for it somewhere?
> 
> The driver does a lot of open-coded dma_cache_inv() calls (in a way which
> assumes a 32-bit bus, too).  I assume that dma_cache_inv() is some mips
> thing.  I'd have thought that it would be better to use the dma mapping API
> thoughout the driver, and its associated dma invalidation APIs.
> 
> The driver has some LINUX_VERSION_CODE ifdefs.  We usually prefer that such
> code not be present in a merged-up driver.
> 
> 
> 
> > +			priv->regs->mac_hash_low = 0xffffffff;
> > +			priv->regs->mac_hash_high = 0xffffffff;
> > +		} else {
> > +			for (i = 0, iter = dev->mc_list; i < dev->mc_count;
> > +			    i++, iter = iter->next) {
> > +				hash = 0;
> > +				tmp = iter->dmi_addr[0];
> > +				hash  ^= (tmp >> 2) ^ (tmp << 4);
> > +				tmp = iter->dmi_addr[1];
> > +				hash  ^= (tmp >> 4) ^ (tmp << 2);
> > +				tmp = iter->dmi_addr[2];
> > +				hash  ^= (tmp >> 6) ^ tmp;
> > +				tmp = iter->dmi_addr[4];
> > +				hash  ^= (tmp >> 2) ^ (tmp << 4);
> > +				tmp = iter->dmi_addr[5];
> > +				hash  ^= (tmp >> 4) ^ (tmp << 2);
> > +				tmp = iter->dmi_addr[6];
> > +				hash  ^= (tmp >> 6) ^ tmp;
> > +				hash &= 0x3f;
> > +				if (hash < 32) {
> > +					hashlo |= 1<<hash;
> > +				} else {
> > +					hashhi |= 1<<(hash - 32);
> > +				}
> > +			}
> > +
> > +			priv->regs->mac_hash_low = hashlo;
> > +			priv->regs->mac_hash_high = hashhi;
> > +		}
> 
> Do we not have a library function anywhere which will perform this little
> multicasting hash?

Depends on the ethernet controller, but the ones that I know about
just use a CRC (crc-16 IIRC) calculation for the multicast hash.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: jamal @ 2007-09-06 23:06 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <200709061416.l86EG0Vb017675@quickie.katalix.com>

On Thu, 2007-06-09 at 15:16 +0100, James Chapman wrote:

> First, do we need to encourage consistency in NAPI poll drivers? A 
> survey of current NAPI drivers shows different strategies being used 
> in their poll(). Some such as r8169 do the napi_complete() if poll() 
> does less work than their allowed budget. Others such as e100 and tg3 
> do napi_complete() only if they do no work at all. And some drivers use 
> NAPI only for receive handling, perhaps setting txdone interrupts for 
> 1 in N transmitted packets, while others do all "interrupt" processing in 
> their poll(). Should we encourage more consistency? Should we encourage more 
> NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx 
> processing in the poll(), and do napi_complete() only when the poll does _no_ work?

not to stiffle the discussion, but Stephen Hemminger is planning to
write a new howto; that would be a good time to bring up the topic. The 
challenge is that there may be hardware issues that will result in small
deviations.

> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise 
> have gone idle means 
> that it might be called many times by the NAPI softirq while it has no work to do. 
> This wastes CPU cycles. It would be important therefore to implement the driver's 
> poll() to make this case as efficient as possible, perhaps testing for it early.

> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx 
> softirq 
> will loop until the next jiffy tick or when quota is exceeded, calling each device 
> in its polled 
> list. Since the idle device is still in the poll list, it will be polled very rapidly.

One suggestion on limiting the amount of polls is to actually have the
driver chew something off the quota even on empty polls - easier by just
changing the driver. A simple case will be say 1 packet (more may make
more sense, machine dependent) every time poll is invoked by the core.
This way the core algorithm continues to be fair and when the jiffies
are exceeded you bail out from the driver.

> 2. No other active device is in the poll list. The net_rx softirq will poll 
> the idle device twice 
> and then exit the softirq processing loop as if quota is exceeded. See the 
> net_rx_action() changes 
> in the patch which force the loop to exit if no work is being done by any 
> device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing 
> from ksoftirqd. This 
> might be very soon, especially if the system is otherwise idle. But if the 
> system is idle, do we 
> really care that idle network devices will be polled for 1-2 jiffies? 

Unfortunately the folks who have brought this up as an issue would
answer affirmatively. 
OTOH, if you can demonstrate that you spend less cycles polling rather
than letting NAPI do its thing, you will be able to make a compelling
case.

> If the system is otherwise 
> busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate 
> anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, 
> for example, keeps the device in polled mode, generating no interrupts. 

Must be a fast machine.

> In a test, 8510 packets are sent/received versus 6200 previously; 

The other packets are dropped? What are the rtt numbers like?

> CPU load is 100% versus 62% previously; 

not good.

> and 1 netdev interrupt occurs versus 12400 previously. 

good - maybe ;->

> Performance and CPU load under extreme 
> network load (using pktgen) is unchanged, as expected. 
> Most importantly though, it is no longer possible to find a combination 
> of CPU performance and traffic pattern that induce high interrupt rates. 
> And because hardware interrupt mitigation isn't used, packet latency is minimized.

I dont think youd find much win against NAPI in this case; 

> The increase in CPU load isn't surprising for a flood ping test since the CPU 
> is working to bounce packets as fast as it can. The increase in packet rate 
> is a good indicator of how much the interrupt and NAPI scheduling overhead is. 

Your results above showed decreased tput and increased cpu - did you
mistype that?

> The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration 
> of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread 
> gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive 
> _better_ system response (smoother X cursor movement etc) during the flood ping test, 

interesting - i think i did not notice something similar on my laptop
but i couldnt quantify it and it didnt seem to make sense.

> despite the CPU load being increased. For a system whose main job is processing network 
> traffic quickly, like an embedded router or a network server, this approach might be very 
> beneficial.

I am not sure i buy that James;-> The router types really have not much
of a challenge in this area.

>  For a desktop, I'm less sure, although as I said above, I've noticed no performance 
> issues in my setups to date.

> Is this worth pursuing further? I'm considering doing more work to measure the effects at 
> various relatively low packet rates. 

The standard litmus test applies since this is about performance.
Ignoring memory, the three standard net resources to worry about are
cpu, throughput and latency.  If you can show one or more of those
resources got better consistently without affecting the others across
different scenarios - you have a case to make.
For example in my experiments:
At high traffic rates, i didnt affect any of those axes.
At low rates, I was able to reduce cpu abuse, make throughput consistent
but make latency a lot worse. So this meant it was not fit to push
forward.

> I also want to investigate using High Res Timers rather 
> than jiffy sampling to reduce the idle poll time.

Mandeep also mentioned tickless - it would be interesting to see both.

>  Perhaps it is also worth trying HRT in the 
> net_rx softirq too. 

You may wanna also try the approach i did with hrt+/tickless by changing
only the driver and not the core.

> I thought it would be worth throwing the ideas out there 
> first to get early feedback.

You are doing the right thing by following the path on perfomance
analysis. I hope you dont get discouraged because the return on
investment may be very low in such work - the majority of the work is in
the testing and analysis (not in puking code endlessly). 

cheers,
jamal

^ permalink raw reply

* Re: [PATCH][MIPS][7/7] AR7: ethernet
From: Matteo Croce @ 2007-09-06 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mips, ejka, jgarzik, netdev, davem, kuznet, pekkas, jmorris,
	yoshfuji, kaber
In-Reply-To: <20070906153025.7cb71cb1.akpm@linux-foundation.org>

Il Friday 07 September 2007 00:30:25 Andrew Morton ha scritto:
> > On Thu, 6 Sep 2007 17:34:10 +0200 Matteo Croce <technoboy85@gmail.com> wrote:
> > Driver for the cpmac 100M ethernet driver.
> > It works fine disabling napi support, enabling it gives a kernel panic
> > when the first IPv6 packet has to be forwarded.
> > Other than that works fine.
> > 
> 
> I'm not too sure why I got cc'ed on this (and not on patches 1-6?) but
> whatever.

I mailed every maintainer in the respective section in the file MAINTAINERS
and you were in the "NETWORK DEVICE DRIVERS" section

> This patch introduces quite a number of basic coding-style mistakes. 
> Please run it through scripts/checkpatch.pl and review the output.

Already done. I'm collecting other suggestions before committing

> The patch introduces vast number of volatile structure fields.  Please see
> Documentation/volatile-considered-harmful.txt.

Removing them and the kernel hangs at module load

> The patch inroduces a modest number of unneeded (and undesirable) casts of
> void*, such as
> 
> +	struct cpmac_mdio_regs *regs = (struct cpmac_mdio_regs *)bus->priv;
> 
> please check for those and fix them up.

Done

> The driver implements a driver-private skb pool.  I don't know if this is
> something which we like net drivers doing?  If it is approved then surely
> there should be a common implementation for it somewhere?

Are you referring at cpmac_poll?

> The driver has some LINUX_VERSION_CODE ifdefs.  We usually prefer that such
> code not be present in a merged-up driver.

I will remove in the final release, now I need for testing: my running kernel
is older than current git

> 
> > +			priv->regs->mac_hash_low = 0xffffffff;
> > +			priv->regs->mac_hash_high = 0xffffffff;
> > +		} else {
> > +			for (i = 0, iter = dev->mc_list; i < dev->mc_count;
> > +			    i++, iter = iter->next) {
> > +				hash = 0;
> > +				tmp = iter->dmi_addr[0];
> > +				hash  ^= (tmp >> 2) ^ (tmp << 4);
> > +				tmp = iter->dmi_addr[1];
> > +				hash  ^= (tmp >> 4) ^ (tmp << 2);
> > +				tmp = iter->dmi_addr[2];
> > +				hash  ^= (tmp >> 6) ^ tmp;
> > +				tmp = iter->dmi_addr[4];
> > +				hash  ^= (tmp >> 2) ^ (tmp << 4);
> > +				tmp = iter->dmi_addr[5];
> > +				hash  ^= (tmp >> 4) ^ (tmp << 2);
> > +				tmp = iter->dmi_addr[6];
> > +				hash  ^= (tmp >> 6) ^ tmp;
> > +				hash &= 0x3f;
> > +				if (hash < 32) {
> > +					hashlo |= 1<<hash;
> > +				} else {
> > +					hashhi |= 1<<(hash - 32);
> > +				}
> > +			}
> > +
> > +			priv->regs->mac_hash_low = hashlo;
> > +			priv->regs->mac_hash_high = hashhi;
> > +		}
> 
> Do we not have a library function anywhere which will perform this little
> multicasting hash?

Can you tell me the function so i'll implement it?

> > +static inline struct sk_buff *cpmac_rx_one(struct net_device *dev,
> > +					   struct cpmac_priv *priv,
> > +					   struct cpmac_desc *desc)
> > +{
> > +	unsigned long flags;
> > +	char *data;
> > +	struct sk_buff *skb, *result = NULL;
> > +
> > +	priv->regs->rx_ack[0] = virt_to_phys(desc);
> > +	if (unlikely(!desc->datalen)) {
> > +		if (printk_ratelimit())
> > +			printk(KERN_WARNING "%s: rx: spurious interrupt\n",
> > +			       dev->name);
> > +		priv->stats.rx_errors++;
> > +		return NULL;
> > +	}
> > +
> > +	spin_lock_irqsave(&priv->lock, flags);
> > +	skb = cpmac_get_skb(dev);
> > +	if (likely(skb)) {
> > +		data = (char *)phys_to_virt(desc->hw_data);
> > +		dma_cache_inv((u32)data, desc->datalen);
> > +		skb_put(desc->skb, desc->datalen);
> > +		desc->skb->protocol = eth_type_trans(desc->skb, dev);
> > +		desc->skb->ip_summed = CHECKSUM_NONE;
> > +		priv->stats.rx_packets++;
> > +		priv->stats.rx_bytes += desc->datalen;
> > +		result = desc->skb;
> > +		desc->skb = skb;
> > +	} else {
> > +#ifdef CPMAC_DEBUG
> > +		if (printk_ratelimit())
> > +			printk("%s: low on skbs, dropping packet\n",
> > +			       dev->name);
> > +#endif
> > +		priv->stats.rx_dropped++;
> > +	}
> > +	spin_unlock_irqrestore(&priv->lock, flags);
> > +
> > +	desc->hw_data = virt_to_phys(desc->skb->data);
> > +	desc->buflen = CPMAC_SKB_SIZE;
> > +	desc->dataflags = CPMAC_OWN;
> > +	dma_cache_wback((u32)desc, 16);
> > +
> > +	return result;
> > +}
> 
> This function is far too large to be inlined.
> 
> > +static irqreturn_t cpmac_irq(int irq, void *dev_id)
> > +{
> > +	struct net_device *dev = (struct net_device *)dev_id;
> 
> unneeded cast

fixed

> > +static int __devinit cpmac_probe(struct platform_device *pdev)
> > +{
> > +	int i, rc, phy_id;
> > +	struct resource *res;
> > +	struct cpmac_priv *priv;
> > +	struct net_device *dev;
> > +	struct plat_cpmac_data *pdata;
> > +
> > +	if (strcmp(pdev->name, "cpmac") != 0)
> > +		return -ENODEV;
> 
> I don't think this can happen?  If it can, something is pretty screwed up?

Hehe, so screwed that you won't care about your ethernet ;)

> > +	pdata = pdev->dev.platform_data;
> > +
> > +	for (phy_id = 0; phy_id < PHY_MAX_ADDR; phy_id++) {
> > +		if (!(pdata->phy_mask & (1 << phy_id)))
> > +			continue;
> > +		if (!cpmac_mii.phy_map[phy_id])
> > +			continue;
> > +		break;
> > +	}
> > +
> > +	if (phy_id == PHY_MAX_ADDR) {
> > +		if (external_switch) {
> > +			phy_id = 0;
> > +		} else {
> > +			printk("cpmac: no PHY present\n");
> > +			return -ENODEV;
> > +		}
> > +	}
> > +
> > +	dev = alloc_etherdev(sizeof(struct cpmac_priv));
> > +
> > +	if (!dev) {
> > +		printk(KERN_ERR "cpmac: Unable to allocate net_device structure!\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	SET_MODULE_OWNER(dev);
> > +	platform_set_drvdata(pdev, dev);
> > +	priv = netdev_priv(dev);
> > +
> > +	res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
> > +	if (!res) {
> > +		rc = -ENODEV;
> > +		goto fail;
> > +	}
> > +
> > +	dev->mem_start = res->start;
> > +	dev->mem_end = res->end;
> > +	dev->irq = platform_get_irq_byname(pdev, "irq");
> > +
> > +	dev->mtu                = 1500;
> > +	dev->open               = cpmac_open;
> > +	dev->stop               = cpmac_stop;
> > +	dev->set_config         = cpmac_config;
> > +	dev->hard_start_xmit    = cpmac_start_xmit;
> > +	dev->do_ioctl           = cpmac_ioctl;
> > +	dev->get_stats          = cpmac_stats;
> > +	dev->change_mtu         = cpmac_change_mtu;
> > +	dev->set_mac_address    = cpmac_set_mac_address;
> > +	dev->set_multicast_list = cpmac_set_multicast_list;
> > +	dev->tx_timeout         = cpmac_tx_timeout;
> > +	dev->ethtool_ops        = &cpmac_ethtool_ops;
> > +	if (!disable_napi) {
> > +		dev->poll = cpmac_poll;
> > +		dev->weight = min(rx_ring_size, 64);
> > +	}
> > +
> > +	memset(priv, 0, sizeof(struct cpmac_priv));
> 
> I think alloc_etherdev() already did that?

What? zeroing the memory or other stuff?

> > +	spin_lock_init(&priv->lock);
> > +	priv->msg_enable = netif_msg_init(NETIF_MSG_WOL, 0x3fff);
> > +	priv->config = pdata;
> > +	priv->dev = dev;
> > +	memcpy(dev->dev_addr, priv->config->dev_addr, sizeof(dev->dev_addr));
> > +	if (phy_id == 31) {
> > +		snprintf(priv->phy_name, BUS_ID_SIZE, PHY_ID_FMT,
> > +			 cpmac_mii.id, phy_id);
> > +	} else {
> > +		snprintf(priv->phy_name, BUS_ID_SIZE, "fixed@%d:%d", 100, 1);
> > +	}
> > +
> > +	if ((rc = register_netdev(dev))) {
> > +		printk("cpmac: error %i registering device %s\n",
> > +		       rc, dev->name);
> > +		goto fail;
> > +	}
> > +
> > +	printk("cpmac: device %s (regs: %p, irq: %d, phy: %s, mac: ",
> > +	       dev->name, (u32 *)dev->mem_start, dev->irq,
> > +	       priv->phy_name);
> > +	for (i = 0; i < 6; i++)
> > +		printk("%02x%s", dev->dev_addr[i], i < 5 ? ":" : ")\n");
> > +
> > +	return 0;
> > +
> > +fail:
> > +	free_netdev(dev);
> > +	return rc;
> > +}
> > +

What about this?

Thanks for Your attention,
Matteo Croce

^ permalink raw reply

* Re: [PATCH][MIPS][7/7] AR7: ethernet
From: Andrew Morton @ 2007-09-07  0:41 UTC (permalink / raw)
  To: Matteo Croce
  Cc: linux-mips, ejka, jgarzik, netdev, davem, kuznet, pekkas, jmorris,
	yoshfuji, kaber
In-Reply-To: <200709070121.42629.technoboy85@gmail.com>

> On Fri, 7 Sep 2007 01:21:41 +0200 Matteo Croce <technoboy85@gmail.com> wrote:
> > The patch introduces vast number of volatile structure fields.  Please see
> > Documentation/volatile-considered-harmful.txt.
> 
> Removing them and the kernel hangs at module load

They can't just be removed.  Please see the document.  There are I/O APIs
which, if properly used, make volatile unneeded.

^ permalink raw reply

* bug in arp handling 2.6.20
From: Joseph Southwell @ 2007-09-07  0:41 UTC (permalink / raw)
  To: netdev

scenario
Machine A
eth0 is plugged  into network.
192.168.1.201
eth0:1
192.168.1.2

Machine B
eth1 is plugged into network.
192.168.1.101
eth0 is not plugged into network. so I can test locally before  
switching the wire.
192.168.1.201
eth0:1
192.168.1.2

arp request for 192.168.1.2 is responded to by eth1 on machine B.  
That seems buggy to me. worse, it continues to do it after  
unconfiguring eth0:1 and eth0.

^ permalink raw reply

* Re: bug in arp handling 2.6.20
From: Rick Jones @ 2007-09-07  1:16 UTC (permalink / raw)
  To: Joseph Southwell; +Cc: netdev
In-Reply-To: <50155211-787D-48AE-B9D8-7FD18406C8E3@ontoserve.com>

Joseph Southwell wrote:
> scenario
> Machine A
> eth0 is plugged  into network.
> 192.168.1.201
> eth0:1
> 192.168.1.2
> 
> 
> Machine B
> eth1 is plugged into network.
> 192.168.1.101
> eth0 is not plugged into network. so I can test locally before  
> switching the wire.
> 192.168.1.201
> eth0:1
> 192.168.1.2
> 
> arp request for 192.168.1.2 is responded to by eth1 on machine B.  That 
> seems buggy to me. 

Does it still do so if you set either "arp_ignore" or "arp_filter" (I 
can never keep those two straight)?  Could be the "weak-end-system" 
design of the stack at work here.  That allows any interface in the 
system to respond to an ARP request for any of the IP's assigned to any 
of the interfaces.

> worse, it continues to do it after  unconfiguring 
> eth0:1 and eth0.

As in simply marking it "down?"

rick jones

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: Mandeep Singh Baines @ 2007-09-07  3:55 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <200709061416.l86EG0Vb017675@quickie.katalix.com>

Hi James,

I like the idea of staying in poll longer.

My comments are similar to what Jamal and Stephen have already said.

A tunable (via sysfs) would be nice.

A timer might be preferred to jiffy polling. Jiffy polling will not increase 
latency the way a timer would. However, jiffy polling will consume a lot more
CPU than a timer would. Hence more power. For jiffy polling, you could have 
thousands of calls to poll for a single packet received. While in a timer 
approach the numbers of polls per packet is upper bound to 2. 

I think it may difficult to make poll efficient for the no packet case because,
at a minimum, you have to poll the device state via the has_work method.

If you go to a timer implementation then having a tunable will be important.
Different appications will have different requirements on delay and jitter.
Some applications may want to trade delay/jitter for less CPU/power 
consumption and some may not.

imho, the work should definately be pursued further:)

Regards,
Mandeep

James Chapman (jchapman@katalix.com) wrote:
> This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 
> 
> First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?
> 
> One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.
> 
> By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively p
 olled in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.
> 
> The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.
> 
> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> 2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.
> 
> The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, 
 although as I said above, I've noticed no performance issues in my setups to date.
> 
> 
> Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.
> 
> Here's the patch.
> 
> Index: linux-2.6/drivers/net/e100.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/e100.c
> +++ linux-2.6/drivers/net/e100.c
> @@ -544,6 +544,7 @@ struct nic {
>  	struct cb *cb_to_use;
>  	struct cb *cb_to_send;
>  	struct cb *cb_to_clean;
> +	unsigned long exit_poll_time;
>  	u16 tx_command;
>  	/* End: frequently used values: keep adjacent for cache effect */
>  
> @@ -1993,12 +1994,35 @@ static int e100_poll(struct napi_struct 
>  	e100_rx_clean(nic, &work_done, budget);
>  	tx_cleaned = e100_tx_clean(nic);
>  
> -	/* If no Rx and Tx cleanup work was done, exit polling mode. */
> -	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
> +	if (!netif_running(netdev)) {
>  		netif_rx_complete(netdev, napi);
>  		e100_enable_irq(nic);
> +		return 0;
>  	}
>  
> +	/* Stay in polled mode if we do any tx cleanup */
> +	if (tx_cleaned)
> +		work_done++;
> +
> +	/* If no Rx and Tx cleanup work was done, exit polling mode if
> +	 * we've seen no work for 1-2 jiffies.
> +	 */
> +	if (work_done == 0) {
> +		if (nic->exit_poll_time) {
> +			if (time_after(jiffies, nic->exit_poll_time)) {
> +				nic->exit_poll_time = 0;
> +				netif_rx_complete(netdev, napi);
> +				e100_enable_irq(nic);
> +			}
> +		} else {
> +			nic->exit_poll_time = jiffies + 2;
> +		}
> +		return 0;
> +	}
> +
> +	/* Otherwise, reset poll exit time and stay in poll list */
> +	nic->exit_poll_time = 0;
> +
>  	return work_done;
>  }
>  
> Index: linux-2.6/drivers/net/tg3.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/tg3.c
> +++ linux-2.6/drivers/net/tg3.c
> @@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *
>  	struct tg3_hw_status *sblk = tp->hw_status;
>  	int work_done = 0;
>  
> +	/* fastpath having no work while we're holding ourself in
> +	 * polled mode
> +	 */
> +	if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
> +		if (time_after(jiffies, tp->exit_poll_time)) {
> +			tp->exit_poll_time = 0;
> +			/* tell net stack and NIC we're done */
> +			netif_rx_complete(netdev, napi);
> +			tg3_restart_ints(tp);
> +		}
> +		return 0;
> +	}
> +
> +	/* if we get here, there might be work to do so disable the
> +	 * poll hold fastpath above
> +	 */
> +	tp->exit_poll_time = 0;
> +
>  	/* handle link change and other phy events */
>  	if (!(tp->tg3_flags &
>  	      (TG3_FLAG_USE_LINKCHG_REG |
> @@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *
>  	} else
>  		sblk->status &= ~SD_STATUS_UPDATED;
>  
> -	/* if no more work, tell net stack and NIC we're done */
> -	if (!tg3_has_work(tp)) {
> -		netif_rx_complete(netdev, napi);
> -		tg3_restart_ints(tp);
> -	}
> +	/* if no more work, set the time in jiffies when we should
> +	 * exit polled mode
> +	 */
> +	if (!tg3_has_work(tp))
> +		tp->exit_poll_time = jiffies + 2;
>  
>  	return work_done;
>  }
> Index: linux-2.6/drivers/net/tg3.h
> ===================================================================
> --- linux-2.6.orig/drivers/net/tg3.h
> +++ linux-2.6/drivers/net/tg3.h
> @@ -2163,6 +2163,7 @@ struct tg3 {
>  	u32				last_tag;
>  
>  	u32				msg_enable;
> +	unsigned long			exit_poll_time;
>  
>  	/* begin "tx thread" cacheline section */
>  	void				(*write32_tx_mbox) (struct tg3 *, u32,
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c
> +++ linux-2.6/net/core/dev.c
> @@ -2073,6 +2073,8 @@ static void net_rx_action(struct softirq
>  	unsigned long start_time = jiffies;
>  	int budget = netdev_budget;
>  	void *have;
> +	struct napi_struct *last_hold = NULL;
> +	int done = 0;
>  
>  	local_irq_disable();
>  	list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list);
> @@ -2082,7 +2084,7 @@ static void net_rx_action(struct softirq
>  		struct napi_struct *n;
>  
>  		/* if softirq window is exhuasted then punt */
> -		if (unlikely(budget <= 0 || jiffies != start_time)) {
> +		if (unlikely(budget <= 0 || jiffies != start_time || done)) {
>  			local_irq_disable();
>  			list_splice(&list, &__get_cpu_var(softnet_data).poll_list);
>  			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
> @@ -2096,12 +2098,28 @@ static void net_rx_action(struct softirq
>  
>  		list_del(&n->poll_list);
>  
> -		/* if quota not exhausted process work */
> +		/* if quota not exhausted process work. We special
> +		 * case on n->poll() returning 0 here when the driver
> +		 * doesn't do a napi_complete(), which indicates that
> +		 * the device wants to stay on the poll list although
> +		 * it did no work. We remember the first device to go
> +		 * into this state in order to terminate this loop if
> +		 * we see the same device again without doing any
> +		 * other work.
> +		 */
>  		if (likely(n->quota > 0)) {
>  			int work = n->poll(n, min(budget, n->quota));
>  
> -			budget -= work;
> -			n->quota -= work;
> +			if (likely(work)) {
> +				budget -= work;
> +				n->quota -= work;
> +				last_hold = NULL;
> +			} else if (test_bit(NAPI_STATE_SCHED, &n->state)) {
> +				if (unlikely(n == last_hold))
> +					done = 1;
> +				if (likely(!last_hold))
> +					last_hold = n;
> +			}
>  		}
>  
>  		/* if napi_complete not called, reschedule */

^ permalink raw reply

* wrt Age Entry For IPv4 & IPv6 Route Table
From: Varun Chandramohan @ 2007-09-07  5:31 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Varun Chandramohan

Hi Dave,

                  Last week i sent out the rev-3 of the age patch set,
and i got no comments . So i assume the people who reviewed it earlier
are ok with it. Can you please go through it and push it upstream?


Regards,
Varun

^ permalink raw reply

* Re: request for information about the "ath5k" licensing
From: Reyk Floeter @ 2007-09-07  6:47 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: Reyk Floeter, linux-kernel, linux-wireless, netdev, jirislaby,
	moglen
In-Reply-To: <43e72e890709051000m74252604paff3eb1651f55d22@mail.gmail.com>

On Wed, Sep 05, 2007 at 01:00:15PM -0400, Luis R. Rodriguez wrote:
> On 9/5/07, Michael Buesch <mb@bu3sch.de> wrote:
> > On Wednesday 05 September 2007, Reyk Floeter wrote:
> > > I'm the author of the free hardware driver layer for wireless Atheros
> > > devices in OpenBSD, also known as "OpenHAL".
> > >
> > > I'm still trying to get an idea about the facts and the latest state
> > > of the incidence that violated the copyright of my code, because I
> > > just returned from vacation.
> >
> > > Could you please give me some feedback about the latest state? Please
> > > reply in private, I'm not subscribed to any of the Linux lists and I'm
> > > rather interested in facts than in the usual trolling.
> > >
> > > - Has this issue been fixed?
> >
> > It has never been applied to any repository.
> > -> No issue and no copyright violation.
> >
> > > - Is there any repository available with the "ath5k" code using a
> > > modified/"extended" license?
> >
> > No.
> 
> Well that is not accurate. Please give us a few we are working on
> verifying some information for you.
> 
> > > I don't know how to find the relevant bits in the various Linux git
> > > repositories. Sorry, I don't get the structure of it. Are there any
> > > other sources online except this diff on the linux kernel mailing
> > > list?
> > >
> > > - Are there any plans to release the "ath5k" code using a
> > > modified/"extended" license?
> >
> > No.
> >
> 
> Same here. Apologies for this taking so long. It'll all be sorted out soon.
> 

I'm still waiting for an answer. Your process is taking too long.

Reyk

^ permalink raw reply

* Re: [PATCH][MIPS][7/7] AR7: ethernet
From: Geert Uytterhoeven @ 2007-09-07  7:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matteo Croce, linux-mips, ejka, jgarzik, netdev, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber
In-Reply-To: <20070906153025.7cb71cb1.akpm@linux-foundation.org>

On Thu, 6 Sep 2007, Andrew Morton wrote:
> > On Thu, 6 Sep 2007 17:34:10 +0200 Matteo Croce <technoboy85@gmail.com> wrote:
> > Driver for the cpmac 100M ethernet driver.
> > It works fine disabling napi support, enabling it gives a kernel panic
> > when the first IPv6 packet has to be forwarded.
> > Other than that works fine.
> 
> The driver does a lot of open-coded dma_cache_inv() calls (in a way which
> assumes a 32-bit bus, too).  I assume that dma_cache_inv() is some mips

No, even i386 has it ;-)

> thing.  I'd have thought that it would be better to use the dma mapping API
> thoughout the driver, and its associated dma invalidation APIs.

However, Ralf just posted a patch to remove it on all architectures, and
driver writers should consider it gone.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* 2.6.23-rc4-mm1: e1000e napi lockup
From: Jiri Slaby @ 2007-09-07  7:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: netdev, e1000-devel, Auke Kok, David S. Miller

Hi,

I found a regression in 2.6.23-rc4-mm1 (since -rc3-mm1) in e1000e driver.
napi_disable(&adapter->napi) in e1000_probe freezes the kernel on boot.

regards,
-- 
Jiri Slaby (jirislaby@gmail.com)
Faculty of Informatics, Masaryk University

^ permalink raw reply

* PROBLEM: Oops with forcedeth and netkey in 2.6.21 and 22
From: Alexandre Ghisoli @ 2007-09-06 22:49 UTC (permalink / raw)
  To: netdev

Hi there, 

I cannot get my AMD64 working with forcedeth network chip and Netkey.

While recompiling kernel to get iptables and OpenSWAN working, I cannot
anymore boot my computer, it freeze on network setup.

After some reboots / recompile, I've traced the problem arround NetKEY.
If I enable it in the kernel, I'm getting oops.

Starting in single, I've been able to see errors comming because
dhcpclient process and af_packet module. If I dont load af_packet at
boot, i can setup manually an ip address. Unfortunaty, when lauching
gnome, my computer hang (probably some process tries to load
af_packets ?)

The NIC is on-board NIC on the MSI Neo4 Platinum motherboard (product
MS-7125)

I've tried thoses kernel version, same behavior :
 - 2.6.21.6
 - 2.6.22.1
 - 2.6.22.6


Here is the oops I got (dmesg captured) :

skb_under_panic: text:c02b089c len:14 put:14 head:f74e8410 data:f74e8402
tail:f74e8400 end:f74e8580 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:111!
invalid opcode: 0000 [#1]
PREEMPT SMP
Modules linked in: af_packet usbhid sha256 sha1 hmac crypto_hash des
crypto_algapi af_key xfrm_user ohci_hcd parport_pc ehci_hcd usbcore
parport ohci1394 ieee1394 nvidia(P) floppy forcedeth sg
CPU:    1
EIP:    0060:[<c029f0c9>]    Tainted: P       VLI
EFLAGS: 00010292   (2.6.21.6 #2)
EIP is at skb_under_panic+0x59/0x60
eax: 00000072   ebx: f74e8410   ecx: f72a2000   edx: 00000000
esi: 00000000   edi: 00000800   ebp: 00000000   esp: f72a3d6c
ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
Process dhcpcd (pid: 3723, ti=f72a2000 task=f76bd030 task.ti=f72a2000)
Stack: c0398bd0 c02b089c 0000000e 0000000e f74e8410 f74e8402 f74e8400
f74e8580
       c037d386 f74e8402 f7fa4c80 c02b08a1 f79b4000 f72a3f40 f79b4000
c215a900
       f7fa4c80 f8b95cca f72a3ecc 00000000 00000148 c016c455 f7a92600
0008cae0
Call Trace:
 [<c02b089c>] eth_header+0x10c/0x120
 [<c02b08a1>] eth_header+0x111/0x120
 [<f8b95cca>] packet_sendmsg+0x14a/0x260 [af_packet]
 [<c016c455>] link_path_walk+0x65/0xc0
 [<c0299dbe>] sock_sendmsg+0xce/0x100
 [<c0131180>] autoremove_wake_function+0x0/0x40
 [<c016d3d4>] path_lookup+0x14/0x20
 [<c02fdb10>] unix_find_other+0x30/0x1a0
 [<c029a193>] sys_sendto+0x133/0x180
 [<c029b1ce>] sys_socketcall+0x14e/0x280
 [<c0102c7e>] sysenter_past_esp+0x5f/0x85
 =======================
Code: 00 00 89 5c 24 14 8b 98 90 00 00 00 89 54 24 0c 89 5c 24 10 8b 40
60 89 4c 24 04 c7 04 24 d0 8b 39 c0 89 44 24 08 e8 57 f4 e7 ff <0f> 0b
eb fe 8d 76 00 56 53 bb 86 d3 37 c0 83 ec 24 8b 70 14 85
EIP: [<c029f0c9>] skb_under_panic+0x59/0x60 SS:ESP 0068:f72a3d6c

Here is my running environement :
Linux amd64-linux 2.6.21.6 #1 SMP PREEMPT Thu Sep 6 23:33:04 CEST 2007
i686 AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD
GNU/Linux
 
Gnu C                  4.2.0
Gnu make               3.81
binutils               Binutils
util-linux             2.12r
mount                  2.12r
module-init-tools      3.2.2
e2fsprogs              1.40.2
reiserfsprogs          3.6.19
PPP                    2.4.4
Linux C Library        > libc.2.6
Dynamic linker (ldd)   2.6
Procps                 3.2.7
Net-tools              1.60
Kbd                    1.13
Sh-utils               6.9
udev                   114

And my on-board NIC :
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
        Subsystem: Micro-Star International Co., Ltd. Unknown device
7125
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 0 (250ns min, 5000ns max)
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at fe029000 (32-bit, non-prefetchable)
[size=4K]
        Region 1: I/O ports at b400 [size=8]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1
+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=0 PME-

Thanks for your help.

--Alexandre


^ permalink raw reply

* Re: wrt Age Entry For IPv4 & IPv6 Route Table
From: David Miller @ 2007-09-07  8:00 UTC (permalink / raw)
  To: varunc; +Cc: netdev, varuncha
In-Reply-To: <46E0E241.4000101@linux.vnet.ibm.com>

I'm trevelling otherwise I would have reviewed and integrated
or given feedback for changes.

I'll be back late next week.

^ permalink raw reply

* Re: 2.6.23-rc4-mm1: e1000e napi lockup
From: David Miller @ 2007-09-07  8:03 UTC (permalink / raw)
  To: jirislaby; +Cc: akpm, netdev, e1000-devel, auke-jan.h.kok
In-Reply-To: <46E0FB82.2040000@gmail.com>

From: Jiri Slaby <jirislaby@gmail.com>
Date: Fri, 07 Sep 2007 09:19:30 +0200

> I found a regression in 2.6.23-rc4-mm1 (since -rc3-mm1) in e1000e driver.
> napi_disable(&adapter->napi) in e1000_probe freezes the kernel on boot.

Yes, the semantics changed slightly in the net-2.6.24 tree the
other week and someone needs to fix it up.

The netif_napi_add() implicitly does a napi_disable() call.  Device
open must explicitly napi_enable() and device close must explicitly
napi_disable(), and if done elsewhere these calls must be strictly
balanced.

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: James Chapman @ 2007-09-07  9:31 UTC (permalink / raw)
  To: hadi; +Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger
In-Reply-To: <1189120020.4259.68.camel@localhost>

jamal wrote:
> On Thu, 2007-06-09 at 15:16 +0100, James Chapman wrote:
> 
 >> First, do we need to encourage consistency in NAPI poll drivers?

> not to stiffle the discussion, but Stephen Hemminger is planning to
> write a new howto; that would be a good time to bring up the topic. The 
> challenge is that there may be hardware issues that will result in small
> deviations.

Ok.

>> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
>>
>> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx 
>> softirq 
>> will loop until the next jiffy tick or when quota is exceeded, calling each device 
>> in its polled 
>> list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> One suggestion on limiting the amount of polls is to actually have the
> driver chew something off the quota even on empty polls - easier by just
> changing the driver. A simple case will be say 1 packet (more may make
> more sense, machine dependent) every time poll is invoked by the core.

I wanted to minimize the impact on devices that do have work to do. But 
it's worth investigating. Thanks for the suggestion.

>> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, 
>> for example, keeps the device in polled mode, generating no interrupts. 
> 
> Must be a fast machine.

Not really. I used 3-year-old, single CPU x86 boxes with e100 
interfaces. The idle poll change keeps them in polled mode. Without idle 
poll, I get twice as many interrupts as packets, one for txdone and one 
for rx. NAPI is continuously scheduled in/out.

>> In a test, 8510 packets are sent/received versus 6200 previously; 
> 
> The other packets are dropped? 

No. Since I did a flood ping from the machine under test, the improved 
latency meant that the ping response was handled more quickly, causing 
the next packet to be sent sooner. So more packets were transmitted in 
the allotted time (10 seconds).

> What are the rtt numbers like?

With current NAPI:
rtt min/avg/max/mdev = 0.902/1.843/101.727/4.659 ms, pipe 9, ipg/ewma 
1.611/1.421 ms

With idle poll changes:
rtt min/avg/max/mdev = 0.898/1.117/28.371/0.689 ms, pipe 3, ipg/ewma 
1.175/1.236 ms

>> CPU load is 100% versus 62% previously; 
> 
> not good.

But the CPU has done more work. The flood ping will always show 
increased CPU with these changes because the driver always stays in the 
NAPI poll list. For typical LAN traffic, the average CPU usage doesn't 
increase as much, though more measurements would be useful.

> Your results above showed decreased tput and increased cpu - did you
> mistype that?

I didn't use clear English. :) I'm seeing increased throughput, mostly 
because latency is improved. The increased cpu is partly because of the 
increased throughput, and partly because ksoftirqd stays busy longer.

>> despite the CPU load being increased. For a system whose main job is processing network 
>> traffic quickly, like an embedded router or a network server, this approach might be very 
>> beneficial.
> 
> I am not sure i buy that James;-> The router types really have not much
> of a challenge in this area.

The problem I started thinking about was the one where NAPI thrashes 
in/out of polled mode at higher and higher rates as network interface 
speeds and CPU speeds increase. A flood ping demonstrates this even on 
100M links on my boxes. Networking boxes want consistent 
performance/latency for all traffic patterns and they need to avoid 
interrupt livelock. Current practice seems to be to use hardware 
interrupt mitigation or timers to limit interrupt rate but this just 
hurts latency, as you noted. So I'm trying to find a way to limit the 
NAPI interrupt rate without increasing latency. My comment about this 
approach being suitable for routers and networked servers is that these 
boxes care more about minimizing packet latency than they do about 
wasting CPU cycles by polling idle devices.

> You are doing the right thing by following the path on perfomance
> analysis. I hope you dont get discouraged because the return on
> investment may be very low in such work - the majority of the work is in
> the testing and analysis (not in puking code endlessly). 

Thanks for your feedback. The challenge will be finding the time to do 
this work. :)

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply

* your free lifetime membership
From: Vivian @ 2007-09-07  9:28 UTC (permalink / raw)
  To: netdev

uncrucified

no cent to pay! http://shesissy.cn

sailing


^ permalink raw reply

* auto recycling of TIME_WAIT connections
From: Pádraig Brady @ 2007-09-07  9:21 UTC (permalink / raw)
  To: netdev

As I see it, TIME_WAIT state is required for 2 reasons:

  to handle wandering duplicate packets
  (so a reincarnation of a connection will not be corrupted by these packets)

  To handle last ack from active closer (client) not being received by remote.
  If that happened, the server which is in LAST_ACK state would retransmit its FIN
  (which may contain data also) so the client must be in TIME_WAIT state to handle that.
  If client is not in TIME_WAIT state, then it could only indicate to the server
  that data was maybe lost (with an RST).

The first issue, requires a large timeout, and
the TIME_WAIT timeout is currently 60 seconds on linux.
That timeout effectively limits the connection rate between
local TCP clients and a server to 32k/60s or around 500 connections/second.

But that issue can't really happen when the client
and server are on the same machine can it, and
even if it could, the timeouts involved would be shorter.

Now linux does have an (undocumented) /proc/sys/net/ipv4/tcp_tw_recycle flag
to enable recycling of TIME_WAIT connections. This is global however and could cause
problems in general for external connections.

So how about auto enabling recycling for local connections?

cheers,
Pádraig.

^ permalink raw reply

* new NAPI interface broken
From: Jan-Bernd Themann @ 2007-09-07  9:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, themann, Christoph Raisch

Hi Stephen,

I saw that you developed most of the new NAPI interface.
I already addressed this issue a while ago. Please correct me if I got
it wrong. I think there is still a serious problem with the NAPI
changes to make NAPI polling independent of struct net_device objects.
Its about the question who inserts and removes devices from the poll list.

netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list.
netif_rx_complete: clears NAPI_STATE_SCHED
netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list.
net_rx_action: 
 -removes dev from poll list
 -calls poll function
 -adds dev to poll list if NAPI_STATE_SCHED still set

1) netif_rx_complete and netif_rx_reschedule don't work together
2) On SMP systems: after netif_rx_complete has been called on CPU1
   (+interruts enabled), netif_rx_schedule could be called on CPU2 
   (irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. 
   In that case the device would be added to poll lists of CPU1 and CPU2
   as net_rx_action would see NAPI_STATE_SCHED set.
   This must not happen. It will be caught when netif_rx_complete is
   called the second time (BUG() called)

This would mean we have a problem on all SMP machines right now.

If I got all this right then we probably need a further flag to tell
net_rx_action whether to poll again or to stop (with the possibility
that the device has been scheduled on a different CPU in between).
The "old" NAPI interface uses the return value of poll to determine
if the device has to be polled again or not. 
We can either switch back or in case we want to stick to
the new return value, we might have to add something similar to 
the NAPI_STATE_SCHED flag or a new parameter...

Regards,
Jan-Bernd

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: James Chapman @ 2007-09-07  9:38 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: netdev, hadi, davem, jeff, ossthema, Stephen Hemminger
In-Reply-To: <20070907035528.GA3755@ludhiana>

Hi Mandeep,

Mandeep Singh Baines wrote:
> Hi James,
> 
> I like the idea of staying in poll longer.
> 
> My comments are similar to what Jamal and Stephen have already said.
> 
> A tunable (via sysfs) would be nice.
> 
> A timer might be preferred to jiffy polling. Jiffy polling will not increase 
> latency the way a timer would. However, jiffy polling will consume a lot more
> CPU than a timer would. Hence more power. For jiffy polling, you could have 
> thousands of calls to poll for a single packet received. While in a timer 
> approach the numbers of polls per packet is upper bound to 2. 

Why would using a timer to hold off the napi_complete() rather than 
jiffy count limit the polls per packet to 2?

> I think it may difficult to make poll efficient for the no packet case because,
> at a minimum, you have to poll the device state via the has_work method.

Why wouldn't it be efficient? It would usually be done by reading an 
"interrupt pending" register.

> If you go to a timer implementation then having a tunable will be important.
> Different appications will have different requirements on delay and jitter.
> Some applications may want to trade delay/jitter for less CPU/power 
> consumption and some may not.

I agree. I'm leaning towards a new ethtool parameter to control this to 
be consistent with other per-device tunables.

> imho, the work should definately be pursued further:)

Thanks Mandeep. I'll try. :)

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* Re: [PATCH 1/1] ipv6: corrects sended rtnetlink message
From: Thomas Graf @ 2007-09-07 10:11 UTC (permalink / raw)
  To: Milan Kocian; +Cc: David Miller, netdev
In-Reply-To: <20070906210550.GA13542@wq.cz>

* Milan Kocian <milon@wq.cz> 2007-09-06 23:05
> I agree but ipv6 sends on device change (NETDEV_DOWN) RTM_DELLINK message.
> BTW when ipv6 send LINK message on NETDEV_UNREGISTER event, why doesn't 
> send message on NETDEV_REGISTER event? No symmetry ?

You should be seeing two RTM_DELLINK upon NETDEV_UNREGISTER if the
interface carried any IPv6 addresess. Once with ifi_change=~0
notifying you that the device is disappearing and once with
ifi_change=0 coming from the IPv6 protocol shutdown.

> ok. However, if I understand, LINK messages handle device changes not
> protocol changes. Or not ?

Yes, I personally think this behaviour is wrong but we can't remove it
unless we are sure it doesn't break anything.

> Now I ignore (RTM_DELLINK && ifi_family==AF_INET6) :-). But it's only workaround
> till next change. 

That's also correct, pure netdevice notification will always be sent
with the ifi_family set to AF_UNSPEC.

> Hard to find it. I can try to look at other routing sw (probably most using it)
> about handling with RTM_DELLINK.
> Howerer when was made change from RTM_NEWLINK to RTM_DELLINK without protests,
> we can try remove message :-)

I'd give the USAGI software tree a short peek, if there was a specific
reason for adding this notification in the first place, the software is
most probably found in that tree.

^ permalink raw reply

* [PATCH 1/2][RESEND] ehea: propagate physical port state
From: Jan-Bernd Themann @ 2007-09-07 10:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: netdev, Christoph Raisch, Jan-Bernd Themann, linux-kernel,
	linux-ppc, Marcus Eder, Thomas Klein, Stefan Roscher

Introduces a module parameter to decide whether the physical
port link state is propagated to the network stack or not.
It makes sense not to take the physical port state into account
on machines with more logical partitions that communicate
with each other. This is always possible no matter what the physical
port state is. Thus eHEA can be considered as a switch there.

Signed-off-by: Jan-Bernd Themann <themann@de.ibm.com>

---
 drivers/net/ehea/ehea.h      |    5 ++++-
 drivers/net/ehea/ehea_main.c |   14 +++++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h
index d67f97b..8d58be5 100644
--- a/drivers/net/ehea/ehea.h
+++ b/drivers/net/ehea/ehea.h
@@ -39,7 +39,7 @@
 #include <asm/io.h>
 
 #define DRV_NAME	"ehea"
-#define DRV_VERSION	"EHEA_0073"
+#define DRV_VERSION	"EHEA_0074"
 
 /* eHEA capability flags */
 #define DLPAR_PORT_ADD_REM 1
@@ -402,6 +402,8 @@ struct ehea_mc_list {
 
 #define EHEA_PORT_UP 1
 #define EHEA_PORT_DOWN 0
+#define EHEA_PHY_LINK_UP 1
+#define EHEA_PHY_LINK_DOWN 0
 #define EHEA_MAX_PORT_RES 16
 struct ehea_port {
 	struct ehea_adapter *adapter;	 /* adapter that owns this port */
@@ -427,6 +429,7 @@ struct ehea_port {
 	u32 msg_enable;
 	u32 sig_comp_iv;
 	u32 state;
+	u8 phy_link;
 	u8 full_duplex;
 	u8 autoneg;
 	u8 num_def_qps;
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index db57474..1e9fd6f 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -53,17 +53,21 @@ static int rq3_entries = EHEA_DEF_ENTRIES_RQ3;
 static int sq_entries = EHEA_DEF_ENTRIES_SQ;
 static int use_mcs = 0;
 static int num_tx_qps = EHEA_NUM_TX_QP;
+static int prop_carrier_state = 0;
 
 module_param(msg_level, int, 0);
 module_param(rq1_entries, int, 0);
 module_param(rq2_entries, int, 0);
 module_param(rq3_entries, int, 0);
 module_param(sq_entries, int, 0);
+module_param(prop_carrier_state, int, 0);
 module_param(use_mcs, int, 0);
 module_param(num_tx_qps, int, 0);
 
 MODULE_PARM_DESC(num_tx_qps, "Number of TX-QPS");
 MODULE_PARM_DESC(msg_level, "msg_level");
+MODULE_PARM_DESC(prop_carrier_state, "Propagate carrier state of physical "
+		 "port to stack. 1:yes, 0:no.  Default = 0 ");
 MODULE_PARM_DESC(rq3_entries, "Number of entries for Receive Queue 3 "
 		 "[2^x - 1], x = [6..14]. Default = "
 		 __MODULE_STRING(EHEA_DEF_ENTRIES_RQ3) ")");
@@ -814,7 +818,9 @@ int ehea_set_portspeed(struct ehea_port *port, u32 port_speed)
 			ehea_error("Failed setting port speed");
 		}
 	}
-	netif_carrier_on(port->netdev);
+	if (!prop_carrier_state || (port->phy_link == EHEA_PHY_LINK_UP))
+		netif_carrier_on(port->netdev);
+
 	kfree(cb4);
 out:
 	return ret;
@@ -869,13 +875,19 @@ static void ehea_parse_eqe(struct ehea_adapter *adapter, u64 eqe)
 			}
 
 		if (EHEA_BMASK_GET(NEQE_EXTSWITCH_PORT_UP, eqe)) {
+			port->phy_link = EHEA_PHY_LINK_UP;
 			if (netif_msg_link(port))
 				ehea_info("%s: Physical port up",
 					  port->netdev->name);
+			if (prop_carrier_state)
+				netif_carrier_on(port->netdev);
 		} else {
+			port->phy_link = EHEA_PHY_LINK_DOWN;
 			if (netif_msg_link(port))
 				ehea_info("%s: Physical port down",
 					  port->netdev->name);
+			if (prop_carrier_state)
+				netif_carrier_off(port->netdev);
 		}
 
 		if (EHEA_BMASK_GET(NEQE_EXTSWITCH_PRIMARY, eqe))
-- 
1.5.2


^ permalink raw reply related

* [PATCH 2/2][RESEND] ehea: fix last_rx update
From: Jan-Bernd Themann @ 2007-09-07 10:30 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: netdev, Christoph Raisch, Jan-Bernd Themann, linux-kernel,
	linux-ppc, Marcus Eder, Thomas Klein, Stefan Roscher

Update last_rx in registered device struct instead of
in the dummy device.

Signed-off-by: Jan-Bernd Themann <themann@de.ibm.com>

---
 drivers/net/ehea/ehea_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 1e9fd6f..717b129 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -471,7 +471,7 @@ static struct ehea_cqe *ehea_proc_rwqes(struct net_device *dev,
 			else
 				netif_receive_skb(skb);
 
-			dev->last_rx = jiffies;
+			port->netdev->last_rx = jiffies;
 		} else {
 			pr->p_stats.poll_receive_errors++;
 			port_reset = ehea_treat_poll_error(pr, rq, cqe,
-- 
1.5.2


^ permalink raw reply related

* [PATCH] Freeing alive inet6 address
From: Denis V. Lunev @ 2007-09-07 10:21 UTC (permalink / raw)
  To: den, adobriyan, xemul, dev, kuznet, yoshfuji, davem; +Cc: netdev, devel

From: Denis V. Lunev <den@openvz.org>

addrconf_dad_failure calls addrconf_dad_stop which takes referenced address
and drops the count. So, in6_ifa_put perrformed at out: is extra. This
results in message: "Freeing alive inet6 address" and not released dst entries.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>

--- ./net/ipv6/ndisc.c.ipv6dad	2007-09-03 16:54:32.000000000 +0400
+++ ./net/ipv6/ndisc.c	2007-09-07 13:34:30.000000000 +0400
@@ -736,7 +736,7 @@ static void ndisc_recv_ns(struct sk_buff
 				 * so fail our DAD process
 				 */
 				addrconf_dad_failure(ifp);
-				goto out;
+				return;
 			} else {
 				/*
 				 * This is not a dad solicitation.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox