Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-05-07  3:12 UTC (permalink / raw)
  To: Tantilov, Emil S; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FEA26A969@rrsmsx501.amr.corp.intel.com>

On 05/06/2010 05:06 PM, Tantilov, Emil S wrote:
> Ben Greear wrote:
>> On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:
>>
>>> Hi Ben,
>>>
>>> We do have a patch in testing (see attached). It may not apply
>>> cleanly as it is on top of some other patches currently in
>>> validation. Let me know if it works for you.
>>
>> It wasn't difficult to backport this patch to 2.6.31.12....
>>
>> I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.
>>
>> The NIC doesn't show as PROMISC in any way I can detect, but I guess
>> it must actually be in PROMISC mode:
>
> Yes the interface is in promisc mode. The driver sets the FCTRL.UPE bit
> (unicast promisc mode) when the number of allowed rar_entries is exceeded.

Is there any way to get this setting from ethtool or similar?  It would be nice
to know the actual PROMISC state of the NIC regardless of what user-space has or has not
configured.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* linux-next: build failure after merge of the suspend tree
From: Stephen Rothwell @ 2010-05-07  3:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-next, linux-kernel, Helmut Schaa, John W. Linville,
	David Miller, netdev

Hi Rafael,

After merging the suspend tree, today's linux-next build (x86_64
allmodconfig) failed like this:

net/mac80211/scan.c: In function 'ieee80211_scan_state_decision':
net/mac80211/scan.c:510: error: implicit declaration of function 'pm_qos_requirement'

Caused by commit 62bad14fc6e0911a99882c261390968977d43283 ("PM QOS
update") from the suspend tree interacting with commit
df13cce53a7b28a81460e6bfc4857e9df4956141 ("mac80211: Improve software
scan timing") from the net tree.

I have added the following merge fixup patch and can carry it as
necessary:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 7 May 2010 13:02:54 +1000
Subject: [PATCH] wireless: update for pm_qos_requirement to pm_qos_request rename

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/mac80211/scan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/mac80211/scan.c b/net/mac80211/scan.c
index e14c441..e1b0be7 100644
--- a/net/mac80211/scan.c
+++ b/net/mac80211/scan.c
@@ -510,7 +510,7 @@ static int ieee80211_scan_state_decision(struct ieee80211_local *local,
 		bad_latency = time_after(jiffies +
 				ieee80211_scan_get_channel_time(next_chan),
 				local->leave_oper_channel_time +
-				usecs_to_jiffies(pm_qos_requirement(PM_QOS_NETWORK_LATENCY)));
+				usecs_to_jiffies(pm_qos_request(PM_QOS_NETWORK_LATENCY)));
 
 		listen_int_exceeded = time_after(jiffies +
 				ieee80211_scan_get_channel_time(next_chan),
-- 
1.7.1

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: virtio: put last_used and last_avail index into ring itself.
From: Rusty Russell @ 2010-05-07  3:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, virtualization, kvm, linux-kernel, mingo, linux-mm, akpm,
	hpa, gregory.haskins, s.hetze, Daniel Walker, Eric Dumazet
In-Reply-To: <20100506062755.GC8363@redhat.com>

On Thu, 6 May 2010 03:57:55 pm Michael S. Tsirkin wrote:
> On Thu, May 06, 2010 at 10:22:12AM +0930, Rusty Russell wrote:
> > On Wed, 5 May 2010 03:52:36 am Michael S. Tsirkin wrote:
> > > What do you think?
> > 
> > I think everyone is settled on 128 byte cache lines for the forseeable
> > future, so it's not really an issue.
> 
> You mean with 64 bit descriptors we will be bouncing a cache line
> between host and guest, anyway?

I'm confused by this entire thread.

Descriptors are 16 bytes.  They are at the start, so presumably aligned to
cache boundaries.

Available ring follows that at 2 bytes per entry, so it's also packed nicely
into cachelines.

Then there's padding to page boundary.  That puts us on a cacheline again
for the used ring; also 2 bytes per entry.

I don't see how any change in layout could be more cache friendly?
Rusty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 1/2] netdev/fec: fix performance impact from mdio poll operation
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel
In-Reply-To: <1273199239-11057-1-git-send-email-bryan.wu@canonical.com>

BugLink: http://bugs.launchpad.net/bugs/546649
BugLink: http://bugs.launchpad.net/bugs/457878

After introducing phylib supporting, users experienced performace drop. That is
because of the mdio polling operation of phylib. Use msleep to replace the busy
waiting cpu_relax() and remove the warning message.

Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
 drivers/net/fec.c |   45 +++++++++++++++++++++------------------------
 1 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index 2b1651a..9c58f6b 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -203,7 +203,7 @@ static void fec_stop(struct net_device *dev);
 #define FEC_MMFR_TA		(2 << 16)
 #define FEC_MMFR_DATA(v)	(v & 0xffff)
 
-#define FEC_MII_TIMEOUT		10000
+#define FEC_MII_TIMEOUT		10
 
 /* Transmitter timeout */
 #define TX_TIMEOUT (2 * HZ)
@@ -611,13 +611,29 @@ spin_unlock:
 /*
  * NOTE: a MII transaction is during around 25 us, so polling it...
  */
-static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int fec_enet_mdio_poll(struct fec_enet_private *fep)
 {
-	struct fec_enet_private *fep = bus->priv;
 	int timeout = FEC_MII_TIMEOUT;
 
 	fep->mii_timeout = 0;
 
+	/* wait for end of transfer */
+	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
+		msleep(1);
+		if (timeout-- < 0) {
+			fep->mii_timeout = 1;
+			break;
+		}
+	}
+
+	return 0;
+}
+
+static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+{
+	struct fec_enet_private *fep = bus->priv;
+
+
 	/* clear MII end of transfer bit*/
 	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
 
@@ -626,15 +642,7 @@ static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 		FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
 		FEC_MMFR_TA, fep->hwp + FEC_MII_DATA);
 
-	/* wait for end of transfer */
-	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
-		cpu_relax();
-		if (timeout-- < 0) {
-			fep->mii_timeout = 1;
-			printk(KERN_ERR "FEC: MDIO read timeout\n");
-			return -ETIMEDOUT;
-		}
-	}
+	fec_enet_mdio_poll(fep);
 
 	/* return value */
 	return FEC_MMFR_DATA(readl(fep->hwp + FEC_MII_DATA));
@@ -644,9 +652,6 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 			   u16 value)
 {
 	struct fec_enet_private *fep = bus->priv;
-	int timeout = FEC_MII_TIMEOUT;
-
-	fep->mii_timeout = 0;
 
 	/* clear MII end of transfer bit*/
 	writel(FEC_ENET_MII, fep->hwp + FEC_IEVENT);
@@ -657,15 +662,7 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 		FEC_MMFR_TA | FEC_MMFR_DATA(value),
 		fep->hwp + FEC_MII_DATA);
 
-	/* wait for end of transfer */
-	while (!(readl(fep->hwp + FEC_IEVENT) & FEC_ENET_MII)) {
-		cpu_relax();
-		if (timeout-- < 0) {
-			fep->mii_timeout = 1;
-			printk(KERN_ERR "FEC: MDIO write timeout\n");
-			return -ETIMEDOUT;
-		}
-	}
+	fec_enet_mdio_poll(fep);
 
 	return 0;
 }
-- 
1.7.0.1


^ permalink raw reply related

* [PATCH 2/2] netdev/fec: fix ifconfig eth0 down hang issue
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel
In-Reply-To: <1273199239-11057-1-git-send-email-bryan.wu@canonical.com>

BugLink: http://bugs.launchpad.net/bugs/559065

In fec open/close function, we need to use phy_connect and phy_disconnect
operation before we start/stop phy. Otherwise it will cause system hang.

Only call fec_enet_mii_probe() in open function, because the first open
action will cause NULL pointer error.

Signed-off-by: Bryan Wu <bryan.wu@canonical.com>
---
 drivers/net/fec.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index 9c58f6b..af4243f 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -678,6 +678,8 @@ static int fec_enet_mii_probe(struct net_device *dev)
 	struct phy_device *phy_dev = NULL;
 	int phy_addr;
 
+	fep->phy_dev = NULL;
+
 	/* find the first phy */
 	for (phy_addr = 0; phy_addr < PHY_MAX_ADDR; phy_addr++) {
 		if (fep->mii_bus->phy_map[phy_addr]) {
@@ -708,6 +710,11 @@ static int fec_enet_mii_probe(struct net_device *dev)
 	fep->link = 0;
 	fep->full_duplex = 0;
 
+	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
+		"(mii_bus:phy_addr=%s, irq=%d)\n", dev->name,
+		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
+		fep->phy_dev->irq);
+
 	return 0;
 }
 
@@ -753,13 +760,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	if (mdiobus_register(fep->mii_bus))
 		goto err_out_free_mdio_irq;
 
-	if (fec_enet_mii_probe(dev) != 0)
-		goto err_out_unregister_bus;
-
 	return 0;
 
-err_out_unregister_bus:
-	mdiobus_unregister(fep->mii_bus);
 err_out_free_mdio_irq:
 	kfree(fep->mii_bus->irq);
 err_out_free_mdiobus:
@@ -912,7 +914,12 @@ fec_enet_open(struct net_device *dev)
 	if (ret)
 		return ret;
 
-	/* schedule a link state check */
+	/* Probe and connect to PHY when open the interface */
+	ret = fec_enet_mii_probe(dev);
+	if (ret) {
+		fec_enet_free_buffers(dev);
+		return ret;
+	}
 	phy_start(fep->phy_dev);
 	netif_start_queue(dev);
 	fep->opened = 1;
@@ -926,10 +933,12 @@ fec_enet_close(struct net_device *dev)
 
 	/* Don't know what to do yet. */
 	fep->opened = 0;
-	phy_stop(fep->phy_dev);
 	netif_stop_queue(dev);
 	fec_stop(dev);
 
+	if (fep->phy_dev)
+		phy_disconnect(fep->phy_dev);
+
         fec_enet_free_buffers(dev);
 
 	return 0;
@@ -1293,11 +1302,6 @@ fec_probe(struct platform_device *pdev)
 	if (ret)
 		goto failed_register;
 
-	printk(KERN_INFO "%s: Freescale FEC PHY driver [%s] "
-		"(mii_bus:phy_addr=%s, irq=%d)\n", ndev->name,
-		fep->phy_dev->drv->name, dev_name(&fep->phy_dev->dev),
-		fep->phy_dev->irq);
-
 	return 0;
 
 failed_register:
-- 
1.7.0.1

^ permalink raw reply related

* [PATCH 0/2] net-next/fec: bug fixing after introduced phylib supporting
From: Bryan Wu @ 2010-05-07  2:27 UTC (permalink / raw)
  To: davem, Sascha Hauer, Greg Ungerer, Amit Kucheria, netdev,
	linux-kernel

After introduced phylib supporting, we found some critical issues in Ubuntu on
Freescale iMX51. Following 2 patches fix those bugs which was recorded in our
Launchpad bug tracker.

Bryan Wu (2):
  netdev/fec: fix performance impact from mdio poll operation
  netdev/fec: fix ifconfig eth0 down hang issue

 drivers/net/fec.c |   73 +++++++++++++++++++++++++++--------------------------
 1 files changed, 37 insertions(+), 36 deletions(-)

^ permalink raw reply

* RE: ixgbe and mac-vlans problem
From: Tantilov, Emil S @ 2010-05-07  0:06 UTC (permalink / raw)
  To: Ben Greear; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <4BE32B4A.2030409@candelatech.com>

Ben Greear wrote:
> On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:
> 
>> Hi Ben,
>> 
>> We do have a patch in testing (see attached). It may not apply
>> cleanly as it is on top of some other patches currently in
>> validation. Let me know if it works for you.  
> 
> It wasn't difficult to backport this patch to 2.6.31.12....
> 
> I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.
> 
> The NIC doesn't show as PROMISC in any way I can detect, but I guess
> it must actually be in PROMISC mode:

Yes the interface is in promisc mode. The driver sets the FCTRL.UPE bit
(unicast promisc mode) when the number of allowed rar_entries is exceeded. 

> 
> [root@i7-1qc-1 ~]# cat /sys/class/net/eth11/flags
> 0x1003
> 
> [root@i7-1qc-1 ~]# ip link show dev eth11
> 2: eth11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>      state UP qlen 1000 link/ether 00:e0:ed:11:25:12 brd
> ff:ff:ff:ff:ff:ff 

The IFF_PROMISC flag is not set in this case. That's how the driver knows when the promisc mode is turned by the user.

Thanks,
Emil

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: nhorman @ 2010-05-07  0:02 UTC (permalink / raw)
  To: Eric Dumazet, Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273180085.2222.33.camel@edumazet-laptop>


On Thu, 6 May 2010 17:08:05 -0400, Eric Dumazet wrote:

> Le jeudi 06 mai 2010 à 16:29 -0400, Neil Horman a écrit :
> > Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
> > when a netns is created.  This makes sense, and helps prevent attackers from
> > guessing our initial secret value
> > 
> > 
> > 
> > A while back there was a discussion regarding the rt_secret_interval timer.
> > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > now, based on a statistical anal> > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > analysis mechanism to determine the need for route cache flushes.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> 
> > -
> > -static __net_init int rt_secret_timer_init(struct net *net)
> > +static __net_init int rt_genid_init(struct net *net)
> >  {
> > -	atomic_set(&net->ipv4.rt_genid,
> > -			(int) ((num_physpages ^ (num_physpages>>8)) ^
> > -			(jiffies ^ (jiffies >> 7))));
> > -
> 
> 
> > +	/*
> > +	 * This just serves to start off each new net namespace
> > +	 * with a non-zero rt_genid value, making it harder to guess
> > +	 */
> > +	rt_cache_invalidate(net);
> >  	return 0;
> >  }
> >  
> 
> I am _sorry_ to be such a paranoiac guy.
> 
Don't be sorry, I think your concern is valid, I just don't want to keep old code around when 
> Could you please feed more than 8 bits here ?
> 
> like :
> 
> get_random_bytes(&net->ipv4.rt_genid, sizeof(net->ipv4.rt_genid));
> 
Sure, I'm good with that. I'm not at my desk right now, but ill do that in the morning.

> There is no need to comment this in the code, this kind of rnd init is
> very common in net tree.
>
Ok, copy that, ill fix that up at the same time.  

Thanks & regards
Neil

> 
> 
> 
> 
> 


^ permalink raw reply

* Re: 3 packet TCP window limit?
From: Jerry Chu @ 2010-05-06 23:15 UTC (permalink / raw)
  To: dormando; +Cc: Lars Eggert, Rick Jones, Brian Bloniarz, netdev@vger.kernel.org
In-Reply-To: <q2ud1c2719f1005061613yf90cd7c6r46ee23cc49858e74@mail.gmail.com>

From: dormando <dormando@rydia.net>
>
> Date: Thu, May 6, 2010 at 1:51 AM
> Subject: Re: 3 packet TCP window limit?
> To: Lars Eggert <lars.eggert@nokia.com>
> Cc: Rick Jones <rick.jones2@hp.com>, Brian Bloniarz
> <bmb@athenacr.com>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>
>
>
> > On 2010-5-5, at 23:31, dormando wrote:
> > > The RFC clearly states "around 4k",
> >
> > no, it doesn't. RFC3390 gives a very precise formula for calculating the initial window:
> >
> >       min (4*MSS, max (2*MSS, 4380 bytes))
> >
> > Please see the RFC for why. More reading at http://www.icir.org/floyd/tcp_init_win.html I believe that Linux implements behavior this pretty faithfully.
>
> Sorry, paraphrasing :) Web nerds have been working around this for a long
> time now. Google talks about using HTTP chunked encoding responses to send
> an initial "frame" of a webpage in under 3 packets. Which immediately
> gives the browser something to render and primes the TCP connection for
> more web junk.
>
> > I'm surprised to hear that OpenBSD doesn't follow the RFC. Can you share a measurement? Are you sure the box you are measuring is using the default configuration?
>
> Yeah, default config. OBSD was giving me back 4 packets in the first
> window, while linux always gives back 3. The Big/IP is based on linux
> 2.4.21. If that kernel didn't have it wrong, they tuned it.
>
> Already nuked my dumps. If you're curious I'll re-create.
>
> > I don't think the RFC can be misread (it's pretty clear), and the
> > formula is also not exactly complicated. My guess would be that some
> > vendors have convinced themselves that using a slightly larger value is
> > OK, esp. if they can show customers that "their" TCP is "faster" than
> > some competitors' TCPs. An arms race between vendors in this space would
> > really not be good for anyone - it's clear that at some point, problems
> > due to overshoot will occur.
>
> I clearly remember some vendors bragging about doing this. That was a long
> time ago? Perhaps they stopped? If it's true they've been doing it for
> half a decade or more, and haven't broken anything someone would notice.
>
> The only reason why I set about tuning this is because our latency jumped
> while moving traffic from a commercial machine to a linux machine, and I
> had to figure out what they changed to do that. I've since turned the
> setting *back* to the standard, having confirmed what they did.
>
> Almost tempted to test this against a bunch of websites...
>
> > (We can definitely argue about whether the current RFC-recommended value
> > is too low, and Google and others are gathering data in support of
> > making a convincing and backed-up argument for increasing the initial
> > window to the IETF. Which is exactly the correct way of going about
> > this.)
>
> This sounds like fun. We have some diverse traffic, so I'm hoping we can
> contribute to that conversation. Still have a lot of reading to catch up
> with first :)

Yes please do.  Our presentation at Anaheim IETF can be found at
http://www.ietf.org/proceedings/10mar/slides/tcpm-4.pdf, with a paper describing
the details of our experiments at
http://code.google.com/speed/articles/tcp_initcwnd_paper.pdf.

We've gotten a lot of feedback from IETF and are planning to collect
more data to
justify the proposal. But at this point we really need help from
others as the scope of
the work is certainly not a one-company job. Help can be in the form of more
experiments/tests and/or simulations to study the effect of a larger
initcwnd. Please
contact me directly or send your data to IETF's TCPM WG list
(http://www.ietf.org/mail-archive/web/tcpm/current/maillist.html).

Thanks,

Jerry

^ permalink raw reply

* [PATCH 2.6.34-rc6] net: Improve ks8851 snl transmit performance
From: Ha, Tristram @ 2010-05-06 22:50 UTC (permalink / raw)
  To: Ben Dooks; +Cc: David Miller, netdev, linux-kernel, Abraham Arce, Sebastien Jan

From: Tristram Ha <Tristram.Ha@micrel.com>

Under heavy transmission the driver will put 4 1514-byte packets in queue and stop the device transmit queue.  Only the last packet triggers the transmit done interrupt and wakes up the device transmit queue.  That means a bit of time is wasted when the CPU cannot send any more packet.

The new implementation triggers the transmit interrupt when the transmit buffer left is less than 3 packets.  The maximum transmit buffer size is 6144 bytes.  This allows the device transmit queue to be restarted sooner so that CPU can send more packets.

For TCP receiving it also has the benefit of not triggering any transmit interrupt at all.

There is a driver option no_tx_opt so that the driver can revert to original implementation.  This allows user to verify if the transmit performance actually improves.

Signed-off-by: Tristram Ha <Tristram.Ha@micrel.com>
---
This replaces the [patch 01/13] patch I submitted and was objected by David.

Other users with Micrel KSZ8851 SNL chip please verify the transmit performance does improve or not.

--- a/drivers/net/ks8851.c	2010-04-29 20:02:05.000000000 -0700
+++ b/drivers/net/ks8851.c	2010-05-06 15:30:40.000000000 -0700
@@ -74,6 +74,9 @@ union ks8851_tx_hdr {
  * @rxd: Space for receiving SPI data, in DMA-able space.
  * @txd: Space for transmitting SPI data, in DMA-able space.
  * @msg_enable: The message flags controlling driver output (see ethtool).
+ * @tx_space: The current available transmit buffer size.
+ * @tx_avail: The maximum available transmit buffer size.
+ * @tx_chk_cnt: Used to indicate how often to check the transmit buffer.
  * @fid: Incrementing frame id tag.
  * @rc_ier: Cached copy of KS_IER.
  * @rc_rxqcr: Cached copy of KS_RXQCR.
@@ -103,6 +106,8 @@ struct ks8851_net {
 
 	u32			msg_enable ____cacheline_aligned;
 	u16			tx_space;
+	u16			tx_avail;
+	u8			tx_chk_cnt;
 	u8			fid;
 
 	u16			rc_ier;
@@ -124,6 +129,7 @@ struct ks8851_net {
 };
 
 static int msg_enable;
+static int no_tx_opt;
 
 #define ks_info(_ks, _msg...) dev_info(&(_ks)->spidev->dev, _msg)
 #define ks_warn(_ks, _msg...) dev_warn(&(_ks)->spidev->dev, _msg)
@@ -580,10 +586,21 @@ static void ks8851_irq_work(struct work_
 
 		/* update our idea of how much tx space is available to the
 		 * system */
+		ks->tx_chk_cnt = 0;
 		ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
 
 		if (netif_msg_intr(ks))
 			ks_dbg(ks, "%s: txspace %d\n", __func__, ks->tx_space);
+
+	/* Update tx space when packets are being transmitted. */
+	} else if (ks->tx_space < ks->tx_avail) {
+		ks->tx_chk_cnt++;
+
+		/* Read the transmit buffer register every 4th rx interrupt. */
+		if (4 == ks->tx_chk_cnt) {
+			ks->tx_chk_cnt = 0;
+			ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+		}
 	}
 
 	if (status & IRQ_RXI)
@@ -715,6 +732,7 @@ static void ks8851_tx_work(struct work_s
 	struct ks8851_net *ks = container_of(work, struct ks8851_net, tx_work);
 	struct sk_buff *txb;
 	bool last = skb_queue_empty(&ks->txq);
+	bool tx_irq;
 
 	mutex_lock(&ks->lock);
 
@@ -724,7 +742,11 @@ static void ks8851_tx_work(struct work_s
 
 		if (txb != NULL) {
 			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr | RXQCR_SDA);
-			ks8851_wrpkt(ks, txb, last);
+			if (ks->tx_avail)
+				tx_irq = (CHECKSUM_UNNECESSARY == txb->ip_summed);
+			else
+				tx_irq = last;
+			ks8851_wrpkt(ks, txb, tx_irq);
 			ks8851_wrreg16(ks, KS_RXQCR, ks->rc_rxqcr);
 			ks8851_wrreg16(ks, KS_TXQCR, TXQCR_METFE);
 
@@ -917,11 +939,17 @@ static netdev_tx_t ks8851_start_xmit(str
 		ret = NETDEV_TX_BUSY;
 	} else {
 		ks->tx_space -= needed;
+		/*
+		 * Indicate to enable transmit done interrupt when transmit
+		 * buffer is less than a certain size.
+		 */
+		if (ks->tx_avail && ks->tx_space < 1514 * 3)
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
 		skb_queue_tail(&ks->txq, skb);
+		schedule_work(&ks->tx_work);
 	}
 
 	spin_unlock(&ks->statelock);
-	schedule_work(&ks->tx_work);
 
 	return ret;
 }
@@ -1224,7 +1252,6 @@ static int __devinit ks8851_probe(struct
 
 	ks->netdev = ndev;
 	ks->spidev = spi;
-	ks->tx_space = 6144;
 
 	mutex_init(&ks->lock);
 	spin_lock_init(&ks->statelock);
@@ -1279,6 +1306,10 @@ static int __devinit ks8851_probe(struct
 		goto err_id;
 	}
 
+	ks->tx_space = ks8851_rdreg16(ks, KS_TXMIR);
+	if (!no_tx_opt)
+		ks->tx_avail = ks->tx_space;
+
 	ks8851_read_selftest(ks);
 	ks8851_init_mac(ks);
 
@@ -1351,6 +1382,8 @@ MODULE_DESCRIPTION("KS8851 Network drive
 MODULE_AUTHOR("Ben Dooks <ben@simtec.co.uk>");
 MODULE_LICENSE("GPL");
 
+module_param(no_tx_opt, int, 0);
+MODULE_PARM_DESC(message, "No TX optimization");
 module_param_named(message, msg_enable, int, 0);
 MODULE_PARM_DESC(message, "Message verbosity level (0=none, 31=all)");
 MODULE_ALIAS("spi:ks8851");

^ permalink raw reply

* Re: 2.6.33.2: Turn tx power off/on for Atheros card
From: Luis R. Rodriguez @ 2010-05-06 22:16 UTC (permalink / raw)
  To: Yegor Yefremov
  Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <g2wf69abfc31005060752w6876439cm45f5be68001c8382-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Thu, May 6, 2010 at 7:52 AM, Yegor Yefremov
<yegorslists-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
> On Wed, May 5, 2010 at 12:26 PM, Yegor Yefremov
> <yegorslists-gM/Ye1E23mwN+BqQ9rBEUg@public.gmane.org> wrote:
>> I'm using kernel 2.6.33.2 with AR2413 WLAN card. Issuing
>>
>> iwconfig wlan0 txpower off
>>
>> turns txpower off. I can see this status by iwconfig wlan0 and the
>> communication with AP terminates. But when I turn the txpower on
>>
>> iwconfig wlan0 txpower on
>>
>> nothing happens. Though iwconfig shows the previous tx power value.
>> Only ifconfig wlan0 down and then up recovers the transmission.
>>
>> Is it a known bug or I'm doing something wrong?
>
> I made some debugging and found out that after iwconfig wlan0 txpower
> off dev_close() will be invoked, so that local->open_count will be 0.
> The next time txpower on will be called, it will be checked if
> local->open_count > 0 and this conditions fails, so no  hardware
> configuration will be made.
>
> I've made a quick and dirty hack, that opens the wireless device by
> enabling the txpower, if it was closed before. Is there any proper
> solution? Is it really necessary to close device to tunr txpower off?

Depends on the type of interfaces you have. For a monitor device it
makes no sense to close the device as you should be able to still RX.
It also is possible to TX over a monitor device using frame injection
so technically setting tx power to off would just mute it and would
seem useful.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: Eric Dumazet @ 2010-05-06 21:08 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506202957.GE5063@hmsreliant.think-freely.org>

Le jeudi 06 mai 2010 à 16:29 -0400, Neil Horman a écrit :
> Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
> when a netns is created.  This makes sense, and helps prevent attackers from
> guessing our initial secret value
> 
> 
> 
> A while back there was a discussion regarding the rt_secret_interval timer.
> Given that we've had the ability to do emergency route cache rebuilds for awhile
> now, based on a statistical analysis of the various hash chain lengths in the
> cache, the use of the flush timer is somewhat redundant.  This patch removes the
> rt_secret_interval sysctl, allowing us to rely solely on the statistical
> analysis mechanism to determine the need for route cache flushes.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 

> -
> -static __net_init int rt_secret_timer_init(struct net *net)
> +static __net_init int rt_genid_init(struct net *net)
>  {
> -	atomic_set(&net->ipv4.rt_genid,
> -			(int) ((num_physpages ^ (num_physpages>>8)) ^
> -			(jiffies ^ (jiffies >> 7))));
> -


> +	/*
> +	 * This just serves to start off each new net namespace
> +	 * with a non-zero rt_genid value, making it harder to guess
> +	 */
> +	rt_cache_invalidate(net);
>  	return 0;
>  }
>  

I am _sorry_ to be such a paranoiac guy.

Could you please feed more than 8 bits here ?

like :

get_random_bytes(&net->ipv4.rt_genid, sizeof(net->ipv4.rt_genid));

There is no need to comment this in the code, this kind of rnd init is
very common in net tree.






^ permalink raw reply

* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-05-06 20:49 UTC (permalink / raw)
  To: Tantilov, Emil S; +Cc: Arnd Bergmann, NetDev, Patrick McHardy
In-Reply-To: <EA929A9653AAE14F841771FB1DE5A1365FEA26A2D0@rrsmsx501.amr.corp.intel.com>

On 05/06/2010 10:51 AM, Tantilov, Emil S wrote:

> Hi Ben,
>
> We do have a patch in testing (see attached). It may not apply cleanly as it is on top of some other patches currently in validation. Let me know if it works for you.

It wasn't difficult to backport this patch to 2.6.31.12....

I just tested this on an 85998 NIC and 50 MAC-VLANs worked fine.

The NIC doesn't show as PROMISC in any way I can detect, but I guess
it must actually be in PROMISC mode:

[root@i7-1qc-1 ~]# cat /sys/class/net/eth11/flags
0x1003

[root@i7-1qc-1 ~]# ip link show dev eth11
2: eth11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
     link/ether 00:e0:ed:11:25:12 brd ff:ff:ff:ff:ff:ff

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH  kernel 2.6.34-rc5] lib8390: to be SMP safe
From: Ken Kawasaki @ 2010-05-06 20:47 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20100503194316.60c98272.ken_kawasaki@spring.nifty.jp>


Sorry, I cancel this patch
and test it again.


Best Regards
Ken

> 
> lib8390:
> 	write the value "ENISR_ALL" to register "EN0_IMR"
> 	after enable_irq_lockdep_irqrestore. 
> 
> 	This patch avoids frequent transmit error on SMP system.
> 
> 
> Signed-off-by: Ken Kawasaki <ken_kawasaki@spring.nifty.jp>
> 
> ---
> 
> --- linux-2.6.34-rc6/drivers/net/lib8390.c.orig	2010-05-02 16:49:57.000000000 +0900
> +++ linux-2.6.34-rc6/drivers/net/lib8390.c	2010-05-02 18:09:18.000000000 +0900
> @@ -367,9 +367,9 @@ static netdev_tx_t __ei_start_xmit(struc
>  				dev->name, ei_local->tx1, ei_local->tx2, ei_local->lasttx);
>  		ei_local->irqlock = 0;
>  		netif_stop_queue(dev);
> -		ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  		spin_unlock(&ei_local->page_lock);
>  		enable_irq_lockdep_irqrestore(dev->irq, &flags);
> +		ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  		dev->stats.tx_errors++;
>  		return NETDEV_TX_BUSY;
>  	}
> @@ -407,10 +407,10 @@ static netdev_tx_t __ei_start_xmit(struc
>  
>  	/* Turn 8390 interrupts back on. */
>  	ei_local->irqlock = 0;
> -	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  
>  	spin_unlock(&ei_local->page_lock);
>  	enable_irq_lockdep_irqrestore(dev->irq, &flags);
> +	ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
>  
>  	dev_kfree_skb (skb);
>  	dev->stats.tx_bytes += send_length;


^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer (v2)
From: Neil Horman @ 2010-05-06 20:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506171639.GA5063@hmsreliant.think-freely.org>

Version 2 of this patch, taking Erics comment about making the rt_genid non-zero
when a netns is created.  This makes sense, and helps prevent attackers from
guessing our initial secret value



A while back there was a discussion regarding the rt_secret_interval timer.
Given that we've had the ability to do emergency route cache rebuilds for awhile
now, based on a statistical analysis of the various hash chain lengths in the
cache, the use of the flush timer is somewhat redundant.  This patch removes the
rt_secret_interval sysctl, allowing us to rely solely on the statistical
analysis mechanism to determine the need for route cache flushes.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h |    1 
 net/ipv4/route.c         |  111 ++++-------------------------------------------
 2 files changed, 11 insertions(+), 101 deletions(-)


diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ae07fee..d68c3f1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,7 +55,6 @@ struct netns_ipv4 {
 	int sysctl_rt_cache_rebuild_count;
 	int current_rt_cache_rebuild_count;
 
-	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
 
 #ifdef CONFIG_IP_MROUTE
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..e55a066 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,7 +129,6 @@ static int ip_rt_gc_elasticity __read_mostly	= 8;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
-static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
 static int rt_chain_length_max __read_mostly	= 20;
 
 static struct delayed_work expires_work;
@@ -918,32 +917,11 @@ void rt_cache_flush_batch(void)
 	rt_do_flush(!in_softirq());
 }
 
-/*
- * We change rt_genid and let gc do the cleanup
- */
-static void rt_secret_rebuild(unsigned long __net)
-{
-	struct net *net = (struct net *)__net;
-	rt_cache_invalidate(net);
-	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
-static void rt_secret_rebuild_oneshot(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-	rt_cache_invalidate(net);
-	if (ip_rt_secret_interval)
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
-}
-
 static void rt_emergency_hash_rebuild(struct net *net)
 {
-	if (net_ratelimit()) {
+	if (net_ratelimit())
 		printk(KERN_WARNING "Route hash chain too long!\n");
-		printk(KERN_WARNING "Adjust your secret_interval!\n");
-	}
-
-	rt_secret_rebuild_oneshot(net);
+	rt_cache_invalidate(net);
 }
 
 /*
@@ -3101,48 +3079,6 @@ static int ipv4_sysctl_rtcache_flush(ctl_table *__ctl, int write,
 	return -EINVAL;
 }
 
-static void rt_secret_reschedule(int old)
-{
-	struct net *net;
-	int new = ip_rt_secret_interval;
-	int diff = new - old;
-
-	if (!diff)
-		return;
-
-	rtnl_lock();
-	for_each_net(net) {
-		int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-		long time;
-
-		if (!new)
-			continue;
-
-		if (deleted) {
-			time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-			if (time <= 0 || (time += diff) <= 0)
-				time = 0;
-		} else
-			time = new;
-
-		mod_timer(&net->ipv4.rt_secret_timer, jiffies + time);
-	}
-	rtnl_unlock();
-}
-
-static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
-					  void __user *buffer, size_t *lenp,
-					  loff_t *ppos)
-{
-	int old = ip_rt_secret_interval;
-	int ret = proc_dointvec_jiffies(ctl, write, buffer, lenp, ppos);
-
-	rt_secret_reschedule(old);
-
-	return ret;
-}
-
 static ctl_table ipv4_route_table[] = {
 	{
 		.procname	= "gc_thresh",
@@ -3251,13 +3187,6 @@ static ctl_table ipv4_route_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
-	{
-		.procname	= "secret_interval",
-		.data		= &ip_rt_secret_interval,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= ipv4_sysctl_rt_secret_interval,
-	},
 	{ }
 };
 
@@ -3336,34 +3265,18 @@ static __net_initdata struct pernet_operations sysctl_route_ops = {
 };
 #endif
 
-
-static __net_init int rt_secret_timer_init(struct net *net)
+static __net_init int rt_genid_init(struct net *net)
 {
-	atomic_set(&net->ipv4.rt_genid,
-			(int) ((num_physpages ^ (num_physpages>>8)) ^
-			(jiffies ^ (jiffies >> 7))));
-
-	net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
-	net->ipv4.rt_secret_timer.data = (unsigned long)net;
-	init_timer_deferrable(&net->ipv4.rt_secret_timer);
-
-	if (ip_rt_secret_interval) {
-		net->ipv4.rt_secret_timer.expires =
-			jiffies + net_random() % ip_rt_secret_interval +
-			ip_rt_secret_interval;
-		add_timer(&net->ipv4.rt_secret_timer);
-	}
+	/*
+	 * This just serves to start off each new net namespace
+	 * with a non-zero rt_genid value, making it harder to guess
+	 */
+	rt_cache_invalidate(net);
 	return 0;
 }
 
-static __net_exit void rt_secret_timer_exit(struct net *net)
-{
-	del_timer_sync(&net->ipv4.rt_secret_timer);
-}
-
-static __net_initdata struct pernet_operations rt_secret_timer_ops = {
-	.init = rt_secret_timer_init,
-	.exit = rt_secret_timer_exit,
+static __net_initdata struct pernet_operations rt_genid_ops = {
+	.init = rt_genid_init,
 };
 
 
@@ -3424,9 +3337,6 @@ int __init ip_rt_init(void)
 	schedule_delayed_work(&expires_work,
 		net_random() % ip_rt_gc_interval + ip_rt_gc_interval);
 
-	if (register_pernet_subsys(&rt_secret_timer_ops))
-		printk(KERN_ERR "Unable to setup rt_secret_timer\n");
-
 	if (ip_rt_proc_init())
 		printk(KERN_ERR "Unable to create route proc files\n");
 #ifdef CONFIG_XFRM
@@ -3438,6 +3348,7 @@ int __init ip_rt_init(void)
 #ifdef CONFIG_SYSCTL
 	register_pernet_subsys(&sysctl_route_ops);
 #endif
+	register_pernet_subsys(&rt_genid_ops);
 	return rc;
 }
 

^ permalink raw reply related

* Re: [PATCH v21 020/100] c/r: documentation
From: Randy Dunlap @ 2010-05-06 20:27 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, containers, linux-kernel, Serge Hallyn,
	Matt Helsley, Pavel Emelyanov, linux-api, linux-mm, linux-fsdevel,
	netdev, Dave Hansen
In-Reply-To: <1272723382-19470-21-git-send-email-orenl@cs.columbia.edu>

On Sat,  1 May 2010 10:15:02 -0400 Oren Laadan wrote:

> Covers application checkpoint/restart, overall design, interfaces,
> usage, shared objects, and and checkpoint image format.
> 
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Tested-by: Serge E. Hallyn <serue@us.ibm.com>
> ---
>  Documentation/checkpoint/checkpoint.c      |   38 +++
>  Documentation/checkpoint/readme.txt        |  370 ++++++++++++++++++++++++++++
>  Documentation/checkpoint/self_checkpoint.c |   69 +++++
>  Documentation/checkpoint/self_restart.c    |   40 +++
>  Documentation/checkpoint/usage.txt         |  247 +++++++++++++++++++
>  5 files changed, 764 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/checkpoint/checkpoint.c
>  create mode 100644 Documentation/checkpoint/readme.txt
>  create mode 100644 Documentation/checkpoint/self_checkpoint.c
>  create mode 100644 Documentation/checkpoint/self_restart.c
>  create mode 100644 Documentation/checkpoint/usage.txt

> diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
> new file mode 100644
> index 0000000..4fa5560
> --- /dev/null
> +++ b/Documentation/checkpoint/readme.txt
> @@ -0,0 +1,370 @@
> +
...
> +In contrast, when checkpointing a subtree of a container it is up to
> +the user to ensure that dependencies either don't exist or can be
> +safely ignored. This is useful, for instance, for HPC scenarios or
> +even a user that would like to periodically checkpoint a long-running

               who

> +batch job.
> +
...

> +
> +Checkpoint image format
> +=======================
> +
...

> +
> +The container configuration section containers information that is

                                       contains

> +global to the container. Security (LSM) configuration is one example.
> +Network configuration and container-wide mounts may also go here, so
> +that the userspace restart coordinator can re-create a suitable
> +environment.
> +
...

> +
> +Then the state of all tasks is saved, in the order that they appear in
> +the tasks array above. For each state, we save data like task_struct,
> +namespaces, open files, memory layout, memory contents, cpu state,

                                                           CPU (throughout, please)

> +signals and signal handlers, etc. For resources that are shared among
> +multiple processes, we first checkpoint said resource (and only once),
> +and in the task data we give a reference to it. More about shared
> +resources below.
> +
...

> +
> +Shared objects
> +==============
> +
> +Many resources may be shared by multiple tasks (e.g. file descriptors,
> +memory address space, etc), or even have multiple references from

                         etc.),

> +other resources (e.g. a single inode that represents two ends of a
> +pipe).
> +
...

> +Memory contents format
> +======================
> +
> +The memory contents of a given memory address space (->mm) is dumped

                                                              are (I think)

> +as a sequence of vma objects, represented by 'struct ckpt_hdr_vma'.
> +This header details the vma properties, and a reference to a file
> +(if file backed) or an inode (or shared memory) object.
> +
> +The vma header is followed by the actual contents - but only those
> +pages that need to be saved, i.e. dirty pages. They are written in
> +chunks of data, where each chunks contains a header that indicates

                              chunk

> +that number of pages in the chunk, followed by an array of virtual

   the

> +addresses and then an array of actual page contents. The last chunk
> +holds zero pages.
> +
...

> +Kernel interfaces
> +=================
> +
> +* To checkpoint a vma, the 'struct vm_operations_struct' needs to
> +  provide a method ->checkpoint:
> +    int checkpoint(struct ckpt_ctx *, struct vma_struct *)
> +  Restart requires a matching (exported) restore:
> +    int restore(struct ckpt_ctx *, struct mm_struct *, struct ckpt_hdr_vma *)
> +
> +* To checkpoint a file, the 'struct file_operations' needs to provide
> +  the methods ->checkpoint and ->collect:
> +    int checkpoint(struct ckpt_ctx *, struct file *)
> +    int collect(struct ckpt_ctx *, struct file *)
> +  Restart requires a matching (exported) restore:
> +    int restore(struct ckpt_ctx *, struct ckpt_hdr_file *)
> +  For most file systems, generic_file_{checkpoint,restore}() can be
> +  used.
> +
> +* To checkpoint a socket, the 'struct proto_ops' needs to provide

     To checkpoint/restart a socket,

> +  the methods ->checkpoint, ->collect and ->restore:
> +    int checkpoint(struct ckpt_ctx *ctx, struct socket *sock);
> +    int collect(struct ckpt_ctx *ctx, struct socket *sock);
> +    int restore(struct ckpt_ctx *, struct socket *sock, struct ckpt_hdr_socket *h)


> diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
> new file mode 100644
> index 0000000..c6fc045
> --- /dev/null
> +++ b/Documentation/checkpoint/usage.txt
> @@ -0,0 +1,247 @@
> +
> +	      How to use Checkpoint-Restart
> +	=========================================
> +
> +
> +API
> +===
> +
> +The API consists of three new system calls:
> +
> +* long checkpoint(pid_t pid, int fd, unsigned long flag, int logfd);

                                                      flags,

> +
> + Checkpoint a (sub-)container whose root task is identified by @pid,
> + to the open file indicated by @fd. If @logfd isn't -1, it indicates
> + an open file to which error and debug messages are written. @flags
> + may be one or more of:
> +   - CHECKPOINT_SUBTREE : allow checkpoint of sub-container
> + (other value are not allowed).
> +
> + Returns: a positive checkpoint identifier (ckptid) upon success, 0 if
> + it returns from a restart, and -1 if an error occurs. The ckptid will
> + uniquely identify a checkpoint image, for as long as the checkpoint
> + is kept in the kernel (e.g. if one wishes to keep a checkpoint, or a
> + partial checkpoint, residing in kernel memory).
> +
> +* long sys_restart(pid_t pid, int fd, unsigned long flags, int logfd);
> +
> + Restart a process hierarchy from a checkpoint image that is read from
> + the blob stored in the file indicated by @fd.  If @logfd isn't -1, it
> + indicates an open file to which error and debug messages are written.
> + @flags will have future meaning (must be 0 for now). @pid indicates
> + the root of the hierarchy as seen in the coordinator's pid-namespace,
> + and is expected to be a child of the coordinator. @flags may be one
> + or more of:
> +   - RESTART_TASKSELF : (self) restart of a single process
> +   - RESTART_FROEZN : processes remain frozen once restart completes

                FROZEN ?

> +   - RESTART_GHOST : process is a ghost (placeholder for a pid)

about @flags:  Above says both of these:
a) @flags will have future meaning (must be 0 for now)
b) @flags may be one or more of:

so please decide which one it is ;)

> + (Note that this argument may mean 'ckptid' to identify an in-kernel
> + checkpoint image, with some @flags in the future).
> +
> + Returns: -1 if an error occurs, 0 on success when restarting from a
> + "self" checkpoint, and return value of system call at the time of the
> + checkpoint when restarting from an "external" checkpoint.
> +
...
> +
> +Sysctl/proc
> +===========
> +
> +/proc/sys/kernel/ckpt_unpriv_allowed		[default = 1]
> +  controls whether c/r operation is allowed for unprivileged users

                      C/R

> +
> +
> +Operation
> +=========
> +
> +The granularity of a checkpoint usually is a process hierarchy. The
> +'pid' argument is interpreted in the caller's pid namespace. So to
> +checkpoint a container whose init task (pid 1 in that pidns) appears
> +as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
> +pid 1 will attempt to checkpoint the caller's container, and if the
> +caller isn't privileged and init is owned by root, it will fail.
> +
> +Unless the CHECKPOINT_SUBTREE flag is set, if the caller passes a pid
> +which does not refer to a container's init task, then sys_checkpoint()
> +would return -EINVAL.

   returns -EINVAL.

...

> +
> +
> +User tools
> +==========
> +
> +* checkpoint(1): a tool to perform a checkpoint of a container/subtree
> +* restart(1): a tool to restart a container/subtree
> +* ckptinfo: a tool to examine a checkpoint image
> +
> +It is best to use the dedicated user tools for checkpoint and restart.
> +
> +If you insist, then here is a code snippet that illustrates how a
> +checkpoint is initiated by a process inside a container - the logic is
> +similar to fork():
> +	...
> +	ckptid = checkpoint(0, ...);
> +	switch (crid) {

	       (ckptid) ?

> +	case -1:
> +		perror("checkpoint failed");
> +		break;
> +	default:
> +		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);

s/ret/ckptid/ ?

> +		/* proceed with execution after checkpoint */
> +		...
> +		break;
> +	case 0:
> +		fprintf(stderr, "returned after restart\n");
> +		/* proceed with action required following a restart */
> +		...
> +		break;
> +	}
> +	...
> +
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> +	...
> +	if (restart(pid, ...) < 0)
> +		perror("restart failed");
> +	/* only get here if restart failed */
> +	...
> +
> +Note, that the code also supports "self" checkpoint, where a process

   Note that

> +can checkpoint itself. This mode does not capture the relationships of
> +the task with other tasks, or any shared resources. It is useful for
> +application that wish to be able to save and restore their state.

   applications

> +They will either not use (or care about) shared resources, or they
> +will be aware of the operations and adapt suitably after a restart.
> +The code above can also be used for "self" checkpoint.
> +
> +
> +You may find the following sample programs useful:
> +
> +* checkpoint.c: accepts a 'pid' and checkpoint that task to stdout

                                       checkpoints

> +* self_checkpoint.c: a simple test program doing self-checkpoint
> +* self_restart.c: restarts a (self-) checkpoint image from stdin
> +
> +See also the utilities 'checkpoint' and 'restart' (from user-cr).
> +
> +
> +"External" checkpoint
> +=====================
> +
> +To do "external" checkpoint, you need to first freeze that other task
> +either using the freezer cgroup.

eh?  cannot parse that.

> +
> +Restart does not preserve the original PID yet, (because we haven't
> +solved yet the fork-with-specific-pid issue). In a real scenario, you
> +probably want to first create a new names space, and have the init

                                       namespace,

> +task there call 'sys_restart()'.
> +
> +I tested it this way:

...

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 20:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273176614.2222.21.camel@edumazet-laptop>

On Thu, May 06, 2010 at 10:10:14PM +0200, Eric Dumazet wrote:
> 
> > Doing that doesn't solve my aim however, which is to avoid performing rt_genid
> > updates when no one is attacking you at all.  I completely agree that we can
> > start the gen_id at some random value (by forcing an initial invalidation),
> > however.  Beyond that however, if someone is managing to guess our secret value,
> > then we need to make our secret value more complex to determine.  Perhaps given
> > the reduction in the number of times we need to iterate our gen_id with the
> > timer gone, we can use something more heavyweight to determine the the hash
> > secret (the cprng perhaps?).
> 
> Secrets that dont change are known to be honey pots for hackers.
> 
> I just dont see why we want to risk security regressions for something
> that proved to work well.
> 
Because we have two ways of doing the same thing now, and I don't see why we
should maintain code for both.  I get that a timer based invalidation works
well.  So does the statistical analysis.

> Cache invalidation is just a genid change nowadays, and dont have side
> effects.
> 
I disagree with this, changing a genid in and of itself is fast, yes, but it
creates a need for the cache to get repopulated, sending packets through the
slow routing path.  On high volume systems this causes a performance
degradation.  The timer approach makes that a periodic degradation, one that I
would like to avoid if possible.

I get that hackers like secrets to stay unchanged so that they can figure out
what they are.  Its not like we're leaving ourselves vulnerable here, we're just
rebuilding only when we need it, not every X seconds.  And if someone is
_really_ in need of a periodic rebuild, and can cope with the performance hit,
then they can still do that from user space, as I've pointed out.  We just don't
need to keep the code in the kernel any more.

> Considering we do cache invalidation when routes are changed anyway, I
> dont get why we should avoid the invalidation once every xxx seconds...
> 
Who says routes are going to change that often?  I know you dont believe that a
former is a substitute for the latter.  As for why we should avoid periodic
invalidation, I've said it several times now.

> If you believe this cache invalidation has problems, maybe we should
> address them and not hide them ?
> 
Now you're just being intentionally obtuse.  Eric, you know perfectly good and
well what my reasons are for wanting to remove the rt_secret timer.  Its why we
did the statistical analysis code in the first place.  There just not a large
need for it.  If you want to do periodic invalidation, fine, do it.  Just do it
in user space.  We have an on-demand strategy in the kernel that has been
working well for quite some time, and is superior in performance for 99% of the
use cases out there.  So lets lighten the maintenence workload for the code
thats not strictly needed anymore by getting rid of it.


^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-06 20:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Dmitry Torokhov, Christoph Hellwig, pv-drivers@vmware.com,
	Pankaj Thakkar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505105253.0a8bc465@nehalam>

On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote:
> Let me put it bluntly. Any design that allows external code to run
> in the kernel is not going to be accepted.  Out of tree kernel modules are enough
> of a pain already, why do you expect the developers to add another
> interface.

Exactly.  Until our friends at VMware get this basic fact it's useless
to continue arguing.

Pankaj and Dmitry: you're fine to waste your time on this, but it's not
going to go anywhere until you address that fundamental problem.  The
first thing you need to fix in your archicture is to integrate the VF
function code into the kernel tree, and we can work from there.

Please post patches doing this if you want to resume the discussion.

^ permalink raw reply

* Re: RTL-8110SC lockup with r8169
From: Francois Romieu @ 2010-05-06 20:20 UTC (permalink / raw)
  To: Pádraig Brady; +Cc: netdev, Glen Gray
In-Reply-To: <4BE1973D.8080502@draigBrady.com>

Pádraig Brady <P@draigBrady.com> :
[...]
> However the above code wasn't in the 2.6.32.10-90.fc12 driver we used.
> Also I've back-ported the latest r8169 driver from git to our kernel
> and it still has the same issue.

"latest" as "includes 908ba2bfd22253f26fa910cd855e4ccffb1467d0" ?

Otherwise you may save some time and try directly the backport at :
http://userweb.kernel.org/~romieu/r8169/2.6.32.11-99.fc12/

> # dmesg | grep 8169
> # lspci -n | grep -v 8086:
> 01:04.0 0200: 10ec:8167 (rev 10)

8167 how comes...

Which hardware (lspci) is the host computer made of ?

--
Ueimor

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-06 20:19 UTC (permalink / raw)
  To: Pankaj Thakkar
  Cc: Gleb Natapov, Christoph Hellwig, Dmitry Torokhov,
	pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100506180411.GC25364@vmware.com>

On Thu, May 06, 2010 at 11:04:11AM -0700, Pankaj Thakkar wrote:
> Plugin is x86 or x64 machine code. You write the plugin in C and compile it using gcc/ld to get the object file, we map the relevant sections only to the OS space. 

Which is simply not supportable for a cross-platform operating system
like Linux.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-06 20:17 UTC (permalink / raw)
  To: Pankaj Thakkar
  Cc: Christoph Hellwig, Dmitry Torokhov, pv-drivers@vmware.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <F1354E79A137A24CBA60059AA65CB1B802A235C535@EXCH-MBX-2.vmware.com>

On Wed, May 05, 2010 at 10:47:10AM -0700, Pankaj Thakkar wrote:
> > Forget about the licensing.  Loading binary blobs written to a shim
> > layer is a complete pain in the ass and totally unsupportable, and
> > also uninteresting because of the overhead.
> 
> [PT] Why do you think it is unsupportable? How different is it from any module written against a well maintained interface? What overhead are you talking about?

We only support in-kernel drivers, everything else is subject to changes
in the kernel API and ABI.  What you do is basically introducing another
wrapper layer not allowing full access to the normal Linux API.  People
have tried this before and we're not willing to add it.  Do a little
research on Project UDI if you're curious.

> >  (1) move the limited VF drivers directly into the kernel tree,
> >      talk to them through a normal ops vector
> [PT] This assumes that all the VF drivers would always be available.

Yes, absolutely.  Just as we assume that for every other driver.

> Also we have to support windows and our current design supports it nicely in an OS agnostic manner.

And that's not something we care about at all.  The Linux kernel has
traditionally a very hostile position against cross platform drivers for
reasons well explained before at many occasions.

> >  (2) get rid of the whole shim crap and instead integrate the limited
> >      VF driver with the full VF driver we already have, instead of
> >      duplicating the code
> [PT] Having a full VF driver adds a lot of dependency on the guest VM and this is what NPA tries to avoid.

Yes, of course it does.  It's a normal driver at the point which it
should have been from day one.

> >  (3) don't make the PV to VF integration VMware-specific but also
> >      provide an open reference implementation like virtio.  We're not
> >      going to add massive amount of infrastructure that is not actually
> >      useable in a free software stack.
> [PT] Today this is tied to vmxnet3 device and is intended to work on ESX hypervisor only (vmxnet3 works on VMware hypervisor only). All the loading support is inside the ESX hypervisor. I am going to post the interface between the shell and the plugin soon and you can see that there is not a whole lot of dependency or infrastructure requirements from the Linux kernel. Please keep in mind that we don't use Linux as a hypervisor but as a guest VM.

But we use Linux as the hypervisor, too.  So if you want to target a
major infrastructure you might better make it available for that case.

^ permalink raw reply

* Re: r8169 transmit queue time outs
From: Francois Romieu @ 2010-05-06 20:10 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: netdev
In-Reply-To: <20100506141715.GC4480@ihatethathostname.lab.bos.redhat.com>

Kyle McMartin <kmcmartin@redhat.com> :
[...]
> Some of our users have been seeing their r8169 cards just up and stop
> transmitting packets pretty quickly after boot with recent kernels.
[...]
> Pid: 0, comm: swapper Not tainted 2.6.31.5-127.fc12.i686.PAE #1

Can they upgrade to 2.6.32.11-99.fc12.i686 and try an out-of-tree build
of the driver at http://userweb.kernel.org/~romieu/r8169/2.6.32.11-99.fc12/ ?

It should be quite close to the current git kernel.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Eric Dumazet @ 2010-05-06 20:10 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20100506195442.GC5063@hmsreliant.think-freely.org>

> Doing that doesn't solve my aim however, which is to avoid performing rt_genid
> updates when no one is attacking you at all.  I completely agree that we can
> start the gen_id at some random value (by forcing an initial invalidation),
> however.  Beyond that however, if someone is managing to guess our secret value,
> then we need to make our secret value more complex to determine.  Perhaps given
> the reduction in the number of times we need to iterate our gen_id with the
> timer gone, we can use something more heavyweight to determine the the hash
> secret (the cprng perhaps?).

Secrets that dont change are known to be honey pots for hackers.

I just dont see why we want to risk security regressions for something
that proved to work well.

Cache invalidation is just a genid change nowadays, and dont have side
effects.

Considering we do cache invalidation when routes are changed anyway, I
dont get why we should avoid the invalidation once every xxx seconds...

If you believe this cache invalidation has problems, maybe we should
address them and not hide them ?

^ permalink raw reply

* Re: [PATCH 0/6] netns support in the kobject layer
From: Greg KH @ 2010-05-06 20:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Tejun Heo,
	Cornelia Huck, Eric Dumazet, Benjamin LaHaise, Serge Hallyn,
	netdev, David Miller
In-Reply-To: <m1d3xb4239.fsf@fess.ebiederm.org>

On Tue, May 04, 2010 at 05:35:54PM -0700, Eric W. Biederman wrote:
> 
> With the tagged sysfs support finally merged into Greg's tree,
> it is time for the last little bits of work to get the kobject
> layer and network namespaces to play together properly.
> 
> These patches are roughly evenly divided between network layer work
> and sysfs layer work.  Last time this conundrum came up I believe
> we decided that the easiest way to handle this was for Greg to carry
> all of the patches.  David, Greg does that still make sense?

That's fine, if I get David's ack on these.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] ipv4: remove ip_rt_secret timer
From: Neil Horman @ 2010-05-06 19:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1273170813.2222.10.camel@edumazet-laptop>

On Thu, May 06, 2010 at 08:33:33PM +0200, Eric Dumazet wrote:
> Le jeudi 06 mai 2010 à 14:02 -0400, Neil Horman a écrit :
> > On Thu, May 06, 2010 at 07:32:35PM +0200, Eric Dumazet wrote:
> > > Le jeudi 06 mai 2010 à 13:16 -0400, Neil Horman a écrit :
> > > > A while back there was a discussion regarding the rt_secret_interval timer.
> > > > Given that we've had the ability to do emergency route cache rebuilds for awhile
> > > > now, based on a statistical analysis of the various hash chain lengths in the
> > > > cache, the use of the flush timer is somewhat redundant.  This patch removes the
> > > > rt_secret_interval sysctl, allowing us to rely solely on the statistical
> > > > analysis mechanism to determine the need for route cache flushes.
> > > > 
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > 
> > > > 
> > > 
> > > Nice cleanup try Neil, but this gives to attackers more time to hit the
> > > cache (infinite time should be enough as a matter of fact ;) )
> > > 
> > Not sure I follow what your complaint is.  I get that this gives attackers
> > plenty of time to try to attack the cache, but thats rather the point of the
> > statistics gathering for the cache, and why I don't think we need the secret
> > timer any more.   With the statistical analysis we do on the route cache every
> > gc cycle, we can tell if an attacker has guessed our rt_genid value, and is
> > making any chains in the cache abnormally long.  Thats when we do the rebuild,
> > modifying the rt_genid, forcing the attacker to re-discover it (which should be
> > difficult).  Theres no need to change this periodically if you're not being
> > attacked.
> >  
> > > Hints : 
> > > 
> > > - What is the initial value of rt_genid ?
> > > 
> > > - How/When is it changed (full 32 bits are changed or small
> > > perturbations ? check rt_cache_invalidate())
> > > 
> > /*
> >  * Pertubation of rt_genid by a small quantity [1..256]
> >  * Using 8 bits of shuffling ensure we can call rt_cache_invalidate()
> >  * many times (2^24) without giving recent rt_genid.
> >  * Jenkins hash is strong enough that litle changes of rt_genid are OK.
> >  */
> > static void rt_cache_invalidate(struct net *net)
> > {
> >         unsigned char shuffle;
> > 
> >         get_random_bytes(&shuffle, sizeof(shuffle));
> >         atomic_add(shuffle + 1U, &net->ipv4.rt_genid);
> > }
> > 
> > Clearly, its small changes.  To paraphrase the comment, Changes to rt_genid are
> > small enough to be confident that we don't repetatively use a gen_id often, but
> > sufficiently random that attackers cannot easily guess the next gen_id based on
> > the current value.  Both the timer and the statistics code use this invalidation
> > technique previously, and the latter continues to do so.
> > 
> > I've not changed anything regarding how we
> > invalidate, only when we choose to invalidate.  Invalidation can lead to
> > slowdowns during routing, since it send frames through the slow path after an
> > invalidation.  It behooves us to avoid preforming this invalidation without
> > need, and since we have a mechanism in place to do that invalidation specfically
> > when we need to, lets get rid of the code that handles that, and make it a bit
> > cleaner.  If there are users that feel strongly that they need to defend against
> > potential attacks by periodically changing their rt_genid, its still possible.
> > Its as simple as putting:
> > echo -1 > /proc/sys/net/ipv4/route/flush
> > in a cron job.
> > 
> 
> I have some customers that will simply kill me if their routing cache is
> disabled by a smart attack, slowing down their server by a 4x factor.
> 
> I know its possible, it has been done.
> 
Ok, I can understand that, but I don't think a single user profile needs to
dictate policy here.  the statistics code has a configurable threshold for where
to disable caching, so I would think you can just set that to be high.  And if
you need even more than that, you can do as I suggested, adding a:
echo -1 > /proc/sys/net/ipv4/route/flush
to a cron job on a short interval.  That will preform the _exact_ same operation
that the in-kernel timer did previously.  And the other 99% of our users don't
have to suffer a periodic cache invalidation when they don't need it.

> For a quiet machine possible rt_genid values range are known from
> attacker, and hash size is also known. Thats really too easy for the bad
> guys...
> 
Well, ok, I can buy your argument, but I think the problem you are describing is
orthogonoal to what my change here does.  If its too easy for attackers to guess
our genid secret, then we need to make it more complex to guess, not force
everyone to change it more frequently.

> Neil, I think your cleanup should stay a cleanup for the moment, or you
> must make sure rt_genid initial value is not 0 (read your patch
> again...)
> 
Yeah, I get that we start the gen_id at 0, that doesn't come from this change,
its always that way in the net_alloc initalization.  I certainly don't have a
problem during inialization starting with a randomization of that value.  I'll
post an updated patch shortly.

> I agree we dont need anymore the complex timer logic. We could keep the
> secret_interval (default to 0 if you really want) and force a
> rt_cache_invalidate() call once in a while from the periodic
> rt_check_expire() for example.
> 
Doing that doesn't solve my aim however, which is to avoid performing rt_genid
updates when no one is attacking you at all.  I completely agree that we can
start the gen_id at some random value (by forcing an initial invalidation),
however.  Beyond that however, if someone is managing to guess our secret value,
then we need to make our secret value more complex to determine.  Perhaps given
the reduction in the number of times we need to iterate our gen_id with the
timer gone, we can use something more heavyweight to determine the the hash
secret (the cprng perhaps?).

Regards
Neil
> 
> 
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox