Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: e1000 full-duplex TCP performance well below wire speed
From: Rick Jones @ 2008-01-31 18:38 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
	Henning Fehrmann, Bruce Allen
In-Reply-To: <47A20E9E.7070503@intel.com>

> A lot of people tend to forget that the pci-express bus has enough bandwidth on
> first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it
> there is significant overhead going on: each packet requires transmit, cleanup and
> buffer transactions, and there are many irq register clears per second (slow
> ioread/writes). The transactions double for TCP ack processing, and this all
> accumulates and starts to introduce latency, higher cpu utilization etc...

Sounds like tools to show PCI* bus utilization would be helpful...

rick jones

^ permalink raw reply

* Re: hard hang through qdisc
From: Patrick McHardy @ 2008-01-31 18:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: hadi, netdev
In-Reply-To: <47A1E556.3050900@trash.net>

Patrick McHardy wrote:
> Andi Kleen wrote:
>
>> Can you please try with the above config?
> 
> 
> I'll give it a try later.

I took all options from that config that seemed possibly
related (qdiscs, no hrtimers, no nohz, slab, ...), but
still can't reproduce it.

Does it also crash if you use more reasonable parameters?

^ permalink raw reply

* Re: [PATCH] Disable TSO for non standard qdiscs
From: Patrick McHardy @ 2008-01-31 18:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stephen Hemminger, netdev
In-Reply-To: <20080131190125.GE4671@one.firstfloor.org>

Andi Kleen wrote:
> On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote:
>> Andi Kleen wrote:
>>>> Then change TBF to use skb_gso_segment?  Be careful, the fact that
>>> That doesn't help because it wants to interleave packets
>> >from different streams to get everything fair and smooth. The only 
>>> good way to handle that is to split it up and the simplest way to do 
>>> this is to just tell TCP to not do GSO in the first place.
>>
>> Thats not correct, TBF keeps packets strictly ordered unless
> 
> My point was that without TSO different submitters will interleave
> their streams (because they compete about the qdisc submission) 
> and then you end up with a smooth rate over time for all of them.
> 
> If you submit in large chunks only (as TSO does) it will always 
> be more bursty and that works against the TBF goal.
> 
> For a single submitter you would be correct.


The TBF goal is not really fairness among different flows, but
I agree, avoiding TSO in the first place seems to make more sense.

^ permalink raw reply

* Re: [PATCH] Disable TSO for non standard qdiscs
From: Andi Kleen @ 2008-01-31 19:25 UTC (permalink / raw)
  To: Rick Jones; +Cc: Andi Kleen, netdev, davem
In-Reply-To: <47A214FE.3050200@hp.com>

> So, at what timescale do people using these qdiscs expect things to 
> appear "smooth?"  64KB of data at GbE speeds is something just north of 
> half a millisecond unless I've botched my units somewhere.

One typical use case for TBF is you talking to a DSL bridge that 
is connected using a GBit Ethernet switch. For these DSL connections it gives
much better behaviour to shape the traffic to slightly below
your external link speed so that you can e.g. prioritize packets properly.
But the actual external link speed is much lower than GbE.
A lot of GbE NICs enable TSO by default.

-Andi

^ permalink raw reply

* Re: e1000 full-duplex TCP performance well below wire speed
From: Kok, Auke @ 2008-01-31 18:47 UTC (permalink / raw)
  To: Rick Jones
  Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
	Henning Fehrmann, Bruce Allen
In-Reply-To: <47A215A8.2090104@hp.com>

Rick Jones wrote:
>> A lot of people tend to forget that the pci-express bus has enough
>> bandwidth on
>> first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data
>> going over it
>> there is significant overhead going on: each packet requires transmit,
>> cleanup and
>> buffer transactions, and there are many irq register clears per second
>> (slow
>> ioread/writes). The transactions double for TCP ack processing, and
>> this all
>> accumulates and starts to introduce latency, higher cpu utilization
>> etc...
> 
> Sounds like tools to show PCI* bus utilization would be helpful...

that would be a hardware profiling thing and highly dependent on the part sticking
out of the slot, vendor bus implementation etc... Perhaps Intel has some tools for
this already but I personally do not know of any :/

Auke

^ permalink raw reply

* RE: [PATCH] Disable TSO for non standard qdiscs
From: Waskiewicz Jr, Peter P @ 2008-01-31 18:47 UTC (permalink / raw)
  To: Andi Kleen, Patrick McHardy; +Cc: Stephen Hemminger, netdev
In-Reply-To: <20080131190125.GE4671@one.firstfloor.org>

> My point was that without TSO different submitters will 
> interleave their streams (because they compete about the 
> qdisc submission) and then you end up with a smooth rate over 
> time for all of them.
> 
> If you submit in large chunks only (as TSO does) it will 
> always be more bursty and that works against the TBF goal.
> 
> For a single submitter you would be correct.
> 
> -Andi

TSO by nature is bursty.  But disabling TSO without the option of having
it on or off to me seems to aggressive.  If someone is using a qdisc
that TSO is interfering with the effectiveness of the traffic shaping,
then they should turn off TSO via ethtool on the target device.  Some
people may want TSO with certain rate limiter settings.  That way (as
Stephen said) you can even allow the stack to GSO, then segment before
calling hard_start_xmit(), which still saves a number of cycles.

I'd rather not see this, but a documented recommendation why TSO could
be bad for some traffic shaping qdiscs.  Give the power to the user to
either shoot themselves in the foot or disable TSO when needed.

-PJ Waskiewicz
<peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply

* Re: hard hang through qdisc
From: Andi Kleen @ 2008-01-31 18:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Andi Kleen, hadi, netdev
In-Reply-To: <47A2176A.7080606@trash.net>

On Thu, Jan 31, 2008 at 07:46:02PM +0100, Patrick McHardy wrote:
> Patrick McHardy wrote:
>> Andi Kleen wrote:
>>
>>> Can you please try with the above config?
>> I'll give it a try later.
>
>
> I took all options from that config that seemed possibly
> related (qdiscs, no hrtimers, no nohz, slab, ...), but
> still can't reproduce it.

Ok I'll do bisect then later (not today anymore likely) 

> Does it also crash if you use more reasonable parameters?

I managed to make it crash with different parameters too,
but with good parameters it did set a qdisc successfully
and appeared to work.

-Andi

^ permalink raw reply

* Re: [PATCH] Disable TSO for non standard qdiscs
From: Andi Kleen @ 2008-01-31 19:34 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter P
  Cc: Andi Kleen, Patrick McHardy, Stephen Hemminger, netdev
In-Reply-To: <D5C1322C3E673F459512FB59E0DDC32904700E40@orsmsx414.amr.corp.intel.com>

> TSO by nature is bursty.  But disabling TSO without the option of having
> it on or off to me seems to aggressive.  If someone is using a qdisc
> that TSO is interfering with the effectiveness of the traffic shaping,
> then they should turn off TSO via ethtool on the target device.  Some

The philosophical problem I have with this suggestion is that I expect
that the large majority of users will be more happy with disabled TSO
if they use non standard qdiscs and defaults that do not fit 
the majority use case are bad.

Basically you're suggesting that nearly everyone using tc should learn about
another obscure command.

-Andi

^ permalink raw reply

* Re: e1000 full-duplex TCP performance well below wire speed
From: Rick Jones @ 2008-01-31 19:07 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Bruce Allen, Brandeburg, Jesse, netdev, Carsten Aulbert,
	Henning Fehrmann, Bruce Allen
In-Reply-To: <47A217CE.4000002@intel.com>

>>Sounds like tools to show PCI* bus utilization would be helpful...
> 
> 
> that would be a hardware profiling thing and highly dependent on the part sticking
> out of the slot, vendor bus implementation etc... Perhaps Intel has some tools for
> this already but I personally do not know of any :/

Small matter of getting specs for the various LBA's (is that the correct 
term? - lower bus adaptors) and then abstracting them a la the CPU perf 
counters as done by say perfmon and then used by papi :)

rick jones

^ permalink raw reply

* Re: Null pointer dereference when bringing up bonding device on kernel-2.6.24-2.fc9.i686
From: Jay Vosburgh @ 2008-01-31 19:09 UTC (permalink / raw)
  To: =?ISO-8859-1?Q?Siim_P=F5der?=; +Cc: netdev
In-Reply-To: <47A1C45C.5080906@p6drad-teel.net>

Siim Põder <siim@p6drad-teel.net> wrote:

>Jay Vosburgh wrote:
>> Benny Amorsen <benny+usenet@amorsen.dk> wrote:
>> 
>>> https://bugzilla.redhat.com/show_bug.cgi?id=430391
>> 
>> 	I know what this is, I'll fix it.
>
>do you know when this happend, so we would know which kernel is ok to
>use (not to start trying blindly)?

	It was something I changed up during the 2.6.24-rc development
somewhere.  Also, I posted the fix for this a couple days ago, it's
commit 5eb71eec3616b0a62e63197016576a74da240c6b in netdev-2.6#upstream.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* [PATCH] [RFC] 3c509: convert to isa_driver and pnp_driver
From: Ondrej Zary @ 2008-01-31 19:12 UTC (permalink / raw)
  To: netdev; +Cc: Linux Kernel

Hello,
this patch is an (incomplete and probably wrong) attempt to convert 3c509 
driver to isa_driver and pnp_driver model. It works but is not finished. My 
original goal was to make 3c509 resume correctly after hibernation - this 
still does not work (in fact, now it hangs during hibernation - I need to do 
some more debugging). Having udev to load the driver automatically when 3c509 
PnP card is detected would be very nice too (assuming udev can do that for 
PnP cards).

I think that the isa_driver part is mostly OK but don't know about the PnP 
one.

Should I use pnp_driver or pnp_card_driver? 3c509 cards have no subdevices. 
I'm confused about usage of pnp_driver together with word "bios" in ALSA 
drivers - I think that pnp_driver should work with all 3 PnP kinds (isapnp, 
pnpbios, acpipnp).

Is it OK to introduce a limit of 8 3c509 cards? The driver had no explicit 
limit before - but the irq[] module parameter had space for only 8 numbers 
(and xcvr[] had 12?)

--- linux-2.6.24-orig/drivers/net/3c509.c	2008-01-27 19:48:19.000000000 +0100
+++ linux-2.6.24-pentium/drivers/net/3c509.c	2008-01-30 20:44:48.000000000 +0100
@@ -69,10 +69,9 @@
 static int max_interrupt_work = 10;
 
 #include <linux/module.h>
-#ifdef CONFIG_MCA
 #include <linux/mca.h>
-#endif
-#include <linux/isapnp.h>
+#include <linux/isa.h>
+#include <linux/pnp.h>
 #include <linux/string.h>
 #include <linux/interrupt.h>
 #include <linux/errno.h>
@@ -97,20 +96,17 @@
 
 static char version[] __initdata = DRV_NAME ".c:" DRV_VERSION " " DRV_RELDATE " becker@scyld.com\n";
 
-#if defined(CONFIG_PM) && (defined(CONFIG_MCA) || defined(CONFIG_EISA))
-#define EL3_SUSPEND
-#endif
-
 #ifdef EL3_DEBUG
 static int el3_debug = EL3_DEBUG;
 #else
-static int el3_debug = 2;
+static int el3_debug = 20;
 #endif
 
 /* Used to do a global count of all the cards in the system.  Must be
  * a global variable so that the mca/eisa probe routines can increment
  * it */
 static int el3_cards = 0;
+#define EL3_MAX_CARDS 8
 
 /* To minimize the size of the driver source I only define operating
    constants if they are used several times.  You'll need the manual
@@ -119,7 +115,7 @@
 #define EL3_DATA 0x00
 #define EL3_CMD 0x0e
 #define EL3_STATUS 0x0e
-#define	 EEPROM_READ 0x80
+#define	EEPROM_READ 0x80
 
 #define EL3_IO_EXTENT	16
 
@@ -170,7 +166,6 @@
 
 struct el3_private {
 	struct net_device_stats stats;
-	struct net_device *next_dev;
 	spinlock_t lock;
 	/* skb send-queue */
 	int head, size;
@@ -179,12 +174,29 @@
 		EL3_MCA,
 		EL3_PNP,
 		EL3_EISA,
+		EL3_ISA,
 	} type;						/* type of device */
 	struct device *dev;
 };
 static int id_port __initdata = 0x110;	/* Start with 0x110 to avoid new sound cards.*/
-static struct net_device *el3_root_dev;
+static struct net_device *el3_devs[EL3_MAX_CARDS];
+//static __be16 el3_phys_addr[EL3_MAX_CARDS][3];
+
+static int isa_registered;
+#ifdef CONFIG_PNP
+static int pnp_registered;
+static int nopnp;
+#endif
+#ifdef CONFIG_EISA
+static int eisa_registered;
+#endif
+#ifdef CONFIG_MCA
+static int mca_registered;
+#endif
+
 
+static int __init el3_common_init(struct net_device *dev);
+static void el3_common_remove (struct net_device *dev);
 static ushort id_read_eeprom(int index);
 static ushort read_eeprom(int ioaddr, int index);
 static int el3_open(struct net_device *dev);
@@ -199,23 +211,288 @@
 static void el3_down(struct net_device *dev);
 static void el3_up(struct net_device *dev);
 static const struct ethtool_ops ethtool_ops;
-#ifdef EL3_SUSPEND
+#ifdef CONFIG_PM
 static int el3_suspend(struct device *, pm_message_t);
 static int el3_resume(struct device *);
-#else
-#define el3_suspend NULL
-#define el3_resume NULL
 #endif
 
 
 /* generic device remove for all device types */
-#if defined(CONFIG_EISA) || defined(CONFIG_MCA)
+//#if defined(CONFIG_MCA) || defined(CONFIG_EISA)
 static int el3_device_remove (struct device *device);
-#endif
+//#endif
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void el3_poll_controller(struct net_device *dev);
 #endif
 
+#ifdef CONFIG_ISA
+static int __devinit el3_isa_match(struct device *pdev,
+				   unsigned int ndev)
+{
+	struct net_device *dev;
+	struct el3_private *lp;
+	short lrs_state = 0xff, i;
+	int ioaddr, irq, if_port;
+	__be16 phys_addr[3];
+	static int current_tag;
+
+again:
+	/* Select an open I/O location at 0x1*0 to do contention select. */
+	for ( ; id_port < 0x200; id_port += 0x10) {
+		if (!request_region(id_port, 1, "3c509"))
+			continue;
+		outb(0x00, id_port);
+		outb(0xff, id_port);
+		if (inb(id_port) & 0x01){
+			release_region(id_port, 1);
+			break;
+		} else
+			release_region(id_port, 1);
+	}
+	if (id_port >= 0x200) {
+		/* Rare -- do we really need a warning? */
+		printk(" WARNING: No I/O port available for 3c509 activation.\n");
+		return 0;
+	}
+
+	/* Next check for all ISA bus boards by sending the ID sequence to the
+	   ID_PORT.  We find cards past the first by setting the 'current_tag'
+	   on cards as they are found.  Cards with their tag set will not
+	   respond to subsequent ID sequences. */
+
+	outb(0x00, id_port);
+	outb(0x00, id_port);
+	for(i = 0; i < 255; i++) {
+		outb(lrs_state, id_port);
+		lrs_state <<= 1;
+		lrs_state = lrs_state & 0x100 ? lrs_state ^ 0xcf : lrs_state;
+	}
+
+	/* For the first probe, clear all board's tag registers. */
+	if (current_tag == 0)
+		outb(0xd0, id_port);
+	else				/* Otherwise kill off already-found boards. */
+		outb(0xd8, id_port);
+
+	if (id_read_eeprom(7) != 0x6d50)
+		return 0;
+
+	/* Read in EEPROM data, which does contention-select.
+	   Only the lowest address board will stay "on-line".
+	   3Com got the byte order backwards. */
+	for (i = 0; i < 3; i++)
+		phys_addr[i] = htons(id_read_eeprom(i));
+
+#ifdef CONFIG_PNP
+	if (!nopnp) {
+		/* The ISA PnP 3c509 cards respond to the ID sequence.
+		   This check is needed in order not to register them twice. */
+		for (i = 0; i < el3_cards; i++) {
+			if (!memcmp(phys_addr, el3_devs[i]->dev_addr, ETH_ALEN))
+			{
+				if (el3_debug > 3)
+					printk("3c509 with address %02x %02x %02x %02x %02x %02x was found by ISAPnP\n",
+						phys_addr[0] & 0xff, phys_addr[0] >> 8,
+						phys_addr[1] & 0xff, phys_addr[1] >> 8,
+						phys_addr[2] & 0xff, phys_addr[2] >> 8);
+				/* Set the adaptor tag so that the next card can be found. */
+				outb(0xd0 + ++current_tag, id_port);
+				goto again;
+			}
+		}
+	}
+#endif /* CONFIG_PNP */
+
+	{
+		unsigned int iobase = id_read_eeprom(8);
+		if_port = iobase >> 14;
+		ioaddr = 0x200 + ((iobase & 0x1f) << 4);
+	}
+	irq = id_read_eeprom(9) >> 12;
+
+	dev = alloc_etherdev(sizeof (struct el3_private));
+	if (!dev)
+		return -ENOMEM;
+
+	netdev_boot_setup_check(dev);
+
+	/* Set passed-in IRQ or I/O Addr. */
+	if (dev->irq > 1 && dev->irq < 16)
+			irq = dev->irq;
+
+	if (dev->base_addr) {
+		if (dev->mem_end == 0x3c509 	/* Magic key */
+		    && dev->base_addr >= 0x200  &&  dev->base_addr <= 0x3e0)
+			ioaddr = dev->base_addr & 0x3f0;
+		else if (dev->base_addr != ioaddr) {
+			free_netdev(dev);
+			return 0;
+		}
+	}
+
+	if (!request_region(ioaddr, EL3_IO_EXTENT, "3c509")) {
+		free_netdev(dev);
+		return 0;
+	}
+
+	/* Set the adaptor tag so that the next card can be found. */
+	outb(0xd0 + ++current_tag, id_port);
+
+	/* Activate the adaptor at the EEPROM location. */
+	outb((ioaddr >> 4) | 0xe0, id_port);
+
+	EL3WINDOW(0);
+	if (inw(ioaddr) != 0x6d50) {
+		free_netdev(dev);
+		return 0;
+	}
+
+	/* Free the interrupt so that some other card can use it. */
+	outw(0x0f00, ioaddr + WN0_IRQ);
+
+
+	memcpy(dev->dev_addr, phys_addr, sizeof(phys_addr));
+	dev->base_addr = ioaddr;
+	dev->irq = irq;
+	dev->if_port = if_port;
+	lp = netdev_priv(dev);
+	lp->type = EL3_ISA;
+	lp->dev = pdev;
+	dev_set_drvdata(pdev, dev);
+	if (el3_common_init(dev)) {
+		free_netdev(dev);
+		return 0;
+	}
+
+	el3_devs[el3_cards] = dev;
+	el3_cards++;
+	return 1;
+}
+
+static int __devexit el3_isa_remove(struct device *pdev,
+				    unsigned int ndev)
+{
+	el3_device_remove(pdev);
+	dev_set_drvdata(pdev, NULL);
+	return 0;
+}
+
+#ifdef CONFIG_PM
+static int el3_isa_suspend(struct device *dev, unsigned int n,
+			   pm_message_t state)
+{
+	return el3_suspend(dev_get_drvdata(dev), state);
+}
+
+static int el3_isa_resume(struct device *dev, unsigned int n)
+{
+	return el3_resume(dev_get_drvdata(dev));
+}
+#endif
+
+static struct isa_driver el3_isa_driver = {
+	.match		= el3_isa_match,
+	.remove		= __devexit_p(el3_isa_remove),
+#ifdef CONFIG_PM
+	.suspend	= el3_isa_suspend,
+	.resume		= el3_isa_resume,
+#endif
+	.driver		= {
+		.name	= "3c509"
+	},
+};
+#endif
+
+#ifdef CONFIG_PNP
+static struct pnp_device_id el3_pnp_ids[] = {
+	{ .id = "TCM5090" }, /* 3Com Etherlink III (TP) */
+	{ .id = "TCM5091" }, /* 3Com Etherlink III */
+	{ .id = "TCM5094" }, /* 3Com Etherlink III (combo) */
+	{ .id = "TCM5095" }, /* 3Com Etherlink III (TPO) */
+	{ .id = "TCM5098" }, /* 3Com Etherlink III (TPC) */
+	{ .id = "PNP80f7" }, /* 3Com Etherlink III compatible */
+	{ .id = "PNP80f8" }, /* 3Com Etherlink III compatible */
+	{ .id = "" }
+};
+MODULE_DEVICE_TABLE(pnp, el3_pnp_ids);
+
+static int __devinit el3_pnp_probe(struct pnp_dev *pdev,
+				    const struct pnp_device_id *id)
+{
+	struct el3_private *lp;
+	short i;
+	int ioaddr, irq, if_port;
+	u16 phys_addr[3];
+	struct net_device *dev = NULL;
+	int err;
+
+	ioaddr = pnp_port_start(pdev, 0);
+	if (!request_region(ioaddr, EL3_IO_EXTENT, "3c509 PnP"))
+		return -EBUSY;
+	irq = pnp_irq(pdev, 0);
+	EL3WINDOW(0);
+	for (i = 0; i < 3; i++)
+		phys_addr[i] = htons(read_eeprom(ioaddr, i));
+	if_port = read_eeprom(ioaddr, 8) >> 14;
+	dev = alloc_etherdev(sizeof (struct el3_private));
+	if (!dev) {
+		release_region(ioaddr, EL3_IO_EXTENT);
+		return -ENOMEM;
+	}
+	SET_NETDEV_DEV(dev, &pdev->dev);
+	netdev_boot_setup_check(dev);
+
+	memcpy(dev->dev_addr, phys_addr, sizeof(phys_addr));
+	dev->base_addr = ioaddr;
+	dev->irq = irq;
+	dev->if_port = if_port;
+	lp = netdev_priv(dev);
+	lp->dev = &pdev->dev;
+	lp->type = EL3_PNP;
+	pnp_set_drvdata (pdev, dev);
+	err = el3_common_init(dev);
+
+	if (err) {
+		pnp_set_drvdata (pdev, NULL);
+		free_netdev(dev);
+		return err;
+	}
+
+	el3_devs[el3_cards] = dev;
+	el3_cards++;
+	return 0;
+}
+
+static void __devexit el3_pnp_remove(struct pnp_dev *pdev)
+{
+	el3_common_remove(pnp_get_drvdata(pdev));
+	pnp_set_drvdata(pdev, NULL);
+}
+
+#ifdef CONFIG_PM
+static int el3_pnp_suspend(struct pnp_dev *pdev, pm_message_t state)
+{
+	return el3_suspend(pnp_get_drvdata(pdev), state);
+}
+
+static int el3_pnp_resume(struct pnp_dev *pdev)
+{
+	return el3_resume(pnp_get_drvdata(pdev));
+}
+#endif
+
+static struct pnp_driver el3_pnp_driver = {
+	.name		= "3c509",
+	.id_table	= el3_pnp_ids,
+	.probe		= el3_pnp_probe,
+	.remove		= __devexit_p(el3_pnp_remove),
+#ifdef CONFIG_PM
+	.suspend	= el3_pnp_suspend,
+	.resume		= el3_pnp_resume,
+#endif
+};
+#endif /* CONFIG_PNP */
+
 #ifdef CONFIG_EISA
 static struct eisa_device_id el3_eisa_ids[] = {
 		{ "TCM5092" },
@@ -273,43 +550,6 @@
 };
 #endif /* CONFIG_MCA */
 
-#if defined(__ISAPNP__)
-static struct isapnp_device_id el3_isapnp_adapters[] __initdata = {
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('T', 'C', 'M'), ISAPNP_FUNCTION(0x5090),
-		(long) "3Com Etherlink III (TP)" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('T', 'C', 'M'), ISAPNP_FUNCTION(0x5091),
-		(long) "3Com Etherlink III" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('T', 'C', 'M'), ISAPNP_FUNCTION(0x5094),
-		(long) "3Com Etherlink III (combo)" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('T', 'C', 'M'), ISAPNP_FUNCTION(0x5095),
-		(long) "3Com Etherlink III (TPO)" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('T', 'C', 'M'), ISAPNP_FUNCTION(0x5098),
-		(long) "3Com Etherlink III (TPC)" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('P', 'N', 'P'), ISAPNP_FUNCTION(0x80f7),
-		(long) "3Com Etherlink III compatible" },
-	{	ISAPNP_ANY_ID, ISAPNP_ANY_ID,
-		ISAPNP_VENDOR('P', 'N', 'P'), ISAPNP_FUNCTION(0x80f8),
-		(long) "3Com Etherlink III compatible" },
-	{ }	/* terminate list */
-};
-
-static __be16 el3_isapnp_phys_addr[8][3];
-static int nopnp;
-#endif /* __ISAPNP__ */
-
-/* With the driver model introduction for EISA devices, both init
- * and cleanup have been split :
- * - EISA devices probe/remove starts in el3_eisa_probe/el3_device_remove
- * - MCA/ISA still use el3_probe
- *
- * Both call el3_common_init/el3_common_remove. */
-
 static int __init el3_common_init(struct net_device *dev)
 {
 	struct el3_private *lp = netdev_priv(dev);
@@ -373,218 +613,6 @@
 	free_netdev (dev);
 }
 
-static int __init el3_probe(int card_idx)
-{
-	struct net_device *dev;
-	struct el3_private *lp;
-	short lrs_state = 0xff, i;
-	int ioaddr, irq, if_port;
-	__be16 phys_addr[3];
-	static int current_tag;
-	int err = -ENODEV;
-#if defined(__ISAPNP__)
-	static int pnp_cards;
-	struct pnp_dev *idev = NULL;
-	int pnp_found = 0;
-
-	if (nopnp == 1)
-		goto no_pnp;
-
-	for (i=0; el3_isapnp_adapters[i].vendor != 0; i++) {
-		int j;
-		while ((idev = pnp_find_dev(NULL,
-					    el3_isapnp_adapters[i].vendor,
-					    el3_isapnp_adapters[i].function,
-					    idev))) {
-			if (pnp_device_attach(idev) < 0)
-				continue;
-			if (pnp_activate_dev(idev) < 0) {
-__again:
-				pnp_device_detach(idev);
-				continue;
-			}
-			if (!pnp_port_valid(idev, 0) || !pnp_irq_valid(idev, 0))
-				goto __again;
-			ioaddr = pnp_port_start(idev, 0);
-			if (!request_region(ioaddr, EL3_IO_EXTENT, "3c509 PnP")) {
-				pnp_device_detach(idev);
-				return -EBUSY;
-			}
-			irq = pnp_irq(idev, 0);
-			if (el3_debug > 3)
-				printk ("ISAPnP reports %s at i/o 0x%x, irq %d\n",
-					(char*) el3_isapnp_adapters[i].driver_data, ioaddr, irq);
-			EL3WINDOW(0);
-			for (j = 0; j < 3; j++)
-				el3_isapnp_phys_addr[pnp_cards][j] =
-					phys_addr[j] =
-						htons(read_eeprom(ioaddr, j));
-			if_port = read_eeprom(ioaddr, 8) >> 14;
-			dev = alloc_etherdev(sizeof (struct el3_private));
-			if (!dev) {
-					release_region(ioaddr, EL3_IO_EXTENT);
-					pnp_device_detach(idev);
-					return -ENOMEM;
-			}
-
-			SET_NETDEV_DEV(dev, &idev->dev);
-			pnp_cards++;
-
-			netdev_boot_setup_check(dev);
-			pnp_found = 1;
-			goto found;
-		}
-	}
-no_pnp:
-#endif /* __ISAPNP__ */
-
-	/* Select an open I/O location at 0x1*0 to do contention select. */
-	for ( ; id_port < 0x200; id_port += 0x10) {
-		if (!request_region(id_port, 1, "3c509"))
-			continue;
-		outb(0x00, id_port);
-		outb(0xff, id_port);
-		if (inb(id_port) & 0x01){
-			release_region(id_port, 1);
-			break;
-		} else
-			release_region(id_port, 1);
-	}
-	if (id_port >= 0x200) {
-		/* Rare -- do we really need a warning? */
-		printk(" WARNING: No I/O port available for 3c509 activation.\n");
-		return -ENODEV;
-	}
-
-	/* Next check for all ISA bus boards by sending the ID sequence to the
-	   ID_PORT.  We find cards past the first by setting the 'current_tag'
-	   on cards as they are found.  Cards with their tag set will not
-	   respond to subsequent ID sequences. */
-
-	outb(0x00, id_port);
-	outb(0x00, id_port);
-	for(i = 0; i < 255; i++) {
-		outb(lrs_state, id_port);
-		lrs_state <<= 1;
-		lrs_state = lrs_state & 0x100 ? lrs_state ^ 0xcf : lrs_state;
-	}
-
-	/* For the first probe, clear all board's tag registers. */
-	if (current_tag == 0)
-		outb(0xd0, id_port);
-	else				/* Otherwise kill off already-found boards. */
-		outb(0xd8, id_port);
-
-	if (id_read_eeprom(7) != 0x6d50) {
-		return -ENODEV;
-	}
-
-	/* Read in EEPROM data, which does contention-select.
-	   Only the lowest address board will stay "on-line".
-	   3Com got the byte order backwards. */
-	for (i = 0; i < 3; i++) {
-		phys_addr[i] = htons(id_read_eeprom(i));
-	}
-
-#if defined(__ISAPNP__)
-	if (nopnp == 0) {
-		/* The ISA PnP 3c509 cards respond to the ID sequence.
-		   This check is needed in order not to register them twice. */
-		for (i = 0; i < pnp_cards; i++) {
-			if (phys_addr[0] == el3_isapnp_phys_addr[i][0] &&
-			    phys_addr[1] == el3_isapnp_phys_addr[i][1] &&
-			    phys_addr[2] == el3_isapnp_phys_addr[i][2])
-			{
-				if (el3_debug > 3)
-					printk("3c509 with address %02x %02x %02x %02x %02x %02x was found by ISAPnP\n",
-						phys_addr[0] & 0xff, phys_addr[0] >> 8,
-						phys_addr[1] & 0xff, phys_addr[1] >> 8,
-						phys_addr[2] & 0xff, phys_addr[2] >> 8);
-				/* Set the adaptor tag so that the next card can be found. */
-				outb(0xd0 + ++current_tag, id_port);
-				goto no_pnp;
-			}
-		}
-	}
-#endif /* __ISAPNP__ */
-
-	{
-		unsigned int iobase = id_read_eeprom(8);
-		if_port = iobase >> 14;
-		ioaddr = 0x200 + ((iobase & 0x1f) << 4);
-	}
-	irq = id_read_eeprom(9) >> 12;
-
-	dev = alloc_etherdev(sizeof (struct el3_private));
-	if (!dev)
-		return -ENOMEM;
-
-	netdev_boot_setup_check(dev);
-
-	/* Set passed-in IRQ or I/O Addr. */
-	if (dev->irq > 1  &&  dev->irq < 16)
-			irq = dev->irq;
-
-	if (dev->base_addr) {
-		if (dev->mem_end == 0x3c509 	/* Magic key */
-		    && dev->base_addr >= 0x200  &&  dev->base_addr <= 0x3e0)
-			ioaddr = dev->base_addr & 0x3f0;
-		else if (dev->base_addr != ioaddr)
-			goto out;
-	}
-
-	if (!request_region(ioaddr, EL3_IO_EXTENT, "3c509")) {
-		err = -EBUSY;
-		goto out;
-	}
-
-	/* Set the adaptor tag so that the next card can be found. */
-	outb(0xd0 + ++current_tag, id_port);
-
-	/* Activate the adaptor at the EEPROM location. */
-	outb((ioaddr >> 4) | 0xe0, id_port);
-
-	EL3WINDOW(0);
-	if (inw(ioaddr) != 0x6d50)
-		goto out1;
-
-	/* Free the interrupt so that some other card can use it. */
-	outw(0x0f00, ioaddr + WN0_IRQ);
-
-#if defined(__ISAPNP__)
- found:							/* PNP jumps here... */
-#endif /* __ISAPNP__ */
-
-	memcpy(dev->dev_addr, phys_addr, sizeof(phys_addr));
-	dev->base_addr = ioaddr;
-	dev->irq = irq;
-	dev->if_port = if_port;
-	lp = netdev_priv(dev);
-#if defined(__ISAPNP__)
-	lp->dev = &idev->dev;
-	if (pnp_found)
-		lp->type = EL3_PNP;
-#endif
-	err = el3_common_init(dev);
-
-	if (err)
-		goto out1;
-
-	el3_cards++;
-	lp->next_dev = el3_root_dev;
-	el3_root_dev = dev;
-	return 0;
-
-out1:
-#if defined(__ISAPNP__)
-	if (idev)
-		pnp_device_detach(idev);
-#endif
-out:
-	free_netdev(dev);
-	return err;
-}
-
 #ifdef CONFIG_MCA
 static int __init el3_mca_probe(struct device *device)
 {
@@ -721,10 +749,10 @@
 }
 #endif
 
-#if defined(CONFIG_EISA) || defined(CONFIG_MCA)
 /* This remove works for all device types.
  *
  * The net dev must be stored in the driver_data field */
+//#if defined(CONFIG_MCA) || defined(CONFIG_EISA)
 static int __devexit el3_device_remove (struct device *device)
 {
 	struct net_device *dev;
@@ -734,7 +762,7 @@
 	el3_common_remove (dev);
 	return 0;
 }
-#endif
+//#endif
 
 /* Read a word from the EEPROM using the regular EEPROM access register.
    Assume that we are in register window zero.
@@ -1450,7 +1478,7 @@
 }
 
 /* Power Management support functions */
-#ifdef EL3_SUSPEND
+#ifdef CONFIG_PM
 
 static int
 el3_suspend(struct device *pdev, pm_message_t state)
@@ -1500,12 +1528,12 @@
 	return 0;
 }
 
-#endif /* EL3_SUSPEND */
+#endif /* CONFIG_PM */
 
 /* Parameters that may be passed into the module. */
 static int debug = -1;
 static int irq[] = {-1, -1, -1, -1, -1, -1, -1, -1};
-static int xcvr[] = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1};
+static int xcvr[] = {-1, -1, -1, -1, -1, -1, -1, -1};
 
 module_param(debug,int, 0);
 module_param_array(irq, int, NULL, 0);
@@ -1515,61 +1543,87 @@
 MODULE_PARM_DESC(irq, "IRQ number(s) (assigned)");
 MODULE_PARM_DESC(xcvr,"transceiver(s) (0=internal, 1=external)");
 MODULE_PARM_DESC(max_interrupt_work, "maximum events handled per interrupt");
-#if defined(__ISAPNP__)
+#ifdef CONFIG_PNP
 module_param(nopnp, int, 0);
 MODULE_PARM_DESC(nopnp, "disable ISA PnP support (0-1)");
-MODULE_DEVICE_TABLE(isapnp, el3_isapnp_adapters);
-#endif	/* __ISAPNP__ */
-MODULE_DESCRIPTION("3Com Etherlink III (3c509, 3c509B) ISA/PnP ethernet driver");
+#endif	/* CONFIG_PNP */
+MODULE_DESCRIPTION("3Com Etherlink III (3c509, 3c509B) ethernet driver");
 MODULE_LICENSE("GPL");
 
 static int __init el3_init_module(void)
 {
 	int ret = 0;
-	el3_cards = 0;
 
 	if (debug >= 0)
 		el3_debug = debug;
 
-	el3_root_dev = NULL;
-	while (el3_probe(el3_cards) == 0) {
-		if (irq[el3_cards] > 1)
-			el3_root_dev->irq = irq[el3_cards];
-		if (xcvr[el3_cards] >= 0)
-			el3_root_dev->if_port = xcvr[el3_cards];
-		el3_cards++;
+//	while (el3_probe(el3_cards) == 0) {
+//		if (irq[el3_cards] > 1)
+//			el3_root_dev->irq = irq[el3_cards];
+//		if (xcvr[el3_cards] >= 0)
+//			el3_root_dev->if_port = xcvr[el3_cards];
+//		el3_cards++;
+//	}
+
+#ifdef CONFIG_PNP
+	if (!nopnp) {
+		ret = pnp_register_driver(&el3_pnp_driver);
+		if (!ret)
+			pnp_registered = 1;
 	}
-
+#endif
+	ret = isa_register_driver(&el3_isa_driver, EL3_MAX_CARDS);
+	if (!ret)
+		isa_registered = 1;
 #ifdef CONFIG_EISA
 	ret = eisa_driver_register(&el3_eisa_driver);
+	if (!ret)
+		eisa_registeted = 1;
 #endif
 #ifdef CONFIG_MCA
-	{
-		int err = mca_register_driver(&el3_mca_driver);
-		if (ret == 0)
-			ret = err;
-	}
+	ret = mca_register_driver(&el3_mca_driver);
+	if (!ret)
+		mca_registered = 1;
+#endif
+
+#ifdef CONFIG_PNP
+	if (pnp_registered)
+		ret = 0;
+#endif
+	if (isa_registered)
+		ret = 0;
+#ifdef CONFIG_EISA
+	if (eisa_registered)
+		ret = 0;
 #endif
+#ifdef CONFIG_MCA
+	if (mca_registered)
+		ret = 0;
+#endif
+	printk("el3_cards=%d\n", el3_cards);
 	return ret;
 }
 
 static void __exit el3_cleanup_module(void)
 {
-	struct net_device *next_dev;
-
-	while (el3_root_dev) {
-		struct el3_private *lp = netdev_priv(el3_root_dev);
-
-		next_dev = lp->next_dev;
-		el3_common_remove (el3_root_dev);
-		el3_root_dev = next_dev;
-	}
-
+	int i;
+	
+//	for (i = 0; i < el3_cards; i++)
+//		el3_common_remove(el3_devs[i]);
+
+#ifdef CONFIG_PNP
+	if (pnp_registered)
+		pnp_unregister_driver(&el3_pnp_driver);
+#endif
+	if (isa_registered)
+		isa_unregister_driver(&el3_isa_driver);
 #ifdef CONFIG_EISA
-	eisa_driver_unregister (&el3_eisa_driver);
+	if (eisa_registered)
+		eisa_driver_unregister (&el3_eisa_driver);
 #endif
 #ifdef CONFIG_MCA
-	mca_unregister_driver(&el3_mca_driver);
+	if (mca_registered)
+		mca_unregister_driver(&el3_mca_driver);
 #endif
 }
 



-- 
Ondrej Zary

^ permalink raw reply

* Re: e1000 full-duplex TCP performance well below wire speed
From: Bruce Allen @ 2008-01-31 19:13 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
	Bruce Allen
In-Reply-To: <47A20E9E.7070503@intel.com>

Hi Auke,

>>>> Important note: we ARE able to get full duplex wire speed (over 900
>>>> Mb/s simulaneously in both directions) using UDP.  The problems occur
>>>> only with TCP connections.
>>>
>>> That eliminates bus bandwidth issues, probably, but small packets take
>>> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>>
>> I see.  Your concern is the extra ACK packets associated with TCP.  Even
>> those these represent a small volume of data (around 5% with MTU=1500,
>> and less at larger MTU) they double the number of packets that must be
>> handled by the system compared to UDP transmission at the same data
>> rate. Is that correct?
>
> A lot of people tend to forget that the pci-express bus has enough 
> bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart 
> from data going over it there is significant overhead going on: each 
> packet requires transmit, cleanup and buffer transactions, and there are 
> many irq register clears per second (slow ioread/writes). The 
> transactions double for TCP ack processing, and this all accumulates and 
> starts to introduce latency, higher cpu utilization etc...

Based on the discussion in this thread, I am inclined to believe that lack 
of PCI-e bus bandwidth is NOT the issue.  The theory is that the extra 
packet handling associated with TCP acknowledgements are pushing the PCI-e 
x1 bus past its limits.  However the evidence seems to show otherwise:

(1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit 
PCI connection.  That connection can transfer data at 8Gb/s.

(2) If the theory is right, then doubling the MTU from 1500 to 3000 should 
have significantly reduce the problem, since it drops the number of ACK's 
by two.  Similarly, going from MTU 1500 to MTU 9000 should reduce the 
number of ACK's by a factor of six, practically eliminating the problem. 
But changing the MTU size does not help.

(3) The interrupt counts are quite reasonable.  Broadcom NICs without 
interrupt aggregation generate an order of magnitude more irq/s and this 
doesn't prevent wire speed performance there.

(4) The CPUs on the system are largely idle.  There are plenty of 
computing resources available.

(5) I don't think that the overhead will increase the bandwidth needed by 
more than a factor of two.  Of course you and the other e1000 developers 
are the experts, but the dominant bus cost should be copying data buffers 
across the bus. Everything else in minimal in comparison.

Intel insiders: isn't there some simple instrumentation available (which 
read registers or statistics counters on the PCI-e interface chip) to tell 
us statistics such as how many bits have moved over the link in each 
direction? This plus some accurate timing would make it easy to see if the 
TCP case is saturating the PCI-e bus.  Then the theory addressed with data 
rather than with opinions.

Cheers,
 	Bruce

^ permalink raw reply

* Re: [PATCH] Disable TSO for non standard qdiscs
From: Rick Jones @ 2008-01-31 19:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev, davem
In-Reply-To: <20080131192521.GG4671@one.firstfloor.org>

Andi Kleen wrote:
>>So, at what timescale do people using these qdiscs expect things to 
>>appear "smooth?"  64KB of data at GbE speeds is something just north of 
>>half a millisecond unless I've botched my units somewhere.
> 
> 
> One typical use case for TBF is you talking to a DSL bridge that 
> is connected using a GBit Ethernet switch. For these DSL connections it gives
> much better behaviour to shape the traffic to slightly below
> your external link speed so that you can e.g. prioritize packets properly.

Sounds like the functionality needs to be in the DSL bridge :) (or the 
"router" in the same case) Particularly since it might be getting used 
by more than one host on the GbE switch.

> But the actual external link speed is much lower than GbE.
> A lot of GbE NICs enable TSO by default.

bluesky typing...

then the qdisc could/should place a cap on the size of a 'TSO' based on 
the bitrate (and perhaps input as to how much time any one "burst" of 
data should be allowed to consume on the network) and pass that up the 
stack?  right now you seem to be proposing what is effectively a cap of 
1 MSS.

rick jones

^ permalink raw reply

* [1/2] POHMELFS - network filesystem with local coherent cache.
From: Evgeniy Polyakov @ 2008-01-31 19:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, linux-fsdevel

Hi.

POHMELFS stands for Parallel Optimized Host Message Exchange
Layered File System. It allows to mount remote servers to local
directory via network. This filesystem supports local caching
and writeback flushing.
POHMELFS is a brick in a future distributed filesystem.

This set includes two patches:
 * network filesystem with write-through cache (slow, but works with
 	remote userspace server)
 * hack to show how local cache works and how faster it is compared
 	to async NFS (see below). hack disables writeback flush and
	performs local allocation of the objects only.

Now, some vaporware aka food for thoughts and your brains.

A small benchmark of the local cached mode (above hack):

$ time tar -xf /home/zbr/threading.tar

	POHMELFS	NFS v3 (async)
real    0m0.043s	0m1.679s

Which is damn 40 times!

Excited? Now get huge bucket with ice.

Generic problem with writeback cache is a fact, that all local objects
has to have IDs in sync with remote side. For example, if remote side
is ext3, local one should not overwrite inode with number 0.
Contrary write-through cache allows to request remote side about
what ID should given data have and be in sync. This one is slow.

Of course this will not be _that_ huge difference in a real world, when
tested archives are larger (this one if a git archive of my userspace
threading library), which is very small. Since it is so small there is
no writeback cache flushing, and thus remote side never receives data.

Actually one can consider this as tmpfs or something like that. Code supports
sync, but since inode generation process is very different, files and dirs
can not be blindly synced to the ext3. So, this release of POHMELFS consists of
two patches: first one is a network filesystem implementation with write-through
cache, when object is first created on the remote side and then populated to the
local cache. This one is slow.

Second patch is a hack to disable writeback caching and implement local caching
only, which is very fast.

Next task is to think about how to generically solve the problem with
syncing local changes with remote server, when remote server maintains inodes with
completely different numbers.
This, among others, will allow offline work with automatic syncing after reconnect.

This is not intended for inclusion, CRFS by Zach Brown is a bit ahead of POHMELFS,
but it is not generic enough (because of above problem), works only with BTRFS,
and was closed by Oracle so far :)
So, anyone who managed to read up to this and happend to be at LCA 08 just has to
move this Friday to his presentation.

POHMELFS TODO list includes:

 * mechanism of keeping it coherent with other users
 * unified method of syncing with various remote filesystems

Thank you.

P.S. POHMELFS is about one month old, so do not be so severe with it :)

Crappy-stuff-created-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/fs/Kconfig b/fs/Kconfig
index f9eed6d..c40f2c5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -1519,6 +1519,8 @@ endmenu
 menu "Network File Systems"
 	depends on NET
 
+source "fs/pohmelfs/Kconfig"
+
 config NFS_FS
 	tristate "NFS file system support"
 	depends on INET
diff --git a/fs/Makefile b/fs/Makefile
index 720c29d..8fff82a 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -118,3 +118,4 @@ obj-$(CONFIG_HPPFS)		+= hppfs/
 obj-$(CONFIG_DEBUG_FS)		+= debugfs/
 obj-$(CONFIG_OCFS2_FS)		+= ocfs2/
 obj-$(CONFIG_GFS2_FS)           += gfs2/
+obj-$(CONFIG_POHMELFS)          += pohmelfs/
diff --git a/fs/pohmelfs/Kconfig b/fs/pohmelfs/Kconfig
new file mode 100644
index 0000000..ac19aac
--- /dev/null
+++ b/fs/pohmelfs/Kconfig
@@ -0,0 +1,6 @@
+config POHMELFS
+	tristate "POHMELFS filesystem support"
+	help
+	  POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System.
+	  This is a network filesystem which supports coherent caching of data and metadata
+	  on clients.
diff --git a/fs/pohmelfs/Makefile b/fs/pohmelfs/Makefile
new file mode 100644
index 0000000..8a87f46
--- /dev/null
+++ b/fs/pohmelfs/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_POHMELFS)	+= pohmelfs.o
+
+pohmelfs-y := inode.o config.o dir.o net.o
diff --git a/fs/pohmelfs/config.c b/fs/pohmelfs/config.c
new file mode 100644
index 0000000..10eabe1
--- /dev/null
+++ b/fs/pohmelfs/config.c
@@ -0,0 +1,120 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/connector.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#include "netfs.h"
+
+struct pohmelfs_config
+{
+	struct list_head	config_entry;
+	struct pohmelfs_ctl	cmd;
+};
+
+static struct cb_id pohmelfs_cn_id = {.idx = POHMELFS_CN_IDX, .val = POHMELFS_CN_VAL};
+static LIST_HEAD(pohmelfs_config_list);
+static DEFINE_MUTEX(pohmelfs_config_lock);
+
+int pohmelfs_copy_config(struct pohmelfs_ctl *dst, unsigned int idx)
+{
+	struct pohmelfs_config *c;
+	int err = -ENODEV;
+
+	mutex_lock(&pohmelfs_config_lock);
+	list_for_each_entry(c, &pohmelfs_config_list, config_entry) {
+		if (c->cmd.idx != idx)
+			continue;
+
+		memcpy(dst, &c->cmd, sizeof(struct pohmelfs_ctl));
+		err = 0;
+		break;
+	}
+	mutex_unlock(&pohmelfs_config_lock);
+
+	return err;
+}
+
+static void pohmelfs_cn_callback(void *data)
+{
+	struct cn_msg *msg = data;
+	struct pohmelfs_ctl *cmd;
+	struct pohmelfs_cn_ack *ack;
+	struct pohmelfs_config *cfg, *c;
+	int err;
+
+	if (msg->len < sizeof(struct pohmelfs_ctl)) {
+		err = -EBADMSG;
+		goto out;
+	}
+
+	cfg = kmalloc(sizeof(struct pohmelfs_config), GFP_KERNEL);
+	if (!cfg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	cmd = (struct pohmelfs_ctl *)msg->data;
+
+	memcpy(&cfg->cmd, cmd, sizeof(struct pohmelfs_ctl));
+
+	err = 0;
+	mutex_lock(&pohmelfs_config_lock);
+	list_for_each_entry(c, &pohmelfs_config_list, config_entry) {
+		if (c->cmd.idx == cmd->idx) {
+			err = -EEXIST;
+			break;
+		}
+	}
+	if (!err)
+		list_add_tail(&cfg->config_entry, &pohmelfs_config_list);
+	mutex_unlock(&pohmelfs_config_lock);
+
+out:
+	ack = kmalloc(sizeof(struct pohmelfs_cn_ack), GFP_KERNEL);
+	if (!ack)
+		return;
+
+	memcpy(&ack->msg, msg, sizeof(struct cn_msg));
+
+	ack->msg.ack = msg->ack + 1;
+	ack->msg.len = sizeof(struct pohmelfs_cn_ack) - sizeof(struct cn_msg);
+
+	ack->error = err;
+
+	cn_netlink_send(&ack->msg, 0, GFP_KERNEL);
+	kfree(ack);
+}
+
+int __init pohmelfs_config_init(void)
+{
+	return cn_add_callback(&pohmelfs_cn_id, "pohmelfs", pohmelfs_cn_callback);
+}
+
+void __exit pohmelfs_config_exit(void)
+{
+	struct pohmelfs_config *c, *tmp;
+
+	cn_del_callback(&pohmelfs_cn_id);
+	
+	mutex_lock(&pohmelfs_config_lock);
+	list_for_each_entry_safe(c, tmp, &pohmelfs_config_list, config_entry) {
+		list_del(&c->config_entry);
+		kfree(c);
+	}
+	mutex_unlock(&pohmelfs_config_lock);
+}
diff --git a/fs/pohmelfs/dir.c b/fs/pohmelfs/dir.c
new file mode 100644
index 0000000..23f9ecd
--- /dev/null
+++ b/fs/pohmelfs/dir.c
@@ -0,0 +1,892 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/fs.h>
+
+#include "netfs.h"
+
+static int pohmelfs_cmp_offset(struct pohmelfs_name *inode, u64 offset)
+{
+	if (inode->offset > offset)
+		return -1;
+	if (inode->offset < offset)
+		return 1;
+	return 0;
+}
+
+static struct pohmelfs_name *pohmelfs_search_offset(struct pohmelfs_inode *pi, u64 offset)
+{
+	struct rb_node *n = pi->offset_root.rb_node;
+	struct pohmelfs_name *tmp;
+	int cmp;
+
+	while (n) {
+		tmp = rb_entry(n, struct pohmelfs_name, offset_node);
+
+		cmp = pohmelfs_cmp_offset(tmp, offset);
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else
+			return tmp;
+	}
+
+	return NULL;
+}
+
+static struct pohmelfs_name *pohmelfs_insert_offset(struct pohmelfs_inode *pi,
+		struct pohmelfs_name *new)
+{
+	struct rb_node **n = &pi->offset_root.rb_node, *parent = NULL;
+	struct pohmelfs_name *ret = NULL, *tmp;
+	int cmp;
+
+	while (*n) {
+		parent = *n;
+
+		tmp = rb_entry(parent, struct pohmelfs_name, offset_node);
+
+		cmp = pohmelfs_cmp_offset(tmp, new->offset);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			ret = tmp;
+			break;
+		}
+	}
+
+	if (ret) {
+		dprintk("%s: exist: offset: %llu, ino: %llu, hash: %x, data: '%s', new: ino: %llu, hash: %x, data: '%s'.\n",
+			__func__, ret->offset, ret->ino, ret->hash, ret->data, new->ino, new->hash, new->data);
+		return ret;
+	}
+
+	rb_link_node(&new->offset_node, parent, n);
+	rb_insert_color(&new->offset_node, &pi->offset_root);
+
+	return NULL;
+}
+
+static struct pohmelfs_name *pohmelfs_insert_name_hash(struct rb_root *root,
+		struct pohmelfs_name *new)
+{
+	struct rb_node **n = &root->rb_node, *parent = NULL;
+	struct pohmelfs_name *ret = NULL, *tmp;
+	int cmp;
+
+	while (*n) {
+		parent = *n;
+
+		tmp = rb_entry(parent, struct pohmelfs_name, hash_node);
+
+		cmp = pohmelfs_cmp_hash(tmp, new->parent, new->hash, new->len);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			ret = tmp;
+			break;
+		}
+	}
+
+	if (ret) {
+		dprintk("%s: exist: ino: %llu, hash: %x, data: '%s', new: ino: %llu, hash: %x, data: '%s'.\n",
+				__func__, ret->ino, ret->hash, ret->data, new->ino, new->hash, new->data);
+		return ret;
+	}
+
+	rb_link_node(&new->hash_node, parent, n);
+	rb_insert_color(&new->hash_node, root);
+	
+	dprintk("%s: inserted: ino: %llu, hash: %x, data: '%s'.\n",
+			__func__, new->ino, new->hash, new->data);
+
+	return NULL;
+}
+
+static struct pohmelfs_name *pohmelfs_search_name_hash(struct rb_root *root,
+		u64 parent, u32 hash, u32 len)
+{
+	struct rb_node *n = root->rb_node;
+	struct pohmelfs_name *tmp;
+	int cmp;
+
+	while (n) {
+		tmp = rb_entry(n, struct pohmelfs_name, hash_node);
+
+		cmp = pohmelfs_cmp_hash(tmp, parent, hash, len);
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else
+			return tmp;
+	}
+
+	dprintk("%s: Failed to find a name for parent %llu, hash: %x, len: %u.\n",
+			__func__, parent, hash, len);
+	return NULL;
+}
+
+
+void pohmelfs_name_del(struct pohmelfs_inode *parent, struct pohmelfs_name *node)
+{
+	struct rb_node *rb_node;
+	int decr = 0;
+
+	for (rb_node = rb_next(&node->offset_node); rb_node; rb_node = rb_next(rb_node)) {
+		struct pohmelfs_name *n = container_of(rb_node, struct pohmelfs_name, offset_node);
+
+		n->offset -= node->len;
+		decr++;
+	}
+
+	dprintk("%s: parent: '%s', name: %p/'%s', decr: %d.\n",
+			__func__, parent->name.data, node, node->data, decr);
+
+	rb_erase(&node->offset_node, &parent->offset_root);
+	rb_erase(&node->hash_node, &parent->hash_root);
+
+	kfree(node);
+}
+
+static struct pohmelfs_name *pohmelfs_name_clone(unsigned int len)
+{
+	struct pohmelfs_name *n;
+
+	n = kzalloc(sizeof(struct pohmelfs_name) + len, GFP_KERNEL);
+	if (!n)
+		return NULL;
+
+	n->data = (char *)(n+1);
+
+	return n;
+}
+
+//#define POHMELFS_NEW_INODES	1
+
+static struct pohmelfs_inode *pohmelfs_new_inode(struct pohmelfs_sb *psb, struct pohmelfs_inode *parent,
+		char *data, struct netfs_cmd *cmd, struct netfs_inode_info *info)
+{
+	struct inode *new;
+	struct pohmelfs_inode *npi, *ret;
+	int err = -ENOMEM;
+
+	dprintk("%s: creating inode for data: '%s', info_ino: %llu, cmd_ino: %llu, parent_ino: %llu.\n",
+			__func__, data, info->ino, cmd->ino, (parent)?parent->ino:0);
+#ifdef POHMELFS_NEW_INODES
+	new = new_inode(psb->sb);
+#else
+	new = iget_locked(psb->sb, cmd->ino);
+#endif
+	if (!new) {
+		kfree(data);
+		goto err_out_exit;
+	}
+
+	npi = POHMELFS_I(new);
+
+	new->i_ino = cmd->ino;
+
+#ifdef POHMELFS_NEW_INODES
+	if (1) {
+#else
+	if (new->i_state & I_NEW) {
+#endif
+		npi->name.ino = npi->ino = info->ino;
+		npi->name.parent = npi->parent = (parent)?parent->ino:0;
+		npi->name.hash = netfs_get_inode_hash(cmd);
+		npi->name.len = cmd->size;
+		npi->name.offset = cmd->start;
+		npi->name.data = data;
+		npi->name.mode = info->mode;
+
+		err = -EEXIST;
+		dprintk("%s: filling VFS inode for data: '%s'.\n", __func__, data);
+		ret = pohmelfs_fill_inode(npi, info);
+		if (ret != npi)
+			goto err_out_put;
+	}
+
+	if (parent) {
+		struct pohmelfs_name *n, *name;
+
+		err = -ENOMEM;
+		n = pohmelfs_name_clone(cmd->size);
+		if (!n)
+			goto err_out_put;
+
+		n->parent = parent->ino;
+		n->ino = npi->ino;
+		n->offset = cmd->start;
+		n->hash = netfs_get_inode_hash(cmd);
+		n->mode = info->mode;
+		n->len = cmd->size;
+		strncpy(n->data, data, n->len);
+
+		mutex_lock(&parent->offset_lock);
+		name = pohmelfs_insert_offset(parent, n);
+
+		if (!name) {
+			name = pohmelfs_insert_name_hash(&parent->hash_root, n);
+			if (name)
+				rb_erase(&n->offset_node, &parent->offset_root);
+		}
+		mutex_unlock(&parent->offset_lock);
+
+		dprintk("%s: %s inserted name: %p/'%s', offset: %llu, ino: %llu, parent: %llu.\n",
+				__func__, (name)?"unsuccessfully":"successfully",
+				n, n->data, n->offset, n->ino, n->parent);
+
+		err = 0;
+		if (name) {
+			err = -EEXIST;
+			kfree(n);
+			goto err_out_put;
+		}
+	}
+
+#ifdef POHMELFS_NEW_INODES
+	insert_inode_hash(new);
+#else
+	if (new->i_state & I_NEW)
+		unlock_new_inode(new);
+	else
+		iput(new);
+#endif
+	mark_inode_dirty(new);
+
+	if (parent)
+		mark_inode_dirty(&parent->vfs_inode);
+
+	return npi;
+
+err_out_put:
+	unlock_new_inode(new);
+	printk("%s: putting inode: %p, npi: %p, error: %d, count: %d, nlink: %u.\n",
+			__func__, new, npi, err, atomic_read(&new->i_count), new->i_nlink);
+	iput(new);
+err_out_exit:
+	return ERR_PTR(err);
+}
+
+static int netfs_recv_inode_info(struct pohmelfs_sb *psb, struct pohmelfs_inode *parent,
+		struct pohmelfs_inode **newp, char *data)
+{
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	struct pohmelfs_inode *npi;
+	int err, total_size = 0, alloc = 0;
+
+	dprintk("%s: receiving inode info, data: %p, parent: %llu, st: %p.\n",
+			__func__, data, (parent)?parent->ino:0, st);
+
+	err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_exit;
+	}
+
+	netfs_convert_cmd(cmd);
+	total_size += sizeof(struct netfs_cmd) + cmd->size;
+
+	dprintk("%s: start: %llu, size: %llu.\n", __func__, cmd->start, cmd->size);
+
+	if (cmd->start == ~0ULL) {
+		err = -cmd->size;
+		goto err_out_exit;
+	}
+
+	if (!cmd->size) {
+		err = 0;
+		goto err_out_exit;
+	}
+
+	/*
+	 * Each directory entry can not exceed 256 bytes for path
+	 * plus header overhead, so PAGE_SIZE is more than enough.
+	 */
+
+	if (cmd->size >= PAGE_SIZE) {
+		printk("%s: wrong received data size: %llu, ino: %llu.\n",
+			__func__, cmd->size, cmd->ino);
+		BUG_ON(1);
+		err = -E2BIG;
+		goto err_out_exit;
+	}
+
+	if (!data) {
+		err = -ENOMEM;
+		data = kzalloc(cmd->size + 1, GFP_KERNEL);
+		if (!data)
+			goto err_out_exit;
+		alloc = 1;
+
+		dprintk("%s: receiving data, size: %llu.\n", __func__, cmd->size);
+		err = netfs_data_recv(st, data, cmd->size);
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto err_out_free;
+		}
+		data[cmd->size] = '\0';
+	}
+
+	dprintk("%s: receiving info.\n", __func__);
+	err = netfs_data_recv(st, &st->info, sizeof(struct netfs_inode_info));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+	
+	total_size += sizeof(struct netfs_inode_info);
+
+	netfs_convert_inode_info(&st->info);
+
+	npi = pohmelfs_new_inode(psb, parent, data, cmd, &st->info);
+	if (IS_ERR(npi)) {
+		err = PTR_ERR(npi);
+		if (err != -EEXIST)
+			goto err_out_exit;
+		npi = NULL;
+	} else
+		err = 0;
+	
+	dprintk("%s: all is ok, total_size: %d, err: %d.\n",
+			__func__, total_size, err);
+
+	*newp = npi;
+	return total_size;
+
+err_out_free:
+	if (alloc)
+		kfree(data);
+err_out_exit:
+	*newp = NULL;
+	dprintk("%s: returning err: %d.\n", __func__, err);
+	return err;
+}
+
+static int netfs_sync_inode(struct pohmelfs_inode *pi, u64 start)
+{
+	struct inode *inode = &pi->vfs_inode;
+	struct pohmelfs_sb *psb = POHMELFS_SB(inode->i_sb);
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	struct pohmelfs_inode *npi;
+	int err, added = 0;
+	u64 size, ps;
+
+	dprintk("%s: start: %llu, inode: %p [%lu].\n",
+			__func__, start, inode, inode->i_ino);
+
+	mutex_lock(&st->lock);
+
+	while (1) {
+		cmd->cmd = NETFS_READDIR;
+		cmd->ino = inode->i_ino;
+		cmd->ts = 0;
+		cmd->size = PAGE_SIZE;
+		cmd->start = start;
+
+		netfs_convert_cmd(cmd);
+
+		err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto err_out_unlock;
+		}
+
+		dprintk("%s: receiving reply.\n", __func__);
+		err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto err_out_unlock;
+		}
+
+		netfs_convert_cmd(cmd);
+
+		dprintk("%s: received size: %llu.\n", __func__, cmd->size);
+		ps = size = cmd->size;
+
+		if (!size)
+			break;
+
+		while (size != 0) {
+			err = netfs_recv_inode_info(psb, pi, &npi, NULL);
+			if (err < 0)
+				goto err_out_unlock;
+
+			size -= err;
+			start += err;
+			added++;
+		}
+
+		if (ps < PAGE_SIZE - 256 - sizeof(struct netfs_cmd) -
+				sizeof(struct netfs_inode_info))
+			break;
+	}
+
+	mutex_unlock(&st->lock);
+
+	return added;
+
+err_out_unlock:
+	mutex_unlock(&st->lock);
+	dprintk("%s: returning err: %d.\n", __func__, err);
+	return err;
+}
+
+static inline int pohmelfs_sync_inode(struct pohmelfs_inode *pi, u64 start)
+{
+	int err = 0;
+
+	dprintk("%s: start: %llu, state: %lu.\n", __func__, start, pi->state);
+
+	if (!test_and_set_bit(NETFS_STATE_SYNC, &pi->state)) {
+		err = netfs_sync_inode(pi, start);
+		if (err < 0) {
+			clear_bit(NETFS_STATE_SYNC, &pi->state);
+			return err;
+		}
+	}
+
+	return err;
+}
+
+static int pohmelfs_readdir(struct file *file, void *dirent, filldir_t filldir)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct pohmelfs_inode *pi = POHMELFS_I(inode);
+	struct pohmelfs_name *n;
+	int err = 0;
+	u64 len;
+
+	pohmelfs_sync_inode(pi, file->f_pos);
+
+	while (1) {
+		mutex_lock(&pi->offset_lock);
+		n = pohmelfs_search_offset(pi, file->f_pos);
+		dprintk("%s: offset: %llu, parent ino: %lu, n: %p.\n",
+				__func__, file->f_pos, pi->vfs_inode.i_ino, n);
+		if (!n) {
+			mutex_unlock(&pi->offset_lock);
+			err = 0;
+			break;
+		}
+
+		len = n->len;
+		err = filldir(dirent, n->data, n->len, file->f_pos,
+				n->ino, (n->mode >> 12) & 15);
+		mutex_unlock(&pi->offset_lock);
+
+		if (err < 0) {
+			dprintk("%s: err: %d.\n", __func__, err);
+			break;
+		}
+
+		file->f_pos += len;
+	}
+
+	return err;
+}
+
+const struct file_operations pohmelfs_dir_fops = {
+	.read = generic_read_dir,
+	.readdir = pohmelfs_readdir,
+};
+
+struct pohmelfs_inode *pohmelfs_process_lookup_request(struct pohmelfs_sb *psb,
+		struct pohmelfs_inode *parent, char *name, __u32 len, __u32 hash)
+{
+	struct pohmelfs_inode *npi;
+	void *data;
+	struct netfs_cmd *cmd;
+	struct netfs_state *st = &psb->state;
+	int err = -ENOMEM;
+
+	data = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!data)
+		goto err_out_exit;
+
+	cmd = data;
+
+	cmd->cmd = NETFS_LOOKUP;
+	cmd->ino = (parent)?parent->ino:0;
+	cmd->ts = 0;
+	cmd->size = len;
+	cmd->start = hash;
+	cmd->flags = 0;
+
+	memcpy(data + sizeof(struct netfs_cmd), name, len);
+	
+	netfs_convert_cmd(cmd);
+
+	mutex_lock(&st->lock);
+
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd) + len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+
+	npi = NULL;
+	err = netfs_recv_inode_info(psb, parent, &npi, NULL);
+	if (err < 0)
+		goto err_out_unlock;
+	
+	mutex_unlock(&st->lock);
+	kfree(data);
+
+	return npi;
+
+err_out_unlock:
+	mutex_unlock(&st->lock);
+	kfree(data);
+err_out_exit:
+	return ERR_PTR(err);
+}
+
+struct dentry *pohmelfs_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
+{
+	struct inode *inode = NULL;
+	struct pohmelfs_sb *psb = POHMELFS_SB(dir->i_sb);
+	struct pohmelfs_inode *npi, *parent = POHMELFS_I(dir);
+	int err = -ENOMEM;
+	struct pohmelfs_name *n;
+
+	dentry->d_op = dir->i_sb->s_root->d_op;
+
+	dprintk("%s: dir: %p, dir_ino: %lu, dentry: %p, dinode: %p, "
+			"name: %s, len: %u.\n",
+			__func__, dir, dir->i_ino, dentry, dentry->d_inode,
+			dentry->d_name.name, dentry->d_name.len);
+
+	do {
+		mutex_lock(&psb->hash_lock);
+		npi = pohmelfs_search_hash(&psb->hash_root, dir->i_ino, dentry->d_name.hash, dentry->d_name.len);
+		if (npi) {
+			inode = &npi->vfs_inode;
+			mutex_unlock(&psb->hash_lock);
+			goto out_add;
+		}
+		mutex_unlock(&psb->hash_lock);
+
+		mutex_lock(&parent->offset_lock);
+		n = pohmelfs_search_name_hash(&parent->hash_root, parent->ino, dentry->d_name.hash, dentry->d_name.len);
+		if (n) {
+			inode = ilookup(dir->i_sb, n->ino);
+
+			if (inode) {
+				iput(inode);
+				mutex_unlock(&parent->offset_lock);
+				goto out_add;
+			}
+		}
+		mutex_unlock(&parent->offset_lock);
+
+		err = pohmelfs_sync_inode(POHMELFS_I(dir), 0);
+		if (err < 0)
+			goto err_out_exit;
+	} while (err > 0);
+
+	if (inode == NULL)
+		return NULL;
+
+out_add:
+	return d_splice_alias(inode, dentry);
+
+err_out_exit:
+	return ERR_PTR(err);
+}
+
+static int pohmelfs_create_entry(struct inode *dir, struct dentry *dentry, u64 start, int mode)
+{
+	struct pohmelfs_sb *psb = POHMELFS_SB(dir->i_sb);
+	struct pohmelfs_inode *npi;
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	int err = -ENOMEM;
+	char *data;
+
+	dprintk("%s: dir_ino: %lu, name: '%s', mode: %o, start: %llu.\n",
+			__func__, dir->i_ino, dentry->d_name.name, mode, start);
+
+	data = kstrdup(dentry->d_name.name, GFP_KERNEL);
+	if (!data)
+		goto err_out_exit;
+
+	mutex_lock(&st->lock);
+
+	cmd->ino = dir->i_ino;
+	cmd->cmd = NETFS_CREATE;
+	cmd->ts = 0;
+	cmd->size = dentry->d_name.len;
+	cmd->start = start;
+	netfs_set_cmd_flags(cmd, dentry->d_name.hash, mode);
+
+	netfs_convert_cmd(cmd);
+
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+
+	err = netfs_data_send(st, data, dentry->d_name.len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+
+	err = netfs_recv_inode_info(psb, POHMELFS_I(dir), &npi, data);
+	if (err < 0)
+		goto err_out_unlock;
+	mutex_unlock(&st->lock);
+
+	d_add(dentry, &npi->vfs_inode);
+	dprintk("%s: dir: '%s', nlink: %u, inode: '%s', nlink: %u, d_count: %d, d_unhashed: %d, dentry: %p.\n",
+			__func__,
+			POHMELFS_I(dir)->name.data, dir->i_nlink,
+			npi->name.data, npi->vfs_inode.i_nlink,
+			atomic_read(&dentry->d_count), d_unhashed(dentry), dentry);
+
+	return 0;
+
+err_out_free:
+	kfree(data);
+err_out_unlock:
+	mutex_unlock(&st->lock);
+err_out_exit:
+	dprintk("%s: err: %d.\n", __func__, err);
+	return err;
+}
+
+static int pohmelfs_create(struct inode *dir, struct dentry *dentry, int mode,
+		struct nameidata *nd)
+{
+	return pohmelfs_create_entry(dir, dentry, 0, mode);
+}
+
+static int pohmelfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+	int err;
+
+	inode_inc_link_count(dir);
+	err = pohmelfs_create_entry(dir, dentry, 0, mode | S_IFDIR);
+	if (err)
+		inode_dec_link_count(dir);
+
+	return err;
+}
+
+static int pohmelfs_remove_entry(struct inode *dir, struct dentry *dentry)
+{
+	struct pohmelfs_sb *psb = POHMELFS_SB(dir->i_sb);
+	struct netfs_state *st = &psb->state;
+	struct inode *inode = dentry->d_inode;
+	struct pohmelfs_inode *parent = POHMELFS_I(dir), *pi = POHMELFS_I(inode);
+	struct netfs_cmd *cmd = &st->cmd;
+	struct pohmelfs_name *n;
+	int err = -ENOENT;
+
+	dprintk("%s: dir_ino: %lu, inode: %lu, name: '%s', nlink: %u.\n",
+			__func__, dir->i_ino, inode->i_ino, dentry->d_name.name, inode->i_nlink);
+
+	mutex_lock(&st->lock);
+
+	cmd->ino = inode->i_ino;
+	cmd->cmd = NETFS_REMOVE;
+	cmd->ts = 0;
+	cmd->size = 0;
+	cmd->start = 0;
+	netfs_set_cmd_flags(cmd, dentry->d_name.hash, dentry->d_inode->i_mode);
+
+	netfs_convert_cmd(cmd);
+
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+	
+	err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+	
+	netfs_convert_cmd(cmd);
+
+	err = 0;
+	if (cmd->start == ~0ULL)
+		err = -cmd->size;
+	mutex_unlock(&st->lock);
+
+	dprintk("%s: dir_ino: %lu, inode: %lu, name: '%s', err: %d.\n",
+			__func__, dir->i_ino, inode->i_ino, dentry->d_name.name, err);
+
+	if (!err) {
+		inode->i_ctime = dir->i_ctime;
+
+		err = -ENOENT;
+		mutex_lock(&parent->offset_lock);
+		n = pohmelfs_search_name_hash(&parent->hash_root, parent->ino, pi->name.hash, pi->name.len);
+		if (n) {
+			pohmelfs_name_del(parent, n);
+			err = 0;
+		}
+		mutex_unlock(&parent->offset_lock);
+		
+		if (!err) {
+			pohmelfs_inode_del_inode(psb, pi);
+		}
+		
+		inode_dec_link_count(inode);
+		dprintk("%s: inode: %lu, lock: %ld, unhashed: %d.\n",
+				__func__, inode->i_ino, inode->i_state & I_LOCK, hlist_unhashed(&inode->i_hash));
+	}
+
+	return err;
+
+err_out_unlock:
+	mutex_unlock(&st->lock);
+	return err;
+}
+
+static int pohmelfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+	return pohmelfs_remove_entry(dir, dentry);
+}
+
+static int pohmelfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+	int err;
+	struct inode *inode = dentry->d_inode;
+
+	err = pohmelfs_remove_entry(dir, dentry);
+	if (!err) {
+		inode_dec_link_count(dir);
+		inode_dec_link_count(inode);
+	}
+
+	dprintk("%s: dentry: %p, dir: '%s', nlink: %u, inode: '%s', nlink: %u, d_count: %d, d_unhashed: %d, err: %d.\n",
+			__func__, dentry,
+			POHMELFS_I(dir)->name.data, dir->i_nlink,
+			POHMELFS_I(inode)->name.data, inode->i_nlink,
+			atomic_read(&dentry->d_count), d_unhashed(dentry), err);
+
+	return err;
+}
+
+static int pohmelfs_link(struct dentry *old_dentry, struct inode *dir,
+	struct dentry *dentry)
+{
+	struct inode *inode = old_dentry->d_inode;
+
+	return pohmelfs_create_entry(dir, dentry, inode->i_ino, inode->i_mode);
+}
+
+static int pohmelfs_symlink(struct inode *dir, struct dentry *dentry, const char *symname)
+{
+	struct pohmelfs_sb *psb = POHMELFS_SB(dir->i_sb);
+	struct pohmelfs_inode *npi;
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	int err = -ENOMEM;
+	unsigned int len = strlen(symname);
+	char *data;
+
+	dprintk("%s: dir_ino: %lu, dentry: '%s', dino: %p, symname: '%s'.\n",
+			__func__, dir->i_ino, dentry->d_name.name, dentry->d_inode, symname);
+
+	data = kstrdup(dentry->d_name.name, GFP_KERNEL);
+	if (!data)
+		goto err_out_exit;
+
+	mutex_lock(&st->lock);
+
+	cmd->ino = dir->i_ino;
+	cmd->cmd = NETFS_CREATE;
+	cmd->ts = 0;
+	cmd->size = dentry->d_name.len + len;
+	cmd->start = dentry->d_name.len;
+	netfs_set_cmd_flags(cmd, dentry->d_name.hash, S_IFLNK | S_IRWXUGO);
+
+	netfs_convert_cmd(cmd);
+
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+
+	err = netfs_data_send(st, data, dentry->d_name.len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+
+	err = netfs_data_send(st, (void *)symname, len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_free;
+	}
+
+	err = netfs_recv_inode_info(psb, POHMELFS_I(dir), &npi, data);
+	if (err < 0)
+		goto err_out_unlock;
+	mutex_unlock(&st->lock);
+
+	d_add(dentry, &npi->vfs_inode);
+
+	return 0;
+
+err_out_free:
+	kfree(data);
+err_out_unlock:
+	mutex_unlock(&st->lock);
+err_out_exit:
+	dprintk("%s: err: %d.\n", __func__, err);
+	return err;
+}
+
+const struct inode_operations pohmelfs_dir_inode_ops = {
+	.link	= pohmelfs_link,
+	.symlink= pohmelfs_symlink,
+	.unlink	= pohmelfs_unlink,
+	.mkdir	= pohmelfs_mkdir,
+	.rmdir	= pohmelfs_rmdir,
+	.create	= pohmelfs_create,
+	.lookup = pohmelfs_lookup,
+};
+
diff --git a/fs/pohmelfs/inode.c b/fs/pohmelfs/inode.c
new file mode 100644
index 0000000..b0ee0b3
--- /dev/null
+++ b/fs/pohmelfs/inode.c
@@ -0,0 +1,603 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/ktime.h>
+#include <linux/fs.h>
+#include <linux/jhash.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
+#include <linux/mm.h>
+
+#include "netfs.h"
+
+static struct kmem_cache *pohmelfs_inode_cache;
+
+struct pohmelfs_inode *pohmelfs_search_hash(struct rb_root *root, u64 parent,
+		u32 hash, u32 len)
+{
+	struct rb_node *n = root->rb_node;
+	struct pohmelfs_inode *tmp;
+	int cmp;
+
+	while (n) {
+		tmp = rb_entry(n, struct pohmelfs_inode, hash_node);
+
+		cmp = pohmelfs_cmp_hash(&tmp->name, parent, hash, len);
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else
+			return tmp;
+	}
+
+	dprintk("%s: Failed to find a name for parent %llu, hash: %x, len: %u.\n",
+			__func__, parent, hash, len);
+	return NULL;
+}
+
+static struct pohmelfs_inode *pohmelfs_insert_hash(struct rb_root *root,
+		struct pohmelfs_inode *new)
+{
+	struct rb_node **n = &root->rb_node, *parent = NULL;
+	struct pohmelfs_inode *ret = NULL, *tmp;
+	int cmp;
+
+	while (*n) {
+		parent = *n;
+
+		tmp = rb_entry(parent, struct pohmelfs_inode, hash_node);
+
+		cmp = pohmelfs_cmp_hash(&tmp->name, new->parent, new->name.hash, new->name.len);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			ret = tmp;
+			break;
+		}
+	}
+
+	if (ret) {
+		dprintk("%s: exist: ino: %lu, hash: %x, data: '%s', new: ino: %lu, hash: %x, data: '%s'.\n",
+				__func__,
+				ret->vfs_inode.i_ino, ret->name.hash, ret->name.data,
+				new->vfs_inode.i_ino, new->name.hash, new->name.data);
+		return ret;
+	}
+
+	rb_link_node(&new->hash_node, parent, n);
+	rb_insert_color(&new->hash_node, root);
+	
+	dprintk("%s: inserted: ino: %lu, hash: %x, data: '%s'.\n",
+			__func__, new->vfs_inode.i_ino, new->name.hash, new->name.data);
+
+	return new;
+}
+
+void pohmelfs_inode_del_inode(struct pohmelfs_sb *psb, struct pohmelfs_inode *pi)
+{
+	struct pohmelfs_name *n;
+	struct rb_node *rb_node;
+
+	mutex_lock(&psb->hash_lock);
+	if (pi->hash_node.rb_parent_color) {
+		rb_erase(&pi->hash_node, &psb->hash_root);
+		pi->hash_node.rb_parent_color = 0;
+	}
+	mutex_unlock(&psb->hash_lock);
+
+	mutex_lock(&pi->offset_lock);
+	for (rb_node = rb_first(&pi->offset_root); rb_node;) {
+		n = rb_entry(rb_node, struct pohmelfs_name, offset_node);
+		rb_node = rb_next(rb_node);
+
+		pohmelfs_name_del(pi, n);
+	}
+	mutex_unlock(&pi->offset_lock);
+
+	dprintk("%s: pi: %p, ino: %llu, name: '%s'.\n",
+			__func__, pi, pi->ino, pi->name.data);
+}
+
+static int netfs_process_page(struct file *file, struct page *page, __u64 cmd_op, __u64 size)
+{
+	struct inode *inode = page->mapping->host;
+	struct pohmelfs_sb *psb = POHMELFS_SB(inode->i_sb);
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	int err;
+	void *addr;
+
+	mutex_lock(&st->lock);
+
+	cmd->ino = inode->i_ino;
+	cmd->start = page->index << PAGE_CACHE_SHIFT;
+	cmd->size = size;
+	cmd->cmd = cmd_op;
+	cmd->ts = 0;
+
+	dprintk("%s: page: %p, ino: %lu, start: %llu, idx: %lu, cmd: %llu, size: %llu.\n",
+			__func__, page, inode->i_ino, cmd->start, page->index, cmd_op, size);
+
+	netfs_convert_cmd(cmd);
+
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto out_unlock;
+	}
+
+	addr = kmap(page);
+
+	if (cmd_op == NETFS_READ_PAGE) {
+		err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto out_unmap;
+		}
+
+		netfs_convert_cmd(cmd);
+
+		if (cmd->start == ~0ULL) {
+			err = -cmd->size;
+			goto out_unmap;
+		}
+
+		if (cmd->size == 0) {
+			err = -EINVAL;
+			goto out_unmap;
+		}
+
+		err = netfs_data_recv(st, addr, cmd->size);
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto out_unmap;
+		}
+		
+		if (file)
+			file->f_pos += cmd->size;
+	} else {
+		err = netfs_data_send(st, addr, size);
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto out_unmap;
+		}
+
+		err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+		if (err <= 0) {
+			if (err == 0)
+				err = -ECONNRESET;
+			goto out_unmap;
+		}
+
+		netfs_convert_cmd(cmd);
+
+		if (cmd->start == ~0ULL) {
+			err = -cmd->size;
+			goto out_unmap;
+		}
+	}
+
+	err = 0;
+	SetPageUptodate(page);
+
+out_unmap:
+	kunmap(page);
+out_unlock:
+	mutex_unlock(&st->lock);
+
+	if (err)
+		SetPageError(page);
+	unlock_page(page);
+
+	dprintk("%s: page: %p, start: %llu/%llx, size: %llu, err: %d.\n",
+			__func__, page, cmd->start, cmd->start, cmd->size, err);
+
+	return err;
+}
+
+static int pohmelfs_readpage(struct file *file, struct page *page)
+{
+	return netfs_process_page(file, page, NETFS_READ_PAGE, PAGE_CACHE_SIZE);
+}
+
+static int pohmelfs_writepage(struct page *page, struct writeback_control *wbc)
+{
+	return netfs_process_page(NULL, page, NETFS_WRITE_PAGE, page_private(page));
+}
+
+static int pohmelfs_prepare_write(struct file *file, struct page *page,
+			unsigned from, unsigned to)
+{
+	dprintk("%s: ino: %lu, from: %u, to: %u.\n",
+			__func__, page->mapping->host->i_ino, from, to);
+	SetPagePrivate(page);
+	return 0;
+}
+
+static int pohmelfs_commit_write(struct file *file, struct page *page,
+		unsigned from, unsigned to)
+{
+	struct inode *inode = page->mapping->host;
+	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	unsigned end = page_private(page);
+
+	dprintk("%s: ino: %lu, from: %u, to: %u, end: %u, pos: %llu, i_size: %llu.\n",
+			__func__, inode->i_ino, from, to, end, pos, inode->i_size);
+
+	ClearPagePrivate(page);
+	SetPageUptodate(page);
+
+	if (to > end)
+		set_page_private(page, to);
+	set_page_dirty(page);
+
+	/*
+	 * No need to use i_size_read() here, the i_size
+	 * cannot change under us because we hold i_mutex.
+	 */
+	if (pos > inode->i_size) {
+		i_size_write(inode, pos);
+		mark_inode_dirty(inode);
+	}
+	return 0;
+}
+
+const struct address_space_operations pohmelfs_aops = {
+	.readpage		= pohmelfs_readpage,
+	.writepage		= pohmelfs_writepage,
+	.prepare_write		= pohmelfs_prepare_write,
+	.commit_write		= pohmelfs_commit_write,
+};
+
+static void pohmelfs_destroy_inode(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	struct pohmelfs_sb *psb = POHMELFS_SB(sb);
+	struct pohmelfs_inode *pi = POHMELFS_I(inode);
+
+	dprintk("%s: inode: %p, vfs_inode: %p.\n",
+			__func__, pi, inode);
+	pohmelfs_inode_del_inode(psb, pi);
+	if (pi->name.data)
+		kfree(pi->name.data);
+	kmem_cache_free(pohmelfs_inode_cache, POHMELFS_I(inode));
+}
+
+static struct inode *pohmelfs_alloc_inode(struct super_block *sb)
+{
+	struct pohmelfs_inode *inode;
+
+	inode = kmem_cache_alloc(pohmelfs_inode_cache, GFP_KERNEL);
+	if (!inode)
+		return NULL;
+	dprintk("%s: inode: %p, vfs_inode: %p.\n",
+			__func__, inode, &inode->vfs_inode);
+
+	inode->hash_node.rb_parent_color = 0;
+
+	inode->offset_root = RB_ROOT;
+	inode->hash_root = RB_ROOT;
+	mutex_init(&inode->offset_lock);
+
+	memset(&inode->name, 0, sizeof(struct pohmelfs_name));
+
+	inode->state = 0;
+	inode->parent = 0;
+
+	return &inode->vfs_inode;
+}
+
+const static struct file_operations pohmelfs_file_ops = {
+	.llseek		= generic_file_llseek,
+
+	.read		= do_sync_read,
+	.aio_read	= generic_file_aio_read,
+
+	.mmap		= generic_file_mmap,
+
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
+
+	.write		= do_sync_write,
+	.aio_write	= generic_file_aio_write,
+};
+
+const struct inode_operations pohmelfs_symlink_inode_operations = {
+	.readlink	= generic_readlink,
+	.follow_link	= page_follow_link_light,
+	.put_link	= page_put_link,
+};
+
+struct pohmelfs_inode *pohmelfs_fill_inode(struct pohmelfs_inode *pi, struct netfs_inode_info *info)
+{
+	struct inode *inode = &pi->vfs_inode;
+	struct pohmelfs_sb *psb = POHMELFS_SB(inode->i_sb);
+	struct pohmelfs_inode *ret;
+
+	inode->i_mode = info->mode;
+	inode->i_nlink = info->nlink;
+	inode->i_uid = info->uid;
+	inode->i_gid = info->gid;
+	inode->i_blocks = info->blocks;
+	inode->i_rdev = info->rdev;
+	inode->i_size = info->size;
+	inode->i_version = info->version;
+	inode->i_blkbits = ffs(info->blocksize);
+
+	dprintk("%s: inode: %p, num: %lu, parent: %llu hash: %x, data: '%s', "
+			"inode is regular: %d, dir: %d, link: %d, mode: %o.\n",
+			__func__, inode, inode->i_ino, pi->parent,
+			pi->name.hash, pi->name.data,
+			S_ISREG(inode->i_mode), S_ISDIR(inode->i_mode),
+			S_ISLNK(inode->i_mode), inode->i_mode);
+
+	inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME_SEC;
+
+	inode->i_blocks = DIV_ROUND_UP(inode->i_size, 512);
+
+	/*
+	 * i_mapping is a pointer to i_data during inode initialization.
+	 */
+	inode->i_data.a_ops = &pohmelfs_aops;
+	
+	if (S_ISREG(inode->i_mode)) {
+		inode->i_fop = &pohmelfs_file_ops;
+	} else if (S_ISDIR(inode->i_mode)) {
+		inode->i_fop = &pohmelfs_dir_fops;
+		inode->i_op = &pohmelfs_dir_inode_ops;
+	} else if (S_ISLNK(inode->i_mode)) {
+		inode->i_op = &pohmelfs_symlink_inode_operations;
+		inode->i_fop = &pohmelfs_file_ops;
+	} else {
+		inode->i_fop = &generic_ro_fops;
+	}
+
+	mutex_lock(&psb->hash_lock);
+	ret = pohmelfs_insert_hash(&psb->hash_root, pi);
+	mutex_unlock(&psb->hash_lock);
+
+	return ret;
+}
+
+static void pohmelfs_read_inode(struct inode *inode)
+{
+	struct pohmelfs_sb *psb = POHMELFS_SB(inode->i_sb);
+	struct netfs_state *st = &psb->state;
+	struct netfs_cmd *cmd = &st->cmd;
+	struct pohmelfs_inode *pi = POHMELFS_I(inode), *ret;
+	int err;
+
+	mutex_lock(&st->lock);
+	
+	cmd->ino = inode->i_ino;
+	cmd->cmd = NETFS_READ_INODE;
+	cmd->size = 0;
+	cmd->start = 0;
+	cmd->ts = 0;
+
+	netfs_convert_cmd(cmd);
+	
+	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+
+	err = netfs_data_recv(st, cmd, sizeof(struct netfs_cmd));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+
+	netfs_convert_cmd(cmd);
+
+	err = -EINVAL;
+	if (cmd->size != sizeof(struct netfs_inode_info))
+		goto err_out_unlock;
+
+	err = netfs_data_recv(st, &st->info, sizeof(struct netfs_inode_info));
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		goto err_out_unlock;
+	}
+
+	err = -ENODEV;
+	if (cmd->start == ~0ULL)
+		goto err_out_unlock;
+
+	netfs_convert_inode_info(&st->info);
+
+	ret = pohmelfs_fill_inode(pi, &st->info);
+	if (ret != pi)
+		goto err_out_unlock;
+
+	mutex_unlock(&st->lock);
+
+	return;
+
+err_out_unlock:
+	mutex_unlock(&st->lock);
+	make_bad_inode(inode);
+	return;
+}
+
+static void pohmelfs_put_super(struct super_block *sb)
+{
+	struct pohmelfs_sb *psb = POHMELFS_SB(sb);
+	struct rb_node *rb_node;
+	struct pohmelfs_inode *pi;
+
+	while ((rb_node = rb_first(&psb->hash_root)) != NULL) {
+		pi = rb_entry(rb_node, struct pohmelfs_inode, hash_node);
+
+		iput(&pi->vfs_inode);
+	}
+	kfree(psb);
+	sb->s_fs_info = NULL;
+}
+
+static int pohmelfs_remount(struct super_block *sb, int *flags, char *data)
+{
+	*flags |= MS_RDONLY;
+	return 0;
+}
+
+static const struct super_operations pohmelfs_sb_ops = {
+	.alloc_inode	= pohmelfs_alloc_inode,
+	.destroy_inode	= pohmelfs_destroy_inode,
+	.read_inode	= pohmelfs_read_inode,
+	.put_super	= pohmelfs_put_super,
+	.remount_fs	= pohmelfs_remount,
+};
+
+static int pohmelfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+	struct pohmelfs_sb *psb;
+	int err = -ENOMEM;
+	struct inode *root;
+	struct pohmelfs_inode *npi;
+
+	psb = kzalloc(sizeof(struct pohmelfs_sb), GFP_KERNEL);
+	if (!psb)
+		goto err_out_exit;
+
+	sb->s_fs_info = psb;
+	sb->s_op = &pohmelfs_sb_ops;
+
+	psb->sb = sb;
+	psb->hash_root = RB_ROOT;
+
+	mutex_init(&psb->hash_lock);
+
+	err = pohmelfs_state_init(&psb->state, 0);
+	if (err)
+		goto err_out_free_sb;
+
+	npi = pohmelfs_process_lookup_request(psb, NULL, "/", 1, full_name_hash("/", 1));
+	if (IS_ERR(npi) || !npi) {
+		err = PTR_ERR(npi);
+		if (!err)
+			err = -ENODEV;
+		goto err_out_state_exit;
+	}
+
+	root = &npi->vfs_inode;
+
+	sb->s_root = d_alloc_root(root);
+	if (!sb->s_root)
+		goto err_out_put_root;
+
+	return 0;
+
+err_out_put_root:
+	iput(root);
+err_out_state_exit:
+	pohmelfs_state_exit(&psb->state);
+err_out_free_sb:
+	kfree(psb);
+err_out_exit:
+	return err;
+}
+
+static int pohmelfs_get_sb(struct file_system_type *fs_type,
+	int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+	return get_sb_nodev(fs_type, flags, data, pohmelfs_fill_super,
+				mnt);
+}
+
+static struct file_system_type pohmel_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "pohmel",
+	.get_sb		= pohmelfs_get_sb,
+	.kill_sb 	= kill_anon_super,
+};
+
+static void pohmelfs_init_once(void *data, struct kmem_cache *cachep, unsigned long flags)
+{
+	struct pohmelfs_inode *inode = data;
+
+	inode_init_once(&inode->vfs_inode);
+}
+
+static int pohmelfs_init_inodecache(void)
+{
+	pohmelfs_inode_cache = kmem_cache_create("pohmelfs_inode_cache",
+				sizeof(struct pohmelfs_inode),
+				0, (SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD),
+				pohmelfs_init_once);
+	if (!pohmelfs_inode_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void pohmelfs_destroy_inodecache(void)
+{
+	kmem_cache_destroy(pohmelfs_inode_cache);
+}
+
+static int __init init_pohmel_fs(void)
+{
+	int err;
+
+	err = pohmelfs_config_init();
+	if (err)
+		goto err_out_exit;
+
+	err = pohmelfs_init_inodecache();
+	if (err)
+		goto err_out_config_exit;
+
+	err = register_filesystem(&pohmel_fs_type);
+	if (err)
+		goto err_out_destroy;
+
+	return 0;
+
+err_out_destroy:
+	pohmelfs_destroy_inodecache();
+err_out_config_exit:
+	pohmelfs_config_exit();
+err_out_exit:
+	return err;
+}
+
+static void __exit exit_pohmel_fs(void)
+{
+        unregister_filesystem(&pohmel_fs_type);
+	pohmelfs_destroy_inodecache();
+	pohmelfs_config_exit();
+}
+
+module_init(init_pohmel_fs);
+module_exit(exit_pohmel_fs);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Pohmel filesystem");
diff --git a/fs/pohmelfs/net.c b/fs/pohmelfs/net.c
new file mode 100644
index 0000000..b886ad3
--- /dev/null
+++ b/fs/pohmelfs/net.c
@@ -0,0 +1,112 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include "netfs.h"
+
+int netfs_data_recv(struct netfs_state *st, void *buf, u64 size)
+{
+	struct msghdr msg;
+	struct kvec iov;
+	int err;
+
+	BUG_ON(!size);
+
+	iov.iov_base = buf;
+	iov.iov_len = size;
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_recvmsg(st->socket, &msg, &iov, 1, iov.iov_len,
+			msg.msg_flags);
+	if (err <= 0) {
+		printk("%s: failed to receive data: size: %llu, err: %d.\n", __func__, size, err);
+	}
+
+	return err;
+}
+
+int netfs_data_send(struct netfs_state *st, void *buf, u64 size)
+{
+	struct msghdr msg;
+	struct kvec iov;
+	int err;
+
+	BUG_ON(!size);
+
+	iov.iov_base = buf;
+	iov.iov_len = size;
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+	if (err <= 0) {
+		printk("%s: failed to send data: size: %llu, err: %d.\n", __func__, size, err);
+	}
+
+	return err;
+}
+
+int pohmelfs_state_init(struct netfs_state *st, unsigned int idx)
+{
+	int err = -ENOMEM;
+	struct pohmelfs_ctl *ctl;
+
+	mutex_init(&st->lock);
+
+	ctl = kzalloc(sizeof(struct pohmelfs_ctl), GFP_KERNEL);
+	if (!ctl)
+		goto err_out_exit;
+
+	err = pohmelfs_copy_config(ctl, idx);
+	if (err)
+		goto err_out_exit;
+
+	err = sock_create(ctl->addr.sa_family, ctl->type, ctl->proto, &st->socket);
+	if (err)
+		goto err_out_free;
+
+	err = st->socket->ops->connect(st->socket,
+			(struct sockaddr *)&ctl->addr, ctl->addrlen, 0);
+	if (err)
+		goto err_out_release;
+
+	st->socket->sk->sk_allocation = GFP_NOIO;
+
+	return 0;
+
+err_out_release:
+	sock_release(st->socket);
+err_out_free:
+	kfree(ctl);
+err_out_exit:
+	return err;
+}
+
+void pohmelfs_state_exit(struct netfs_state *st)
+{
+	sock_release(st->socket);
+}
diff --git a/fs/pohmelfs/netfs.h b/fs/pohmelfs/netfs.h
new file mode 100644
index 0000000..23aa953
--- /dev/null
+++ b/fs/pohmelfs/netfs.h
@@ -0,0 +1,254 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __NETFS_H
+#define __NETFS_H
+
+#include <linux/types.h>
+#include <linux/connector.h>
+
+#define POHMELFS_CN_IDX			5
+#define POHMELFS_CN_VAL			0
+
+struct netfs_cmd
+{
+	__u64			cmd;
+	__u64			ino;
+	__u64			flags;
+	__u64			ts;
+	__u64			size;
+	__u64			start;
+	__u8			data[];
+};
+
+static inline void netfs_convert_cmd(struct netfs_cmd *cmd)
+{
+	cmd->cmd = __be64_to_cpu(cmd->cmd);
+	cmd->ino = __be64_to_cpu(cmd->ino);
+	cmd->ts = __be64_to_cpu(cmd->ts);
+	cmd->size = __be64_to_cpu(cmd->size);
+	cmd->start = __be64_to_cpu(cmd->start);
+	cmd->flags = __be64_to_cpu(cmd->flags);
+}
+
+enum {
+	NETFS_READDIR	= 1,	/* Read directory for given inode number */
+	NETFS_LOOKUP,		/* Lookup inode for given name */
+	NETFS_READ_INODE,	/* Read inode data */
+	NETFS_READ_PAGE,	/* Read data page from the server */
+	NETFS_WRITE_PAGE,	/* Write data page to the server */
+	NETFS_CREATE,		/* Create directory entry */
+	NETFS_REMOVE,		/* Remove directory entry */
+	NETFS_STATE,		/* State change message */
+	NETFS_CMD_MAX
+};
+
+#define _K_SS_MAXSIZE	128
+
+struct saddr
+{
+	unsigned short		sa_family;
+	char			addr[_K_SS_MAXSIZE];
+};
+
+struct pohmelfs_ctl
+{
+	unsigned int		idx;
+	unsigned int		type;
+	unsigned int		proto;
+	unsigned int		addrlen;
+	struct saddr		addr;
+};
+
+struct pohmelfs_cn_ack
+{
+	struct cn_msg		msg;
+	int			error;
+	int			unused[3];
+};
+
+struct netfs_inode_info
+{
+	unsigned int		mode;
+	unsigned int		nlink;
+	unsigned int		uid;
+	unsigned int		gid;
+	unsigned int		blocksize;
+	unsigned int		padding;
+	__u64			ino;
+	__u64			blocks;
+	__u64			rdev;
+	__u64			size;
+	__u64			version;
+};
+
+static inline void netfs_convert_inode_info(struct netfs_inode_info *info)
+{
+	info->mode = __cpu_to_be32(info->mode);
+	info->nlink = __cpu_to_be32(info->nlink);
+	info->uid = __cpu_to_be32(info->uid);
+	info->gid = __cpu_to_be32(info->gid);
+	info->blocksize = __cpu_to_be32(info->blocksize);
+	info->blocks = __cpu_to_be64(info->blocks);
+	info->rdev = __cpu_to_be64(info->rdev);
+	info->size = __cpu_to_be64(info->size);
+	info->version = __cpu_to_be64(info->version);
+	info->ino = __cpu_to_be64(info->ino);
+}
+
+static inline __u32 netfs_get_inode_hash(struct netfs_cmd *cmd)
+{
+	return cmd->flags >> 32;
+}
+
+static inline __u32 netfs_get_inode_mode(struct netfs_cmd *cmd)
+{
+	return cmd->flags & 0xffffffff;
+}
+
+static inline void netfs_set_cmd_flags(struct netfs_cmd *cmd, __u32 hash, __u32 type)
+{
+	cmd->flags = hash;
+	cmd->flags <<= 32;
+	cmd->flags |= type;
+}
+
+enum {
+	NETFS_STATE_SYNC = 0,		/* Inode is in sync */
+};
+
+#ifdef __KERNEL__
+
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/net.h>
+
+struct pohmelfs_name
+{
+	struct rb_node		offset_node;
+	struct rb_node		hash_node;
+
+	u64			ino, parent;
+
+	u64			offset;
+
+	u32			mode;
+	u32			hash;
+	u32			len;
+	char			*data;
+};
+
+struct pohmelfs_inode
+{
+	struct rb_node		hash_node;
+
+	struct rb_root		hash_root;
+	struct rb_root		offset_root;
+	struct mutex		offset_lock;
+
+	long			state;
+
+	u64			ino;
+	u64			parent;
+
+	struct pohmelfs_name	name;
+
+	struct inode		vfs_inode;
+};
+
+struct netfs_state
+{
+	struct mutex		lock;
+	struct netfs_cmd 	cmd;
+	struct netfs_inode_info	info;
+	struct socket		*socket;
+};
+
+struct pohmelfs_sb
+{
+	struct rb_root		hash_root;
+	struct mutex		hash_lock;
+
+	struct super_block	*sb;
+
+	struct netfs_state	state;
+};
+
+static inline struct pohmelfs_sb *POHMELFS_SB(struct super_block *sb)
+{
+	return sb->s_fs_info;
+}
+
+static inline struct pohmelfs_inode *POHMELFS_I(struct inode *inode)
+{
+	return container_of(inode, struct pohmelfs_inode, vfs_inode);
+}
+
+extern int __init pohmelfs_config_init(void);
+extern void __exit pohmelfs_config_exit(void);
+extern int pohmelfs_copy_config(struct pohmelfs_ctl *dst, unsigned int idx);
+
+extern const struct file_operations pohmelfs_dir_fops;
+extern const struct inode_operations pohmelfs_dir_inode_ops;
+
+extern int netfs_data_recv(struct netfs_state *st, void *buf, u64 size);
+extern int netfs_data_send(struct netfs_state *st, void *buf, u64 size);
+extern int pohmelfs_state_init(struct netfs_state *st, unsigned int idx);
+extern void pohmelfs_state_exit(struct netfs_state *st);
+
+extern struct pohmelfs_inode *pohmelfs_fill_inode(struct pohmelfs_inode *pi,
+		struct netfs_inode_info *info);
+
+static inline int pohmelfs_cmp_hash(struct pohmelfs_name *n, u64 parent, u32 hash, u32 len)
+{
+	if (n->parent > parent)
+		return -1;
+	if (n->parent < parent)
+		return 1;
+
+	if (n->hash > hash)
+		return -1;
+	if (n->hash < hash)
+		return 1;
+	
+	if (n->len > len)
+		return -1;
+	if (n->len < len)
+		return 1;
+	
+	return 0;
+}
+
+extern struct pohmelfs_inode *pohmelfs_search_hash(struct rb_root *root,
+		u64 parent, u32 hash, u32 len);
+
+extern struct pohmelfs_inode *pohmelfs_process_lookup_request(struct pohmelfs_sb *psb,
+		struct pohmelfs_inode *parent, char *name, __u32 len, __u32 hash);
+
+void pohmelfs_name_del(struct pohmelfs_inode *parent, struct pohmelfs_name *n);
+
+void pohmelfs_inode_del_inode(struct pohmelfs_sb *psb, struct pohmelfs_inode *pi);
+
+#define CONFIG_POHMELFS_DEBUG
+
+#ifdef CONFIG_POHMELFS_DEBUG
+#define dprintk(f, a...) printk(f, ##a)
+#else
+#define dprintk(f, a...) do {} while (0)
+#endif
+
+#endif /* __KERNEL__*/
+
+#endif /* __NETFS_H */

-- 
	Evgeniy Polyakov

^ permalink raw reply related

* [2/2] POHMELFS: hack to disable writeback.
From: Evgeniy Polyakov @ 2008-01-31 19:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, linux-fsdevel
In-Reply-To: <11972872493911@2ka.mipt.ru>

This patch disables writeback in POHMELFS and creates all objects
on behalf of its own without sync with remote side.
This mode is _very_ fast.

If POHEMLFS would be bound to single remote filesystem, it could
use its inode allocation policy and be very happy with write-back cache.
By design POHMELFS is a transport layer in distributed filesystem,
which will work with some or other remote filesystem (likely completely
new one), so instead of stupid algorithm shown here, it will contain
correct object creation.

Likely the way to go is to use name hash with parent inode number as
unique ID, which can be matched to filesystem path, so that remote side
could create objects without _any_ knowledge of inode numbers on the
local fs.

Crappy-stuff-created-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/fs/pohmelfs/dir.c b/fs/pohmelfs/dir.c
index 23f9ecd..5aec593 100644
--- a/fs/pohmelfs/dir.c
+++ b/fs/pohmelfs/dir.c
@@ -80,6 +80,8 @@ static struct pohmelfs_name *pohmelfs_insert_offset(struct pohmelfs_inode *pi,
 	rb_link_node(&new->offset_node, parent, n);
 	rb_insert_color(&new->offset_node, &pi->offset_root);
 
+	pi->total_len += new->len;
+
 	return NULL;
 }
 
@@ -647,6 +649,7 @@ static int pohmelfs_create_entry(struct inode *dir, struct dentry *dentry, u64 s
 	cmd->start = start;
 	netfs_set_cmd_flags(cmd, dentry->d_name.hash, mode);
 
+#if 0
 	netfs_convert_cmd(cmd);
 
 	err = netfs_data_send(st, cmd, sizeof(struct netfs_cmd));
@@ -666,6 +669,30 @@ static int pohmelfs_create_entry(struct inode *dir, struct dentry *dentry, u64 s
 	err = netfs_recv_inode_info(psb, POHMELFS_I(dir), &npi, data);
 	if (err < 0)
 		goto err_out_unlock;
+#else
+	{
+		static u64 pohmelfs_ino = 123;
+
+		st->info.mode = netfs_get_inode_mode(cmd);
+		st->info.ino = pohmelfs_ino++;
+		st->info.nlink = 2;
+		st->info.uid = 2319;
+		st->info.gid = 100;
+
+		cmd->ino = st->info.ino;
+		cmd->start = POHMELFS_I(dir)->total_len;
+	}
+
+	npi = pohmelfs_new_inode(psb, POHMELFS_I(dir), data, cmd, &st->info);
+	if (IS_ERR(npi)) {
+		err = PTR_ERR(npi);
+		if (err != -EEXIST)
+			goto err_out_unlock;
+		npi = NULL;
+	} else
+		err = 0;
+	npi->state = 1;
+#endif
 	mutex_unlock(&st->lock);
 
 	d_add(dentry, &npi->vfs_inode);
diff --git a/fs/pohmelfs/inode.c b/fs/pohmelfs/inode.c
index b0ee0b3..6a81bdc 100644
--- a/fs/pohmelfs/inode.c
+++ b/fs/pohmelfs/inode.c
@@ -125,6 +125,16 @@ static int netfs_process_page(struct file *file, struct page *page, __u64 cmd_op
 	int err;
 	void *addr;
 
+	{
+		if (cmd_op == NETFS_READ_PAGE) {
+			if (file)
+				file->f_pos += cmd->size;
+		}
+		SetPageUptodate(page);
+		unlock_page(page);
+		return 0;
+	}
+
 	mutex_lock(&st->lock);
 
 	cmd->ino = inode->i_ino;
@@ -305,6 +315,7 @@ static struct inode *pohmelfs_alloc_inode(struct super_block *sb)
 
 	inode->state = 0;
 	inode->parent = 0;
+	inode->total_len = 0;
 
 	return &inode->vfs_inode;
 }
diff --git a/fs/pohmelfs/netfs.h b/fs/pohmelfs/netfs.h
index 23aa953..b719fbe 100644
--- a/fs/pohmelfs/netfs.h
+++ b/fs/pohmelfs/netfs.h
@@ -163,6 +163,8 @@ struct pohmelfs_inode
 	u64			ino;
 	u64			parent;
 
+	u64			total_len;
+
 	struct pohmelfs_name	name;
 
 	struct inode		vfs_inode;
 

-- 
	Evgeniy Polyakov

^ permalink raw reply related

* Re: e1000 full-duplex TCP performance well below wire speed
From: Kok, Auke @ 2008-01-31 19:32 UTC (permalink / raw)
  To: Bruce Allen
  Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
	Bruce Allen
In-Reply-To: <Pine.LNX.4.63.0801311251040.14403@trinity.phys.uwm.edu>

Bruce Allen wrote:
> Hi Auke,
> 
>>>>> Important note: we ARE able to get full duplex wire speed (over 900
>>>>> Mb/s simulaneously in both directions) using UDP.  The problems occur
>>>>> only with TCP connections.
>>>>
>>>> That eliminates bus bandwidth issues, probably, but small packets take
>>>> up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.
>>>
>>> I see.  Your concern is the extra ACK packets associated with TCP.  Even
>>> those these represent a small volume of data (around 5% with MTU=1500,
>>> and less at larger MTU) they double the number of packets that must be
>>> handled by the system compared to UDP transmission at the same data
>>> rate. Is that correct?
>>
>> A lot of people tend to forget that the pci-express bus has enough
>> bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but
>> apart from data going over it there is significant overhead going on:
>> each packet requires transmit, cleanup and buffer transactions, and
>> there are many irq register clears per second (slow ioread/writes).
>> The transactions double for TCP ack processing, and this all
>> accumulates and starts to introduce latency, higher cpu utilization
>> etc...
> 
> Based on the discussion in this thread, I am inclined to believe that
> lack of PCI-e bus bandwidth is NOT the issue.  The theory is that the
> extra packet handling associated with TCP acknowledgements are pushing
> the PCI-e x1 bus past its limits.  However the evidence seems to show
> otherwise:
> 
> (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
> 64-bit PCI connection.  That connection can transfer data at 8Gb/s.

That was even a PCI-X connection, which is known to have extremely good latency
numbers, IIRC better than PCI-e? (?) which could account for a lot of the
latency-induced lower performance...

also, 82573's are _not_ a serverpart and were not designed for this usage. 82546's
are and that really does make a difference. 82573's are full of power savings
features and all that does make a difference even with some of them turned off.
It's not for nothing that these 82573's are used in a ton of laptops like from
toshiba, lenovo etc.... A lot of this has to do with the cards internal clock
timings as usual.

So, you'd really have to compare the 82546 to a 82571 card to be fair. You get
what you pay for so to speak.

> (2) If the theory is right, then doubling the MTU from 1500 to 3000
> should have significantly reduce the problem, since it drops the number
> of ACK's by two.  Similarly, going from MTU 1500 to MTU 9000 should
> reduce the number of ACK's by a factor of six, practically eliminating
> the problem. But changing the MTU size does not help.
> 
> (3) The interrupt counts are quite reasonable.  Broadcom NICs without
> interrupt aggregation generate an order of magnitude more irq/s and this
> doesn't prevent wire speed performance there.
> 
> (4) The CPUs on the system are largely idle.  There are plenty of
> computing resources available.
> 
> (5) I don't think that the overhead will increase the bandwidth needed
> by more than a factor of two.  Of course you and the other e1000
> developers are the experts, but the dominant bus cost should be copying
> data buffers across the bus. Everything else in minimal in comparison.
> 
> Intel insiders: isn't there some simple instrumentation available (which
> read registers or statistics counters on the PCI-e interface chip) to
> tell us statistics such as how many bits have moved over the link in
> each direction? This plus some accurate timing would make it easy to see
> if the TCP case is saturating the PCI-e bus.  Then the theory addressed
> with data rather than with opinions.

the only tools we have are expensive bus analyzers. As said in the thread with
Rick Jones, I think there might be some tools avaialable from Intel for this but I
have never seen these.

Auke


^ permalink raw reply

* Re: e1000 full-duplex TCP performance well below wire speed
From: Bruce Allen @ 2008-01-31 19:37 UTC (permalink / raw)
  To: Bill Fink
  Cc: SANGTAE HA, Linux Kernel Mailing List, netdev, Stephen Hemminger
In-Reply-To: <20080131123627.599be68f.billfink@mindspring.com>

Hi Bill,

>>> I see similar results on my test systems
>>
>> Thanks for this report and for confirming our observations.  Could you
>> please confirm that a single-port bidrectional UDP link runs at wire
>> speed?  This helps to localize the problem to the TCP stack or interaction
>> of the TCP stack with the e1000 driver and hardware.
>
> Yes, a single-port bidirectional UDP test gets full GigE line rate
> in both directions with no packet loss.

Thanks for confirming this.  And thanks also for nuttcp!  I just 
recognized you as the author.

Cheers,
 	Bruce

^ permalink raw reply

* RE: [PATCH] Disable TSO for non standard qdiscs
From: Waskiewicz Jr, Peter P @ 2008-01-31 19:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Patrick McHardy, Stephen Hemminger, netdev
In-Reply-To: <20080131193406.GH4671@one.firstfloor.org>

> The philosophical problem I have with this suggestion is that 
> I expect that the large majority of users will be more happy 
> with disabled TSO if they use non standard qdiscs and 
> defaults that do not fit the majority use case are bad.
> 
> Basically you're suggesting that nearly everyone using tc 
> should learn about another obscure command.

If someone is using tc to load and configure a qdisc, I'd really hope
knowing or learning ethtool wouldn't be a stretch for them...  And I'm
not arguing the majority of people will want this or not, but taking
away the ability to use TSO at the kernel level here without allowing a
tuneable is making that decision for them, which is wrong IMO.

Cheers,

-PJ Waskiewicz
<peter.p.waskiewicz.jr@intel.com>

^ permalink raw reply

* Re: e1000 full-duplex TCP performance well below wire speed
From: Bruce Allen @ 2008-01-31 19:48 UTC (permalink / raw)
  To: Kok, Auke
  Cc: Brandeburg, Jesse, netdev, Carsten Aulbert, Henning Fehrmann,
	Bruce Allen
In-Reply-To: <47A22241.70600@intel.com>

Hi Auke,

>> Based on the discussion in this thread, I am inclined to believe that
>> lack of PCI-e bus bandwidth is NOT the issue.  The theory is that the
>> extra packet handling associated with TCP acknowledgements are pushing
>> the PCI-e x1 bus past its limits.  However the evidence seems to show
>> otherwise:
>>
>> (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
>> 64-bit PCI connection.  That connection can transfer data at 8Gb/s.
>
> That was even a PCI-X connection, which is known to have extremely good latency
> numbers, IIRC better than PCI-e? (?) which could account for a lot of the
> latency-induced lower performance...
>
> also, 82573's are _not_ a serverpart and were not designed for this 
> usage. 82546's are and that really does make a difference.

I'm confused.  It DOESN'T make a difference! Using 'server grade' 82546's 
on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP 
full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1 
bus.

Just like us, when Bill goes from TCP to UDP, he gets wire speed back.

Cheers,
 	Bruce

^ permalink raw reply

* [PATCH] Add addrlabel subsystem.
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-31 19:56 UTC (permalink / raw)
  To: shemminger; +Cc: yoshfuji, netdev

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/linux/if_addrlabel.h |   32 +++++
 ip/Makefile                  |    2 +-
 ip/ip.c                      |    5 +-
 ip/ip_common.h               |    4 +
 ip/ipaddrlabel.c             |  260 ++++++++++++++++++++++++++++++++++++++++++
 ip/ipmonitor.c               |    4 +
 6 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_addrlabel.h b/include/linux/if_addrlabel.h
new file mode 100644
index 0000000..9fe79c9
--- /dev/null
+++ b/include/linux/if_addrlabel.h
@@ -0,0 +1,32 @@
+/*
+ * if_addrlabel.h - netlink interface for address labels
+ *
+ * Copyright (C)2007 USAGI/WIDE Project,  All Rights Reserved.
+ *
+ * Authors:
+ *	YOSHIFUJI Hideaki @ USAGI/WIDE <yoshfuji@linux-ipv6.org>
+ */
+
+#ifndef __LINUX_IF_ADDRLABEL_H
+#define __LINUX_IF_ADDRLABEL_H
+
+struct ifaddrlblmsg
+{
+	__u8		ifal_family;		/* Address family */
+	__u8		__ifal_reserved;	/* Reserved */
+	__u8		ifal_prefixlen;		/* Prefix length */
+	__u8		ifal_flags;		/* Flags */
+	__u32		ifal_index;		/* Link index */
+	__u32		ifal_seq;		/* sequence number */
+};
+
+enum
+{
+	IFAL_ADDRESS = 1,
+	IFAL_LABEL = 2,
+	__IFAL_MAX
+};
+
+#define IFAL_MAX	(__IFAL_MAX - 1)
+
+#endif
diff --git a/ip/Makefile b/ip/Makefile
index b427d58..d908817 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,4 +1,4 @@
-IPOBJ=ip.o ipaddress.o iproute.o iprule.o \
+IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
     rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
diff --git a/ip/ip.c b/ip/ip.c
index aeb8c68..c4c773f 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -46,8 +46,8 @@ static void usage(void)
 	fprintf(stderr,
 "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
 "       ip [ -force ] [-batch filename\n"
-"where  OBJECT := { link | addr | route | rule | neigh | ntable | tunnel |\n"
-"                   maddr | mroute | monitor | xfrm }\n"
+"where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
+"                   tunnel | maddr | mroute | monitor | xfrm }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | link } |\n"
 "                    -o[neline] | -t[imestamp] }\n");
@@ -64,6 +64,7 @@ static const struct cmd {
 	int (*func)(int argc, char **argv);
 } cmds[] = {
 	{ "address", 	do_ipaddr },
+	{ "addrlabel",	do_ipaddrlabel },
 	{ "maddress",	do_multiaddr },
 	{ "route",	do_iproute },
 	{ "rule",	do_iprule },
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 39f2507..1bbd50d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -4,6 +4,9 @@ extern int print_linkinfo(const struct sockaddr_nl *who,
 extern int print_addrinfo(const struct sockaddr_nl *who,
 			  struct nlmsghdr *n,
 			  void *arg);
+extern int print_addrlabelinfo(const struct sockaddr_nl *who,
+			       struct nlmsghdr *n,
+			       void *arg);
 extern int print_neigh(const struct sockaddr_nl *who,
 		       struct nlmsghdr *n, void *arg);
 extern int print_ntable(const struct sockaddr_nl *who,
@@ -23,6 +26,7 @@ extern int print_prefix(const struct sockaddr_nl *who,
 extern int print_rule(const struct sockaddr_nl *who,
 		      struct nlmsghdr *n, void *arg);
 extern int do_ipaddr(int argc, char **argv);
+extern int do_ipaddrlabel(int argc, char **argv);
 extern int do_iproute(int argc, char **argv);
 extern int do_iprule(int argc, char **argv);
 extern int do_ipneigh(int argc, char **argv);
diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
new file mode 100644
index 0000000..1c873e9
--- /dev/null
+++ b/ip/ipaddrlabel.c
@@ -0,0 +1,260 @@
+/*
+ * ipaddrlabel.c	"ip addrlabel"
+ *
+ * Copyright (C)2007 USAGI/WIDE Project
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ * Based on iprule.c.
+ *
+ * Authors:	YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <linux/types.h>
+#include <linux/if_addrlabel.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+#define IFAL_RTA(r)	((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifaddrlblmsg))))
+#define IFAL_PAYLOAD(n)	NLMSG_PAYLOAD(n,sizeof(struct ifaddrlblmsg))
+
+extern struct rtnl_handle rth;
+
+static void usage(void) __attribute__((noreturn));
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip addrlabel [ list | add | del | flush ] prefix PREFIX [ dev DEV ] [ label LABEL ]\n");
+	exit(-1);
+}
+
+int print_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	FILE *fp = (FILE*)arg;
+	struct ifaddrlblmsg *ifal = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	int host_len = -1;
+	struct rtattr *tb[IFAL_MAX+1];
+	char abuf[256];
+
+	if (n->nlmsg_type != RTM_NEWADDRLABEL && n->nlmsg_type != RTM_DELADDRLABEL)
+		return 0;
+
+	len -= NLMSG_LENGTH(sizeof(*ifal));
+	if (len < 0)
+		return -1;
+
+	parse_rtattr(tb, IFAL_MAX, IFAL_RTA(ifal), len);
+
+	if (ifal->ifal_family == AF_INET)
+		host_len = 32;
+	else if (ifal->ifal_family == AF_INET6)
+		host_len = 128;
+
+	if (n->nlmsg_type == RTM_DELADDRLABEL)
+		fprintf(fp, "Deleted ");
+
+	if (tb[IFAL_ADDRESS]) {
+		fprintf(fp, "prefix %s/%u ",
+			format_host(ifal->ifal_family,
+				    RTA_PAYLOAD(tb[IFAL_ADDRESS]),
+				    RTA_DATA(tb[IFAL_ADDRESS]),
+				    abuf, sizeof(abuf)),
+			ifal->ifal_prefixlen);
+	}
+
+	if (ifal->ifal_index)
+		fprintf(fp, "dev %s ", ll_index_to_name(ifal->ifal_index));
+
+	if (tb[IFAL_LABEL] && RTA_PAYLOAD(tb[IFAL_LABEL]) == sizeof(int32_t)) {
+		int32_t label;
+		memcpy(&label, RTA_DATA(tb[IFAL_LABEL]), sizeof(label));
+		fprintf(fp, "label %d ", label);
+	}
+
+	fprintf(fp, "\n");
+	fflush(fp);
+	return 0;
+}
+
+static int ipaddrlabel_list(int argc, char **argv)
+{
+	int af = preferred_family;
+
+	if (af == AF_UNSPEC)
+		af = AF_INET6;
+
+	if (argc > 0) {
+		fprintf(stderr, "\"ip addrlabel show\" does not take any arguments.\n");
+		return -1;
+	}
+
+	if (rtnl_wilddump_request(&rth, af, RTM_GETADDRLABEL) < 0) {
+		perror("Cannot send dump request");
+		return 1;
+	}
+
+	if (rtnl_dump_filter(&rth, print_addrlabel, stdout, NULL, NULL) < 0) {
+		fprintf(stderr, "Dump terminated\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+
+static int ipaddrlabel_modify(int cmd, int argc, char **argv)
+{
+	struct {
+		struct nlmsghdr 	n;
+		struct ifaddrlblmsg	ifal;
+		char   			buf[1024];
+	} req;
+
+	inet_prefix prefix;
+	uint32_t label = 0xffffffffUL;
+
+	memset(&req, 0, sizeof(req));
+	memset(&prefix, 0, sizeof(prefix));
+
+	req.n.nlmsg_type = cmd;
+	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrlblmsg));
+	req.n.nlmsg_flags = NLM_F_REQUEST;
+	req.ifal.ifal_family = preferred_family;
+	req.ifal.ifal_prefixlen = 0;
+	req.ifal.ifal_index = 0;
+
+	if (cmd == RTM_NEWADDRLABEL) {
+		req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
+	}
+
+	while (argc > 0) {
+		if (strcmp(*argv, "prefix") == 0) {
+			NEXT_ARG();
+			get_prefix(&prefix, *argv, preferred_family);
+		} else if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			if ((req.ifal.ifal_index = ll_name_to_index(*argv)) == 0)
+				invarg("dev is invalid\n", *argv);
+		} else if (strcmp(*argv, "label") == 0) {
+			NEXT_ARG();
+			if (get_u32(&label, *argv, 0) || label == 0xffffffffUL)
+				invarg("label is invalid\n", *argv);
+		}
+		argc--;
+		argv++;
+	}
+
+	addattr32(&req.n, sizeof(req), IFAL_LABEL, label);
+	addattr_l(&req.n, sizeof(req), IFAL_ADDRESS, &prefix.data, prefix.bytelen);
+
+	if (req.ifal.ifal_family == AF_UNSPEC)
+		req.ifal.ifal_family = AF_INET6;
+
+	if (rtnl_talk(&rth, &req.n, 0, 0, NULL, NULL, NULL) < 0)
+		return 2;
+
+	return 0;
+}
+
+
+static int flush_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	struct rtnl_handle rth2;
+	struct rtmsg *r = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	struct rtattr * tb[IFAL_MAX+1];
+
+	len -= NLMSG_LENGTH(sizeof(*r));
+	if (len < 0)
+		return -1;
+
+	parse_rtattr(tb, IFAL_MAX, RTM_RTA(r), len);
+
+	if (tb[IFAL_ADDRESS]) {
+		n->nlmsg_type = RTM_DELADDRLABEL;
+		n->nlmsg_flags = NLM_F_REQUEST;
+
+		if (rtnl_open(&rth2, 0) < 0)
+			return -1;
+
+		if (rtnl_talk(&rth2, n, 0, 0, NULL, NULL, NULL) < 0)
+			return -2;
+
+		rtnl_close(&rth2);
+	}
+
+	return 0;
+}
+
+static int ipaddrlabel_flush(int argc, char **argv)
+{
+	int af = preferred_family;
+
+	if (af == AF_UNSPEC)
+		af = AF_INET6;
+
+	if (argc > 0) {
+		fprintf(stderr, "\"ip addrlabel flush\" does not allow extra arguments\n");
+		return -1;
+	}
+
+	if (rtnl_wilddump_request(&rth, af, RTM_GETADDRLABEL) < 0) {
+		perror("Cannot send dump request");
+		return 1;
+	}
+
+	if (rtnl_dump_filter(&rth, flush_addrlabel, NULL, NULL, NULL) < 0) {
+		fprintf(stderr, "Flush terminated\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+int do_ipaddrlabel(int argc, char **argv)
+{
+	if (argc < 1) {
+		return ipaddrlabel_list(0, NULL);
+	} else if (matches(argv[0], "list") == 0 ||
+		   matches(argv[0], "show") == 0) {
+		return ipaddrlabel_list(argc-1, argv+1);
+	} else if (matches(argv[0], "add") == 0) {
+		return ipaddrlabel_modify(RTM_NEWADDRLABEL, argc-1, argv+1);
+	} else if (matches(argv[0], "delete") == 0) {
+		return ipaddrlabel_modify(RTM_DELADDRLABEL, argc-1, argv+1);
+	} else if (matches(argv[0], "flush") == 0) {
+		return ipaddrlabel_flush(argc-1, argv+1);
+	} else if (matches(argv[0], "help") == 0)
+		usage();
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip addrlabel help\".\n", *argv);
+	exit(-1);
+}
+
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index f1a1f27..df0fd91 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -54,6 +54,10 @@ int accept_msg(const struct sockaddr_nl *who,
 		print_addrinfo(who, n, arg);
 		return 0;
 	}
+	if (n->nlmsg_type == RTM_NEWADDRLABEL || n->nlmsg_type == RTM_DELADDRLABEL) {
+		print_addrlabel(who, n, arg);
+		return 0;
+	}
 	if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH) {
 		print_neigh(who, n, arg);
 		return 0;
-- 
1.4.4.4


^ permalink raw reply related

* [PATCH] IPROUTE2: Add addrlabel subsystem.
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-31 19:57 UTC (permalink / raw)
  To: shemminger; +Cc: yoshfuji, netdev

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
---
 include/linux/if_addrlabel.h |   32 +++++
 ip/Makefile                  |    2 +-
 ip/ip.c                      |    5 +-
 ip/ip_common.h               |    4 +
 ip/ipaddrlabel.c             |  260 ++++++++++++++++++++++++++++++++++++++++++
 ip/ipmonitor.c               |    4 +
 6 files changed, 304 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_addrlabel.h b/include/linux/if_addrlabel.h
new file mode 100644
index 0000000..9fe79c9
--- /dev/null
+++ b/include/linux/if_addrlabel.h
@@ -0,0 +1,32 @@
+/*
+ * if_addrlabel.h - netlink interface for address labels
+ *
+ * Copyright (C)2007 USAGI/WIDE Project,  All Rights Reserved.
+ *
+ * Authors:
+ *	YOSHIFUJI Hideaki @ USAGI/WIDE <yoshfuji@linux-ipv6.org>
+ */
+
+#ifndef __LINUX_IF_ADDRLABEL_H
+#define __LINUX_IF_ADDRLABEL_H
+
+struct ifaddrlblmsg
+{
+	__u8		ifal_family;		/* Address family */
+	__u8		__ifal_reserved;	/* Reserved */
+	__u8		ifal_prefixlen;		/* Prefix length */
+	__u8		ifal_flags;		/* Flags */
+	__u32		ifal_index;		/* Link index */
+	__u32		ifal_seq;		/* sequence number */
+};
+
+enum
+{
+	IFAL_ADDRESS = 1,
+	IFAL_LABEL = 2,
+	__IFAL_MAX
+};
+
+#define IFAL_MAX	(__IFAL_MAX - 1)
+
+#endif
diff --git a/ip/Makefile b/ip/Makefile
index b427d58..d908817 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,4 +1,4 @@
-IPOBJ=ip.o ipaddress.o iproute.o iprule.o \
+IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
     rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
diff --git a/ip/ip.c b/ip/ip.c
index aeb8c68..c4c773f 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -46,8 +46,8 @@ static void usage(void)
 	fprintf(stderr,
 "Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n"
 "       ip [ -force ] [-batch filename\n"
-"where  OBJECT := { link | addr | route | rule | neigh | ntable | tunnel |\n"
-"                   maddr | mroute | monitor | xfrm }\n"
+"where  OBJECT := { link | addr | addrlabel | route | rule | neigh | ntable |\n"
+"                   tunnel | maddr | mroute | monitor | xfrm }\n"
 "       OPTIONS := { -V[ersion] | -s[tatistics] | -d[etails] | -r[esolve] |\n"
 "                    -f[amily] { inet | inet6 | ipx | dnet | link } |\n"
 "                    -o[neline] | -t[imestamp] }\n");
@@ -64,6 +64,7 @@ static const struct cmd {
 	int (*func)(int argc, char **argv);
 } cmds[] = {
 	{ "address", 	do_ipaddr },
+	{ "addrlabel",	do_ipaddrlabel },
 	{ "maddress",	do_multiaddr },
 	{ "route",	do_iproute },
 	{ "rule",	do_iprule },
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 39f2507..1bbd50d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -4,6 +4,9 @@ extern int print_linkinfo(const struct sockaddr_nl *who,
 extern int print_addrinfo(const struct sockaddr_nl *who,
 			  struct nlmsghdr *n,
 			  void *arg);
+extern int print_addrlabelinfo(const struct sockaddr_nl *who,
+			       struct nlmsghdr *n,
+			       void *arg);
 extern int print_neigh(const struct sockaddr_nl *who,
 		       struct nlmsghdr *n, void *arg);
 extern int print_ntable(const struct sockaddr_nl *who,
@@ -23,6 +26,7 @@ extern int print_prefix(const struct sockaddr_nl *who,
 extern int print_rule(const struct sockaddr_nl *who,
 		      struct nlmsghdr *n, void *arg);
 extern int do_ipaddr(int argc, char **argv);
+extern int do_ipaddrlabel(int argc, char **argv);
 extern int do_iproute(int argc, char **argv);
 extern int do_iprule(int argc, char **argv);
 extern int do_ipneigh(int argc, char **argv);
diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
new file mode 100644
index 0000000..1c873e9
--- /dev/null
+++ b/ip/ipaddrlabel.c
@@ -0,0 +1,260 @@
+/*
+ * ipaddrlabel.c	"ip addrlabel"
+ *
+ * Copyright (C)2007 USAGI/WIDE Project
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ *
+ * Based on iprule.c.
+ *
+ * Authors:	YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <linux/types.h>
+#include <linux/if_addrlabel.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+#define IFAL_RTA(r)	((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct ifaddrlblmsg))))
+#define IFAL_PAYLOAD(n)	NLMSG_PAYLOAD(n,sizeof(struct ifaddrlblmsg))
+
+extern struct rtnl_handle rth;
+
+static void usage(void) __attribute__((noreturn));
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: ip addrlabel [ list | add | del | flush ] prefix PREFIX [ dev DEV ] [ label LABEL ]\n");
+	exit(-1);
+}
+
+int print_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	FILE *fp = (FILE*)arg;
+	struct ifaddrlblmsg *ifal = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	int host_len = -1;
+	struct rtattr *tb[IFAL_MAX+1];
+	char abuf[256];
+
+	if (n->nlmsg_type != RTM_NEWADDRLABEL && n->nlmsg_type != RTM_DELADDRLABEL)
+		return 0;
+
+	len -= NLMSG_LENGTH(sizeof(*ifal));
+	if (len < 0)
+		return -1;
+
+	parse_rtattr(tb, IFAL_MAX, IFAL_RTA(ifal), len);
+
+	if (ifal->ifal_family == AF_INET)
+		host_len = 32;
+	else if (ifal->ifal_family == AF_INET6)
+		host_len = 128;
+
+	if (n->nlmsg_type == RTM_DELADDRLABEL)
+		fprintf(fp, "Deleted ");
+
+	if (tb[IFAL_ADDRESS]) {
+		fprintf(fp, "prefix %s/%u ",
+			format_host(ifal->ifal_family,
+				    RTA_PAYLOAD(tb[IFAL_ADDRESS]),
+				    RTA_DATA(tb[IFAL_ADDRESS]),
+				    abuf, sizeof(abuf)),
+			ifal->ifal_prefixlen);
+	}
+
+	if (ifal->ifal_index)
+		fprintf(fp, "dev %s ", ll_index_to_name(ifal->ifal_index));
+
+	if (tb[IFAL_LABEL] && RTA_PAYLOAD(tb[IFAL_LABEL]) == sizeof(int32_t)) {
+		int32_t label;
+		memcpy(&label, RTA_DATA(tb[IFAL_LABEL]), sizeof(label));
+		fprintf(fp, "label %d ", label);
+	}
+
+	fprintf(fp, "\n");
+	fflush(fp);
+	return 0;
+}
+
+static int ipaddrlabel_list(int argc, char **argv)
+{
+	int af = preferred_family;
+
+	if (af == AF_UNSPEC)
+		af = AF_INET6;
+
+	if (argc > 0) {
+		fprintf(stderr, "\"ip addrlabel show\" does not take any arguments.\n");
+		return -1;
+	}
+
+	if (rtnl_wilddump_request(&rth, af, RTM_GETADDRLABEL) < 0) {
+		perror("Cannot send dump request");
+		return 1;
+	}
+
+	if (rtnl_dump_filter(&rth, print_addrlabel, stdout, NULL, NULL) < 0) {
+		fprintf(stderr, "Dump terminated\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+
+static int ipaddrlabel_modify(int cmd, int argc, char **argv)
+{
+	struct {
+		struct nlmsghdr 	n;
+		struct ifaddrlblmsg	ifal;
+		char   			buf[1024];
+	} req;
+
+	inet_prefix prefix;
+	uint32_t label = 0xffffffffUL;
+
+	memset(&req, 0, sizeof(req));
+	memset(&prefix, 0, sizeof(prefix));
+
+	req.n.nlmsg_type = cmd;
+	req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifaddrlblmsg));
+	req.n.nlmsg_flags = NLM_F_REQUEST;
+	req.ifal.ifal_family = preferred_family;
+	req.ifal.ifal_prefixlen = 0;
+	req.ifal.ifal_index = 0;
+
+	if (cmd == RTM_NEWADDRLABEL) {
+		req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
+	}
+
+	while (argc > 0) {
+		if (strcmp(*argv, "prefix") == 0) {
+			NEXT_ARG();
+			get_prefix(&prefix, *argv, preferred_family);
+		} else if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			if ((req.ifal.ifal_index = ll_name_to_index(*argv)) == 0)
+				invarg("dev is invalid\n", *argv);
+		} else if (strcmp(*argv, "label") == 0) {
+			NEXT_ARG();
+			if (get_u32(&label, *argv, 0) || label == 0xffffffffUL)
+				invarg("label is invalid\n", *argv);
+		}
+		argc--;
+		argv++;
+	}
+
+	addattr32(&req.n, sizeof(req), IFAL_LABEL, label);
+	addattr_l(&req.n, sizeof(req), IFAL_ADDRESS, &prefix.data, prefix.bytelen);
+
+	if (req.ifal.ifal_family == AF_UNSPEC)
+		req.ifal.ifal_family = AF_INET6;
+
+	if (rtnl_talk(&rth, &req.n, 0, 0, NULL, NULL, NULL) < 0)
+		return 2;
+
+	return 0;
+}
+
+
+static int flush_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	struct rtnl_handle rth2;
+	struct rtmsg *r = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	struct rtattr * tb[IFAL_MAX+1];
+
+	len -= NLMSG_LENGTH(sizeof(*r));
+	if (len < 0)
+		return -1;
+
+	parse_rtattr(tb, IFAL_MAX, RTM_RTA(r), len);
+
+	if (tb[IFAL_ADDRESS]) {
+		n->nlmsg_type = RTM_DELADDRLABEL;
+		n->nlmsg_flags = NLM_F_REQUEST;
+
+		if (rtnl_open(&rth2, 0) < 0)
+			return -1;
+
+		if (rtnl_talk(&rth2, n, 0, 0, NULL, NULL, NULL) < 0)
+			return -2;
+
+		rtnl_close(&rth2);
+	}
+
+	return 0;
+}
+
+static int ipaddrlabel_flush(int argc, char **argv)
+{
+	int af = preferred_family;
+
+	if (af == AF_UNSPEC)
+		af = AF_INET6;
+
+	if (argc > 0) {
+		fprintf(stderr, "\"ip addrlabel flush\" does not allow extra arguments\n");
+		return -1;
+	}
+
+	if (rtnl_wilddump_request(&rth, af, RTM_GETADDRLABEL) < 0) {
+		perror("Cannot send dump request");
+		return 1;
+	}
+
+	if (rtnl_dump_filter(&rth, flush_addrlabel, NULL, NULL, NULL) < 0) {
+		fprintf(stderr, "Flush terminated\n");
+		return 1;
+	}
+
+	return 0;
+}
+
+int do_ipaddrlabel(int argc, char **argv)
+{
+	if (argc < 1) {
+		return ipaddrlabel_list(0, NULL);
+	} else if (matches(argv[0], "list") == 0 ||
+		   matches(argv[0], "show") == 0) {
+		return ipaddrlabel_list(argc-1, argv+1);
+	} else if (matches(argv[0], "add") == 0) {
+		return ipaddrlabel_modify(RTM_NEWADDRLABEL, argc-1, argv+1);
+	} else if (matches(argv[0], "delete") == 0) {
+		return ipaddrlabel_modify(RTM_DELADDRLABEL, argc-1, argv+1);
+	} else if (matches(argv[0], "flush") == 0) {
+		return ipaddrlabel_flush(argc-1, argv+1);
+	} else if (matches(argv[0], "help") == 0)
+		usage();
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"ip addrlabel help\".\n", *argv);
+	exit(-1);
+}
+
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index f1a1f27..df0fd91 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -54,6 +54,10 @@ int accept_msg(const struct sockaddr_nl *who,
 		print_addrinfo(who, n, arg);
 		return 0;
 	}
+	if (n->nlmsg_type == RTM_NEWADDRLABEL || n->nlmsg_type == RTM_DELADDRLABEL) {
+		print_addrlabel(who, n, arg);
+		return 0;
+	}
 	if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH) {
 		print_neigh(who, n, arg);
 		return 0;
-- 
1.4.4.4


^ permalink raw reply related

* Re: [PATCH 6/6] [TCP]: Reorganize struct tcp_sock to save 16 bytes on 64-bit arch
From: Eric Dumazet @ 2008-01-31 19:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: David S. Miller, netdev, dccp
In-Reply-To: <1201804304-28777-7-git-send-email-acme@redhat.com>

Arnaldo Carvalho de Melo a écrit :
> /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
>   struct tcp_sock  |  -16
>   struct tcp6_sock |  -16
>  2 structs changed
> 
> Now it is at:
> 
> /* size: 1552, cachelines: 25 */
> /* paddings: 2, sum paddings: 8 */
> /* last cacheline: 16 bytes */
> 
> As soon as we stop using skb_queue_list we'll get it down to 24 cachelines.
> 
> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  include/linux/tcp.h |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 08027f1..f48644d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -304,7 +304,6 @@ struct tcp_sock {
>  	u32	rtt_seq;	/* sequence number to update rttvar	*/
>  
>  	u32	packets_out;	/* Packets which are "in flight"	*/
> -	u32	retrans_out;	/* Retransmitted packets out		*/
>  /*
>   *      Options received (usually on last packet, some only on SYN packets).
>   */
> @@ -332,6 +331,8 @@ struct tcp_sock {
>  
>  	struct tcp_sack_block recv_sack_cache[4];
>  
> +	u32	retrans_out;	/* Retransmitted packets out		*/
> +

Hum... retrans_out should sit close to packets_out (or lost_out/sacked_out 
???), please.

'struct tcp_sock' is very large on 64 bits, so I would prefer to make sure 
most paths dont need to touch all 24 cache lines (or 25 cache lines).


^ permalink raw reply

* Re: [PATCH] Add addrlabel subsystem.
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2008-01-31 19:59 UTC (permalink / raw)
  To: shemminger; +Cc: yoshfuji, netdev
In-Reply-To: <20080201.065610.16428092.yoshfuji@linux-ipv6.org>

In article <20080201.065610.16428092.yoshfuji@linux-ipv6.org> (at Fri, 01 Feb 2008 06:56:10 +1100 (EST)), YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> says:

> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
> ---
>  include/linux/if_addrlabel.h |   32 +++++
>  ip/Makefile                  |    2 +-
>  ip/ip.c                      |    5 +-
>  ip/ip_common.h               |    4 +
>  ip/ipaddrlabel.c             |  260 ++++++++++++++++++++++++++++++++++++++++++
>  ip/ipmonitor.c               |    4 +
>  6 files changed, 304 insertions(+), 3 deletions(-)

Sorry, "iproute2" was missing in the subject...resent.

--yoshfuji

^ permalink raw reply

* Re: [PATCH 6/6] [TCP]: Reorganize struct tcp_sock to save 16 bytes on 64-bit arch
From: Arnaldo Carvalho de Melo @ 2008-01-31 20:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Arnaldo Carvalho de Melo, David S. Miller, netdev, dccp
In-Reply-To: <47A22841.6090508@cosmosbay.com>

Em Thu, Jan 31, 2008 at 08:57:53PM +0100, Eric Dumazet escreveu:
> Arnaldo Carvalho de Melo a écrit :
>> /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
>>   struct tcp_sock  |  -16
>>   struct tcp6_sock |  -16
>>  2 structs changed
>>
>> Now it is at:
>>
>> /* size: 1552, cachelines: 25 */
>> /* paddings: 2, sum paddings: 8 */
>> /* last cacheline: 16 bytes */
>>
>> As soon as we stop using skb_queue_list we'll get it down to 24 cachelines.
>>
>> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
>> ---
>>  include/linux/tcp.h |    6 ++++--
>>  1 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index 08027f1..f48644d 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -304,7 +304,6 @@ struct tcp_sock {
>>  	u32	rtt_seq;	/* sequence number to update rttvar	*/
>>   	u32	packets_out;	/* Packets which are "in flight"	*/
>> -	u32	retrans_out;	/* Retransmitted packets out		*/
>>  /*
>>   *      Options received (usually on last packet, some only on SYN packets).
>>   */
>> @@ -332,6 +331,8 @@ struct tcp_sock {
>>   	struct tcp_sack_block recv_sack_cache[4];
>>  +	u32	retrans_out;	/* Retransmitted packets out		*/
>> +
>
> Hum... retrans_out should sit close to packets_out (or lost_out/sacked_out 
> ???), please.
>
> 'struct tcp_sock' is very large on 64 bits, so I would prefer to make sure 
> most paths dont need to touch all 24 cache lines (or 25 cache lines).

That is perfectly fine, I'll replace my patch with another, that states
this beyond doubt.

- Arnaldo

^ permalink raw reply

* Re: [PATCH] Disable TSO for non standard qdiscs
From: Jarek Poplawski @ 2008-01-31 20:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Waskiewicz Jr, Peter P, Patrick McHardy, Stephen Hemminger,
	netdev
In-Reply-To: <20080131193406.GH4671@one.firstfloor.org>

Andi Kleen wrote, On 01/31/2008 08:34 PM:

>> TSO by nature is bursty.  But disabling TSO without the option of having
>> it on or off to me seems to aggressive.  If someone is using a qdisc
>> that TSO is interfering with the effectiveness of the traffic shaping,
>> then they should turn off TSO via ethtool on the target device.  Some
> 
> The philosophical problem I have with this suggestion is that I expect
> that the large majority of users will be more happy with disabled TSO
> if they use non standard qdiscs and defaults that do not fit 
> the majority use case are bad.

If you mean the large majority of the large minority of users, who use
non standard qdiscs - I agree - this is really the philosophical problem!

> Basically you're suggesting that nearly everyone using tc should learn about
> another obscure command.

...So, it sounds like tc is used by nearly everyone now...

It seems my distro really isn't up to date:

"Package: iproute
 ...
 Description: Professional tools to control the networking in Linux kernels
 This is `iproute', the professional set of tools to control the
 networking behavior in kernels 2.2.x and later."

And ethtool doesn't have to be learnt at all: "most friendly distros"
could use this in config or add some graphical wrapper.

Regards,
Jarek P.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox