Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2] [02/10] pasemi_mac: Stop using the pci config space accessors for register read/writes
From: Jeff Garzik @ 2007-08-31 13:54 UTC (permalink / raw)
  To: Olof Johansson; +Cc: netdev, linuxppc-dev
In-Reply-To: <20070823181310.GB31882@lixom.net>

Olof Johansson wrote:
> Move away from using the pci config access functions for simple register
> access.  Our device has all of the registers in the config space (hey,
> from the hardware point of view it looks reasonable :-), so we need to
> somehow get to it. Newer firmwares have it in the device tree such that
> we can just get it and ioremap it there (in case it ever moves in future
> products). For now, provide a hardcoded fallback for older firmwares.
> 
> 
> Signed-off-by: Olof Johansson <olof@lixom.net>

applied 2-10



^ permalink raw reply

* Re: [PATCH 07/12] sky2: use net_device internal stats
From: Stephen Hemminger @ 2007-08-31 13:59 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev, akpm
In-Reply-To: <46D81B09.6080107@pobox.com>

On Fri, 31 Aug 2007 09:43:37 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> NAK -- grep around net/core, you want ->get_stats()
> 
> 
> 

Unneeded if device leaves get_stats as NULL, then register_netdevice
sets it to internal get stats (in net-2.6.24)

^ permalink raw reply

* Re: [PATCH 1/2 v2] PS3: changed the way to handle tx skbs
From: Jeff Garzik @ 2007-08-31 13:59 UTC (permalink / raw)
  To: Masakazu Mokuno
  Cc: Linas Vepstas, netdev, cbe-oss-dev, geoffrey.levand,
	Geert Uytterhoeven
In-Reply-To: <20070831221800.C323.MOKUNO@sm.sony.co.jp>

Masakazu Mokuno wrote:
> The PS3 virtual network device requires a vlan tag in the sending packet
> to select the destination device, ethernet port or wireless. 
> As the vlan tag field is in the middle of the passed data, 
> we should insert it into the packet data.
> To avoid copying much of the packet data, the driver used two tx descriptors
> for one tx skb; one descriptor was for sending a small static 
> buffer which contained vlan tag and copied header (two mac addresses),
> one was for the residual data after the vlan field.
> 
> This patch changes the way to insert the vlan tag.  By changing 
> netdev->hard_header_len, we can make the headroom for moving mac address
> fields in the skb buffer. Then we can send one tx skb with
> one tx descriptor.  This also gives us a tx throughut gain of approx.
> 20% according to netperf results.
> 
> Signed-off-by: Masakazu Mokuno <mokuno@sm.sony.co.jp>
> CC: Geoff Levand <geoffrey.levand@am.sony.com>
> ---
>  drivers/net/ps3_gelic_net.c |  140 +++++++++++++++++++-------------------------
>  1 file changed, 63 insertions(+), 77 deletions(-)

applied 1-2, thanks for the update



^ permalink raw reply

* Re: [0/7] [PPP]: Fix shared/cloned/non-linear skb bugs (was: malformed captured packets)
From: Toralf Förster @ 2007-08-31 14:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: James Chapman, David S. Miller, Michal Ostrowski, Paul Mackerras,
	netdev
In-Reply-To: <20070831090625.GA28175@gondor.apana.org.au>

[-- Attachment #1: Type: text/plain, Size: 1362 bytes --]

Am Freitag, 31. August 2007 schrieb Herbert Xu:
> On Thu, Aug 30, 2007 at 09:51:31AM +0000, James Chapman wrote:
> >
> > The captured PPPoE stream seems to show incorrect data lengths in the
> > PPPoE header for some captured PPPoE packets. The kernel's PPPoE
> > datapath uses this length to extract the PPP frame and send it through
> > to the ppp interface. Since your ppp stream is fine, the actual PPPoE
> > header contents must be correct when it is parsed by the kernel PPPoE
> > code. It seems more likely that this is a wireshark bug to me.
> 
> If he were using the kernel pppoe driver, then this is because
> PPP filtering is writing over a cloned skb without copying it.
> 
> In fact, there seems to be quite a few bugs of this kind in
> the various ppp*.c files.
> 
> Please try the following patches to see if they make a
> difference.
> 
> I've audited ppp_generic.c and pppoe.c.  I'll do pppol2tp
> tomorrow.
> 
> Cheers,


Herbert,

your patches - applied against 2.6.23-rc4-g2d8348b4 - works like a charm :-)

Among many other places at least
http://bugzilla.kernel.org/show_bug.cgi?id=8409
but probably also
http://bugzilla.kernel.org/show_bug.cgi?id=7938 are solved by your 7 patches.

Many thanks

-- 
MfG/Sincerely

Toralf Förster
pgp finger print: 7B1A 07F4 EC82 0F90 D4C2 8936 872A E508 7DB6 9DA3

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: net-2.6.24 rebased
From: John W. Linville @ 2007-08-31 13:44 UTC (permalink / raw)
  To: David Miller; +Cc: johannes, joe, netdev
In-Reply-To: <20070830.222129.102575067.davem@davemloft.net>

On Thu, Aug 30, 2007 at 10:21:29PM -0700, David Miller wrote:
> From: Johannes Berg <johannes@sipsolutions.net>
> Date: Thu, 30 Aug 2007 13:58:12 +0200
> 
> > Huh? I'm fairly sure I sent a patch to remove it from that driver, no
> > idea where it got lost. FWIW, you can simply delete the "|
> > IEEE80211_HW_DATA_NULLFUNC_ACK" part.
> 
> It might have been a mis-merge between John and myself.
> A few times because I had rebased I had to take his
> directory of patches and one of us might have made a
> mistake somewhere along the line.
> 
> In any event, I've removed the offending line from
> the driver.

Sorry, I sent the patch through Jeff.

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik/0010-rtl8187-remove-IEEE80211_HW_DATA_NULLFUNC_ACK.patch

But, it sounds like you have taken care of it...?

Let me know if I need to fix this.

John
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply

* Re: Tc bug (kernel crash) more info
From: Badalian Vyacheslav @ 2007-08-31 14:31 UTC (permalink / raw)
  To: Jarek Poplawski, netdev
In-Reply-To: <20070831125917.GC3108@ff.dom.local>

Ok =) I hope in next week you found bug place and fix it!

PS. if you ask where i can read "kernel panic dump logic" literature and 
try find bugline in code.
I read dump and see that bug in function "rb_insert_color" + some shift 
(in asm?) that called from htb_dequeue? But in htb_dequeue not have 
calling rb_insert_color =( Or some nodes in trace was skipped?

Its for change up my education ;)
> I'll not be able to assist you until monday (but I'll try to look
> into the code and maybe to prepare some new patch - but it needs
> a lot of checking to not add too much of this locking as well).
>
> I think you can stay with a kernel whichever you like - I'm not sure
> any config changes or even more debugging can change much, but maybe
> I'm wrong. It looks to me like some locking is missing or interrupted.
> If you are working weekends and find something new, don't wait: maybe
> somebody else here could be interested too.
>  
> Cheers,
> Jarek P. 
>
>   


^ permalink raw reply

* Re: [PATCH 5/5] Net: ath5k, kconfig changes
From: Dan Williams @ 2007-08-31 14:32 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: John W. Linville, Nick Kossifidis, Christoph Hellwig, Jiri Slaby,
	linux-kernel, linux-wireless, netdev
In-Reply-To: <46D8180A.5070902@garzik.org>

On Fri, 2007-08-31 at 09:30 -0400, Jeff Garzik wrote:
> Dan Williams wrote:
> > On Thu, 2007-08-30 at 08:36 -0400, John W. Linville wrote:
> >> On Thu, Aug 30, 2007 at 04:38:09AM +0300, Nick Kossifidis wrote:
> >>> 2007/8/28, Christoph Hellwig <hch@infradead.org>:
> >>>> Also this whole patch seems rather pointless.  It saves only
> >>>> very little and turns the driver into a complete ifdef maze.
> >>> Also most
> >>> people will use 5212 code only, 5211 cards are on some old laptops and
> >>> 5210, well i couldn't even find  a 5210 for actual testing :P
> >> FWIW, I'd bet dollars to donuts that distros will enable them all
> >> together.
> > 
> > I would certainly _hope_ that distros enable everything -that is in the
> > kernel- that they can get their hands on, otherwise when you stick a
> > card in, it doesn't just work.
> 
> Distros definitely -do not- do this.  Plenty of ancient ISA drivers are 
> disabled at build time, for example, in many distros.

Ok, so let me qualify to "within reason".  All 802.11-compliant wireless
cards would fall within the "within reason" IMHO, but, for example,
older non 802.11 wireless cards (early Aironet products for example)
probably don't.  ISA clearly does not for mainstream distros.

Dan



^ permalink raw reply

* Re: Tc bug (kernel crash) more info
From: Badalian Vyacheslav @ 2007-08-31 14:51 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: netdev
In-Reply-To: <46D8265B.6000405@bigtelecom.ru>

I found that bug in this place

(gdb) l *0xc01c8973
0xc01c8973 is in rb_insert_color (lib/rbtree.c:80).
75
76              while ((parent = rb_parent(node)) && rb_is_red(parent))
77              {
78                      gparent = rb_parent(parent);
79
80                      if (parent == gparent->rb_left)
81                      {
82                              {
83                                      register struct rb_node *uncle = 
gparent->rb_right;
84                                      if (uncle && rb_is_red(uncle))


if i not wrong understand message "unable to handle kernel NULL pointer 
dereference at virtual address 00000008" its was known that "gparent == 
Null"?
Or i hope or i try find a mare's-nest?

> Ok =) I hope in next week you found bug place and fix it!
>
> PS. if you ask where i can read "kernel panic dump logic" literature 
> and try find bugline in code.
> I read dump and see that bug in function "rb_insert_color" + some 
> shift (in asm?) that called from htb_dequeue? But in htb_dequeue not 
> have calling rb_insert_color =( Or some nodes in trace was skipped?
>
> Its for change up my education ;)
>> I'll not be able to assist you until monday (but I'll try to look
>> into the code and maybe to prepare some new patch - but it needs
>> a lot of checking to not add too much of this locking as well).
>>
>> I think you can stay with a kernel whichever you like - I'm not sure
>> any config changes or even more debugging can change much, but maybe
>> I'm wrong. It looks to me like some locking is missing or interrupted.
>> If you are working weekends and find something new, don't wait: maybe
>> somebody else here could be interested too.
>>  
>> Cheers,
>> Jarek P.
>>   
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply

* Re: [PATCH 1/7] Generic bitbanged MDIO library
From: Scott Wood @ 2007-08-31 15:16 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: netdev, linuxppc-dev, Paul Mackerras
In-Reply-To: <46D81638.6050509@pobox.com>

On Fri, Aug 31, 2007 at 09:23:04AM -0400, Jeff Garzik wrote:
> I cannot ACK this, nor do I want to see it merged, until users appear 
> and have been reviewed alongside this.  I do not see any fs_enet patches 
> that actually use this.

The fs_enet patchset does use it in mii-bitbang.c, in patch 6/7 (or 7/8
in the revised patchset).  I'll also be using it on the ep8248e board
(which is why I factored it out), which I'll attach below.

> * the delay (where you call ndelay()) is not guaranteed without a flush 
> of some sort

Hmm...  I'm inclined to push that into the client's implementation of set
mdc/data, primarily because there's no get mdc to do a read-back
generically.

> * how widely applicable is this "generic" library?

It implements MDIO as described in the ethernet specification; there
should be no implementation dependencies.

>  have you converted any non-embedded drivers over to it?

No; most hardware doesn't need bitbanging.  Sunhme does, but I don't have
hardware to test a conversion.

Here's the ep8248e user of the bitbang code:

/*
 * Embedded Planet EP8248E support
 *
 * Copyright 2007 Freescale Semiconductor, Inc.
 * Author: Scott Wood <scottwood@freescale.com>
 *
 * This program is free software; you can redistribute  it and/or modify
 * it under the terms of version 2 of the GNU General Public License as
 * published by the Free Software Foundation.
 */

#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/fsl_devices.h>
#include <linux/mdio-bitbang.h>
#include <linux/of_platform.h>

#include <asm/io.h>
#include <asm/cpm2.h>
#include <asm/udbg.h>
#include <asm/machdep.h>
#include <asm/time.h>
#include <asm/mpc8260.h>

#include <sysdev/fsl_soc.h>
#include <sysdev/cpm2_pic.h>

#include "pq2.h"

static u8 __iomem *ep8248e_bcsr;
static struct device_node *ep8248e_bcsr_node;

#define BCSR7_SCC2_ENABLE 0x10

#define BCSR8_PHY1_ENABLE 0x80
#define BCSR8_PHY1_POWER  0x40
#define BCSR8_PHY2_ENABLE 0x20
#define BCSR8_PHY2_POWER  0x10
#define BCSR8_MDIO_READ   0x04
#define BCSR8_MDIO_CLOCK  0x02
#define BCSR8_MDIO_DATA   0x01

#define BCSR9_USB_ENABLE  0x80
#define BCSR9_USB_POWER   0x40
#define BCSR9_USB_HOST    0x20
#define BCSR9_USB_FULL_SPEED_TARGET 0x10

static void __init ep8248_pic_init(void)
{
	struct device_node *np = of_find_compatible_node(NULL, NULL, "fsl,pq2-pic");
	if (!np) {
		printk(KERN_ERR "PIC init: can not find cpm-pic node\n");
		return;
	}

	cpm2_pic_init(np);
	of_node_put(np);
}

static void ep8248e_set_mdc(struct mdio_bitbang_ctrl *ctrl, int level)
{
	if (level)
		setbits8(&ep8248e_bcsr[8], BCSR8_MDIO_CLOCK);
	else
		clrbits8(&ep8248e_bcsr[8], BCSR8_MDIO_CLOCK);

	/* Flush the write before delaying. */
	in_8(&ep8248e_bcsr[8]);
}

static void ep8248e_set_mdio_dir(struct mdio_bitbang_ctrl *ctrl, int output)
{
	if (output)
		clrbits8(&ep8248e_bcsr[8], BCSR8_MDIO_READ);
	else
		setbits8(&ep8248e_bcsr[8], BCSR8_MDIO_READ);

	/* Flush the write before delaying. */
	in_8(&ep8248e_bcsr[8]);
}

static void ep8248e_set_mdio_data(struct mdio_bitbang_ctrl *ctrl, int data)
{
	if (data)
		setbits8(&ep8248e_bcsr[8], BCSR8_MDIO_DATA);
	else
		clrbits8(&ep8248e_bcsr[8], BCSR8_MDIO_DATA);

	/* Flush the write before delaying. */
	in_8(&ep8248e_bcsr[8]);
}

static int ep8248e_get_mdio_data(struct mdio_bitbang_ctrl *ctrl)
{
	return in_8(&ep8248e_bcsr[8]) & BCSR8_MDIO_DATA;
}

static struct mdio_bitbang_ops ep8248e_mdio_ops = {
	.set_mdc = ep8248e_set_mdc,
	.set_mdio_dir = ep8248e_set_mdio_dir,
	.set_mdio_data = ep8248e_set_mdio_data,
	.get_mdio_data = ep8248e_get_mdio_data,
	.owner = THIS_MODULE,
};

static struct mdio_bitbang_ctrl ep8248e_mdio_ctrl = {
	.ops = &ep8248e_mdio_ops,
};

static int __devinit ep8248e_mdio_probe(struct of_device *ofdev,
                                        const struct of_device_id *match)
{
	struct mii_bus *bus;
	struct resource res;
	int ret, i;

	if (of_get_parent(ofdev->node) != ep8248e_bcsr_node)
		return -ENODEV;

	ret = of_address_to_resource(ofdev->node, 0, &res);
	if (ret)
		return ret;

	bus = alloc_mdio_bitbang(&ep8248e_mdio_ctrl);
	if (!bus)
		return -ENOMEM;

	bus->phy_mask = 0;
	bus->irq = kmalloc(sizeof(int) * PHY_MAX_ADDR, GFP_KERNEL);

	for (i = 0; i < PHY_MAX_ADDR; i++)
		bus->irq[i] = -1;

	bus->name = "ep8248e-mdio-bitbang";
	bus->dev = &ofdev->dev;
	bus->id = res.start;

	return mdiobus_register(bus);
}

static int ep8248e_mdio_remove(struct of_device *ofdev)
{
	BUG();
	return 0;
}

static struct of_device_id ep8248e_mdio_match[] = {
	{
		.compatible = "fsl,ep8248e-mdio-bitbang",
	},
	{},
};

static struct of_platform_driver ep8248e_mdio_driver = {
	.name = "ep8248e-mdio-bitbang",
	.match_table = ep8248e_mdio_match,
	.probe = ep8248e_mdio_probe,
	.remove = ep8248e_mdio_remove,
};

struct cpm_pin {
	int port, pin, flags;
};

static struct cpm_pin ep8248_pins[] = {
	/* SMC1 */
	{2, 4, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 5, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},

	/* SCC1 */
	{2, 14, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 15, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{3, 29, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{3, 30, CPM_PIN_OUTPUT | CPM_PIN_SECONDARY},
	{3, 31, CPM_PIN_INPUT | CPM_PIN_PRIMARY},

	/* FCC1 */
	{0, 14, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{0, 15, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{0, 16, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{0, 17, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{0, 18, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{0, 19, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{0, 20, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{0, 21, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{0, 26, CPM_PIN_INPUT | CPM_PIN_SECONDARY},
	{0, 27, CPM_PIN_INPUT | CPM_PIN_SECONDARY},
	{0, 28, CPM_PIN_OUTPUT | CPM_PIN_SECONDARY},
	{0, 29, CPM_PIN_OUTPUT | CPM_PIN_SECONDARY},
	{0, 30, CPM_PIN_INPUT | CPM_PIN_SECONDARY},
	{0, 31, CPM_PIN_INPUT | CPM_PIN_SECONDARY},
	{2, 21, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 22, CPM_PIN_INPUT | CPM_PIN_PRIMARY},

	/* FCC2 */
	{1, 18, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 19, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 20, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 21, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 22, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{1, 23, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{1, 24, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{1, 25, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{1, 26, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 27, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 28, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 29, CPM_PIN_OUTPUT | CPM_PIN_SECONDARY},
	{1, 30, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{1, 31, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{2, 18, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 19, CPM_PIN_INPUT | CPM_PIN_PRIMARY},

	/* I2C */
	{4, 14, CPM_PIN_INPUT | CPM_PIN_SECONDARY},
	{4, 15, CPM_PIN_INPUT | CPM_PIN_SECONDARY},

	/* USB */
	{2, 10, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 11, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{2, 20, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{2, 24, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
	{3, 23, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{3, 24, CPM_PIN_OUTPUT | CPM_PIN_PRIMARY},
	{3, 25, CPM_PIN_INPUT | CPM_PIN_PRIMARY},
};

static void __init init_ioports(void)
{
	int i;

	for (i = 0; i < ARRAY_SIZE(ep8248_pins); i++) {
		struct cpm_pin *pin = &ep8248_pins[i];
		cpm2_set_pin(pin->port, pin->pin, pin->flags);
	}

	cpm2_smc_clk_setup(CPM_CLK_SMC1, CPM_BRG7);
	cpm2_clk_setup(CPM_CLK_SCC1, CPM_BRG1, CPM_CLK_RX);
	cpm2_clk_setup(CPM_CLK_SCC1, CPM_BRG1, CPM_CLK_TX);
	cpm2_clk_setup(CPM_CLK_SCC3, CPM_CLK8, CPM_CLK_TX); /* USB */
	cpm2_clk_setup(CPM_CLK_FCC1, CPM_CLK11, CPM_CLK_RX);
	cpm2_clk_setup(CPM_CLK_FCC1, CPM_CLK10, CPM_CLK_TX);
	cpm2_clk_setup(CPM_CLK_FCC2, CPM_CLK13, CPM_CLK_RX);
	cpm2_clk_setup(CPM_CLK_FCC2, CPM_CLK14, CPM_CLK_TX);
}

static void __init ep8248_setup_arch(void)
{
	if (ppc_md.progress)
		ppc_md.progress("ep8248_setup_arch()", 0);

	cpm2_reset();

	/* When this is set, snooping CPM DMA from RAM causes
	 * machine checks.  See erratum SIU18.
	 */
	clrbits32(&cpm2_immr->im_siu_conf.siu_82xx.sc_bcr, MPC82XX_BCR_PLDP);

	ep8248e_bcsr_node =
		of_find_compatible_node(NULL, NULL, "fsl,ep8248e-bcsr");
	if (!ep8248e_bcsr_node) {
		printk(KERN_ERR "No bcsr in device tree\n");
		return;
	}

	ep8248e_bcsr = of_iomap(ep8248e_bcsr_node, 0);
	if (!ep8248e_bcsr) {
		printk(KERN_ERR "Cannot map BCSR registers\n");
		return;
	}

	setbits8(&ep8248e_bcsr[7], BCSR7_SCC2_ENABLE);
	setbits8(&ep8248e_bcsr[8], BCSR8_PHY1_ENABLE | BCSR8_PHY1_POWER |
	                           BCSR8_PHY2_ENABLE | BCSR8_PHY2_POWER);

	init_ioports();

	if (ppc_md.progress)
		ppc_md.progress("ep8248_setup_arch(), finish", 0);
}

static struct of_device_id __initdata of_bus_ids[] = {
	{ .name = "soc", },
	{ .name = "cpm", },
	{ .compatible = "fsl,pq2-chipselect", },
	{ .compatible = "fsl,ep8248e-bcsr", },
	{},
};

static int __init declare_of_platform_devices(void)
{
	if (!machine_is(ep8248))
		return 0;

	of_platform_bus_probe(NULL, of_bus_ids, NULL);
	of_register_platform_driver(&ep8248e_mdio_driver);

	return 0;
}
device_initcall(declare_of_platform_devices);

/*
 * Called very early, device-tree isn't unflattened
 */
static int __init ep8248_probe(void)
{
	unsigned long root = of_get_flat_dt_root();
	return of_flat_dt_is_compatible(root, "fsl,ep8248e");
}

define_machine(ep8248)
{
	.name = "Embedded Planet EP8248E",
	.probe = ep8248_probe,
	.setup_arch = ep8248_setup_arch,
	.init_IRQ = ep8248_pic_init,
	.get_irq = cpm2_get_irq,
	.calibrate_decr = generic_calibrate_decr,
	.restart = pq2_restart,
	.progress = udbg_progress,
};

^ permalink raw reply

* Distributed storage. Security attributes and ducumentation update.
From: Evgeniy Polyakov @ 2007-08-31 16:06 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, linux-fsdevel

On Tue, Jul 31, 2007 at 09:13:47PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
Hi.

I'm pleased to announce third release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.

This release includes following changes:
* security attributes (permission mask assigned to addresses, allowed to
	connect to given local export node)
* big documentation update (userspace documentation on the site also
	includes various usage case examples and descirption of the
	configuration utilitiy, protocols and userspace target)
* mirror algorithm has been moved from per-page to per-sector dirty
	bitmask

Further TODO list includes:
* implement optional saving of mirroring/linear information on the remote
	nodes (simple)
* implement netlink based setup (simple)
* new redundancy algorithm (complex)

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 0000000..bfc6984
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /----- Node 1 ---\                         /------ Node 3 ----\
+start              end                     start               end
+ |==================|========================|==================|
+ |                start                     end                 |
+ |                  \------- Node 2 ---------/                  |
+ |                                                              |
+start                                                          end
+ \-------------------------- DST storage ----------------------/
+
+			        /\
+			        ||
+			        ||
+
+			   IO operations
+
+			    Figure 1. 
+     3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+                  IO operations
+                       ||
+                       ||
+                       \/
+
+|---------------- DST storate -------------------|
+|      prev position                             |
+|-------|------------ Node 1 --------------------|
+|                              prev pos          |
+|-------------------- Node 2 -----|--------------|
+|prev pos                                        |
+|---|---------------- Node 3 --------------------|
+
+		Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+	int			(*add_node)(struct dst_node *n);
+	void			(*del_node)(struct dst_node *n);
+	int 			(*remap)(struct dst_request *req);
+	int			(*error)(struct kst_state *state, int err);
+	struct module 		*owner;
+};
+
+@add_node.
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
+@del_node.
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
+@remap.
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
+@error.
+This callback is invoked for each error, which happend when processed
+requests for remote nodes or when talking to remote size
+of the local export node (state contains data related to data
+transfers over the network).
+If this function has fixed given error, it must return 0 or negative
+error value otherwise.
+
+@owner.
+This is module reference counter updated automatically by DST core.
+
+Algorithm must provide its name and above structure to the 
+dst_alloc_alg() function, which will return a reference to the newly
+created algorithm.
+To remove it, one needs to call dst_remove_alg() with given algorithm
+pointer.
diff --git a/Documentation/dst/dst.txt b/Documentation/dst/dst.txt
new file mode 100644
index 0000000..3b326aa
--- /dev/null
+++ b/Documentation/dst/dst.txt
@@ -0,0 +1,66 @@
+Distributed storage. Design and implementation.
+http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
+
+	     Evgeniy Polyakov
+
+This document is intended to briefly describe design and
+implementation details of the distributed storage project,
+aimed to create ability to group physically and/or logically
+distributed storages into single device.
+
+Main operational unit in the storage is node. Node can represent
+either remote storage, connected to local machine, or local
+device, or storage exported to the outside of the system.
+Here goes small explaination of basic therms.
+
+Local node.
+This node is just a logical link between block device (with given
+major and minor numbers) and structure in the DST hierarchy,
+which represents number of sectors on the area, corresponding to given
+block device. it can be a disk, a device mapper node or stacked
+block device on top of another underlying DST nodes.
+
+Local export node.
+Essentially the same as local node, but it allows to access
+to its data via network. Remote clients can connect to given local 
+export node and read or write blocks according to its size.
+Blocks are then forwarded to underlying local node and processed
+there accordingly to the nature of the local node.
+
+Remote node.
+This type of nodes contain remotely accessible devices. One can think
+about remote nodes as remote disks, which can be connected to
+local system and combined into single storage. Remote nodes
+are presented as number of sectors accessed over the network
+by the local machine, where distributed storage is being formed.
+
+
+Each node or set of them can be formed into single array, which
+in turn becomes a local node, which can be exported further by stacking
+a local export node on top of it.
+
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+One can find more details in Documentation/dst/algorithms.txt file.
+
+Main goal of the distributed storage is to combine remote nodes into
+single device, so each block IO request is being sent over the network
+(contrary requests for local nodes are handled by the gneric block
+layer features). Each network connection has number of variables which
+describe it (socket, list of requests, error handling and so on),
+which form kst_state structure. This network state is added into per-socket
+polling state machine, and can be processed by dedicated thread when
+becomes ready. This system forms asynchronous IO for given block
+requests. If block request can be processed without blocking, then
+no new structures are allocated and async part of the state is not used.
+
+When connection to the remote peer breaks, DST core tries to reconnect
+to failed node and no requests are marked as errorneous, instead
+they live in the queue until reconnectin is established.
+
+Userspace code, setup documentation and examples can be found on project's
+homepage above.
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
 	This driver provides Support for ATA over Ethernet block
 	devices like the Coraid EtherDrive (R) Storage Blade.
 
+source "drivers/block/dst/Kconfig"
+
 source "drivers/s390/block/Kconfig"
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD)		+= viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)	+= sx8.o
 obj-$(CONFIG_BLK_DEV_UB)	+= ub.o
 
+obj-$(CONFIG_DST)		+= dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 0000000..9c5eba2
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,19 @@
+config DST
+	tristate "Distributed storage"
+	depends on NET
+	---help---
+	This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+	tristate "Linear distribution algorithm"
+	depends on DST
+	---help---
+	This module allows to create linear mapping of the nodes
+	in the distributed storage.
+
+config DST_ALG_MIRROR
+	tristate "Mirror distribution algorithm"
+	depends on DST
+	---help---
+	This module allows to create a mirror of the noes in the
+	distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 0000000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o
diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 0000000..584f99e
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,99 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/dst.h>
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+
+	n->start = st->disk_size;
+	st->disk_size += n->size;
+
+	return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+	int err;
+
+	if (req->node->bdev) {
+		generic_make_request(req->bio);
+		return 0;
+	}
+
+	err = kst_check_permissions(req->state, req->bio);
+	if (err)
+		return err;
+
+	return req->state->ops->push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+	if (err)
+		set_bit(DST_NODE_FROZEN, &st->node->flags);
+	else
+		clear_bit(DST_NODE_FROZEN, &st->node->flags);
+	return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+	.remap		= dst_linear_remap,
+	.add_node 	= dst_linear_add_node,
+	.del_node 	= dst_linear_del_node,
+	.error		= dst_linear_error,
+	.owner		= THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+	alg_linear = dst_alloc_alg("alg_linear", &alg_linear_ops);
+	if (!alg_linear)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+	dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Linear distributed algorithm.");
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 0000000..be42350
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,765 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/poll.h>
+#include <linux/dst.h>
+
+#define DST_MIRROR_MAX_CHUNKS		4096
+
+struct dst_mirror_priv
+{
+	unsigned int		chunk_num;
+
+	u64			last_start;
+
+	spinlock_t		backlog_lock;
+	struct list_head	backlog_list;
+
+	unsigned long		*chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static ssize_t dst_mirror_chunk_mask_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned int i;
+	int rest = PAGE_SIZE;
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		int bit, j;
+
+		for (j = 0; j < BITS_PER_LONG; ++j) {
+			bit = (priv->chunk[i] >> j) & 1;
+			sprintf(buf, "%c", (bit)?'+':'-');
+			buf++;
+		}
+
+		rest -= BITS_PER_LONG;
+
+		if (rest < BITS_PER_LONG)
+			break;
+	}
+
+	return PAGE_SIZE - rest;
+}
+
+static DEVICE_ATTR(chunks, 0444, dst_mirror_chunk_mask_show, NULL);
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_mirror_del_node(struct dst_node *n)
+{
+	struct dst_mirror_priv *priv = n->priv;
+
+	vfree(priv->chunk);
+	kfree(priv);
+	n->priv = NULL;
+
+	if (n->device.parent == &n->st->device)
+		device_remove_file(&n->device, &dev_attr_chunks);
+}
+
+static void dst_mirror_handle_priv(struct dst_node *n)
+{
+	if (n->priv) {
+		int err;
+		err = device_create_file(&n->device, &dev_attr_chunks);
+	}
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_mirror_add_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+	struct dst_mirror_priv *priv;
+
+	if (st->disk_size)
+		st->disk_size = min(n->size, st->disk_size);
+	else
+		st->disk_size = n->size;
+
+	priv = kzalloc(sizeof(struct dst_mirror_priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->chunk_num = st->disk_size;
+
+	priv->chunk = vmalloc(priv->chunk_num/BITS_PER_LONG * sizeof(long));
+	if (!priv->chunk)
+		goto err_out_free;
+
+	memset(priv->chunk, 0, priv->chunk_num/BITS_PER_LONG * sizeof(long));
+
+	spin_lock_init(&priv->backlog_lock);
+	INIT_LIST_HEAD(&priv->backlog_list);
+
+	dprintk("%s: %llu:%llu, chunk_num: %u, disk_size: %llu.\n",
+			__func__, n->start, n->size,
+			priv->chunk_num, st->disk_size);
+
+	n->priv_callback = &dst_mirror_handle_priv;
+	n->priv = priv;
+
+	return 0;
+
+err_out_free:
+	kfree(priv);
+	return -ENOMEM;
+}
+
+static void dst_mirror_sync_destructor(struct bio *bio)
+{
+	struct bio_vec *bv;
+	int i;
+
+	bio_for_each_segment(bv, bio, i)
+		__free_page(bv->bv_page);
+	bio_free(bio, dst_mirror_bio_set);
+}
+
+static void dst_mirror_sync_requeue(struct dst_node *n)
+{
+	struct dst_mirror_priv *p = n->priv;
+	struct dst_request *req;
+	unsigned int num, idx, i;
+	u64 start;
+	unsigned long flags;
+	int err;
+
+	while (!list_empty(&p->backlog_list)) {
+		req = NULL;
+		spin_lock_irqsave(&p->backlog_lock, flags);
+		if (!list_empty(&p->backlog_list)) {
+			req = list_entry(p->backlog_list.next,
+					struct dst_request,
+					request_list_entry);
+			list_del(&req->request_list_entry);
+		}
+		spin_unlock_irqrestore(&p->backlog_lock, flags);
+
+		if (!req)
+			break;
+
+		start = req->start - to_sector(req->orig_size - req->size);
+
+		idx = start;
+		num = to_sector(req->orig_size);
+
+		for (i=0; i<num; ++i)
+			if (test_bit(idx+i, p->chunk))
+				break;
+
+		dprintk("%s: idx: %u, num: %u, i: %u, req: %p, "
+				"start: %llu, size: %llu.\n",
+				__func__, idx, num, i, req, 
+				req->start, req->orig_size);
+
+		err = -1;
+		if (i != num) {
+			err = kst_enqueue_req(n->state, req);
+			if (err) {
+				printk("%s: congestion [%c]: req: %p, "
+						"start: %llu, size: %llu.\n",
+					__func__,
+					(bio_rw(req->bio) == WRITE)?'W':'R',
+					req, req->start, req->size);
+				kst_del_req(req);
+			}
+		}
+		if (err) {
+			req->bio_endio(req, err);
+			dst_free_request(req);
+		}
+	}
+
+	kst_wake(n->state);
+}
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+	if (test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+		clear_bit(DST_NODE_NOTSYNC, &n->flags);
+		printk("%s: node: %p, %llu:%llu synchronization "
+				"has been completed.\n",
+			__func__, n, n->start, n->size);
+	}
+}
+
+static void dst_mirror_mark_notsync(struct dst_node *n)
+{
+	if (!test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+		set_bit(DST_NODE_NOTSYNC, &n->flags);
+		printk("%s: not synced node n: %p.\n", __func__, n);
+	}
+}
+
+/*
+ * Without errors it is always called under node's request lock,
+ * so it is safe to requeue them.
+ */
+static void dst_mirror_bio_error(struct dst_request *req, int err)
+{
+	int i;
+	struct dst_mirror_priv *priv = req->node->priv;
+	unsigned int num, idx;
+	void (*process_bit[])(int nr, volatile void *addr) =
+		{&__clear_bit, &__set_bit};
+	u64 start = req->start - to_sector(req->orig_size - req->size);
+
+	if (err)
+		dst_mirror_mark_notsync(req->node);
+	else
+		dst_mirror_sync_requeue(req->node);
+
+	priv->last_start = req->start;
+
+	idx = start;
+	num = to_sector(req->orig_size);
+
+	dprintk("%s: req_priv: %p, chunk %p, %llu:%llu start: %llu, size: %llu, "
+		"chunk_num: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, req->priv, priv->chunk, req->node->start, 
+		req->node->size, start, req->orig_size, priv->chunk_num, 
+		idx, num, err);
+
+	if (unlikely(idx >= priv->chunk_num || idx + num > priv->chunk_num)) {
+		printk("%s: %llu:%llu req: %p, start: %llu, orig_size: %llu, "
+			"req_start: %llu, req_size: %llu, "
+			"chunk_num: %u, idx: %d, num: %d, err: %d.\n",
+			__func__, req->node->start, req->node->size, req,
+			start, req->orig_size, 
+			req->start, req->size,
+			priv->chunk_num, idx, num, err);
+		return;
+	}
+
+	for (i=0; i<num; ++i)
+		process_bit[!!err](idx+i, priv->chunk);
+}
+
+static void dst_mirror_sync_req_endio(struct dst_request *req, int err)
+{
+	unsigned long notsync = 0;
+	struct dst_mirror_priv *priv = req->node->priv;
+	int i;
+
+	dst_mirror_bio_error(req, err);
+
+	printk("%s: freeing bio: %p, bi_size: %u, "
+			"orig_size: %llu, req: %p, node: %p.\n",
+		__func__, req->bio, req->bio->bi_size, req->orig_size, req,
+		req->node);
+
+	bio_put(req->bio);
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		notsync = priv->chunk[i];
+
+		if (notsync)
+			break;
+	}
+
+	if (!notsync)
+		dst_mirror_mark_sync(req->node);
+}
+
+static int dst_mirror_sync_endio(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+	struct dst_node *n = req->node;
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned long flags;
+
+	printk("%s: bio: %p, err: %d, size: %u, req: %p.\n",
+			__func__, bio, err, bio->bi_size, req);
+
+	if (bio->bi_size)
+		return 1;
+
+	bio->bi_rw = WRITE;
+	bio->bi_size = req->orig_size;
+	bio->bi_sector = req->start;
+
+	if (!err) {
+		spin_lock_irqsave(&priv->backlog_lock, flags);
+		list_add_tail(&req->request_list_entry, &priv->backlog_list);
+		spin_unlock_irqrestore(&priv->backlog_lock, flags);
+		kst_wake(req->state);
+	} else {
+		req->bio_endio(req, err);
+		dst_free_request(req);
+	}
+	return 0;
+}
+
+static int dst_mirror_sync_block(struct dst_node *n,
+		int bit_start, int bit_num)
+{
+	u64 start = to_bytes(bit_start);
+	struct bio *bio;
+	unsigned int nr_pages = to_bytes(bit_num)/PAGE_SIZE, i;
+	struct page *page;
+	int err = -ENOMEM;
+	struct dst_request *req;
+
+	printk("%s: bit_start: %d, bit_num: %d, start: %llu, nr_pages: %u, "
+			"disk_size: %llu.\n",
+			__func__, bit_start, bit_num, start, nr_pages,
+			n->st->disk_size);
+
+	while (nr_pages) {
+		req = dst_clone_request(NULL, n->w->req_pool);
+		if (!req)
+			return -ENOMEM;
+
+		bio = bio_alloc_bioset(GFP_NOIO, nr_pages, dst_mirror_bio_set);
+		if (!bio)
+			goto err_out_free_req;
+
+		bio->bi_rw = READ;
+		bio->bi_private = req;
+		bio->bi_sector = to_sector(start);
+		bio->bi_bdev = NULL;
+		bio->bi_destructor = dst_mirror_sync_destructor;
+		bio->bi_end_io = dst_mirror_sync_endio;
+
+		for (i = 0; i < nr_pages; ++i) {
+			err = -ENOMEM;
+
+			page = alloc_page(GFP_NOIO);
+			if (!page)
+				break;
+
+			err = bio_add_pc_page(n->st->queue, bio,
+					page, PAGE_SIZE, 0);
+			if (err <= 0)
+				break;
+			err = 0;
+		}
+
+		if (err && !bio->bi_vcnt)
+			goto err_out_put_bio;
+
+		req->node = n;
+		req->state = n->state;
+		req->start = bio->bi_sector;
+		req->size = req->orig_size = bio->bi_size;
+		req->bio = bio;
+		req->idx = bio->bi_idx;
+		req->num = bio->bi_vcnt;
+		req->bio_endio = &dst_mirror_sync_req_endio;
+		req->callback = &kst_data_callback;
+
+		dprintk("%s: start: %llu, size(pages): %u, bio: %p, "
+				"size: %u, cnt: %d, req: %p, size: %llu.\n",
+				__func__, bio->bi_sector, nr_pages, bio,
+				bio->bi_size, bio->bi_vcnt, req, req->size);
+
+		err = n->st->queue->make_request_fn(n->st->queue, bio);
+		if (err)
+			goto err_out_put_bio;
+
+		nr_pages -= bio->bi_vcnt;
+		start += bio->bi_size;
+	}
+
+	return 0;
+
+err_out_put_bio:
+	bio_put(bio);
+err_out_free_req:
+	dst_free_request(req);
+	return err;
+}
+
+/*
+ * Resync logic.
+ *
+ * System allocates and queues requests for number of regions.
+ * Each request initially is reading from the one of the nodes.
+ * When it is completed, system checks if given region was already
+ * written to, and in such case just drops read request, otherwise
+ * it writes it to the node being updated. Any write clears not-uptodate
+ * bit, which is used as a flag that region must be synchronized or not.
+ * Reading is never performed from the node under resync.
+ */
+static int dst_mirror_resync(struct dst_node *n)
+{
+	int err = 0, sync = 0;
+	struct dst_mirror_priv *priv = n->priv;
+	unsigned int i;
+
+	printk("%s: node: %p, %llu:%llu synchronization has been started.\n",
+			__func__, n, n->start, n->size);
+
+	for (i = 0; i < priv->chunk_num/BITS_PER_LONG; ++i) {
+		int bit, num, start;
+		unsigned long word = priv->chunk[i];
+
+		if (!word)
+			continue;
+
+		num = 0;
+		start = -1;
+		while (word && num < BITS_PER_LONG) {
+			bit = __ffs(word);
+			if (start == -1)
+				start = bit;
+			num++;
+			word >>= (bit+1);
+		}
+
+		if (start != -1) {
+			err = dst_mirror_sync_block(n, start + i*BITS_PER_LONG,
+					num);
+			if (err)
+				break;
+			sync++;
+		}
+	}
+
+	if (!sync && !err)
+		dst_mirror_mark_sync(n);
+
+	return err;
+}
+
+static void dst_mirror_destructor(struct bio *bio)
+{
+	dprintk("%s: bio: %p.\n", __func__, bio);
+	bio_free(bio, dst_mirror_bio_set);
+}
+
+static int dst_mirror_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+
+	if (bio->bi_size)
+		return 0;
+
+	dprintk("%s: req: %p, bio: %p, req->bio: %p, err: %d.\n",
+			__func__, req, bio, req->bio, err);
+	req->bio_endio(req, err);
+	bio_put(bio);
+	return 0;
+}
+
+static void dst_mirror_read_endio(struct dst_request *req, int err)
+{
+	dst_mirror_bio_error(req, err);
+
+	if (!err)
+		kst_bio_endio(req, 0);
+}
+
+static void dst_mirror_write_endio(struct dst_request *req, int err)
+{
+	dst_mirror_bio_error(req, err);
+
+	req = req->priv;
+
+	dprintk("%s: req: %p, priv: %p err: %d, bio: %p, "
+			"cnt: %d, orig_size: %llu.\n",
+		__func__, req, req->priv, err, req->bio,
+		atomic_read(&req->refcnt), req->orig_size);
+
+	if (atomic_dec_and_test(&req->refcnt)) {
+		dprintk("%s: freeing bio %p.\n", __func__, req->bio);
+		bio_endio(req->bio, req->orig_size, 0);
+		dst_free_request(req);
+	}
+}
+
+static int dst_mirror_process_request(struct dst_request *req,
+		struct dst_node *n)
+{
+	int err = 0;
+
+	/*
+	 * Block layer requires to clone a bio.
+	 */
+	if (n->bdev) {
+		struct bio *clone = bio_alloc_bioset(GFP_NOIO,
+			req->bio->bi_max_vecs, dst_mirror_bio_set);
+
+		__bio_clone(clone, req->bio);
+
+		clone->bi_bdev = n->bdev;
+		clone->bi_destructor = dst_mirror_destructor;
+		clone->bi_private = req;
+		clone->bi_end_io = &dst_mirror_end_io;
+
+		dprintk("%s: clone: %p, bio: %p, req: %p.\n",
+				__func__, clone, req->bio, req);
+
+		generic_make_request(clone);
+	} else {
+		struct dst_request nr;
+		/*
+		 * Network state processing engine will clone request 
+		 * by itself if needed. We can not use the same structure
+		 * here, since number of its fields will be modified.
+		 */
+		memcpy(&nr, req, sizeof(struct dst_request));
+
+		nr.node = n;
+		nr.state = n->state;
+		nr.priv = req;
+
+		err = kst_check_permissions(n->state, req->bio);
+		if (!err)
+			err = req->state->ops->push(&nr);
+	}
+
+	dprintk("%s: req: %p, n: %p, bdev: %p, err: %d.\n",
+			__func__, req, n, n->bdev, err);
+	return err;
+}
+
+static int dst_mirror_write(struct dst_request *oreq)
+{
+	struct dst_node *n, *node = req->node;
+	int num, err = 0, err_num = 0, orig_num;
+
+	req = dst_clone_request(oreq, oreq->node->w->req_pool);
+	if (!req) {
+		kst_bio_endio(oreq, -ENOMEM);
+		return -ENOMEM;
+	}
+
+	req->priv = req;
+
+	/*
+	 * This logic is pretty simple - req->bio_endio will not
+	 * call bio_endio() until all mirror devices completed
+	 * processing of the request (no matter with or without error).
+	 * Mirror's req->bio_endio callback will take care of that.
+	 */
+	orig_num = num = atomic_read(&req->node->shared_num) + 1;
+	atomic_set(&req->refcnt, num);
+
+	req->bio_endio = &dst_mirror_write_endio;
+
+	dprintk("\n%s: req: %p, mirror to %d nodes.\n",
+			__func__, req, num);
+
+	err = dst_mirror_process_request(req, node);
+	if (err)
+		err_num++;
+
+	if (--num) {
+		list_for_each_entry_rcu(n, &node->shared, shared) {
+			dprintk("\n%s: req: %p, start: %llu, size: %llu, "
+					"num: %d, n: %p.\n",
+				__func__, req, req->start, 
+				req->size, num, n);
+
+			err = dst_mirror_process_request(req, n);
+			if (err)
+				err_num++;
+
+			if (--num <= 0)
+				break;
+		}
+	}
+
+	if (err_num == orig_num) {
+		dprintk("%s: req: %p, num: %d, err: %d.\n",
+				__func__, req, num, err);
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+static int dst_mirror_read(struct dst_request *req)
+{
+	struct dst_node *node = req->node, *n, *min_dist_node;
+	struct dst_mirror_priv *priv = node->priv;
+	u64 dist, d;
+	int err;
+
+	req->bio_endio = &dst_mirror_read_endio;
+
+	do {
+		err = -ENODEV;
+		min_dist_node = NULL;
+		dist = -1ULL;
+ 
+		/*
+		 * Reading is never performed from the node under resync.
+		 * If this will cause any troubles (like all nodes must be
+		 * resynced between each other), this check can be removed
+		 * and per-chunk dirty bit can be tested instead.
+		 */
+
+		if (!test_bit(DST_NODE_NOTSYNC, &node->flags)) {
+			priv = node->priv;
+			if (req->start > priv->last_start)
+				dist = req->start - priv->last_start;
+			else
+				dist = priv->last_start - req->start;
+			min_dist_node = req->node;
+		}
+
+		list_for_each_entry_rcu(n, &node->shared, shared) {
+			if (test_bit(DST_NODE_NOTSYNC, &n->flags))
+				continue;
+
+			priv = n->priv;
+
+			if (req->start > priv->last_start)
+				d = req->start - priv->last_start;
+			else
+				d = priv->last_start - req->start;
+
+			if (d < dist)
+				min_dist_node = n;
+		}
+
+		if (!min_dist_node)
+			break;
+
+		req->node = min_dist_node;
+		req->state = req->node->state;
+
+		if (req->node->bdev) {
+			req->bio->bi_bdev = req->node->bdev;
+			generic_make_request(req->bio);
+			err = 0;
+			break;
+		}
+
+		err = req->state->ops->push(req);
+		if (err) {
+			printk("%s: 1 req: %p, bio: %p, node: %p, err: %d.\n",
+				__func__, req, req->bio, min_dist_node, err);
+			dst_mirror_mark_notsync(req->node);
+		}
+	} while (err && min_dist_node);
+
+	if (err) {
+		printk("%s: req: %p, bio: %p, node: %p, err: %d.\n",
+			__func__, req, req->bio, min_dist_node, err);
+		kst_bio_endio(req, err);
+	}
+	return err;
+}
+
+/*
+ * This callback is invoked from block layer request processing function,
+ * its task is to remap block request to different nodes.
+ */
+static int dst_mirror_remap(struct dst_request *req)
+{
+	int (*remap[])(struct dst_request *) = 
+		{&dst_mirror_read, &dst_mirror_write};
+
+	return remap[bio_rw(req->bio) == WRITE](req);
+}
+
+static int dst_mirror_error(struct kst_state *st, int err)
+{
+	struct dst_request *req, *tmp;
+	unsigned int revents = st->socket->ops->poll(NULL, st->socket, NULL);
+
+	if (err == -EEXIST)
+		return err;
+
+	if (!(revents & (POLLERR | POLLHUP))) {
+		if (test_bit(DST_NODE_NOTSYNC, &st->node->flags)) {
+			return dst_mirror_resync(st->node);
+		}
+		return 0;
+	}
+
+	dst_mirror_mark_notsync(st->node);
+
+	mutex_lock(&st->request_lock);
+	list_for_each_entry_safe(req, tmp, &st->request_list,
+					request_list_entry) {
+		kst_del_req(req);
+		dprintk("%s: requeue [%c], start: %llu, idx: %d,"
+				" num: %d, size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+
+		if (bio_rw(req->bio) == READ) {
+			req->start -= to_sector(req->orig_size - req->size);
+			req->size = req->orig_size;
+			req->flags &= ~DST_REQ_HEADER_SENT;
+			req->idx = 0;
+			if (dst_mirror_read(req))
+				kst_complete_req(req, err);
+			else
+				dst_free_request(req);
+		} else {
+			kst_complete_req(req, err);
+		}
+	}
+	mutex_unlock(&st->request_lock);
+	return err;
+}
+
+static struct dst_alg_ops alg_mirror_ops = {
+	.remap		= dst_mirror_remap,
+	.add_node	= dst_mirror_add_node,
+	.del_node	= dst_mirror_del_node,
+	.error		= dst_mirror_error,
+	.owner		= THIS_MODULE,
+};
+
+static int __devinit alg_mirror_init(void)
+{
+	int err = -ENOMEM;
+
+	dst_mirror_bio_set = bioset_create(256, 256);
+	if (!dst_mirror_bio_set)
+		return -ENOMEM;
+
+	alg_mirror = dst_alloc_alg("alg_mirror", &alg_mirror_ops);
+	if (!alg_mirror)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	bioset_free(dst_mirror_bio_set);
+	return err;
+}
+
+static void __devexit alg_mirror_exit(void)
+{
+	dst_remove_alg(alg_mirror);
+	bioset_free(dst_mirror_bio_set);
+}
+
+module_init(alg_mirror_init);
+module_exit(alg_mirror_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Mirror distributed algorithm.");
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 0000000..2bf7fc1
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1526 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/socket.h>
+#include <linux/dst.h>
+#include <linux/device.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/buffer_head.h>
+
+#include <net/sock.h>
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+
+struct kmem_cache *dst_request_cache;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+	return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+	.name 		= "dst",
+	.match 		= &dst_dev_match,
+};
+
+static struct device dst_dev = {
+	.bus 		= &dst_dev_bus_type,
+	.release 	= &dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+	.release 	= &dst_node_release
+};
+
+static struct bio_set *dst_bio_set;
+
+static void dst_destructor(struct bio *bio)
+{
+	bio_free(bio, dst_bio_set);
+}
+
+/*
+ * Internal callback for local requests (i.e. for local disk),
+ * which are splitted between nodes (part with local node destination
+ * ends up with this ->bi_end_io() callback).
+ */
+static int dst_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct bio *orig_bio = bio->bi_private;
+
+	if (bio->bi_size)
+		return 0;
+
+	dprintk("%s: bio: %p, orig_bio: %p, size: %u, orig_size: %u.\n",
+		__func__, bio, orig_bio, size, orig_bio->bi_size);
+
+	bio_endio(orig_bio, size, 0);
+	bio_put(bio);
+	return 0;
+}
+
+/*
+ * This function sends processing request down to block layer (for local node)
+ * or to network state machine (for remote node).
+ */
+static int dst_node_push(struct dst_request *req)
+{
+	int err = 0;
+	struct dst_node *n = req->node;
+
+	if (n->bdev) {
+		struct bio *bio = req->bio;
+
+		dprintk("%s: start: %llu, num: %d, idx: %d, offset: %u, "
+				"size: %llu, bi_idx: %d, bi_vcnt: %d.\n",
+			__func__, req->start, req->num, req->idx,
+			req->offset, req->size,	bio->bi_idx, bio->bi_vcnt);
+
+		if (likely(bio->bi_idx == req->idx &&
+					bio->bi_vcnt == req->num)) {
+			bio->bi_bdev = n->bdev;
+			bio->bi_sector = req->start;
+		} else {
+			struct bio *clone = bio_alloc_bioset(GFP_NOIO,
+					bio->bi_max_vecs, dst_bio_set);
+			struct bio_vec *bv;
+
+			err = -ENOMEM;
+			if (!clone)
+				goto out_put;
+
+			__bio_clone(clone, bio);
+
+			bv = bio_iovec_idx(clone, req->idx);
+			bv->bv_offset += req->offset;
+			clone->bi_idx = req->idx;
+			clone->bi_vcnt = req->num;
+			clone->bi_bdev = n->bdev;
+			clone->bi_sector = req->start;
+			clone->bi_destructor = dst_destructor;
+			clone->bi_private = bio;
+			clone->bi_size = req->orig_size;
+			clone->bi_end_io = &dst_end_io;
+			req->bio = clone;
+
+			dprintk("%s: start: %llu, num: %d, idx: %d, "
+				"offset: %u, size: %llu, "
+				"bi_idx: %d, bi_vcnt: %d, req: %p, bio: %p.\n",
+				__func__, req->start, req->num, req->idx,
+				req->offset, req->size,
+				clone->bi_idx, clone->bi_vcnt, req, req->bio);
+
+		}
+	}
+
+	err = n->st->alg->ops->remap(req);
+
+out_put:
+	dst_node_put(n);
+	return err;
+}
+
+/*
+ * This function is invoked from block layer request processing function,
+ * its task is to remap block request to different nodes.
+ */
+static int dst_remap(struct dst_storage *st, struct bio *bio)
+{
+	struct dst_node *n;
+	int err = -EINVAL, i, cnt;
+	unsigned int bio_sectors = bio->bi_size>>9;
+	struct bio_vec *bv;
+	struct dst_request req;
+	u64 rest_in_node, start, total_size;
+
+	mutex_lock(&st->tree_lock);
+	n = dst_storage_tree_search(st, bio->bi_sector);
+	mutex_unlock(&st->tree_lock);
+
+	if (!n) {
+		dprintk("%s: failed to find a node for bio: %p, "
+				"sector: %llu.\n",
+				__func__, bio, bio->bi_sector);
+		return -ENODEV;
+	}
+
+	dprintk("%s: bio: %llu-%llu, dev: %llu-%llu, in sectors.\n",
+			__func__, bio->bi_sector, bio->bi_sector+bio_sectors,
+			n->start, n->start+n->size);
+
+	memset(&req, 0, sizeof(struct dst_request));
+
+	start = bio->bi_sector;
+	total_size = bio->bi_size;
+
+	req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+				DST_REQ_ALWAYS_QUEUE:0;
+	req.start = start - n->start;
+	req.offset = 0;
+	req.state = n->state;
+	req.node = n;
+	req.bio = bio;
+
+	req.size = bio->bi_size;
+	req.orig_size = bio->bi_size;
+	req.idx = bio->bi_idx;
+	req.num = bio->bi_vcnt;
+
+	req.bio_endio = &kst_bio_endio;
+
+	/*
+	 * Common fast path - block request does not cross
+	 * boundaries between nodes.
+	 */
+	if (likely(bio->bi_sector + bio_sectors <= n->start + n->size))
+		return dst_node_push(&req);
+
+	req.size = 0;
+	req.idx = 0;
+	req.num = 1;
+
+	cnt = bio->bi_vcnt;
+
+	rest_in_node = to_bytes(n->size - req.start);
+
+	for (i = 0; i < cnt; ++i) {
+		bv = bio_iovec_idx(bio, i);
+
+		if (req.size + bv->bv_len >= rest_in_node) {
+			unsigned int diff = req.size + bv->bv_len -
+				rest_in_node;
+
+			req.size += bv->bv_len - diff;
+			req.start = start - n->start;
+			req.orig_size = req.size;
+			req.bio = bio;
+			req.bio_endio = &kst_bio_endio;
+
+			dprintk("%s: split: start: %llu/%llu, size: %llu, "
+					"total_size: %llu, diff: %u, idx: %d, "
+					"num: %d, bv_len: %u, bv_offset: %u.\n",
+					__func__, start, req.start, req.size,
+					total_size, diff, req.idx, req.num,
+					bv->bv_len, bv->bv_offset);
+
+			err = dst_node_push(&req);
+			if (err)
+				break;
+
+			total_size -= req.orig_size;
+
+			if (!total_size)
+				break;
+
+			start += to_sector(req.orig_size);
+
+			req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+				DST_REQ_ALWAYS_QUEUE:0;
+			req.orig_size = req.size = diff;
+
+			if (diff) {
+				req.offset = bv->bv_len - diff;
+				req.idx = req.num - 1;
+			} else {
+				req.idx = req.num;
+				req.offset = 0;
+			}
+
+			dprintk("%s: next: start: %llu, size: %llu, "
+				"total_size: %llu, diff: %u, idx: %d, "
+				"num: %d, offset: %u, bv_len: %u, "
+				"bv_offset: %u.\n",
+				__func__, start, req.size, total_size, diff,
+				req.idx, req.num, req.offset,
+				bv->bv_len, bv->bv_offset);
+
+			mutex_lock(&st->tree_lock);
+			n = dst_storage_tree_search(st, start);
+			mutex_unlock(&st->tree_lock);
+
+			if (!n) {
+				err = -ENODEV;
+				dprintk("%s: failed to find a split node for "
+				  "bio: %p, sector: %llu, start: %llu.\n",
+						__func__, bio, bio->bi_sector,
+						req.start);
+				break;
+			}
+
+			req.state = n->state;
+			req.node = n;
+			req.start = start - n->start;
+			rest_in_node = to_bytes(n->size - req.start);
+
+			dprintk("%s: req.start: %llu, start: %llu, "
+					"dev_start: %llu, dev_size: %llu, "
+					"rest_in_node: %llu.\n",
+				__func__, req.start, start, n->start,
+				n->size, rest_in_node);
+		} else {
+			req.size += bv->bv_len;
+			req.num++;
+		}
+	}
+
+	dprintk("%s: last request: start: %llu, size: %llu, "
+			"total_size: %llu.\n", __func__,
+			req.start, req.size, total_size);
+	if (total_size) {
+		req.orig_size = req.size;
+		req.bio = bio;
+		req.bio_endio = &kst_bio_endio;
+
+		dprintk("%s: last: start: %llu/%llu, size: %llu, "
+				"total_size: %llu, idx: %d, num: %d.\n",
+			__func__, start, req.start, req.size,
+			total_size, req.idx, req.num);
+
+		err = dst_node_push(&req);
+		if (!err) {
+			total_size -= req.orig_size;
+
+			BUG_ON(total_size != 0);
+		}
+	}
+
+	dprintk("%s: end bio: %p, err: %d.\n", __func__, bio, err);
+	return err;
+}
+
+
+/*
+ * Distributed storage erquest processing function.
+ * It calls algorithm spcific remapping code only.
+ */
+static int dst_request(request_queue_t *q, struct bio *bio)
+{
+	struct dst_storage *st = q->queuedata;
+	int err;
+
+	dprintk("\n%s: start: st: %p, bio: %p, cnt: %u.\n",
+			__func__, st, bio, bio->bi_vcnt);
+
+	err = dst_remap(st, bio);
+
+	dprintk("%s: end: st: %p, bio: %p, err: %d.\n",
+			__func__, st, bio, err);
+	return 0;
+}
+
+static void dst_unplug(request_queue_t *q)
+{
+}
+
+static int dst_flush(request_queue_t *q, struct gendisk *disk, sector_t *sec)
+{
+	return 0;
+}
+
+static struct block_device_operations dst_blk_ops = {
+	.owner =	THIS_MODULE,
+};
+
+/*
+ * Block layer binding - disk is created when array is fully configured
+ * by userspace request.
+ */
+static int dst_create_disk(struct dst_storage *st)
+{
+	int err = -ENOMEM;
+
+	st->queue = blk_alloc_queue(GFP_KERNEL);
+	if (!st->queue)
+		goto err_out_exit;
+
+	st->queue->queuedata = st;
+	blk_queue_make_request(st->queue, dst_request);
+	blk_queue_bounce_limit(st->queue, BLK_BOUNCE_ANY);
+	st->queue->unplug_fn = dst_unplug;
+	st->queue->issue_flush_fn = dst_flush;
+
+	err = -EINVAL;
+	st->disk = alloc_disk(1);
+	if (!st->disk)
+		goto err_out_free_queue;
+
+	st->disk->major = dst_major;
+	st->disk->first_minor = (((unsigned long)st->disk) ^
+		(((unsigned long)st->disk) >> 31)) & 0xff;
+	st->disk->fops = &dst_blk_ops;
+	st->disk->queue = st->queue;
+	st->disk->private_data = st;
+	snprintf(st->disk->disk_name, sizeof(st->disk->disk_name),
+			"dst-%s-%d", st->name, st->disk->first_minor);
+
+	return 0;
+
+err_out_free_queue:
+	blk_cleanup_queue(st->queue);
+err_out_exit:
+	return err;
+}
+
+static void dst_remove_disk(struct dst_storage *st)
+{
+	del_gendisk(st->disk);
+	put_disk(st->disk);
+	blk_cleanup_queue(st->queue);
+}
+
+/*
+ * Shows node name in sysfs.
+ */
+static ssize_t dst_name_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+
+	return sprintf(buf, "%s\n", st->name);
+}
+
+static void dst_remove_all_nodes(struct dst_storage *st)
+{
+	struct dst_node *n, *node, *tmp;
+	struct rb_node *rb_node;
+
+	mutex_lock(&st->tree_lock);
+	while ((rb_node = rb_first(&st->tree_root)) != NULL) {
+		n = rb_entry(rb_node, struct dst_node, tree_node);
+		dprintk("%s: n: %p, start: %llu, size: %llu.\n",
+				__func__, n, n->start, n->size);
+		rb_erase(&n->tree_node, &st->tree_root);
+		if (!n->shared_head && atomic_read(&n->shared_num)) {
+			list_for_each_entry_safe(node, tmp, &n->shared, shared) {
+				list_del_rcu(&node->shared);
+				atomic_dec(&node->shared_head->refcnt);
+				node->shared_head = NULL;
+				dst_node_put(node);
+			}
+		}
+		dst_node_put(n);
+	}
+	mutex_unlock(&st->tree_lock);
+}
+
+/*
+ * Shows node layout in syfs.
+ */
+static ssize_t dst_nodes_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	int size = PAGE_CACHE_SIZE, sz;
+	struct dst_node *n;
+	struct rb_node *rb_node;
+
+	sz = sprintf(buf, "sectors (start [size]): ");
+	size -= sz;
+	buf += sz;
+
+	mutex_lock(&st->tree_lock);
+	for (rb_node = rb_first(&st->tree_root); rb_node;
+			rb_node = rb_next(rb_node)) {
+		n = rb_entry(rb_node, struct dst_node, tree_node);
+		if (size < 32)
+			break;
+		sz = sprintf(buf, "%llu [%llu]", n->start, n->size);
+		buf += sz;
+		size -= sz;
+
+		if (!rb_next(rb_node))
+			break;
+
+		sz = sprintf(buf, " | ");
+		buf += sz;
+		size -= sz;
+	}
+	mutex_unlock(&st->tree_lock);
+	size -= sprintf(buf, "\n");
+	return PAGE_CACHE_SIZE - size;
+}
+
+/*
+ * Algorithm currently being used by given storage.
+ */
+static ssize_t dst_alg_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	return sprintf(buf, "%s\n", st->alg->name);
+}
+
+/*
+ * Writing to this sysfs file allows to remove all nodes
+ * and storage itself automatically.
+ */
+static ssize_t dst_remove_nodes(struct device *dev,
+		struct device_attribute *attr,
+		const char *buf, size_t count)
+{
+	struct dst_storage *st = container_of(dev, struct dst_storage, device);
+	dst_remove_all_nodes(st);
+	return count;
+}
+
+static DEVICE_ATTR(name, 0444, dst_name_show, NULL);
+static DEVICE_ATTR(nodes, 0444, dst_nodes_show, NULL);
+static DEVICE_ATTR(alg, 0444, dst_alg_show, NULL);
+static DEVICE_ATTR(remove_all_nodes, 0644, NULL, dst_remove_nodes);
+
+static int dst_create_storage_attributes(struct dst_storage *st)
+{
+	int err;
+
+	err = device_create_file(&st->device, &dev_attr_name);
+	err = device_create_file(&st->device, &dev_attr_nodes);
+	err = device_create_file(&st->device, &dev_attr_alg);
+	err = device_create_file(&st->device, &dev_attr_remove_all_nodes);
+	return 0;
+}
+
+static void dst_remove_storage_attributes(struct dst_storage *st)
+{
+	device_remove_file(&st->device, &dev_attr_name);
+	device_remove_file(&st->device, &dev_attr_nodes);
+	device_remove_file(&st->device, &dev_attr_alg);
+	device_remove_file(&st->device, &dev_attr_remove_all_nodes);
+}
+
+static void dst_storage_sysfs_exit(struct dst_storage *st)
+{
+	dst_remove_storage_attributes(st);
+	device_unregister(&st->device);
+}
+
+static int dst_storage_sysfs_init(struct dst_storage *st)
+{
+	int err;
+
+	memcpy(&st->device, &dst_dev, sizeof(struct device));
+	snprintf(st->device.bus_id, sizeof(st->device.bus_id), "%s", st->name);
+
+	err = device_register(&st->device);
+	if (err) {
+		dprintk(KERN_ERR "Failed to register dst device %s, err: %d.\n",
+			st->name, err);
+		goto err_out_exit;
+	}
+
+	dst_create_storage_attributes(st);
+
+	return 0;
+
+err_out_exit:
+	return err;
+}
+
+/*
+ * This functions shows size and start of the appropriate node.
+ * Both are in sectors.
+ */
+static ssize_t dst_show_start(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+
+	return sprintf(buf, "%llu\n", n->start);
+}
+
+static ssize_t dst_show_size(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+
+	return sprintf(buf, "%llu\n", n->size);
+}
+
+/*
+ * Shows type of the remote node - device major/minor number
+ * for local nodes and address (af_inet ipv4/ipv6 only) for remote nodes.
+ */
+static ssize_t dst_show_type(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dst_node *n = container_of(dev, struct dst_node, device);
+	struct sockaddr addr;
+	struct socket *sock;
+	int addrlen;
+
+	if (!n->state && !n->bdev)
+		return 0;
+
+	if (n->bdev)
+		return sprintf(buf, "L: %d:%d\n",
+				MAJOR(n->bdev->bd_dev), MINOR(n->bdev->bd_dev));
+
+	sock = n->state->socket;
+	if (sock->ops->getname(sock, &addr, &addrlen, 2))
+		return 0;
+
+	if (sock->ops->family == AF_INET) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+		return sprintf(buf, "R: %u.%u.%u.%u:%d\n",
+			NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+	} else if (sock->ops->family == AF_INET6) {
+		struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+		return sprintf(buf,
+			"R: %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d\n",
+			NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+	}
+	return 0;
+}
+
+static DEVICE_ATTR(start, 0444, dst_show_start, NULL);
+static DEVICE_ATTR(size, 0444, dst_show_size, NULL);
+static DEVICE_ATTR(type, 0444, dst_show_type, NULL);
+
+static int dst_create_node_attributes(struct dst_node *n)
+{
+	int err;
+
+	err = device_create_file(&n->device, &dev_attr_start);
+	err = device_create_file(&n->device, &dev_attr_size);
+	err = device_create_file(&n->device, &dev_attr_type);
+	return 0;
+}
+
+static void dst_remove_node_attributes(struct dst_node *n)
+{
+	device_remove_file(&n->device, &dev_attr_start);
+	device_remove_file(&n->device, &dev_attr_size);
+	device_remove_file(&n->device, &dev_attr_type);
+}
+
+static void dst_node_sysfs_exit(struct dst_node *n)
+{
+	if (n->device.parent == &n->st->device) {
+		dst_remove_node_attributes(n);
+		device_unregister(&n->device);
+		n->device.parent = NULL;
+	}
+}
+
+static int dst_node_sysfs_init(struct dst_node *n)
+{
+	int err;
+
+	memcpy(&n->device, &dst_node_dev, sizeof(struct device));
+
+	n->device.parent = &n->st->device;
+
+	snprintf(n->device.bus_id, sizeof(n->device.bus_id),
+			"n-%llu-%p", n->start, n);
+	err = device_register(&n->device);
+	if (err) {
+		dprintk(KERN_ERR "Failed to register node, err: %d.\n", err);
+		goto err_out_exit;
+	}
+
+	dst_create_node_attributes(n);
+
+	return 0;
+
+err_out_exit:
+	n->device.parent = NULL;
+	return err;
+}
+
+/*
+ * Gets a reference for given storage, if
+ * storage with given name and algorithm being used
+ * does not exist it is created.
+ */
+static struct dst_storage *dst_get_storage(char *name, char *aname, int alloc)
+{
+	struct dst_storage *st, *rst = NULL;
+	int err;
+	struct dst_alg *alg;
+
+	mutex_lock(&dst_storage_lock);
+	list_for_each_entry(st, &dst_storage_list, entry) {
+		if (!strcmp(name, st->name) && !strcmp(st->alg->name, aname)) {
+			rst = st;
+			atomic_inc(&st->refcnt);
+			break;
+		}
+	}
+	mutex_unlock(&dst_storage_lock);
+
+	if (rst || !alloc)
+		return rst;
+
+	st = kzalloc(sizeof(struct dst_storage), GFP_KERNEL);
+	if (!st)
+		return NULL;
+
+	mutex_init(&st->tree_lock);
+	/*
+	 * One for storage itself,
+	 * another one for attached node below.
+	 */
+	atomic_set(&st->refcnt, 2);
+	snprintf(st->name, DST_NAMELEN, "%s", name);
+	st->tree_root.rb_node = NULL;
+
+	err = dst_storage_sysfs_init(st);
+	if (err)
+		goto err_out_free;
+
+	err = dst_create_disk(st);
+	if (err)
+		goto err_out_sysfs_exit;
+
+	mutex_lock(&dst_alg_lock);
+	list_for_each_entry(alg, &dst_alg_list, entry) {
+		if (!strcmp(alg->name, aname)) {
+			atomic_inc(&alg->refcnt);
+			try_module_get(alg->ops->owner);
+			st->alg = alg;
+			break;
+		}
+	}
+	mutex_unlock(&dst_alg_lock);
+
+	if (!st->alg)
+		goto err_out_disk_remove;
+
+	mutex_lock(&dst_storage_lock);
+	list_add_tail(&st->entry, &dst_storage_list);
+	mutex_unlock(&dst_storage_lock);
+
+	return st;
+
+err_out_disk_remove:
+	dst_remove_disk(st);
+err_out_sysfs_exit:
+	dst_storage_sysfs_init(st);
+err_out_free:
+	kfree(st);
+	return NULL;
+}
+
+/*
+ * Allows to allocate and add new algorithm by external modules.
+ */
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops)
+{
+	struct dst_alg *alg;
+
+	alg = kzalloc(sizeof(struct dst_alg), GFP_KERNEL);
+	if (!alg)
+		return NULL;
+	snprintf(alg->name, DST_NAMELEN, "%s", name);
+	atomic_set(&alg->refcnt, 1);
+	alg->ops = ops;
+
+	mutex_lock(&dst_alg_lock);
+	list_add_tail(&alg->entry, &dst_alg_list);
+	mutex_unlock(&dst_alg_lock);
+
+	return alg;
+}
+EXPORT_SYMBOL_GPL(dst_alloc_alg);
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+	dprintk("%s: alg: %p.\n", __func__, alg);
+	kfree(alg);
+}
+
+/*
+ * Algorithm is never freed directly,
+ * since its module reference counter is increased
+ * by storage when it is created - just like network protocols.
+ */
+static inline void dst_put_alg(struct dst_alg *alg)
+{
+	dprintk("%s: alg: %p, refcnt: %d.\n",
+			__func__, alg, atomic_read(&alg->refcnt));
+	module_put(alg->ops->owner);
+	if (atomic_dec_and_test(&alg->refcnt))
+		dst_free_alg(alg);
+}
+
+/*
+ * Removing algorithm from main list of supported algorithms.
+ */
+void dst_remove_alg(struct dst_alg *alg)
+{
+	mutex_lock(&dst_alg_lock);
+	list_del_init(&alg->entry);
+	mutex_unlock(&dst_alg_lock);
+
+	dst_put_alg(alg);
+}
+EXPORT_SYMBOL_GPL(dst_remove_alg);
+
+static void dst_cleanup_node(struct dst_node *n)
+{
+	struct dst_storage *st = n->st;
+
+	dprintk("%s: node: %p.\n", __func__, n);
+
+	n->st->alg->ops->del_node(n);
+
+	if (n->shared_head) {
+		mutex_lock(&st->tree_lock);
+		list_del_rcu(&n->shared);
+		mutex_unlock(&st->tree_lock);
+
+		atomic_dec(&n->shared_head->refcnt);
+		dst_node_put(n->shared_head);
+		n->shared_head = NULL;
+	}
+
+	if (n->cleanup)
+		n->cleanup(n);
+	dst_node_sysfs_exit(n);
+	kfree(n);
+}
+
+static void dst_free_storage(struct dst_storage *st)
+{
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	BUG_ON(rb_first(&st->tree_root) != NULL);
+
+	dst_put_alg(st->alg);
+	kfree(st);
+}
+
+static inline void dst_put_storage(struct dst_storage *st)
+{
+	dprintk("%s: st: %p, refcnt: %d.\n",
+			__func__, st, atomic_read(&st->refcnt));
+	if (atomic_dec_and_test(&st->refcnt))
+		dst_free_storage(st);
+}
+
+void dst_node_put(struct dst_node *n)
+{
+	dprintk("%s: node: %p, start: %llu, size: %llu, refcnt: %d.\n",
+			__func__, n, n->start, n->size,
+			atomic_read(&n->refcnt));
+
+	if (atomic_dec_and_test(&n->refcnt)) {
+		struct dst_storage *st = n->st;
+
+		dprintk("%s: freeing node: %p, start: %llu, size: %llu, "
+				"refcnt: %d.\n",
+				__func__, n, n->start, n->size,
+				atomic_read(&n->refcnt));
+
+		dst_cleanup_node(n);
+		dst_put_storage(st);
+	}
+}
+EXPORT_SYMBOL_GPL(dst_node_put);
+
+static inline int dst_compare_id(struct dst_node *old, u64 new)
+{
+	if (old->start + old->size <= new)
+		return 1;
+	if (old->start > new)
+		return -1;
+	return 0;
+}
+
+/*
+ * Tree of of the nodes, which form the storage.
+ * Tree is indexed via start of the node and its size.
+ * Comparison function above.
+ */
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start)
+{
+	struct rb_node *n = st->tree_root.rb_node;
+	struct dst_node *dn;
+	int cmp;
+
+	while (n) {
+		dn = rb_entry(n, struct dst_node, tree_node);
+
+		cmp = dst_compare_id(dn, start);
+		dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+			__func__, dn->start, dn->start+dn->size, start);
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else {
+			return dst_node_get(dn);
+		}
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(dst_storage_tree_search);
+
+/*
+ * This function allows to remove a node with given start address
+ * from the storage.
+ */
+static struct dst_node *dst_storage_tree_del(struct dst_storage *st, u64 start)
+{
+	struct dst_node *n = dst_storage_tree_search(st, start);
+
+	if (!n)
+		return NULL;
+
+	rb_erase(&n->tree_node, &st->tree_root);
+	dst_node_put(n);
+	return n;
+}
+
+/*
+ * This function allows to add given node to the storage.
+ * Returns -EEXIST if the same area is already covered by another node.
+ * This is return must be checked for redundancy algorithms.
+ */
+static struct dst_node *dst_storage_tree_add(struct dst_node *new,
+		struct dst_storage *st)
+{
+	struct rb_node **n = &st->tree_root.rb_node, *parent = NULL;
+	struct dst_node *dn;
+	int cmp;
+
+	while (*n) {
+		parent = *n;
+		dn = rb_entry(parent, struct dst_node, tree_node);
+
+		cmp = dst_compare_id(dn, new->start);
+		dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+				__func__, dn->start, dn->start+dn->size,
+				new->start);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			return dn;
+		}
+	}
+
+	rb_link_node(&new->tree_node, parent, n);
+	rb_insert_color(&new->tree_node, &st->tree_root);
+
+	return NULL;
+}
+
+/*
+ * This function finds devices major/minor numbers for given pathname.
+ */
+static int dst_lookup_device(const char *path, dev_t *dev)
+{
+	int err;
+	struct nameidata nd;
+	struct inode *inode;
+
+	err = path_lookup(path, LOOKUP_FOLLOW, &nd);
+	if (err)
+		return err;
+
+	inode = nd.dentry->d_inode;
+	if (!inode) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	if (!S_ISBLK(inode->i_mode)) {
+		err = -ENOTBLK;
+		goto out;
+	}
+
+	*dev = inode->i_rdev;
+
+out:
+	path_release(&nd);
+	return err;
+}
+
+/*
+ * Cleanup routings for local, local exporting and remote nodes.
+ */
+static void dst_cleanup_remote(struct dst_node *n)
+{
+	if (n->state) {
+		kst_state_exit(n->state);
+		n->state = NULL;
+	}
+}
+
+static void dst_cleanup_local(struct dst_node *n)
+{
+	if (n->bdev) {
+		sync_blockdev(n->bdev);
+		blkdev_put(n->bdev);
+		n->bdev = NULL;
+	}
+}
+
+static void dst_cleanup_local_export(struct dst_node *n)
+{
+	dst_cleanup_local(n);
+	dst_cleanup_remote(n);
+}
+
+/*
+ * Setup routings for local, local exporting and remote nodes.
+ */
+static int dst_setup_local(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_local_ctl *l)
+{
+	dev_t dev;
+	int err;
+
+	err = dst_lookup_device(l->name, &dev);
+	if (err)
+		return err;
+
+	n->bdev = open_by_devnum(dev, FMODE_READ|FMODE_WRITE);
+	if (!n->bdev)
+		return -ENODEV;
+
+	if (!n->size)
+		n->size = get_capacity(n->bdev->bd_disk);
+
+	return 0;
+}
+
+static int dst_setup_local_export(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_le_template *tmp)
+{
+	int err;
+
+	err = dst_setup_local(n, ctl, &tmp->le.lctl);
+	if (err)
+		goto err_out_exit;
+
+	n->state = kst_listener_state_init(n, tmp);
+	if (IS_ERR(n->state)) {
+		err = PTR_ERR(n->state);
+		goto err_out_cleanup;
+	}
+
+	return 0;
+
+err_out_cleanup:
+	dst_cleanup_local(n);
+err_out_exit:
+	return err;
+}
+
+static int dst_request_remote_config(struct dst_node *n, struct socket *sock)
+{
+	struct dst_remote_request cfg;
+	struct msghdr msg;
+	struct kvec iov;
+	int err;
+
+	memset(&cfg, 0, sizeof(struct dst_remote_request));
+	cfg.cmd = cpu_to_be32(DST_REMOTE_CFG);
+
+	iov.iov_base = &cfg;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		return err;
+	}
+
+	iov.iov_base = &cfg;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL;
+
+	err = kernel_recvmsg(sock, &msg, &iov, 1, iov.iov_len, msg.msg_flags);
+	if (err <= 0) {
+		if (err == 0)
+			err = -ECONNRESET;
+		return err;
+	}
+
+	if (be32_to_cpu(cfg.cmd) != DST_REMOTE_CFG)
+		return -EINVAL;
+
+	n->size = be64_to_cpu(cfg.sector);
+
+	return 0;
+}
+
+static int dst_setup_remote(struct dst_node *n, struct dst_ctl *ctl,
+		struct dst_remote_ctl *r)
+{
+	int err;
+	struct socket *sock;
+
+	err = sock_create(r->addr.sa_family, r->type, r->proto, &sock);
+	if (err < 0)
+		goto err_out_exit;
+
+	sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+		msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+	err = sock->ops->connect(sock, (struct sockaddr *)&r->addr,
+			r->addr.sa_data_len, 0);
+	if (err)
+		goto err_out_destroy;
+
+	if (!n->size) {
+		err = dst_request_remote_config(n, sock);
+		if (err)
+			goto err_out_destroy;
+	}
+
+	n->state = kst_data_state_init(n, sock);
+	if (IS_ERR(n->state)) {
+		err = PTR_ERR(n->state);
+		goto err_out_destroy;
+	}
+
+	return 0;
+
+err_out_destroy:
+	sock_release(sock);
+err_out_exit:
+	return err;
+}
+
+/*
+ * This function inserts node into storage.
+ */
+static int dst_insert_node(struct dst_node *n)
+{
+	int err;
+	struct dst_storage *st = n->st;
+	struct dst_node *dn;
+
+	err = st->alg->ops->add_node(n);
+	if (err)
+		return err;
+
+	err = dst_node_sysfs_init(n);
+	if (err)
+		goto err_out_remove_node;
+
+	mutex_lock(&st->tree_lock);
+	dn = dst_storage_tree_add(n, st);
+	if (dn) {
+		err = -EINVAL;
+		dn->size = st->disk_size;
+		if (dn->start == n->start) {
+			err = 0;
+			n->shared_head = dst_node_get(dn);
+			atomic_inc(&dn->shared_num);
+			list_add_tail_rcu(&n->shared, &dn->shared);
+		}
+	}
+	mutex_unlock(&st->tree_lock);
+	if (err)
+		goto err_out_sysfs_exit;
+
+	if (n->priv_callback)
+		n->priv_callback(n);
+
+	return 0;
+
+err_out_sysfs_exit:
+	dst_node_sysfs_exit(n);
+err_out_remove_node:
+	st->alg->ops->del_node(n);
+	return err;
+}
+
+static struct dst_node *dst_alloc_node(struct dst_ctl *ctl,
+		void (*cleanup)(struct dst_node *))
+{
+	struct dst_storage *st;
+	struct dst_node *n;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 1);
+	if (!st)
+		goto err_out_exit;
+
+	n = kzalloc(sizeof(struct dst_node), GFP_KERNEL);
+	if (!n)
+		goto err_out_put_storage;
+
+	n->w = kst_main_worker;
+	n->st = st;
+	n->cleanup = cleanup;
+	n->start = ctl->start;
+	n->size = ctl->size;
+	INIT_LIST_HEAD(&n->shared);
+	n->shared_head = NULL;
+	atomic_set(&n->shared_num, 0);
+	atomic_set(&n->refcnt, 1);
+
+	return n;
+
+err_out_put_storage:
+	mutex_lock(&dst_storage_lock);
+	list_del_init(&st->entry);
+	mutex_unlock(&dst_storage_lock);
+
+	dst_put_storage(st);
+err_out_exit:
+	return NULL;
+}
+
+/*
+ * Control callback for userspace commands to setup
+ * different nodes and start/stop array.
+ */
+static int dst_add_remote(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_remote_ctl rctl;
+
+	if (copy_from_user(&rctl, data, sizeof(struct dst_remote_ctl)))
+		return -EFAULT;
+
+	n = dst_alloc_node(ctl, &dst_cleanup_remote);
+	if (!n)
+		return -ENOMEM;
+
+	err = dst_setup_remote(n, ctl, &rctl);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_add_local_export(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_le_template tmp;
+
+	if (copy_from_user(&tmp.le, data, sizeof(struct dst_local_export_ctl)))
+		return -EFAULT;
+
+	tmp.data = data + sizeof(struct dst_local_export_ctl);
+
+	n = dst_alloc_node(ctl, &dst_cleanup_local_export);
+	if (!n)
+		return -EINVAL;
+
+	err = dst_setup_local_export(n, ctl, &tmp);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_add_local(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_node *n;
+	int err;
+	struct dst_local_ctl lctl;
+
+	if (copy_from_user(&lctl, data, sizeof(struct dst_local_ctl)))
+		return -EFAULT;
+
+	n = dst_alloc_node(ctl, &dst_cleanup_local);
+	if (!n)
+		return -EINVAL;
+
+	err = dst_setup_local(n, ctl, &lctl);
+	if (err < 0)
+		goto err_out_free;
+
+	err = dst_insert_node(n);
+	if (err)
+		goto err_out_free;
+
+	return 0;
+
+err_out_free:
+	dst_node_put(n);
+	return err;
+}
+
+static int dst_del_node(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_node *n;
+	struct dst_storage *st;
+	int err = -ENODEV;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		goto err_out_exit;
+
+	mutex_lock(&st->tree_lock);
+	n = dst_storage_tree_del(st, ctl->start);
+	mutex_unlock(&st->tree_lock);
+	if (!n)
+		goto err_out_put;
+
+	dst_node_put(n);
+	dst_put_storage(st);
+
+	return 0;
+
+err_out_put:
+	dst_put_storage(st);
+err_out_exit:
+	return err;
+}
+
+static int dst_start_storage(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_storage *st;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		return -ENODEV;
+
+	mutex_lock(&st->tree_lock);
+	if (!(st->flags & DST_ST_STARTED)) {
+		set_capacity(st->disk, st->disk_size);
+		add_disk(st->disk);
+		st->flags |= DST_ST_STARTED;
+		dprintk("%s: STARTED st: %p, disk_size: %llu.\n",
+				__func__, st, st->disk_size);
+	}
+	mutex_unlock(&st->tree_lock);
+
+	dst_put_storage(st);
+
+	return 0;
+}
+
+static int dst_stop_storage(struct dst_ctl *ctl, void __user *data)
+{
+	struct dst_storage *st;
+
+	st = dst_get_storage(ctl->st, ctl->alg, 0);
+	if (!st)
+		return -ENODEV;
+
+	dprintk("%s: STOPPED storage: %s.\n", __func__, st->name);
+
+	dst_storage_sysfs_exit(st);
+
+	mutex_lock(&dst_storage_lock);
+	list_del_init(&st->entry);
+	mutex_unlock(&dst_storage_lock);
+
+	if (st->flags & DST_ST_STARTED)
+		dst_remove_disk(st);
+
+	dst_remove_all_nodes(st);
+	dst_put_storage(st); /* One reference got above */
+	dst_put_storage(st); /* Another reference set during initialization */
+
+	return 0;
+}
+
+typedef int (*dst_command_func)(struct dst_ctl *ctl, void __user *data);
+
+/*
+ * List of userspace commands.
+ */
+static dst_command_func dst_commands[] = {
+	[DST_ADD_REMOTE] = &dst_add_remote,
+	[DST_ADD_LOCAL] = &dst_add_local,
+	[DST_ADD_LOCAL_EXPORT] = &dst_add_local_export,
+	[DST_DEL_NODE] = &dst_del_node,
+	[DST_START_STORAGE] = &dst_start_storage,
+	[DST_STOP_STORAGE] = &dst_stop_storage,
+};
+
+/*
+ * Move to connector for configuration is in TODO list.
+ */
+static int dst_ioctl(struct inode *inode, struct file *file,
+		unsigned int command, unsigned long data)
+{
+	struct dst_ctl ctl;
+	unsigned int cmd = _IOC_NR(command);
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (_IOC_TYPE(command) != DST_IOCTL)
+		return -ENOTTY;
+
+	if (cmd >= DST_CMD_MAX)
+		return -EINVAL;
+
+	if (copy_from_user(&ctl, (void __user *)data, sizeof(struct dst_ctl)))
+		return -EFAULT;
+
+	data += sizeof(struct dst_ctl);
+
+	return dst_commands[cmd](&ctl, (void __user *)data);
+}
+
+static const struct file_operations dst_fops = {
+	.ioctl	 = dst_ioctl,
+	.owner	 = THIS_MODULE,
+};
+
+static struct miscdevice dst_misc = {
+	.minor 		= MISC_DYNAMIC_MINOR,
+	.name  		= DST_NAME,
+	.fops  		= &dst_fops
+};
+
+static int dst_sysfs_init(void)
+{
+	return bus_register(&dst_dev_bus_type);
+}
+
+static void dst_sysfs_exit(void)
+{
+	bus_unregister(&dst_dev_bus_type);
+}
+
+static int __devinit dst_sys_init(void)
+{
+	int err = -ENOMEM;
+
+	dst_request_cache = kmem_cache_create("dst", sizeof(struct dst_request),
+				       0, 0, NULL, NULL);
+	if (!dst_request_cache)
+		return -ENOMEM;
+
+	dst_bio_set = bioset_create(32, 32);
+	if (!dst_bio_set)
+		goto err_out_destroy;
+
+	err = register_blkdev(dst_major, DST_NAME);
+	if (err < 0)
+		goto err_out_destroy_bioset;
+	if (err)
+		dst_major = err;
+
+	err = dst_sysfs_init();
+	if (err)
+		goto err_out_unregister;
+
+	kst_main_worker = kst_worker_init(0);
+	if (IS_ERR(kst_main_worker)) {
+		err = PTR_ERR(kst_main_worker);
+		goto err_out_sysfs_exit;
+	}
+
+	err = misc_register(&dst_misc);
+	if (err)
+		goto err_out_worker_exit;
+
+	return 0;
+
+err_out_worker_exit:
+	kst_worker_exit(kst_main_worker);
+err_out_sysfs_exit:
+	dst_sysfs_exit();
+err_out_unregister:
+	unregister_blkdev(dst_major, DST_NAME);
+err_out_destroy_bioset:
+	bioset_free(dst_bio_set);
+err_out_destroy:
+	kmem_cache_destroy(dst_request_cache);
+	return err;
+}
+
+static void __devexit dst_sys_exit(void)
+{
+	misc_deregister(&dst_misc);
+	dst_sysfs_exit();
+	unregister_blkdev(dst_major, DST_NAME);
+	kst_exit_all();
+	bioset_free(dst_bio_set);
+	kmem_cache_destroy(dst_request_cache);
+}
+
+module_init(dst_sys_init);
+module_exit(dst_sys_exit);
+
+MODULE_DESCRIPTION("Distributed storage");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 0000000..b739402
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1609 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/kthread.h>
+#include <linux/net.h>
+#include <linux/in.h>
+#include <linux/poll.h>
+#include <linux/bio.h>
+#include <linux/dst.h>
+
+#include <net/sock.h>
+
+struct kst_poll_helper
+{
+	poll_table 		pt;
+	struct kst_state	*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+		int type, int proto, int backlog)
+{
+	int err;
+
+	err = sock_create(addr->sa_family, type, proto, &st->socket);
+	if (err)
+		goto err_out_exit;
+
+	err = st->socket->ops->bind(st->socket, (struct sockaddr *)addr,
+			addr->sa_data_len);
+
+	err = st->socket->ops->listen(st->socket, backlog);
+	if (err)
+		goto err_out_release;
+
+	st->socket->sk->sk_allocation = GFP_NOIO;
+
+	return 0;
+
+err_out_release:
+	sock_release(st->socket);
+err_out_exit:
+	return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+	if (st->socket) {
+		sock_release(st->socket);
+		st->socket = NULL;
+	}
+}
+
+void kst_wake(struct kst_state *st)
+{
+	struct kst_worker *w = st->node->w;
+	unsigned long flags;
+
+	spin_lock_irqsave(&w->ready_lock, flags);
+	if (list_empty(&st->ready_entry))
+		list_add_tail(&st->ready_entry, &w->ready_list);
+	spin_unlock_irqrestore(&w->ready_lock, flags);
+
+	wake_up(&w->wait);
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+		int sync, void *key)
+{
+	struct kst_state *st = container_of(wait, struct kst_state, wait);
+	kst_wake(st);
+	return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+				 poll_table *pt)
+{
+	struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)->st;
+
+	st->whead = whead;
+	init_waitqueue_func_entry(&st->wait, kst_state_wake_callback);
+	add_wait_queue(whead, &st->wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+	if (st->whead) {
+		remove_wait_queue(st->whead, &st->wait);
+		st->whead = NULL;
+	}
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+	struct kst_state *st = req->state;
+
+	rb_erase(&req->request_entry, &st->request_root);
+	RB_CLEAR_NODE(&req->request_entry);
+	list_del_init(&req->request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+	struct dst_request *req = NULL;
+
+	if (!list_empty(&st->request_list))
+		req = list_entry(st->request_list.next, struct dst_request,
+				request_list_entry);
+	return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+	struct dst_request *req;
+
+	mutex_lock(&st->request_lock);
+	req = kst_req_first(st);
+	if (req)
+		kst_del_req(req);
+	mutex_unlock(&st->request_lock);
+	return req;
+}
+
+static inline int dst_compare_request_id(struct dst_request *old,
+		struct dst_request *new)
+{
+	int cmd = 0;
+
+	if (old->start + to_sector(old->orig_size) <= new->start)
+		cmd = 1;
+	if (old->start >= new->start + to_sector(new->orig_size))
+		cmd = -1;
+
+	dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+		"new: op: %lu, start: %llu, size: %llu, off: %u, cmp: %d.\n",
+		__func__, bio_rw(old->bio), old->start, old->orig_size,
+		old->offset,
+		bio_rw(new->bio), new->start, new->orig_size,
+		new->offset, cmd);
+
+	return cmd;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+	struct rb_node **n = &st->request_root.rb_node, *parent = NULL;
+	struct dst_request *old = NULL;
+	int cmp, err = 0;
+
+	while (*n) {
+		parent = *n;
+		old = rb_entry(parent, struct dst_request, request_entry);
+
+		cmp = dst_compare_request_id(old, req);
+		if (cmp < 0)
+			n = &parent->rb_left;
+		else if (cmp > 0)
+			n = &parent->rb_right;
+		else {
+			printk("%s: [%c] old_req: %p, start: %llu, "
+					"size: %llu.\n",
+					__func__, 
+					(bio_rw(old->bio) == WRITE)?'W':'R',
+					old, old->start, old->orig_size);
+			err = -EEXIST;
+			break;
+		}
+	}
+
+	if (!err) {
+		rb_link_node(&req->request_entry, parent, n);
+		rb_insert_color(&req->request_entry, &st->request_root);
+	}
+
+	if (req->size != req->orig_size)
+		list_add(&req->request_list_entry, &st->request_list);
+	else
+		list_add_tail(&req->request_list_entry, &st->request_list);
+	return err;
+}
+EXPORT_SYMBOL_GPL(kst_enqueue_req);
+
+/*
+ * BIOs for local exporting node are freed via this function.
+ */
+static void kst_export_put_bio(struct bio *bio)
+{
+	int i;
+	struct bio_vec *bv;
+
+	dprintk("%s: bio: %p, size: %u, idx: %d, num: %d.\n",
+			__func__, bio, bio->bi_size, bio->bi_idx,
+			bio->bi_vcnt);
+
+	bio_for_each_segment(bv, bio, i)
+		__free_page(bv->bv_page);
+	bio_put(bio);
+}
+
+/*
+ * This is a generic request completion function for requests,
+ * queued for async processing.
+ * If it is local export node, state machine is different,
+ * see details below.
+ */
+void kst_complete_req(struct dst_request *req, int err)
+{
+	dprintk("%s: bio: %p, req: %p, size: %llu, orig_size: %llu, "
+			"bi_size: %u, err: %d, flags: %u.\n",
+			__func__, req->bio, req, req->size, req->orig_size,
+			req->bio->bi_size, err, req->flags);
+
+	if (req->flags & DST_REQ_EXPORT) {
+		if (req->flags & DST_REQ_EXPORT_WRITE) {
+			req->bio->bi_rw = WRITE;
+			generic_make_request(req->bio);
+		} else
+			kst_export_put_bio(req->bio);
+	} else {
+		req->bio_endio(req, err);
+	}
+	dst_free_request(req);
+}
+EXPORT_SYMBOL_GPL(kst_complete_req);
+
+static void kst_flush_requests(struct kst_state *st)
+{
+	struct dst_request *req;
+
+	while ((req = kst_dequeue_req(st)) != NULL)
+		kst_complete_req(req, -EIO);
+}
+
+static int kst_poll_init(struct kst_state *st)
+{
+	struct kst_poll_helper ph;
+
+	ph.st = st;
+	init_poll_funcptr(&ph.pt, &kst_queue_func);
+
+	st->socket->ops->poll(NULL, st->socket, &ph.pt);
+	return 0;
+}
+
+/*
+ * Main state creation function.
+ * It creates new state according to given operations
+ * and links it into worker structure and node.
+ */
+static struct kst_state *kst_state_init(struct dst_node *node,
+		unsigned int permissions,
+		struct kst_state_ops *ops, void *data)
+{
+	struct kst_state *st;
+	int err;
+
+	st = kzalloc(sizeof(struct kst_state), GFP_KERNEL);
+	if (!st)
+		return ERR_PTR(-ENOMEM);
+
+	st->permissions = permissions;
+	st->node = node;
+	st->ops = ops;
+	INIT_LIST_HEAD(&st->ready_entry);
+	INIT_LIST_HEAD(&st->entry);
+	st->request_root.rb_node = NULL;
+	INIT_LIST_HEAD(&st->request_list);
+	mutex_init(&st->request_lock);
+
+	err = st->ops->init(st, data);
+	if (err)
+		goto err_out_free;
+	mutex_lock(&node->w->state_mutex);
+	list_add_tail(&st->entry, &node->w->state_list);
+	mutex_unlock(&node->w->state_mutex);
+
+	kst_wake(st);
+
+	return st;
+
+err_out_free:
+	kfree(st);
+	return ERR_PTR(err);
+}
+
+/*
+ * This function is called when node is removed,
+ * or when state is destroyed for connected to local exporting
+ * node client.
+ */
+void kst_state_exit(struct kst_state *st)
+{
+	struct kst_worker *w = st->node->w;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	mutex_lock(&w->state_mutex);
+	list_del_init(&st->entry);
+	mutex_unlock(&w->state_mutex);
+
+	st->ops->exit(st);
+
+	st->node->state = NULL;
+
+	kfree(st);
+}
+
+static int kst_error(struct kst_state *st, int err)
+{
+	if ((err == -ECONNRESET || err == -EPIPE) && st->ops->recovery(st, err))
+		err = st->ops->recovery(st, err);
+
+	return st->node->st->alg->ops->error(st, err);
+}
+
+/*
+ * This is main state processing function.
+ * It tries to complete request and invoke appropriate
+ * callbacks in case of errors or successfull operation finish.
+ */
+static int kst_thread_process_state(struct kst_state *st)
+{
+	int err, empty;
+	unsigned int revents;
+	struct dst_request *req, *tmp;
+
+	mutex_lock(&st->request_lock);
+	if (st->ops->ready) {
+		err = st->ops->ready(st);
+		if (err) {
+			mutex_unlock(&st->request_lock);
+			if (err < 0)
+				kst_state_exit(st);
+			return err;
+		}
+	}
+
+	err = 0;
+	empty = 1;
+	req = NULL;
+	list_for_each_entry_safe(req, tmp, &st->request_list,
+			request_list_entry) {
+		empty = 0;
+		revents = st->socket->ops->poll(st->socket->file,
+				st->socket, NULL);
+		dprintk("\n%s: st: %p, revents: %x.\n", __func__, st, revents);
+		if (!revents)
+			break;
+		err = req->callback(req, revents);
+		dprintk("%s: callback returned, st: %p, err: %d.\n",
+				__func__, st, err);
+		if (err)
+			break;
+	}
+	mutex_unlock(&st->request_lock);
+
+	dprintk("%s: req: %p, err: %d.\n", __func__, req, err);
+	if (err < 0) {
+		err = kst_error(st, err);
+		if (err && (st != st->node->state)) {
+			dprintk("%s: err: %d, st: %p, node->state: %p.\n",
+					__func__, err, st, st->node->state);
+			/*
+			 * Accepted client has state not related to storage
+			 * node, so it must be freed explicitely.
+			 */
+
+			kst_state_exit(st);
+			return err;
+		}
+
+		kst_wake(st);
+	}
+
+	if (list_empty(&st->request_list) && !empty)
+		kst_wake(st);
+
+	return err;
+}
+
+/*
+ * Main worker thread - one per storage.
+ */
+static int kst_thread_func(void *data)
+{
+	struct kst_worker *w = data;
+	struct kst_state *st;
+	unsigned long flags;
+	int err = 0;
+
+	while (!kthread_should_stop()) {
+		wait_event_interruptible_timeout(w->wait,
+				!list_empty(&w->ready_list) ||
+				kthread_should_stop(),
+				HZ);
+
+		st = NULL;
+		spin_lock_irqsave(&w->ready_lock, flags);
+		if (!list_empty(&w->ready_list)) {
+			st = list_entry(w->ready_list.next, struct kst_state,
+					ready_entry);
+			list_del_init(&st->ready_entry);
+		}
+		spin_unlock_irqrestore(&w->ready_lock, flags);
+
+		if (!st)
+			continue;
+
+		err = kst_thread_process_state(st);
+	}
+
+	return err;
+}
+
+/*
+ * Worker initialization - this object will host andprocess all states,
+ * which in turn host requests for remote targets.
+ */
+struct kst_worker *kst_worker_init(int id)
+{
+	struct kst_worker *w;
+	int err;
+
+	w = kzalloc(sizeof(struct kst_worker), GFP_KERNEL);
+	if (!w)
+		return ERR_PTR(-ENOMEM);
+
+	w->id = id;
+	init_waitqueue_head(&w->wait);
+	spin_lock_init(&w->ready_lock);
+	mutex_init(&w->state_mutex);
+
+	INIT_LIST_HEAD(&w->ready_list);
+	INIT_LIST_HEAD(&w->state_list);
+
+	w->req_pool = mempool_create_slab_pool(256, dst_request_cache);
+	if (!w->req_pool) {
+		err = -ENOMEM;
+		goto err_out_free;
+	}
+
+	w->thread = kthread_run(&kst_thread_func, w, "kst%d", w->id);
+	if (IS_ERR(w->thread)) {
+		err = PTR_ERR(w->thread);
+		goto err_out_destroy;
+	}
+
+	mutex_lock(&kst_worker_mutex);
+	list_add_tail(&w->entry, &kst_worker_list);
+	mutex_unlock(&kst_worker_mutex);
+
+	return w;
+
+err_out_destroy:
+	mempool_destroy(w->req_pool);
+err_out_free:
+	kfree(w);
+	return ERR_PTR(err);
+}
+
+void kst_worker_exit(struct kst_worker *w)
+{
+	struct kst_state *st, *n;
+
+	mutex_lock(&kst_worker_mutex);
+	list_del(&w->entry);
+	mutex_unlock(&kst_worker_mutex);
+
+	kthread_stop(w->thread);
+
+	list_for_each_entry_safe(st, n, &w->state_list, entry) {
+		kst_state_exit(st);
+	}
+
+	mempool_destroy(w->req_pool);
+	kfree(w);
+}
+
+/*
+ * Common state exit callback.
+ * Removes itself from worker's list of states,
+ * releases socket and flushes all requests.
+ */
+static void kst_common_exit(struct kst_state *st)
+{
+	unsigned long flags;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+	kst_poll_exit(st);
+
+	spin_lock_irqsave(&st->node->w->ready_lock, flags);
+	list_del_init(&st->ready_entry);
+	spin_unlock_irqrestore(&st->node->w->ready_lock, flags);
+
+	kst_sock_release(st);
+	kst_flush_requests(st);
+}
+
+/*
+ * Listen socket contains security attributes in request_list,
+ * so it can not be flushed via usual way.
+ */
+static void kst_listen_flush(struct kst_state *st)
+{
+	struct dst_secure *s, *tmp;
+
+	list_for_each_entry_safe(s, tmp, &st->request_list, sec_entry) {
+		list_del(&s->sec_entry);
+		kfree(s);
+	}
+}
+
+static void kst_listen_exit(struct kst_state *st)
+{
+	kst_listen_flush(st);
+	kst_common_exit(st);
+}
+
+/*
+ * Header sending function - may block.
+ */
+static int kst_data_send_header(struct kst_state *st,
+		struct dst_remote_request *r)
+{
+	struct msghdr msg;
+	struct kvec iov;
+
+	iov.iov_base = r;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+	return kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+}
+
+/*
+ * BIO vector receiving function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_recv_bio_vec(struct kst_state *st, struct bio_vec *bv,
+		unsigned int offset, unsigned int size)
+{
+	struct msghdr msg;
+	struct kvec iov;
+	void *kaddr;
+	int err;
+
+	kaddr = kmap(bv->bv_page);
+
+	iov.iov_base = kaddr + bv->bv_offset + offset;
+	iov.iov_len = size;
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
+
+	err = kernel_recvmsg(st->socket, &msg, &iov, 1, iov.iov_len,
+			msg.msg_flags);
+	kunmap(bv->bv_page);
+
+	return err;
+}
+
+/*
+ * BIO vector sending function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_send_bio_vec(struct kst_state *st, struct bio_vec *bv,
+		unsigned int offset, unsigned int size)
+{
+	return kernel_sendpage(st->socket, bv->bv_page,
+			bv->bv_offset + offset, size,
+			MSG_DONTWAIT | MSG_NOSIGNAL);
+}
+
+typedef int (*kst_data_process_bio_vec_t)(struct kst_state *st,
+		struct bio_vec *bv, unsigned int offset, unsigned int size);
+
+/*
+ * @req: processing request.
+ * Contains BIO and all related to its processing info.
+ *
+ * This function sends or receives requested number of pages from given BIO.
+ *
+ * In case of errors negative return value is returned and @size,
+ * @index and @off are set to the:
+ * - number of bytes not yet processed (i.e. the rest of the bytes to be
+ *   processed).
+ * - index of the last bio_vec started to be processed (header sent).
+ * - offset of the first byte to be processed in the bio_vec.
+ *
+ * If there are no errors, zero is returned.
+ * -EAGAIN is not an error and is transformed into zero return value,
+ * called must check if @size is zero, in that case whole BIO is processed
+ * and thus req->bio_endio() can be called, othervise new request must be allocated
+ * to be processed later.
+ */
+static int kst_data_process_bio(struct dst_request *req)
+{
+	int err = -ENOSPC, partial = (req->size != req->orig_size);
+	struct dst_remote_request r;
+	kst_data_process_bio_vec_t func;
+	unsigned int cur_size;
+
+	r.flags = cpu_to_be32(((unsigned long)req->bio) & 0xffffffff);
+
+	if (bio_rw(req->bio) == WRITE) {
+		r.cmd = cpu_to_be32(DST_WRITE);
+		func = kst_data_send_bio_vec;
+	} else {
+		r.cmd = cpu_to_be32(DST_READ);
+		func = kst_data_recv_bio_vec;
+	}
+
+	dprintk("%s: start: [%c], start: %llu, idx: %d, num: %d, "
+			"size: %llu, offset: %u.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size, req->offset);
+
+	while (req->idx < req->num) {
+		struct bio_vec *bv = bio_iovec_idx(req->bio, req->idx);
+
+		cur_size = min_t(u64, bv->bv_len - req->offset, req->size);
+
+		if (cur_size == 0) {
+			printk("%s: %d/%d: start: %llu, "
+				"bv_offset: %u, bv_len: %u, "
+				"req_offset: %u, req_size: %llu, "
+				"req: %p, bio: %p, err: %d.\n",
+				__func__, req->idx, req->num, req->start, 
+				bv->bv_offset, bv->bv_len,
+				req->offset, req->size,
+				req, req->bio, err);
+			BUG();
+		}
+
+		if (!(req->flags & DST_REQ_HEADER_SENT)) {
+			r.sector = cpu_to_be64(req->start);
+			r.offset = cpu_to_be32(bv->bv_offset + req->offset);
+			r.size = cpu_to_be32(cur_size);
+
+			err = kst_data_send_header(req->state, &r);
+			if (err != sizeof(struct dst_remote_request)) {
+				dprintk("%s: %d/%d: header: start: %llu, "
+					"bv_offset: %u, bv_len: %u, "
+					"a offset: %u, offset: %u, "
+					"cur_size: %u, err: %d.\n",
+					__func__, req->idx, req->num,
+					req->start, bv->bv_offset, bv->bv_len,
+					bv->bv_offset + req->offset,
+					req->offset, cur_size, err);
+				if (err >= 0)
+					err = -EINVAL;
+				break;
+			}
+
+			req->flags |= DST_REQ_HEADER_SENT;
+		}
+
+		err = func(req->state, bv, req->offset, cur_size);
+		if (err <= 0)
+			break;
+
+		req->offset += err;
+		req->size -= err;
+
+		if (req->offset != bv->bv_len) {
+			dprintk("%s: %d/%d: this: start: %llu, bv_offset: %u, "
+				"bv_len: %u, a offset: %u, offset: %u, "
+				"cur_size: %u, err: %d.\n",
+				__func__, req->idx, req->num, req->start,
+				bv->bv_offset, bv->bv_len,
+				bv->bv_offset + req->offset,
+				req->offset, cur_size, err);
+			err = -EAGAIN;
+			break;
+		}
+		req->offset = 0;
+		req->idx++;
+		req->flags &= ~DST_REQ_HEADER_SENT;
+
+		req->start += to_sector(bv->bv_len);
+	}
+
+	if (err <= 0 && err != -EAGAIN) {
+		if (err == 0)
+			err = -ECONNRESET;
+	} else
+		err = 0;
+
+	if (req->size) {
+		req->state->flags |= KST_FLAG_PARTIAL;
+	} else if (partial) {
+		req->state->flags &= ~KST_FLAG_PARTIAL;
+	}
+
+	if (err < 0 || (req->idx == req->num && req->size)) {
+		dprintk("%s: return: idx: %d, num: %d, offset: %u, "
+				"size: %llu, err: %d.\n",
+			__func__, req->idx, req->num, req->offset,
+			req->size, err);
+	}
+	dprintk("%s: end: start: %llu, idx: %d, num: %d, "
+			"size: %llu, offset: %u.\n",
+		__func__, req->start, req->idx, req->num,
+		req->size, req->offset);
+
+	return err;
+}
+
+void kst_bio_endio(struct dst_request *req, int err)
+{
+	if (err)
+		printk("%s: freeing bio: %p, bi_size: %u, "
+			"orig_size: %llu, req: %p.\n",
+		__func__, req->bio, req->bio->bi_size, req->orig_size, req);
+	bio_endio(req->bio, req->orig_size, err);
+}
+EXPORT_SYMBOL_GPL(kst_bio_endio);
+
+/*
+ * This callback is invoked by worker thread to process given request.
+ */
+int kst_data_callback(struct dst_request *req, unsigned int revents)
+{
+	int err;
+
+	dprintk("%s: req: %p, num: %d, idx: %d, bio: %p, "
+			"revents: %x, flags: %x.\n",
+			__func__, req, req->num, req->idx, req->bio,
+			revents, req->flags);
+
+	if (req->flags & DST_REQ_EXPORT_READ)
+		return 1;
+
+	err = kst_data_process_bio(req);
+	if (err < 0)
+		goto err_out;
+
+	if (!req->size) {
+		dprintk("%s: complete: req: %p, bio: %p.\n",
+				__func__, req, req->bio);
+		kst_del_req(req);
+		kst_complete_req(req, 0);
+		return 0;
+	}
+
+	if (revents & (POLLERR | POLLHUP | POLLRDHUP)) {
+		err = -EPIPE;
+		goto err_out;
+	}
+
+	return 1;
+
+err_out:
+	return err;
+}
+EXPORT_SYMBOL_GPL(kst_data_callback);
+
+#define KST_CONG_COMPLETED		(0)
+#define KST_CONG_NOT_FOUND		(1)
+#define KST_CONG_QUEUE			(-1)
+
+/*
+ * kst_congestion - checks for data congestion, i.e. the case, when given
+ * 	block request crosses an area of the another block request which
+ * 	is not yet sent to the remote node.
+ *
+ * @req: dst request containing block io related information.
+ *
+ * Return value:
+ * %KST_CONG_COMPLETED  - congestion was found and processed,
+ * 	bio must be ended, request is completed.
+ * %KST_CONG_NOT_FOUND  - no congestion found,
+ * 	request must be processed as usual
+ * %KST_CONG_QUEUE - congestion has been found, but bio is not completed,
+ * 	new request must be allocated and processed.
+ */
+static int kst_congestion(struct dst_request *req)
+{
+	int cmp, i;
+	struct kst_state *st = req->state;
+	struct rb_node *n = st->request_root.rb_node;
+	struct dst_request *old = NULL, *dst_req, *src_req;
+
+	while (n) {
+		src_req = rb_entry(n, struct dst_request, request_entry);
+		cmp = dst_compare_request_id(src_req, req);
+
+		if (cmp < 0)
+			n = n->rb_left;
+		else if (cmp > 0)
+			n = n->rb_right;
+		else {
+			old = src_req;
+			break;
+		}
+	}
+
+	if (likely(!old))
+		return KST_CONG_NOT_FOUND;
+
+	dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+			"new: op: %lu, start: %llu, size: %llu, off: %u.\n",
+		__func__, bio_rw(old->bio), old->start, old->orig_size,
+		old->offset,
+		bio_rw(req->bio), req->start, req->orig_size, req->offset);
+
+	if ((bio_rw(old->bio) != WRITE) && (bio_rw(req->bio) != WRITE)) {
+		return KST_CONG_QUEUE;
+	}
+
+	if (unlikely(req->offset != old->offset))
+		return KST_CONG_QUEUE;
+
+	src_req = old;
+	dst_req = req;
+	if (bio_rw(req->bio) == WRITE) {
+		dst_req = old;
+		src_req = req;
+	}
+
+	/* Actually we could partially complete new request by copying
+	 * part of the first one, but not now, consider this as a
+	 * (low-priority) todo item.
+	 */
+	if (src_req->start + src_req->orig_size <
+			dst_req->start + dst_req->orig_size)
+		return KST_CONG_QUEUE;
+
+	/*
+	 * So, only process if new request is differnt from old one,
+	 * or subsequent write, i.e.:
+	 * - not completed write and request to read
+	 * - not completed read and request to write
+	 * - not completed write and request to (over)write
+	 */
+	for (i = old->idx; i < old->num; ++i) {
+		struct bio_vec *bv_src, *bv_dst;
+		void *src, *dst;
+		u64 len;
+
+		bv_src = bio_iovec_idx(src_req->bio, i);
+		bv_dst = bio_iovec_idx(dst_req->bio, i);
+
+		if (unlikely(bv_dst->bv_offset != bv_src->bv_offset))
+			return KST_CONG_QUEUE;
+
+		if (unlikely(bv_dst->bv_len != bv_src->bv_len))
+			return KST_CONG_QUEUE;
+
+		src = kmap_atomic(bv_src->bv_page, KM_USER0);
+		dst = kmap_atomic(bv_dst->bv_page, KM_USER1);
+
+		len = min_t(u64, bv_dst->bv_len, dst_req->size);
+
+		memcpy(dst + bv_dst->bv_offset, src + bv_src->bv_offset, len);
+
+		kunmap_atomic(src, KM_USER0);
+		kunmap_atomic(dst, KM_USER1);
+
+		dst_req->idx++;
+		dst_req->size -= len;
+		dst_req->offset = 0;
+		dst_req->start += to_sector(len);
+
+		if (!dst_req->size)
+			break;
+	}
+
+	if (req == dst_req)
+		return KST_CONG_COMPLETED;
+
+	kst_del_req(dst_req);
+	kst_complete_req(dst_req, 0);
+
+	return KST_CONG_NOT_FOUND;
+}
+
+struct dst_request *dst_clone_request(struct dst_request *req, mempool_t *pool)
+{
+	struct dst_request *new_req;
+
+	new_req = mempool_alloc(pool, GFP_NOIO);
+	if (!new_req)
+		return NULL;
+
+	memset(new_req, 0, sizeof(struct dst_request));
+
+	dprintk("%s: req: %p, new_req: %p, bio: %p.\n",
+			__func__, req, new_req, req->bio);
+
+	RB_CLEAR_NODE(&new_req->request_entry);
+
+	if (req) {
+		new_req->bio = req->bio;
+		new_req->state = req->state;
+		new_req->node = req->node;
+		new_req->idx = req->idx;
+		new_req->num = req->num;
+		new_req->size = req->size;
+		new_req->orig_size = req->orig_size;
+		new_req->offset = req->offset;
+		new_req->start = req->start;
+		new_req->flags = req->flags;
+		new_req->bio_endio = req->bio_endio;
+		new_req->priv = req->priv;
+	}
+
+	return new_req;
+}
+EXPORT_SYMBOL_GPL(dst_clone_request);
+
+void dst_free_request(struct dst_request *req)
+{
+	dprintk("%s: free req: %p, pool: %p, bio: %p, state: %p, node: %p.\n",
+			__func__, req, req->node->w->req_pool,
+			req->bio, req->state, req->node);
+	mempool_free(req, req->node->w->req_pool);
+}
+EXPORT_SYMBOL_GPL(dst_free_request);
+
+/*
+ * This is main data processing function, eventually invoked from block layer.
+ * It tries to complte request, but if it is about to block, it allocates
+ * new request and queues it to main worker to be processed when events allow.
+ */
+static int kst_data_push(struct dst_request *req)
+{
+	struct kst_state *st = req->state;
+	struct dst_request *new_req;
+	unsigned int revents;
+	int err, locked = 0;
+
+	dprintk("%s: start: %llu, size: %llu, bio: %p.\n",
+			__func__, req->start, req->size, req->bio);
+
+	if (mutex_trylock(&st->request_lock)) {
+		locked = 1;
+
+		if (st->flags & (KST_FLAG_PARTIAL | DST_REQ_ALWAYS_QUEUE))
+			goto alloc_new_req;
+
+		err = kst_congestion(req);
+		if (err == KST_CONG_COMPLETED) {
+			err = 0;
+			goto out_bio_endio;
+		}
+
+		if (err == KST_CONG_NOT_FOUND) {
+			revents = st->socket->ops->poll(NULL, st->socket, NULL);
+			dprintk("%s: st: %p, bio: %p, revents: %x.\n",
+					__func__, st, req->bio, revents);
+			if (revents & POLLOUT) {
+				err = kst_data_process_bio(req);
+				if (err < 0)
+					goto out_unlock;
+
+				if (!req->size) {
+					err = 0;
+					goto out_bio_endio;
+				}
+			}
+		}
+	}
+
+alloc_new_req:
+	err = -ENOMEM;
+	new_req = dst_clone_request(req, req->node->w->req_pool);
+	if (!new_req)
+		goto out_unlock;
+
+	new_req->callback = &kst_data_callback;
+
+	if (!locked)
+		mutex_lock(&st->request_lock);
+	locked = 1;
+
+	err = kst_enqueue_req(st, new_req);
+	mutex_unlock(&st->request_lock);
+	locked = 0;
+	if (err) {
+		printk(KERN_NOTICE "%s: congestion [%c], start: %llu, idx: %d,"
+				" num: %d, size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+	}
+
+	kst_wake(st);
+
+	return 0;
+
+out_bio_endio:
+	req->bio_endio(req, err);
+out_unlock:
+	if (locked)
+		mutex_unlock(&st->request_lock);
+	locked = 0;
+
+	if (err) {
+		err = kst_error(st, err);
+		if (!err)
+			goto alloc_new_req;
+	}
+
+	if (err) {
+		printk("%s: error [%c], start: %llu, idx: %d, num: %d, "
+				"size: %llu, offset: %u, err: %d.\n",
+			__func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+			req->start, req->idx, req->num, req->size,
+			req->offset, err);
+		req->bio_endio(req, err);
+	}
+
+	kst_wake(st);
+	return err;
+}
+
+/*
+ * Remote node initialization callback.
+ */
+static int kst_data_init(struct kst_state *st, void *data)
+{
+	int err;
+
+	st->socket = data;
+	st->socket->sk->sk_allocation = GFP_NOIO;
+	/*
+	 * Why not?
+	 */
+	st->socket->sk->sk_sndbuf = st->socket->sk->sk_sndbuf = 1024*1024*10;
+
+	err = kst_poll_init(st);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/*
+ * Remote node recovery function - tries to reconnect to given target.
+ */
+static int kst_data_recovery(struct kst_state *st, int err)
+{
+	struct socket *sock;
+	struct sockaddr addr;
+	int addrlen;
+	struct dst_request *req;
+
+	if (err != -ECONNRESET && err != -EPIPE) {
+		dprintk("%s: state %p does not know how "
+				"to recover from error %d.\n",
+				__func__, st, err);
+		return err;
+	}
+
+	err = sock_create(st->socket->ops->family, st->socket->type,
+			st->socket->sk->sk_protocol, &sock);
+	if (err < 0)
+		goto err_out_exit;
+
+	sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+		msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+	err = sock->ops->getname(st->socket, &addr, &addrlen, 2);
+	if (err)
+		goto err_out_destroy;
+
+	err = sock->ops->connect(sock, &addr, addrlen, 0);
+	if (err)
+		goto err_out_destroy;
+
+	kst_poll_exit(st);
+	kst_sock_release(st);
+
+	mutex_lock(&st->request_lock);
+	err = st->ops->init(st, sock);
+	if (!err) {
+		/*
+		 * After reconnection is completed all requests
+		 * must be resent from the state they were finished previously,
+		 * but with new headers.
+		 */
+		list_for_each_entry(req, &st->request_list, request_list_entry)
+			req->flags &= ~DST_REQ_HEADER_SENT;
+	}
+	mutex_unlock(&st->request_lock);
+	if (err < 0)
+		goto err_out_destroy;
+
+	kst_wake(st);
+	dprintk("%s: recovery completed.\n", __func__);
+
+	return 0;
+
+err_out_destroy:
+	sock_release(sock);
+err_out_exit:
+	dprintk("%s: revovery failed: st: %p, err: %d.\n", __func__, st, err);
+	return err;
+}
+
+static inline void kst_convert_header(struct dst_remote_request *r)
+{
+	r->cmd = be32_to_cpu(r->cmd);
+	r->sector = be64_to_cpu(r->sector);
+	r->offset = be32_to_cpu(r->offset);
+	r->size = be32_to_cpu(r->size);
+	r->flags = be32_to_cpu(r->flags);
+}
+
+/*
+ * Local exporting node end IO callbacks.
+ */
+static int kst_export_write_end_io(struct bio *bio, unsigned int size, int err)
+{
+	dprintk("%s: bio: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, bio, bio->bi_size, bio->bi_idx, bio->bi_vcnt, err);
+
+	if (bio->bi_size)
+		return 1;
+
+	kst_export_put_bio(bio);
+	return 0;
+}
+
+static int kst_export_read_end_io(struct bio *bio, unsigned int size, int err)
+{
+	struct dst_request *req = bio->bi_private;
+	struct kst_state *st = req->state;
+
+	dprintk("%s: bio: %p, req: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+		__func__, bio, req, bio->bi_size, bio->bi_idx,
+		bio->bi_vcnt, err);
+
+	if (bio->bi_size)
+		return 1;
+
+	bio->bi_size = req->size = req->orig_size;
+	bio->bi_rw = WRITE;
+	req->flags &= ~DST_REQ_EXPORT_READ;
+	kst_wake(st);
+	return 0;
+}
+
+/*
+ * This callback is invoked each time new request from remote
+ * node to given local export node is received.
+ * It allocates new block IO request and queues it for processing.
+ */
+static int kst_export_ready(struct kst_state *st)
+{
+	struct dst_remote_request r;
+	struct msghdr msg;
+	struct kvec iov;
+	struct bio *bio;
+	int err, nr, i;
+	struct dst_request *req;
+	sector_t data_size;
+	unsigned int revents = st->socket->ops->poll(NULL, st->socket, NULL);
+
+	if (revents & (POLLERR | POLLHUP)) {
+		err = -EPIPE;
+		goto err_out_exit;
+	}
+
+	if (!(revents & POLLIN) || !list_empty(&st->request_list))
+		return 0;
+
+	iov.iov_base = &r;
+	iov.iov_len = sizeof(struct dst_remote_request);
+
+	msg.msg_iov = (struct iovec *)&iov;
+	msg.msg_iovlen = 1;
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+	err = kernel_recvmsg(st->socket, &msg, &iov, 1,
+			iov.iov_len, msg.msg_flags);
+	if (err != sizeof(struct dst_remote_request)) {
+		err = -EINVAL;
+		goto err_out_exit;
+	}
+
+	kst_convert_header(&r);
+
+	dprintk("\n%s: cmd: %u, sector: %llu, size: %u, "
+			"flags: %x, offset: %u.\n",
+			__func__, r.cmd, r.sector, r.size, r.flags, r.offset);
+
+	err = -EINVAL;
+	if (r.cmd != DST_READ && r.cmd != DST_WRITE && r.cmd != DST_REMOTE_CFG)
+		goto err_out_exit;
+
+	data_size = get_capacity(st->node->bdev->bd_disk);
+	if ((signed)(r.sector + to_sector(r.size)) < 0 ||
+			(signed)(r.sector + to_sector(r.size)) > data_size ||
+			(signed)r.sector > data_size)
+		goto err_out_exit;
+
+	if (r.cmd == DST_REMOTE_CFG) {
+		r.sector = data_size;
+		kst_convert_header(&r);
+
+		iov.iov_base = &r;
+		iov.iov_len = sizeof(struct dst_remote_request);
+
+		msg.msg_iov = (struct iovec *)&iov;
+		msg.msg_iovlen = 1;
+		msg.msg_name = NULL;
+		msg.msg_namelen = 0;
+		msg.msg_control = NULL;
+		msg.msg_controllen = 0;
+		msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+		err = kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+		if (err != sizeof(struct dst_remote_request)) {
+			err = -EINVAL;
+			goto err_out_exit;
+		}
+		kst_wake(st);
+		return 0;
+	}
+
+	nr = r.size/PAGE_SIZE + 1;
+
+	while (r.size) {
+		int nr_pages = min(BIO_MAX_PAGES, nr);
+		unsigned int size;
+		struct page *page;
+
+		err = -ENOMEM;
+		req = dst_clone_request(NULL, st->node->w->req_pool);
+		if (!req)
+			goto err_out_exit;
+
+		dprintk("%s: alloc req: %p, pool: %p.\n",
+				__func__, req, st->node->w->req_pool);
+
+		bio = bio_alloc(GFP_NOIO, nr_pages);
+		if (!bio)
+			goto err_out_free_req;
+
+		req->flags = DST_REQ_EXPORT | DST_REQ_HEADER_SENT;
+		req->bio = bio;
+		req->state = st;
+		req->node = st->node;
+		req->callback = &kst_data_callback;
+		req->bio_endio = &kst_bio_endio;
+
+		/*
+		 * Yes, looks a bit weird.
+		 * Logic is simple - for local exporting node all operations
+		 * are reversed compared to usual nodes, since usual nodes
+		 * process remote data and local export node process remote
+		 * requests, so that writing data means sending data to
+		 * remote node and receiving on the local export one.
+		 *
+		 * So, to process writing to the exported node we need first 
+		 * to receive data from the net (i.e. to perform READ 
+		 * operationin terms of usual node), and then put it to the 
+		 * storage (WRITE command, so it will be changed before 
+		 * calling generic_make_request()).
+		 *
+		 * To process read request from the exported node we need
+		 * first to read it from storage (READ command for BIO)
+		 * and then send it over the net (perform WRITE operation
+		 * in terms of network).
+		 */
+		if (r.cmd == DST_WRITE) {
+			req->flags |= DST_REQ_EXPORT_WRITE;
+			bio->bi_end_io = kst_export_write_end_io;
+		} else {
+			req->flags |= DST_REQ_EXPORT_READ;
+			bio->bi_end_io = kst_export_read_end_io;
+		}
+		bio->bi_rw = READ;
+		bio->bi_private = req;
+		bio->bi_sector = r.sector;
+		bio->bi_bdev = st->node->bdev;
+
+		for (i = 0; i < nr_pages; ++i) {
+			page = alloc_page(GFP_NOIO);
+			if (!page)
+				break;
+
+			size = min_t(u32, PAGE_SIZE, r.size);
+
+			err = bio_add_page(bio, page, size, r.offset);
+			dprintk("%s: %d/%d: page: %p, size: %u, offset: %u, "
+					"err: %d.\n",
+					__func__, i, nr_pages, page, size,
+					r.offset, err);
+			if (err <= 0)
+				break;
+
+			if (err == size) {
+				r.offset = 0;
+				nr--;
+			} else {
+				r.offset += err;
+			}
+
+			r.size -= err;
+			r.sector += to_sector(err);
+
+			if (!r.size)
+				break;
+		}
+
+		if (!bio->bi_vcnt) {
+			err = -ENOMEM;
+			goto err_out_put;
+		}
+
+		req->size = req->orig_size = bio->bi_size;
+		req->start = bio->bi_sector;
+		req->idx = 0;
+		req->num = bio->bi_vcnt;
+
+		dprintk("%s: submitting: bio: %p, req: %p, start: %llu, "
+			"size: %llu, idx: %d, num: %d, offset: %u, err: %d.\n",
+			__func__, bio, req, req->start, req->size,
+			req->idx, req->num, req->offset, err);
+
+		err = kst_enqueue_req(st, req);
+		if (err)
+			goto err_out_put;
+
+		if (r.cmd == DST_READ) {
+			generic_make_request(bio);
+		}
+	}
+
+	kst_wake(st);
+	return 0;
+
+err_out_put:
+	bio_put(bio);
+err_out_free_req:
+	dst_free_request(req);
+err_out_exit:
+	dprintk("%s: error: %d.\n", __func__, err);
+	return err;
+}
+
+static void kst_export_exit(struct kst_state *st)
+{
+	struct dst_node *n = st->node;
+
+	dprintk("%s: st: %p.\n", __func__, st);
+
+	kst_common_exit(st);
+	dst_node_put(n);
+}
+
+static struct kst_state_ops kst_data_export_ops = {
+	.init = &kst_data_init,
+	.push = &kst_data_push,
+	.exit = &kst_export_exit,
+	.ready = &kst_export_ready,
+};
+
+/*
+ * This callback is invoked each time listening socket for
+ * given local export node becomes ready.
+ * It creates new state for connected client and queues for processing.
+ */
+static int kst_listen_ready(struct kst_state *st)
+{
+	struct socket *newsock;
+	struct saddr addr;
+	struct kst_state *newst;
+	int err;
+	unsigned int revents, permissions = 0;
+	struct dst_secure *s;
+
+	revents = st->socket->ops->poll(NULL, st->socket, NULL);
+	if (!(revents & POLLIN))
+		return 1;
+
+	err = sock_create(st->socket->ops->family, st->socket->type,
+			st->socket->sk->sk_protocol, &newsock);
+	if (err)
+		goto err_out_exit;
+
+	err = st->socket->ops->accept(st->socket, newsock, 0);
+	if (err)
+		goto err_out_put;
+
+	if (newsock->ops->getname(newsock, (struct sockaddr *)&addr,
+				  (int *)&addr.sa_data_len, 2) < 0) {
+		err = -ECONNABORTED;
+		goto err_out_put;
+	}
+
+	list_for_each_entry(s, &st->request_list, sec_entry) {
+		void *sec_addr, *new_addr;
+
+		sec_addr = ((void *)&s->sec.addr) + s->sec.check_offset;
+		new_addr = ((void *)&addr) + s->sec.check_offset;
+
+		if (!memcmp(sec_addr, new_addr,	
+				addr.sa_data_len - s->sec.check_offset)) {
+			permissions = s->sec.permissions;
+			break;
+		}
+	}
+
+	/*
+	 * So far only reading and writing are supported.
+	 * Block device does not know about anything else,
+	 * but as far as I recall, there was a prognosis,
+	 * that computer will never require more than 640kb of RAM.
+	 */
+	if (permissions == 0) {
+		err = -EPERM;
+		goto err_out_put;
+	}
+
+	if (st->socket->ops->family == AF_INET) {
+		struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+		printk(KERN_INFO "%s: Client: %u.%u.%u.%u:%d.\n", __func__,
+			NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+	} else if (st->socket->ops->family == AF_INET6) {
+		struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+		printk(KERN_INFO "%s: Client: "
+			"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d",
+			__func__, 
+			NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+	}
+
+	dst_node_get(st->node);
+	newst = kst_state_init(st->node, permissions,
+			&kst_data_export_ops, newsock);
+	if (IS_ERR(newst)) {
+		err = PTR_ERR(newst);
+		goto err_out_put;
+	}
+
+	/*
+	 * Negative return value means error, positive - stop this state 
+	 * processing. Zero allows to check state for pending requests.
+	 * Listening socket contains security objects in request list,
+	 * since it does not have any requests.
+	 */
+	return 1;
+
+err_out_put:
+	sock_release(newsock);
+err_out_exit:
+	return 1;
+}
+
+static int kst_listen_init(struct kst_state *st, void *data)
+{
+	int err = -ENOMEM, i;
+	struct dst_le_template *tmp = data;
+	struct dst_secure *s;
+
+	for (i=0; i<tmp->le.secure_attr_num; ++i) {
+		s = kmalloc(sizeof(struct dst_secure), GFP_KERNEL);
+		if (!s)
+			goto err_out_exit;
+
+		if (copy_from_user(&s->sec, tmp->data,
+				sizeof(struct dst_secure_user))) {
+			kfree(s);
+			err = -EFAULT;
+			goto err_out_exit;
+		}
+
+		list_add_tail(&s->sec_entry, &st->request_list);
+		tmp->data += sizeof(struct dst_secure_user);
+
+		if (s->sec.addr.sa_family == AF_INET) {
+			struct sockaddr_in *sin = 
+				(struct sockaddr_in *)&s->sec.addr;
+			printk(KERN_INFO "%s: Client: %u.%u.%u.%u:%d, "
+					"permissions: %x.\n", 
+				__func__, NIPQUAD(sin->sin_addr.s_addr), 
+				ntohs(sin->sin_port), s->sec.permissions);
+		} else if (s->sec.addr.sa_family == AF_INET6) {
+			struct sockaddr_in6 *sin = 
+				(struct sockaddr_in6 *)&s->sec.addr;
+			printk(KERN_INFO "%s: Client: "
+				"%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d, "
+				"permissions: %x.\n", 
+				__func__, NIP6(sin->sin6_addr), 
+				ntohs(sin->sin6_port), s->sec.permissions);
+		}
+	}
+
+	err = kst_sock_create(st, &tmp->le.rctl.addr, tmp->le.rctl.type,
+			tmp->le.rctl.proto, tmp->le.backlog);
+	if (err)
+		goto err_out_exit;
+
+	err = kst_poll_init(st);
+	if (err)
+		goto err_out_release;
+
+	return 0;
+
+err_out_release:
+	kst_sock_release(st);
+err_out_exit:
+	kst_listen_flush(st);
+	return err;
+}
+
+/*
+ * Operations for different types of states.
+ * There are three:
+ * data state - created for remote node, when distributed storage connects
+ * 	to remote node, which contain data.
+ * listen state - created for local export node, when remote distributed
+ * 	storage's node connects to given node to get/put data.
+ * data export state - created for each client connected to above listen
+ * 	state.
+ */
+static struct kst_state_ops kst_listen_ops = {
+	.init = &kst_listen_init,
+	.exit = &kst_listen_exit,
+	.ready = &kst_listen_ready,
+};
+static struct kst_state_ops kst_data_ops = {
+	.init = &kst_data_init,
+	.push = &kst_data_push,
+	.exit = &kst_common_exit,
+	.recovery = &kst_data_recovery,
+};
+
+struct kst_state *kst_listener_state_init(struct dst_node *node,
+		struct dst_le_template *tmp)
+{
+	return kst_state_init(node, DST_PERM_READ | DST_PERM_WRITE,
+			&kst_listen_ops, tmp);
+}
+
+struct kst_state *kst_data_state_init(struct dst_node *node,
+		struct socket *newsock)
+{
+	return kst_state_init(node, DST_PERM_READ | DST_PERM_WRITE,
+			&kst_data_ops, newsock);
+}
+
+/*
+ * Remove all workers and associated states.
+ */
+void kst_exit_all(void)
+{
+	struct kst_worker *w, *n;
+
+	list_for_each_entry_safe(w, n, &kst_worker_list, entry) {
+		kst_worker_exit(w);
+	}
+}
diff --git a/include/linux/dst.h b/include/linux/dst.h
new file mode 100644
index 0000000..7b0feb1
--- /dev/null
+++ b/include/linux/dst.h
@@ -0,0 +1,354 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __DST_H
+#define __DST_H
+
+#include <linux/types.h>
+
+#define DST_NAMELEN		32
+#define DST_NAME		"dst"
+#define DST_IOCTL		0xba
+
+enum {
+	DST_DEL_NODE	= 0,	/* Remove node with given id from storage */
+	DST_ADD_REMOTE,		/* Add remote node with given id to the storage */
+	DST_ADD_LOCAL,		/* Add local node with given id to the storage */
+	DST_ADD_LOCAL_EXPORT,	/* Add local node with given id to the storage to be exported and used by remote peers */
+	DST_START_STORAGE,	/* Array is ready and storage can be started, if there will be new nodes
+				 * added to the storage, they will be checked against existing size and
+				 * probably be dropped (for example in mirror format when new node has smaller
+				 * size than array created) or inserted.
+				 */
+	DST_STOP_STORAGE,	/* Remove array and all nodes. */
+	DST_CMD_MAX
+};
+
+#define DST_CTL_FLAGS_REMOTE	(1<<0)
+#define DST_CTL_FLAGS_EXPORT	(1<<1)
+
+struct dst_ctl
+{
+	char			st[DST_NAMELEN];
+	char			alg[DST_NAMELEN];
+	__u32			flags;
+	__u64			start, size;
+};
+
+struct dst_local_ctl
+{
+	char			name[DST_NAMELEN];
+};
+
+#define SADDR_MAX_DATA	128
+
+struct saddr {
+	unsigned short		sa_family;			/* address family, AF_xxx	*/
+	char			sa_data[SADDR_MAX_DATA];	/* 14 bytes of protocol address	*/
+	unsigned short		sa_data_len;			/* Number of bytes used in sa_data */
+};
+
+struct dst_remote_ctl
+{
+	__u16			type;
+	__u16			proto;
+	struct saddr		addr;
+};
+
+#define DST_PERM_READ		(1<<0)
+#define DST_PERM_WRITE		(1<<1)
+
+/*
+ * Right now it is simple model, where each remote address
+ * is assigned to set of permissions it is allowed to perform.
+ * In real world block device does not know anything but
+ * reading and writing, so it should be more than enough.
+ */
+struct dst_secure_user
+{
+	unsigned int		permissions;
+	unsigned short		check_offset;
+	struct saddr		addr;
+};
+
+struct dst_local_export_ctl
+{
+	__u32			backlog;
+	int			secure_attr_num;
+	struct dst_local_ctl	lctl;
+	struct dst_remote_ctl	rctl;
+};
+
+enum {
+	DST_REMOTE_CFG		= 1, 		/* Request remote configuration */
+	DST_WRITE,				/* Writing */
+	DST_READ,				/* Reading */
+	DST_NCMD_MAX,
+};
+
+struct dst_remote_request
+{
+	__u32			cmd;
+	__u32			flags;
+	__u64			sector;
+	__u32			offset;
+	__u32			size;
+};
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/net.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/mempool.h>
+#include <linux/device.h>
+
+//#define DST_DEBUG
+
+#ifdef DST_DEBUG
+#define dprintk(f, a...) printk(KERN_NOTICE f, ##a)
+#else
+#define dprintk(f, a...) do {} while (0)
+#endif
+
+struct kst_worker
+{
+	struct list_head	entry;
+
+	struct list_head	state_list;
+	struct mutex		state_mutex;
+
+	struct list_head	ready_list;
+	spinlock_t		ready_lock;
+
+	mempool_t		*req_pool;
+
+	struct task_struct	*thread;
+
+	wait_queue_head_t 	wait;
+
+	int			id;
+};
+
+struct kst_state;
+struct dst_node;
+
+#define DST_REQ_HEADER_SENT	(1<<0)
+#define DST_REQ_EXPORT		(1<<1)
+#define DST_REQ_EXPORT_WRITE	(1<<2)
+#define DST_REQ_EXPORT_READ	(1<<3)
+#define DST_REQ_ALWAYS_QUEUE	(1<<4)
+
+struct dst_request
+{
+	struct rb_node		request_entry;
+	struct list_head	request_list_entry;
+	struct bio		*bio;
+	struct kst_state 	*state;
+	struct dst_node 	*node;
+
+	u32			flags;
+
+	int 			(*callback)(struct dst_request *dst,
+						unsigned int revents);
+	void			(*bio_endio)(struct dst_request *dst, 
+						int err);
+
+	void			*priv;
+	atomic_t		refcnt;
+
+	u64			size, orig_size, start;
+	int			idx, num;
+	u32			offset;
+};
+
+struct kst_state_ops
+{
+	int 		(*init)(struct kst_state *, void *);
+	int 		(*push)(struct dst_request *req);
+	int		(*ready)(struct kst_state *);
+	int		(*recovery)(struct kst_state *, int err);
+	void 		(*exit)(struct kst_state *);
+};
+
+#define KST_FLAG_PARTIAL		(1<<0)
+
+struct kst_state
+{
+	struct list_head	entry;
+	struct list_head	ready_entry;
+
+	wait_queue_t 		wait;
+	wait_queue_head_t 	*whead;
+
+	struct dst_node		*node;
+	struct socket		*socket;
+
+	u32			flags, permissions;
+
+	struct rb_root		request_root;
+	struct mutex		request_lock;
+	struct list_head	request_list;
+
+	struct kst_state_ops	*ops;
+};
+
+#define DST_DEFAULT_TIMEO	2000
+
+struct dst_storage;
+
+struct dst_alg_ops
+{
+	int			(*add_node)(struct dst_node *n);
+	void			(*del_node)(struct dst_node *n);
+	int 			(*remap)(struct dst_request *req);
+	int			(*error)(struct kst_state *state, int err);
+	struct module 		*owner;
+};
+
+struct dst_alg
+{
+	struct list_head	entry;
+	char			name[DST_NAMELEN];
+	atomic_t		refcnt;
+	struct dst_alg_ops	*ops;
+};
+
+#define DST_ST_STARTED		(1<<0)
+
+struct dst_storage
+{
+	struct list_head	entry;
+	char			name[DST_NAMELEN];
+	struct dst_alg		*alg;
+	atomic_t		refcnt;
+	struct mutex		tree_lock;
+	struct rb_root		tree_root;
+
+	request_queue_t		*queue;
+	struct gendisk		*disk;
+
+	long			flags;
+	u64			disk_size;
+
+	struct device		device;
+};
+
+#define DST_NODE_FROZEN		0
+#define DST_NODE_NOTSYNC	1
+
+struct dst_node
+{
+	struct rb_node		tree_node;
+
+	struct list_head	shared;
+	struct dst_node		*shared_head;
+
+	struct block_device 	*bdev;
+	struct dst_storage	*st;
+	struct kst_state	*state;
+	struct kst_worker	*w;
+
+	atomic_t		refcnt;
+	atomic_t		shared_num;
+
+	void			(*cleanup)(struct dst_node *);
+
+	long			flags;
+
+	u64			start, size;
+
+	void			(*priv_callback)(struct dst_node *);
+	void			*priv;
+
+	struct device		device;
+};
+
+struct dst_le_template
+{
+	struct dst_local_export_ctl	le;
+	void __user			*data;
+};
+
+struct dst_secure
+{
+	struct list_head	sec_entry;
+	struct dst_secure_user	sec;
+};
+
+void kst_state_exit(struct kst_state *st);
+
+struct kst_worker *kst_worker_init(int id);
+void kst_worker_exit(struct kst_worker *w);
+
+struct kst_state *kst_listener_state_init(struct dst_node *node,
+		struct dst_le_template *tmp);
+struct kst_state *kst_data_state_init(struct dst_node *node,
+		struct socket *newsock);
+
+void kst_wake(struct kst_state *st);
+
+void kst_exit_all(void);
+
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops);
+void dst_remove_alg(struct dst_alg *alg);
+
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start);
+
+void dst_node_put(struct dst_node *n);
+
+static inline struct dst_node *dst_node_get(struct dst_node *n)
+{
+	atomic_inc(&n->refcnt);
+	return n;
+}
+
+struct dst_request *dst_clone_request(struct dst_request *req, mempool_t *pool);
+void dst_free_request(struct dst_request *req);
+
+void kst_complete_req(struct dst_request *req, int err);
+void kst_bio_endio(struct dst_request *req, int err);
+void kst_del_req(struct dst_request *req);
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req);
+
+int kst_data_callback(struct dst_request *req, unsigned int revents);
+
+extern struct kmem_cache *dst_request_cache;
+
+static inline sector_t to_sector(unsigned long n)
+{
+	return (n >> 9);
+}
+
+static inline unsigned long to_bytes(sector_t n)
+{
+	return (n << 9);
+}
+
+/*
+ * Checks state's permissions.
+ * Returns -EPERM if check failed.
+ */
+static inline int kst_check_permissions(struct kst_state *st, struct bio *bio)
+{
+	if ((bio_rw(bio) == WRITE) && !(st->permissions & DST_PERM_WRITE))
+		return -EPERM;
+
+	return 0;
+}
+
+#endif /* __KERNEL__ */
+#endif /* __DST_H */

-- 
	Evgeniy Polyakov

^ permalink raw reply related

* Re: [PATCH] Fix a lock problem in generic phy code
From: Hans-Jürgen Koch @ 2007-08-31 16:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Jeff Garzik, afleming
In-Reply-To: <200708311430.09590.hjk@linutronix.de>

Am Freitag 31 August 2007 schrieb Hans-Jürgen Koch:
> Lock debugging finds a problem in phy.c and phy_device.c,
> this patch fixes it. Tested on an AT91SAM9263-EK board,
> kernel 2.6.23-rc4.

FYI, here's the log message without that patch:

[    3.420000] =================================
[    3.420000] [ INFO: inconsistent lock state ]
[    3.420000] 2.6.23-rc3-mm1 #21
[    3.420000] ---------------------------------
[    3.420000] inconsistent {softirq-on-W} -> {in-softirq-W} usage.
[    3.420000] swapper/1 [HC0[0]:SC1[1]:HE1:SE0] takes:
[    3.420000]  (&dev->lock){-+..}, at: [<c017da1c>] phy_timer+0x1c/0x4c8
[    3.420000] {softirq-on-W} state was registered at:
[    3.420000]   [<c006021c>] lock_acquire+0x94/0xac
[    3.420000]   [<c026423c>] _spin_lock+0x40/0x50
[    3.420000]   [<c017e2b4>] phy_probe+0x40/0x88
[    3.420000]   [<c017e60c>] phy_attach+0xc8/0xfc
[    3.420000]   [<c017e6c4>] phy_connect+0x1c/0x58
[    3.420000]   [<c017fc44>] macb_probe+0x4fc/0x5bc
[    3.420000]   [<c0177b44>] platform_drv_probe+0x20/0x24
[    3.420000]   [<c0176210>] driver_probe_device+0xb0/0x1bc
[    3.420000]   [<c01764dc>] __driver_attach+0xe4/0xe8
[    3.420000]   [<c017550c>] bus_for_each_dev+0x54/0x80
[    3.420000]   [<c0176074>] driver_attach+0x20/0x28
[    3.420000]   [<c01758d4>] bus_add_driver+0x84/0x1e0
[    3.420000]   [<c0176708>] driver_register+0x54/0x90
[    3.420000]   [<c0177de0>] platform_driver_register+0x6c/0x88
[    3.420000]   [<c00167ec>] macb_init+0x14/0x1c
[    3.420000]   [<c00087e4>] kernel_init+0x9c/0x2b4
[    3.420000]   [<c0040338>] do_exit+0x0/0x8e8
[    3.420000] irq event stamp: 115025
[    3.420000] hardirqs last  enabled at (115025): [<c0264958>] _spin_unlock_irq+0x30/0x60
[    3.420000] hardirqs last disabled at (115024): [<c02642c8>] _spin_lock_irq+0x28/0x60
[    3.420000] softirqs last  enabled at (114999): [<c0042910>] __do_softirq+0xfc/0x12c
[    3.420000] softirqs last disabled at (115022): [<c0042ee8>] irq_exit+0x68/0x7c
[    3.420000]
[    3.420000] other info that might help us debug this:
[    3.420000] no locks held by swapper/1.
[    3.420000]
[    3.420000] stack backtrace:
[    3.420000] [<c0028af4>] (dump_stack+0x0/0x14) from [<c005d8e4>] (print_usage_bug+0x120/0x150)
[    3.420000] [<c005d7c4>] (print_usage_bug+0x0/0x150) from [<c005e5e8>] (mark_lock+0x574/0x6bc)
[    3.420000] [<c005e074>] (mark_lock+0x0/0x6bc) from [<c005f4c0>] (__lock_acquire+0x4d0/0x1198)
[    3.420000] [<c005eff0>] (__lock_acquire+0x0/0x1198) from [<c006021c>] (lock_acquire+0x94/0xac)
[    3.420000] [<c0060188>] (lock_acquire+0x0/0xac) from [<c026423c>] (_spin_lock+0x40/0x50)
[    3.420000] [<c02641fc>] (_spin_lock+0x0/0x50) from [<c017da1c>] (phy_timer+0x1c/0x4c8)
[    3.420000]  r5:c098a800 r4:00000104
[    3.420000] [<c017da00>] (phy_timer+0x0/0x4c8) from [<c0046ee4>] (run_timer_softirq+0x1a0/0x218)
[    3.420000]  r7:c081ddd0 r6:c0331e80 r5:c098aa50 r4:00000104
[    3.420000] [<c0046d44>] (run_timer_softirq+0x0/0x218) from [<c00428a8>] (__do_softirq+0x94/0x12c)
[    3.420000] [<c0042814>] (__do_softirq+0x0/0x12c) from [<c0042ee8>] (irq_exit+0x68/0x7c)
[    3.420000] [<c0042e80>] (irq_exit+0x0/0x7c) from [<c0023060>] (__exception_text_start+0x60/0x74)
[    3.420000]  r4:00000001
[    3.420000] [<c0023000>] (__exception_text_start+0x0/0x74) from [<c0023b28>] (__irq_svc+0x48/0x74)
[    3.420000] Exception stack(0xc081de70 to 0xc081deb8)
[    3.420000] de60:                                     c030b2a8 c081a800 60000013 c081c000
[    3.420000] de80: c032db40 c081dee0 c032d758 c081dece 00000012 c030b398 00000034 c081df2c
[    3.420000] dea0: c081deb8 c081deb8 c003d6f0 c003d6f8 60000013 ffffffff
[    3.420000]  r7:00000003 r6:00000001 r5:fefff000 r4:ffffffff
[    3.420000] [<c003d494>] (vprintk+0x0/0x474) from [<c003d930>] (printk+0x28/0x30)
[    3.420000] [<c003d908>] (printk+0x0/0x30) from [<c01d3558>] (sock_register+0x64/0x84)
[    3.420000]  r3:00000001 r2:00000002 r1:00000001 r0:c02e4c88
[    3.420000] [<c01d34f4>] (sock_register+0x0/0x84) from [<c001b7bc>] (af_unix_init+0x40/0x88)
[    3.420000]  r5:00000000 r4:00000000
[    3.420000] [<c001b77c>] (af_unix_init+0x0/0x88) from [<c00087e4>] (kernel_init+0x9c/0x2b4)
[    3.420000]  r4:00000000
[    3.420000] [<c0008748>] (kernel_init+0x0/0x2b4) from [<c0040338>] (do_exit+0x0/0x8e8)

^ permalink raw reply

* [PATCH] ucc_geth: suppress 'qe_bd_t' undeclared error
From: Kim Phillips @ 2007-08-31 16:35 UTC (permalink / raw)
  To: netdev; +Cc: Li Yang

drivers/net/ucc_geth.c:2151: error: ‘qe_bd_t’ undeclared (first use in this function)

Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
---
Leo, this is for 2.6.23.

 drivers/net/ucc_geth.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 12e01b2..9a38dfe 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -2148,7 +2148,7 @@ static void ucc_geth_memclean(struct ucc_geth_private *ugeth)
 		for (j = 0; j < ugeth->ug_info->bdRingLenTx[i]; j++) {
 			if (ugeth->tx_skbuff[i][j]) {
 				dma_unmap_single(NULL,
-						 ((qe_bd_t *)bd)->buf,
+						 ((struct qe_bd *)bd)->buf,
 						 (in_be32((u32 *)bd) &
 						  BD_LENGTH_MASK),
 						 DMA_TO_DEVICE);
-- 
1.5.2.2


^ permalink raw reply related

* Re: wither bounds checking for networking sysctls
From: Rick Jones @ 2007-08-31 17:14 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Linux Network Development list
In-Reply-To: <20070830205939.2f85a567@localhost>

Stephen Hemminger wrote:
> On Thu, 30 Aug 2007 18:09:17 -0700
> Rick Jones <rick.jones2@hp.com> wrote:
> 
> 
>>While messing about with "sysctl_tcp_rto_min" I went back and forth a 
>>bit as to whether there should have been bounds checking (as did some of 
>>the folks who did some internal review for me).  That leads to the 
>>question - is it considered worthwhile to add a bit more bounds checking 
>>to sundry networking sysctls?
>>
>>rick jones
> 
> 
> IMHO As long as the any value from sysctl doesn't crash kernel, we
> should let it go. Enforcing RFC policy or inter-dependencies seems
> likes a useless exercise.

I was thinking more along the lines of more fundamental things - like 
precluding negative values when something is clearly a positive.

rick jones

^ permalink raw reply

* Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2
From: Rick Jones @ 2007-08-31 17:19 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev
In-Reply-To: <46D79E91.2040004@psc.edu>

John Heffner wrote:
> Rick Jones wrote:
> 
>> Like I said the consumers of this are a triffle well, "anxious" :)
> 
> 
> Just curious, did you or this customer try with F-RTO enabled?  Or is 
> this case you're dealing with truly hopeless?

F-RTO was mentioned to the customer and I'm awaiting their response as 
to its efficacy in their situation.  Everything I've seen thusfar is 
leading me to believe that we'll still need a higher than 200 
millisecond minimum rto though.

rick

^ permalink raw reply

* Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus
From: Randy Dunlap @ 2007-08-31 17:25 UTC (permalink / raw)
  To: Simon Arlott, sam
  Cc: Linux Kernel Mailing List, Robert P. J. Day, Randy Dunlap,
	Stefan Richter, Adrian Bunk, Jeff Garzik, Gabriel C, netdev
In-Reply-To: <469FE045.3070403@simon.arlott.org.uk>

On Thu, 19 Jul 2007 23:05:57 +0100 Simon Arlott wrote:

> On 19/07/07 17:19, Robert P. J. Day wrote:
> > On Thu, 19 Jul 2007, Randy Dunlap wrote:
> >> I think that Stefan means a patch to the kconfig source code,
> >> not the the Kconfig files.  Good luck.  I'd still like to see it.
> > 
> > yes, i understand what he wanted now.  as a first step (that
> > theoretically shouldn't change any behaviour), i'd patch the Kconfig
> > structure to add a new attribute ("maturity") which would be allowed
> > to be set to *exactly one* of a pre-defined set of values (say,
> > OBSOLETE, DEPRECATED, EXPERIMENTAL, and STILLBLEEDING).  and that's
> > it, nothing more.
> > 
> > don't try to do anything with any of that just yet, just add the
> > infrastructure to support the (optional) association of a maturity
> > level with a config option.  that's step one.
> 
> What about something like this? I'm not sure if the addition to sym_init
> is desirable... I also had to prefix _ to the name for now otherwise it
> conflicts badly with the current symbols. It probably should stop
> "depends on _BROKEN" etc. too.

Hi Simon,

(sorry for the delay)

I like this patch very much... I just can't get it to build (on
2.6.23-rc4).

Thanks.

> ---
> diff --git a/init/Kconfig b/init/Kconfig
> diff --git a/net/Kconfig b/net/Kconfig
> index cdba08c..5e2f4db 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -6,6 +6,7 @@ menu "Networking"
> 
>  config NET
>  	bool "Networking support"
> +	maturity _BROKEN
>  	---help---
>  	  Unless you really know what you are doing, you should say Y here.
>  	  The reason is that some programs need kernel networking support even
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
> index fb2bb30..1dea08e 100644
> --- a/scripts/kconfig/Makefile
> +++ b/scripts/kconfig/Makefile
> @@ -256,7 +256,7 @@ $(obj)/lkc_defs.h: $(src)/lkc_proto.h
>  # The following requires flex/bison/gperf
>  # By default we use the _shipped versions, uncomment the following line if
>  # you are modifying the flex/bison src.
> -# LKC_GENPARSER := 1
> +LKC_GENPARSER := 1
> 
>  ifdef LKC_GENPARSER
> 
> diff --git a/scripts/kconfig/conf.c b/scripts/kconfig/conf.c
> index 1199baf..cc9be0e 100644
> --- a/scripts/kconfig/conf.c
> +++ b/scripts/kconfig/conf.c
> @@ -211,6 +211,22 @@ static int conf_sym(struct menu *menu)
> 
>  	while (1) {
>  		printf("%*s%s ", indent - 1, "", menu->prompt->text);
> +		switch (sym->maturity) {
> +		case M_EXPERIMENTAL:
> +			printf("(EXPERIMENTAL) ");
> +			break;
> +		case M_DEPRECATED:
> +			printf("(DEPRECATED) ");
> +			break;
> +		case M_OBSOLETE:
> +			printf("(OBSOLETE) ");
> +			break;
> +		case M_BROKEN:
> +			printf("(BROKEN) ");
> +			break;
> +		default:
> +			break;
> +		}
>  		if (sym->name)
>  			printf("(%s) ", sym->name);
>  		type = sym_get_type(sym);
> diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h
> index 6084525..a22b6c1 100644
> --- a/scripts/kconfig/expr.h
> +++ b/scripts/kconfig/expr.h
> @@ -60,7 +60,11 @@ struct symbol_value {
>  };
> 
>  enum symbol_type {
> -	S_UNKNOWN, S_BOOLEAN, S_TRISTATE, S_INT, S_HEX, S_STRING, S_OTHER
> +	S_UNKNOWN, S_BOOLEAN, S_TRISTATE, S_INT, S_HEX, S_STRING, S_MATURITY,
> S_OTHER
> +};
> +
> +enum maturity_level {
> +	M_NONE, M_EXPERIMENTAL, M_DEPRECATED, M_OBSOLETE, M_BROKEN
>  };
> 
>  enum {
> @@ -72,6 +76,7 @@ struct symbol {
>  	struct symbol *next;
>  	char *name;
>  	char *help;
> +	enum maturity_level maturity;
>  	enum symbol_type type;
>  	struct symbol_value curr;
>  	struct symbol_value def[4];
> diff --git a/scripts/kconfig/lex.zconf.c_shipped
> b/scripts/kconfig/lex.zconf.c_shipped
> diff --git a/scripts/kconfig/lkc.h b/scripts/kconfig/lkc.h
> index 8a07ee4..9add1cd 100644
> --- a/scripts/kconfig/lkc.h
> +++ b/scripts/kconfig/lkc.h
> @@ -79,6 +79,7 @@ void menu_end_menu(void);
>  void menu_add_entry(struct symbol *sym);
>  void menu_end_entry(void);
>  void menu_add_dep(struct expr *dep);
> +void menu_set_maturity(struct symbol *sym);
>  struct property *menu_add_prop(enum prop_type type, char *prompt,
> struct expr *expr, struct expr *dep);
>  struct property *menu_add_prompt(enum prop_type type, char *prompt,
> struct expr *dep);
>  void menu_add_expr(enum prop_type type, struct expr *expr, struct expr
> *dep);
> diff --git a/scripts/kconfig/mconf.c b/scripts/kconfig/mconf.c
> index d0e4fa5..eaf199b 100644
> --- a/scripts/kconfig/mconf.c
> +++ b/scripts/kconfig/mconf.c
> @@ -238,6 +238,7 @@ search_help[] = N_(
>  	"Result:\n"
>  	"-----------------------------------------------------------------\n"
>  	"Symbol: FOO [=m]\n"
> +	"Maturity: FOO\n"
>  	"Prompt: Foo bus is used to drive the bar HW\n"
>  	"Defined at drivers/pci/Kconfig:47\n"
>  	"Depends on: X86_LOCAL_APIC && X86_IO_APIC || IA64\n"
> @@ -359,6 +360,27 @@ static void get_symbol_str(struct gstr *r, struct
> symbol *sym)
> 
>  	str_printf(r, "Symbol: %s [=%s]\n", sym->name,
>  	                               sym_get_string_value(sym));
> +
> +	if (sym->maturity != M_NONE) {
> +		str_append(r, "Maturity: ");
> +		switch (sym->maturity) {
> +		case M_EXPERIMENTAL:
> +			str_append(r, "EXPERIMENTAL\n");
> +			break;
> +		case M_DEPRECATED:
> +			str_append(r, "DEPRECATED\n");
> +			break;
> +		case M_OBSOLETE:
> +			str_append(r, "OBSOLETE\n");
> +			break;
> +		case M_BROKEN:
> +			str_append(r, "BROKEN\n");
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +
>  	for_all_prompts(sym, prop)
>  		get_prompt_str(r, prop);
>  	hit = false;
> diff --git a/scripts/kconfig/menu.c b/scripts/kconfig/menu.c
> index f14aeac..3e37e5b 100644
> --- a/scripts/kconfig/menu.c
> +++ b/scripts/kconfig/menu.c
> @@ -104,6 +104,15 @@ void menu_add_dep(struct expr *dep)
>  	current_entry->dep = expr_alloc_and(current_entry->dep,
> menu_check_dep(dep));
>  }
> 
> +void menu_set_maturity(struct symbol *sym)
> +{
> +	if (sym->type != S_MATURITY) {
> +		zconf_error("'%s' is an invalid maturity level", sym->name);
> +	} else {
> +		current_entry->sym->maturity = sym->maturity;
> +	}
> +}
> +
>  void menu_set_type(int type)
>  {
>  	struct symbol *sym = current_entry->sym;
> diff --git a/scripts/kconfig/symbol.c b/scripts/kconfig/symbol.c
> index c35dcc5..280ee8f 100644
> --- a/scripts/kconfig/symbol.c
> +++ b/scripts/kconfig/symbol.c
> @@ -72,6 +72,22 @@ void sym_init(void)
>  	sym->type = S_STRING;
>  	sym->flags |= SYMBOL_AUTO;
>  	sym_add_default(sym, uts.release);
> +
> +	sym = sym_lookup("_EXPERIMENTAL", 0);
> +	sym->type = S_MATURITY;
> +	sym->maturity = M_EXPERIMENTAL;
> +
> +	sym = sym_lookup("_DEPRECATED", 0);
> +	sym->type = S_MATURITY;
> +	sym->maturity = M_DEPRECATED;
> +
> +	sym = sym_lookup("_OBSOLETE", 0);
> +	sym->type = S_MATURITY;
> +	sym->maturity = M_OBSOLETE;
> +
> +	sym = sym_lookup("_BROKEN", 0);
> +	sym->type = S_MATURITY;
> +	sym->maturity = M_BROKEN;
>  }
> 
>  enum symbol_type sym_get_type(struct symbol *sym)
> @@ -100,6 +116,8 @@ const char *sym_type_name(enum symbol_type type)
>  		return "hex";
>  	case S_STRING:
>  		return "string";
> +	case S_MATURITY:
> +		return "maturity";
>  	case S_UNKNOWN:
>  		return "unknown";
>  	case S_OTHER:
> @@ -679,6 +697,7 @@ struct symbol *sym_lookup(const char *name, int isconst)
>  	memset(symbol, 0, sizeof(*symbol));
>  	symbol->name = new_name;
>  	symbol->type = S_UNKNOWN;
> +	symbol->maturity = M_NONE;
>  	if (isconst)
>  		symbol->flags |= SYMBOL_CONST;
> 
> diff --git a/scripts/kconfig/zconf.gperf b/scripts/kconfig/zconf.gperf
> index 9b44c80..756d559 100644
> --- a/scripts/kconfig/zconf.gperf
> +++ b/scripts/kconfig/zconf.gperf
> @@ -25,6 +25,7 @@ endif,		T_ENDIF,	TF_COMMAND
>  depends,	T_DEPENDS,	TF_COMMAND
>  requires,	T_REQUIRES,	TF_COMMAND
>  optional,	T_OPTIONAL,	TF_COMMAND
> +maturity,	T_MATURITY,	TF_COMMAND
>  default,	T_DEFAULT,	TF_COMMAND, S_UNKNOWN
>  prompt,		T_PROMPT,	TF_COMMAND
>  tristate,	T_TYPE,		TF_COMMAND, S_TRISTATE
> diff --git a/scripts/kconfig/zconf.y b/scripts/kconfig/zconf.y
> index 92eb02b..1b47966 100644
> --- a/scripts/kconfig/zconf.y
> +++ b/scripts/kconfig/zconf.y
> @@ -66,6 +66,7 @@ static struct menu *current_menu, *current_entry;
>  %token <id>T_DEPENDS
>  %token <id>T_REQUIRES
>  %token <id>T_OPTIONAL
> +%token <id>T_MATURITY
>  %token <id>T_PROMPT
>  %token <id>T_TYPE
>  %token <id>T_DEFAULT
> @@ -120,7 +121,7 @@ stmt_list:
>  ;
> 
>  option_name:
> -	T_DEPENDS | T_PROMPT | T_TYPE | T_SELECT | T_OPTIONAL | T_RANGE |
> T_DEFAULT
> +	T_DEPENDS | T_MATURITY | T_PROMPT | T_TYPE | T_SELECT | T_OPTIONAL |
> T_RANGE | T_DEFAULT
>  ;
> 
>  common_stmt:
> @@ -177,6 +178,7 @@ config_option_list:
>  	| config_option_list config_option
>  	| config_option_list symbol_option
>  	| config_option_list depends
> +	| config_option_list maturity
>  	| config_option_list help
>  	| config_option_list option_error
>  	| config_option_list T_EOL
> @@ -269,6 +271,7 @@ choice_option_list:
>  	  /* empty */
>  	| choice_option_list choice_option
>  	| choice_option_list depends
> +	| choice_option_list maturity
>  	| choice_option_list help
>  	| choice_option_list T_EOL
>  	| choice_option_list option_error
> @@ -349,7 +352,7 @@ menu: T_MENU prompt T_EOL
>  	printd(DEBUG_PARSE, "%s:%d:menu\n", zconf_curname(), zconf_lineno());
>  };
> 
> -menu_entry: menu depends_list
> +menu_entry: menu maturity_set_opt depends_list
>  {
>  	$$ = menu_add_menu();
>  };
> @@ -430,6 +433,19 @@ depends: T_DEPENDS T_ON expr T_EOL
>  	printd(DEBUG_PARSE, "%s:%d:requires\n", zconf_curname(), zconf_lineno());
>  };
> 
> +/* maturity setting */
> +
> +maturity_set_opt:
> +	  /* empty */
> +	| maturity
> +;
> +
> +maturity: T_MATURITY symbol T_EOL
> +{
> +	menu_set_maturity($2);
> +	printd(DEBUG_PARSE, "%s:%d:maturity\n", zconf_curname(), zconf_lineno());
> +};
> +
>  /* prompt statement */
> 
>  prompt_stmt_opt:
> @@ -519,6 +535,7 @@ const char *zconf_tokenname(int token)
>  	case T_IF:		return "if";
>  	case T_ENDIF:		return "endif";
>  	case T_DEPENDS:		return "depends";
> +	case T_MATURITY:	return "maturity";
>  	}
>  	return "<token>";
>  }
> @@ -615,6 +632,9 @@ void print_symbol(FILE *out, struct menu *menu)
>  	case S_HEX:
>  		fputs("  hex\n", out);
>  		break;
> +	case S_MATURITY:
> +		fputs("  maturity\n", out);
> +		break;
>  	default:
>  		fputs("  ???\n", out);
>  		break;
> 
> 
> -- 

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [1/1] Block device throttling [Re: Distributed storage.]
From: Evgeniy Polyakov @ 2007-08-31 17:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jens Axboe, netdev, linux-kernel, linux-fsdevel, Peter Zijlstra
In-Reply-To: <200708301620.37965.phillips@phunq.net>

Hi Daniel.

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips (phillips@phunq.net) wrote:
> On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> > Then, if of course you will want, which I doubt, you can reread
> > previous mails and find that it was pointed to that race and
> > possibilities to solve it way too long ago.
> 
> What still bothers me about your response is that, while you know the 
> race exists and do not disagree with my example, you don't seem to see 
> that that race can eventually lock up the block device by repeatedly 
> losing throttle counts which are never recovered.  What prevents that?

I posted a trivial hack with pointed possible errors and a question
about should it be further extended (and race fixed by any of the
possible methods and so on) or new one should be developed (like in your
approach when only high level device is charged), instead I got replies
that it contains bugs, whcih will stop system and kill gene pool of the
mankind. I know how it works and where problems are. And if we are going
with this approach I will fix pointed issues.

> > > --- 2.6.22.clean/block/ll_rw_blk.c	2007-07-08 16:32:17.000000000
> > > -0700 +++ 2.6.22/block/ll_rw_blk.c	2007-08-24 12:07:16.000000000
> > > -0700 @@ -3237,6 +3237,15 @@ end_io:
> > >   */
> > >  void generic_make_request(struct bio *bio)
> > >  {
> > > +	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > > +
> > > +	if (q && q->metric) {
> > > +		int need = bio->bi_reserved = q->metric(bio);
> > > +		bio->queue = q;
> >
> > In case you have stacked device, this entry will be rewritten and you
> > will lost all your account data.
> 
> It is a weakness all right.  Well,
> 
> -	if (q && q->metric) {
> +	if (q && q->metric && !bio->queue) {
> 
> which fixes that problem.  Maybe there is a better fix possible.  Thanks 
> for the catch!

Yes, it should.

> The original conception was that this block throttling would apply only 
> to the highest level submission of the bio, the one that crosses the 
> boundary between filesystem (or direct block device application) and 
> block layer.  Resubmitting a bio or submitting a dependent bio from 
> inside a block driver does not need to be throttled because all 
> resources required to guarantee completion must have been obtained 
> _before_ the bio was allowed to proceed into the block layer.

We still did not come to the conclusion, but I do not want to start a
flamewar, you believe that throttling must be done on the top level
device, so you need to extend bio and convince others that idea worth
it.

> The other principle we are trying to satisfy is that the throttling 
> should not be released until bio->endio, which I am not completely sure 
> about with the patch as modified above.  Your earlier idea of having 
> the throttle protection only cover the actual bio submission is 
> interesting and may be effective in some cases, in fact it may cover 
> the specific case of ddsnap.  But we don't have to look any further 
> than ddraid (distributed raid) to find a case it doesn't cover - the 
> additional memory allocated to hold parity data has to be reserved 
> until parity data is deallocated, long after the submission completes.
> So while you manage to avoid some logistical difficulties, it also looks 
> like you didn't solve the general problem.

Block layer does not know and should not be bothered with underlying
device nature - if you think that in endio callback limit should not be
rechardged, then provide your own layer on top of bio and thus call
endio callback only when you think it is ready to be completed.

> Hopefully I will be able to report on whether my patch actually works 
> soon, when I get back from vacation.  The mechanism in ddsnap this is 
> supposed to replace is effective, it is just ugly and tricky to verify.
> 
> Regards,
> 
> Daniel

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus
From: Robert P. J. Day @ 2007-08-31 17:23 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Simon Arlott, sam, Linux Kernel Mailing List, Stefan Richter,
	Adrian Bunk, Jeff Garzik, Gabriel C, netdev
In-Reply-To: <20070831102527.09fb42c0.rdunlap@xenotime.net>

On Fri, 31 Aug 2007, Randy Dunlap wrote:

> On Thu, 19 Jul 2007 23:05:57 +0100 Simon Arlott wrote:
>
> > On 19/07/07 17:19, Robert P. J. Day wrote:
> > > On Thu, 19 Jul 2007, Randy Dunlap wrote:
> > >> I think that Stefan means a patch to the kconfig source code,
> > >> not the the Kconfig files.  Good luck.  I'd still like to see it.
> > >
> > > yes, i understand what he wanted now.  as a first step (that
> > > theoretically shouldn't change any behaviour), i'd patch the Kconfig
> > > structure to add a new attribute ("maturity") which would be allowed
> > > to be set to *exactly one* of a pre-defined set of values (say,
> > > OBSOLETE, DEPRECATED, EXPERIMENTAL, and STILLBLEEDING).  and that's
> > > it, nothing more.
> > >
> > > don't try to do anything with any of that just yet, just add the
> > > infrastructure to support the (optional) association of a maturity
> > > level with a config option.  that's step one.
> >
> > What about something like this? I'm not sure if the addition to sym_init
> > is desirable... I also had to prefix _ to the name for now otherwise it
> > conflicts badly with the current symbols. It probably should stop
> > "depends on _BROKEN" etc. too.

i'm sure i'm going to get shouted down here, but i really disagree
with "BROKEN" being considered a "maturity level".  IMHO, things like
EXPERIMENTAL, DEPRECATED and OBSOLETE represent maturity levels, for
what i think are obvious reasons.

something like BROKEN, though, has *nothing* to do with maturity.  a
feature can be any of those maturity levels, and simultaneously be
BROKEN.  i consider BROKEN to be what i call a "status", and different
status levels might be the default of normal, or KIND_OF_FLAKY or
TOTALLY_BORKED -- that's where BROKEN would fit in.

*please* don't start mixing these different attributes.  it's just
going to make an irreversible mess of things.  leave BROKEN out of
this for the time being.

rday
-- 
========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://crashcourse.ca
========================================================================

^ permalink raw reply

* Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus
From: Jeff Garzik @ 2007-08-31 18:06 UTC (permalink / raw)
  To: Robert P. J. Day
  Cc: Randy Dunlap, Simon Arlott, sam, Linux Kernel Mailing List,
	Stefan Richter, Adrian Bunk, Gabriel C, netdev
In-Reply-To: <Pine.LNX.4.64.0708311318220.18158@localhost.localdomain>

Robert P. J. Day wrote:
> On Fri, 31 Aug 2007, Randy Dunlap wrote:
> 
>> On Thu, 19 Jul 2007 23:05:57 +0100 Simon Arlott wrote:
>>
>>> On 19/07/07 17:19, Robert P. J. Day wrote:
>>>> On Thu, 19 Jul 2007, Randy Dunlap wrote:
>>>>> I think that Stefan means a patch to the kconfig source code,
>>>>> not the the Kconfig files.  Good luck.  I'd still like to see it.
>>>> yes, i understand what he wanted now.  as a first step (that
>>>> theoretically shouldn't change any behaviour), i'd patch the Kconfig
>>>> structure to add a new attribute ("maturity") which would be allowed
>>>> to be set to *exactly one* of a pre-defined set of values (say,
>>>> OBSOLETE, DEPRECATED, EXPERIMENTAL, and STILLBLEEDING).  and that's
>>>> it, nothing more.
>>>>
>>>> don't try to do anything with any of that just yet, just add the
>>>> infrastructure to support the (optional) association of a maturity
>>>> level with a config option.  that's step one.
>>> What about something like this? I'm not sure if the addition to sym_init
>>> is desirable... I also had to prefix _ to the name for now otherwise it
>>> conflicts badly with the current symbols. It probably should stop
>>> "depends on _BROKEN" etc. too.
> 
> i'm sure i'm going to get shouted down here, but i really disagree
> with "BROKEN" being considered a "maturity level".  IMHO, things like
> EXPERIMENTAL, DEPRECATED and OBSOLETE represent maturity levels, for
> what i think are obvious reasons.
> 
> something like BROKEN, though, has *nothing* to do with maturity.  a
> feature can be any of those maturity levels, and simultaneously be
> BROKEN.  i consider BROKEN to be what i call a "status", and different
> status levels might be the default of normal, or KIND_OF_FLAKY or
> TOTALLY_BORKED -- that's where BROKEN would fit in.

BROKEN is definitely a maturity level.  A more accurate description 
would be BITROTTING perhaps.  The code in question has passed through 
bleeding -> experimental -> stable, and come out the other side.

In contrast, OBSOLETE and DEPRECATED reflect high-level status not code 
quality/maturity.

	Jeff




^ permalink raw reply

* Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2
From: Rick Jones @ 2007-08-31 18:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20070830.220911.41008322.davem@davemloft.net>

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 30 Aug 2007 18:07:13 -0700
> 
> 
>>Anyhow, I'll try grubbing around the source code (already doing that to 
>>see about writing a pet tcp cong module) but if pointers to the likely 
>>relevant files were available I could try to help thrash-out the routing 
>>metric version.  Like I said the consumers of this are a triffle well, 
>>"anxious" :)
> 
> 
> The change is actually a lot simpler than the sysctl version.
> 
> In fact it borders on trivial :-)
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index c91476c..dff3192 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -351,6 +351,8 @@ enum
>  #define RTAX_INITCWND RTAX_INITCWND
>  	RTAX_FEATURES,
>  #define RTAX_FEATURES RTAX_FEATURES
> +	RTAX_RTO_MIN,
> +#define RTAX_RTO_MIN RTAX_RTO_MIN
>  	__RTAX_MAX
>  };
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9785df3..1ee7212 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -555,6 +555,16 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
>  		tcp_grow_window(sk, skb);
>  }
>  
> +static u32 tcp_rto_min(struct sock *sk)
> +{
> +	struct dst_entry *dst = __sk_dst_get(sk);
> +	u32 rto_min = TCP_RTO_MIN;
> +
> +	if (dst_metric_locked(dst, RTAX_RTO_MIN))
> +		rto_min = dst->metrics[RTAX_RTO_MIN-1];
> +	return rto_min;
> +}
> +
>  /* Called to compute a smoothed rtt estimate. The data fed to this
>   * routine either comes from timestamps, or from segments that were
>   * known _not_ to have been retransmitted [see Karn/Partridge
> @@ -616,13 +626,13 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>  			if (tp->mdev_max < tp->rttvar)
>  				tp->rttvar -= (tp->rttvar-tp->mdev_max)>>2;
>  			tp->rtt_seq = tp->snd_nxt;
> -			tp->mdev_max = TCP_RTO_MIN;
> +			tp->mdev_max = tcp_rto_min(sk);
>  		}
>  	} else {
>  		/* no previous measure. */
>  		tp->srtt = m<<3;	/* take the measured time to be rtt */
>  		tp->mdev = m<<1;	/* make sure rto = 3*rtt */
> -		tp->mdev_max = tp->rttvar = max(tp->mdev, TCP_RTO_MIN);
> +		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
>  		tp->rtt_seq = tp->snd_nxt;
>  	}
>  }

At the risk of showing my ignorance (what me worry about that?-) I 
presume this is then an interface expecting to take-in jiffies?  That 
means the user has to know the value of HZ which can be (IIRC) one of 
three different values?

rick jones

^ permalink raw reply

* Re: [PATCH 2.6.24 2/3]S2io: Support for add/delete/store/restore ethernet addresses
From: David Miller @ 2007-08-31 18:36 UTC (permalink / raw)
  To: jeff; +Cc: Sreenivasa.Honnur, netdev, support
In-Reply-To: <46D81777.2090700@garzik.org>

From: Jeff Garzik <jeff@garzik.org>
Date: Fri, 31 Aug 2007 09:28:23 -0400

> Sreenivasa Honnur wrote:
> > - Support to add/delete/store/restore 64 and 128 Ethernet addresses for
> > Xframe I and Xframe II respectively.
> > 
> > Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@neterion.com>
> > Signed-off-by: Ramkrishna Vepa <ram.vepa@neterion.com>
> 
> What is the purpose of this?  We do not support more than one unicast 
> MAC address at present...

Yes we do, the card can register with the card in ->set_rx_mode() or
enable promiscuous mode if it has no alternate MAC facility.

See netdev->uc_list.

^ permalink raw reply

* Re: [PATCH 2.6.24 2/3]S2io: Support for add/delete/store/restore ethernet addresses
From: Jeff Garzik @ 2007-08-31 18:38 UTC (permalink / raw)
  To: David Miller; +Cc: Sreenivasa.Honnur, netdev, support
In-Reply-To: <20070831.113604.29572801.davem@davemloft.net>

David Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Fri, 31 Aug 2007 09:28:23 -0400
> 
>> Sreenivasa Honnur wrote:
>>> - Support to add/delete/store/restore 64 and 128 Ethernet addresses for
>>> Xframe I and Xframe II respectively.
>>>
>>> Signed-off-by: Sreenivasa Honnur <sreenivasa.honnur@neterion.com>
>>> Signed-off-by: Ramkrishna Vepa <ram.vepa@neterion.com>
>> What is the purpose of this?  We do not support more than one unicast 
>> MAC address at present...
> 
> Yes we do, the card can register with the card in ->set_rx_mode() or
> enable promiscuous mode if it has no alternate MAC facility.
> 
> See netdev->uc_list.

Ah, well objection removed then.  I stand corrected.  Didn't know it was 
in yet.

	Jeff




^ permalink raw reply

* [git patches] net driver fixes
From: Jeff Garzik @ 2007-08-31 18:46 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: netdev, LKML


Please pull from 'upstream-linus' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus

to receive the following updates:

 drivers/infiniband/hw/cxgb3/cxio_hal.c |    2 +-
 drivers/net/cxgb3/adapter.h            |    2 +
 drivers/net/cxgb3/common.h             |    3 +-
 drivers/net/cxgb3/cxgb3_main.c         |  252 +++++++++++++++++++-------------
 drivers/net/cxgb3/cxgb3_offload.c      |   16 ++-
 drivers/net/cxgb3/cxgb3_offload.h      |    2 +
 drivers/net/cxgb3/sge.c                |   23 ++-
 drivers/net/cxgb3/t3_hw.c              |   46 +++++-
 drivers/net/cxgb3/t3cdev.h             |    3 -
 drivers/net/ioc3-eth.c                 |   80 ++++++++---
 drivers/net/netxen/netxen_nic_hdr.h    |    6 +-
 drivers/net/netxen/netxen_nic_hw.c     |    8 +-
 drivers/net/netxen/netxen_nic_main.c   |   19 +--
 drivers/net/ps3_gelic_net.c            |    1 -
 drivers/s390/net/qeth.h                |    4 +-
 drivers/s390/net/qeth_main.c           |  158 +++++++++++++++-----
 drivers/s390/net/qeth_mpc.h            |    1 +
 drivers/s390/net/qeth_sys.c            |    8 +-
 18 files changed, 428 insertions(+), 206 deletions(-)

Divy Le Ray (2):
      cxgb3 - Fix dev->priv usage
      - cxgb3 engine microcode load

Frank Blaschka (2):
      qeth: enforce a rate limit for inbound scatter gather messages
      qeth: Announce tx checksumming for qeth devices in TSO/EDDP mode

Heiko Carstens (1):
      qeth: dont return the return values of void functions.

Klaus D. Wacker (1):
      qeth: Drop ARP packages on HiperSockets interface with NOARP attribute.

Masakazu Mokuno (1):
      PS3: fix the bug that 'ifconfig down' would hang

Ralf Baechle (1):
      IOC3: Program UART predividers.

Ursula Braun (3):
      qeth: ungrouping a device must not be interruptible
      qeth: crash during reboot after failing online setting
      qeth: provide specific message for OSA-adapters exclusively used

dhananjay@netxen.com (2):
      netxen: Avoid firmware load in PCI probe
      netxen: fix crashes during module unload

diff --git a/drivers/infiniband/hw/cxgb3/cxio_hal.c b/drivers/infiniband/hw/cxgb3/cxio_hal.c
index 1518b41..beb2a38 100644
--- a/drivers/infiniband/hw/cxgb3/cxio_hal.c
+++ b/drivers/infiniband/hw/cxgb3/cxio_hal.c
@@ -916,7 +916,7 @@ int cxio_rdev_open(struct cxio_rdev *rdev_p)
 	PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name);
 	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
 	if (!rdev_p->t3cdev_p)
-		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+		rdev_p->t3cdev_p = dev2t3cdev(netdev_p);
 	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
 	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
 					 &(rdev_p->rnic_info));
diff --git a/drivers/net/cxgb3/adapter.h b/drivers/net/cxgb3/adapter.h
index ab72563..20e887d 100644
--- a/drivers/net/cxgb3/adapter.h
+++ b/drivers/net/cxgb3/adapter.h
@@ -50,7 +50,9 @@ typedef irqreturn_t(*intr_handler_t) (int, void *);
 
 struct vlan_group;
 
+struct adapter;
 struct port_info {
+	struct adapter *adapter;
 	struct vlan_group *vlan_grp;
 	const struct port_type_info *port_type;
 	u8 port_id;
diff --git a/drivers/net/cxgb3/common.h b/drivers/net/cxgb3/common.h
index 1637800..2129210 100644
--- a/drivers/net/cxgb3/common.h
+++ b/drivers/net/cxgb3/common.h
@@ -679,7 +679,8 @@ const struct adapter_info *t3_get_adapter_info(unsigned int board_id);
 int t3_seeprom_read(struct adapter *adapter, u32 addr, u32 *data);
 int t3_seeprom_write(struct adapter *adapter, u32 addr, u32 data);
 int t3_seeprom_wp(struct adapter *adapter, int enable);
-int t3_check_tpsram_version(struct adapter *adapter);
+int t3_get_tp_version(struct adapter *adapter, u32 *vers);
+int t3_check_tpsram_version(struct adapter *adapter, int *must_load);
 int t3_check_tpsram(struct adapter *adapter, u8 *tp_ram, unsigned int size);
 int t3_set_proto_sram(struct adapter *adap, u8 *data);
 int t3_read_flash(struct adapter *adapter, unsigned int addr,
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index dc5d269..5ab319c 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -358,11 +358,14 @@ static int init_dummy_netdevs(struct adapter *adap)
 
 		for (j = 0; j < pi->nqsets - 1; j++) {
 			if (!adap->dummy_netdev[dummy_idx]) {
-				nd = alloc_netdev(0, "", ether_setup);
+				struct port_info *p;
+
+				nd = alloc_netdev(sizeof(*p), "", ether_setup);
 				if (!nd)
 					goto free_all;
 
-				nd->priv = adap;
+				p = netdev_priv(nd);
+				p->adapter = adap;
 				nd->weight = 64;
 				set_bit(__LINK_STATE_START, &nd->state);
 				adap->dummy_netdev[dummy_idx] = nd;
@@ -482,7 +485,8 @@ static ssize_t attr_store(struct device *d, struct device_attribute *attr,
 #define CXGB3_SHOW(name, val_expr) \
 static ssize_t format_##name(struct net_device *dev, char *buf) \
 { \
-	struct adapter *adap = dev->priv; \
+	struct port_info *pi = netdev_priv(dev); \
+	struct adapter *adap = pi->adapter; \
 	return sprintf(buf, "%u\n", val_expr); \
 } \
 static ssize_t show_##name(struct device *d, struct device_attribute *attr, \
@@ -493,7 +497,8 @@ static ssize_t show_##name(struct device *d, struct device_attribute *attr, \
 
 static ssize_t set_nfilters(struct net_device *dev, unsigned int val)
 {
-	struct adapter *adap = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adap = pi->adapter;
 	int min_tids = is_offload(adap) ? MC5_MIN_TIDS : 0;
 
 	if (adap->flags & FULL_INIT_DONE)
@@ -515,7 +520,8 @@ static ssize_t store_nfilters(struct device *d, struct device_attribute *attr,
 
 static ssize_t set_nservers(struct net_device *dev, unsigned int val)
 {
-	struct adapter *adap = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adap = pi->adapter;
 
 	if (adap->flags & FULL_INIT_DONE)
 		return -EBUSY;
@@ -556,9 +562,10 @@ static struct attribute_group cxgb3_attr_group = {.attrs = cxgb3_attrs };
 static ssize_t tm_attr_show(struct device *d, struct device_attribute *attr,
 			    char *buf, int sched)
 {
-	ssize_t len;
+	struct port_info *pi = netdev_priv(to_net_dev(d));
+	struct adapter *adap = pi->adapter;
 	unsigned int v, addr, bpt, cpt;
-	struct adapter *adap = to_net_dev(d)->priv;
+	ssize_t len;
 
 	addr = A_TP_TX_MOD_Q1_Q0_RATE_LIMIT - sched / 2;
 	rtnl_lock();
@@ -581,10 +588,11 @@ static ssize_t tm_attr_show(struct device *d, struct device_attribute *attr,
 static ssize_t tm_attr_store(struct device *d, struct device_attribute *attr,
 			     const char *buf, size_t len, int sched)
 {
+	struct port_info *pi = netdev_priv(to_net_dev(d));
+	struct adapter *adap = pi->adapter;
+	unsigned int val;
 	char *endp;
 	ssize_t ret;
-	unsigned int val;
-	struct adapter *adap = to_net_dev(d)->priv;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
@@ -721,6 +729,7 @@ static void bind_qsets(struct adapter *adap)
 }
 
 #define FW_FNAME "t3fw-%d.%d.%d.bin"
+#define TPSRAM_NAME "t3%c_protocol_sram-%d.%d.%d.bin"
 
 static int upgrade_fw(struct adapter *adap)
 {
@@ -739,6 +748,71 @@ static int upgrade_fw(struct adapter *adap)
 	}
 	ret = t3_load_fw(adap, fw->data, fw->size);
 	release_firmware(fw);
+
+	if (ret == 0)
+		dev_info(dev, "successful upgrade to firmware %d.%d.%d\n",
+			 FW_VERSION_MAJOR, FW_VERSION_MINOR, FW_VERSION_MICRO);
+	else
+		dev_err(dev, "failed to upgrade to firmware %d.%d.%d\n",
+			FW_VERSION_MAJOR, FW_VERSION_MINOR, FW_VERSION_MICRO);
+	
+	return ret;
+}
+
+static inline char t3rev2char(struct adapter *adapter)
+{
+	char rev = 0;
+
+	switch(adapter->params.rev) {
+	case T3_REV_B:
+	case T3_REV_B2:
+		rev = 'b';
+		break;
+	}
+	return rev;
+}
+
+int update_tpsram(struct adapter *adap)
+{
+	const struct firmware *tpsram;
+	char buf[64];
+	struct device *dev = &adap->pdev->dev;
+	int ret;
+	char rev;
+	
+	rev = t3rev2char(adap);
+	if (!rev)
+		return 0;
+
+	snprintf(buf, sizeof(buf), TPSRAM_NAME, rev,
+		 TP_VERSION_MAJOR, TP_VERSION_MINOR, TP_VERSION_MICRO);
+
+	ret = request_firmware(&tpsram, buf, dev);
+	if (ret < 0) {
+		dev_err(dev, "could not load TP SRAM: unable to load %s\n",
+			buf);
+		return ret;
+	}
+	
+	ret = t3_check_tpsram(adap, tpsram->data, tpsram->size);
+	if (ret)
+		goto release_tpsram;	
+
+	ret = t3_set_proto_sram(adap, tpsram->data);
+	if (ret == 0)
+		dev_info(dev,
+			 "successful update of protocol engine "
+			 "to %d.%d.%d\n",
+			 TP_VERSION_MAJOR, TP_VERSION_MINOR, TP_VERSION_MICRO);
+	else
+		dev_err(dev, "failed to update of protocol engine %d.%d.%d\n",
+			TP_VERSION_MAJOR, TP_VERSION_MINOR, TP_VERSION_MICRO);
+	if (ret)
+		dev_err(dev, "loading protocol SRAM failed\n");
+
+release_tpsram:
+	release_firmware(tpsram);
+	
 	return ret;
 }
 
@@ -755,6 +829,7 @@ static int upgrade_fw(struct adapter *adap)
 static int cxgb_up(struct adapter *adap)
 {
 	int err = 0;
+	int must_load;
 
 	if (!(adap->flags & FULL_INIT_DONE)) {
 		err = t3_check_fw_version(adap);
@@ -763,6 +838,13 @@ static int cxgb_up(struct adapter *adap)
 		if (err)
 			goto out;
 
+		err = t3_check_tpsram_version(adap, &must_load);
+		if (err == -EINVAL) {
+			err = update_tpsram(adap);
+			if (err && must_load)
+				goto out;
+		}
+
 		err = init_dummy_netdevs(adap);
 		if (err)
 			goto out;
@@ -858,8 +940,9 @@ static void schedule_chk_task(struct adapter *adap)
 
 static int offload_open(struct net_device *dev)
 {
-	struct adapter *adapter = dev->priv;
-	struct t3cdev *tdev = T3CDEV(dev);
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
+	struct t3cdev *tdev = dev2t3cdev(dev);
 	int adap_up = adapter->open_device_map & PORT_MASK;
 	int err = 0;
 
@@ -924,10 +1007,10 @@ static int offload_close(struct t3cdev *tdev)
 
 static int cxgb_open(struct net_device *dev)
 {
-	int err;
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	int other_ports = adapter->open_device_map & PORT_MASK;
+	int err;
 
 	if (!adapter->open_device_map && (err = cxgb_up(adapter)) < 0)
 		return err;
@@ -951,17 +1034,17 @@ static int cxgb_open(struct net_device *dev)
 
 static int cxgb_close(struct net_device *dev)
 {
-	struct adapter *adapter = dev->priv;
-	struct port_info *p = netdev_priv(dev);
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 
-	t3_port_intr_disable(adapter, p->port_id);
+	t3_port_intr_disable(adapter, pi->port_id);
 	netif_stop_queue(dev);
-	p->phy.ops->power_down(&p->phy, 1);
+	pi->phy.ops->power_down(&pi->phy, 1);
 	netif_carrier_off(dev);
-	t3_mac_disable(&p->mac, MAC_DIRECTION_TX | MAC_DIRECTION_RX);
+	t3_mac_disable(&pi->mac, MAC_DIRECTION_TX | MAC_DIRECTION_RX);
 
 	spin_lock(&adapter->work_lock);	/* sync with update task */
-	clear_bit(p->port_id, &adapter->open_device_map);
+	clear_bit(pi->port_id, &adapter->open_device_map);
 	spin_unlock(&adapter->work_lock);
 
 	if (!(adapter->open_device_map & PORT_MASK))
@@ -976,13 +1059,13 @@ static int cxgb_close(struct net_device *dev)
 
 static struct net_device_stats *cxgb_get_stats(struct net_device *dev)
 {
-	struct adapter *adapter = dev->priv;
-	struct port_info *p = netdev_priv(dev);
-	struct net_device_stats *ns = &p->netstats;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
+	struct net_device_stats *ns = &pi->netstats;
 	const struct mac_stats *pstats;
 
 	spin_lock(&adapter->stats_lock);
-	pstats = t3_mac_update_stats(&p->mac);
+	pstats = t3_mac_update_stats(&pi->mac);
 	spin_unlock(&adapter->stats_lock);
 
 	ns->tx_bytes = pstats->tx_octets;
@@ -1015,14 +1098,16 @@ static struct net_device_stats *cxgb_get_stats(struct net_device *dev)
 
 static u32 get_msglevel(struct net_device *dev)
 {
-	struct adapter *adapter = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 
 	return adapter->msg_enable;
 }
 
 static void set_msglevel(struct net_device *dev, u32 val)
 {
-	struct adapter *adapter = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 
 	adapter->msg_enable = val;
 }
@@ -1096,10 +1181,13 @@ static int get_eeprom_len(struct net_device *dev)
 
 static void get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
 {
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	u32 fw_vers = 0;
-	struct adapter *adapter = dev->priv;
+	u32 tp_vers = 0;
 
 	t3_get_fw_version(adapter, &fw_vers);
+	t3_get_tp_version(adapter, &tp_vers);
 
 	strcpy(info->driver, DRV_NAME);
 	strcpy(info->version, DRV_VERSION);
@@ -1108,11 +1196,14 @@ static void get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info)
 		strcpy(info->fw_version, "N/A");
 	else {
 		snprintf(info->fw_version, sizeof(info->fw_version),
-			 "%s %u.%u.%u",
+			 "%s %u.%u.%u TP %u.%u.%u",
 			 G_FW_VERSION_TYPE(fw_vers) ? "T" : "N",
 			 G_FW_VERSION_MAJOR(fw_vers),
 			 G_FW_VERSION_MINOR(fw_vers),
-			 G_FW_VERSION_MICRO(fw_vers));
+			 G_FW_VERSION_MICRO(fw_vers),
+			 G_TP_VERSION_MAJOR(tp_vers),
+			 G_TP_VERSION_MINOR(tp_vers),
+			 G_TP_VERSION_MICRO(tp_vers));
 	}
 }
 
@@ -1136,8 +1227,8 @@ static unsigned long collect_sge_port_stats(struct adapter *adapter,
 static void get_stats(struct net_device *dev, struct ethtool_stats *stats,
 		      u64 *data)
 {
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	const struct mac_stats *s;
 
 	spin_lock(&adapter->stats_lock);
@@ -1205,7 +1296,8 @@ static inline void reg_block_dump(struct adapter *ap, void *buf,
 static void get_regs(struct net_device *dev, struct ethtool_regs *regs,
 		     void *buf)
 {
-	struct adapter *ap = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *ap = pi->adapter;
 
 	/*
 	 * Version scheme:
@@ -1246,8 +1338,9 @@ static int restart_autoneg(struct net_device *dev)
 
 static int cxgb3_phys_id(struct net_device *dev, u32 data)
 {
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	int i;
-	struct adapter *adapter = dev->priv;
 
 	if (data == 0)
 		data = 2;
@@ -1408,8 +1501,8 @@ static int set_rx_csum(struct net_device *dev, u32 data)
 
 static void get_sge_param(struct net_device *dev, struct ethtool_ringparam *e)
 {
-	const struct adapter *adapter = dev->priv;
-	const struct port_info *pi = netdev_priv(dev);
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	const struct qset_params *q = &adapter->params.sge.qset[pi->first_qset];
 
 	e->rx_max_pending = MAX_RX_BUFFERS;
@@ -1425,10 +1518,10 @@ static void get_sge_param(struct net_device *dev, struct ethtool_ringparam *e)
 
 static int set_sge_param(struct net_device *dev, struct ethtool_ringparam *e)
 {
-	int i;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	struct qset_params *q;
-	struct adapter *adapter = dev->priv;
-	const struct port_info *pi = netdev_priv(dev);
+	int i;
 
 	if (e->rx_pending > MAX_RX_BUFFERS ||
 	    e->rx_jumbo_pending > MAX_RX_JUMBO_BUFFERS ||
@@ -1457,7 +1550,8 @@ static int set_sge_param(struct net_device *dev, struct ethtool_ringparam *e)
 
 static int set_coalesce(struct net_device *dev, struct ethtool_coalesce *c)
 {
-	struct adapter *adapter = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	struct qset_params *qsp = &adapter->params.sge.qset[0];
 	struct sge_qset *qs = &adapter->sge.qs[0];
 
@@ -1471,7 +1565,8 @@ static int set_coalesce(struct net_device *dev, struct ethtool_coalesce *c)
 
 static int get_coalesce(struct net_device *dev, struct ethtool_coalesce *c)
 {
-	struct adapter *adapter = dev->priv;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	struct qset_params *q = adapter->params.sge.qset;
 
 	c->rx_coalesce_usecs = q->coalesce_usecs;
@@ -1481,8 +1576,9 @@ static int get_coalesce(struct net_device *dev, struct ethtool_coalesce *c)
 static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 		      u8 * data)
 {
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	int i, err = 0;
-	struct adapter *adapter = dev->priv;
 
 	u8 *buf = kmalloc(EEPROMSIZE, GFP_KERNEL);
 	if (!buf)
@@ -1501,10 +1597,11 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 		      u8 * data)
 {
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
+	u32 aligned_offset, aligned_len, *p;
 	u8 *buf;
 	int err = 0;
-	u32 aligned_offset, aligned_len, *p;
-	struct adapter *adapter = dev->priv;
 
 	if (eeprom->magic != EEPROM_MAGIC)
 		return -EINVAL;
@@ -1592,9 +1689,10 @@ static int in_range(int val, int lo, int hi)
 
 static int cxgb_extension_ioctl(struct net_device *dev, void __user *useraddr)
 {
-	int ret;
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	u32 cmd;
-	struct adapter *adapter = dev->priv;
+	int ret;
 
 	if (copy_from_user(&cmd, useraddr, sizeof(cmd)))
 		return -EFAULT;
@@ -1923,10 +2021,10 @@ static int cxgb_extension_ioctl(struct net_device *dev, void __user *useraddr)
 
 static int cxgb_ioctl(struct net_device *dev, struct ifreq *req, int cmd)
 {
-	int ret, mmd;
-	struct adapter *adapter = dev->priv;
-	struct port_info *pi = netdev_priv(dev);
 	struct mii_ioctl_data *data = if_mii(req);
+	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
+	int ret, mmd;
 
 	switch (cmd) {
 	case SIOCGMIIPHY:
@@ -1994,9 +2092,9 @@ static int cxgb_ioctl(struct net_device *dev, struct ifreq *req, int cmd)
 
 static int cxgb_change_mtu(struct net_device *dev, int new_mtu)
 {
-	int ret;
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
+	int ret;
 
 	if (new_mtu < 81)	/* accommodate SACK */
 		return -EINVAL;
@@ -2013,8 +2111,8 @@ static int cxgb_change_mtu(struct net_device *dev, int new_mtu)
 
 static int cxgb_set_mac_addr(struct net_device *dev, void *p)
 {
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	struct sockaddr *addr = p;
 
 	if (!is_valid_ether_addr(addr->sa_data))
@@ -2050,8 +2148,8 @@ static void t3_synchronize_rx(struct adapter *adap, const struct port_info *p)
 
 static void vlan_rx_register(struct net_device *dev, struct vlan_group *grp)
 {
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 
 	pi->vlan_grp = grp;
 	if (adapter->params.rev > 0)
@@ -2070,8 +2168,8 @@ static void vlan_rx_register(struct net_device *dev, struct vlan_group *grp)
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void cxgb_netpoll(struct net_device *dev)
 {
-	struct adapter *adapter = dev->priv;
 	struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	int qidx;
 
 	for (qidx = pi->first_qset; qidx < pi->first_qset + pi->nqsets; qidx++) {
@@ -2088,42 +2186,6 @@ static void cxgb_netpoll(struct net_device *dev)
 }
 #endif
 
-#define TPSRAM_NAME "t3%c_protocol_sram-%d.%d.%d.bin"
-int update_tpsram(struct adapter *adap)
-{
-	const struct firmware *tpsram;
-	char buf[64];
-	struct device *dev = &adap->pdev->dev;
-	int ret;
-	char rev;
-	
-	rev = adap->params.rev == T3_REV_B2 ? 'b' : 'a';
-
-	snprintf(buf, sizeof(buf), TPSRAM_NAME, rev,
-		 TP_VERSION_MAJOR, TP_VERSION_MINOR, TP_VERSION_MICRO);
-
-	ret = request_firmware(&tpsram, buf, dev);
-	if (ret < 0) {
-		dev_err(dev, "could not load TP SRAM: unable to load %s\n",
-			buf);
-		return ret;
-	}
-	
-	ret = t3_check_tpsram(adap, tpsram->data, tpsram->size);
-	if (ret)
-		goto release_tpsram;	
-
-	ret = t3_set_proto_sram(adap, tpsram->data);
-	if (ret)
-		dev_err(dev, "loading protocol SRAM failed\n");
-
-release_tpsram:
-	release_firmware(tpsram);
-	
-	return ret;
-}
-
-
 /*
  * Periodic accumulation of MAC statistics.
  */
@@ -2433,6 +2495,7 @@ static int __devinit init_one(struct pci_dev *pdev,
 
 		adapter->port[i] = netdev;
 		pi = netdev_priv(netdev);
+		pi->adapter = adapter;
 		pi->rx_csum_offload = 1;
 		pi->nqsets = 1;
 		pi->first_qset = i;
@@ -2442,7 +2505,6 @@ static int __devinit init_one(struct pci_dev *pdev,
 		netdev->irq = pdev->irq;
 		netdev->mem_start = mmio_start;
 		netdev->mem_end = mmio_start + mmio_len - 1;
-		netdev->priv = adapter;
 		netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
 		netdev->features |= NETIF_F_LLTX;
 		if (pci_using_dac)
@@ -2467,18 +2529,11 @@ static int __devinit init_one(struct pci_dev *pdev,
 		SET_ETHTOOL_OPS(netdev, &cxgb_ethtool_ops);
 	}
 
-	pci_set_drvdata(pdev, adapter->port[0]);
+	pci_set_drvdata(pdev, adapter);
 	if (t3_prep_adapter(adapter, ai, 1) < 0) {
 		err = -ENODEV;
 		goto out_free_dev;
 	}
-
-	err = t3_check_tpsram_version(adapter);
-	if (err == -EINVAL)
-		err = update_tpsram(adapter);
-
-	if (err)
-		goto out_free_dev;
 		
 	/*
 	 * The card is now ready to go.  If any errors occur during device
@@ -2547,11 +2602,10 @@ out_release_regions:
 
 static void __devexit remove_one(struct pci_dev *pdev)
 {
-	struct net_device *dev = pci_get_drvdata(pdev);
+	struct adapter *adapter = pci_get_drvdata(pdev);
 
-	if (dev) {
+	if (adapter) {
 		int i;
-		struct adapter *adapter = dev->priv;
 
 		t3_sge_stop(adapter);
 		sysfs_remove_group(&adapter->port[0]->dev.kobj,
diff --git a/drivers/net/cxgb3/cxgb3_offload.c b/drivers/net/cxgb3/cxgb3_offload.c
index e620ed4..bdff7ba 100644
--- a/drivers/net/cxgb3/cxgb3_offload.c
+++ b/drivers/net/cxgb3/cxgb3_offload.c
@@ -593,6 +593,16 @@ int cxgb3_alloc_stid(struct t3cdev *tdev, struct cxgb3_client *client,
 
 EXPORT_SYMBOL(cxgb3_alloc_stid);
 
+/* Get the t3cdev associated with a net_device */
+struct t3cdev *dev2t3cdev(struct net_device *dev)
+{
+	const struct port_info *pi = netdev_priv(dev);
+
+	return (struct t3cdev *)pi->adapter;
+}
+
+EXPORT_SYMBOL(dev2t3cdev);
+
 static int do_smt_write_rpl(struct t3cdev *dev, struct sk_buff *skb)
 {
 	struct cpl_smt_write_rpl *rpl = cplhdr(skb);
@@ -925,7 +935,7 @@ void cxgb_neigh_update(struct neighbour *neigh)
 	struct net_device *dev = neigh->dev;
 
 	if (dev && (is_offloading(dev))) {
-		struct t3cdev *tdev = T3CDEV(dev);
+		struct t3cdev *tdev = dev2t3cdev(dev);
 
 		BUG_ON(!tdev);
 		t3_l2t_update(tdev, neigh);
@@ -973,9 +983,9 @@ void cxgb_redirect(struct dst_entry *old, struct dst_entry *new)
 		       "device ignored.\n", __FUNCTION__);
 		return;
 	}
-	tdev = T3CDEV(olddev);
+	tdev = dev2t3cdev(olddev);
 	BUG_ON(!tdev);
-	if (tdev != T3CDEV(newdev)) {
+	if (tdev != dev2t3cdev(newdev)) {
 		printk(KERN_WARNING "%s: Redirect to different "
 		       "offload device ignored.\n", __FUNCTION__);
 		return;
diff --git a/drivers/net/cxgb3/cxgb3_offload.h b/drivers/net/cxgb3/cxgb3_offload.h
index f15446a..7a37913 100644
--- a/drivers/net/cxgb3/cxgb3_offload.h
+++ b/drivers/net/cxgb3/cxgb3_offload.h
@@ -51,6 +51,8 @@ void cxgb3_offload_deactivate(struct adapter *adapter);
 
 void cxgb3_set_dummy_ops(struct t3cdev *dev);
 
+struct t3cdev *dev2t3cdev(struct net_device *dev);
+
 /*
  * Client registration.  Users of T3 driver must register themselves.
  * The T3 driver will call the add function of every client for each T3
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index a2cfd68..58a5f60 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1073,7 +1073,7 @@ int t3_eth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	unsigned int ndesc, pidx, credits, gen, compl;
 	const struct port_info *pi = netdev_priv(dev);
-	struct adapter *adap = dev->priv;
+	struct adapter *adap = pi->adapter;
 	struct sge_qset *qs = dev2qset(dev);
 	struct sge_txq *q = &qs->txq[TXQ_ETH];
 
@@ -1326,7 +1326,8 @@ static void restart_ctrlq(unsigned long data)
 	struct sk_buff *skb;
 	struct sge_qset *qs = (struct sge_qset *)data;
 	struct sge_txq *q = &qs->txq[TXQ_CTRL];
-	struct adapter *adap = qs->netdev->priv;
+	const struct port_info *pi = netdev_priv(qs->netdev);
+	struct adapter *adap = pi->adapter;
 
 	spin_lock(&q->lock);
       again:reclaim_completed_tx_imm(q);
@@ -1531,7 +1532,8 @@ static void restart_offloadq(unsigned long data)
 	struct sk_buff *skb;
 	struct sge_qset *qs = (struct sge_qset *)data;
 	struct sge_txq *q = &qs->txq[TXQ_OFLD];
-	struct adapter *adap = qs->netdev->priv;
+	const struct port_info *pi = netdev_priv(qs->netdev);
+	struct adapter *adap = pi->adapter;
 
 	spin_lock(&q->lock);
       again:reclaim_completed_tx(adap, q);
@@ -1675,7 +1677,8 @@ static inline void deliver_partial_bundle(struct t3cdev *tdev,
  */
 static int ofld_poll(struct net_device *dev, int *budget)
 {
-	struct adapter *adapter = dev->priv;
+	const struct port_info *pi = netdev_priv(dev);
+	struct adapter *adapter = pi->adapter;
 	struct sge_qset *qs = dev2qset(dev);
 	struct sge_rspq *q = &qs->rspq;
 	int work_done, limit = min(*budget, dev->quota), avail = limit;
@@ -2075,7 +2078,8 @@ static inline int is_pure_response(const struct rsp_desc *r)
  */
 static int napi_rx_handler(struct net_device *dev, int *budget)
 {
-	struct adapter *adap = dev->priv;
+	const struct port_info *pi = netdev_priv(dev);
+	struct adapter *adap = pi->adapter;
 	struct sge_qset *qs = dev2qset(dev);
 	int effective_budget = min(*budget, dev->quota);
 
@@ -2205,7 +2209,8 @@ static inline int handle_responses(struct adapter *adap, struct sge_rspq *q)
 irqreturn_t t3_sge_intr_msix(int irq, void *cookie)
 {
 	struct sge_qset *qs = cookie;
-	struct adapter *adap = qs->netdev->priv;
+	const struct port_info *pi = netdev_priv(qs->netdev);
+	struct adapter *adap = pi->adapter;
 	struct sge_rspq *q = &qs->rspq;
 
 	spin_lock(&q->lock);
@@ -2224,7 +2229,8 @@ irqreturn_t t3_sge_intr_msix(int irq, void *cookie)
 irqreturn_t t3_sge_intr_msix_napi(int irq, void *cookie)
 {
 	struct sge_qset *qs = cookie;
-	struct adapter *adap = qs->netdev->priv;
+	const struct port_info *pi = netdev_priv(qs->netdev);
+	struct adapter *adap = pi->adapter;
 	struct sge_rspq *q = &qs->rspq;
 
 	spin_lock(&q->lock);
@@ -2508,7 +2514,8 @@ static void sge_timer_cb(unsigned long data)
 {
 	spinlock_t *lock;
 	struct sge_qset *qs = (struct sge_qset *)data;
-	struct adapter *adap = qs->netdev->priv;
+	const struct port_info *pi = netdev_priv(qs->netdev);
+	struct adapter *adap = pi->adapter;
 
 	if (spin_trylock(&qs->txq[TXQ_ETH].lock)) {
 		reclaim_completed_tx(adap, &qs->txq[TXQ_ETH]);
diff --git a/drivers/net/cxgb3/t3_hw.c b/drivers/net/cxgb3/t3_hw.c
index dd3149d..b02d15d 100644
--- a/drivers/net/cxgb3/t3_hw.c
+++ b/drivers/net/cxgb3/t3_hw.c
@@ -848,16 +848,15 @@ static int t3_write_flash(struct adapter *adapter, unsigned int addr,
 }
 
 /**
- *	t3_check_tpsram_version - read the tp sram version
+ *	t3_get_tp_version - read the tp sram version
  *	@adapter: the adapter
+ *	@vers: where to place the version
  *
- *	Reads the protocol sram version from serial eeprom.
+ *	Reads the protocol sram version from sram.
  */
-int t3_check_tpsram_version(struct adapter *adapter)
+int t3_get_tp_version(struct adapter *adapter, u32 *vers)
 {
 	int ret;
-	u32 vers;
-	unsigned int major, minor;
 
 	/* Get version loaded in SRAM */
 	t3_write_reg(adapter, A_TP_EMBED_OP_FIELD0, 0);
@@ -866,7 +865,32 @@ int t3_check_tpsram_version(struct adapter *adapter)
 	if (ret)
 		return ret;
 	
-	vers = t3_read_reg(adapter, A_TP_EMBED_OP_FIELD1);
+	*vers = t3_read_reg(adapter, A_TP_EMBED_OP_FIELD1);
+
+	return 0;
+}
+
+/**
+ *	t3_check_tpsram_version - read the tp sram version
+ *	@adapter: the adapter
+ *	@must_load: set to 1 if loading a new microcode image is required
+ *
+ *	Reads the protocol sram version from flash.
+ */
+int t3_check_tpsram_version(struct adapter *adapter, int *must_load)
+{
+	int ret;
+	u32 vers;
+	unsigned int major, minor;
+
+	if (adapter->params.rev == T3_REV_A)
+		return 0;
+
+	*must_load = 1;
+
+	ret = t3_get_tp_version(adapter, &vers);
+	if (ret)
+		return ret;
 
 	major = G_TP_VERSION_MAJOR(vers);
 	minor = G_TP_VERSION_MINOR(vers);
@@ -874,6 +898,16 @@ int t3_check_tpsram_version(struct adapter *adapter)
 	if (major == TP_VERSION_MAJOR && minor == TP_VERSION_MINOR) 
 		return 0;
 
+	if (major != TP_VERSION_MAJOR)
+		CH_ERR(adapter, "found wrong TP version (%u.%u), "
+		       "driver needs version %d.%d\n", major, minor,
+		       TP_VERSION_MAJOR, TP_VERSION_MINOR);
+	else {
+		*must_load = 0;
+		CH_ERR(adapter, "found wrong TP version (%u.%u), "
+		       "driver compiled for version %d.%d\n", major, minor,
+		       TP_VERSION_MAJOR, TP_VERSION_MINOR);
+	}
 	return -EINVAL;
 }
 
diff --git a/drivers/net/cxgb3/t3cdev.h b/drivers/net/cxgb3/t3cdev.h
index fa4099b..77fcc1a 100644
--- a/drivers/net/cxgb3/t3cdev.h
+++ b/drivers/net/cxgb3/t3cdev.h
@@ -42,9 +42,6 @@
 
 #define T3CNAMSIZ 16
 
-/* Get the t3cdev associated with a net_device */
-#define T3CDEV(netdev) (struct t3cdev *)(netdev->priv)
-
 struct cxgb3_client;
 
 enum t3ctype {
diff --git a/drivers/net/ioc3-eth.c b/drivers/net/ioc3-eth.c
index 3ca1e8e..0834ef0 100644
--- a/drivers/net/ioc3-eth.c
+++ b/drivers/net/ioc3-eth.c
@@ -48,6 +48,7 @@
 #ifdef CONFIG_SERIAL_8250
 #include <linux/serial_core.h>
 #include <linux/serial_8250.h>
+#include <linux/serial_reg.h>
 #endif
 
 #include <linux/netdevice.h>
@@ -1151,13 +1152,41 @@ static int ioc3_is_menet(struct pci_dev *pdev)
  * Also look in ip27-pci.c:pci_fixup_ioc3() for some comments on working
  * around ioc3 oddities in this respect.
  *
- * The IOC3 serials use a 22MHz clock rate with an additional divider by 3.
+ * The IOC3 serials use a 22MHz clock rate with an additional divider which
+ * can be programmed in the SCR register if the DLAB bit is set.
+ *
+ * Register to interrupt zero because we share the interrupt with
+ * the serial driver which we don't properly support yet.
+ *
+ * Can't use UPF_IOREMAP as the whole of IOC3 resources have already been
+ * registered.
  */
+static void __devinit ioc3_8250_register(struct ioc3_uartregs __iomem *uart)
+{
+#define COSMISC_CONSTANT 6
+
+	struct uart_port port = {
+		.irq		= 0,
+		.flags		= UPF_SKIP_TEST | UPF_BOOT_AUTOCONF,
+		.iotype		= UPIO_MEM,
+		.regshift	= 0,
+		.uartclk	= (22000000 << 1) / COSMISC_CONSTANT,
+
+		.membase	= (unsigned char __iomem *) uart,
+		.mapbase	= (unsigned long) uart,
+	};
+	unsigned char lcr;
+
+	lcr = uart->iu_lcr;
+	uart->iu_lcr = lcr | UART_LCR_DLAB;
+	uart->iu_scr = COSMISC_CONSTANT,
+	uart->iu_lcr = lcr;
+	uart->iu_lcr;
+	serial8250_register_port(&port);
+}
 
 static void __devinit ioc3_serial_probe(struct pci_dev *pdev, struct ioc3 *ioc3)
 {
-	struct uart_port port;
-
 	/*
 	 * We need to recognice and treat the fourth MENET serial as it
 	 * does not have an SuperIO chip attached to it, therefore attempting
@@ -1171,24 +1200,35 @@ static void __devinit ioc3_serial_probe(struct pci_dev *pdev, struct ioc3 *ioc3)
 		return;
 
 	/*
-	 * Register to interrupt zero because we share the interrupt with
-	 * the serial driver which we don't properly support yet.
-	 *
-	 * Can't use UPF_IOREMAP as the whole of IOC3 resources have already
-	 * been registered.
+	 * Switch IOC3 to PIO mode.  It probably already was but let's be
+	 * paranoid
 	 */
-	memset(&port, 0, sizeof(port));
-	port.irq      = 0;
-	port.flags    = UPF_SKIP_TEST | UPF_BOOT_AUTOCONF;
-	port.iotype   = UPIO_MEM;
-	port.regshift = 0;
-	port.uartclk  = 22000000 / 3;
-
-	port.membase  = (unsigned char *) &ioc3->sregs.uarta;
-	serial8250_register_port(&port);
-
-	port.membase  = (unsigned char *) &ioc3->sregs.uartb;
-	serial8250_register_port(&port);
+	ioc3->gpcr_s = GPCR_UARTA_MODESEL | GPCR_UARTB_MODESEL;
+	ioc3->gpcr_s;
+	ioc3->gppr_6 = 0;
+	ioc3->gppr_6;
+	ioc3->gppr_7 = 0;
+	ioc3->gppr_7;
+	ioc3->sscr_a = ioc3->sscr_a & ~SSCR_DMA_EN;
+	ioc3->sscr_a;
+	ioc3->sscr_b = ioc3->sscr_b & ~SSCR_DMA_EN;
+	ioc3->sscr_b;
+	/* Disable all SA/B interrupts except for SA/B_INT in SIO_IEC. */
+	ioc3->sio_iec &= ~ (SIO_IR_SA_TX_MT | SIO_IR_SA_RX_FULL |
+			    SIO_IR_SA_RX_HIGH | SIO_IR_SA_RX_TIMER |
+			    SIO_IR_SA_DELTA_DCD | SIO_IR_SA_DELTA_CTS |
+			    SIO_IR_SA_TX_EXPLICIT | SIO_IR_SA_MEMERR);
+	ioc3->sio_iec |= SIO_IR_SA_INT;
+	ioc3->sscr_a = 0;
+	ioc3->sio_iec &= ~ (SIO_IR_SB_TX_MT | SIO_IR_SB_RX_FULL |
+			    SIO_IR_SB_RX_HIGH | SIO_IR_SB_RX_TIMER |
+			    SIO_IR_SB_DELTA_DCD | SIO_IR_SB_DELTA_CTS |
+			    SIO_IR_SB_TX_EXPLICIT | SIO_IR_SB_MEMERR);
+	ioc3->sio_iec |= SIO_IR_SB_INT;
+	ioc3->sscr_b = 0;
+
+	ioc3_8250_register(&ioc3->sregs.uarta);
+	ioc3_8250_register(&ioc3->sregs.uartb);
 }
 #endif
 
diff --git a/drivers/net/netxen/netxen_nic_hdr.h b/drivers/net/netxen/netxen_nic_hdr.h
index 3276866..d72f8f8 100644
--- a/drivers/net/netxen/netxen_nic_hdr.h
+++ b/drivers/net/netxen/netxen_nic_hdr.h
@@ -649,9 +649,11 @@ enum {
 #define PCIX_INT_VECTOR		(0x10100)
 #define PCIX_INT_MASK		(0x10104)
 
-#define PCIX_MN_WINDOW		(0x10200)
+#define PCIX_MN_WINDOW_F0	(0x10200)
+#define PCIX_MN_WINDOW(_f)	(PCIX_MN_WINDOW_F0 + (0x20 * (_f)))
 #define PCIX_MS_WINDOW		(0x10204)
-#define PCIX_SN_WINDOW		(0x10208)
+#define PCIX_SN_WINDOW_F0	(0x10208)
+#define PCIX_SN_WINDOW(_f)	(PCIX_SN_WINDOW_F0 + (0x20 * (_f)))
 #define PCIX_CRB_WINDOW		(0x10210)
 #define PCIX_CRB_WINDOW_F0	(0x10210)
 #define PCIX_CRB_WINDOW_F1	(0x10230)
diff --git a/drivers/net/netxen/netxen_nic_hw.c b/drivers/net/netxen/netxen_nic_hw.c
index aac1542..a7b8d7f 100644
--- a/drivers/net/netxen/netxen_nic_hw.c
+++ b/drivers/net/netxen/netxen_nic_hw.c
@@ -904,11 +904,11 @@ netxen_nic_pci_set_window(struct netxen_adapter *adapter,
 			ddr_mn_window = window;
 			writel(window, PCI_OFFSET_SECOND_RANGE(adapter,
 							       NETXEN_PCIX_PH_REG
-							       (PCIX_MN_WINDOW)));
+							       (PCIX_MN_WINDOW(adapter->ahw.pci_func))));
 			/* MUST make sure window is set before we forge on... */
 			readl(PCI_OFFSET_SECOND_RANGE(adapter,
 						      NETXEN_PCIX_PH_REG
-						      (PCIX_MN_WINDOW)));
+						      (PCIX_MN_WINDOW(adapter->ahw.pci_func))));
 		}
 		addr -= (window * NETXEN_WINDOW_ONE);
 		addr += NETXEN_PCI_DDR_NET;
@@ -929,11 +929,11 @@ netxen_nic_pci_set_window(struct netxen_adapter *adapter,
 			writel((window << 22),
 			       PCI_OFFSET_SECOND_RANGE(adapter,
 						       NETXEN_PCIX_PH_REG
-						       (PCIX_SN_WINDOW)));
+						       (PCIX_SN_WINDOW(adapter->ahw.pci_func))));
 			/* MUST make sure window is set before we forge on... */
 			readl(PCI_OFFSET_SECOND_RANGE(adapter,
 						      NETXEN_PCIX_PH_REG
-						      (PCIX_SN_WINDOW)));
+						      (PCIX_SN_WINDOW(adapter->ahw.pci_func))));
 		}
 		addr -= (window * 0x400000);
 		addr += NETXEN_PCI_QDR_NET;
diff --git a/drivers/net/netxen/netxen_nic_main.c b/drivers/net/netxen/netxen_nic_main.c
index 08a62ac..3122d01 100644
--- a/drivers/net/netxen/netxen_nic_main.c
+++ b/drivers/net/netxen/netxen_nic_main.c
@@ -639,10 +639,6 @@ netxen_nic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 			NETXEN_CRB_NORMALIZE(adapter,
 				NETXEN_ROMUSB_GLB_PEGTUNE_DONE));
 		/* Handshake with the card before we register the devices. */
-		writel(0, NETXEN_CRB_NORMALIZE(adapter, CRB_CMDPEG_STATE));
-		netxen_pinit_from_rom(adapter, 0);
-		msleep(1);
-		netxen_load_firmware(adapter);
 		netxen_phantom_init(adapter, NETXEN_NIC_PEG_TUNE);
 	}
 
@@ -750,9 +746,6 @@ static void __devexit netxen_nic_remove(struct pci_dev *pdev)
 
 	netxen_nic_disable_int(adapter);
 
-	if (adapter->irq)
-		free_irq(adapter->irq, adapter);
-
 	if (adapter->is_up == NETXEN_ADAPTER_UP_MAGIC) {
 		init_firmware_done++;
 		netxen_free_hw_resources(adapter);
@@ -776,13 +769,8 @@ static void __devexit netxen_nic_remove(struct pci_dev *pdev)
 		}
 	}
 
-	if (adapter->flags & NETXEN_NIC_MSI_ENABLED)
-		pci_disable_msi(pdev);
-
 	vfree(adapter->cmd_buf_arr);
 
-	pci_disable_device(pdev);
-
 	if (adapter->portnum == 0) {
 		if (init_firmware_done) {
 			i = 100;
@@ -833,12 +821,19 @@ static void __devexit netxen_nic_remove(struct pci_dev *pdev)
 		}
 	}
 
+	if (adapter->irq)
+		free_irq(adapter->irq, adapter);
+
+	if (adapter->flags & NETXEN_NIC_MSI_ENABLED)
+		pci_disable_msi(pdev);
+
 	iounmap(adapter->ahw.db_base);
 	iounmap(adapter->ahw.pci_base0);
 	iounmap(adapter->ahw.pci_base1);
 	iounmap(adapter->ahw.pci_base2);
 
 	pci_release_regions(pdev);
+	pci_disable_device(pdev);
 	pci_set_drvdata(pdev, NULL);
 
 	free_netdev(netdev);
diff --git a/drivers/net/ps3_gelic_net.c b/drivers/net/ps3_gelic_net.c
index 13d1c0a..e565039 100644
--- a/drivers/net/ps3_gelic_net.c
+++ b/drivers/net/ps3_gelic_net.c
@@ -556,7 +556,6 @@ static int gelic_net_stop(struct net_device *netdev)
 {
 	struct gelic_net_card *card = netdev_priv(netdev);
 
-	netif_poll_disable(netdev);
 	netif_stop_queue(netdev);
 
 	/* turn off DMA, force end */
diff --git a/drivers/s390/net/qeth.h b/drivers/s390/net/qeth.h
index ec18bae..6d49598 100644
--- a/drivers/s390/net/qeth.h
+++ b/drivers/s390/net/qeth.h
@@ -1178,9 +1178,9 @@ qeth_ipaddr_to_string(enum qeth_prot_versions proto, const __u8 *addr,
 		      char *buf)
 {
 	if (proto == QETH_PROT_IPV4)
-		return qeth_ipaddr4_to_string(addr, buf);
+		qeth_ipaddr4_to_string(addr, buf);
 	else if (proto == QETH_PROT_IPV6)
-		return qeth_ipaddr6_to_string(addr, buf);
+		qeth_ipaddr6_to_string(addr, buf);
 }
 
 static inline int
diff --git a/drivers/s390/net/qeth_main.c b/drivers/s390/net/qeth_main.c
index 57f6943..f3e6fbe 100644
--- a/drivers/s390/net/qeth_main.c
+++ b/drivers/s390/net/qeth_main.c
@@ -561,7 +561,7 @@ qeth_set_offline(struct ccwgroup_device *cgdev)
 }
 
 static int
-qeth_wait_for_threads(struct qeth_card *card, unsigned long threads);
+qeth_threads_running(struct qeth_card *card, unsigned long threads);
 
 
 static void
@@ -576,8 +576,7 @@ qeth_remove_device(struct ccwgroup_device *cgdev)
 	if (!card)
 		return;
 
-	if (qeth_wait_for_threads(card, 0xffffffff))
-		return;
+	wait_event(card->wait_q, qeth_threads_running(card, 0xffffffff) == 0);
 
 	if (cgdev->state == CCWGROUP_ONLINE){
 		card->use_hard_stop = 1;
@@ -1542,16 +1541,21 @@ qeth_idx_write_cb(struct qeth_channel *channel, struct qeth_cmd_buffer *iob)
 	card = CARD_FROM_CDEV(channel->ccwdev);
 
 	if (!(QETH_IS_IDX_ACT_POS_REPLY(iob->data))) {
-		PRINT_ERR("IDX_ACTIVATE on write channel device %s: negative "
-			  "reply\n", CARD_WDEV_ID(card));
+		if (QETH_IDX_ACT_CAUSE_CODE(iob->data) == 0x19)
+			PRINT_ERR("IDX_ACTIVATE on write channel device %s: "
+				"adapter exclusively used by another host\n",
+				CARD_WDEV_ID(card));
+		else
+			PRINT_ERR("IDX_ACTIVATE on write channel device %s: "
+				"negative reply\n", CARD_WDEV_ID(card));
 		goto out;
 	}
 	memcpy(&temp, QETH_IDX_ACT_FUNC_LEVEL(iob->data), 2);
 	if ((temp & ~0x0100) != qeth_peer_func_level(card->info.func_level)) {
 		PRINT_WARN("IDX_ACTIVATE on write channel device %s: "
-			   "function level mismatch "
-			   "(sent: 0x%x, received: 0x%x)\n",
-			   CARD_WDEV_ID(card), card->info.func_level, temp);
+			"function level mismatch "
+			"(sent: 0x%x, received: 0x%x)\n",
+			CARD_WDEV_ID(card), card->info.func_level, temp);
 		goto out;
 	}
 	channel->state = CH_STATE_UP;
@@ -1597,8 +1601,13 @@ qeth_idx_read_cb(struct qeth_channel *channel, struct qeth_cmd_buffer *iob)
 			goto out;
 	}
 	if (!(QETH_IS_IDX_ACT_POS_REPLY(iob->data))) {
-		PRINT_ERR("IDX_ACTIVATE on read channel device %s: negative "
-			  "reply\n", CARD_RDEV_ID(card));
+		if (QETH_IDX_ACT_CAUSE_CODE(iob->data) == 0x19)
+			PRINT_ERR("IDX_ACTIVATE on read channel device %s: "
+				"adapter exclusively used by another host\n",
+				CARD_RDEV_ID(card));
+		else
+			PRINT_ERR("IDX_ACTIVATE on read channel device %s: "
+				"negative reply\n", CARD_RDEV_ID(card));
 		goto out;
 	}
 
@@ -1613,8 +1622,8 @@ qeth_idx_read_cb(struct qeth_channel *channel, struct qeth_cmd_buffer *iob)
 	memcpy(&temp, QETH_IDX_ACT_FUNC_LEVEL(iob->data), 2);
 	if (temp != qeth_peer_func_level(card->info.func_level)) {
 		PRINT_WARN("IDX_ACTIVATE on read channel device %s: function "
-			   "level mismatch (sent: 0x%x, received: 0x%x)\n",
-			   CARD_RDEV_ID(card), card->info.func_level, temp);
+			"level mismatch (sent: 0x%x, received: 0x%x)\n",
+			CARD_RDEV_ID(card), card->info.func_level, temp);
 		goto out;
 	}
 	memcpy(&card->token.issuer_rm_r,
@@ -2496,7 +2505,7 @@ qeth_rebuild_skb_fake_ll_tr(struct qeth_card *card, struct sk_buff *skb,
 	struct iphdr *ip_hdr;
 
 	QETH_DBF_TEXT(trace,5,"skbfktr");
-	skb_set_mac_header(skb, -QETH_FAKE_LL_LEN_TR);
+	skb_set_mac_header(skb, (int)-QETH_FAKE_LL_LEN_TR);
 	/* this is a fake ethernet header */
 	fake_hdr = tr_hdr(skb);
 
@@ -2804,13 +2813,16 @@ qeth_queue_input_buffer(struct qeth_card *card, int index)
 		if (newcount < count) {
 			/* we are in memory shortage so we switch back to
 			   traditional skb allocation and drop packages */
-			if (atomic_cmpxchg(&card->force_alloc_skb, 0, 1))
-				printk(KERN_WARNING
-					"qeth: switch to alloc skb\n");
+			if (!atomic_read(&card->force_alloc_skb) &&
+			    net_ratelimit())
+				PRINT_WARN("Switch to alloc skb\n");
+			atomic_set(&card->force_alloc_skb, 3);
 			count = newcount;
 		} else {
-			if (atomic_cmpxchg(&card->force_alloc_skb, 1, 0))
-				printk(KERN_WARNING "qeth: switch to sg\n");
+			if ((atomic_read(&card->force_alloc_skb) == 1) &&
+			    net_ratelimit())
+				PRINT_WARN("Switch to sg\n");
+			atomic_add_unless(&card->force_alloc_skb, -1, 0);
 		}
 
 		/*
@@ -3354,10 +3366,12 @@ out_freeoutq:
 	while (i > 0)
 		kfree(card->qdio.out_qs[--i]);
 	kfree(card->qdio.out_qs);
+	card->qdio.out_qs = NULL;
 out_freepool:
 	qeth_free_buffer_pool(card);
 out_freeinq:
 	kfree(card->qdio.in_q);
+	card->qdio.in_q = NULL;
 out_nomem:
 	atomic_set(&card->qdio.state, QETH_QDIO_UNINITIALIZED);
 	return -ENOMEM;
@@ -3373,16 +3387,20 @@ qeth_free_qdio_buffers(struct qeth_card *card)
 		QETH_QDIO_UNINITIALIZED)
 		return;
 	kfree(card->qdio.in_q);
+	card->qdio.in_q = NULL;
 	/* inbound buffer pool */
 	qeth_free_buffer_pool(card);
 	/* free outbound qdio_qs */
-	for (i = 0; i < card->qdio.no_out_queues; ++i){
-		for (j = 0; j < QDIO_MAX_BUFFERS_PER_Q; ++j)
-			qeth_clear_output_buffer(card->qdio.out_qs[i],
-					&card->qdio.out_qs[i]->bufs[j]);
-		kfree(card->qdio.out_qs[i]);
+	if (card->qdio.out_qs) {
+		for (i = 0; i < card->qdio.no_out_queues; ++i) {
+			for (j = 0; j < QDIO_MAX_BUFFERS_PER_Q; ++j)
+				qeth_clear_output_buffer(card->qdio.out_qs[i],
+						&card->qdio.out_qs[i]->bufs[j]);
+			kfree(card->qdio.out_qs[i]);
+		}
+		kfree(card->qdio.out_qs);
+		card->qdio.out_qs = NULL;
 	}
-	kfree(card->qdio.out_qs);
 }
 
 static void
@@ -3393,7 +3411,7 @@ qeth_clear_qdio_buffers(struct qeth_card *card)
 	QETH_DBF_TEXT(trace, 2, "clearqdbf");
 	/* clear outbound buffers to free skbs */
 	for (i = 0; i < card->qdio.no_out_queues; ++i)
-		if (card->qdio.out_qs[i]){
+		if (card->qdio.out_qs && card->qdio.out_qs[i]) {
 			for (j = 0; j < QDIO_MAX_BUFFERS_PER_Q; ++j)
 				qeth_clear_output_buffer(card->qdio.out_qs[i],
 						&card->qdio.out_qs[i]->bufs[j]);
@@ -4553,6 +4571,53 @@ qeth_get_elements_no(struct qeth_card *card, void *hdr,
         return elements_needed;
 }
 
+static void qeth_tx_csum(struct sk_buff *skb)
+{
+	int tlen;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		tlen = ntohs(ip_hdr(skb)->tot_len) - (ip_hdr(skb)->ihl << 2);
+		switch (ip_hdr(skb)->protocol) {
+		case IPPROTO_TCP:
+			tcp_hdr(skb)->check = 0;
+			tcp_hdr(skb)->check = csum_tcpudp_magic(
+				ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
+				tlen, ip_hdr(skb)->protocol,
+				skb_checksum(skb, skb_transport_offset(skb),
+					tlen, 0));
+			break;
+		case IPPROTO_UDP:
+			udp_hdr(skb)->check = 0;
+			udp_hdr(skb)->check = csum_tcpudp_magic(
+				ip_hdr(skb)->saddr, ip_hdr(skb)->daddr,
+				tlen, ip_hdr(skb)->protocol,
+				skb_checksum(skb, skb_transport_offset(skb),
+					tlen, 0));
+			break;
+		}
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		switch (ipv6_hdr(skb)->nexthdr) {
+		case IPPROTO_TCP:
+			tcp_hdr(skb)->check = 0;
+			tcp_hdr(skb)->check = csum_ipv6_magic(
+				&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr,
+				ipv6_hdr(skb)->payload_len,
+				ipv6_hdr(skb)->nexthdr,
+				skb_checksum(skb, skb_transport_offset(skb),
+					ipv6_hdr(skb)->payload_len, 0));
+			break;
+		case IPPROTO_UDP:
+			udp_hdr(skb)->check = 0;
+			udp_hdr(skb)->check = csum_ipv6_magic(
+				&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr,
+				ipv6_hdr(skb)->payload_len,
+				ipv6_hdr(skb)->nexthdr,
+				skb_checksum(skb, skb_transport_offset(skb),
+					ipv6_hdr(skb)->payload_len, 0));
+			break;
+		}
+	}
+}
 
 static int
 qeth_send_packet(struct qeth_card *card, struct sk_buff *skb)
@@ -4638,12 +4703,22 @@ qeth_send_packet(struct qeth_card *card, struct sk_buff *skb)
 		elements_needed += elems;
 	}
 
+	if ((large_send == QETH_LARGE_SEND_NO) &&
+	    (skb->ip_summed == CHECKSUM_PARTIAL))
+		qeth_tx_csum(new_skb);
+
 	if (card->info.type != QETH_CARD_TYPE_IQD)
 		rc = qeth_do_send_packet(card, queue, new_skb, hdr,
 					 elements_needed, ctx);
-	else
+	else {
+		if ((skb->protocol == htons(ETH_P_ARP)) &&
+		    (card->dev->flags & IFF_NOARP)) {
+			__qeth_free_new_skb(skb, new_skb);
+			return -EPERM;
+		}
 		rc = qeth_do_send_packet_fast(card, queue, new_skb, hdr,
 					      elements_needed, ctx);
+	}
 	if (!rc) {
 		card->stats.tx_packets++;
 		card->stats.tx_bytes += tx_bytes;
@@ -6385,20 +6460,18 @@ qeth_deregister_addr_entry(struct qeth_card *card, struct qeth_ipaddr *addr)
 static u32
 qeth_ethtool_get_tx_csum(struct net_device *dev)
 {
-	/* We may need to say that we support tx csum offload if
-	 * we do EDDP or TSO. There are discussions going on to
-	 * enforce rules in the stack and in ethtool that make
-	 * SG and TSO depend on HW_CSUM. At the moment there are
-	 * no such rules....
-	 * If we say yes here, we have to checksum outbound packets
-	 * any time. */
-	return 0;
+	return (dev->features & NETIF_F_HW_CSUM) != 0;
 }
 
 static int
 qeth_ethtool_set_tx_csum(struct net_device *dev, u32 data)
 {
-	return -EINVAL;
+	if (data)
+		dev->features |= NETIF_F_HW_CSUM;
+	else
+		dev->features &= ~NETIF_F_HW_CSUM;
+
+	return 0;
 }
 
 static u32
@@ -7412,7 +7485,8 @@ qeth_start_ipa_tso(struct qeth_card *card)
 	}
 	if (rc && (card->options.large_send == QETH_LARGE_SEND_TSO)){
 		card->options.large_send = QETH_LARGE_SEND_NO;
-		card->dev->features &= ~ (NETIF_F_TSO | NETIF_F_SG);
+		card->dev->features &= ~(NETIF_F_TSO | NETIF_F_SG |
+						NETIF_F_HW_CSUM);
 	}
 	return rc;
 }
@@ -7552,22 +7626,26 @@ qeth_set_large_send(struct qeth_card *card, enum qeth_large_send_types type)
 	card->options.large_send = type;
 	switch (card->options.large_send) {
 	case QETH_LARGE_SEND_EDDP:
-		card->dev->features |= NETIF_F_TSO | NETIF_F_SG;
+		card->dev->features |= NETIF_F_TSO | NETIF_F_SG |
+					NETIF_F_HW_CSUM;
 		break;
 	case QETH_LARGE_SEND_TSO:
 		if (qeth_is_supported(card, IPA_OUTBOUND_TSO)){
-			card->dev->features |= NETIF_F_TSO | NETIF_F_SG;
+			card->dev->features |= NETIF_F_TSO | NETIF_F_SG |
+						NETIF_F_HW_CSUM;
 		} else {
 			PRINT_WARN("TSO not supported on %s. "
 				   "large_send set to 'no'.\n",
 				   card->dev->name);
-			card->dev->features &= ~(NETIF_F_TSO | NETIF_F_SG);
+			card->dev->features &= ~(NETIF_F_TSO | NETIF_F_SG |
+						NETIF_F_HW_CSUM);
 			card->options.large_send = QETH_LARGE_SEND_NO;
 			rc = -EOPNOTSUPP;
 		}
 		break;
 	default: /* includes QETH_LARGE_SEND_NO */
-		card->dev->features &= ~(NETIF_F_TSO | NETIF_F_SG);
+		card->dev->features &= ~(NETIF_F_TSO | NETIF_F_SG |
+					NETIF_F_HW_CSUM);
 		break;
 	}
 	if (card->state == CARD_STATE_UP)
diff --git a/drivers/s390/net/qeth_mpc.h b/drivers/s390/net/qeth_mpc.h
index 1d8083c..6de2da5 100644
--- a/drivers/s390/net/qeth_mpc.h
+++ b/drivers/s390/net/qeth_mpc.h
@@ -565,6 +565,7 @@ extern unsigned char IDX_ACTIVATE_WRITE[];
 #define QETH_IDX_ACT_QDIO_DEV_REALADDR(buffer) (buffer+0x20)
 #define QETH_IS_IDX_ACT_POS_REPLY(buffer) (((buffer)[0x08]&3)==2)
 #define QETH_IDX_REPLY_LEVEL(buffer) (buffer+0x12)
+#define QETH_IDX_ACT_CAUSE_CODE(buffer) (buffer)[0x09]
 
 #define PDU_ENCAPSULATION(buffer) \
 	(buffer + *(buffer + (*(buffer+0x0b)) + \
diff --git a/drivers/s390/net/qeth_sys.c b/drivers/s390/net/qeth_sys.c
index bb0287a..2cc3f3a 100644
--- a/drivers/s390/net/qeth_sys.c
+++ b/drivers/s390/net/qeth_sys.c
@@ -1760,10 +1760,10 @@ qeth_remove_device_attributes(struct device *dev)
 {
 	struct qeth_card *card = dev->driver_data;
 
-	if (card->info.type == QETH_CARD_TYPE_OSN)
-		return sysfs_remove_group(&dev->kobj,
-					  &qeth_osn_device_attr_group);
-
+	if (card->info.type == QETH_CARD_TYPE_OSN) {
+		sysfs_remove_group(&dev->kobj, &qeth_osn_device_attr_group);
+		return;
+	}
 	sysfs_remove_group(&dev->kobj, &qeth_device_attr_group);
 	sysfs_remove_group(&dev->kobj, &qeth_device_ipato_group);
 	sysfs_remove_group(&dev->kobj, &qeth_device_vipa_group);

^ permalink raw reply related

* Re: [PATCH] make _minimum_ TCP retransmission timeout configurable take 2
From: David Miller @ 2007-08-31 18:57 UTC (permalink / raw)
  To: rick.jones2; +Cc: netdev
In-Reply-To: <46D859D9.6030407@hp.com>

From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 31 Aug 2007 11:11:37 -0700

> At the risk of showing my ignorance (what me worry about that?-) I 
> presume this is then an interface expecting to take-in jiffies?  That 
> means the user has to know the value of HZ which can be (IIRC) one of 
> three different values?

The iproute2 changes might look something like this:

--- ./include/linux/rtnetlink.h.orig	2007-08-31 11:55:30.000000000 -0700
+++ ./include/linux/rtnetlink.h	2007-08-31 11:52:22.000000000 -0700
@@ -351,6 +351,8 @@ enum
 #define RTAX_INITCWND RTAX_INITCWND
 	RTAX_FEATURES,
 #define RTAX_FEATURES RTAX_FEATURES
+	RTAX_RTO_MIN,
+#define RTAX_RTO_MIN RTAX_RTO_MIN
 	__RTAX_MAX
 };
 
--- ./ip/iproute.c.orig	2007-08-31 11:55:30.000000000 -0700
+++ ./ip/iproute.c	2007-08-31 11:53:29.000000000 -0700
@@ -51,6 +51,7 @@ static const char *mx_names[RTAX_MAX+1] 
 	[RTAX_HOPLIMIT] = "hoplimit",
 	[RTAX_INITCWND] = "initcwnd",
 	[RTAX_FEATURES] = "features",
+	[RTAX_RTO_MIN]	= "rto_min",
 };
 static void usage(void) __attribute__((noreturn));
 
@@ -74,6 +75,7 @@ static void usage(void)
 	fprintf(stderr, "           [ rtt NUMBER ] [ rttvar NUMBER ]\n");
 	fprintf(stderr, "           [ window NUMBER] [ cwnd NUMBER ] [ initcwnd NUMBER ]\n");
 	fprintf(stderr, "           [ ssthresh NUMBER ] [ realms REALM ]\n");
+	fprintf(stderr, "           [ rto_min NUMBER ]\n");
 	fprintf(stderr, "TYPE := [ unicast | local | broadcast | multicast | throw |\n");
 	fprintf(stderr, "          unreachable | prohibit | blackhole | nat ]\n");
 	fprintf(stderr, "TABLE_ID := [ local | main | default | all | NUMBER ]\n");
@@ -520,7 +522,8 @@ int print_route(const struct sockaddr_nl
 			if (mxlock & (1<<i))
 				fprintf(fp, " lock");
 
-			if (i != RTAX_RTT && i != RTAX_RTTVAR)
+			if (i != RTAX_RTT && i != RTAX_RTTVAR &&
+			    i != RTAX_RTO_MIN)
 				fprintf(fp, " %u", *(unsigned*)RTA_DATA(mxrta[i]));
 			else {
 				unsigned val = *(unsigned*)RTA_DATA(mxrta[i]);
@@ -528,7 +531,7 @@ int print_route(const struct sockaddr_nl
 				val *= 1000;
 				if (i == RTAX_RTT)
 					val /= 8;
-				else
+				else if (i == RTAX_RTTVAR)
 					val /= 4;
 				if (val >= hz)
 					fprintf(fp, " %ums", val/hz);
@@ -803,6 +806,15 @@ int iproute_modify(int cmd, unsigned fla
 			if (get_unsigned(&rtt, *argv, 0))
 				invarg("\"rtt\" value is invalid\n", *argv);
 			rta_addattr32(mxrta, sizeof(mxbuf), RTAX_RTT, rtt);
+		} else if (strcmp(*argv, "rto_min") == 0) {
+			unsigned rto_min;
+			NEXT_ARG();
+			mxlock |= (1<<RTAX_RTO_MIN);
+			if (get_unsigned(&rto_min, *argv, 0))
+				invarg("\"rto_min\" value is invalid\n",
+				       *argv);
+			rta_addattr32(mxrta, sizeof(mxbuf), RTAX_RTO_MIN,
+				      rto_min);
 		} else if (matches(*argv, "window") == 0) {
 			unsigned win;
 			NEXT_ARG();

^ permalink raw reply

* Re: net-2.6.24 rebased
From: David Miller @ 2007-08-31 18:58 UTC (permalink / raw)
  To: linville; +Cc: johannes, joe, netdev
In-Reply-To: <20070831134432.GA3352@tuxdriver.com>

From: "John W. Linville" <linville@tuxdriver.com>
Date: Fri, 31 Aug 2007 09:44:32 -0400

> Sorry, I sent the patch through Jeff.
> 
> 	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/upstream-jgarzik/0010-rtl8187-remove-IEEE80211_HW_DATA_NULLFUNC_ACK.patch
> 
> But, it sounds like you have taken care of it...?
> 
> Let me know if I need to fix this.

Thanks for tracking this down, that is exactly the change
I checked into 2.6.24 so everything is good now.

^ permalink raw reply

* Re: [PATCH] net/, drivers/net/ , missing EXPERIMENTAL in menus
From: Jan Engelhardt @ 2007-08-31 19:29 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Robert P. J. Day, Randy Dunlap, Simon Arlott, sam,
	Linux Kernel Mailing List, Stefan Richter, Adrian Bunk, Gabriel C,
	netdev
In-Reply-To: <46D858A9.4080506@garzik.org>


On Aug 31 2007 14:06, Jeff Garzik wrote:
>> something like BROKEN, though, has *nothing* to do with maturity.  a
>> feature can be any of those maturity levels, and simultaneously be
>> BROKEN.  i consider BROKEN to be what i call a "status", and different
>> status levels might be the default of normal, or KIND_OF_FLAKY or
>> TOTALLY_BORKED -- that's where BROKEN would fit in.
>
> BROKEN is definitely a maturity level.  A more accurate description would be
> BITROTTING perhaps.  The code in question has passed through bleeding ->
> experimental -> stable, and come out the other side.

Software is quite unlike the art of medicine. There, you would go from
'stable' to 'broken', then someone 'experiments' with you, and if
it's a bad day, you bleed out. :)


	Jan
-- 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox