* [BUG] ISIC + 2.6.22 (via-rhine)
From: Xose Vazquez Perez @ 2007-07-31 21:11 UTC (permalink / raw)
To: linux-kernel; +Cc: netdev
hi,
Running ISIC -- IP Stack Integrity Checker ( http://isic.sf.net ),
in Fedora-7-i386 with 2.6.22, the NIC stopped to send packages.
But one second latter it began to send out more of them.
dmesg shows the bug.
command is:
# tcpsic -s rand -d 172.26.0.2 -I100
driver is:
via-rhine.c:v1.10-LK1.4.3 2007-03-06 Written by Donald Becker
eth0: VIA Rhine II at 0xbc000000, 00:11:d8:54:e9:3c, IRQ 19.
eth0: MII PHY found at address 1, status 0x786d advertising 01e1 Link 45e1.
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
eth0: no IPv6 routers present
---lspci--
00:12.0 0200: 1106:3065 (rev 78)
Subsystem: 1043:80ed
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping+ SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (750ns min, 2000ns max), Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 19
Region 0: I/O ports at 7000 [size=256]
Region 1: Memory at bc000000 (32-bit, non-prefetchable) [size=256]
Capabilities: [40] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
--end--
--dmesg output--
[...]
NETDEV WATCHDOG: eth0: transmit timed out
eth0: Transmit timed out, status 0000, PHY status 786d, resetting...
=================================
[ INFO: inconsistent lock state ]
2.6.22 #1
---------------------------------
inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(&rp->lock){++..}, at: [<f8c890db>] rhine_tx_timeout+0x6f/0xf4 [via_rhine]
{in-hardirq-W} state was registered at:
[<c04440d4>] __lock_acquire+0x38c/0xb12
[<c0444c1b>] lock_acquire+0x56/0x6f
[<c0615279>] _spin_lock+0x2b/0x38
[<f8c87b49>] rhine_interrupt+0x16b/0x69b [via_rhine]
[<c045aac6>] handle_IRQ_event+0x1a/0x46
[<c045bbc0>] handle_fasteoi_irq+0x7d/0xb6
[<c0407089>] do_IRQ+0xb1/0xd8
[<ffffffff>] 0xffffffff
irq event stamp: 18052892
hardirqs last enabled at (18052892): [<c061567d>] _spin_unlock_irqrestore+0x36/0x3c
hardirqs last disabled at (18052891): [<c061558d>] _spin_lock_irqsave+0x12/0x44
softirqs last enabled at (18052876): [<c042d272>] __do_softirq+0xe3/0xe9
softirqs last disabled at (18052887): [<c0406f72>] do_softirq+0x61/0xc7
other info that might help us debug this:
1 lock held by swapper/0:
#0: (_xmit_ETHER){-+..}, at: [<c05c042a>] dev_watchdog+0x14/0xbf
stack backtrace:
[<c0405e6a>] show_trace_log_lvl+0x1a/0x2f
[<c04068cf>] show_trace+0x12/0x14
[<c0406928>] dump_stack+0x16/0x18
[<c0442ccd>] print_usage_bug+0x141/0x14b
[<c04434fd>] mark_lock+0x1fd/0x409
[<c0444144>] __lock_acquire+0x3fc/0xb12
[<c0444c1b>] lock_acquire+0x56/0x6f
[<c0615279>] _spin_lock+0x2b/0x38
[<f8c890db>] rhine_tx_timeout+0x6f/0xf4 [via_rhine]
[<c05c048c>] dev_watchdog+0x76/0xbf
[<c04303be>] run_timer_softirq+0x11a/0x182
[<c042d1fe>] __do_softirq+0x6f/0xe9
[<c0406f72>] do_softirq+0x61/0xc7
=======================
via-rhine: Reset not complete yet. Trying harder.
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
--end---
-thanks-
--
Politicos de mierda, yo no soy un terrorista.
^ permalink raw reply
* Re: [PATCH 39] net/ipv4/ip_options.c: kmalloc + memset conversion to kzalloc
From: David Miller @ 2007-07-31 21:07 UTC (permalink / raw)
To: m.kozlowski; +Cc: linux-kernel, kernel-janitors, akpm, netdev
In-Reply-To: <200707312016.59970.m.kozlowski@tuxland.pl>
From: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Date: Tue, 31 Jul 2007 20:16:59 +0200
> Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Applied, but note that this patch changes behavior, previously
only the ip_options structure base was cleared out, but now
the whole memory region is cleared.
I think it's OK for this case, but I'm just making note of it.
^ permalink raw reply
* Re: [PATCH 20] net/decnet/dn_route.c: kmalloc + memset conversion to kzalloc
From: David Miller @ 2007-07-31 21:06 UTC (permalink / raw)
To: m.kozlowski; +Cc: linux-kernel, kernel-janitors, akpm, netdev, patrick
In-Reply-To: <200707311933.33745.m.kozlowski@tuxland.pl>
From: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Date: Tue, 31 Jul 2007 19:33:33 +0200
> Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
Patch applied, thanks.
^ permalink raw reply
* Re: [RESEND][PATCH 1/3] PPPoE: improved hashing routine
From: David Miller @ 2007-07-31 20:51 UTC (permalink / raw)
To: florz; +Cc: mostrows, netdev
In-Reply-To: <20070731110547.GC12071@florz.florz.dyndns.org>
From: Florian Zumbiehl <florz@florz.de>
Date: Tue, 31 Jul 2007 13:05:47 +0200
> A few variations I tried back when I created the patch, using larger
> things than a char for accumulating the pieces and then folding down
> from that, turned out to be slower than what I finally submitted, at
> least on the machines I tested it on. I didn't do comprehensive testing,
> as it really doesn't matter, after all, but I think the version I
> submitted is pretty fast, plus it's quite readable, and it keeps the
> flexibility of different hash sizes, but still should allow the
> compiler to optimize away the loops that allow for this flexibility.
Therefore, I've put your original patch into the tree :-)
Thanks.
^ permalink raw reply
* [RFC][PATCH] Get rid of dead code in net/ipv4/fib_semantics.c
From: Michal Piotrowski @ 2007-07-31 20:36 UTC (permalink / raw)
To: Andrew Morton, Alexey Kuznetsov, Netdev, LKML
Hi,
File /home/devel/linux-rdc/net/ipv4/fib_semantics.c line 525
Unknown CONFIG option! CONFIG_IP_ROUTE_PERVASIVE
Regards,
Michal
--
LOG
http://www.stardust.webpages.pl/log/
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
--- linux-rdc-clean/net/ipv4/fib_semantics.c 2007-07-31 17:17:31.000000000 +0200
+++ linux-rdc/net/ipv4/fib_semantics.c 2007-07-31 18:22:54.000000000 +0200
@@ -522,10 +522,6 @@ static int fib_check_nh(struct fib_confi
if (nh->nh_gw) {
struct fib_result res;
-#ifdef CONFIG_IP_ROUTE_PERVASIVE
- if (nh->nh_flags&RTNH_F_PERVASIVE)
- return 0;
-#endif
if (nh->nh_flags&RTNH_F_ONLINK) {
struct net_device *dev;
^ permalink raw reply
* [RFC][PATCH] Get rid of dead code in net/wanrouter/wanmain.c
From: Michal Piotrowski @ 2007-07-31 20:36 UTC (permalink / raw)
To: Andrew Morton, Netdev, LKML
Hi,
File /home/devel/linux-rdc/net/wanrouter/wanmain.c line 569
Unknown CONFIG option! CONFIG_WANPIPE_MULTPPP
File /home/devel/linux-rdc/net/wanrouter/wanmain.c line 590
Unknown CONFIG option! CONFIG_WANPIPE_MULTPPP
File /home/devel/linux-rdc/net/wanrouter/wanmain.c line 663
Unknown CONFIG option! CONFIG_WANPIPE_MULTPPP
Regards,
Michal
--
LOG
http://www.stardust.webpages.pl/log/
Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
--- linux-rdc-clean/net/wanrouter/wanmain.c 2007-07-09 01:32:17.000000000 +0200
+++ linux-rdc/net/wanrouter/wanmain.c 2007-07-31 18:04:58.000000000 +0200
@@ -566,9 +566,6 @@ static int wanrouter_device_new_if(struc
{
wanif_conf_t *cnf;
struct net_device *dev = NULL;
-#ifdef CONFIG_WANPIPE_MULTPPP
- struct ppp_device *pppdev=NULL;
-#endif
int err;
if ((wandev->state == WAN_UNCONFIGURED) || (wandev->new_if == NULL))
@@ -587,25 +584,10 @@ static int wanrouter_device_new_if(struc
goto out;
if (cnf->config_id == WANCONFIG_MPPP) {
-#ifdef CONFIG_WANPIPE_MULTPPP
- pppdev = kzalloc(sizeof(struct ppp_device), GFP_KERNEL);
- err = -ENOBUFS;
- if (pppdev == NULL)
- goto out;
- pppdev->dev = kzalloc(sizeof(struct net_device), GFP_KERNEL);
- if (pppdev->dev == NULL) {
- kfree(pppdev);
- err = -ENOBUFS;
- goto out;
- }
- err = wandev->new_if(wandev, (struct net_device *)pppdev, cnf);
- dev = pppdev->dev;
-#else
printk(KERN_INFO "%s: Wanpipe Mulit-Port PPP support has not been compiled in!\n",
wandev->name);
err = -EPROTONOSUPPORT;
goto out;
-#endif
} else {
dev = kzalloc(sizeof(struct net_device), GFP_KERNEL);
err = -ENOBUFS;
@@ -660,16 +642,9 @@ static int wanrouter_device_new_if(struc
kfree(dev->priv);
dev->priv = NULL;
-#ifdef CONFIG_WANPIPE_MULTPPP
- if (cnf->config_id == WANCONFIG_MPPP)
- kfree(pppdev);
- else
- kfree(dev);
-#else
/* Sync PPP is disabled */
if (cnf->config_id != WANCONFIG_MPPP)
kfree(dev);
-#endif
out:
kfree(cnf);
^ permalink raw reply
* Re: [Bugme-new] [Bug 8754] New: Kernel addrconf modifies MTU of non-kernel routes
From: Simon Arlott @ 2007-07-31 20:32 UTC (permalink / raw)
To: netdev; +Cc: bugme-daemon
In-Reply-To: <4699E915.5040904@simon.arlott.org.uk>
On 15/07/07 10:29, Simon Arlott wrote:
> On 14/07/07 23:09, Andrew Morton wrote:
>> On Sat, 14 Jul 2007 14:54:32 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=8754
>>>
>>> I have an MTU of 16110 set on eth0 on a network where the MTU is 1500 as set by
>>> RAs. One of the other hosts on the network has an MRU/MTU of 7200 so I have a
>>> specific route to it with this MTU.
>>>
>>> If I add the route early (i.e. on startup) before address autoconfiguration
>>> takes place, when the first RA is received the kernel changes the MTU on my
>>> route - this should not happen.
>
> This also happens whenever I change the MTU on eth0 - it will alter the
> MTU on routes *I* have added too. While this is valid behaviour for a
> new MTU that is too low for the route it is not for an MTU above the route.
>
> Changing the MTU also allows the "next RA with MTU set changes
> non-kernel routes too" problem to occur again.
The problem seems to be that because the IPv6 code maintains its own MTU for
each interface, which is set from RAs (router advertisements) and when the
interface MTU is set (it's also improperly modifiable via sysctl when it
shouldn't be, but that's another bug), it uses that to limit the MTU of every
route.
I propose that it should use the real interface MTU as the limit for non-kernel
routes and the RA MTU for kernel routes.
Since IPv6 routes (appear to) always have an MTU (IPv4 routes don't and hence
inherit from the interface) this would have the side effect that a user-added
route's automatically set MTU would not be lowered by the RA MTU.
New user IPv6 routes without an explicit MTU should not have one set and use
the RA MTU automatically.
Is this ok? I'll send a patch to do this some time this week when I get around
to it.
--
Simon Arlott
^ permalink raw reply
* [PATCH] ethtool: Add support for setting multiple rx/tx queues
From: Auke Kok @ 2007-07-31 20:21 UTC (permalink / raw)
To: netdev; +Cc: jgarzik, davem
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---
ethtool-copy.h | 23 +++++++++++++
ethtool.8 | 23 +++++++++++++
ethtool.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 149 insertions(+), 0 deletions(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index ab9d688..aefd580 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -196,6 +196,23 @@ struct ethtool_ringparam {
__u32 tx_pending;
};
+/* for configuring RX/TX queue count */
+struct ethtool_queueparam {
+ __u32 cmd; /* ETHTOOL_{G,S}QUEUEPARAM */
+
+ /* Read only attributes. These indicate the maximum number
+ * of RX/TX queues the driver will allow the user to set.
+ */
+ __u32 rx_max;
+ __u32 tx_max;
+
+ /* Values changeable by the user. The valid values are
+ * in the range 1 to the "*_max" counterpart above.
+ */
+ __u32 rx;
+ __u32 tx;
+};
+
/* for configuring link flow control parameters */
struct ethtool_pauseparam {
__u32 cmd; /* ETHTOOL_{G,S}PAUSEPARAM */
@@ -295,6 +312,8 @@ int ethtool_op_set_lro(struct net_device *dev, u32 data);
* set_coalesce: Set interrupt coalescing parameters
* get_ringparam: Report ring sizes
* set_ringparam: Set ring sizes
+ * get_queueparam: Report ring sizes
+ * set_queueparam: Set ring sizes
* get_pauseparam: Report pause parameters
* set_pauseparam: Set pause paramters
* get_rx_csum: Report whether receive checksums are turned on or off
@@ -356,6 +375,8 @@ struct ethtool_ops {
int (*set_coalesce)(struct net_device *, struct ethtool_coalesce *);
void (*get_ringparam)(struct net_device *, struct ethtool_ringparam *);
int (*set_ringparam)(struct net_device *, struct ethtool_ringparam *);
+ void (*get_queueparam)(struct net_device *, struct ethtool_queueparam *);
+ int (*set_queueparam)(struct net_device *, struct ethtool_queueparam *);
void (*get_pauseparam)(struct net_device *, struct ethtool_pauseparam*);
int (*set_pauseparam)(struct net_device *, struct ethtool_pauseparam*);
u32 (*get_rx_csum)(struct net_device *);
@@ -422,6 +443,8 @@ struct ethtool_ops {
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
#define ETHTOOL_GLRO 0x00000025 /* Get LRO enable (ethtool_value) */
#define ETHTOOL_SLRO 0x00000026 /* Set LRO enable (ethtool_value) */
+#define ETHTOOL_GQUEUEPARAM 0x00000027 /* Get queue parameters */
+#define ETHTOOL_SQUEUEPARAM 0x00000028 /* Set queue parameters. */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/ethtool.8 b/ethtool.8
index 89abf08..3f2e0c0 100644
--- a/ethtool.8
+++ b/ethtool.8
@@ -120,6 +120,17 @@ ethtool \- Display or change ethernet card settings
.RB [ tx
.IR N ]
+
+.B ethtool \-q|\-\-show\-queue
+.I ethX
+
+.B ethtool \-Q|\-\-set\-queue
+.I ethX
+.RB [ rx
+.IR N ]
+.RB [ tx
+.IR N ]
+
.B ethtool \-i|\-\-driver
.I ethX
@@ -243,6 +254,18 @@ Changes the number of ring entries for the Rx Jumbo ring.
.BI tx \ N
Changes the number of ring entries for the Tx ring.
.TP
+.B \-q \-\-show\-queue
+Queries the specified ethernet device for rx/tx queue parameter information.
+.TP
+.B \-Q \-\-set\-queue
+Changes the rx/tx queue parameters of the specified ethernet device.
+.TP
+.BI rx \ N
+Changes the number of Rx queues.
+.TP
+.BI tx \ N
+Changes the number of Tx queues.
+.TP
.B \-i \-\-driver
Queries the specified ethernet device for associated driver information.
.TP
diff --git a/ethtool.c b/ethtool.c
index 4c9844a..227349f 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -64,6 +64,8 @@ static int do_gpause(int fd, struct ifreq *ifr);
static int do_spause(int fd, struct ifreq *ifr);
static int do_gring(int fd, struct ifreq *ifr);
static int do_sring(int fd, struct ifreq *ifr);
+static int do_gqueue(int fd, struct ifreq *ifr);
+static int do_squeue(int fd, struct ifreq *ifr);
static int do_gcoalesce(int fd, struct ifreq *ifr);
static int do_scoalesce(int fd, struct ifreq *ifr);
static int do_goffload(int fd, struct ifreq *ifr);
@@ -87,6 +89,8 @@ static enum {
MODE_SCOALESCE,
MODE_GRING,
MODE_SRING,
+ MODE_GQUEUE,
+ MODE_SQUEUE,
MODE_GOFFLOAD,
MODE_SOFFLOAD,
MODE_GSTATS,
@@ -144,6 +148,10 @@ static struct option {
" [ rx-mini N ]\n"
" [ rx-jumbo N ]\n"
" [ tx N ]\n" },
+ { "-q", "--show-queue", MODE_GQUEUE, "Query RX/TX queue parameters" },
+ { "-Q", "--set-queue", MODE_SQUEUE, "Set RX/TX queue parameters",
+ " [ rx N ]\n"
+ " [ tx N ]\n" },
{ "-k", "--show-offload", MODE_GOFFLOAD, "Get protocol offload information" },
{ "-K", "--offload", MODE_SOFFLOAD, "Set protocol offload",
" [ rx on|off ]\n"
@@ -216,6 +224,11 @@ static int ring_rx_mini_wanted = -1;
static int ring_rx_jumbo_wanted = -1;
static int ring_tx_wanted = -1;
+static struct ethtool_queueparam equeue;
+static int gqueue_changed = 0;
+static int queue_rx = -1;
+static int queue_tx = -1;
+
static struct ethtool_coalesce ecoal;
static int gcoalesce_changed = 0;
static int coal_stats_wanted = -1;
@@ -328,6 +341,11 @@ static struct cmdline_info cmdline_ring[] = {
{ "tx", CMDL_INT, &ring_tx_wanted, &ering.tx_pending },
};
+static struct cmdline_info cmdline_queue[] = {
+ { "rx", CMDL_INT, &queue_rx, &equeue.rx},
+ { "tx", CMDL_INT, &queue_tx, &equeue.tx},
+};
+
static struct cmdline_info cmdline_coalesce[] = {
{ "adaptive-rx", CMDL_BOOL, &coal_adaptive_rx_wanted, &ecoal.use_adaptive_rx_coalesce },
{ "adaptive-tx", CMDL_BOOL, &coal_adaptive_tx_wanted, &ecoal.use_adaptive_tx_coalesce },
@@ -430,6 +448,8 @@ static void parse_cmdline(int argc, char **argp)
(mode == MODE_SCOALESCE) ||
(mode == MODE_GRING) ||
(mode == MODE_SRING) ||
+ (mode == MODE_GQUEUE) ||
+ (mode == MODE_SQUEUE) ||
(mode == MODE_GOFFLOAD) ||
(mode == MODE_SOFFLOAD) ||
(mode == MODE_GSTATS) ||
@@ -496,6 +516,14 @@ static void parse_cmdline(int argc, char **argp)
i = argc;
break;
}
+ if (mode == MODE_SQUEUE) {
+ parse_generic_cmdline(argc, argp, i,
+ &gqueue_changed,
+ cmdline_ring,
+ ARRAY_SIZE(cmdline_queue));
+ i = argc;
+ break;
+ }
if (mode == MODE_SCOALESCE) {
parse_generic_cmdline(argc, argp, i,
&gcoalesce_changed,
@@ -1150,6 +1178,26 @@ static int dump_ring(void)
return 0;
}
+static int dump_queue(void)
+{
+ fprintf(stdout,
+ "Pre-set maximums:\n"
+ "RX: %u\n"
+ "TX: %u\n",
+ equeue.rx_max,
+ equeue.tx_max);
+
+ fprintf(stdout,
+ "Current hardware settings:\n"
+ "RX: %u\n"
+ "TX: %u\n",
+ equeue.rx,
+ equeue.tx);
+
+ fprintf(stdout, "\n");
+ return 0;
+}
+
static int dump_coalesce(void)
{
fprintf(stdout, "Adaptive RX: %s TX: %s\n",
@@ -1278,6 +1326,10 @@ static int doit(void)
return do_gring(fd, &ifr);
} else if (mode == MODE_SRING) {
return do_sring(fd, &ifr);
+ } else if (mode == MODE_GQUEUE) {
+ return do_gqueue(fd, &ifr);
+ } else if (mode == MODE_SQUEUE) {
+ return do_squeue(fd, &ifr);
} else if (mode == MODE_GOFFLOAD) {
return do_goffload(fd, &ifr);
} else if (mode == MODE_SOFFLOAD) {
@@ -1435,6 +1487,57 @@ static int do_gring(int fd, struct ifreq *ifr)
return 0;
}
+static int do_squeue(int fd, struct ifreq *ifr)
+{
+ int err, changed = 0;
+
+ equeue.cmd = ETHTOOL_GQUEUEPARAM;
+ ifr->ifr_data = (caddr_t)&equeue;
+ err = ioctl(fd, SIOCETHTOOL, ifr);
+ if (err) {
+ perror("Cannot get device queue settings");
+ return 76;
+ }
+
+ do_generic_set(cmdline_queue, ARRAY_SIZE(cmdline_queue), &changed);
+
+ if (!changed) {
+ fprintf(stderr, "no queue parameters changed, aborting\n");
+ return 80;
+ }
+
+ equeue.cmd = ETHTOOL_SQUEUEPARAM;
+ ifr->ifr_data = (caddr_t)&equeue;
+ err = ioctl(fd, SIOCETHTOOL, ifr);
+ if (err) {
+ perror("Cannot set device queue parameters");
+ return 81;
+ }
+
+ return 0;
+}
+
+static int do_gqueue(int fd, struct ifreq *ifr)
+{
+ int err;
+
+ fprintf(stdout, "Queue parameters for %s:\n", devname);
+
+ equeue.cmd = ETHTOOL_GQUEUEPARAM;
+ ifr->ifr_data = (caddr_t)&equeue;
+ err = ioctl(fd, SIOCETHTOOL, ifr);
+ if (err == 0) {
+ err = dump_queue();
+ if (err)
+ return err;
+ } else {
+ perror("Cannot get device queue settings");
+ return 76;
+ }
+
+ return 0;
+}
+
static int do_gcoalesce(int fd, struct ifreq *ifr)
{
int err;
^ permalink raw reply related
* [PATCH] [NET] ethtool: Add support for multiple queues
From: Auke Kok @ 2007-07-31 20:21 UTC (permalink / raw)
To: netdev; +Cc: jgarzik, davem
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---
include/linux/ethtool.h | 23 +++++++++++++++++++++++
net/core/ethtool.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+), 0 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ab9d688..aefd580 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -196,6 +196,23 @@ struct ethtool_ringparam {
__u32 tx_pending;
};
+/* for configuring RX/TX queue count */
+struct ethtool_queueparam {
+ __u32 cmd; /* ETHTOOL_{G,S}QUEUEPARAM */
+
+ /* Read only attributes. These indicate the maximum number
+ * of RX/TX queues the driver will allow the user to set.
+ */
+ __u32 rx_max;
+ __u32 tx_max;
+
+ /* Values changeable by the user. The valid values are
+ * in the range 1 to the "*_max" counterpart above.
+ */
+ __u32 rx;
+ __u32 tx;
+};
+
/* for configuring link flow control parameters */
struct ethtool_pauseparam {
__u32 cmd; /* ETHTOOL_{G,S}PAUSEPARAM */
@@ -295,6 +312,8 @@ int ethtool_op_set_lro(struct net_device *dev, u32 data);
* set_coalesce: Set interrupt coalescing parameters
* get_ringparam: Report ring sizes
* set_ringparam: Set ring sizes
+ * get_queueparam: Report ring sizes
+ * set_queueparam: Set ring sizes
* get_pauseparam: Report pause parameters
* set_pauseparam: Set pause paramters
* get_rx_csum: Report whether receive checksums are turned on or off
@@ -356,6 +375,8 @@ struct ethtool_ops {
int (*set_coalesce)(struct net_device *, struct ethtool_coalesce *);
void (*get_ringparam)(struct net_device *, struct ethtool_ringparam *);
int (*set_ringparam)(struct net_device *, struct ethtool_ringparam *);
+ void (*get_queueparam)(struct net_device *, struct ethtool_queueparam *);
+ int (*set_queueparam)(struct net_device *, struct ethtool_queueparam *);
void (*get_pauseparam)(struct net_device *, struct ethtool_pauseparam*);
int (*set_pauseparam)(struct net_device *, struct ethtool_pauseparam*);
u32 (*get_rx_csum)(struct net_device *);
@@ -422,6 +443,8 @@ struct ethtool_ops {
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
#define ETHTOOL_GLRO 0x00000025 /* Get LRO enable (ethtool_value) */
#define ETHTOOL_SLRO 0x00000026 /* Set LRO enable (ethtool_value) */
+#define ETHTOOL_GQUEUEPARAM 0x00000027 /* Get queue parameters */
+#define ETHTOOL_SQUEUEPARAM 0x00000028 /* Set queue parameters. */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 23ccaa1..f1a1234 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -443,6 +443,33 @@ static int ethtool_set_ringparam(struct net_device *dev, void __user *useraddr)
return dev->ethtool_ops->set_ringparam(dev, &ringparam);
}
+static int ethtool_get_queueparam(struct net_device *dev, void __user *useraddr)
+{
+ struct ethtool_queueparam queueparam = { ETHTOOL_GQUEUEPARAM };
+
+ if (!dev->ethtool_ops->get_queueparam)
+ return -EOPNOTSUPP;
+
+ dev->ethtool_ops->get_queueparam(dev, &queueparam);
+
+ if (copy_to_user(useraddr, &queueparam, sizeof(queueparam)))
+ return -EFAULT;
+ return 0;
+}
+
+static int ethtool_set_queueparam(struct net_device *dev, void __user *useraddr)
+{
+ struct ethtool_queueparam queueparam;
+
+ if (!dev->ethtool_ops->set_queueparam)
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(&queueparam, useraddr, sizeof(queueparam)))
+ return -EFAULT;
+
+ return dev->ethtool_ops->set_queueparam(dev, &queueparam);
+}
+
static int ethtool_get_pauseparam(struct net_device *dev, void __user *useraddr)
{
struct ethtool_pauseparam pauseparam = { ETHTOOL_GPAUSEPARAM };
@@ -875,6 +902,7 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_GMSGLVL:
case ETHTOOL_GCOALESCE:
case ETHTOOL_GRINGPARAM:
+ case ETHTOOL_GQUEUEPARAM:
case ETHTOOL_GPAUSEPARAM:
case ETHTOOL_GRXCSUM:
case ETHTOOL_GTXCSUM:
@@ -946,6 +974,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SRINGPARAM:
rc = ethtool_set_ringparam(dev, useraddr);
break;
+ case ETHTOOL_GQUEUEPARAM:
+ rc = ethtool_get_queueparam(dev, useraddr);
+ break;
+ case ETHTOOL_SQUEUEPARAM:
+ rc = ethtool_set_queueparam(dev, useraddr);
+ break;
case ETHTOOL_GPAUSEPARAM:
rc = ethtool_get_pauseparam(dev, useraddr);
break;
^ permalink raw reply related
* [PATCH] ethtool: Add LRO support
From: Auke Kok @ 2007-07-31 20:21 UTC (permalink / raw)
To: netdev; +Cc: jgarzik, davem
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---
ethtool-copy.h | 8 ++++++++
ethtool.8 | 8 ++++++--
ethtool.c | 39 +++++++++++++++++++++++++++++++++------
3 files changed, 47 insertions(+), 8 deletions(-)
diff --git a/ethtool-copy.h b/ethtool-copy.h
index 3a63224..ab9d688 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -274,6 +274,8 @@ int ethtool_op_get_perm_addr(struct net_device *dev,
struct ethtool_perm_addr *addr, u8 *data);
u32 ethtool_op_get_ufo(struct net_device *dev);
int ethtool_op_set_ufo(struct net_device *dev, u32 data);
+u32 ethtool_op_get_lro(struct net_device *dev);
+int ethtool_op_set_lro(struct net_device *dev, u32 data);
/**
* ðtool_ops - Alter and report network device settings
@@ -305,6 +307,8 @@ int ethtool_op_set_ufo(struct net_device *dev, u32 data);
* set_tso: Turn TCP segmentation offload on or off
* get_ufo: Report whether UDP fragmentation offload is enabled
* set_ufo: Turn UDP fragmentation offload on or off
+ * get_lro: Report whether large receive offload is enabled
+ * set_lro: Turn large receive offload on or off
* self_test: Run specified self-tests
* get_strings: Return a set of strings that describe the requested objects
* phys_id: Identify the device
@@ -373,6 +377,8 @@ struct ethtool_ops {
void (*complete)(struct net_device *);
u32 (*get_ufo)(struct net_device *);
int (*set_ufo)(struct net_device *, u32);
+ u32 (*get_lro)(struct net_device *);
+ int (*set_lro)(struct net_device *, u32);
};
#endif /* __KERNEL__ */
@@ -414,6 +420,8 @@ struct ethtool_ops {
#define ETHTOOL_SUFO 0x00000022 /* Set UFO enable (ethtool_value) */
#define ETHTOOL_GGSO 0x00000023 /* Get GSO enable (ethtool_value) */
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GLRO 0x00000025 /* Get LRO enable (ethtool_value) */
+#define ETHTOOL_SLRO 0x00000026 /* Set LRO enable (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/ethtool.8 b/ethtool.8
index af51056..89abf08 100644
--- a/ethtool.8
+++ b/ethtool.8
@@ -158,6 +158,7 @@ ethtool \- Display or change ethernet card settings
.B2 tso on off
.B2 ufo on off
.B2 gso on off
+.B2 lro on off
.B ethtool \-p|\-\-blink
.I ethX
@@ -289,10 +290,13 @@ Specifies whether scatter-gather should be enabled.
Specifies whether TCP segmentation offload should be enabled.
.TP
.A2 ufo on off
-Specifies whether UDP fragmentation offload should be enabled
+Specifies whether UDP fragmentation offload should be enabled.
.TP
.A2 gso on off
-Specifies whether generic segmentation offload should be enabled
+Specifies whether generic segmentation offload should be enabled.
+.TP
+.A2 lro on off
+Specifies whether large receive offload should be enabled.
.TP
.B \-p \-\-identify
Initiates adapter-specific action intended to enable an operator to
diff --git a/ethtool.c b/ethtool.c
index b04f747..4c9844a 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -151,7 +151,8 @@ static struct option {
" [ sg on|off ]\n"
" [ tso on|off ]\n"
" [ ufo on|off ]\n"
- " [ gso on|off ]\n" },
+ " [ gso on|off ]\n"
+ " [ lro on|off ]\n" },
{ "-i", "--driver", MODE_GDRV, "Show driver information" },
{ "-d", "--register-dump", MODE_GREGS, "Do a register dump",
" [ raw on|off ]\n"
@@ -200,6 +201,7 @@ static int off_sg_wanted = -1;
static int off_tso_wanted = -1;
static int off_ufo_wanted = -1;
static int off_gso_wanted = -1;
+static int off_lro_wanted = -1;
static struct ethtool_pauseparam epause;
static int gpause_changed = 0;
@@ -310,6 +312,7 @@ static struct cmdline_info cmdline_offload[] = {
{ "tso", CMDL_BOOL, &off_tso_wanted, NULL },
{ "ufo", CMDL_BOOL, &off_ufo_wanted, NULL },
{ "gso", CMDL_BOOL, &off_gso_wanted, NULL },
+ { "lro", CMDL_BOOL, &off_lro_wanted, NULL },
};
static struct cmdline_info cmdline_pause[] = {
@@ -1207,7 +1210,7 @@ static int dump_coalesce(void)
return 0;
}
-static int dump_offload (int rx, int tx, int sg, int tso, int ufo, int gso)
+static int dump_offload (int rx, int tx, int sg, int tso, int ufo, int gso, int lro)
{
fprintf(stdout,
"rx-checksumming: %s\n"
@@ -1215,13 +1218,15 @@ static int dump_offload (int rx, int tx, int sg, int tso, int ufo, int gso)
"scatter-gather: %s\n"
"tcp segmentation offload: %s\n"
"udp fragmentation offload: %s\n"
- "generic segmentation offload: %s\n",
+ "generic segmentation offload: %s\n"
+ "large receive offload: %s\n",
rx ? "on" : "off",
tx ? "on" : "off",
sg ? "on" : "off",
tso ? "on" : "off",
ufo ? "on" : "off",
- gso ? "on" : "off");
+ gso ? "on" : "off",
+ lro ? "on" : "off");
return 0;
}
@@ -1485,7 +1490,8 @@ static int do_scoalesce(int fd, struct ifreq *ifr)
static int do_goffload(int fd, struct ifreq *ifr)
{
struct ethtool_value eval;
- int err, allfail = 1, rx = 0, tx = 0, sg = 0, tso = 0, ufo = 0, gso = 0;
+ int err, allfail = 1;
+ int rx = 0, tx = 0, sg = 0, tso = 0, ufo = 0, gso = 0, lro = 0;
fprintf(stdout, "Offload parameters for %s:\n", devname);
@@ -1549,12 +1555,22 @@ static int do_goffload(int fd, struct ifreq *ifr)
allfail = 0;
}
+ eval.cmd = ETHTOOL_GLRO;
+ ifr->ifr_data = (caddr_t)&eval;
+ err = ioctl(fd, SIOCETHTOOL, ifr);
+ if (err)
+ perror("Cannot get device generic large receive offload settings");
+ else {
+ gso = eval.data;
+ allfail = 0;
+ }
+
if (allfail) {
fprintf(stdout, "no offload info available\n");
return 83;
}
- return dump_offload(rx, tx, sg, tso, ufo, gso);
+ return dump_offload(rx, tx, sg, tso, ufo, gso, lro);
}
static int do_soffload(int fd, struct ifreq *ifr)
@@ -1631,6 +1647,17 @@ static int do_soffload(int fd, struct ifreq *ifr)
return 90;
}
}
+ if (off_lro_wanted >= 0) {
+ changed = 1;
+ eval.cmd = ETHTOOL_SLRO;
+ eval.data = (off_gso_wanted == 1);
+ ifr->ifr_data = (caddr_t)&eval;
+ err = ioctl(fd, SIOCETHTOOL, ifr);
+ if (err) {
+ perror("Cannot set device large receive offload settings");
+ return 91;
+ }
+ }
if (!changed) {
fprintf(stdout, "no offload settings changed\n");
}
^ permalink raw reply related
* [PATCH] [NET] ethtool: Add LRO support
From: Auke Kok @ 2007-07-31 20:21 UTC (permalink / raw)
To: netdev; +Cc: jgarzik, davem
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
---
include/linux/ethtool.h | 8 +++++++
include/linux/netdevice.h | 1 +
net/core/ethtool.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 62 insertions(+), 1 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 3a63224..ab9d688 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -274,6 +274,8 @@ int ethtool_op_get_perm_addr(struct net_device *dev,
struct ethtool_perm_addr *addr, u8 *data);
u32 ethtool_op_get_ufo(struct net_device *dev);
int ethtool_op_set_ufo(struct net_device *dev, u32 data);
+u32 ethtool_op_get_lro(struct net_device *dev);
+int ethtool_op_set_lro(struct net_device *dev, u32 data);
/**
* ðtool_ops - Alter and report network device settings
@@ -305,6 +307,8 @@ int ethtool_op_set_ufo(struct net_device *dev, u32 data);
* set_tso: Turn TCP segmentation offload on or off
* get_ufo: Report whether UDP fragmentation offload is enabled
* set_ufo: Turn UDP fragmentation offload on or off
+ * get_lro: Report whether large receive offload is enabled
+ * set_lro: Turn large receive offload on or off
* self_test: Run specified self-tests
* get_strings: Return a set of strings that describe the requested objects
* phys_id: Identify the device
@@ -373,6 +377,8 @@ struct ethtool_ops {
void (*complete)(struct net_device *);
u32 (*get_ufo)(struct net_device *);
int (*set_ufo)(struct net_device *, u32);
+ u32 (*get_lro)(struct net_device *);
+ int (*set_lro)(struct net_device *, u32);
};
#endif /* __KERNEL__ */
@@ -414,6 +420,8 @@ struct ethtool_ops {
#define ETHTOOL_SUFO 0x00000022 /* Set UFO enable (ethtool_value) */
#define ETHTOOL_GGSO 0x00000023 /* Get GSO enable (ethtool_value) */
#define ETHTOOL_SGSO 0x00000024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GLRO 0x00000025 /* Get LRO enable (ethtool_value) */
+#define ETHTOOL_SLRO 0x00000026 /* Set LRO enable (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4a616d7..4863ffc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -341,6 +341,7 @@ struct net_device
#define NETIF_F_GSO 2048 /* Enable software GSO. */
#define NETIF_F_LLTX 4096 /* LockLess TX */
#define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */
+#define NETIF_F_LRO 32768 /* Has large receive offload */
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 0b531e9..23ccaa1 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -104,7 +104,6 @@ int ethtool_op_get_perm_addr(struct net_device *dev, struct ethtool_perm_addr *a
return 0;
}
-
u32 ethtool_op_get_ufo(struct net_device *dev)
{
return (dev->features & NETIF_F_UFO) != 0;
@@ -119,6 +118,20 @@ int ethtool_op_set_ufo(struct net_device *dev, u32 data)
return 0;
}
+u32 ethtool_op_get_lro(struct net_device *dev)
+{
+ return (dev->features & NETIF_F_LRO) != 0;
+}
+
+int ethtool_op_set_lro(struct net_device *dev, u32 data)
+{
+ if (data)
+ dev->features |= NETIF_F_LRO;
+ else
+ dev->features &= ~NETIF_F_LRO;
+ return 0;
+}
+
/* Handlers for each ethtool command */
static int ethtool_get_settings(struct net_device *dev, void __user *useraddr)
@@ -514,6 +527,13 @@ static int __ethtool_set_sg(struct net_device *dev, u32 data)
if (err)
return err;
}
+
+ if (!data && dev->ethtool_ops->set_lro) {
+ err = dev->ethtool_ops->set_lro(dev, 0);
+ if (err)
+ return err;
+ }
+
return dev->ethtool_ops->set_sg(dev, data);
}
@@ -625,6 +645,29 @@ static int ethtool_set_ufo(struct net_device *dev, char __user *useraddr)
return dev->ethtool_ops->set_ufo(dev, edata.data);
}
+static int ethtool_get_lro(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata = { ETHTOOL_GLRO };
+
+ edata.data = dev->features & NETIF_F_LRO;
+ if (copy_to_user(useraddr, &edata, sizeof(edata)))
+ return -EFAULT;
+ return 0;
+}
+
+static int ethtool_set_lro(struct net_device *dev, char __user *useraddr)
+{
+ struct ethtool_value edata;
+
+ if (copy_from_user(&edata, useraddr, sizeof(edata)))
+ return -EFAULT;
+ if (edata.data)
+ dev->features |= NETIF_F_LRO;
+ else
+ dev->features &= ~NETIF_F_LRO;
+ return 0;
+}
+
static int ethtool_get_gso(struct net_device *dev, char __user *useraddr)
{
struct ethtool_value edata = { ETHTOOL_GGSO };
@@ -840,6 +883,7 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_GTSO:
case ETHTOOL_GPERMADDR:
case ETHTOOL_GUFO:
+ case ETHTOOL_GLRO:
case ETHTOOL_GGSO:
break;
default:
@@ -953,6 +997,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SUFO:
rc = ethtool_set_ufo(dev, useraddr);
break;
+ case ETHTOOL_GLRO:
+ rc = ethtool_get_lro(dev, useraddr);
+ break;
+ case ETHTOOL_SLRO:
+ rc = ethtool_set_lro(dev, useraddr);
+ break;
case ETHTOOL_GGSO:
rc = ethtool_get_gso(dev, useraddr);
break;
@@ -994,3 +1044,5 @@ EXPORT_SYMBOL(ethtool_op_set_tx_hw_csum);
EXPORT_SYMBOL(ethtool_op_set_tx_ipv6_csum);
EXPORT_SYMBOL(ethtool_op_set_ufo);
EXPORT_SYMBOL(ethtool_op_get_ufo);
+EXPORT_SYMBOL(ethtool_op_set_lro);
+EXPORT_SYMBOL(ethtool_op_get_lro);
^ permalink raw reply related
* [PATCHES] RFC: Ethtool patches to add multi queue support, LRO
From: Kok, Auke @ 2007-07-31 20:20 UTC (permalink / raw)
To: NetDev; +Cc: Jeff Garzik, David S. Miller
All,
Recently new features have been written to add multiqueue support and LRO.
However, none of the patches touch on a basic configuration scheme and most use
module parameters.
I propose several patches to add support to change these features for LRO and
multiqueue. Currently these patches are implemented in the most generic way
possible (largely copy/paste from current ethtool code) - and add 'ethtool -k
lro on|off' support that toggles the NETIF_F_LRO generic device flag, and a new
ethtool_queueparam struct to get/pass rx and tx queue count.
I'm contenplating adding a "usecs-irq" non-rx non-tx parameter for the intel
(1/10) gigabit adapters since our adapters share an interrupt moderation
counters for both rx and tx, but that will come later.
Cheers,
Auke
^ permalink raw reply
* Re: [PATCH] [NET]: fix multicast list when cloning sockets
From: Flavio Leitner @ 2007-07-31 18:29 UTC (permalink / raw)
To: netdev; +Cc: David Miller, dlstevens, Arnaldo Carvalho de Melo
In-Reply-To: <39e6f6c70707302000l52926c9ar927fd550467ce3e3@mail.gmail.com>
On Tue, Jul 31, 2007 at 12:00:41AM -0300, Arnaldo Carvalho de Melo wrote:
> On 7/30/07, David Miller <davem@davemloft.net> wrote:
> > Allowing non-datagram sockets to end up with a non-NULL inet->mc_list
> > in the first place is a bug.
> >
> > Multicast subscriptions cannot even be used with TCP and DCCP, which
> > are the only two users of these connection oriented socket functions.
> >
> > The first thing that TCP and DCCP do, in fact, for input packet
> > processing is drop the packet if it is not unicast.
> >
> > Therefore the fix really is for the inet layer to reject multicast
> > subscription requests on sockets for which that absolutely does not
> > make sense. There is no reason these functions in
> > inet_connection_sock.c should need to be mindful of multicast
> > state. :-)
>
> Well, we can add a BUG_ON there then 8)
>
> Flavio, take a look at do_ip_setsockopt in net/ipv4/ip_sockglue.c, in
> the IP_{ADD,DROP}_MEMBERSHIP labels.
>
> Don't forget IPV6 (net/ipv6/ipv6_sockglue.c)
yes, right. What about the one below?
[NET]: Fix IP_ADD/DROP_MEMBERSHIP to handle only connectionless
Fix IP[V6]_ADD_MEMBERSHIP and IP[V6]_DROP_MEMBERSHIP to
return -EPROTO for connection oriented sockets.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 4d54457..6b420ae 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -625,6 +625,10 @@ static int do_ip_setsockopt(struct sock *sk, int level,
{
struct ip_mreqn mreq;
+ err = -EPROTO;
+ if (inet_sk(sk)->is_icsk)
+ break;
+
if (optlen < sizeof(struct ip_mreq))
goto e_inval;
err = -EFAULT;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index d684639..350e584 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -554,6 +554,10 @@ done:
{
struct ipv6_mreq mreq;
+ retv = -EPROTO;
+ if (inet_sk(sk)->is_icsk)
+ break;
+
retv = -EFAULT;
if (copy_from_user(&mreq, optval, sizeof(struct ipv6_mreq)))
break;
--
1.5.2.4
--
Flavio
^ permalink raw reply related
* [PATCH 39] net/ipv4/ip_options.c: kmalloc + memset conversion to kzalloc
From: Mariusz Kozlowski @ 2007-07-31 18:16 UTC (permalink / raw)
To: linux-kernel; +Cc: kernel-janitors, Andrew Morton, netdev, davem
In-Reply-To: <200707311845.48807.m.kozlowski@tuxland.pl>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
net/ipv4/ip_options.c | 15425 -> 15368 (-57 bytes)
net/ipv4/ip_options.o | 133668 -> 133588 (-80 bytes)
net/ipv4/ip_options.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
--- linux-2.6.23-rc1-mm1-a/net/ipv4/ip_options.c 2007-07-26 13:07:44.000000000 +0200
+++ linux-2.6.23-rc1-mm1-b/net/ipv4/ip_options.c 2007-07-31 15:17:22.000000000 +0200
@@ -513,11 +513,8 @@ void ip_options_undo(struct ip_options *
static struct ip_options *ip_options_get_alloc(const int optlen)
{
- struct ip_options *opt = kmalloc(sizeof(*opt) + ((optlen + 3) & ~3),
- GFP_KERNEL);
- if (opt)
- memset(opt, 0, sizeof(*opt));
- return opt;
+ return kzalloc(sizeof(struct ip_options) + ((optlen + 3) & ~3),
+ GFP_KERNEL);
}
static int ip_options_get_finish(struct ip_options **optp,
^ permalink raw reply
* Re: [PATCH 2.6.23 1/2] Make the iw_cxgb3 module parameters writable.
From: Steve Wise @ 2007-07-31 17:57 UTC (permalink / raw)
To: Roland Dreier; +Cc: general, linux-kernel, netdev
In-Reply-To: <adad4y9skxm.fsf@cisco.com>
Roland Dreier wrote:
> ugh, missed these before my last merge...
>
> anyway:
>
> why do we want to parameters writable? a good changelog tells me
> what, why and how, and this changelog just covered the "what". Also,
> I assume you've checked that it's OK for these variables to change at
> any time?
I want to be able to changes these parameters at run time. Eventually,
if we might want these parameters as rdma connection setup parameters.
For now, its useful to be able to set them without reloading.
Also, it is safe to change them at any time. All of these are read once
and utilized at connection setup. So changing them is safe in that
existing connections aren't affected, and only subsequent connections
will utilize the new values.
Sorry for the terse changelog...
Steve.
^ permalink raw reply
* Re: RFC: on [ab]use of skb->cb by VLAN code
From: Roland Dreier @ 2007-07-31 17:50 UTC (permalink / raw)
To: Rick Jones; +Cc: Ben Greear, David Miller, hadi, kaber, netdev, mcarlson
In-Reply-To: <46AF69C7.1090504@hp.com>
> > Do we really need an 'unsigned int' for mac_len? Maybe we could use
> > a 16-bit counter here, and then use the other 16 bits for the VLAN bits?
>
> Not knowing exactly if/how it interacts with that specific field I
> will point-out that IPoIB in OFED 1.2 just took their MTU to 65520.
> While that doesn't break the bitbank it does get rather close.
Leaving aside OFED releases, the IPoIB connected mode code in the
standard kernel also allows the MTU to go up to 65520. And there's
nothing magic about that value -- we could easily do bigger packets.
However, this is irrelevant for two reasons: mac_len is the length of
the LL header, not the packet overall, *and* mac_len is already 16
bits as of commit 334a8132.
- R.
^ permalink raw reply
* [PATCH 20] net/decnet/dn_route.c: kmalloc + memset conversion to kzalloc
From: Mariusz Kozlowski @ 2007-07-31 17:33 UTC (permalink / raw)
To: linux-kernel; +Cc: kernel-janitors, Andrew Morton, netdev, patrick
In-Reply-To: <200707311845.48807.m.kozlowski@tuxland.pl>
Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl>
net/decnet/dn_route.c | 45013 -> 44991 (-22 bytes)
net/decnet/dn_route.o | 199388 -> 199580 (+192 bytes)
net/decnet/dn_route.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- linux-2.6.23-rc1-mm1-a/net/decnet/dn_route.c 2007-07-26 13:07:44.000000000 +0200
+++ linux-2.6.23-rc1-mm1-b/net/decnet/dn_route.c 2007-07-31 15:15:11.000000000 +0200
@@ -1737,8 +1737,9 @@ static int dn_rt_cache_seq_open(struct i
{
struct seq_file *seq;
int rc = -ENOMEM;
- struct dn_rt_cache_iter_state *s = kmalloc(sizeof(*s), GFP_KERNEL);
-
+ struct dn_rt_cache_iter_state *s;
+
+ s = kzalloc(sizeof(*s), GFP_KERNEL);
if (!s)
goto out;
rc = seq_open(file, &dn_rt_cache_seq_ops);
@@ -1746,7 +1747,6 @@ static int dn_rt_cache_seq_open(struct i
goto out_kfree;
seq = file->private_data;
seq->private = s;
- memset(s, 0, sizeof(*s));
out:
return rc;
out_kfree:
^ permalink raw reply
* Re: [Lksctp-developers] [PATCH] SCTP: drop SACK if ctsn is not less than the next tsn of assoc
From: Sridhar Samudrala @ 2007-07-31 17:28 UTC (permalink / raw)
To: Neil Horman; +Cc: Wei Yongjun, netdev, lksctp-developers
In-Reply-To: <20070731113709.GA28333@hmsreliant.homelinux.net>
On Tue, 2007-07-31 at 07:37 -0400, Neil Horman wrote:
> On Tue, Jul 31, 2007 at 12:44:27PM +0800, Wei Yongjun wrote:
> > If SCTP data sender received a SACK which contains Cumulative TSN Ack is
> > not less than the Cumulative TSN Ack Point, and if this Cumulative TSN
> > Ack is not used by the data sender, SCTP data sender still accept this
> > SACK , and next SACK which send correctly to DATA sender be dropped,
> > because it is less than the new Cumulative TSN Ack Point.
> > After received this SACK, data will be retrans again and again even if
> > correct SACK is received.
> > So I think this SACK must be dropped to let data transmit correctly.
> >
> > Following is the tcpdump of my test. And patch in this mail can avoid
> > this problem.
> >
> > 02:19:38.233278 sctp (1) [INIT] [init tag: 1250461886] [rwnd: 54784] [OS: 10] [MIS: 65535] [init TSN: 217114040]
> > 02:19:39.782160 sctp (1) [INIT ACK] [init tag: 1] [rwnd: 54784] [OS: 100] [MIS: 65535] [init TSN: 100]
> > 02:19:39.798583 sctp (1) [COOKIE ECHO]
> > 02:19:40.082125 sctp (1) [COOKIE ACK]
> > 02:19:40.097859 sctp (1) [DATA] (B)(E) [TSN: 217114040] [SID: 0] [SSEQ 0] [PPID 0xf192090b]
> > 02:19:40.100162 sctp (1) [DATA] (B)(E) [TSN: 217114041] [SID: 0] [SSEQ 1] [PPID 0x3e467007]
> > 02:19:40.100779 sctp (1) [DATA] (B)(E) [TSN: 217114042] [SID: 0] [SSEQ 2] [PPID 0x11b12a0a]
> > 02:19:40.101200 sctp (1) [DATA] (B)(E) [TSN: 217114043] [SID: 0] [SSEQ 3] [PPID 0x30e7d979]
> > 02:19:40.561147 sctp (1) [SACK] [cum ack 217114040] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:40.568498 sctp (1) [DATA] (B)(E) [TSN: 217114044] [SID: 0] [SSEQ 4] [PPID 0x251ff86f]
> > 02:19:40.569308 sctp (1) [DATA] (B)(E) [TSN: 217114045] [SID: 0] [SSEQ 5] [PPID 0xe5d5da5d]
> > 02:19:40.700584 sctp (1) [SACK] [cum ack 290855864] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:40.701562 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:40.701567 sctp (1) [DATA] (B)(E) [TSN: 217114047] [SID: 0] [SSEQ 7] [PPID 0xca47e645]
> > 02:19:40.701569 sctp (1) [DATA] (B)(E) [TSN: 217114048] [SID: 0] [SSEQ 8] [PPID 0x6c0ea150]
> > 02:19:40.701576 sctp (1) [DATA] (B)(E) [TSN: 217114049] [SID: 0] [SSEQ 9] [PPID 0x9cc1994f]
> > 02:19:40.701585 sctp (1) [DATA] (B)(E) [TSN: 217114050] [SID: 0] [SSEQ 10] [PPID 0xb1df4129]
> > 02:19:41.098201 sctp (1) [SACK] [cum ack 217114041] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:41.283257 sctp (1) [SACK] [cum ack 217114042] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:41.457217 sctp (1) [SACK] [cum ack 217114043] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:41.691528 sctp (1) [SACK] [cum ack 217114044] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:41.849636 sctp (1) [SACK] [cum ack 217114045] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:41.975473 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:42.021229 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:42.196495 sctp (1) [SACK] [cum ack 217114047] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:42.424319 sctp (1) [SACK] [cum ack 217114048] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:42.586924 sctp (1) [SACK] [cum ack 217114049] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:42.744810 sctp (1) [SACK] [cum ack 217114050] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:42.965536 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:43.106385 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:43.218969 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:45.374101 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:45.489258 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:49.830116 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:49.984577 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> > 02:19:58.760300 sctp (1) [DATA] (B)(E) [TSN: 217114046] [SID: 0] [SSEQ 6] [PPID 0x87d8b423]
> > 02:19:58.931690 sctp (1) [SACK] [cum ack 217114046] [a_rwnd 54784] [#gap acks 0] [#dup tsns 0]
> >
> >
> > Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
> >
> > --- net/sctp/sm_statefuns.c.orig 2007-07-29 18:11:01.000000000 -0400
> > +++ net/sctp/sm_statefuns.c 2007-07-29 18:14:49.000000000 -0400
> > @@ -2880,6 +2880,15 @@ sctp_disposition_t sctp_sf_eat_sack_6_2(
> > return SCTP_DISPOSITION_DISCARD;
> > }
> >
> > + /* If Cumulative TSN Ack is not less than the Cumulative TSN
> > + * Ack which will be send in the next data, drop the SACK.
> > + */
> > + if (!TSN_lt(ctsn, asoc->next_tsn)) {
> > + SCTP_DEBUG_PRINTK("ctsn %x\n", ctsn);
> > + SCTP_DEBUG_PRINTK("next_tsn %x\n", asoc->next_tsn);
> > + return SCTP_DISPOSITION_DISCARD;
> > + }
> > +
> > /* Return this SACK for further processing. */
> > sctp_add_cmd_sf(commands, SCTP_CMD_PROCESS_SACK, SCTP_SACKH(sackh));
> >
> >
> >
> Whats the behavior on this in the event that a sack is received in which the
> ctsn falls within a a missing space in a stream of gap acks? I.e. what if the
> sack being sent falls into a hole between the ack point and the first gap ack
> range? Does this patch impact that at all?
>
> Also, what is this:
> 02:19:40.700584 sctp (1) [SACK] [cum ack 290855864] ....
>
> That ack value seems rather out of range for the rest of the trace. Was that
> part of your test? If so, what caused it?
Yes. This SACK seems to be totally out of range and may be causing the problem.
I would expect the following check in sctp_sf_eat_sack_6_2() to drop any SACKs
with CTSN value lower than the earlier SACKs.
/* i) If Cumulative TSN Ack is less than the Cumulative TSN
* Ack Point, then drop the SACK. Since Cumulative TSN
* Ack is monotonically increasing, a SACK whose
* Cumulative TSN Ack is less than the Cumulative TSN Ack
* Point indicates an out-of-order SACK.
*/
if (TSN_lt(ctsn, asoc->ctsn_ack_point)) {
SCTP_DEBUG_PRINTK("ctsn %x\n", ctsn);
SCTP_DEBUG_PRINTK("ctsn_ack_point %x\n", asoc->ctsn_ack_point);
return SCTP_DISPOSITION_DISCARD;
}
Thanks
Sridhar
^ permalink raw reply
* Distributed storage.
From: Evgeniy Polyakov @ 2007-07-31 17:13 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, linux-fsdevel
Hi.
I'm pleased to announce first release of the distributed storage
subsystem, which allows to form a storage on top of remote and local
nodes, which in turn can be exported to another storage as a node to
form tree-like storages.
There is number of main features, this device supports:
* zero additional allocations in the common fast path (only one per node if
network queue is full) not counting network alocations
* zero-copy sending (except header) if supported by device using sendpage()
* ability to use any implemented algorithm (linear algo implemented)
* plugable mapping algorithms
* failover recovery in case of broken link (reconnection if remote node
is down)
* ability to suspend remote node for maintenance without breaking dataflow
to another nodes (if supported by algorithm and block layer) and
without turning down main node
* initial autoconfiguration (ability to request remote node size and use
that dynamic data during array setup time)
* non-blocking network data processing (except headers, which are
sent/received in blocking mode, can be simply changed to non-blocking
too by increasing request size to store state) without busy loops
checking return valu of processing functions. Non-blocking data
processing is based on ->poll() state machine with only one working
thread per storage.
* support for any kind of network media (not limited to tcp or inet
protocols) higher MAC layer (socket layer), data consistensy must be
part of the protocol (i.e. will lose data with UDP in favour of
performance)
* no need for any special tools for data processing (like special
userspace applications) except for configuration
* userspace and kernelspace targets. Userspace target can work on top of
usual files. (Windows or any other OS userspace target support can be
trivially added on request)
Compared to other similar approaches namely iSCSI and NBD,
there are following advantages:
* non-blocking processing without busy loops (compared to both above)
* small, plugable architecture
* failover recovery (reconnect to remote target)
* autoconfiguration (full absence in NBD and/or device mapper on top of it)
* no additional allocatins (not including network part) - at least two in
device mapper for fast path
* very simple - try to compare with iSCSI
* works with different network protocols
* storage can be formed on top of remote nodes and be exported
simultaneously (iSCSI is peer-to-peer only, NBD requires device
mapper and is synchronous)
TODO list currently includes following main items:
* redundancy algorithm (drop me a request of your own, but it is highly
unlikley that Reed-Solomon based will ever be used - it is too slow
for distributed RAID, I consider WEAVER codes)
* extended autoconfiguration
* move away from ioctl based configuration
Patch, userspace configuration utility and userspace target can be found
on project homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 1 +
drivers/block/dst/Kconfig | 12 +
drivers/block/dst/Makefile | 5 +
drivers/block/dst/alg_linear.c | 348 ++++++++++
drivers/block/dst/dcore.c | 1222 ++++++++++++++++++++++++++++++++++
drivers/block/dst/kst.c | 1437 ++++++++++++++++++++++++++++++++++++++++
include/linux/dst.h | 282 ++++++++
8 files changed, 3309 insertions(+), 0 deletions(-)
diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
+source "drivers/block/dst/Kconfig"
+
source "drivers/s390/block/Kconfig"
endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
obj-$(CONFIG_BLK_DEV_SX8) += sx8.o
obj-$(CONFIG_BLK_DEV_UB) += ub.o
+obj-$(CONFIG_DST) += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 0000000..874d2e4
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,12 @@
+config DST
+ tristate "Distributed storage"
+ depends on NET
+ ---help---
+ This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+ tristate "Linear distribution algorithm"
+ depends on DST
+ ---help---
+ This module allows to create linear mapping of the nodes
+ in the distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 0000000..48b7777
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 0000000..9a134fc
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,348 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/dst.h>
+
+static struct dst_alg *alg_linear;
+static struct bio_set *dst_linear_bio_set;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+ struct dst_storage *st = n->st;
+
+ n->start = st->disk_size;
+ st->disk_size += n->size;
+
+ return 0;
+}
+
+/*
+ * Internal callback for local requests (i.e. for local disk),
+ * which are splitted between nodes (part with local node destination
+ * ends up with this ->bi_end_io() callback).
+ */
+static int dst_linear_end_io(struct bio *bio, unsigned int size, int err)
+{
+ struct bio *orig_bio = bio->bi_private;
+
+ if (err)
+ printk("%s: bio: %p, orig_bio: %p, size: %u, orig_size: %u.\n",
+ __func__, bio, orig_bio, size, orig_bio->bi_size);
+
+ bio_endio(orig_bio, size, 0);
+ bio_put(bio);
+ return 0;
+}
+
+static void dst_linear_destructor(struct bio *bio)
+{
+ bio_free(bio, dst_linear_bio_set);
+}
+
+/*
+ * This function sends processing request down to block layer (for local node)
+ * or to network state machine (for remote node).
+ */
+static int dst_linear_node_push(struct dst_request *req)
+{
+ int err = 0;
+
+ if (req->state->node->bdev) {
+ struct bio *bio = req->bio;
+
+ dprintk("%s: start: %llu, num: %d, idx: %d, offset: %u, "
+ "size: %llu, bi_idx: %d, bi_vcnt: %d.\n",
+ __func__, req->start, req->num, req->idx,
+ req->offset, req->size, bio->bi_idx, bio->bi_vcnt);
+
+ if (likely(bio->bi_idx == req->idx &&
+ bio->bi_vcnt == req->num)) {
+ bio->bi_bdev = req->state->node->bdev;
+ bio->bi_sector = req->start;
+ generic_make_request(bio);
+ goto out_put;
+ } else {
+ struct bio *clone = bio_alloc_bioset(GFP_NOIO,
+ bio->bi_max_vecs, dst_linear_bio_set);
+ struct bio_vec *bv;
+
+ err = -ENOMEM;
+ if (!clone)
+ goto out_put;
+
+ dprintk("%s: start: %llu, num: %d, idx: %d, "
+ "offset: %u, size: %llu, "
+ "bi_idx: %d, bi_vcnt: %d.\n",
+ __func__, req->start, req->num, req->idx,
+ req->offset, req->size,
+ bio->bi_idx, bio->bi_vcnt);
+
+ __bio_clone(clone, bio);
+
+ bv = bio_iovec_idx(clone, req->idx);
+ bv->bv_offset += req->offset;
+ clone->bi_idx = req->idx;
+ clone->bi_vcnt = req->num;
+ clone->bi_bdev = req->state->node->bdev;
+ clone->bi_sector = req->start;
+ clone->bi_destructor = dst_linear_destructor;
+ clone->bi_private = bio;
+ clone->bi_size = req->orig_size;
+ clone->bi_end_io = &dst_linear_end_io;
+
+ generic_make_request(clone);
+ err = 0;
+ goto out_put;
+ }
+ }
+
+ err = req->state->node->state->ops->push(req);
+
+out_put:
+ dst_node_put(req->state->node);
+ return err;
+}
+
+/*
+ * This callback is invoked from block layer request processing function,
+ * its task is to remap block request to different nodes.
+ */
+static int dst_linear_remap(struct dst_storage *st, struct bio *bio)
+{
+ struct dst_node *n;
+ int err = -EINVAL, i, cnt;
+ unsigned int bio_sectors = bio->bi_size>>9;
+ struct bio_vec *bv;
+ struct dst_request req;
+ u64 rest_in_node, start, total_size;
+
+ mutex_lock(&st->tree_lock);
+ n = dst_storage_tree_search(st, bio->bi_sector);
+ mutex_unlock(&st->tree_lock);
+
+ if (!n) {
+ dprintk("%s: failed to find a node for bio: %p, "
+ "sector: %llu.\n",
+ __func__, bio, bio->bi_sector);
+ return -ENODEV;
+ }
+
+ dprintk("%s: bio: %llu-%llu, dev: %llu-%llu, in sectors.\n",
+ __func__, bio->bi_sector, bio->bi_sector+bio_sectors,
+ n->start, n->start+n->size);
+
+ memset(&req, 0, sizeof(struct dst_request));
+
+ start = bio->bi_sector;
+ total_size = bio->bi_size;
+
+ req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+ DST_REQ_ALWAYS_QUEUE:0;
+ req.start = start - n->start;
+ req.offset = 0;
+ req.state = n->state;
+ req.bio = bio;
+
+ req.size = bio->bi_size;
+ req.orig_size = bio->bi_size;
+ req.idx = 0;
+ req.num = bio->bi_vcnt;
+
+ /*
+ * Common fast path - block request does not cross
+ * boundaries between nodes.
+ */
+ if (likely(bio->bi_sector + bio_sectors <= n->start + n->size))
+ return dst_linear_node_push(&req);
+
+ req.size = 0;
+ req.idx = 0;
+ req.num = 1;
+
+ cnt = bio->bi_vcnt;
+
+ rest_in_node = to_bytes(n->size - req.start);
+
+ for (i=0; i<cnt; ++i) {
+ bv = bio_iovec_idx(bio, i);
+
+ if (req.size + bv->bv_len >= rest_in_node) {
+ unsigned int diff = req.size + bv->bv_len -
+ rest_in_node;
+
+ req.size += bv->bv_len - diff;
+ req.start = start - n->start;
+ req.orig_size = req.size;
+
+ dprintk("%s: split: start: %llu/%llu, size: %llu, "
+ "total_size: %llu, diff: %u, idx: %d, "
+ "num: %d, bv_len: %u, bv_offset: %u.\n",
+ __func__, start, req.start, req.size,
+ total_size, diff, req.idx, req.num,
+ bv->bv_len, bv->bv_offset);
+
+ err = dst_linear_node_push(&req);
+ if (err)
+ break;
+
+ total_size -= req.orig_size;
+
+ if (!total_size)
+ break;
+
+ start += to_sector(req.orig_size);
+
+ req.flags = (test_bit(DST_NODE_FROZEN, &n->flags))?
+ DST_REQ_ALWAYS_QUEUE:0;
+ req.orig_size = req.size = diff;
+
+ if (diff) {
+ req.offset = bv->bv_len - diff;
+ req.idx = req.num - 1;
+ } else {
+ req.idx = req.num;
+ req.offset = 0;
+ }
+
+ dprintk("%s: next: start: %llu, size: %llu, "
+ "total_size: %llu, diff: %u, idx: %d, "
+ "num: %d, offset: %u, bv_len: %u, "
+ "bv_offset: %u.\n",
+ __func__, start, req.size, total_size, diff,
+ req.idx, req.num, req.offset,
+ bv->bv_len, bv->bv_offset);
+
+ mutex_lock(&st->tree_lock);
+ n = dst_storage_tree_search(st, start);
+ mutex_unlock(&st->tree_lock);
+
+ if (!n) {
+ err = -ENODEV;
+ dprintk("%s: failed to find a split node for "
+ "bio: %p, sector: %llu, start: %llu.\n",
+ __func__, bio, bio->bi_sector,
+ req.start);
+ break;
+ }
+
+ req.state = n->state;
+ req.start = start - n->start;
+ rest_in_node = to_bytes(n->size - req.start);
+
+ dprintk("%s: req.start: %llu, start: %llu, "
+ "dev_start: %llu, dev_size: %llu, "
+ "rest_in_node: %llu.\n",
+ __func__, req.start, start, n->start,
+ n->size, rest_in_node);
+ } else {
+ req.size += bv->bv_len;
+ req.num++;
+ }
+ }
+
+ dprintk("%s: last request: start: %llu, size: %llu, "
+ "total_size: %llu.\n", __func__,
+ req.start, req.size, total_size);
+ if (total_size) {
+ req.orig_size = req.size;
+
+ dprintk("%s: last: start: %llu/%llu, size: %llu, "
+ "total_size: %llu, idx: %d, num: %d.\n",
+ __func__, start, req.start, req.size,
+ total_size, req.idx, req.num);
+
+ err = dst_linear_node_push(&req);
+ if (!err) {
+ total_size -= req.orig_size;
+
+ BUG_ON(total_size != 0);
+ }
+
+ }
+
+ dprintk("%s: end bio: %p, err: %d.\n", __func__, bio, err);
+ return err;
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+ if (!err)
+ return 0;
+
+ if (err == -ECONNRESET || err == -EPIPE) {
+ if (st->ops->recovery(st, err)) {
+ err = st->ops->recovery(st, err);
+ if (err) {
+ set_bit(DST_NODE_FROZEN, &st->node->flags);
+ } else {
+ clear_bit(DST_NODE_FROZEN, &st->node->flags);
+ }
+ err = 0;
+ }
+ }
+
+ return err;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+ .remap = dst_linear_remap,
+ .add_node = dst_linear_add_node,
+ .del_node = dst_linear_del_node,
+ .error = dst_linear_error,
+ .owner = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+ dst_linear_bio_set = bioset_create(32, 32);
+ if (!dst_linear_bio_set)
+ panic("bio: can't allocate bios\n");
+
+ alg_linear = dst_alloc_alg("alg_linear", &alg_linear_ops);
+ if (!alg_linear)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+ dst_remove_alg(alg_linear);
+ bioset_free(dst_linear_bio_set);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_DESCRIPTION("Linear distributed algorithm.");
diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 0000000..fd11f86
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1222 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/miscdevice.h>
+#include <linux/socket.h>
+#include <linux/dst.h>
+#include <linux/device.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/buffer_head.h>
+
+#include <net/sock.h>
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+
+struct kmem_cache *dst_request_cache;
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+ return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+ .name = "dst",
+ .match = &dst_dev_match,
+};
+
+static struct device dst_dev = {
+ .bus = &dst_dev_bus_type,
+ .release = &dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+ .release = &dst_node_release
+};
+
+/*
+ * Distributed storage erquest processing function.
+ * It calls algorithm spcific remapping code only.
+ */
+static int dst_request(request_queue_t *q, struct bio *bio)
+{
+ struct dst_storage *st = q->queuedata;
+ int err;
+
+ dprintk("\n%s: start: st: %p, bio: %p, cnt: %u.\n",
+ __func__, st, bio, bio->bi_vcnt);
+
+ err = st->alg->ops->remap(st, bio);
+
+ dprintk("%s: end: st: %p, bio: %p, err: %d.\n",
+ __func__, st, bio, err);
+
+ if (err) {
+ printk("%s: remap failed: bio: %p, err: %d.\n",
+ __func__, bio, err);
+ bio_endio(bio, bio->bi_size, -EIO);
+ }
+ return 0;
+}
+
+static void dst_unplug(request_queue_t *q)
+{
+}
+
+static int dst_flush(request_queue_t *q, struct gendisk *disk, sector_t *sec)
+{
+ return 0;
+}
+
+static struct block_device_operations dst_blk_ops = {
+ .owner = THIS_MODULE,
+};
+
+/*
+ * Block layer binding - disk is created when array is fully configured
+ * by userspace request.
+ */
+static int dst_create_disk(struct dst_storage *st)
+{
+ int err;
+
+ err = -ENOMEM;
+ st->queue = blk_alloc_queue(GFP_KERNEL);
+ if (!st->queue)
+ goto err_out_exit;
+
+ st->queue->queuedata = st;
+ blk_queue_make_request(st->queue, dst_request);
+ blk_queue_bounce_limit(st->queue, BLK_BOUNCE_ANY);
+ st->queue->unplug_fn = dst_unplug;
+ st->queue->issue_flush_fn = dst_flush;
+
+ err = -EINVAL;
+ st->disk = alloc_disk(1);
+ if (!st->disk)
+ goto err_out_free_queue;
+
+ st->disk->major = dst_major;
+ st->disk->first_minor = 0;
+ st->disk->fops = &dst_blk_ops;
+ st->disk->queue = st->queue;
+ st->disk->private_data = st;
+ snprintf(st->disk->disk_name, sizeof(st->disk->disk_name),
+ "dst-%s-%d", st->name, st->disk->first_minor);
+
+ return 0;
+
+err_out_free_queue:
+ blk_cleanup_queue(st->queue);
+err_out_exit:
+ return err;
+}
+
+static void dst_remove_disk(struct dst_storage *st)
+{
+ del_gendisk(st->disk);
+ put_disk(st->disk);
+ blk_cleanup_queue(st->queue);
+}
+
+/*
+ * Shows node name in sysfs.
+ */
+static ssize_t dst_name_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_storage *st = container_of(dev, struct dst_storage, device);
+
+ return sprintf(buf, "%s\n", st->name);
+}
+
+static void dst_remove_all_nodes(struct dst_storage *st)
+{
+ struct dst_node *n;
+ struct rb_node *rb_node;
+
+ mutex_lock(&st->tree_lock);
+ while ((rb_node = rb_first(&st->tree_root)) != NULL) {
+ n = rb_entry(rb_node, struct dst_node, tree_node);
+ dprintk("%s: n: %p, start: %llu, size: %llu.\n",
+ __func__, n, n->start, n->size);
+ rb_erase(&n->tree_node, &st->tree_root);
+ dst_node_put(n);
+ }
+ mutex_unlock(&st->tree_lock);
+}
+
+/*
+ * Shows node layout in syfs.
+ */
+static ssize_t dst_nodes_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_storage *st = container_of(dev, struct dst_storage, device);
+ int size = PAGE_CACHE_SIZE, sz;
+ struct dst_node *n;
+ struct rb_node *rb_node;
+
+ sz = sprintf(buf, "sectors (start [size]): ");
+ size -= sz;
+ buf += sz;
+
+ mutex_lock(&st->tree_lock);
+ for (rb_node = rb_first(&st->tree_root); rb_node;
+ rb_node = rb_next(rb_node)) {
+ n = rb_entry(rb_node, struct dst_node, tree_node);
+ if (size < 32)
+ break;
+ sz = sprintf(buf, "%llu [%llu]", n->start, n->size);
+ buf += sz;
+ size -= sz;
+
+ if (!rb_next(rb_node))
+ break;
+
+ sz = sprintf(buf, " | ");
+ buf += sz;
+ size -= sz;
+ }
+ mutex_unlock(&st->tree_lock);
+ size -= sprintf(buf, "\n");
+ return PAGE_CACHE_SIZE - size;
+}
+
+/*
+ * Algorithm currently being used by given storage.
+ */
+static ssize_t dst_alg_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_storage *st = container_of(dev, struct dst_storage, device);
+ return sprintf(buf, "%s\n", st->alg->name);
+}
+
+/*
+ * Writing to this sysfs file allows to remove all nodes
+ * and storage itself automatically.
+ */
+static ssize_t dst_remove_nodes(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct dst_storage *st = container_of(dev, struct dst_storage, device);
+ dst_remove_all_nodes(st);
+ return count;
+}
+
+static DEVICE_ATTR(name, 0444, dst_name_show, NULL);
+static DEVICE_ATTR(nodes, 0444, dst_nodes_show, NULL);
+static DEVICE_ATTR(alg, 0444, dst_alg_show, NULL);
+static DEVICE_ATTR(remove_all_nodes, 0644, NULL, dst_remove_nodes);
+
+static int dst_create_storage_attributes(struct dst_storage *st)
+{
+ int err;
+
+ err = device_create_file(&st->device, &dev_attr_name);
+ err = device_create_file(&st->device, &dev_attr_nodes);
+ err = device_create_file(&st->device, &dev_attr_alg);
+ err = device_create_file(&st->device, &dev_attr_remove_all_nodes);
+ return 0;
+}
+
+static void dst_remove_storage_attributes(struct dst_storage *st)
+{
+ device_remove_file(&st->device, &dev_attr_name);
+ device_remove_file(&st->device, &dev_attr_nodes);
+ device_remove_file(&st->device, &dev_attr_alg);
+ device_remove_file(&st->device, &dev_attr_remove_all_nodes);
+}
+
+static void dst_storage_sysfs_exit(struct dst_storage *st)
+{
+ dst_remove_storage_attributes(st);
+ device_unregister(&st->device);
+}
+
+static int dst_storage_sysfs_init(struct dst_storage *st)
+{
+ int err;
+
+ memcpy(&st->device, &dst_dev, sizeof(struct device));
+ snprintf(st->device.bus_id, sizeof(st->device.bus_id), "%s", st->name);
+
+ err = device_register(&st->device);
+ if (err) {
+ dprintk(KERN_ERR "Failed to register dst device %s, err: %d.\n",
+ st->name, err);
+ goto err_out_exit;
+ }
+
+ dst_create_storage_attributes(st);
+
+ return 0;
+
+err_out_exit:
+ return err;
+}
+
+/*
+ * This functions shows size and start of the appropriate node.
+ * Both are in sectors.
+ */
+static ssize_t dst_show_start(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_node *n = container_of(dev, struct dst_node, device);
+
+ return sprintf(buf, "%llu\n", n->start);
+}
+
+static ssize_t dst_show_size(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_node *n = container_of(dev, struct dst_node, device);
+
+ return sprintf(buf, "%llu\n", n->size);
+}
+
+/*
+ * Shows type of the remote node - device major/minor number
+ * for local nodes and address (af_inet ipv4/ipv6 only) for remote nodes.
+ */
+static ssize_t dst_show_type(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dst_node *n = container_of(dev, struct dst_node, device);
+ struct sockaddr addr;
+ struct socket *sock;
+ int addrlen;
+
+ if (!n->state && !n->bdev)
+ return 0;
+
+ if (n->bdev)
+ return sprintf(buf, "L: %d:%d\n",
+ MAJOR(n->bdev->bd_dev), MINOR(n->bdev->bd_dev));
+
+ sock = n->state->socket;
+ if (sock->ops->getname(sock, &addr, &addrlen, 2))
+ return 0;
+
+ if (sock->ops->family == AF_INET) {
+ struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+ return sprintf(buf, "R: %u.%u.%u.%u:%d\n",
+ NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+ } else if (sock->ops->family == AF_INET6) {
+ struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+ return sprintf(buf,
+ "R: %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d\n",
+ NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+ }
+ return 0;
+}
+
+static DEVICE_ATTR(start, 0444, dst_show_start, NULL);
+static DEVICE_ATTR(size, 0444, dst_show_size, NULL);
+static DEVICE_ATTR(type, 0444, dst_show_type, NULL);
+
+static int dst_create_node_attributes(struct dst_node *n)
+{
+ int err;
+
+ err = device_create_file(&n->device, &dev_attr_start);
+ err = device_create_file(&n->device, &dev_attr_size);
+ err = device_create_file(&n->device, &dev_attr_type);
+ return 0;
+}
+
+static void dst_remove_node_attributes(struct dst_node *n)
+{
+ device_remove_file(&n->device, &dev_attr_start);
+ device_remove_file(&n->device, &dev_attr_size);
+ device_remove_file(&n->device, &dev_attr_type);
+}
+
+static void dst_node_sysfs_exit(struct dst_node *n)
+{
+ if (n->device.parent == &n->st->device) {
+ dst_remove_node_attributes(n);
+ device_unregister(&n->device);
+ n->device.parent = NULL;
+ }
+}
+
+static int dst_node_sysfs_init(struct dst_node *n)
+{
+ int err;
+
+ memcpy(&n->device, &dst_node_dev, sizeof(struct device));
+
+ n->device.parent = &n->st->device;
+
+ snprintf(n->device.bus_id, sizeof(n->device.bus_id),
+ "n-%llu", n->start);
+ err = device_register(&n->device);
+ if (err) {
+ dprintk(KERN_ERR "Failed to register node, err: %d.\n", err);
+ goto err_out_exit;
+ }
+
+ dst_create_node_attributes(n);
+
+ return 0;
+
+err_out_exit:
+ return err;
+}
+
+/*
+ * Gets a reference for given storage, if
+ * storage with given name and algorithm being used
+ * does not exist it is created.
+ */
+static struct dst_storage *dst_get_storage(char *name, char *aname, int alloc)
+{
+ struct dst_storage *st, *rst = NULL;
+ int err;
+ struct dst_alg *alg;
+
+ mutex_lock(&dst_storage_lock);
+ list_for_each_entry(st, &dst_storage_list, entry) {
+ if (!strcmp(name, st->name) && !strcmp(st->alg->name, aname)) {
+ rst = st;
+ atomic_inc(&st->refcnt);
+ break;
+ }
+ }
+ mutex_unlock(&dst_storage_lock);
+
+ if (rst || !alloc)
+ return rst;
+
+ st = kzalloc(sizeof(struct dst_storage), GFP_KERNEL);
+ if (!st)
+ return NULL;
+
+ mutex_init(&st->tree_lock);
+ /*
+ * One for storage itself,
+ * another one for attached node below.
+ */
+ atomic_set(&st->refcnt, 2);
+ snprintf(st->name, DST_NAMELEN, "%s", name);
+ st->tree_root.rb_node = NULL;
+
+ err = dst_storage_sysfs_init(st);
+ if (err)
+ goto err_out_free;
+
+ err = dst_create_disk(st);
+ if (err)
+ goto err_out_sysfs_exit;
+
+ mutex_lock(&dst_alg_lock);
+ list_for_each_entry(alg, &dst_alg_list, entry) {
+ if (!strcmp(alg->name, aname)) {
+ atomic_inc(&alg->refcnt);
+ try_module_get(alg->ops->owner);
+ st->alg = alg;
+ break;
+ }
+ }
+ mutex_unlock(&dst_alg_lock);
+
+ if (!st->alg)
+ goto err_out_disk_remove;
+
+ mutex_lock(&dst_storage_lock);
+ list_add_tail(&st->entry, &dst_storage_list);
+ mutex_unlock(&dst_storage_lock);
+
+ return st;
+
+err_out_disk_remove:
+ dst_remove_disk(st);
+err_out_sysfs_exit:
+ dst_storage_sysfs_init(st);
+err_out_free:
+ kfree(st);
+ return NULL;
+}
+
+/*
+ * Allows to allocate and add new algorithm by external modules.
+ */
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops)
+{
+ struct dst_alg *alg;
+
+ alg = kzalloc(sizeof(struct dst_alg), GFP_KERNEL);
+ if (!alg)
+ return NULL;
+ snprintf(alg->name, DST_NAMELEN, "%s", name);
+ atomic_set(&alg->refcnt, 1);
+ alg->ops = ops;
+
+ mutex_lock(&dst_alg_lock);
+ list_add_tail(&alg->entry, &dst_alg_list);
+ mutex_unlock(&dst_alg_lock);
+
+ return alg;
+}
+EXPORT_SYMBOL_GPL(dst_alloc_alg);
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+ dprintk("%s: alg: %p.\n", __func__, alg);
+ kfree(alg);
+}
+
+/*
+ * Algorithm is never freed directly,
+ * since its module reference counter is increased
+ * by storage when it is created - just like network protocols.
+ */
+static inline void dst_put_alg(struct dst_alg *alg)
+{
+ dprintk("%s: alg: %p, refcnt: %d.\n",
+ __func__, alg, atomic_read(&alg->refcnt));
+ module_put(alg->ops->owner);
+ if (atomic_dec_and_test(&alg->refcnt))
+ dst_free_alg(alg);
+}
+
+/*
+ * Removing algorithm from main list of supported algorithms.
+ */
+void dst_remove_alg(struct dst_alg *alg)
+{
+ mutex_lock(&dst_alg_lock);
+ list_del_init(&alg->entry);
+ mutex_unlock(&dst_alg_lock);
+
+ dst_put_alg(alg);
+}
+
+EXPORT_SYMBOL_GPL(dst_remove_alg);
+
+static void dst_cleanup_node(struct dst_node *n)
+{
+ dprintk("%s: node: %p.\n", __func__, n);
+ n->st->alg->ops->del_node(n);
+ if (n->cleanup)
+ n->cleanup(n);
+ dst_node_sysfs_exit(n);
+ kfree(n);
+}
+
+static void dst_free_storage(struct dst_storage *st)
+{
+ dprintk("%s: st: %p.\n", __func__, st);
+
+ BUG_ON(rb_first(&st->tree_root) != NULL);
+
+ dst_put_alg(st->alg);
+ kfree(st);
+}
+
+static inline void dst_put_storage(struct dst_storage *st)
+{
+ dprintk("%s: st: %p, refcnt: %d.\n",
+ __func__, st, atomic_read(&st->refcnt));
+ if (atomic_dec_and_test(&st->refcnt))
+ dst_free_storage(st);
+}
+
+void dst_node_put(struct dst_node *n)
+{
+ dprintk("%s: node: %p, start: %llu, size: %llu, refcnt: %d.\n",
+ __func__, n, n->start, n->size,
+ atomic_read(&n->refcnt));
+
+ if (atomic_dec_and_test(&n->refcnt)) {
+ struct dst_storage *st = n->st;
+
+ dprintk("%s: freeing node: %p, start: %llu, size: %llu, "
+ "refcnt: %d.\n",
+ __func__, n, n->start, n->size,
+ atomic_read(&n->refcnt));
+
+ dst_cleanup_node(n);
+ dst_put_storage(st);
+ }
+}
+EXPORT_SYMBOL_GPL(dst_node_put);
+
+static inline int dst_compare_id(struct dst_node *old, u64 new)
+{
+ if (old->start + old->size <= new)
+ return 1;
+ if (old->start > new)
+ return -1;
+ return 0;
+}
+
+/*
+ * Tree of of the nodes, which form the storage.
+ * Tree is indexed via start of the node and its size.
+ * Comparison function above.
+ */
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start)
+{
+ struct rb_node *n = st->tree_root.rb_node;
+ struct dst_node *dn;
+ int cmp;
+
+ while (n) {
+ dn = rb_entry(n, struct dst_node, tree_node);
+
+ cmp = dst_compare_id(dn, start);
+ dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+ __func__, dn->start, dn->start+dn->size, start);
+ if (cmp < 0)
+ n = n->rb_left;
+ else if (cmp > 0)
+ n = n->rb_right;
+ else {
+ atomic_inc(&dn->refcnt);
+ return dn;
+ }
+ }
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(dst_storage_tree_search);
+
+/*
+ * This function allows to remove a node with given start address
+ * from the storage.
+ */
+static struct dst_node *dst_storage_tree_del(struct dst_storage *st, u64 start)
+{
+ struct dst_node *n = dst_storage_tree_search(st, start);
+
+ if (!n)
+ return NULL;
+
+ rb_erase(&n->tree_node, &st->tree_root);
+ dst_node_put(n);
+ return n;
+}
+
+/*
+ * This function allows to add given node to the storage.
+ * Returns -EEXIST if the same area is already covered by another node.
+ * This is return must be checked for redundancy algorithms.
+ */
+static int dst_storage_tree_add(struct dst_node *new, struct dst_storage *st)
+{
+ struct rb_node **n = &st->tree_root.rb_node, *parent = NULL;
+ struct dst_node *dn;
+ int cmp;
+
+ while (*n) {
+ parent = *n;
+ dn = rb_entry(parent, struct dst_node, tree_node);
+
+ cmp = dst_compare_id(dn, new->start);
+ dprintk("%s: tree: %llu-%llu, new: %llu.\n",
+ __func__, dn->start, dn->start+dn->size,
+ new->start);
+ if (cmp < 0)
+ n = &parent->rb_left;
+ else if (cmp > 0)
+ n = &parent->rb_right;
+ else
+ return -EEXIST;
+ }
+
+ rb_link_node(&new->tree_node, parent, n);
+ rb_insert_color(&new->tree_node, &st->tree_root);
+
+ return 0;
+}
+
+/*
+ * This function finds devices major/minor numbers for given pathname.
+ */
+static int dst_lookup_device(const char *path, dev_t *dev)
+{
+ int err;
+ struct nameidata nd;
+ struct inode *inode;
+
+ err = path_lookup(path, LOOKUP_FOLLOW, &nd);
+ if (err)
+ return err;
+
+ inode = nd.dentry->d_inode;
+ if (!inode) {
+ err = -ENOENT;
+ goto out;
+ }
+
+ if (!S_ISBLK(inode->i_mode)) {
+ err = -ENOTBLK;
+ goto out;
+ }
+
+ *dev = inode->i_rdev;
+
+out:
+ path_release(&nd);
+ return err;
+}
+
+/*
+ * Cleanup routings for local, local exporting and remote nodes.
+ */
+static void dst_cleanup_remote(struct dst_node *n)
+{
+ if (n->state) {
+ kst_state_exit(n->state);
+ n->state = NULL;
+ }
+}
+
+static void dst_cleanup_local(struct dst_node *n)
+{
+ if (n->bdev) {
+ sync_blockdev(n->bdev);
+ blkdev_put(n->bdev);
+ n->bdev = NULL;
+ }
+}
+
+static void dst_cleanup_local_export(struct dst_node *n)
+{
+ dst_cleanup_local(n);
+ dst_cleanup_remote(n);
+}
+
+/*
+ * Setup routings for local, local exporting and remote nodes.
+ */
+static int dst_setup_local(struct dst_node *n, struct dst_ctl *ctl,
+ struct dst_local_ctl *l)
+{
+ dev_t dev;
+ int err;
+
+ err = dst_lookup_device(l->name, &dev);
+ if (err)
+ return err;
+
+ n->bdev = open_by_devnum(dev, FMODE_READ|FMODE_WRITE);
+ if (!n->bdev)
+ return -ENODEV;
+
+ if (!n->size)
+ n->size = get_capacity(n->bdev->bd_disk);
+
+ return 0;
+}
+
+static int dst_setup_local_export(struct dst_node *n, struct dst_ctl *ctl,
+ struct dst_local_export_ctl *le)
+{
+ int err;
+
+ err = dst_setup_local(n, ctl, &le->lctl);
+ if (err)
+ goto err_out_exit;
+
+ n->state = kst_listener_state_init(kst_main_worker, n, le);
+ if (IS_ERR(n->state)) {
+ err = PTR_ERR(n->state);
+ goto err_out_cleanup;
+ }
+
+ return 0;
+
+err_out_cleanup:
+ dst_cleanup_local(n);
+err_out_exit:
+ return err;
+}
+
+static int dst_request_remote_config(struct dst_node *n, struct socket *sock)
+{
+ struct dst_remote_request cfg;
+ struct msghdr msg;
+ struct kvec iov;
+ int err;
+
+ memset(&cfg, 0, sizeof(struct dst_remote_request));
+ cfg.cmd = cpu_to_be32(DST_REMOTE_CFG);
+
+ iov.iov_base = &cfg;
+ iov.iov_len = sizeof(struct dst_remote_request);
+
+ msg.msg_iov = (struct iovec *)&iov;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_flags = MSG_WAITALL;
+
+ err = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
+ if (err <= 0) {
+ if (err == 0)
+ err = -ECONNRESET;
+ return err;
+ }
+
+ iov.iov_base = &cfg;
+ iov.iov_len = sizeof(struct dst_remote_request);
+
+ msg.msg_iov = (struct iovec *)&iov;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_flags = MSG_WAITALL;
+
+ err = kernel_recvmsg(sock, &msg, &iov, 1, iov.iov_len, msg.msg_flags);
+ if (err <= 0) {
+ if (err == 0)
+ err = -ECONNRESET;
+ return err;
+ }
+
+ n->size = be64_to_cpu(cfg.sector);
+
+ return 0;
+}
+
+static int dst_setup_remote(struct dst_node *n, struct dst_ctl *ctl,
+ struct dst_remote_ctl *r)
+{
+ int err;
+ struct socket *sock;
+
+ err = sock_create(r->addr.sa_family, r->type, r->proto, &sock);
+ if (err < 0)
+ goto err_out_exit;
+
+ sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+ msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+ err = sock->ops->connect(sock, (struct sockaddr *)&r->addr,
+ r->addr.sa_data_len, 0);
+ if (err)
+ goto err_out_destroy;
+
+ if (!n->size) {
+ err = dst_request_remote_config(n, sock);
+ if (err)
+ goto err_out_destroy;
+ }
+
+ n->state = kst_data_state_init(kst_main_worker, n, sock);
+ if (IS_ERR(n->state)) {
+ err = PTR_ERR(n->state);
+ goto err_out_destroy;
+ }
+
+ return 0;
+
+err_out_destroy:
+ sock_release(sock);
+err_out_exit:
+ return err;
+}
+
+/*
+ * This function inserts node into storage.
+ */
+static int dst_insert_node(struct dst_node *n)
+{
+ int err;
+ struct dst_storage *st = n->st;
+
+ err = st->alg->ops->add_node(n);
+ if (err)
+ return err;
+
+ err = dst_node_sysfs_init(n);
+ if (err)
+ goto err_out_remove_node;
+
+ mutex_lock(&st->tree_lock);
+ err = dst_storage_tree_add(n, st);
+ mutex_unlock(&st->tree_lock);
+ if (err)
+ goto err_out_sysfs_exit;
+
+ return 0;
+
+err_out_sysfs_exit:
+ dst_node_sysfs_exit(n);
+err_out_remove_node:
+ st->alg->ops->del_node(n);
+ return err;
+}
+
+static struct dst_node *dst_alloc_node(struct dst_ctl *ctl,
+ void (*cleanup)(struct dst_node *))
+{
+ struct dst_storage *st;
+ struct dst_node *n;
+
+ st = dst_get_storage(ctl->st, ctl->alg, 1);
+ if (!st)
+ goto err_out_exit;
+
+ n = kzalloc(sizeof(struct dst_node), GFP_KERNEL);
+ if (!n)
+ goto err_out_put_storage;
+
+ n->st = st;
+ n->cleanup = cleanup;
+ n->start = ctl->start;
+ n->size = ctl->size;
+ atomic_set(&n->refcnt, 1);
+
+ return n;
+
+err_out_put_storage:
+ mutex_lock(&dst_storage_lock);
+ list_del_init(&st->entry);
+ mutex_unlock(&dst_storage_lock);
+
+ dst_put_storage(st);
+err_out_exit:
+ return NULL;
+}
+
+/*
+ * Control callback for userspace commands to setup
+ * different nodes and start/stop array.
+ */
+static int dst_add_remote(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_node *n;
+ int err;
+ struct dst_remote_ctl rctl;
+
+ if (copy_from_user(&rctl, data, sizeof(struct dst_remote_ctl)))
+ return -EFAULT;
+
+ n = dst_alloc_node(ctl, &dst_cleanup_remote);
+ if (!n)
+ return -ENOMEM;
+
+ err = dst_setup_remote(n, ctl, &rctl);
+ if (err < 0)
+ goto err_out_free;
+
+ err = dst_insert_node(n);
+ if (err)
+ goto err_out_free;
+
+ return 0;
+
+err_out_free:
+ dst_node_put(n);
+ return err;
+}
+
+static int dst_add_local_export(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_node *n;
+ int err;
+ struct dst_local_export_ctl le;
+
+ if (copy_from_user(&le, data, sizeof(struct dst_local_export_ctl)))
+ return -EFAULT;
+
+ n = dst_alloc_node(ctl, &dst_cleanup_local_export);
+ if (!n)
+ return -EINVAL;
+
+ err = dst_setup_local_export(n, ctl, &le);
+ if (err < 0)
+ goto err_out_free;
+
+ err = dst_insert_node(n);
+ if (err)
+ goto err_out_free;
+
+
+ return 0;
+
+err_out_free:
+ dst_node_put(n);
+ return err;
+}
+
+static int dst_add_local(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_node *n;
+ int err;
+ struct dst_local_ctl lctl;
+
+ if (copy_from_user(&lctl, data, sizeof(struct dst_local_ctl)))
+ return -EFAULT;
+
+ n = dst_alloc_node(ctl, &dst_cleanup_local);
+ if (!n)
+ return -EINVAL;
+
+ err = dst_setup_local(n, ctl, &lctl);
+ if (err < 0)
+ goto err_out_free;
+
+ err = dst_insert_node(n);
+ if (err)
+ goto err_out_free;
+
+ return 0;
+
+err_out_free:
+ dst_node_put(n);
+ return err;
+}
+
+static int dst_del_node(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_node *n;
+ struct dst_storage *st;
+ int err = -ENODEV;
+
+ st = dst_get_storage(ctl->st, ctl->alg, 0);
+ if (!st)
+ goto err_out_exit;
+
+ mutex_lock(&st->tree_lock);
+ n = dst_storage_tree_del(st, ctl->start);
+ mutex_unlock(&st->tree_lock);
+ if (!n)
+ goto err_out_put;
+
+ dst_node_put(n);
+ dst_put_storage(st);
+
+ return 0;
+
+err_out_put:
+ dst_put_storage(st);
+err_out_exit:
+ return err;
+}
+
+static int dst_start_storage(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_storage *st;
+
+ st = dst_get_storage(ctl->st, ctl->alg, 0);
+ if (!st)
+ return -ENODEV;
+
+ mutex_lock(&st->tree_lock);
+ if (!(st->flags & DST_ST_STARTED)) {
+ set_capacity(st->disk, st->disk_size);
+ add_disk(st->disk);
+ st->flags |= DST_ST_STARTED;
+ dprintk("%s: STARTED st: %p, disk_size: %llu.\n",
+ __func__, st, st->disk_size);
+ }
+ mutex_unlock(&st->tree_lock);
+
+ dst_put_storage(st);
+
+ return 0;
+}
+
+static int dst_stop_storage(struct dst_ctl *ctl, void __user *data)
+{
+ struct dst_storage *st;
+
+ st = dst_get_storage(ctl->st, ctl->alg, 0);
+ if (!st)
+ return -ENODEV;
+
+ dprintk("%s: STOPPED storage: %s.\n", __func__, st->name);
+
+ dst_storage_sysfs_exit(st);
+
+ mutex_lock(&dst_storage_lock);
+ list_del_init(&st->entry);
+ mutex_unlock(&dst_storage_lock);
+
+ if (st->flags & DST_ST_STARTED)
+ dst_remove_disk(st);
+
+ dst_remove_all_nodes(st);
+ dst_put_storage(st); /* One reference got above */
+ dst_put_storage(st); /* Another reference set during initialization */
+
+ return 0;
+}
+
+typedef int (*dst_command_func)(struct dst_ctl *ctl, void __user *data);
+
+/*
+ * List of userspace commands.
+ */
+static dst_command_func dst_commands[] = {
+ [DST_ADD_REMOTE] = &dst_add_remote,
+ [DST_ADD_LOCAL] = &dst_add_local,
+ [DST_ADD_LOCAL_EXPORT] = &dst_add_local_export,
+ [DST_DEL_NODE] = &dst_del_node,
+ [DST_START_STORAGE] = &dst_start_storage,
+ [DST_STOP_STORAGE] = &dst_stop_storage,
+};
+
+/*
+ * Move to connector for configuration is in TODO list.
+ */
+static int dst_ioctl(struct inode *inode, struct file *file,
+ unsigned int command, unsigned long data)
+{
+ struct dst_ctl ctl;
+ unsigned int cmd = _IOC_NR(command);
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (_IOC_TYPE(command) != DST_IOCTL)
+ return -ENOTTY;
+
+ if (cmd >= DST_CMD_MAX)
+ return -EINVAL;
+
+ if (copy_from_user(&ctl, (void __user *)data, sizeof(struct dst_ctl)))
+ return -EFAULT;
+
+ data += sizeof(struct dst_ctl);
+
+ return dst_commands[cmd](&ctl, (void __user *)data);
+}
+
+static const struct file_operations dst_fops = {
+ .ioctl = dst_ioctl,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice dst_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = DST_NAME,
+ .fops = &dst_fops
+};
+
+static int dst_sysfs_init(void)
+{
+ return bus_register(&dst_dev_bus_type);
+}
+
+static void dst_sysfs_exit(void)
+{
+ bus_unregister(&dst_dev_bus_type);
+}
+
+static int __devinit dst_sys_init(void)
+{
+ int err;
+
+ dst_request_cache = kmem_cache_create("dst", sizeof(struct dst_request),
+ 0, 0, NULL, NULL);
+ if (!dst_request_cache)
+ return -ENOMEM;
+
+ err = register_blkdev(dst_major, DST_NAME);
+ if (err < 0)
+ goto err_out_destroy;
+ if (err)
+ dst_major = err;
+
+ err = dst_sysfs_init();
+ if (err)
+ goto err_out_unregister;
+
+ kst_main_worker = kst_worker_init(0);
+ if (IS_ERR(kst_main_worker)) {
+ err = PTR_ERR(kst_main_worker);
+ goto err_out_sysfs_exit;
+ }
+
+ err = misc_register(&dst_misc);
+ if (err)
+ goto err_out_worker_exit;
+
+ return 0;
+
+err_out_worker_exit:
+ kst_worker_exit(kst_main_worker);
+err_out_sysfs_exit:
+ dst_sysfs_exit();
+err_out_unregister:
+ unregister_blkdev(dst_major, DST_NAME);
+err_out_destroy:
+ kmem_cache_destroy(dst_request_cache);
+ return err;
+}
+
+static void __devexit dst_sys_exit(void)
+{
+ misc_deregister(&dst_misc);
+ dst_sysfs_exit();
+ unregister_blkdev(dst_major, DST_NAME);
+ kst_exit_all();
+ kmem_cache_destroy(dst_request_cache);
+}
+
+module_init(dst_sys_init);
+module_exit(dst_sys_exit);
+
+MODULE_DESCRIPTION("Distributed storage");
+MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>");
+MODULE_LICENSE("GPL");
diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 0000000..7193d4c
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1437 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/kthread.h>
+#include <linux/net.h>
+#include <linux/in.h>
+#include <linux/poll.h>
+#include <linux/bio.h>
+#include <linux/dst.h>
+
+#include <net/sock.h>
+
+struct kst_poll_helper
+{
+ poll_table pt;
+ struct kst_state *st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+ int type, int proto, int backlog)
+{
+ int err;
+
+ err = sock_create(addr->sa_family, type, proto, &st->socket);
+ if (err)
+ goto err_out_exit;
+
+ err = st->socket->ops->bind(st->socket, (struct sockaddr *)addr,
+ addr->sa_data_len);
+
+ err = st->socket->ops->listen(st->socket, backlog);
+ if (err)
+ goto err_out_release;
+
+ st->socket->sk->sk_allocation = GFP_NOIO;
+
+ return 0;
+
+err_out_release:
+ sock_release(st->socket);
+err_out_exit:
+ return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+ if (st->socket) {
+ sock_release(st->socket);
+ st->socket = NULL;
+ }
+}
+
+static void kst_wake(struct kst_state *st)
+{
+ struct kst_worker *w = st->w;
+ unsigned long flags;
+
+ spin_lock_irqsave(&w->ready_lock, flags);
+ if (list_empty(&st->ready_entry))
+ list_add_tail(&st->ready_entry, &w->ready_list);
+ spin_unlock_irqrestore(&w->ready_lock, flags);
+
+ wake_up(&w->wait);
+}
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+ int sync, void *key)
+{
+ struct kst_state *st = container_of(wait, struct kst_state, wait);
+ kst_wake(st);
+ return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+ poll_table *pt)
+{
+ struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)->st;
+
+ st->whead = whead;
+ init_waitqueue_func_entry(&st->wait, kst_state_wake_callback);
+ add_wait_queue(whead, &st->wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+ if (st->whead) {
+ remove_wait_queue(st->whead, &st->wait);
+ st->whead = NULL;
+ }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+static void kst_del_req(struct dst_request *req)
+{
+ struct kst_state *st = req->state;
+
+ rb_erase(&req->request_entry, &st->request_root);
+ RB_CLEAR_NODE(&req->request_entry);
+ list_del_init(&req->request_list_entry);
+}
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+ struct dst_request *req = NULL;
+
+ if (!list_empty(&st->request_list))
+ req = list_entry(st->request_list.next, struct dst_request,
+ request_list_entry);
+ return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+ struct dst_request *req;
+
+ mutex_lock(&st->request_lock);
+ req = kst_req_first(st);
+ if (req)
+ kst_del_req(req);
+ mutex_unlock(&st->request_lock);
+ return req;
+}
+
+static inline int dst_compare_request_id(struct dst_request *old,
+ struct dst_request *new)
+{
+ int cmd = 0;
+
+ if (old->start + to_sector(old->orig_size) <= new->start)
+ cmd = 1;
+ if (old->start >= new->start + to_sector(new->orig_size))
+ cmd = -1;
+
+ dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+ "new: op: %lu, start: %llu, size: %llu, off: %u, cmp: %d.\n",
+ __func__, bio_rw(old->bio), old->start, old->orig_size,
+ old->offset,
+ bio_rw(new->bio), new->start, new->orig_size,
+ new->offset, cmd);
+
+ return cmd;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+static int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+ struct rb_node **n = &st->request_root.rb_node, *parent = NULL;
+ struct dst_request *old = NULL;
+ int cmp;
+
+ while (*n) {
+ parent = *n;
+ old = rb_entry(parent, struct dst_request, request_entry);
+
+ cmp = dst_compare_request_id(old, req);
+ if (cmp < 0)
+ n = &parent->rb_left;
+ else if (cmp > 0)
+ n = &parent->rb_right;
+ else
+ return -EEXIST;
+ }
+
+ rb_link_node(&req->request_entry, parent, n);
+ rb_insert_color(&req->request_entry, &st->request_root);
+
+ if (req->size != req->orig_size)
+ list_add(&req->request_list_entry, &st->request_list);
+ else
+ list_add_tail(&req->request_list_entry, &st->request_list);
+ return 0;
+}
+
+/*
+ * BIOs for local exporting node are freed via this function.
+ */
+static void kst_export_put_bio(struct bio *bio)
+{
+ int i;
+ struct bio_vec *bv;
+
+ dprintk("%s: bio: %p, size: %u, idx: %d, num: %d.\n",
+ __func__, bio, bio->bi_size, bio->bi_idx,
+ bio->bi_vcnt);
+
+ bio_for_each_segment(bv, bio, i)
+ __free_page(bv->bv_page);
+ bio_put(bio);
+}
+
+/*
+ * This is a generic request completion function.
+ * If it is local export node, state machine is different,
+ * see details below.
+ */
+static void kst_complete_req(struct dst_request *req, int err)
+{
+ if (err)
+ printk("%s: freeing bio: %p, req: %p, size: %llu, "
+ "orig_size: %llu, bi_size: %u, err: %d, flags: %u.\n",
+ __func__, req->bio, req, req->size, req->orig_size,
+ req->bio->bi_size, err, req->flags);
+
+ if (req->flags & DST_REQ_EXPORT) {
+ if (req->flags & DST_REQ_EXPORT_WRITE) {
+ req->bio->bi_rw = WRITE;
+ generic_make_request(req->bio);
+ } else
+ kst_export_put_bio(req->bio);
+ } else {
+ bio_endio(req->bio, req->orig_size, (err)?-EIO:0);
+ }
+ dprintk("%s: free req: %p, pool: %p.\n",
+ __func__, req, req->state->w->req_pool);
+ mempool_free(req, req->state->w->req_pool);
+}
+
+static void kst_flush_requests(struct kst_state *st)
+{
+ struct dst_request *req;
+
+ while ((req = kst_dequeue_req(st)) != NULL)
+ kst_complete_req(req, -EIO);
+}
+
+static int kst_poll_init(struct kst_state *st)
+{
+ struct kst_poll_helper ph;
+
+ ph.st = st;
+ init_poll_funcptr(&ph.pt, &kst_queue_func);
+
+ st->socket->ops->poll(NULL, st->socket, &ph.pt);
+ return 0;
+}
+
+/*
+ * Main state creation function.
+ * It creates new state according to given operations
+ * and links it into worker structure and node.
+ */
+struct kst_state *kst_state_init(struct kst_worker *w, struct dst_node *node,
+ struct kst_state_ops *ops, void *data)
+{
+ struct kst_state *st;
+ int err;
+
+ st = kzalloc(sizeof(struct kst_state), GFP_KERNEL);
+ if (!st)
+ return ERR_PTR(-ENOMEM);
+
+ st->node = node;
+ st->ops = ops;
+ st->w = w;
+ INIT_LIST_HEAD(&st->ready_entry);
+ INIT_LIST_HEAD(&st->entry);
+ st->request_root.rb_node = NULL;
+ INIT_LIST_HEAD(&st->request_list);
+ mutex_init(&st->request_lock);
+
+ err = st->ops->init(st, data);
+ if (err)
+ goto err_out_free;
+ mutex_lock(&w->state_mutex);
+ list_add_tail(&st->entry, &w->state_list);
+ mutex_unlock(&w->state_mutex);
+
+ kst_wake(st);
+
+ return st;
+
+err_out_free:
+ kfree(st);
+ return ERR_PTR(err);
+}
+
+/*
+ * This function is called when node is removed,
+ * or when state is destroyed for connected to local exporting
+ * node client.
+ */
+void kst_state_exit(struct kst_state *st)
+{
+ struct kst_worker *w = st->w;
+
+ dprintk("%s: st: %p.\n", __func__, st);
+
+ mutex_lock(&w->state_mutex);
+ list_del_init(&st->entry);
+ mutex_unlock(&w->state_mutex);
+
+ st->ops->exit(st);
+ kfree(st);
+}
+
+/*
+ * This is main state processing function.
+ * It tries to complete request and invoke appropriate
+ * callbacks in case of errors or successfull operation finish.
+ */
+static int kst_thread_process_state(struct kst_state *st)
+{
+ int err, empty;
+ unsigned int revents;
+ struct dst_request *req, *tmp;
+
+ mutex_lock(&st->request_lock);
+ if (st->ops->ready) {
+ err = st->ops->ready(st);
+ if (err) {
+ mutex_unlock(&st->request_lock);
+ if (err < 0)
+ kst_state_exit(st);
+ return err;
+ }
+ }
+
+ err = 0;
+ empty = 1;
+ req = NULL;
+ list_for_each_entry_safe(req, tmp, &st->request_list,
+ request_list_entry) {
+ empty = 0;
+ revents = st->socket->ops->poll(st->socket->file,
+ st->socket, NULL);
+ dprintk("\n%s: st: %p, revents: %x.\n", __func__, st, revents);
+ if (!revents)
+ break;
+ err = req->callback(req, revents);
+ dprintk("%s: callback returned, st: %p, err: %d.\n",
+ __func__, st, err);
+ if (err)
+ break;
+ }
+ mutex_unlock(&st->request_lock);
+
+ dprintk("%s: req: %p, err: %d.\n", __func__, req, err);
+ if (err < 0) {
+ err = st->node->st->alg->ops->error(st, err);
+ if (err && (st != st->node->state)) {
+ dprintk("%s: err: %d, st: %p, node->state: %p.\n",
+ __func__, err, st, st->node->state);
+ /*
+ * Accepted client has state not related to storage
+ * node, so it must be freed explicitely.
+ */
+
+ kst_state_exit(st);
+ return err;
+ }
+
+ kst_wake(st);
+ }
+
+ if (list_empty(&st->request_list) && !empty)
+ kst_wake(st);
+
+ return err;
+}
+
+/*
+ * Main worker thread - one per storage.
+ */
+static int kst_thread_func(void *data)
+{
+ struct kst_worker *w = data;
+ struct kst_state *st;
+ unsigned long flags;
+ int err = 0;
+
+ while (!kthread_should_stop()) {
+ wait_event_interruptible_timeout(w->wait,
+ !list_empty(&w->ready_list) ||
+ kthread_should_stop(),
+ HZ);
+
+ st = NULL;
+ spin_lock_irqsave(&w->ready_lock, flags);
+ if (!list_empty(&w->ready_list)) {
+ st = list_entry(w->ready_list.next, struct kst_state,
+ ready_entry);
+ list_del_init(&st->ready_entry);
+ }
+ spin_unlock_irqrestore(&w->ready_lock, flags);
+
+ if (!st)
+ continue;
+
+ err = kst_thread_process_state(st);
+ }
+
+ return err;
+}
+
+/*
+ * Worker initialization - this object will host andprocess all states,
+ * which in turn host requests for remote targets.
+ */
+struct kst_worker *kst_worker_init(int id)
+{
+ struct kst_worker *w;
+ int err;
+
+ w = kzalloc(sizeof(struct kst_worker), GFP_KERNEL);
+ if (!w)
+ return ERR_PTR(-ENOMEM);
+
+ w->id = id;
+ init_waitqueue_head(&w->wait);
+ spin_lock_init(&w->ready_lock);
+ mutex_init(&w->state_mutex);
+
+ INIT_LIST_HEAD(&w->ready_list);
+ INIT_LIST_HEAD(&w->state_list);
+
+ w->req_pool = mempool_create_slab_pool(256, dst_request_cache);
+ if (!w->req_pool) {
+ err = -ENOMEM;
+ goto err_out_free;
+ }
+
+ w->thread = kthread_run(&kst_thread_func, w, "kst%d", w->id);
+ if (IS_ERR(w->thread)) {
+ err = PTR_ERR(w->thread);
+ goto err_out_destroy;
+ }
+
+ mutex_lock(&kst_worker_mutex);
+ list_add_tail(&w->entry, &kst_worker_list);
+ mutex_unlock(&kst_worker_mutex);
+
+ return w;
+
+err_out_destroy:
+ mempool_destroy(w->req_pool);
+err_out_free:
+ kfree(w);
+ return ERR_PTR(err);
+}
+
+void kst_worker_exit(struct kst_worker *w)
+{
+ struct kst_state *st, *n;
+
+ mutex_lock(&kst_worker_mutex);
+ list_del(&w->entry);
+ mutex_unlock(&kst_worker_mutex);
+
+ kthread_stop(w->thread);
+
+ list_for_each_entry_safe(st, n, &w->state_list, entry) {
+ kst_state_exit(st);
+ }
+
+ mempool_destroy(w->req_pool);
+ kfree(w);
+}
+
+/*
+ * Common state exit callback.
+ * Removes itself from worker's list of states,
+ * releases socket and flushes all requests.
+ */
+static void kst_common_exit(struct kst_state *st)
+{
+ unsigned long flags;
+
+ dprintk("%s: st: %p.\n", __func__, st);
+ kst_poll_exit(st);
+
+ spin_lock_irqsave(&st->w->ready_lock, flags);
+ list_del_init(&st->ready_entry);
+ spin_unlock_irqrestore(&st->w->ready_lock, flags);
+
+ kst_sock_release(st);
+ kst_flush_requests(st);
+}
+
+/*
+ * Header sending function - may block.
+ */
+static int kst_data_send_header(struct kst_state *st,
+ struct dst_remote_request *r)
+{
+ struct msghdr msg;
+ struct kvec iov;
+
+ iov.iov_base = r;
+ iov.iov_len = sizeof(struct dst_remote_request);
+
+ msg.msg_iov = (struct iovec *)&iov;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+ return kernel_sendmsg(st->socket, &msg, &iov, 1, iov.iov_len);
+}
+
+/*
+ * BIO vector receiving function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_recv_bio_vec(struct kst_state *st, struct bio_vec *bv,
+ unsigned int offset, unsigned int size)
+{
+ struct msghdr msg;
+ struct kvec iov;
+ void *kaddr;
+ int err;
+
+ kaddr = kmap(bv->bv_page);
+
+ iov.iov_base = kaddr + bv->bv_offset + offset;
+ iov.iov_len = size;
+
+ msg.msg_iov = (struct iovec *)&iov;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
+
+ err = kernel_recvmsg(st->socket, &msg, &iov, 1, iov.iov_len,
+ msg.msg_flags);
+ kunmap(bv->bv_page);
+
+ return err;
+}
+
+/*
+ * BIO vector sending function - does not block, but may sleep because
+ * of scheduling policy.
+ */
+static int kst_data_send_bio_vec(struct kst_state *st, struct bio_vec *bv,
+ unsigned int offset, unsigned int size)
+{
+ return kernel_sendpage(st->socket, bv->bv_page,
+ bv->bv_offset + offset, size,
+ MSG_DONTWAIT | MSG_NOSIGNAL);
+}
+
+typedef int (*kst_data_process_bio_vec_t)(struct kst_state *st,
+ struct bio_vec *bv, unsigned int offset, unsigned int size);
+
+/*
+ * @req: processing request.
+ * Contains BIO and all related to its processing info.
+ *
+ * This function sends or receives requested number of pages from given BIO.
+ *
+ * In case of errors negative return value is returned and @size,
+ * @index and @off are set to the:
+ * - number of bytes not yet processed (i.e. the rest of the bytes to be
+ * processed).
+ * - index of the last bio_vec started to be processed (header sent).
+ * - offset of the first byte to be processed in the bio_vec.
+ *
+ * If there are no errors, zero is returned.
+ * -EAGAIN is not an error and is transformed into zero return value,
+ * called must check if @size is zero, in that case whole BIO is processed
+ * and thus bio_endio() can be called, othervise new request must be allocated
+ * to be processed later.
+ */
+static int kst_data_process_bio(struct dst_request *req)
+{
+ int err = -ENOSPC, partial = (req->size != req->orig_size);
+ struct dst_remote_request r;
+ kst_data_process_bio_vec_t func;
+ unsigned int cur_size;
+
+ r.flags = cpu_to_be32(((unsigned long)req->bio) & 0xffffffff);
+
+ if (bio_rw(req->bio) == WRITE) {
+ r.cmd = cpu_to_be32(DST_WRITE);
+ func = kst_data_send_bio_vec;
+ } else {
+ r.cmd = cpu_to_be32(DST_READ);
+ func = kst_data_recv_bio_vec;
+ }
+
+ dprintk("%s: start: [%c], start: %llu, idx: %d, num: %d, "
+ "size: %llu, offset: %u.\n",
+ __func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+ req->start, req->idx, req->num, req->size, req->offset);
+
+ while (req->idx < req->num) {
+ struct bio_vec *bv = bio_iovec_idx(req->bio, req->idx);
+
+ cur_size = min_t(u64, bv->bv_len - req->offset, req->size);
+
+ BUG_ON(cur_size == 0);
+
+ if (!(req->flags & DST_REQ_HEADER_SENT)) {
+ r.sector = cpu_to_be64(req->start);
+ r.offset = cpu_to_be32(bv->bv_offset + req->offset);
+ r.size = cpu_to_be32(cur_size);
+
+ err = kst_data_send_header(req->state, &r);
+ if (err != sizeof(struct dst_remote_request)) {
+ dprintk("%s: %d/%d: header: start: %llu, "
+ "bv_offset: %u, bv_len: %u, "
+ "a offset: %u, offset: %u, "
+ "cur_size: %u, err: %d.\n",
+ __func__, req->idx, req->num,
+ req->start, bv->bv_offset, bv->bv_len,
+ bv->bv_offset + req->offset,
+ req->offset, cur_size, err);
+ if (err >= 0)
+ err = -EINVAL;
+ break;
+ }
+
+ req->flags |= DST_REQ_HEADER_SENT;
+ }
+
+ err = func(req->state, bv, req->offset, cur_size);
+ if (err <= 0)
+ break;
+
+ req->offset += err;
+ req->size -= err;
+ req->start += to_sector(err);
+
+ if (req->offset != bv->bv_len) {
+ dprintk("%s: %d/%d: this: start: %llu, bv_offset: %u, "
+ "bv_len: %u, a offset: %u, offset: %u, "
+ "cur_size: %u, err: %d.\n",
+ __func__, req->idx, req->num, req->start,
+ bv->bv_offset, bv->bv_len,
+ bv->bv_offset + req->offset,
+ req->offset, cur_size, err);
+ err = -EAGAIN;
+ break;
+ }
+ req->offset = 0;
+ req->idx++;
+ req->flags &= ~DST_REQ_HEADER_SENT;
+ }
+
+ if (err <= 0 && err != -EAGAIN) {
+ if (err == 0)
+ err = -ECONNRESET;
+ } else
+ err = 0;
+
+ if (req->size) {
+ req->state->flags |= KST_FLAG_PARTIAL;
+ } else if (partial) {
+ req->state->flags &= ~KST_FLAG_PARTIAL;
+ }
+
+ if (err < 0 || (req->idx == req->num && req->size)) {
+ dprintk("%s: return: idx: %d, num: %d, offset: %u, "
+ "size: %llu, err: %d.\n",
+ __func__, req->idx, req->num, req->offset,
+ req->size, err);
+ }
+ dprintk("%s: end: start: %llu, idx: %d, num: %d, "
+ "size: %llu, offset: %u.\n",
+ __func__, req->start, req->idx, req->num,
+ req->size, req->offset);
+
+
+ return err;
+}
+
+/*
+ * This callback is invoked by worker thread to process given request.
+ */
+static int kst_data_callback(struct dst_request *req, unsigned int revents)
+{
+ int err;
+
+ dprintk("%s: req: %p, num: %d, idx: %d, bio: %p, "
+ "revents: %x, flags: %x.\n",
+ __func__, req, req->num, req->idx, req->bio,
+ revents, req->flags);
+
+ if (req->flags & DST_REQ_EXPORT_READ)
+ return 1;
+
+ err = kst_data_process_bio(req);
+ if (err < 0)
+ goto err_out;
+
+ if (!req->size) {
+ dprintk("%s: complete: req: %p, bio: %p.\n",
+ __func__, req, req->bio);
+ kst_del_req(req);
+ kst_complete_req(req, 0);
+ return 0;
+ }
+
+ if (revents & (POLLERR | POLLHUP | POLLRDHUP)) {
+ err = -EPIPE;
+ goto err_out;
+ }
+
+ return 1;
+
+err_out:
+ return err;
+}
+
+#define KST_CONG_COMPLETED (0)
+#define KST_CONG_NOT_FOUND (1)
+#define KST_CONG_QUEUE (-1)
+
+/*
+ * kst_congestion - checks for data congestion, i.e. the case, when given
+ * block request crosses an area of the another block request which
+ * is not yet sent to the remote node.
+ *
+ * @req: dst request containing block io related information.
+ *
+ * Return value:
+ * %KST_CONG_COMPLETED - congestion was found and processed,
+ * bio must be ended, request is completed.
+ * %KST_CONG_NOT_FOUND - no congestion found,
+ * request must be processed as usual
+ * %KST_CONG_QUEUE - congestion has been found, but bio is not completed,
+ * new request must be allocated and processed.
+ */
+static int kst_congestion(struct dst_request *req)
+{
+ int cmp, i;
+ struct kst_state *st = req->state;
+ struct rb_node *n = st->request_root.rb_node;
+ struct dst_request *old = NULL, *dst_req, *src_req;
+
+ while (n) {
+ src_req = rb_entry(n, struct dst_request, request_entry);
+ cmp = dst_compare_request_id(src_req, req);
+
+ if (cmp < 0)
+ n = n->rb_left;
+ else if (cmp > 0)
+ n = n->rb_right;
+ else {
+ old = src_req;
+ break;
+ }
+ }
+
+ if (likely(!old))
+ return KST_CONG_NOT_FOUND;
+
+ dprintk("%s: old: op: %lu, start: %llu, size: %llu, off: %u, "
+ "new: op: %lu, start: %llu, size: %llu, off: %u.\n",
+ __func__, bio_rw(old->bio), old->start, old->orig_size,
+ old->offset,
+ bio_rw(req->bio), req->start, req->orig_size, req->offset);
+
+ if ((bio_rw(old->bio) != WRITE) && (bio_rw(req->bio) != WRITE)) {
+ return KST_CONG_QUEUE;
+ }
+
+ if (unlikely(req->offset != old->offset))
+ return KST_CONG_QUEUE;
+
+ src_req = old;
+ dst_req = req;
+ if (bio_rw(req->bio) == WRITE) {
+ dst_req = old;
+ src_req = req;
+ }
+
+ /* Actually we could partially complete new request by copying
+ * part of the first one, but not now, consider this as a
+ * (low-priority) todo item.
+ */
+ if (src_req->start + src_req->orig_size <
+ dst_req->start + dst_req->orig_size)
+ return KST_CONG_QUEUE;
+
+ /*
+ * So, only process if new request is differnt from old one,
+ * or subsequent write, i.e.:
+ * - not completed write and request to read
+ * - not completed read and request to write
+ * - not completed write and request to (over)write
+ */
+ for (i=old->idx; i<old->num; ++i) {
+ struct bio_vec *bv_src, *bv_dst;
+ void *src, *dst;
+ u64 len;
+
+ bv_src = bio_iovec_idx(src_req->bio, i);
+ bv_dst = bio_iovec_idx(dst_req->bio, i);
+
+ if (unlikely(bv_dst->bv_offset != bv_src->bv_offset))
+ return KST_CONG_QUEUE;
+
+ if (unlikely(bv_dst->bv_len != bv_src->bv_len))
+ return KST_CONG_QUEUE;
+
+ src = kmap_atomic(bv_src->bv_page, KM_USER0);
+ dst = kmap_atomic(bv_dst->bv_page, KM_USER1);
+
+ len = min_t(u64, bv_dst->bv_len, dst_req->size);
+
+ memcpy(dst + bv_dst->bv_offset, src + bv_src->bv_offset, len);
+
+ kunmap_atomic(src, KM_USER0);
+ kunmap_atomic(dst, KM_USER1);
+
+ dst_req->idx++;
+ dst_req->size -= len;
+ dst_req->offset = 0;
+ dst_req->start += to_sector(len);
+
+ if (!dst_req->size)
+ break;
+ }
+
+ if (req == dst_req)
+ return KST_CONG_COMPLETED;
+
+ kst_del_req(dst_req);
+ kst_complete_req(dst_req, 0);
+
+ return KST_CONG_NOT_FOUND;
+}
+
+static struct dst_request *dst_clone_request(struct dst_request *req)
+{
+ struct dst_request *new_req;
+
+ new_req = mempool_alloc(req->state->w->req_pool, GFP_NOIO);
+ if (!new_req)
+ return NULL;
+
+ dprintk("%s: req: %p, new_req: %p, bio: %p.\n",
+ __func__, req, new_req, req->bio);
+
+ RB_CLEAR_NODE(&new_req->request_entry);
+
+ new_req->bio = req->bio;
+ new_req->state = req->state;
+ new_req->idx = req->idx;
+ new_req->num = req->num;
+ new_req->size = req->size;
+ new_req->orig_size = req->orig_size;
+ new_req->offset = req->offset;
+ new_req->start = req->start;
+ new_req->flags = req->flags;
+
+ return new_req;
+}
+
+/*
+ * This is main data processing function, eventually invoked from block layer.
+ * It tries to complte request, but if it is about to block, it allocates
+ * new request and queues it to main worker to be processed when events allow.
+ */
+static int kst_data_push(struct dst_request *req)
+{
+ struct kst_state *st = req->state;
+ struct dst_request *new_req;
+ unsigned int revents;
+ int err, locked = 0;
+
+ dprintk("%s: start: %llu, size: %llu, bio: %p.\n",
+ __func__, req->start, req->size, req->bio);
+
+ if (mutex_trylock(&st->request_lock)) {
+ locked = 1;
+
+ if (st->flags & (KST_FLAG_PARTIAL | DST_REQ_ALWAYS_QUEUE))
+ goto alloc_new_req;
+
+ err = kst_congestion(req);
+ if (err == KST_CONG_COMPLETED)
+ goto out_bio_endio;
+
+ if (err == KST_CONG_NOT_FOUND) {
+ revents = st->socket->ops->poll(NULL, st->socket, NULL);
+ dprintk("%s: st: %p, bio: %p, revents: %x.\n",
+ __func__, st, req->bio, revents);
+ if (revents & POLLOUT) {
+ err = kst_data_process_bio(req);
+ if (err < 0)
+ goto out_unlock;
+
+ if (!req->size) {
+ err = 0;
+ goto out_bio_endio;
+ }
+ }
+ }
+ }
+
+alloc_new_req:
+ err = -ENOMEM;
+ new_req = dst_clone_request(req);
+ if (!new_req)
+ goto out_unlock;
+
+ new_req->callback = &kst_data_callback;
+
+ if (!locked)
+ mutex_lock(&st->request_lock);
+ locked = 1;
+
+ err = kst_enqueue_req(st, new_req);
+ mutex_unlock(&st->request_lock);
+ if (err) {
+ printk("%s: free req: %p, pool: %p.\n",
+ __func__, new_req, st->w->req_pool);
+ printk("%s: free [%c], start: %llu, idx: %d, "
+ "num: %d, size: %llu, offset: %u, err: %d.\n",
+ __func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+ req->start, req->idx, req->num, req->size,
+ req->offset, err);
+ mempool_free(new_req, st->w->req_pool);
+ goto err_out;
+ }
+
+ kst_wake(st);
+
+ return 0;
+
+out_bio_endio:
+ if (err)
+ printk("%s: freeing bio: %p, bi_size: %u, orig_size: %llu.\n",
+ __func__, req->bio, req->bio->bi_size, req->orig_size);
+ bio_endio(req->bio, req->orig_size, err);
+out_unlock:
+ mutex_unlock(&st->request_lock);
+ locked = 0;
+err_out:
+ if (err) {
+ err = st->node->st->alg->ops->error(st, err);
+ if (!err)
+ goto alloc_new_req;
+ }
+
+ if (err)
+ printk("%s: [%c], start: %llu, idx: %d, num: %d, "
+ "size: %llu, offset: %u, err: %d.\n",
+ __func__, (bio_rw(req->bio) == WRITE)?'W':'R',
+ req->start, req->idx, req->num, req->size,
+ req->offset, err);
+ kst_wake(st);
+ return err;
+}
+
+/*
+ * Remote node initialization callback.
+ */
+static int kst_data_init(struct kst_state *st, void *data)
+{
+ int err;
+
+ st->socket = data;
+ st->socket->sk->sk_allocation = GFP_NOIO;
+ /*
+ * Why not?
+ */
+ st->socket->sk->sk_sndbuf = st->socket->sk->sk_sndbuf = 1024*1024*10;
+
+ err = kst_poll_init(st);
+ if (err)
+ return err;
+
+ return 0;
+}
+
+/*
+ * Remote node recovery function - tries to reconnect to given target.
+ */
+static int kst_data_recovery(struct kst_state *st, int err)
+{
+ struct socket *sock;
+ struct sockaddr addr;
+ int addrlen;
+ struct dst_request *req;
+
+ if (err != -ECONNRESET && err != -EPIPE) {
+ dprintk("%s: state %p does not know how "
+ "to recover from error %d.\n",
+ __func__, st, err);
+ return err;
+ }
+
+ err = sock_create(st->socket->ops->family, st->socket->type,
+ st->socket->sk->sk_protocol, &sock);
+ if (err < 0)
+ goto err_out_exit;
+
+ sock->sk->sk_sndtimeo = sock->sk->sk_rcvtimeo =
+ msecs_to_jiffies(DST_DEFAULT_TIMEO);
+
+ err = sock->ops->getname(st->socket, &addr, &addrlen, 2);
+ if (err)
+ goto err_out_destroy;
+
+ err = sock->ops->connect(sock, &addr, addrlen, 0);
+ if (err)
+ goto err_out_destroy;
+
+ kst_poll_exit(st);
+ kst_sock_release(st);
+
+ mutex_lock(&st->request_lock);
+ err = st->ops->init(st, sock);
+ if (!err) {
+ /*
+ * After reconnection is completed all requests
+ * must be resent from the state they were finished previously,
+ * but with new headers.
+ */
+ list_for_each_entry(req, &st->request_list, request_list_entry)
+ req->flags &= ~DST_REQ_HEADER_SENT;
+ }
+ mutex_unlock(&st->request_lock);
+ if (err < 0)
+ goto err_out_destroy;
+
+ kst_wake(st);
+ printk("%s: recovery completed.\n", __func__);
+
+ return 0;
+
+err_out_destroy:
+ sock_release(sock);
+err_out_exit:
+ dprintk("%s: reovery failed: st: %p, err: %d.\n", __func__, st, err);
+ return err;
+}
+
+static inline void kst_convert_header(struct dst_remote_request *r)
+{
+ r->cmd = be32_to_cpu(r->cmd);
+ r->sector = be64_to_cpu(r->sector);
+ r->offset = be32_to_cpu(r->offset);
+ r->size = be32_to_cpu(r->size);
+ r->flags = be32_to_cpu(r->flags);
+}
+
+/*
+ * Local exporting node end IO callbacks.
+ */
+static int kst_export_write_end_io(struct bio *bio, unsigned int size, int err)
+{
+ dprintk("%s: bio: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+ __func__, bio, bio->bi_size, bio->bi_idx, bio->bi_vcnt, err);
+
+ if (bio->bi_size)
+ return 1;
+
+ kst_export_put_bio(bio);
+ return 0;
+}
+
+static int kst_export_read_end_io(struct bio *bio, unsigned int size, int err)
+{
+ struct dst_request *req = bio->bi_private;
+ struct kst_state *st = req->state;
+
+ dprintk("%s: bio: %p, req: %p, size: %u, idx: %d, num: %d, err: %d.\n",
+ __func__, bio, req, bio->bi_size, bio->bi_idx,
+ bio->bi_vcnt, err);
+
+ if (bio->bi_size)
+ return 1;
+
+ bio->bi_size = req->size = req->orig_size;
+ bio->bi_rw = WRITE;
+ req->flags &= ~DST_REQ_EXPORT_READ;
+ kst_wake(st);
+ return 0;
+}
+
+/*
+ * This callback is invoked each time new request from remote
+ * node to given local export node is received.
+ * It allocates new block IO request and queues it for processing.
+ */
+static int kst_export_ready(struct kst_state *st)
+{
+ struct dst_remote_request r;
+ struct msghdr msg;
+ struct kvec iov;
+ struct bio *bio;
+ int err, nr, i;
+ struct dst_request *req;
+ sector_t data_size;
+ unsigned int revents = st->socket->ops->poll(NULL, st->socket, NULL);
+
+ if (revents & (POLLERR | POLLHUP)) {
+ err = -EPIPE;
+ goto err_out_exit;
+ }
+
+ if (!(revents & POLLIN) || !list_empty(&st->request_list))
+ return 0;
+
+ iov.iov_base = &r;
+ iov.iov_len = sizeof(struct dst_remote_request);
+
+ msg.msg_iov = (struct iovec *)&iov;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_flags = MSG_WAITALL | MSG_NOSIGNAL;
+
+ err = kernel_recvmsg(st->socket, &msg, &iov, 1,
+ iov.iov_len, msg.msg_flags);
+ if (err != sizeof(struct dst_remote_request)) {
+ err = -EINVAL;
+ goto err_out_exit;
+ }
+
+ kst_convert_header(&r);
+
+ dprintk("\n%s: cmd: %u, sector: %llu, size: %u, "
+ "flags: %x, offset: %u.\n",
+ __func__, r.cmd, r.sector, r.size, r.flags, r.offset);
+
+ /*
+ * Does not support autoconfig yet.
+ */
+ err = -EINVAL;
+ if (r.cmd != DST_READ && r.cmd != DST_WRITE)
+ goto err_out_exit;
+
+ data_size = get_capacity(st->node->bdev->bd_disk);
+ if ((signed)(r.sector + to_sector(r.size)) < 0 ||
+ (signed)(r.sector + to_sector(r.size)) > data_size ||
+ (signed)r.sector > data_size)
+ goto err_out_exit;
+
+ nr = r.size/PAGE_SIZE + 1;
+
+ while (r.size) {
+ int nr_pages = min(BIO_MAX_PAGES, nr);
+ unsigned int size;
+ struct page *page;
+
+ err = -ENOMEM;
+ req = mempool_alloc(st->w->req_pool, GFP_NOIO);
+ if (!req)
+ goto err_out_exit;
+
+ dprintk("%s: alloc req: %p, pool: %p.\n",
+ __func__, req, st->w->req_pool);
+
+ bio = bio_alloc(GFP_NOIO, nr_pages);
+ if (!bio)
+ goto err_out_free_req;
+
+ req->flags = DST_REQ_EXPORT | DST_REQ_HEADER_SENT;
+ req->bio = bio;
+ req->state = st;
+ req->callback = &kst_data_callback;
+
+ /*
+ * Yes, looks a bit weird.
+ * Logic is simple - for local exporting node all operations
+ * are reversed compared to usual nodes, since usual nodes
+ * process remote data and local export node process remote
+ * requests, so that writing data means sending data to
+ * remote node and receiving on the local export one.
+ *
+ * So, to process writing to the exported node we need first to
+ * receive data from the net (i.e. to perform READ operation
+ * in terms of usual node), and then put it to the storage
+ * (WRITE command, so it will be changed before calling
+ * generic_make_request()).
+ *
+ * To process read request from the exported node we need
+ * first to read it from storage (READ command for BIO)
+ * and then send it over the net (perform WRITE operation
+ * in terms of network).
+ */
+ if (r.cmd == DST_WRITE) {
+ req->flags |= DST_REQ_EXPORT_WRITE;
+ bio->bi_end_io = kst_export_write_end_io;
+ } else {
+ req->flags |= DST_REQ_EXPORT_READ;
+ bio->bi_end_io = kst_export_read_end_io;
+ }
+ bio->bi_rw = READ;
+ bio->bi_private = req;
+ bio->bi_sector = r.sector;
+ bio->bi_bdev = st->node->bdev;
+
+ for (i=0; i<nr_pages; ++i) {
+ page = alloc_page(GFP_NOIO);
+ if (!page)
+ break;
+
+ size = min_t(u32, PAGE_SIZE, r.size);
+
+ err = bio_add_page(bio, page, size, r.offset);
+ dprintk("%s: %d/%d: page: %p, size: %u, offset: %u, "
+ "err: %d.\n",
+ __func__, i, nr_pages, page, size,
+ r.offset, err);
+ if (err <= 0)
+ break;
+
+ if (err == size) {
+ r.offset = 0;
+ nr--;
+ } else {
+ r.offset += err;
+ }
+
+ r.size -= err;
+ r.sector += to_sector(err);
+
+ if (!r.size)
+ break;
+ }
+
+ if (!bio->bi_vcnt) {
+ err = -ENOMEM;
+ goto err_out_put;
+ }
+
+ req->size = req->orig_size = bio->bi_size;
+ req->start = bio->bi_sector;
+ req->idx = 0;
+ req->num = bio->bi_vcnt;
+
+ dprintk("%s: submitting: bio: %p, req: %p, start: %llu, "
+ "size: %llu, idx: %d, num: %d, offset: %u, err: %d.\n",
+ __func__, bio, req, req->start, req->size,
+ req->idx, req->num, req->offset, err);
+
+ err = kst_enqueue_req(st, req);
+ if (err)
+ goto err_out_put;
+
+ if (r.cmd == DST_READ) {
+ generic_make_request(bio);
+ }
+ }
+
+ kst_wake(st);
+ return 0;
+
+err_out_put:
+ bio_put(bio);
+err_out_free_req:
+ dprintk("%s: free req: %p, pool: %p.\n",
+ __func__, req, st->w->req_pool);
+ mempool_free(req, st->w->req_pool);
+err_out_exit:
+ dprintk("%s: error: %d.\n", __func__, err);
+ return err;
+}
+
+static void kst_export_exit(struct kst_state *st)
+{
+ struct dst_node *n = st->node;
+
+ dprintk("%s: st: %p.\n", __func__, st);
+
+ kst_common_exit(st);
+ dst_node_put(n);
+}
+
+static struct kst_state_ops kst_data_export_ops = {
+ .init = &kst_data_init,
+ .push = &kst_data_push,
+ .exit = &kst_export_exit,
+ .ready = &kst_export_ready,
+};
+
+/*
+ * This callback is invoked each time listening socket for
+ * given local export node becomes ready.
+ * It creates new state for connected client and queues for processing.
+ */
+static int kst_listen_ready(struct kst_state *st)
+{
+ struct socket *newsock;
+ struct saddr addr;
+ struct kst_state *newst;
+ int err;
+ unsigned int revents;
+
+ revents = st->socket->ops->poll(NULL, st->socket, NULL);
+ if (!(revents & POLLIN))
+ return 1;
+
+ err = sock_create(st->socket->ops->family, st->socket->type,
+ st->socket->sk->sk_protocol, &newsock);
+ if (err)
+ goto err_out_exit;
+
+ err = st->socket->ops->accept(st->socket, newsock, 0);
+ if (err)
+ goto err_out_put;
+
+ if (newsock->ops->getname(newsock, (struct sockaddr *)&addr,
+ (int *)&addr.sa_data_len, 2) < 0) {
+ err = -ECONNABORTED;
+ goto err_out_put;
+ }
+
+ if (st->socket->ops->family == AF_INET) {
+ struct sockaddr_in *sin = (struct sockaddr_in *)&addr;
+ printk("%s: Client: %u.%u.%u.%u:%d.\n", __func__,
+ NIPQUAD(sin->sin_addr.s_addr), ntohs(sin->sin_port));
+ } else if (st->socket->ops->family == AF_INET6) {
+ struct sockaddr_in6 *sin = (struct sockaddr_in6 *)&addr;
+ printk("%s: Client: %04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x:%d",
+ __func__, NIP6(sin->sin6_addr), ntohs(sin->sin6_port));
+ }
+
+ atomic_inc(&st->node->refcnt);
+ newst = kst_state_init(st->w, st->node, &kst_data_export_ops, newsock);
+ if (IS_ERR(newst)) {
+ err = PTR_ERR(newst);
+ goto err_out_put;
+ }
+
+ return 0;
+
+err_out_put:
+ dst_node_put(st->node);
+ sock_release(newsock);
+err_out_exit:
+ return err;
+}
+
+static int kst_listen_init(struct kst_state *st, void *data)
+{
+ int err;
+ struct dst_local_export_ctl *le = data;
+
+ err = kst_sock_create(st, &le->rctl.addr, le->rctl.type,
+ le->rctl.proto, le->backlog);
+ if (err)
+ goto err_out_exit;
+
+ err = kst_poll_init(st);
+ if (err)
+ goto err_out_release;
+
+ return 0;
+
+err_out_release:
+ kst_sock_release(st);
+err_out_exit:
+ return err;
+}
+
+/*
+ * Operations for different types of states.
+ * There are three:
+ * data state - created for remote node, when distributed storage connects
+ * to remote node, which contain data.
+ * listen state - created for local export node, when remote distributed
+ * storage's node connects to given node to get/put data.
+ * data export state - created for each client connected to above listen
+ * state.
+ */
+static struct kst_state_ops kst_listen_ops = {
+ .init = &kst_listen_init,
+ .exit = &kst_common_exit,
+ .ready = &kst_listen_ready,
+};
+static struct kst_state_ops kst_data_ops = {
+ .init = &kst_data_init,
+ .push = &kst_data_push,
+ .exit = &kst_common_exit,
+ .recovery = &kst_data_recovery,
+};
+
+struct kst_state *kst_listener_state_init(struct kst_worker *w,
+ struct dst_node *node, struct dst_local_export_ctl *le)
+{
+ return kst_state_init(w, node, &kst_listen_ops, le);
+}
+
+struct kst_state *kst_data_state_init(struct kst_worker *w,
+ struct dst_node *node, struct socket *newsock)
+{
+ return kst_state_init(w, node, &kst_data_ops, newsock);
+}
+
+/*
+ * Remove all workers and associated states.
+ */
+void kst_exit_all(void)
+{
+ struct kst_worker *w, *n;
+
+ list_for_each_entry_safe(w, n, &kst_worker_list, entry) {
+ kst_worker_exit(w);
+ }
+}
diff --git a/include/linux/dst.h b/include/linux/dst.h
new file mode 100644
index 0000000..b92fb55
--- /dev/null
+++ b/include/linux/dst.h
@@ -0,0 +1,282 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __DST_H
+#define __DST_H
+
+#include <linux/types.h>
+
+#define DST_NAMELEN 32
+#define DST_NAME "dst"
+#define DST_IOCTL 0xba
+
+enum {
+ DST_DEL_NODE = 0, /* Remove node with given id from storage */
+ DST_ADD_REMOTE, /* Add remote node with given id to the storage */
+ DST_ADD_LOCAL, /* Add local node with given id to the storage */
+ DST_ADD_LOCAL_EXPORT, /* Add local node with given id to the storage to be exported and used by remote peers */
+ DST_START_STORAGE, /* Array is ready and storage can be started, if there will be new nodes
+ * added to the storage, they will be checked against existing size and
+ * probably be dropped (for example in mirror format when new node has smaller
+ * size than array created) or inserted.
+ */
+ DST_STOP_STORAGE, /* Remove array and all nodes. */
+ DST_CMD_MAX
+};
+
+#define DST_CTL_FLAGS_REMOTE (1<<0)
+#define DST_CTL_FLAGS_EXPORT (1<<1)
+
+struct dst_ctl
+{
+ char st[DST_NAMELEN];
+ char alg[DST_NAMELEN];
+ __u32 flags;
+ __u64 start, size;
+};
+
+struct dst_local_ctl
+{
+ char name[DST_NAMELEN];
+};
+
+#define SADDR_MAX_DATA 128
+
+struct saddr {
+ unsigned short sa_family; /* address family, AF_xxx */
+ char sa_data[SADDR_MAX_DATA]; /* 14 bytes of protocol address */
+ unsigned short sa_data_len; /* Number of bytes used in sa_data */
+};
+
+struct dst_remote_ctl
+{
+ __u16 type;
+ __u16 proto;
+ struct saddr addr;
+};
+
+struct dst_local_export_ctl
+{
+ __u32 backlog;
+ struct dst_local_ctl lctl;
+ struct dst_remote_ctl rctl;
+};
+
+
+enum {
+ DST_REMOTE_CFG = 1, /* Request remote configuration */
+ DST_WRITE, /* Writing */
+ DST_READ, /* Reading */
+ DST_NCMD_MAX,
+};
+
+struct dst_remote_request
+{
+ __u32 cmd;
+ __u32 flags;
+ __u64 sector;
+ __u32 offset;
+ __u32 size;
+};
+
+#ifdef __KERNEL__
+
+#include <linux/rbtree.h>
+#include <linux/net.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/mempool.h>
+#include <linux/device.h>
+
+//#define DST_DEBUG
+
+#ifdef DST_DEBUG
+#define dprintk(f, a...) printk(f, ##a)
+#else
+#define dprintk(f, a...) do {} while (0)
+#endif
+
+struct kst_worker
+{
+ struct list_head entry;
+
+ struct list_head state_list;
+ struct mutex state_mutex;
+
+ struct list_head ready_list;
+ spinlock_t ready_lock;
+
+ mempool_t *req_pool;
+
+ struct task_struct *thread;
+
+ wait_queue_head_t wait;
+
+ int id;
+};
+
+struct kst_state;
+struct dst_node;
+
+#define DST_REQ_HEADER_SENT (1<<0)
+#define DST_REQ_EXPORT (1<<1)
+#define DST_REQ_EXPORT_WRITE (1<<2)
+#define DST_REQ_EXPORT_READ (1<<3)
+#define DST_REQ_ALWAYS_QUEUE (1<<4)
+
+struct dst_request
+{
+ struct rb_node request_entry;
+ struct list_head request_list_entry;
+ struct bio *bio;
+ struct kst_state *state;
+
+ u32 flags;
+
+ int (*callback)(struct dst_request *, unsigned int);
+
+ u64 size, orig_size, start;
+ int idx, num;
+ u32 offset;
+};
+
+struct kst_state_ops
+{
+ int (*init)(struct kst_state *, void *);
+ int (*push)(struct dst_request *req);
+ int (*ready)(struct kst_state *);
+ int (*recovery)(struct kst_state *, int err);
+ void (*exit)(struct kst_state *);
+};
+
+#define KST_FLAG_PARTIAL (1<<0)
+
+struct kst_state
+{
+ struct list_head entry;
+ struct list_head ready_entry;
+
+ wait_queue_t wait;
+ wait_queue_head_t *whead;
+
+ struct dst_node *node;
+ struct kst_worker *w;
+ struct socket *socket;
+
+ u32 flags;
+
+ struct rb_root request_root;
+ struct mutex request_lock;
+ struct list_head request_list;
+
+ struct kst_state_ops *ops;
+};
+
+#define DST_DEFAULT_TIMEO 2000
+
+struct dst_storage;
+
+struct dst_alg_ops
+{
+ int (*add_node)(struct dst_node *n);
+ void (*del_node)(struct dst_node *n);
+ int (*remap)(struct dst_storage *st, struct bio *bio);
+ int (*error)(struct kst_state *state, int err);
+ struct module *owner;
+};
+
+struct dst_alg
+{
+ struct list_head entry;
+ char name[DST_NAMELEN];
+ atomic_t refcnt;
+ struct dst_alg_ops *ops;
+};
+
+#define DST_ST_STARTED (1<<0)
+
+struct dst_storage
+{
+ struct list_head entry;
+ char name[DST_NAMELEN];
+ struct dst_alg *alg;
+ atomic_t refcnt;
+ struct mutex tree_lock;
+ struct rb_root tree_root;
+
+ request_queue_t *queue;
+ struct gendisk *disk;
+
+ long flags;
+ u64 disk_size;
+
+ struct device device;
+};
+
+#define DST_NODE_FROZEN 0
+
+struct dst_node
+{
+ struct rb_node tree_node;
+ struct block_device *bdev;
+ struct dst_storage *st;
+ struct kst_state *state;
+
+ atomic_t refcnt;
+
+ void (*cleanup)(struct dst_node *);
+
+ long flags;
+
+ u64 start, size;
+
+ struct device device;
+};
+
+struct kst_state *kst_state_init(struct kst_worker *w, struct dst_node *node,
+ struct kst_state_ops *ops, void *data);
+void kst_state_exit(struct kst_state *st);
+
+struct kst_worker *kst_worker_init(int id);
+void kst_worker_exit(struct kst_worker *w);
+
+struct kst_state *kst_listener_state_init(struct kst_worker *w, struct dst_node *node,
+ struct dst_local_export_ctl *le);
+struct kst_state *kst_data_state_init(struct kst_worker *w, struct dst_node *node,
+ struct socket *newsock);
+
+void kst_exit_all(void);
+
+struct dst_alg *dst_alloc_alg(char *name, struct dst_alg_ops *ops);
+void dst_remove_alg(struct dst_alg *alg);
+
+struct dst_node *dst_storage_tree_search(struct dst_storage *st, u64 start);
+
+void dst_node_put(struct dst_node *n);
+
+extern struct kmem_cache *dst_request_cache;
+
+static inline sector_t to_sector(unsigned long n)
+{
+ return (n >> 9);
+}
+
+static inline unsigned long to_bytes(sector_t n)
+{
+ return (n << 9);
+}
+
+#endif /* __KERNEL__ */
+#endif /* __DST_H */
--
Evgeniy Polyakov
^ permalink raw reply related
* Re: RFC: on [ab]use of skb->cb by VLAN code
From: Rick Jones @ 2007-07-31 16:56 UTC (permalink / raw)
To: Ben Greear; +Cc: David Miller, hadi, kaber, netdev, mcarlson
In-Reply-To: <46AEC99E.10809@candelatech.com>
> Do we really need an 'unsigned int' for mac_len? Maybe we could use
> a 16-bit counter here, and then use the other 16 bits for the VLAN bits?
Not knowing exactly if/how it interacts with that specific field I will
point-out that IPoIB in OFED 1.2 just took their MTU to 65520. While that
doesn't break the bitbank it does get rather close.
rick jones
^ permalink raw reply
* Disabling timestamps on AF_PACKET sockets
From: Unai Uribarri @ 2007-07-31 16:24 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 937 bytes --]
Hello,
I want to capture huge amounts of packets without timestamps, since the
machine the program is running on has a very slow clock that only yields
200,000 timestamps per second and uses 70% of CPU. But tpacket_rcv
reenables the timestamps every time it receives a packet at af_packet.c:643
if (skb->tstamp.tv64 == 0) {
__net_timestamp(skb);
sock_enable_timestamp(sk);
}
I suppose that a patch that just removes that four lines won't be
accepted, since breaks an userspace interface. Isn't it?
So I've tried to enable timestamp when you setup the ring (to not affect
other programs) and disabling it latter from user space clearing the
SO_TIMESTAMP option. But it doesn't work, since timestamps can't be
disabled until the socket is closed.
If enabling SO_TIMESTAMP socket option sets SOCK_RCVTSTAMP and calls
sock_enable_timestamp, why disabling it just clears SOCK_RCVTSTAMP and
don't call sock_disable_timestamp?
Thanks.
[-- Attachment #2: unai.uribarri.vcf --]
[-- Type: text/x-vcard, Size: 458 bytes --]
begin:vcard
fn;quoted-printable:Unai Uribarri Rodr=C3=ADguez
n;quoted-printable:Uribarri Rodr=C3=ADguez;Unai
org:Optenet;Research & Development
adr;quoted-printable;quoted-printable:Calle Jos=C3=A9 Echegaray 8;;Parque Empresarial Alvia;Las Rozas;Madrid;28232;Espa=C3=B1a
email;internet:unai.uribarri@optenet.com
tel;work:+34 902 154 604
tel;home:+34 913 575 433
tel;cell:+34 609 54 91 61
x-mozilla-html:TRUE
url:http://www.optenet.com
version:2.1
end:vcard
^ permalink raw reply
* Re: [patch] genirq: temporary fix for level-triggered IRQ resend
From: Ingo Molnar @ 2007-07-31 16:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jarek Poplawski, Thomas Gleixner, Jean-Baptiste Vignaud,
linux-kernel, shemminger, linux-net, netdev, Andrew Morton,
Alan Cox, marcin.slusarz
In-Reply-To: <20070731155843.GA7033@elte.hu>
* Ingo Molnar <mingo@elte.hu> wrote:
> Linus,
>
> with -rc2 approaching i think we should apply the minimal fix below to
> get Marcin's ne2k-pci networking back in working order. The
> WARN_ON_ONCE() will not prevent the system from working and it will be
> a reminder.
there's one more test-patch that Marcin has not tested yet (see below) -
perhaps a POST artifact in ne2k could explain this bug.
Ingo
------------------------->
* Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> Ok the logic behind the 8390 is very simple:
thanks for the explanation Alan! A few comments and a question:
> Things to know
> - IRQ delivery is asynchronous to the PCI bus
> - Blocking the local CPU IRQ via spin locks was too slow
> - The chip has register windows needing locking work
>
> So the path was once (I say once as people appear to have changed it
> in the mean time and it now looks rather bogus if the changes to use
> disable_irq_nosync_irqsave are disabling the local IRQ)
>
>
> Take the page lock
> Mask the IRQ on chip
> Disable the IRQ (but not mask locally- someone seems to have
> broken this with the lock validator stuff)
> [This must be _nosync as the page lock may otherwise
> deadlock us]
( side-note: you can ignore the lock validator stuff here, the validator
changes are supposed to a NOP on the !lockdep case. Local irqs will
only be disabled if the validator is running. This could cause dropped
serial irqs on very old boxes but i doubt anyone will want to run the
validator on those. )
> Drop the page lock and turn IRQs back on
>
> At this point an existing IRQ may still be running but we can't
> get a new one
>
> Take the lock (so we know the IRQ has terminated) but don't mask
> the IRQs on the processor
> Set irqlock [for debug]
>
> Transmit (slow as ****)
>
> re-enable the IRQ
>
>
> We have to use disable_irq because otherwise you will get delayed
> interrupts on the APIC bus deadlocking the transmit path.
>
> Quite hairy but the chip simply wasn't designed for SMP and you can't
> even ACK an interrupt without risking corrupting other parallel
> activities on the chip.
So the whole locking is to be able to keep irqs enabled for a long time,
without risking entry of the same IRQ handler on this same CPU, correct?
Marcin's test results suggest that if an IRQ is resent right at the
enable_irq() point [be that via the hw irq-resend mechanism or the sw
irq-resend mechanism], the hang happens.
In the previous 2.6.20 logic we'd not normally generate an IRQ at that
point (because we masked the irq and the card itself deasserts the line
so any level-triggered irq is now moot).
Once Thomas hacked off this resend mechanism for level-triggered irqs,
Marcin saw the hangs go away.
So it seems to me that maybe the driver could be surprised via these
spurious interrupts that happen right after the irq_enable(). Does the
patch below make any sense in your opinion?
Ingo
Index: linux/drivers/net/lib8390.c
===================================================================
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff
/* Turn 8390 interrupts back on. */
ei_local->irqlock = 0;
ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+ /* force POST: */
+ ei_inb_p(e8390_base + EN0_IMR);
spin_unlock(&ei_local->page_lock);
enable_irq_lockdep_irqrestore(dev->irq, &flags);
^ permalink raw reply
* Re: [PATCH net-2.6 1/2] [TCP]: Fix ratehalving with bidirectional flows
From: Ilpo Järvinen @ 2007-07-31 15:59 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: David Miller, Netdev
In-Reply-To: <20070731143726.3fecfe86@oldman.hamilton.local>
On Tue, 31 Jul 2007, Stephen Hemminger wrote:
> I noticed no difference in the two flow tests. That is not a bad thing, just
> that this test doesn't hit that code.
...I'm not too sure about your test setup but the bugs I fixed only cover
cases that occur if flow is bidirectional (and obviously active in both
directions at the same time), they won't occur in a case of unidirectional
transfer or in request-reply style connections (well, in the latter
case if there's some overlap, it can have effect but that's usually
not significant)...
In case of bidirectional transfers, you *should* see some difference as
previously the fast recovery was _very_ broken. Of course there could be
other issue with large cwnd TCP that hides it by going to RTO still, but
at least over 384k/200ms link (DBP sized buffers, IIRC), these change
behavior very dramatically, mainly in the initial slow-start overshoot
recovery because in there losses per RTT is so high number compared to
what is experienced later on. One or a few losses are usually recovered
without RTO when congestion happens later on.
> The anomaly is that first flow does slow start then gets loss and ends up
> reducing it's window size all the way to the bottom, finally it recovers.
> This happens with Cubic, H-TCP and others as well; if the queue in the
> network is large enough, they don't handle the initial loss well.
...TCP related stuff that changed in /proc/net/netstat might shed
some light to this if none of the given explinations please you... :-)
> See the graph.
What exactly do you mean by "RENO" in the title, I mean what's tcp_sack
set to? There is occassionally a bit confusion in that respect in the
terminology @ netdev, I've used to reno refering to non-SACK stuff
elsewhere but in here that's not always the case... Usually it's possible
to derive the correct interpretation from the context, but in this case
I'm not too sure... :-)
What I often have often seen with non-SACK TCP is that initial slow-start
exhausts even very large advertised window on high DBP link and then due
to draining of ACK feedback, gets RTOed... That usually shows up as long
lasting recovery where one segment per RTT is recovered and new data is
being sent as duplicate ACKs arrive with nearly constant rate until the
window limit is hit (but I cannot see such periond in the graph you
posted, so I guess it's not the explanation in this case). And if your
"RENO" refers to something with SACK, that's not going to explain it
anyway.
...Another nasty one I know is RED+ECN, though I'd say it's a bit far
fetched one, as ECN cannot be used nicely in retransmission,
a retransmission gets dropped instead of marking if RED wanted to mark.
I guess that doesn't occur in your test case either?
--
i.
^ permalink raw reply
* [patch] genirq: temporary fix for level-triggered IRQ resend
From: Ingo Molnar @ 2007-07-31 15:58 UTC (permalink / raw)
To: Linus Torvalds
Cc: Jarek Poplawski, Thomas Gleixner, Linus Torvalds,
Jean-Baptiste Vignaud, linux-kernel, shemminger, linux-net,
netdev, Andrew Morton, Alan Cox, marcin.slusarz
In-Reply-To: <4bacf17f0707300029g5116e70bq4808059dc8b069f1@mail.gmail.com>
Linus,
with -rc2 approaching i think we should apply the minimal fix below to
get Marcin's ne2k-pci networking back in working order. The
WARN_ON_ONCE() will not prevent the system from working and it will be a
reminder.
a better workaround would be to inhibit the resent vector via the
IO-APIC irqchip - but i'd still like to have the patch below because the
ne2k driver _should_ be able to survive the spurious irq that happens.
(even on Marcin's system that ne2k-pci irq line is shared with another
networking card, so an irq could happen at any moment - it's just that
with the delayed-disable logic it happens _all the time_.)
Ingo
----------------------->
From: Thomas Gleixner <tglx@linutronix.de>
Subject: genirq: temporary fix for level-triggered IRQ resend
delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.
On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()).
Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.
I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:
17: 12 IO-APIC-fasteoi eth1, eth0
Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.
The code in question lib8390.c does
disable_irq();
fiddle_with_the_network_card_hardware()
enable_irq();
The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,
Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.
So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/irq/resend.c | 9 +++++++++
1 file changed, 9 insertions(+)
Index: linux/kernel/irq/resend.c
===================================================================
--- linux.orig/kernel/irq/resend.c
+++ linux/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *d
*/
desc->chip->enable(irq);
+ /*
+ * Temporary hack to figure out more about the problem, which
+ * is causing the ancient network cards to die.
+ */
+ if (desc->handle_irq != handle_edge_irq) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+
if ((status & (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc->status = (status & ~IRQ_PENDING) | IRQ_REPLAY;
^ permalink raw reply
* Re: NETPOLL=y , NETDEVICES=n compile error ( Re: 2.6.23-rc1-mm1 )
From: Gabriel C @ 2007-07-31 15:05 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Andrew Morton, linux-kernel, netdev, jason.wessel, amitkale
In-Reply-To: <20070731121735.GA1046@ff.dom.local>
Jarek Poplawski wrote:
> On Tue, Jul 31, 2007 at 12:14:36PM +0200, Gabriel C wrote:
>> Jarek Poplawski wrote:
>>> On 28-07-2007 20:42, Gabriel C wrote:
>>>> Andrew Morton wrote:
>>>>> On Sat, 28 Jul 2007 17:44:45 +0200 Gabriel C <nix.or.die@googlemail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I got this compile error with a randconfig ( http://194.231.229.228/MM/randconfig-auto-82.broken.netpoll.c ).
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> net/core/netpoll.c: In function 'netpoll_poll':
>>>>>> net/core/netpoll.c:155: error: 'struct net_device' has no member named 'poll_controller'
>>>>>> net/core/netpoll.c:159: error: 'struct net_device' has no member named 'poll_controller'
>>>>>> net/core/netpoll.c: In function 'netpoll_setup':
>>>>>> net/core/netpoll.c:670: error: 'struct net_device' has no member named 'poll_controller'
>>>>>> make[2]: *** [net/core/netpoll.o] Error 1
>>>>>> make[1]: *** [net/core] Error 2
>>>>>> make: *** [net] Error 2
>>>>>> make: *** Waiting for unfinished jobs....
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> I think is because KGDBOE selects just NETPOLL.
>>>>>>
>>>>> Looks like it.
>>>>>
>>>>> Select went and selected NETPOLL and NETPOLL_TRAP but things like
>>>>> CONFIG_NETDEVICES and CONFIG_NET_POLL_CONTROLLER remain unset. `select'
>>>>> remains evil.
>>> ...
>>>> I think there may be a logical issue ( again if I got it right ).
>>>> We need some ethernet card to work with kgdboe right ? but we don't have any if !NETDEVICES && !NET_ETHERNET.
>>>>
>>>> So maybe some ' depends on ... && NETDEVICES!=n && NET_ETHERNET!=n ' is needed too ?
>>> IMHO, the only logical issue here is netpoll.c mustn't use
>>> CONFIG_NET_POLL_CONTROLLER code without #ifdef if it doesn't
>>> add this dependency itself.
>>>
>> Well it does if NETDEVICES && if NET_ETHERNET which booth are N when !NETDEVICES is why KGDBOE uses select and not depends on.
>
> "does if XXX" means may "use if XXX".
>From what I know means only use "if xxx" on !xxx everything inside the "if xxx" is n and "depends on <something inside the if xxx>
does not work.
...
menuconfig FOO
bool "FOO"
depends on BAR
default y
-- help --
something
if FOO
config BAZ
depends on WHATEVR && !NOT_THIS
menuconfig SOMETHING_ELSE
....
if SOMETHING_ELSE
config BLUBB
depends on PCI && WHATNOT
endif # SOMETHING_ELSE
config NETPOLL
def_bool NETCONSOLE
config NETPOLL_TRAP
bool "Netpoll traffic trapping"
default n
depends on NETPOLL
config NET_POLL_CONTROLLER
def_bool NETPOLL
endif # FOO
Now if you set FOO=n all is gone and your driver have to select whatever it needs from there.
>
>> Now KGDBOE just selects NETPOLL and NETPOLL_TRAP.
>> Adding 'select CONFIG_NET_POLL_CONTROLLER' let kgdboe compiles but the question is does it work without any ethernet card ?
>
> Why kgdboe should care what netpoll needs? So, I hope, you are adding
> this select under config NETPOLL. On the other hand, if NETPOLL should
> depend on NET_POLL_CONTROLLER there is probably no reason to have them
> both.
NET_POLL_CONTROLLER has def_bool NETPOLL if NETDEVICES .
Net peoples ping ?:)
>
> The "does it work" question isn't logical issue, so it's irrelevant
> here...
Right irrelevant for the compile error but relevant for the fix in my opinion.
>
> Jarek P.
>
Gabriel
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox