* TSO and IPoIB performance degradation
From: Michael S. Tsirkin @ 2006-03-06 22:34 UTC (permalink / raw)
To: netdev, openib-general, Linux Kernel Mailing List,
David S. Miller, Matt Leininger
Hello, Dave!
As you might know, the TSO patches merged into mainline kernel
since 2.6.11 have hurt performance for the simple (non-TSO)
high-speed netdevice that is IPoIB driver.
This was discussed at length here
http://openib.org/pipermail/openib-general/2005-October/012271.html
I'm trying to figure out what can be done to improve the situation.
In partucular, I'm looking at the Super TSO patch
http://oss.sgi.com/archives/netdev/2005-05/msg00889.html
merged into mainline here
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
There, you said:
When we do ucopy receive (ie. copying directly to userspace
during tcp input processing) we attempt to delay the ACK
until cleanup_rbuf() is invoked. Most of the time this
technique works very well, and we emit one ACK advertising
the largest window.
But this explodes if the ucopy prequeue is large enough.
When the receiver is cpu limited and TSO frames are large,
the receiver is inundated with ucopy processing, such that
the ACK comes out very late. Often, this is so late that
by the time the sender gets the ACK the window has emptied
too much to be kept full by the sender.
The existing TSO code mostly avoided this by keeping the
TSO packets no larger than 1/8 of the available window.
But with the new code we can get much larger TSO frames.
So I'm trying to get a handle on it: could a solution be to simply
look at the frame size, and call tcp_send_delayed_ack from
if the frame size is no larger than 1/8?
Does this make sense?
Thanks,
--
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
^ permalink raw reply
* Re: Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: David S. Miller @ 2006-03-06 22:39 UTC (permalink / raw)
To: rdreier; +Cc: linux-kernel, openib-general, netdev
In-Reply-To: <aday7zn432b.fsf@cisco.com>
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 06 Mar 2006 14:32:28 -0800
> The fundamental question seems to be whether things like
>
> struct foo {
> struct sockaddr_in6 src;
> struct sockaddr_in6 dst;
> };
>
> and
>
> struct bar {
> struct sockaddr_in6 a;
> __u32 b;
> };
I wrote a test program and it looks ok:
davem@sunset:~/src/GIT/sparc-2.6.17$ gcc -m32 -O -o foo foo.c
davem@sunset:~/src/GIT/sparc-2.6.17$ ./foo
SPARC32
foo src: 0
foo dst: 28
bar a: 0
bar b: 28
davem@sunset:~/src/GIT/sparc-2.6.17$ gcc -m64 -O -o foo foo.c
davem@sunset:~/src/GIT/sparc-2.6.17$ ./foo
SPARC64
foo src: 0
foo dst: 28
bar a: 0
bar b: 28
^ permalink raw reply
* Re: TSO and IPoIB performance degradation
From: David S. Miller @ 2006-03-06 22:40 UTC (permalink / raw)
To: mst; +Cc: netdev, linux-kernel, openib-general
In-Reply-To: <20060306223438.GA18277@mellanox.co.il>
From: "Michael S. Tsirkin" <mst@mellanox.co.il>
Date: Tue, 7 Mar 2006 00:34:38 +0200
> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
>
> Does this make sense?
The comment you mention is very old, and no longer applies.
Get full packet traces from the kernel TSO code in the 2.6.x
kernel, analyze is, and post here what you think is occuring
that is causing the performance problems.
One thing to note is that the newer TSO code really needs to
have large socket buffers, so you can experiment with that.
^ permalink raw reply
* Re: Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: Roland Dreier @ 2006-03-06 22:41 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, openib-general, netdev
In-Reply-To: <20060306.143901.26500391.davem@davemloft.net>
David> I wrote a test program and it looks ok:
Cool, thanks.
I should look into getting some niagara machines to test with -- with
PCIe slots they should actually be good for IB testing.
- R.
^ permalink raw reply
* Re: TSO and IPoIB performance degradation
From: Stephen Hemminger @ 2006-03-06 22:50 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: netdev, Linux Kernel Mailing List, openib-general,
David S. Miller
In-Reply-To: <20060306223438.GA18277@mellanox.co.il>
On Tue, 7 Mar 2006 00:34:38 +0200
"Michael S. Tsirkin" <mst@mellanox.co.il> wrote:
> Hello, Dave!
> As you might know, the TSO patches merged into mainline kernel
> since 2.6.11 have hurt performance for the simple (non-TSO)
> high-speed netdevice that is IPoIB driver.
>
> This was discussed at length here
> http://openib.org/pipermail/openib-general/2005-October/012271.html
>
> I'm trying to figure out what can be done to improve the situation.
> In partucular, I'm looking at the Super TSO patch
> http://oss.sgi.com/archives/netdev/2005-05/msg00889.html
>
> merged into mainline here
>
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=314324121f9b94b2ca657a494cf2b9cb0e4a28cc
>
> There, you said:
>
> When we do ucopy receive (ie. copying directly to userspace
> during tcp input processing) we attempt to delay the ACK
> until cleanup_rbuf() is invoked. Most of the time this
> technique works very well, and we emit one ACK advertising
> the largest window.
>
> But this explodes if the ucopy prequeue is large enough.
> When the receiver is cpu limited and TSO frames are large,
> the receiver is inundated with ucopy processing, such that
> the ACK comes out very late. Often, this is so late that
> by the time the sender gets the ACK the window has emptied
> too much to be kept full by the sender.
>
> The existing TSO code mostly avoided this by keeping the
> TSO packets no larger than 1/8 of the available window.
> But with the new code we can get much larger TSO frames.
>
> So I'm trying to get a handle on it: could a solution be to simply
> look at the frame size, and call tcp_send_delayed_ack from
> if the frame size is no larger than 1/8?
>
> Does this make sense?
>
> Thanks,
>
>
More likely you are getting hit by the fact that TSO prevents the congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of Nov 2005).
^ permalink raw reply
* Re: Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: David S. Miller @ 2006-03-06 22:50 UTC (permalink / raw)
To: rdreier; +Cc: linux-kernel, openib-general, netdev
In-Reply-To: <adau0ab42ni.fsf@cisco.com>
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 06 Mar 2006 14:41:21 -0800
> I should look into getting some niagara machines to test with -- with
> PCIe slots they should actually be good for IB testing.
You'll be cpu limited until we have Van Jacobson net channels.
Also, since our existing Linux "generic" MSI code is so riddled with
x86'isms (it was written by an Intel person, so this is just the
status quo), it will be a while before MSI interrupts are supported on
sparc64.
I haven't gotten around to working on that problem yet, and PPC needs
this work too as they now have MSI capable PCI controllers.
^ permalink raw reply
* Re: [PATCH, RESEND] Add MWI workaround for Tulip DC21143
From: Martin Michlmayr @ 2006-03-06 22:51 UTC (permalink / raw)
To: netdev, linux-mips
In-Reply-To: <20060218220851.GA1601@colonel-panic.org>
* Peter Horton <pdh@colonel-panic.org> [2006-02-18 22:08]:
> This patch works around the MWI bug on the DC21143 rev 65 Tulip by
> ensuring that the receive buffers don't end on a cache line boundary (as
> documented in the errata).
>
> This patch is required for the MIPs based Cobalt Qube/RaQ as supporting
> the extra PCI commands seems to reduce the chance of a hard lockup
> between the Tulip and the PCI bridge.
Does anyone have comments regarding this patch? I received
confirmation from a number of Debian users that this patch
significantly improves the lockup situation on Cobalt, so
it would be nice if it could go in.
--
Martin Michlmayr
tbm@cyrius.com
^ permalink raw reply
* Re: [PATCH, RESEND] Add MWI workaround for Tulip DC21143
From: Francois Romieu @ 2006-03-06 23:15 UTC (permalink / raw)
To: Martin Michlmayr; +Cc: netdev, linux-mips
In-Reply-To: <20060306225131.GA23327@unjust.cyrius.com>
Martin Michlmayr <tbm@cyrius.com> :
[...]
> Does anyone have comments regarding this patch? I received
> confirmation from a number of Debian users that this patch
> significantly improves the lockup situation on Cobalt, so
> it would be nice if it could go in.
I'll queue it with the pending de2104x fix(es ?) during my next
upkeep.
--
Ueimor
^ permalink raw reply
* [PATCH 2/6 v2] IB: match connection requests based on private data
From: Sean Hefty @ 2006-03-06 23:29 UTC (permalink / raw)
To: Hefty, Sean, 'Roland Dreier', netdev, linux-kernel; +Cc: openib-general
In-Reply-To: <ORSMSX401FRaqbC8wSA00000005@orsmsx401.amr.corp.intel.com>
Extend matching connection requests to listens in the Infiniband CM to include
private data checks.
This allows applications to listen on the same service identifier, with private
data directing the request to the appropriate application.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
This should be the correct patch that incorporates feedback from the initial
submission. Sorry about the mis-post.
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/cm.c
linux-2.6.ib/drivers/infiniband/core/cm.c
--- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.000000000 -0800
@@ -32,7 +32,7 @@
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*
- * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $
+ * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
*/
#include <linux/dma-mapping.h>
#include <linux/err.h>
@@ -130,6 +130,7 @@ struct cm_id_private {
/* todo: use alternate port on send failure */
struct cm_av av;
struct cm_av alt_av;
+ struct ib_cm_compare_data *compare_data;
void *private_data;
__be64 tid;
@@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire
return cm_id_priv;
}
+static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask)
+{
+ int i;
+
+ for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++)
+ ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] &
+ ((unsigned long *) mask)[i];
+}
+
+static int cm_compare_data(struct ib_cm_compare_data *src_data,
+ struct ib_cm_compare_data *dst_data)
+{
+ u8 src[IB_CM_COMPARE_SIZE];
+ u8 dst[IB_CM_COMPARE_SIZE];
+
+ if (!src_data || !dst_data)
+ return 0;
+
+ cm_mask_copy(src, src_data->data, dst_data->mask);
+ cm_mask_copy(dst, dst_data->data, src_data->mask);
+ return memcmp(src, dst, IB_CM_COMPARE_SIZE);
+}
+
+static int cm_compare_private_data(u8 *private_data,
+ struct ib_cm_compare_data *dst_data)
+{
+ u8 src[IB_CM_COMPARE_SIZE];
+
+ if (!dst_data)
+ return 0;
+
+ cm_mask_copy(src, private_data, dst_data->mask);
+ return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE);
+}
+
static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv)
{
struct rb_node **link = &cm.listen_service_table.rb_node;
@@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_
struct cm_id_private *cur_cm_id_priv;
__be64 service_id = cm_id_priv->id.service_id;
__be64 service_mask = cm_id_priv->id.service_mask;
+ int data_cmp;
while (*link) {
parent = *link;
cur_cm_id_priv = rb_entry(parent, struct cm_id_private,
service_node);
+ data_cmp = cm_compare_data(cm_id_priv->compare_data,
+ cur_cm_id_priv->compare_data);
if ((cur_cm_id_priv->id.service_mask & service_id) ==
(service_mask & cur_cm_id_priv->id.service_id) &&
- (cm_id_priv->id.device == cur_cm_id_priv->id.device))
+ (cm_id_priv->id.device == cur_cm_id_priv->id.device) &&
+ !data_cmp)
return cur_cm_id_priv;
if (cm_id_priv->id.device < cur_cm_id_priv->id.device)
@@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_
link = &(*link)->rb_right;
else if (service_id < cur_cm_id_priv->id.service_id)
link = &(*link)->rb_left;
+ else if (service_id > cur_cm_id_priv->id.service_id)
+ link = &(*link)->rb_right;
+ else if (data_cmp < 0)
+ link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
@@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_
}
static struct cm_id_private * cm_find_listen(struct ib_device *device,
- __be64 service_id)
+ __be64 service_id,
+ u8 *private_data)
{
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
+ int data_cmp;
while (node) {
cm_id_priv = rb_entry(node, struct cm_id_private, service_node);
+ data_cmp = cm_compare_private_data(private_data,
+ cm_id_priv->compare_data);
if ((cm_id_priv->id.service_mask & service_id) ==
cm_id_priv->id.service_id &&
- (cm_id_priv->id.device == device))
+ (cm_id_priv->id.device == device) && !data_cmp)
return cm_id_priv;
if (device < cm_id_priv->id.device)
@@ -405,6 +452,10 @@ static struct cm_id_private * cm_find_li
node = node->rb_right;
else if (service_id < cm_id_priv->id.service_id)
node = node->rb_left;
+ else if (service_id > cm_id_priv->id.service_id)
+ node = node->rb_right;
+ else if (data_cmp < 0)
+ node = node->rb_left;
else
node = node->rb_right;
}
@@ -728,15 +779,14 @@ retest:
wait_event(cm_id_priv->wait, !atomic_read(&cm_id_priv->refcount));
while ((work = cm_dequeue_work(cm_id_priv)) != NULL)
cm_free_work(work);
- if (cm_id_priv->private_data && cm_id_priv->private_data_len)
- kfree(cm_id_priv->private_data);
+ kfree(cm_id_priv->compare_data);
+ kfree(cm_id_priv->private_data);
kfree(cm_id_priv);
}
EXPORT_SYMBOL(ib_destroy_cm_id);
-int ib_cm_listen(struct ib_cm_id *cm_id,
- __be64 service_id,
- __be64 service_mask)
+int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
+ struct ib_cm_compare_data *compare_data)
{
struct cm_id_private *cm_id_priv, *cur_cm_id_priv;
unsigned long flags;
@@ -750,7 +800,19 @@ int ib_cm_listen(struct ib_cm_id *cm_id,
return -EINVAL;
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
- BUG_ON(cm_id->state != IB_CM_IDLE);
+ if (cm_id->state != IB_CM_IDLE)
+ return -EINVAL;
+
+ if (compare_data) {
+ cm_id_priv->compare_data = kzalloc(sizeof *compare_data,
+ GFP_KERNEL);
+ if (!cm_id_priv->compare_data)
+ return -ENOMEM;
+ cm_mask_copy(cm_id_priv->compare_data->data,
+ compare_data->data, compare_data->mask);
+ memcpy(cm_id_priv->compare_data->mask, compare_data->mask,
+ IB_CM_COMPARE_SIZE);
+ }
cm_id->state = IB_CM_LISTEN;
@@ -767,6 +829,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id,
if (cur_cm_id_priv) {
cm_id->state = IB_CM_IDLE;
+ kfree(cm_id_priv->compare_data);
+ cm_id_priv->compare_data = NULL;
ret = -EBUSY;
}
return ret;
@@ -1239,7 +1303,8 @@ static struct cm_id_private * cm_match_r
/* Find matching listen request. */
listen_cm_id_priv = cm_find_listen(cm_id_priv->id.device,
- req_msg->service_id);
+ req_msg->service_id,
+ req_msg->private_data);
if (!listen_cm_id_priv) {
spin_unlock_irqrestore(&cm.lock, flags);
cm_issue_rej(work->port, work->mad_recv_wc,
@@ -2646,7 +2711,8 @@ static int cm_sidr_req_handler(struct cm
goto out; /* Duplicate message. */
}
cur_cm_id_priv = cm_find_listen(cm_id->device,
- sidr_req_msg->service_id);
+ sidr_req_msg->service_id,
+ sidr_req_msg->private_data);
if (!cur_cm_id_priv) {
rb_erase(&cm_id_priv->sidr_id_node, &cm.remote_sidr_table);
spin_unlock_irqrestore(&cm.lock, flags);
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/ucm.c
linux-2.6.ib/drivers/infiniband/core/ucm.c
--- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 16:03:08.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 16:03:35.000000000 -0800
@@ -646,6 +646,17 @@ out:
return result;
}
+static int ucm_validate_listen(__be64 service_id, __be64 service_mask)
+{
+ service_id &= service_mask;
+
+ if (((service_id & IB_CMA_SERVICE_ID_MASK) == IB_CMA_SERVICE_ID) ||
+ ((service_id & IB_SDP_SERVICE_ID_MASK) == IB_SDP_SERVICE_ID))
+ return -EINVAL;
+
+ return 0;
+}
+
static ssize_t ib_ucm_listen(struct ib_ucm_file *file,
const char __user *inbuf,
int in_len, int out_len)
@@ -661,7 +672,13 @@ static ssize_t ib_ucm_listen(struct ib_u
if (IS_ERR(ctx))
return PTR_ERR(ctx);
- result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask);
+ result = ucm_validate_listen(cmd.service_id, cmd.service_mask);
+ if (result)
+ goto out;
+
+ result = ib_cm_listen(ctx->cm_id, cmd.service_id, cmd.service_mask,
+ NULL);
+out:
ib_ucm_ctx_put(ctx);
return result;
}
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/include/rdma/ib_cm.h
linux-2.6.ib/include/rdma/ib_cm.h
--- linux-2.6.git/include/rdma/ib_cm.h 2006-01-16 10:26:47.000000000 -0800
+++ linux-2.6.ib/include/rdma/ib_cm.h 2006-01-16 16:03:35.000000000 -0800
@@ -32,7 +32,7 @@
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*
- * $Id: ib_cm.h 2730 2005-06-28 16:43:03Z sean.hefty $
+ * $Id: ib_cm.h 4311 2005-12-05 18:42:01Z sean.hefty $
*/
#if !defined(IB_CM_H)
#define IB_CM_H
@@ -102,7 +102,8 @@ enum ib_cm_data_size {
IB_CM_APR_INFO_LENGTH = 72,
IB_CM_SIDR_REQ_PRIVATE_DATA_SIZE = 216,
IB_CM_SIDR_REP_PRIVATE_DATA_SIZE = 136,
- IB_CM_SIDR_REP_INFO_LENGTH = 72
+ IB_CM_SIDR_REP_INFO_LENGTH = 72,
+ IB_CM_COMPARE_SIZE = 64
};
struct ib_cm_id;
@@ -238,7 +239,6 @@ struct ib_cm_sidr_rep_event_param {
u32 qpn;
void *info;
u8 info_len;
-
};
struct ib_cm_event {
@@ -317,6 +317,15 @@ void ib_destroy_cm_id(struct ib_cm_id *c
#define IB_SERVICE_ID_AGN_MASK __constant_cpu_to_be64(0xFF00000000000000ULL)
#define IB_CM_ASSIGN_SERVICE_ID __constant_cpu_to_be64(0x0200000000000000ULL)
+#define IB_CMA_SERVICE_ID __constant_cpu_to_be64(0x0000000001000000ULL)
+#define IB_CMA_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFF000000ULL)
+#define IB_SDP_SERVICE_ID __constant_cpu_to_be64(0x0000000000010000ULL)
+#define IB_SDP_SERVICE_ID_MASK __constant_cpu_to_be64(0xFFFFFFFFFFFF0000ULL)
+
+struct ib_cm_compare_data {
+ u8 data[IB_CM_COMPARE_SIZE];
+ u8 mask[IB_CM_COMPARE_SIZE];
+};
/**
* ib_cm_listen - Initiates listening on the specified service ID for
@@ -330,10 +339,12 @@ void ib_destroy_cm_id(struct ib_cm_id *c
* range of service IDs. If set to 0, the service ID is matched
* exactly. This parameter is ignored if %service_id is set to
* IB_CM_ASSIGN_SERVICE_ID.
+ * @compare_data: This parameter is optional. It specifies data that must
+ * appear in the private data of a connection request for the specified
+ * listen request.
*/
-int ib_cm_listen(struct ib_cm_id *cm_id,
- __be64 service_id,
- __be64 service_mask);
+int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask,
+ struct ib_cm_compare_data *compare_data);
struct ib_cm_req_param {
struct ib_sa_path_rec *primary_path;
^ permalink raw reply
* [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
From: Sean Hefty @ 2006-03-06 23:31 UTC (permalink / raw)
To: Hefty, Sean, 'Roland Dreier', linux-kernel, netdev; +Cc: openib-general
In-Reply-To: <ORSMSX401FRaqbC8wSA00000007@orsmsx401.amr.corp.intel.com>
Add an address translation service that maps IP addresses to Infiniband
GID addresses using IPoIB.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
This should be the correct patch. The only difference between this and the mis-post
is the use of mutex_lock/unlock in place of up/down.
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/addr.c
linux-2.6.ib/drivers/infiniband/core/addr.c
--- linux-2.6.git/drivers/infiniband/core/addr.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.000000000 -0800
@@ -0,0 +1,356 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc. All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation. All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ * available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ * available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ * copy of which is available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+#include <linux/inetdevice.h>
+#include <linux/workqueue.h>
+#include <net/arp.h>
+#include <net/neighbour.h>
+#include <net/route.h>
+#include <rdma/ib_addr.h>
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("IB Address Translation");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct addr_req {
+ struct list_head list;
+ struct sockaddr src_addr;
+ struct sockaddr dst_addr;
+ struct rdma_dev_addr *addr;
+ void *context;
+ void (*callback)(int status, struct sockaddr *src_addr,
+ struct rdma_dev_addr *addr, void *context);
+ unsigned long timeout;
+ int status;
+};
+
+static void process_req(void *data);
+
+static DEFINE_MUTEX(lock);
+static LIST_HEAD(req_list);
+static DECLARE_WORK(work, process_req, NULL);
+struct workqueue_struct *rdma_wq;
+EXPORT_SYMBOL(rdma_wq);
+
+static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+ unsigned char *dst_dev_addr)
+{
+ switch (dev->type) {
+ case ARPHRD_INFINIBAND:
+ dev_addr->dev_type = IB_NODE_CA;
+ break;
+ default:
+ return -EADDRNOTAVAIL;
+ }
+
+ memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+ memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
+ if (dst_dev_addr)
+ memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+ return 0;
+}
+
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+{
+ struct net_device *dev;
+ u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
+ int ret;
+
+ dev = ip_dev_find(ip);
+ if (!dev)
+ return -EADDRNOTAVAIL;
+
+ ret = copy_addr(dev_addr, dev, NULL);
+ dev_put(dev);
+ return ret;
+}
+EXPORT_SYMBOL(rdma_translate_ip);
+
+static void set_timeout(unsigned long time)
+{
+ unsigned long delay;
+
+ cancel_delayed_work(&work);
+
+ delay = time - jiffies;
+ if ((long)delay <= 0)
+ delay = 1;
+
+ queue_delayed_work(rdma_wq, &work, delay);
+}
+
+static void queue_req(struct addr_req *req)
+{
+ struct addr_req *temp_req;
+
+ mutex_lock(&lock);
+ list_for_each_entry_reverse(temp_req, &req_list, list) {
+ if (time_after(req->timeout, temp_req->timeout))
+ break;
+ }
+
+ list_add(&req->list, &temp_req->list);
+
+ if (req_list.next == &req->list)
+ set_timeout(req->timeout);
+ mutex_unlock(&lock);
+}
+
+static void addr_send_arp(struct sockaddr_in *dst_in)
+{
+ struct rtable *rt;
+ struct flowi fl;
+ u32 dst_ip = dst_in->sin_addr.s_addr;
+
+ memset(&fl, 0, sizeof fl);
+ fl.nl_u.ip4_u.daddr = dst_ip;
+ if (ip_route_output_key(&rt, &fl))
+ return;
+
+ arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev,
+ rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL);
+ ip_rt_put(rt);
+}
+
+static int addr_resolve_remote(struct sockaddr_in *src_in,
+ struct sockaddr_in *dst_in,
+ struct rdma_dev_addr *addr)
+{
+ u32 src_ip = src_in->sin_addr.s_addr;
+ u32 dst_ip = dst_in->sin_addr.s_addr;
+ struct flowi fl;
+ struct rtable *rt;
+ struct neighbour *neigh;
+ int ret;
+
+ memset(&fl, 0, sizeof fl);
+ fl.nl_u.ip4_u.daddr = dst_ip;
+ fl.nl_u.ip4_u.saddr = src_ip;
+ ret = ip_route_output_key(&rt, &fl);
+ if (ret)
+ goto out;
+
+ neigh = neigh_lookup(&arp_tbl, &rt->rt_gateway, rt->idev->dev);
+ if (!neigh) {
+ ret = -ENODATA;
+ goto err1;
+ }
+
+ if (!(neigh->nud_state & NUD_VALID)) {
+ ret = -ENODATA;
+ goto err2;
+ }
+
+ if (!src_ip) {
+ src_in->sin_family = dst_in->sin_family;
+ src_in->sin_addr.s_addr = rt->rt_src;
+ }
+
+ ret = copy_addr(addr, neigh->dev, neigh->ha);
+err2:
+ neigh_release(neigh);
+err1:
+ ip_rt_put(rt);
+out:
+ return ret;
+}
+
+static void process_req(void *data)
+{
+ struct addr_req *req, *temp_req;
+ struct sockaddr_in *src_in, *dst_in;
+ struct list_head done_list;
+
+ INIT_LIST_HEAD(&done_list);
+
+ mutex_lock(&lock);
+ list_for_each_entry_safe(req, temp_req, &req_list, list) {
+ if (req->status) {
+ src_in = (struct sockaddr_in *) &req->src_addr;
+ dst_in = (struct sockaddr_in *) &req->dst_addr;
+ req->status = addr_resolve_remote(src_in, dst_in,
+ req->addr);
+ }
+ if (req->status && time_after(jiffies, req->timeout))
+ req->status = -ETIMEDOUT;
+ else if (req->status == -ENODATA)
+ continue;
+
+ list_del(&req->list);
+ list_add_tail(&req->list, &done_list);
+ }
+
+ if (!list_empty(&req_list)) {
+ req = list_entry(req_list.next, struct addr_req, list);
+ set_timeout(req->timeout);
+ }
+ mutex_unlock(&lock);
+
+ list_for_each_entry_safe(req, temp_req, &done_list, list) {
+ list_del(&req->list);
+ req->callback(req->status, &req->src_addr, req->addr,
+ req->context);
+ kfree(req);
+ }
+}
+
+static int addr_resolve_local(struct sockaddr_in *src_in,
+ struct sockaddr_in *dst_in,
+ struct rdma_dev_addr *addr)
+{
+ struct net_device *dev;
+ u32 src_ip = src_in->sin_addr.s_addr;
+ u32 dst_ip = dst_in->sin_addr.s_addr;
+ int ret;
+
+ dev = ip_dev_find(dst_ip);
+ if (!dev)
+ return -EADDRNOTAVAIL;
+
+ if (!src_ip) {
+ src_in->sin_family = dst_in->sin_family;
+ src_in->sin_addr.s_addr = dst_ip;
+ ret = copy_addr(addr, dev, dev->dev_addr);
+ } else {
+ ret = rdma_translate_ip((struct sockaddr *)src_in, addr);
+ if (!ret)
+ memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+ }
+
+ dev_put(dev);
+ return ret;
+}
+
+int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr,
+ struct rdma_dev_addr *addr, int timeout_ms,
+ void (*callback)(int status, struct sockaddr *src_addr,
+ struct rdma_dev_addr *addr, void *context),
+ void *context)
+{
+ struct sockaddr_in *src_in, *dst_in;
+ struct addr_req *req;
+ int ret = 0;
+
+ req = kmalloc(sizeof *req, GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+ memset(req, 0, sizeof *req);
+
+ if (src_addr)
+ memcpy(&req->src_addr, src_addr, ip_addr_size(src_addr));
+ memcpy(&req->dst_addr, dst_addr, ip_addr_size(dst_addr));
+ req->addr = addr;
+ req->callback = callback;
+ req->context = context;
+
+ src_in = (struct sockaddr_in *) &req->src_addr;
+ dst_in = (struct sockaddr_in *) &req->dst_addr;
+
+ req->status = addr_resolve_local(src_in, dst_in, addr);
+ if (req->status == -EADDRNOTAVAIL)
+ req->status = addr_resolve_remote(src_in, dst_in, addr);
+
+ switch (req->status) {
+ case 0:
+ req->timeout = jiffies;
+ queue_req(req);
+ break;
+ case -ENODATA:
+ req->timeout = msecs_to_jiffies(timeout_ms) + jiffies;
+ queue_req(req);
+ addr_send_arp(dst_in);
+ break;
+ default:
+ ret = req->status;
+ kfree(req);
+ break;
+ }
+ return ret;
+}
+EXPORT_SYMBOL(rdma_resolve_ip);
+
+void rdma_addr_cancel(struct rdma_dev_addr *addr)
+{
+ struct addr_req *req, *temp_req;
+
+ mutex_lock(&lock);
+ list_for_each_entry_safe(req, temp_req, &req_list, list) {
+ if (req->addr == addr) {
+ req->status = -ECANCELED;
+ req->timeout = jiffies;
+ list_del(&req->list);
+ list_add(&req->list, &req_list);
+ set_timeout(req->timeout);
+ break;
+ }
+ }
+ mutex_unlock(&lock);
+}
+EXPORT_SYMBOL(rdma_addr_cancel);
+
+static int addr_arp_recv(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pkt, struct net_device *orig_dev)
+{
+ struct arphdr *arp_hdr;
+
+ arp_hdr = (struct arphdr *) skb->nh.raw;
+
+ if (dev->type == ARPHRD_INFINIBAND &&
+ (arp_hdr->ar_op == __constant_htons(ARPOP_REQUEST) ||
+ arp_hdr->ar_op == __constant_htons(ARPOP_REPLY)))
+ set_timeout(jiffies);
+
+ kfree_skb(skb);
+ return 0;
+}
+
+static struct packet_type addr_arp = {
+ .type = __constant_htons(ETH_P_ARP),
+ .func = addr_arp_recv,
+ .af_packet_priv = (void*) 1,
+};
+
+static int addr_init(void)
+{
+ rdma_wq = create_singlethread_workqueue("rdma_wq");
+ if (!rdma_wq)
+ return -ENOMEM;
+
+ dev_add_pack(&addr_arp);
+ return 0;
+}
+
+static void addr_cleanup(void)
+{
+ dev_remove_pack(&addr_arp);
+ destroy_workqueue(rdma_wq);
+}
+
+module_init(addr_init);
+module_exit(addr_cleanup);
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/Makefile
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:03:08.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:14:24.000000000 -0800
@@ -1,5 +1,5 @@
obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \
- ib_cm.o
+ ib_cm.o ib_addr.o
obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o
@@ -12,6 +12,8 @@ ib_sa-y := sa_query.o
ib_cm-y := cm.o
+ib_addr-y := addr.o
+
ib_umad-y := user_mad.o
ib_ucm-y := ucm.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/include/rdma/ib_addr.h
linux-2.6.ib/include/rdma/ib_addr.h
--- linux-2.6.git/include/rdma/ib_addr.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.ib/include/rdma/ib_addr.h 2006-01-16 16:14:24.000000000 -0800
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation. All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ * available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ * available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ * copy of which is available from the Open Source Initiative, see
+ * http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+
+#if !defined(IB_ADDR_H)
+#define IB_ADDR_H
+
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/netdevice.h>
+#include <linux/socket.h>
+#include <rdma/ib_verbs.h>
+
+extern struct workqueue_struct *rdma_wq;
+
+struct rdma_dev_addr {
+ unsigned char src_dev_addr[MAX_ADDR_LEN];
+ unsigned char dst_dev_addr[MAX_ADDR_LEN];
+ unsigned char broadcast[MAX_ADDR_LEN];
+ enum ib_node_type dev_type;
+};
+
+/**
+ * rdma_translate_ip - Translate a local IP address to an RDMA hardware
+ * address.
+ */
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr);
+
+/**
+ * rdma_resolve_ip - Resolve source and destination IP addresses to
+ * RDMA hardware addresses.
+ * @src_addr: An optional source address to use in the resolution. If a
+ * source address is not provided, a usable address will be returned via
+ * the callback.
+ * @dst_addr: The destination address to resolve.
+ * @addr: A reference to a data location that will receive the resolved
+ * addresses. The data location must remain valid until the callback has
+ * been invoked.
+ * @timeout_ms: Amount of time to wait for the address resolution to complete.
+ * @callback: Call invoked once address resolution has completed, timed out,
+ * or been canceled. A status of 0 indicates success.
+ * @context: User-specified context associated with the call.
+ */
+int rdma_resolve_ip(struct sockaddr *src_addr, struct sockaddr *dst_addr,
+ struct rdma_dev_addr *addr, int timeout_ms,
+ void (*callback)(int status, struct sockaddr *src_addr,
+ struct rdma_dev_addr *addr, void *context),
+ void *context);
+
+void rdma_addr_cancel(struct rdma_dev_addr *addr);
+
+static inline int ip_addr_size(struct sockaddr *addr)
+{
+ return addr->sa_family == AF_INET6 ?
+ sizeof(struct sockaddr_in6) : sizeof(struct sockaddr_in);
+}
+
+static inline u16 ib_addr_get_pkey(struct rdma_dev_addr *dev_addr)
+{
+ return ((u16)dev_addr->broadcast[8] << 8) | (u16)dev_addr->broadcast[9];
+}
+
+static inline void ib_addr_set_pkey(struct rdma_dev_addr *dev_addr, u16 pkey)
+{
+ dev_addr->broadcast[8] = pkey >> 8;
+ dev_addr->broadcast[9] = (unsigned char) pkey;
+}
+
+static inline union ib_gid* ib_addr_get_sgid(struct rdma_dev_addr *dev_addr)
+{
+ return (union ib_gid *) (dev_addr->src_dev_addr + 4);
+}
+
+static inline void ib_addr_set_sgid(struct rdma_dev_addr *dev_addr,
+ union ib_gid *gid)
+{
+ memcpy(dev_addr->src_dev_addr + 4, gid, sizeof *gid);
+}
+
+static inline union ib_gid* ib_addr_get_dgid(struct rdma_dev_addr *dev_addr)
+{
+ return (union ib_gid *) (dev_addr->dst_dev_addr + 4);
+}
+
+static inline void ib_addr_set_dgid(struct rdma_dev_addr *dev_addr,
+ union ib_gid *gid)
+{
+ memcpy(dev_addr->dst_dev_addr + 4, gid, sizeof *gid);
+}
+
+#endif /* IB_ADDR_H */
+
^ permalink raw reply
* Re: Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: Roland Dreier @ 2006-03-06 23:40 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, openib-general, netdev
In-Reply-To: <20060306.145053.129802994.davem@davemloft.net>
Roland> I should look into getting some niagara machines to test
Roland> with -- with PCIe slots they should actually be good for
Roland> IB testing.
David> You'll be cpu limited until we have Van Jacobson net
David> channels.
For IPoIB maybe but not for native IB which offloads all transport to
the HCA... I'd be surprised if the bottleneck were anywhere other than
the bus, even on niagara.
David> Also, since our existing Linux "generic" MSI code is so
David> riddled with x86'isms (it was written by an Intel person,
David> so this is just the status quo), it will be a while before
David> MSI interrupts are supported on sparc64.
Yeah, I've always wanted to make the MSI stuff generic and handle the
embedded ppc chips that have MSI, but I've never had a good enough
reason to really work on it -- it's just been at the level of "that
would be fun." Now that IBM cares I hope it will get done soon.
Anyway IB works fine with standard INTx interrupts -- MSI is just icing.
The Niagara boxes seem like a fun toy if I can get budget for it --
and 32 threads are probably good for flushing out SMP races.
- R.
^ permalink raw reply
* [PATCH 6/6 v2] IB: userspace support for RDMA connection manager
From: Sean Hefty @ 2006-03-06 23:41 UTC (permalink / raw)
To: Hefty, Sean, 'Roland Dreier', netdev, linux-kernel; +Cc: openib-general
In-Reply-To: <ORSMSX4011XvpFVjCRG00000009@orsmsx401.amr.corp.intel.com>
Kernel component necessary to support the userspace RDMA connection management
library.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
---
Discussion on the list suggested giving the userspace interface more time to
develop, which seems reasonable.
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/Makefile
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.000000000 -0800
@@ -1,5 +1,5 @@
obj-$(CONFIG_INFINIBAND) += ib_core.o ib_mad.o ib_sa.o \
- ib_cm.o ib_addr.o rdma_cm.o
+ ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o
obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o
@@ -14,6 +14,8 @@ ib_cm-y := cm.o
rdma_cm-y := cma.o
+rdma_ucm-y := ucma.o
+
ib_addr-y := addr.o
ib_umad-y := user_mad.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/drivers/infiniband/core/ucma.c
linux-2.6.ib/drivers/infiniband/core/ucma.c
--- linux-2.6.git/drivers/infiniband/core/ucma.c 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.000000000 -0800
@@ -0,0 +1,788 @@
+/*
+ * Copyright (c) 2005 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/poll.h>
+#include <linux/idr.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/miscdevice.h>
+
+#include <rdma/rdma_user_cm.h>
+#include <rdma/ib_marshall.h>
+#include <rdma/rdma_cm.h>
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+ UCMA_MAX_BACKLOG = 128
+};
+
+struct ucma_file {
+ struct mutex file_mutex;
+ struct file *filp;
+ struct list_head ctxs;
+ struct list_head events;
+ wait_queue_head_t poll_wait;
+};
+
+struct ucma_context {
+ int id;
+ wait_queue_head_t wait;
+ atomic_t ref;
+ int events_reported;
+ int backlog;
+
+ struct ucma_file *file;
+ struct rdma_cm_id *cm_id;
+ __u64 uid;
+
+ struct list_head events; /* list of pending events. */
+ struct list_head file_list; /* member in file ctx list */
+};
+
+struct ucma_event {
+ struct ucma_context *ctx;
+ struct list_head file_list; /* member in file event list */
+ struct list_head ctx_list; /* member in ctx event list */
+ struct rdma_cm_id *cm_id;
+ struct rdma_ucm_event_resp resp;
+};
+
+static DEFINE_MUTEX(ctx_mutex);
+static DEFINE_IDR(ctx_idr);
+
+static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id)
+{
+ struct ucma_context *ctx;
+
+ mutex_lock(&ctx_mutex);
+ ctx = idr_find(&ctx_idr, id);
+ if (!ctx)
+ ctx = ERR_PTR(-ENOENT);
+ else if (ctx->file != file)
+ ctx = ERR_PTR(-EINVAL);
+ else
+ atomic_inc(&ctx->ref);
+ mutex_unlock(&ctx_mutex);
+
+ return ctx;
+}
+
+static void ucma_put_ctx(struct ucma_context *ctx)
+{
+ if (atomic_dec_and_test(&ctx->ref))
+ wake_up(&ctx->wait);
+}
+
+static void ucma_cleanup_events(struct ucma_context *ctx)
+{
+ struct ucma_event *uevent;
+
+ mutex_lock(&ctx->file->file_mutex);
+ list_del(&ctx->file_list);
+ while (!list_empty(&ctx->events)) {
+
+ uevent = list_entry(ctx->events.next, struct ucma_event,
+ ctx_list);
+ list_del(&uevent->file_list);
+ list_del(&uevent->ctx_list);
+
+ /* clear incoming connections. */
+ if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST)
+ rdma_destroy_id(uevent->cm_id);
+
+ kfree(uevent);
+ }
+ mutex_unlock(&ctx->file->file_mutex);
+}
+
+static struct ucma_context* ucma_alloc_ctx(struct ucma_file *file)
+{
+ struct ucma_context *ctx;
+ int ret;
+
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return NULL;
+
+ atomic_set(&ctx->ref, 1);
+ init_waitqueue_head(&ctx->wait);
+ ctx->file = file;
+ INIT_LIST_HEAD(&ctx->events);
+
+ do {
+ ret = idr_pre_get(&ctx_idr, GFP_KERNEL);
+ if (!ret)
+ goto error;
+
+ mutex_lock(&ctx_mutex);
+ ret = idr_get_new(&ctx_idr, ctx, &ctx->id);
+ mutex_unlock(&ctx_mutex);
+ } while (ret == -EAGAIN);
+
+ if (ret)
+ goto error;
+
+ list_add_tail(&ctx->file_list, &file->ctxs);
+ return ctx;
+
+error:
+ kfree(ctx);
+ return NULL;
+}
+
+static int ucma_event_handler(struct rdma_cm_id *cm_id,
+ struct rdma_cm_event *event)
+{
+ struct ucma_event *uevent;
+ struct ucma_context *ctx = cm_id->context;
+ int ret = 0;
+
+ uevent = kzalloc(sizeof(*uevent), GFP_KERNEL);
+ if (!uevent)
+ return event->event == RDMA_CM_EVENT_CONNECT_REQUEST;
+
+ uevent->ctx = ctx;
+ uevent->cm_id = cm_id;
+ uevent->resp.uid = ctx->uid;
+ uevent->resp.id = ctx->id;
+ uevent->resp.event = event->event;
+ uevent->resp.status = event->status;
+ if ((uevent->resp.private_data_len = event->private_data_len))
+ memcpy(uevent->resp.private_data, event->private_data,
+ event->private_data_len);
+
+ mutex_lock(&ctx->file->file_mutex);
+ if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) {
+ if (!ctx->backlog) {
+ ret = -EDQUOT;
+ goto out;
+ }
+ ctx->backlog--;
+ }
+ list_add_tail(&uevent->file_list, &ctx->file->events);
+ list_add_tail(&uevent->ctx_list, &ctx->events);
+ wake_up_interruptible(&ctx->file->poll_wait);
+out:
+ mutex_unlock(&ctx->file->file_mutex);
+ return ret;
+}
+
+static ssize_t ucma_get_event(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct ucma_context *ctx;
+ struct rdma_ucm_get_event cmd;
+ struct ucma_event *uevent;
+ int ret = 0;
+ DEFINE_WAIT(wait);
+
+ if (out_len < sizeof(struct rdma_ucm_event_resp))
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ mutex_lock(&file->file_mutex);
+ while (list_empty(&file->events)) {
+ if (file->filp->f_flags & O_NONBLOCK) {
+ ret = -EAGAIN;
+ break;
+ }
+
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE);
+ mutex_unlock(&file->file_mutex);
+ schedule();
+ mutex_lock(&file->file_mutex);
+ finish_wait(&file->poll_wait, &wait);
+ }
+
+ if (ret)
+ goto done;
+
+ uevent = list_entry(file->events.next, struct ucma_event, file_list);
+
+ if (uevent->resp.event == RDMA_CM_EVENT_CONNECT_REQUEST) {
+ ctx = ucma_alloc_ctx(file);
+ if (!ctx) {
+ ret = -ENOMEM;
+ goto done;
+ }
+ uevent->ctx->backlog++;
+ ctx->cm_id = uevent->cm_id;
+ ctx->cm_id->context = ctx;
+ uevent->resp.id = ctx->id;
+ }
+
+ if (copy_to_user((void __user *)(unsigned long)cmd.response,
+ &uevent->resp, sizeof(uevent->resp))) {
+ ret = -EFAULT;
+ goto done;
+ }
+
+ list_del(&uevent->file_list);
+ list_del(&uevent->ctx_list);
+ uevent->ctx->events_reported++;
+ kfree(uevent);
+done:
+ mutex_unlock(&file->file_mutex);
+ return ret;
+}
+
+static ssize_t ucma_create_id(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_create_id cmd;
+ struct rdma_ucm_create_id_resp resp;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (out_len < sizeof(resp))
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ mutex_lock(&file->file_mutex);
+ ctx = ucma_alloc_ctx(file);
+ mutex_unlock(&file->file_mutex);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->uid = cmd.uid;
+ ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, RDMA_PS_TCP);
+ if (IS_ERR(ctx->cm_id)) {
+ ret = PTR_ERR(ctx->cm_id);
+ goto err1;
+ }
+
+ resp.id = ctx->id;
+ if (copy_to_user((void __user *)(unsigned long)cmd.response,
+ &resp, sizeof(resp))) {
+ ret = -EFAULT;
+ goto err2;
+ }
+ return 0;
+
+err2:
+ rdma_destroy_id(ctx->cm_id);
+err1:
+ mutex_lock(&ctx_mutex);
+ idr_remove(&ctx_idr, ctx->id);
+ mutex_unlock(&ctx_mutex);
+ kfree(ctx);
+ return ret;
+}
+
+static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_destroy_id cmd;
+ struct rdma_ucm_destroy_id_resp resp;
+ struct ucma_context *ctx;
+ int ret = 0;
+
+ if (out_len < sizeof(resp))
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ mutex_lock(&ctx_mutex);
+ ctx = idr_find(&ctx_idr, cmd.id);
+ if (!ctx)
+ ctx = ERR_PTR(-ENOENT);
+ else if (ctx->file != file)
+ ctx = ERR_PTR(-EINVAL);
+ else
+ idr_remove(&ctx_idr, ctx->id);
+ mutex_unlock(&ctx_mutex);
+
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ atomic_dec(&ctx->ref);
+ wait_event(ctx->wait, !atomic_read(&ctx->ref));
+
+ /* No new events will be generated after destroying the id. */
+ rdma_destroy_id(ctx->cm_id);
+ /* Cleanup events not yet reported to the user. */
+ ucma_cleanup_events(ctx);
+
+ resp.events_reported = ctx->events_reported;
+ if (copy_to_user((void __user *)(unsigned long)cmd.response,
+ &resp, sizeof(resp)))
+ ret = -EFAULT;
+
+ kfree(ctx);
+ return ret;
+}
+
+static ssize_t ucma_bind_addr(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_bind_addr cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = rdma_bind_addr(ctx->cm_id, (struct sockaddr *) &cmd.addr);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_resolve_addr(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_resolve_addr cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = rdma_resolve_addr(ctx->cm_id, (struct sockaddr *) &cmd.src_addr,
+ (struct sockaddr *) &cmd.dst_addr,
+ cmd.timeout_ms);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_resolve_route(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_resolve_route cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = rdma_resolve_route(ctx->cm_id, cmd.timeout_ms);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
+ struct rdma_route *route)
+{
+ struct rdma_dev_addr *dev_addr;
+
+ resp->num_paths = route->num_paths;
+ switch (route->num_paths) {
+ case 0:
+ dev_addr = &route->addr.dev_addr;
+ memcpy(&resp->ib_route[0].dgid, ib_addr_get_dgid(dev_addr),
+ sizeof(union ib_gid));
+ memcpy(&resp->ib_route[0].sgid, ib_addr_get_sgid(dev_addr),
+ sizeof(union ib_gid));
+ resp->ib_route[0].pkey = cpu_to_be16(ib_addr_get_pkey(dev_addr));
+ break;
+ case 2:
+ ib_copy_path_rec_to_user(&resp->ib_route[1],
+ &route->path_rec[1]);
+ /* fall through */
+ case 1:
+ ib_copy_path_rec_to_user(&resp->ib_route[0],
+ &route->path_rec[0]);
+ break;
+ default:
+ break;
+ }
+}
+
+static ssize_t ucma_query_route(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_query_route cmd;
+ struct rdma_ucm_query_route_resp resp;
+ struct ucma_context *ctx;
+ struct sockaddr *addr;
+ int ret = 0;
+
+ if (out_len < sizeof(resp))
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ if (!ctx->cm_id->device) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ addr = &ctx->cm_id->route.addr.src_addr;
+ memcpy(&resp.src_addr, addr, addr->sa_family == AF_INET ?
+ sizeof(struct sockaddr_in) :
+ sizeof(struct sockaddr_in6));
+ addr = &ctx->cm_id->route.addr.dst_addr;
+ memcpy(&resp.dst_addr, addr, addr->sa_family == AF_INET ?
+ sizeof(struct sockaddr_in) :
+ sizeof(struct sockaddr_in6));
+ resp.node_guid = ctx->cm_id->device->node_guid;
+ resp.port_num = ctx->cm_id->port_num;
+ switch (ctx->cm_id->device->node_type) {
+ case IB_NODE_CA:
+ ucma_copy_ib_route(&resp, &ctx->cm_id->route);
+ default:
+ break;
+ }
+
+ if (copy_to_user((void __user *)(unsigned long)cmd.response,
+ &resp, sizeof(resp)))
+ ret = -EFAULT;
+
+out:
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static void ucma_copy_conn_param(struct rdma_conn_param *dst_conn,
+ struct rdma_ucm_conn_param *src_conn)
+{
+ dst_conn->private_data = src_conn->private_data;
+ dst_conn->private_data_len = src_conn->private_data_len;
+ dst_conn->responder_resources =src_conn->responder_resources;
+ dst_conn->initiator_depth = src_conn->initiator_depth;
+ dst_conn->flow_control = src_conn->flow_control;
+ dst_conn->retry_count = src_conn->retry_count;
+ dst_conn->rnr_retry_count = src_conn->rnr_retry_count;
+ dst_conn->srq = src_conn->srq;
+ dst_conn->qp_num = src_conn->qp_num;
+ dst_conn->qp_type = src_conn->qp_type;
+}
+
+static ssize_t ucma_connect(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_connect cmd;
+ struct rdma_conn_param conn_param;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ if (!cmd.conn_param.valid)
+ return -EINVAL;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ucma_copy_conn_param(&conn_param, &cmd.conn_param);
+ ret = rdma_connect(ctx->cm_id, &conn_param);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_listen(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_listen cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ctx->backlog = cmd.backlog > 0 && cmd.backlog < UCMA_MAX_BACKLOG ?
+ cmd.backlog : UCMA_MAX_BACKLOG;
+ ret = rdma_listen(ctx->cm_id, ctx->backlog);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_accept(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_accept cmd;
+ struct rdma_conn_param conn_param;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ if (cmd.conn_param.valid) {
+ ctx->uid = cmd.uid;
+ ucma_copy_conn_param(&conn_param, &cmd.conn_param);
+ ret = rdma_accept(ctx->cm_id, &conn_param);
+ } else
+ ret = rdma_accept(ctx->cm_id, NULL);
+
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_reject(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_reject cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = rdma_reject(ctx->cm_id, cmd.private_data, cmd.private_data_len);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_disconnect(struct ucma_file *file, const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_disconnect cmd;
+ struct ucma_context *ctx;
+ int ret;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ ret = rdma_disconnect(ctx->cm_id);
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t ucma_init_qp_attr(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len)
+{
+ struct rdma_ucm_init_qp_attr cmd;
+ struct ib_uverbs_qp_attr resp;
+ struct ucma_context *ctx;
+ struct ib_qp_attr qp_attr;
+ int ret;
+
+ if (out_len < sizeof(resp))
+ return -ENOSPC;
+
+ if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
+ return -EFAULT;
+
+ ctx = ucma_get_ctx(file, cmd.id);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ resp.qp_attr_mask = 0;
+ memset(&qp_attr, 0, sizeof qp_attr);
+ qp_attr.qp_state = cmd.qp_state;
+ ret = rdma_init_qp_attr(ctx->cm_id, &qp_attr, &resp.qp_attr_mask);
+ if (ret)
+ goto out;
+
+ ib_copy_qp_attr_to_user(&resp, &qp_attr);
+ if (copy_to_user((void __user *)(unsigned long)cmd.response,
+ &resp, sizeof(resp)))
+ ret = -EFAULT;
+
+out:
+ ucma_put_ctx(ctx);
+ return ret;
+}
+
+static ssize_t (*ucma_cmd_table[])(struct ucma_file *file,
+ const char __user *inbuf,
+ int in_len, int out_len) = {
+ [RDMA_USER_CM_CMD_CREATE_ID] = ucma_create_id,
+ [RDMA_USER_CM_CMD_DESTROY_ID] = ucma_destroy_id,
+ [RDMA_USER_CM_CMD_BIND_ADDR] = ucma_bind_addr,
+ [RDMA_USER_CM_CMD_RESOLVE_ADDR] = ucma_resolve_addr,
+ [RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route,
+ [RDMA_USER_CM_CMD_QUERY_ROUTE] = ucma_query_route,
+ [RDMA_USER_CM_CMD_CONNECT] = ucma_connect,
+ [RDMA_USER_CM_CMD_LISTEN] = ucma_listen,
+ [RDMA_USER_CM_CMD_ACCEPT] = ucma_accept,
+ [RDMA_USER_CM_CMD_REJECT] = ucma_reject,
+ [RDMA_USER_CM_CMD_DISCONNECT] = ucma_disconnect,
+ [RDMA_USER_CM_CMD_INIT_QP_ATTR] = ucma_init_qp_attr,
+ [RDMA_USER_CM_CMD_GET_EVENT] = ucma_get_event
+};
+
+static ssize_t ucma_write(struct file *filp, const char __user *buf,
+ size_t len, loff_t *pos)
+{
+ struct ucma_file *file = filp->private_data;
+ struct rdma_ucm_cmd_hdr hdr;
+ ssize_t ret;
+
+ if (len < sizeof(hdr))
+ return -EINVAL;
+
+ if (copy_from_user(&hdr, buf, sizeof(hdr)))
+ return -EFAULT;
+
+ if (hdr.cmd < 0 || hdr.cmd >= ARRAY_SIZE(ucma_cmd_table))
+ return -EINVAL;
+
+ if (hdr.in + sizeof(hdr) > len)
+ return -EINVAL;
+
+ ret = ucma_cmd_table[hdr.cmd](file, buf + sizeof(hdr), hdr.in, hdr.out);
+ if (!ret)
+ ret = len;
+
+ return ret;
+}
+
+static unsigned int ucma_poll(struct file *filp, struct poll_table_struct *wait)
+{
+ struct ucma_file *file = filp->private_data;
+ unsigned int mask = 0;
+
+ poll_wait(filp, &file->poll_wait, wait);
+
+ mutex_lock(&file->file_mutex);
+ if (!list_empty(&file->events))
+ mask = POLLIN | POLLRDNORM;
+ mutex_unlock(&file->file_mutex);
+
+ return mask;
+}
+
+static int ucma_open(struct inode *inode, struct file *filp)
+{
+ struct ucma_file *file;
+
+ file = kmalloc(sizeof *file, GFP_KERNEL);
+ if (!file)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&file->events);
+ INIT_LIST_HEAD(&file->ctxs);
+ init_waitqueue_head(&file->poll_wait);
+ mutex_init(&file->file_mutex);
+
+ filp->private_data = file;
+ file->filp = filp;
+ return 0;
+}
+
+static int ucma_close(struct inode *inode, struct file *filp)
+{
+ struct ucma_file *file = filp->private_data;
+ struct ucma_context *ctx;
+
+ mutex_lock(&file->file_mutex);
+ while (!list_empty(&file->ctxs)) {
+ ctx = list_entry(file->ctxs.next, struct ucma_context,
+ file_list);
+ mutex_unlock(&file->file_mutex);
+
+ mutex_lock(&ctx_mutex);
+ idr_remove(&ctx_idr, ctx->id);
+ mutex_unlock(&ctx_mutex);
+
+ rdma_destroy_id(ctx->cm_id);
+ ucma_cleanup_events(ctx);
+ kfree(ctx);
+
+ mutex_lock(&file->file_mutex);
+ }
+ mutex_unlock(&file->file_mutex);
+ kfree(file);
+ return 0;
+}
+
+static struct file_operations ucma_fops = {
+ .owner = THIS_MODULE,
+ .open = ucma_open,
+ .release = ucma_close,
+ .write = ucma_write,
+ .poll = ucma_poll,
+};
+
+static struct miscdevice ucma_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "rdma_cm",
+ .fops = &ucma_fops,
+};
+
+static int __init ucma_init(void)
+{
+ return misc_register(&ucma_misc);
+}
+
+static void __exit ucma_cleanup(void)
+{
+ misc_deregister(&ucma_misc);
+ idr_destroy(&ctx_idr);
+}
+
+module_init(ucma_init);
+module_exit(ucma_cleanup);
diff -uprN -X linux-2.6.git/Documentation/dontdiff
linux-2.6.git/include/rdma/rdma_user_cm.h
linux-2.6.ib/include/rdma/rdma_user_cm.h
--- linux-2.6.git/include/rdma/rdma_user_cm.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.ib/include/rdma/rdma_user_cm.h 2006-01-16 16:54:55.000000000 -0800
@@ -0,0 +1,186 @@
+/*
+ * Copyright (c) 2005 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses. You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RDMA_USER_CM_H
+#define RDMA_USER_CM_H
+
+#include <linux/types.h>
+#include <linux/in6.h>
+#include <rdma/ib_user_verbs.h>
+#include <rdma/ib_user_sa.h>
+
+#define RDMA_USER_CM_ABI_VERSION 1
+
+#define RDMA_MAX_PRIVATE_DATA 256
+
+enum {
+ RDMA_USER_CM_CMD_CREATE_ID,
+ RDMA_USER_CM_CMD_DESTROY_ID,
+ RDMA_USER_CM_CMD_BIND_ADDR,
+ RDMA_USER_CM_CMD_RESOLVE_ADDR,
+ RDMA_USER_CM_CMD_RESOLVE_ROUTE,
+ RDMA_USER_CM_CMD_QUERY_ROUTE,
+ RDMA_USER_CM_CMD_CONNECT,
+ RDMA_USER_CM_CMD_LISTEN,
+ RDMA_USER_CM_CMD_ACCEPT,
+ RDMA_USER_CM_CMD_REJECT,
+ RDMA_USER_CM_CMD_DISCONNECT,
+ RDMA_USER_CM_CMD_INIT_QP_ATTR,
+ RDMA_USER_CM_CMD_GET_EVENT
+};
+
+/*
+ * command ABI structures.
+ */
+struct rdma_ucm_cmd_hdr {
+ __u32 cmd;
+ __u16 in;
+ __u16 out;
+};
+
+struct rdma_ucm_create_id {
+ __u64 uid;
+ __u64 response;
+};
+
+struct rdma_ucm_create_id_resp {
+ __u32 id;
+};
+
+struct rdma_ucm_destroy_id {
+ __u64 response;
+ __u32 id;
+ __u32 reserved;
+};
+
+struct rdma_ucm_destroy_id_resp {
+ __u32 events_reported;
+};
+
+struct rdma_ucm_bind_addr {
+ __u64 response;
+ struct sockaddr_in6 addr;
+ __u32 id;
+};
+
+struct rdma_ucm_resolve_addr {
+ struct sockaddr_in6 src_addr;
+ struct sockaddr_in6 dst_addr;
+ __u32 id;
+ __u32 timeout_ms;
+};
+
+struct rdma_ucm_resolve_route {
+ __u32 id;
+ __u32 timeout_ms;
+};
+
+struct rdma_ucm_query_route {
+ __u64 response;
+ __u32 id;
+ __u32 reserved;
+};
+
+struct rdma_ucm_query_route_resp {
+ __u64 node_guid;
+ struct ib_user_path_rec ib_route[2];
+ struct sockaddr_in6 src_addr;
+ struct sockaddr_in6 dst_addr;
+ __u32 num_paths;
+ __u8 port_num;
+ __u8 reserved[3];
+};
+
+struct rdma_ucm_conn_param {
+ __u32 qp_num;
+ __u32 qp_type;
+ __u8 private_data[RDMA_MAX_PRIVATE_DATA];
+ __u8 private_data_len;
+ __u8 srq;
+ __u8 responder_resources;
+ __u8 initiator_depth;
+ __u8 flow_control;
+ __u8 retry_count;
+ __u8 rnr_retry_count;
+ __u8 valid;
+};
+
+struct rdma_ucm_connect {
+ struct rdma_ucm_conn_param conn_param;
+ __u32 id;
+ __u32 reserved;
+};
+
+struct rdma_ucm_listen {
+ __u32 id;
+ __u32 backlog;
+};
+
+struct rdma_ucm_accept {
+ __u64 uid;
+ struct rdma_ucm_conn_param conn_param;
+ __u32 id;
+ __u32 reserved;
+};
+
+struct rdma_ucm_reject {
+ __u32 id;
+ __u8 private_data_len;
+ __u8 reserved[3];
+ __u8 private_data[RDMA_MAX_PRIVATE_DATA];
+};
+
+struct rdma_ucm_disconnect {
+ __u32 id;
+};
+
+struct rdma_ucm_init_qp_attr {
+ __u64 response;
+ __u32 id;
+ __u32 qp_state;
+};
+
+struct rdma_ucm_get_event {
+ __u64 response;
+};
+
+struct rdma_ucm_event_resp {
+ __u64 uid;
+ __u32 id;
+ __u32 event;
+ __u32 status;
+ __u8 private_data_len;
+ __u8 reserved[3];
+ __u8 private_data[RDMA_MAX_PRIVATE_DATA];
+};
+
+#endif /* RDMA_USER_CM_H */
^ permalink raw reply
* Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: Bryan O'Sullivan @ 2006-03-07 0:05 UTC (permalink / raw)
To: Roland Dreier; +Cc: David S. Miller, linux-kernel, openib-general, netdev
In-Reply-To: <adapskz3zw7.fsf@cisco.com>
On Mon, 2006-03-06 at 15:40 -0800, Roland Dreier wrote:
> Anyway IB works fine with standard INTx interrupts -- MSI is just icing.
Depends on the driver. Ours needs the interrupt vector rather than the
number, which means we don't work without CONFIG_PCI_MSI. That is,
unless there's some other way to get the vector that I don't know about
(entirely likely).
<b
^ permalink raw reply
* Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: Roland Dreier @ 2006-03-07 0:10 UTC (permalink / raw)
To: Bryan O'Sullivan
Cc: netdev, David S. Miller, openib-general, linux-kernel
In-Reply-To: <1141689930.3814.69.camel@serpentine.pathscale.com>
Bryan> Depends on the driver. Ours needs the interrupt vector
Bryan> rather than the number, which means we don't work without
Bryan> CONFIG_PCI_MSI. That is, unless there's some other way to
Bryan> get the vector that I don't know about (entirely likely).
OK, fair enough. But that's a quirk of the x86 architecture that
you're pretty tied to -- you don't actually need MSI, you just need
the interrupt numbering that is enabled on x86 with that config option.
Since Niagara boxes don't have HT anyway it's kind of a moot point.
Although I wonder what you would have to do to make your device work
with something like the MIPS SoCs that have HT...
- R.
^ permalink raw reply
* Re: Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
From: David S. Miller @ 2006-03-07 0:26 UTC (permalink / raw)
To: rdreier; +Cc: linux-kernel, openib-general, netdev
In-Reply-To: <adapskz3zw7.fsf@cisco.com>
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 06 Mar 2006 15:40:56 -0800
> and 32 threads are probably good for flushing out SMP races.
Indeed, guess what I've been spending most of my time working
on lately? :)
^ permalink raw reply
* Re: Re: TSO and IPoIB performance degradation
From: Shirley Ma @ 2006-03-07 3:13 UTC (permalink / raw)
To: Stephen Hemminger
Cc: netdev, Linux Kernel Mailing List, openib-general,
openib-general-bounces, David S. Miller
In-Reply-To: <20060306145041.320d4dd3@localhost.localdomain>
[-- Attachment #1.1: Type: text/plain, Size: 420 bytes --]
> More likely you are getting hit by the fact that TSO prevents the
congestion
window from increasing properly. This was fixed in 2.6.15 (around mid of
Nov 2005).
Yep, I noticed the same problem. After updating to the new kernel, the
performance are much better, but it's still lower than before.
Thank
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
[-- Attachment #1.2: Type: text/html, Size: 570 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply
* Re: [PATCH, RESEND] Add MWI workaround for Tulip DC21143
From: Ralf Baechle @ 2006-03-07 3:58 UTC (permalink / raw)
To: Francois Romieu; +Cc: Martin Michlmayr, netdev, linux-mips
In-Reply-To: <20060306231530.GB16082@electric-eye.fr.zoreil.com>
On Tue, Mar 07, 2006 at 12:15:30AM +0100, Francois Romieu wrote:
> [...]
> > Does anyone have comments regarding this patch? I received
> > confirmation from a number of Debian users that this patch
> > significantly improves the lockup situation on Cobalt, so
> > it would be nice if it could go in.
>
> I'll queue it with the pending de2104x fix(es ?) during my next
> upkeep.
I'm just not convinced of having such a workaround as a build option.
The average person building a a kernel will probably not know if the
option needs to be enabled or not.
Ralf
^ permalink raw reply
* Re: de2104x: interrupts before interrupt handler is registered
From: Martin Michlmayr @ 2006-03-07 5:11 UTC (permalink / raw)
To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <20060306211745.GD15728@electric-eye.fr.zoreil.com>
* Francois Romieu <romieu@fr.zoreil.com> [2006-03-06 22:17]:
> Not sure about this one, but...
It seems to help. It's hard to say for sure because I don't have a
foolproof way to reproduce this panic. It _usually_ occurs after
copying a few hundred MB but there's no clear trigger. I've now copied
a few GB around using a kernel with your patch and it hasn't crashed.
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply
* Re: de2104x: interrupts before interrupt handler is registered
From: Martin Michlmayr @ 2006-03-07 5:16 UTC (permalink / raw)
To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <20060306205406.GC15728@electric-eye.fr.zoreil.com>
* Francois Romieu <romieu@fr.zoreil.com> [2006-03-06 21:54]:
> > By the way, I'm getting the following messages in dmesg:
> > eth0: tx err, status 0x7fffb002
> Tx underrun.
> Is there anything which could induce a noticeable load on the PCI bus ?
I was going to say "no" because I was simply copying some data via the
network. However, it seems the situation is a bit more complicated
than this. It seems that I only get these underruns using a specific
hard drive. You see, the reason I'm rsyncing hundred of megabytes of
data across my LAN is because my laptop hard drive is dying, so I put
it in a PC as secondary master using an adapter. Interestingly
enough, I don't get any Tx underruns when using a different disk.
Which is strange because at the moment the disk is working fine (it
sort of started dying but seems to behave right now), so I don't know
why it would change anything. Maybe this makes sense to someone.
By the way, I only get underruns when I rsync from the PC to another
machine - not when I rsync from the other machine to the PC.
--
Martin Michlmayr
http://www.cyrius.com/
^ permalink raw reply
* Re: [PATCH] net: Move destructor from neigh->ops to neigh_params
From: Andrew Morton @ 2006-03-07 5:16 UTC (permalink / raw)
To: Roland Dreier; +Cc: netdev, davem, openib-general
In-Reply-To: <adazmlmmy7v.fsf@cisco.com>
Roland Dreier <rdreier@cisco.com> wrote:
>
> struct neigh_ops currently has a destructor field, which no in-kernel
> drivers outside of infiniband use.
net/atm/clip.c begs to disagree.
I'm also wondering why a combination of the net and infiniband trees does this:
drivers/infiniband/ulp/ipoib/ipoib_multicast.c: In function `ipoib_mcast_free':
drivers/infiniband/ulp/ipoib/ipoib_multicast.c:118: error: structure has no member named `destructor'
^ permalink raw reply
* Re: Re: [PATCH] net: Move destructor from neigh->ops to neigh_params
From: Shirley Ma @ 2006-03-07 5:40 UTC (permalink / raw)
To: Andrew Morton
Cc: netdev, openib-general-bounces, davem, openib-general,
Roland Dreier
In-Reply-To: <20060306211630.0df79464.akpm@osdl.org>
[-- Attachment #1.1: Type: text/plain, Size: 518 bytes --]
>I'm also wondering why a combination of the net and infiniband trees does
this:
>drivers/infiniband/ulp/ipoib/ipoib_multicast.c: In function
`ipoib_mcast_free':
>drivers/infiniband/ulp/ipoib/ipoib_multicast.c:118: error: structure has
no member named `destructor'
I guess this patch needs to be combined with another ipoib-neigh patch,
which hasn't been checked into infiniband/ipoib yet.
Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638
[-- Attachment #1.2: Type: text/html, Size: 689 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply
* Re: [PATCH] net: Move destructor from neigh->ops to neigh_params
From: Roland Dreier @ 2006-03-07 5:44 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, davem, openib-general
In-Reply-To: <20060306211630.0df79464.akpm@osdl.org>
Roland> struct neigh_ops currently has a destructor field, which
Roland> no in-kernel drivers outside of infiniband use.
Andrew> net/atm/clip.c begs to disagree.
err... my fault for trusting the patch changelog and not
double-checking. I think the fix is as simple as, although, given my
lack of ATM gear, this is untested:
diff --git a/net/atm/clip.c b/net/atm/clip.c
index 73370de..9d72817 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -289,7 +289,6 @@ static void clip_neigh_error(struct neig
static struct neigh_ops clip_neigh_ops = {
.family = AF_INET,
- .destructor = clip_neigh_destroy,
.solicit = clip_neigh_solicit,
.error_report = clip_neigh_error,
.output = dev_queue_xmit,
@@ -346,6 +345,7 @@ static struct neigh_table clip_tbl = {
/* parameters are copied from ARP ... */
.parms = {
+ .destructor = clip_neigh_destroy,
.tbl = &clip_tbl,
.base_reachable_time = 30 * HZ,
.retrans_time = 1 * HZ,
^ permalink raw reply related
* Re: [PATCH] net: Move destructor from neigh->ops to neigh_params
From: Roland Dreier @ 2006-03-07 5:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: netdev, davem, openib-general
In-Reply-To: <20060306211630.0df79464.akpm@osdl.org>
Andrew> I'm also wondering why a combination of the net and
Andrew> infiniband trees does this:
Andrew> drivers/infiniband/ulp/ipoib/ipoib_multicast.c: In
Andrew> function `ipoib_mcast_free':
Andrew> drivers/infiniband/ulp/ipoib/ipoib_multicast.c:118: error:
Andrew> structure has no member named `destructor'
A cock-up on my part. I think I fixed this when testing the patch,
and promptly forgot about doing it. Then when I reposted the patch to
netdev, I decided to be very careful about things, so I went back to
pristine sources from Michael's message, which of course is missing
the following chunk:
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index fde442a..93c462e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -115,7 +115,6 @@ static void ipoib_mcast_free(struct ipoi
if (neigh->ah)
ipoib_put_ah(neigh->ah);
*to_ipoib_neigh(neigh->neighbour) = NULL;
- neigh->neighbour->ops->destructor = NULL;
kfree(neigh);
}
^ permalink raw reply related
* Re: [PATCH] net: Move destructor from neigh->ops to neigh_params
From: David S. Miller @ 2006-03-07 6:11 UTC (permalink / raw)
To: rdreier; +Cc: akpm, netdev, openib-general
In-Reply-To: <ada7j7624h9.fsf@cisco.com>
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 06 Mar 2006 21:44:50 -0800
> Roland> struct neigh_ops currently has a destructor field, which
> Roland> no in-kernel drivers outside of infiniband use.
>
> Andrew> net/atm/clip.c begs to disagree.
>
> err... my fault for trusting the patch changelog and not
> double-checking. I think the fix is as simple as, although, given my
> lack of ATM gear, this is untested:
Looks good to me, applied.
^ permalink raw reply
* Re: [PATCH 0/8] Intel I/O Acceleration Technology (I/OAT)
From: Evgeniy Polyakov @ 2006-03-07 7:44 UTC (permalink / raw)
To: Ingo Oeser
Cc: David S. Miller, jengelh, christopher.leech, linux-kernel, netdev
In-Reply-To: <200603061844.07439.netdev@axxeo.de>
On Mon, Mar 06, 2006 at 06:44:07PM +0100, Ingo Oeser (netdev@axxeo.de) wrote:
> Evgeniy Polyakov wrote:
> > On Sat, Mar 04, 2006 at 01:41:44PM -0800, David S. Miller (davem@davemloft.net) wrote:
> > > From: Jan Engelhardt <jengelh@linux01.gwdg.de>
> > > Date: Sat, 4 Mar 2006 19:46:22 +0100 (MET)
> > >
> > > > Does this buy the normal standard desktop user anything?
> > >
> > > Absolutely, it optimizes end-node performance.
> >
> > It really depends on how it is used.
> > According to investigation made for kevent based FS AIO reading,
> > get_user_pages() performange graph looks like sqrt() function
>
> Hmm, so I should resurrect my user page table walker abstraction?
>
> There I would hand each page to a "recording" function, which
> can drop the page from the collection or coalesce it in the collector
> if your scatter gather implementation allows it.
It depends on where performance growth is stopped.
>From the first glance it does not look like find_extend_vma(),
probably follow_page() fault and thus __handle_mm_fault().
I can not say actually, but if it is true and performance growth is
stopped due to increased number of faults and it's processing,
your approach will hit this problem too, doesn't it?
> Regards
>
> Ingo Oeser
--
Evgeniy Polyakov
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox