* [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
From: Or Gerlitz @ 2010-05-05 14:31 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie, Alexander Nezhinsky, Yaron Haviv
In-Reply-To: <Pine.LNX.4.64.1005051726110.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>
The iser connection teardown flow isn't over till the underlying
Connection Manager (e.g the IB CM) delivers a disconnected or timeout
event through the RDMA-CM. When the remote (target) side isn't reachable,
e.g when some HW e.g port/hca/switch isn't functioning or taken down
administratively, the CM timeout flow is used and the event may be
generated only after relatively long time, in the order of tens of seconds.
The current iser code exposes this possibly long delay to higher layers,
specifically to the iscsid daemon and iscsi kernel stack. As a result,
the iscsi stack doesn't respond well, to the extent of this low-level CM
delay being added to the fail-over time under HA schemes such as the one
provided by DM multipath through the multipathd(8) service.
This patch enhances the reference counting scheme on iser's IB
connections such that the disconnect flow initiated by iscsid from
user space (ep_disconnect) isn't waiting for the CM to deliver the
disconnect/timeout event. On the other hand, the connection teardown
isn't done from iser's view point till the event is delivered.
The iser ib (rdma) connection object is destroyed when its reference
count reaches zero. When this happens on the RDMA-CM callback context,
extra care is taken such that the RDMA-CM does the actual destroying
of the associated ID as doing it in the callback is prohibited.
The reference count of iser ib connection would normally reach
three, where the <ref, deref> relations are
1. conn <init, terminate>
2. conn <bind, stop/destroy>
3. cma id <create, disconnect/error/timeout callbacks>
Signed-off-by: Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org>
---
with this patch, multipath fail-over time is about 30 seconds,
which is seen here, when a DD over the multi-path device is done
before/during/after the fail-over
regulary, before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.926 s, 1.0 GB/s
taking a port down, causing fail-over during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 46.6117 s, 369 MB/s
after path-failure, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6474 s, 1.0 GB/s
13:00:09 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
13:00:24 connection8:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...]
13:00:24 connection8:0: detected conn error (1011)
13:00:24 iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3)
13:00:39 cto-1 kernel: device-mapper: multipath: Failing path 8:48.
13:00:39 cto-1 multipathd: 8:48: mark as failed
13:00:39 cto-1 multipathd: mpathd: remaining active paths: 1
--> the disconnected event is delivered after the IB CM timeout expires
--> but fail-over doesn't pend on this
13:01:56 iser: iser_cma_handler:event 10 status 0 conn ffff88022dcb39b0 id ffff88022cf09400
without this patch, multipath fail-over time is about 130 seconds
before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6812 s, 1.0 GB/s
taking a port down during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 145.094 s, 118 MB/s
after fail-over, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.8935 s, 1.0 GB/s
14:24:05 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
14:24:20 connection4:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...]
14:24:20 kernel: connection4:0: detected conn error (1011)
14:24:21 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3)
--> the disconnected event is delivered after the IB CM timeout expires
--> fail-over pending on this
14:25:59 iser: iser_cma_handler:event 10 conn ffff88022625a1b0 id ffff880222537c00
14:26:14 session4: session recovery timed out after 15 secs
14:26:14 device-mapper: multipath: Failing path 8:64.
14:26:14 multipathd: mpathd: remaining active paths: 1
drivers/infiniband/ulp/iser/iscsi_iser.c | 9 ++-
drivers/infiniband/ulp/iser/iscsi_iser.h | 3 -
drivers/infiniband/ulp/iser/iser_verbs.c | 72 +++++++++++++++++--------------
3 files changed, 46 insertions(+), 38 deletions(-)
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -238,7 +238,7 @@ alloc_err:
* releases the FMR pool, QP and CMA ID objects, returns 0 on success,
* -1 on failure
*/
-static int iser_free_ib_conn_res(struct iser_conn *ib_conn)
+static int iser_free_ib_conn_res(struct iser_conn *ib_conn, int can_destroy_id)
{
BUG_ON(ib_conn == NULL);
@@ -253,7 +253,8 @@ static int iser_free_ib_conn_res(struct
if (ib_conn->qp != NULL)
rdma_destroy_qp(ib_conn->cma_id);
- if (ib_conn->cma_id != NULL)
+ /* if cma handler context, the caller acts s.t the cma destroy the id */
+ if (ib_conn->cma_id != NULL && can_destroy_id)
rdma_destroy_id(ib_conn->cma_id);
ib_conn->fmr_pool = NULL;
@@ -331,7 +332,7 @@ static int iser_conn_state_comp_exch(str
/**
* Frees all conn objects and deallocs conn descriptor
*/
-static void iser_conn_release(struct iser_conn *ib_conn)
+static void iser_conn_release(struct iser_conn *ib_conn, int can_destroy_id)
{
struct iser_device *device = ib_conn->device;
@@ -341,7 +342,7 @@ static void iser_conn_release(struct ise
list_del(&ib_conn->conn_list);
mutex_unlock(&ig.connlist_mutex);
iser_free_rx_descriptors(ib_conn);
- iser_free_ib_conn_res(ib_conn);
+ iser_free_ib_conn_res(ib_conn, can_destroy_id);
ib_conn->device = NULL;
/* on EVENT_ADDR_ERROR there's no device yet for this conn */
if (device != NULL)
@@ -354,10 +355,13 @@ void iser_conn_get(struct iser_conn *ib_
atomic_inc(&ib_conn->refcount);
}
-void iser_conn_put(struct iser_conn *ib_conn)
+int iser_conn_put(struct iser_conn *ib_conn, int can_destroy_id)
{
- if (atomic_dec_and_test(&ib_conn->refcount))
- iser_conn_release(ib_conn);
+ if (atomic_dec_and_test(&ib_conn->refcount)) {
+ iser_conn_release(ib_conn, can_destroy_id);
+ return 1;
+ }
+ return 0;
}
/**
@@ -381,19 +385,20 @@ void iser_conn_terminate(struct iser_con
wait_event_interruptible(ib_conn->wait,
ib_conn->state == ISER_CONN_DOWN);
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref ib conn deallocate */
}
-static void iser_connect_error(struct rdma_cm_id *cma_id)
+static int iser_connect_error(struct rdma_cm_id *cma_id)
{
struct iser_conn *ib_conn;
ib_conn = (struct iser_conn *)cma_id->context;
ib_conn->state = ISER_CONN_DOWN;
wake_up_interruptible(&ib_conn->wait);
+ return iser_conn_put(ib_conn, 0); /* deref ib conn's cma id */
}
-static void iser_addr_handler(struct rdma_cm_id *cma_id)
+static int iser_addr_handler(struct rdma_cm_id *cma_id)
{
struct iser_device *device;
struct iser_conn *ib_conn;
@@ -402,8 +407,7 @@ static void iser_addr_handler(struct rdm
device = iser_device_find_by_ib_device(cma_id);
if (!device) {
iser_err("device lookup/creation failed\n");
- iser_connect_error(cma_id);
- return;
+ return iser_connect_error(cma_id);
}
ib_conn = (struct iser_conn *)cma_id->context;
@@ -412,11 +416,13 @@ static void iser_addr_handler(struct rdm
ret = rdma_resolve_route(cma_id, 1000);
if (ret) {
iser_err("resolve route failed: %d\n", ret);
- iser_connect_error(cma_id);
+ return iser_connect_error(cma_id);
}
+
+ return 0;
}
-static void iser_route_handler(struct rdma_cm_id *cma_id)
+static int iser_route_handler(struct rdma_cm_id *cma_id)
{
struct rdma_conn_param conn_param;
int ret;
@@ -437,9 +443,9 @@ static void iser_route_handler(struct rd
goto failure;
}
- return;
+ return 0;
failure:
- iser_connect_error(cma_id);
+ return iser_connect_error(cma_id);
}
static void iser_connected_handler(struct rdma_cm_id *cma_id)
@@ -451,12 +457,12 @@ static void iser_connected_handler(struc
wake_up_interruptible(&ib_conn->wait);
}
-static void iser_disconnected_handler(struct rdma_cm_id *cma_id)
+static int iser_disconnected_handler(struct rdma_cm_id *cma_id)
{
struct iser_conn *ib_conn;
+ int ret;
ib_conn = (struct iser_conn *)cma_id->context;
- ib_conn->disc_evt_flag = 1;
/* getting here when the state is UP means that the conn is being *
* terminated asynchronously from the iSCSI layer's perspective. */
@@ -471,20 +477,24 @@ static void iser_disconnected_handler(st
ib_conn->state = ISER_CONN_DOWN;
wake_up_interruptible(&ib_conn->wait);
}
+
+ ret = iser_conn_put(ib_conn, 0); /* deref ib conn's cma id */
+ return ret;
}
static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
{
int ret = 0;
- iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id);
+ iser_err("event %d status %d conn %p id %p\n",
+ event->event, event->status, cma_id->context, cma_id);
switch (event->event) {
case RDMA_CM_EVENT_ADDR_RESOLVED:
- iser_addr_handler(cma_id);
+ ret = iser_addr_handler(cma_id);
break;
case RDMA_CM_EVENT_ROUTE_RESOLVED:
- iser_route_handler(cma_id);
+ ret = iser_route_handler(cma_id);
break;
case RDMA_CM_EVENT_ESTABLISHED:
iser_connected_handler(cma_id);
@@ -494,13 +504,12 @@ static int iser_cma_handler(struct rdma_
case RDMA_CM_EVENT_CONNECT_ERROR:
case RDMA_CM_EVENT_UNREACHABLE:
case RDMA_CM_EVENT_REJECTED:
- iser_err("event: %d, error: %d\n", event->event, event->status);
- iser_connect_error(cma_id);
+ ret = iser_connect_error(cma_id);
break;
case RDMA_CM_EVENT_DISCONNECTED:
case RDMA_CM_EVENT_DEVICE_REMOVAL:
case RDMA_CM_EVENT_ADDR_CHANGE:
- iser_disconnected_handler(cma_id);
+ ret = iser_disconnected_handler(cma_id);
break;
default:
iser_err("Unexpected RDMA CM event (%d)\n", event->event);
@@ -515,7 +524,7 @@ void iser_conn_init(struct iser_conn *ib
init_waitqueue_head(&ib_conn->wait);
ib_conn->post_recv_buf_count = 0;
atomic_set(&ib_conn->post_send_buf_count, 0);
- atomic_set(&ib_conn->refcount, 1);
+ atomic_set(&ib_conn->refcount, 1); /* ref ib conn allocation */
INIT_LIST_HEAD(&ib_conn->conn_list);
spin_lock_init(&ib_conn->lock);
}
@@ -543,6 +552,7 @@ int iser_connect(struct iser_conn *ib_
ib_conn->state = ISER_CONN_PENDING;
+ iser_conn_get(ib_conn); /* ref ib conn's cma id */
ib_conn->cma_id = rdma_create_id(iser_cma_handler,
(void *)ib_conn,
RDMA_PS_TCP);
@@ -580,7 +590,7 @@ id_failure:
addr_failure:
ib_conn->state = ISER_CONN_DOWN;
connect_failure:
- iser_conn_release(ib_conn);
+ iser_conn_release(ib_conn, 1);
return err;
}
@@ -749,12 +759,10 @@ static void iser_handle_comp_error(struc
iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn,
ISCSI_ERR_CONN_FAILED);
- /* complete the termination process if disconnect event was delivered *
- * note there are no more non completed posts to the QP */
- if (ib_conn->disc_evt_flag) {
- ib_conn->state = ISER_CONN_DOWN;
- wake_up_interruptible(&ib_conn->wait);
- }
+ /* no more non completed posts to the QP, complete the
+ * termination process w.o worrying on disconnect event */
+ ib_conn->state = ISER_CONN_DOWN;
+ wake_up_interruptible(&ib_conn->wait);
}
}
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -325,7 +325,7 @@ iscsi_iser_conn_destroy(struct iscsi_cls
*/
if (ib_conn) {
ib_conn->iser_conn = NULL;
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref iscsi/ib conn unbinding */
}
}
@@ -357,11 +357,12 @@ iscsi_iser_conn_bind(struct iscsi_cls_se
/* binds the iSER connection retrieved from the previously
* connected ep_handle to the iSCSI layer connection. exchanges
* connection pointers */
- iser_err("binding iscsi conn %p to iser_conn %p\n",conn,ib_conn);
+ iser_err("binding iscsi/iser conn %p %p to ib_conn %p\n",
+ conn, conn->dd_data, ib_conn);
iser_conn = conn->dd_data;
ib_conn->iser_conn = iser_conn;
iser_conn->ib_conn = ib_conn;
- iser_conn_get(ib_conn);
+ iser_conn_get(ib_conn); /* ref iscsi/ib conn binding */
return 0;
}
@@ -382,7 +383,7 @@ iscsi_iser_conn_stop(struct iscsi_cls_co
* There is no unbind event so the stop callback
* must release the ref from the bind.
*/
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref iscsi/ib conn unbinding */
}
iser_conn->ib_conn = NULL;
}
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -247,7 +247,6 @@ struct iser_conn {
struct rdma_cm_id *cma_id; /* CMA ID */
struct ib_qp *qp; /* QP */
struct ib_fmr_pool *fmr_pool; /* pool of IB FMRs */
- int disc_evt_flag; /* disconn event delivered */
wait_queue_head_t wait; /* waitq for conn/disconn */
int post_recv_buf_count; /* posted rx count */
atomic_t post_send_buf_count; /* posted tx count */
@@ -321,7 +320,7 @@ void iser_conn_init(struct iser_conn *ib
void iser_conn_get(struct iser_conn *ib_conn);
-void iser_conn_put(struct iser_conn *ib_conn);
+int iser_conn_put(struct iser_conn *ib_conn, int destroy_cma_id_allowed);
void iser_conn_terminate(struct iser_conn *ib_conn);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 2/3] ib/iser: remove buggy back-pointer setting
From: Or Gerlitz @ 2010-05-05 14:30 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie
In-Reply-To: <Pine.LNX.4.64.1005051726110.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>
iscsi connection object life cycle includes binding and unbinding
(conn_stop) to/from the iscsi transport connection object. Since
iscsi connection objects are recycled, on the time the transport
connection (e.g iser's ib connection) is released it is illegal
to touch the iscsi connection tied to the transport back-pointer, as
it may already point to a different transport connection.
Signed-off-by: Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/ulp/iser/iser_verbs.c | 2 --
1 file changed, 2 deletions(-)
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -346,8 +346,6 @@ static void iser_conn_release(struct ise
/* on EVENT_ADDR_ERROR there's no device yet for this conn */
if (device != NULL)
iser_device_try_release(device);
- if (ib_conn->iser_conn)
- ib_conn->iser_conn->ib_conn = NULL;
iscsi_destroy_endpoint(ib_conn->ep);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 1/3] ib/iser: add event handler
From: Or Gerlitz @ 2010-05-05 14:30 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie
In-Reply-To: <Pine.LNX.4.64.1005051726110.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>
Add handler to mark events such as port up and down, this is useful
when testing high-availability schemes such as multi-pathing
Signed-off-by: Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org>
---
drivers/infiniband/ulp/iser/iscsi_iser.h | 1 +
drivers/infiniband/ulp/iser/iser_verbs.c | 16 +++++++++++++++-
2 files changed, 16 insertions(+), 1 deletion(-)
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -232,6 +232,7 @@ struct iser_device {
struct ib_cq *tx_cq;
struct ib_mr *mr;
struct tasklet_struct cq_tasklet;
+ struct ib_event_handler event_handler;
struct list_head ig_list; /* entry in ig devices list */
int refcount;
};
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -54,6 +54,13 @@ static void iser_qp_event_callback(struc
iser_err("got qp event %d\n",cause->event);
}
+static void iser_event_handler(struct ib_event_handler *handler,
+ struct ib_event *event)
+{
+ iser_err("async event %d on device %s port %d\n", event->event,
+ event->device->name, event->element.port_num);
+}
+
/**
* iser_create_device_ib_res - creates Protection Domain (PD), Completion
* Queue (CQ), DMA Memory Region (DMA MR) with the device associated with
@@ -96,8 +103,15 @@ static int iser_create_device_ib_res(str
if (IS_ERR(device->mr))
goto dma_mr_err;
+ INIT_IB_EVENT_HANDLER(&device->event_handler, device->ib_device,
+ iser_event_handler);
+ if (ib_register_event_handler(&device->event_handler))
+ goto handler_err;
+
return 0;
+handler_err:
+ ib_dereg_mr(device->mr);
dma_mr_err:
tasklet_kill(&device->cq_tasklet);
cq_arm_err:
@@ -120,7 +134,7 @@ static void iser_free_device_ib_res(stru
BUG_ON(device->mr == NULL);
tasklet_kill(&device->cq_tasklet);
-
+ (void)ib_unregister_event_handler(&device->event_handler);
(void)ib_dereg_mr(device->mr);
(void)ib_destroy_cq(device->tx_cq);
(void)ib_destroy_cq(device->rx_cq);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 0/3 for-2.6.35] ib/iser: fix multipathing over iser, reduce fail-over time
From: Or Gerlitz @ 2010-05-05 14:29 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie
Roland,
This patch series fixes and reduces DM multipath fail-over / time
over iscsi/iser, the core patch is #3.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
From: Walukiewicz, Miroslaw @ 2010-05-05 13:42 UTC (permalink / raw)
To: Steve Wise
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <4BE06425.6000104-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Steve,
> ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP needs access to physical addresses from user space. Due to security reasons we should make a virtual-to-physical address translation in kernel.
>
>
Steve Wise wrote:
But why couldn't you just use the normal memory registration paths? IE
the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs
in the ibv_post_send() which could do kernel bypass.
I see here some misunderstanding. Let me explain better how our tramsmit path works.
In our implementation we use normal memory registration path using ibv_reg_mr and we use ibv_post_send() with lkey/vaddr/len.
The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP passes lkey/virtual_addr/len information to kernel using shared page to our device driver (ud_post_send). There is no data copy here and the driver is used only for fast synchronization.
Because our RAW ETH QP must use physical addresses only, ud_post_send() in kernel makes a virtual to physical memory translation and accesses the QP HW for packet transmission. Previously a packet buffer memory was registered and pinned by ibv_reg_mr to provide necessary information for making such translation.
Steve Wise wrote:
Seems like maybe you could fix the non-bypass post_send/recv paths
instead of implementing an entirely new user<->kernel interface...
The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is shared with all other user-kernel communication and it is quite complex. It is a perfect path for QP/CQ/PD/mem management but for me it is too complex for traffic acceleration.
The user<->kernel path through additional driver, shared page for lkey/vaddr/len passing and SW memory translation in kernel is much more effective.
Maybe it is a good idea to make that API more official after some kind of standarization. Our tests proved that it works. We achieved twice better performance and latency. That way could open the way for adding some non-RDMA devices to devices supported OFED API.
Regards,
Mirek
-----Original Message-----
From: Steve Wise [mailto:swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org]
Sent: Tuesday, May 04, 2010 8:15 PM
To: Walukiewicz, Miroslaw
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
Walukiewicz, Miroslaw wrote:
> Hello Steve,
>
> Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion that user provides a data payload only for TX and similarly receives a payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached by HW.
>
> Our QP implementation in HW does not provide such possibity of attaching headers by HW for UD traffic so for multicast acceleration we choose L2 raw path. It provides some overhead for user application but it is still zero copy apprach.
>
> I thought about using a simulation of UD path using L2 raw QP to get the same result like for true UD QP (user handles a payload only). Such approach costs additional copy of payload in SW due to putting headers first and next payload to single tx buffer. Similar situation is for rx. It is a need for copy payload to posted buffers or provide data with some offset.
>
> ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP needs access to physical addresses from user space. Due to security reasons we should make a virtual-to-physical address translation in kernel.
>
>
But why couldn't you just use the normal memory registration paths? IE
the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs
in the ibv_post_send() which could do kernel bypass.
> Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow due to some number of dynamic memory allocations in the path. We choose to create own private post_send channel to increase tx bandwidth using ud_post_send and friends.
Seems like maybe you could fix the non-bypass post_send/recv paths
instead of implementing an entirely new user<->kernel interface...
Steve.
>
>
> Regards,
>
> Mirek
>
> -----Original Message-----
> From: Steve Wise [mailto:swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org]
> Sent: Tuesday, May 04, 2010 7:19 PM
> To: Walukiewicz, Miroslaw
> Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
>
> Hey Mirek,
>
> It looks like this patch adds a new file interface for a UD service.
> Why didn't you extend the existing UD interface as needed?
>
> What IO is supported with these changes? IMA via the raw QP, but what
> ud_post_send() and friends used for?
>
>
> Steve.
>
>
>
> miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>
>> This patch implements iWarp multicast acceleration (IMA)
>> over IB_QPT_RAW_ETY QP type in nes driver.
>>
>> Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and
>> manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls.
>>
>> Calling ibv_attach_mcast/ibv_datach_mcast has an effect of
>> enabling/disabling L2 MAC address filters in HW.
>>
>> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>
>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 4/4] mlx4: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-05 11:44 UTC (permalink / raw)
To: Roland Dreier
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Tziporet Koren
In-Reply-To: <adar5lyxc01.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Thursday 29 April 2010 23:03, Roland Dreier wrote:
> > @@ -175,7 +175,7 @@ struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev, int entries, int vector
> > if (entries < 1 || entries > dev->dev->caps.max_cqes)
> > return ERR_PTR(-EINVAL);
> >
> > - cq = kmalloc(sizeof *cq, GFP_KERNEL);
> > + cq = kzalloc(sizeof *cq, GFP_KERNEL);
>
> What's the reason for this change?
Because mlx4_ib_create_cq is used in mlx4_ib_alloc_xrcd (to allocate the dummy cq), and I did not
want to worry about unitialized data being present in the struct ib_cq portion of struct mlx4_ib_cq.
> > @@ -477,23 +483,51 @@ static struct ib_xrcd *mlx4_ib_alloc_xrcd(struct ib_device *ibdev,
> > + pd = mlx4_ib_alloc_pd(ibdev, NULL, NULL);
> > + cq = mlx4_ib_create_cq(ibdev, 1, 0, NULL, NULL);
>
> Why does every xrcd get a PD and a CQ now? Just in case someone wants
> to create a rcv QP?
Yes. This way, we have a dummy PD and CQ available to satisfy the ConnectX requirement
that every QP have a PD and CQ.
> (The spec is unclear on this -- for "create XRC
> target QP" it says "A set of initial QP attributes must be specified by
> the Consumer," but then doesn't mention anything in the input modifiers,
> so it's not clear what PD/CQ is supposed to be used)
>From the XRC Annex to the IB Spec:
A12.5.2.3 XRC TARGET QP
A12.5.2.3.1 CREATE XRC TARGET QP
Description:
Creates a XRC Target QP for the specified HCA.
A set of initial QP attributes must be specified by the Consumer.
On success, a handle to the newly created XRC QP and the XRC QP
number are returned.
Input Modifiers:
Same as RC QP except for:
==> no SQ or RQ or SRQ (WQEs, num of s/g elements, CQs, signaling type, etc.)
no need for initiator resources (initiator depth)
==> no PD
add XRC domain to be associated with this QP.
> > + (1ull << IB_USER_VERBS_CMD_DESTROY_XRC_RCV_QP);
> > }
> >
> > -
>
> This seems to be repairing whitespace damage from the previous patch.
I will fix both patches.
> > +int mlx4_ib_reg_xrc_rcv_qp(struct ib_xrcd *xrcd, void *context, u32 qp_num)
>
> > + mutex_lock(&mibqp->mutex);
> > + list_for_each_entry(tmp, &mibqp->xrc_reg_list, list)
> > + if (tmp->context == context) {
> > + mutex_unlock(&mibqp->mutex);
> > + kfree(ctx_entry);
> > + mutex_unlock(&to_mdev(xrcd->device)->xrc_reg_mutex);
> > + return 0;
> > + }
> > +
> > + ctx_entry->context = context;
> > + list_add_tail(&ctx_entry->list, &mibqp->xrc_reg_list);
> > + mutex_unlock(&mibqp->mutex);
>
> This list handling looks completely generic and is what I was saying
> should probably be in the core uverbs module.
This list is used to send async events for this xrc rcv QP to all processes using
the QP (either registered for the qp, or the creating process).
Moving this to the core layer would require significant modifications. I responded to
this in my response to your first mail.
> > +int mlx4_ib_query_xrc_rcv_qp(struct ib_xrcd *ibxrcd, u32 qp_num,
> > + struct ib_qp_attr *qp_attr, int qp_attr_mask,
> > + struct ib_qp_init_attr *qp_init_attr)
>
> Virtually all of this function seems identical to the existing query QP
> operation. We should avoid the mass duplication of code.
>
> Also I'm not clear why this function takes a qp_num instead of a QP
> handle. Why does the consumer have to pass in the XRCD? The IB spec
> XRC annex just shows the QP handle as input to this verb. Is it because
> the reg_xrc_rcv_qp doesn't give a QP handle back?
Yes, I considered the handle to be the pair (xrc domain, qp_number).
>
> Finally, in this implementation, what happens if the consumer passes in
> a QP that isn't an XRC rcv QP?
>
> > + mqp = __mlx4_qp_lookup(dev->dev, qp_num);
> > + if (unlikely(!mqp)) {
> > + printk(KERN_WARNING "mlx4_ib_reg_xrc_rcv_qp: "
> > + "unknown QPN %06x\n", qp_num);
> > + goto err_out;
> > + }
> > +
> > + qp = to_mibqp(mqp);
> > + if (xrcd->xrcdn != to_mxrcd(qp->ibqp.xrcd)->xrcdn)
> > + goto err_out;
>
> In other words is that dereference of ->xrcdn safe if ibqp.xrcd is not set?
> (And the error message talks about reg_xrc_rcv_qp instead of query)
You are correct -- I will fix both of these (error msg and unprotected dereference).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Eli Cohen @ 2010-05-05 11:30 UTC (permalink / raw)
To: Roland Dreier; +Cc: Linux RDMA list, ewg
The current code requires num_possible_cpus() + 1 MSIX vectors. However,
num_possible_cpus() stands for the max number of supported CPUs by the kernel.
We should use num_online_cpus() which is the number of available CPUs for the
system.
Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
drivers/net/mlx4/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index e3e0d54..0559df4 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -969,7 +969,7 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
if (msi_x) {
nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
- num_possible_cpus() + 1);
+ num_online_cpus() + 1);
entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
if (!entries)
goto no_msi;
--
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-05 6:45 UTC (permalink / raw)
To: Roland Dreier
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Tziporet Koren, diego-VPRAkNaXOzVS1MOuV/RT9w
In-Reply-To: <adavdbaxcs1.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Thursday 29 April 2010 22:46, Roland Dreier wrote:
> > Note that for users who do not wish to utilize the reg/unreg verbs,
> > a destroy_xrc_rcv_qp verb is also provided. Thus, usage is:
> > Either: create/destroy_xrc_rcv_qp
> > Or: create/reg/unreg_xrc_rcv_qp (the unreg is used in place of destroy)
>
> I don't really understand the semantics here. What is supposed to
> happen if I do create/reg/destroy?> What happens if one process does
> destroy while another process is still registered?
Maybe we can simply assert that the unreg IS the destroy method of the
IB_SPEC, and get rid of the destroy method.
The xrc target qp section of the spec was not written with QP persistence
(after the creating process exited) in mind. That requirement surfaced
at the last minute as a result of testing by the MPI community during the
implementation phase (as far as I know). Unfortunately, this created
a semantic problem.
For applications in which the creating process persists until all other
processes which use the XRC RCV QP have finished with it, no reg/unreg is
needed -- and the API that makes the most sense is create/destroy.
For apps which DO need persistence, we also need a reg/unreg for reference
counting. In that situation, destroy_xrc_rcv_qp does not make sense as
a pure destroy -- it functions as unreg_xrc_rcv_qp, since it must function
within the reference counting context (unreg also destroys the QP in the
low-level driver when there are no more references to it). In fact, in this
case the semantics of destroy is identical to the semantics of unreg.
I do not see a clean way out of this mess other than to eliminate the
destroy_xrc_rcv_qp method and claim that the unreg is in fact the destroy
method of the SPEC.
> To make everything
> even more confusing, mlx4 defines unreg_rxc_rcv_qp to be equivalent to
> destroy_xrc_rcv_qp.
I simply noticed that if reg is not used, then the unreg would in fact destroy
the QP. I therefore saw no reason to implement the destroy method separately.
> I'm not even clear why the low-level driver has two
> entry points for these two methods -- shouldn't the uverbs core be
> handling the counting/listing of xrc rcv qps and just ask the low-level
> driver to destroy the QP when it's really done with it?
The uverbs layer DOES handle the counting/listing (see, for example, list_add_tail at the
end of ib_uverbs_create_xrc_rcv_qp.
However, I had an additional problem -- to distribute async events received
for the xrc_rcv QP to all registered processes (so that each could unregister
and allow the QP to be destroyed -- the ref count going to zero).
In my original implementation, the low-level driver was responsible for generating
the events for all the processes. To move this mechanism to the core would require
a fairly extensive re-write. I would need to introduce ib_core methods for create,
reg, unreg, and destroy, since the uverbs layer operates per user process and does
not track across multiple processes. I was concerned that modifications of this
magnitude would introduce instability in XRC, and would require a significant QA cycle
Finally, I do not believe that it is such a bad thing to have low-level driver
procedures for reg/unreg here. If a given low-level driver has implementation
details that it wishes to record, it should have the opportunity to do so.
>
> (By the way, should we use the name "target QP" instead of "rcv QP" to
> match the actual IB spec?)
I would rather not, since the xrc_rcv QP function names have been in use for 2 years already.
> - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Jack Morgenstein @ 2010-05-05 5:36 UTC (permalink / raw)
To: Roland Dreier
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Tziporet Koren
In-Reply-To: <adaljcfmkj9.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Thursday 22 April 2010 21:03, Roland Dreier wrote:
> So I'm looking at merging this, and I'm wondering about one thing.
> Seems like it's just a mistake but I want to make sure I understand
> properly:
>
> > @@ -1078,6 +1079,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
> > goto err_put;
> > }
> >
> > + attr.create_flags = 0;
> > attr.event_handler = ib_uverbs_qp_event_handler;
>
> This looks redundant, because this function already sets create_flags to
> 0 a few lines later. So I think this line is just a remnant from some
> other patch.
You're correct.
>
> But then ib_uverbs_create_xrc_rcv_qp() doesn't set create_flags before
> the call to device->create_xrc_rcv_qp() -- which maybe is OK, since that
> function is not going to look at create_flags right now, but for the
> future we should probably set it to 0, right?
Can't hurt.
> Also it's not 100% clear to me why the low-level driver needs a special
> create_xrc_rcv_qp method, rather than having uverbs just call create_qp
> with the right parameters.
I did not want to have the verbs layer dictate implementation details to the
low-level driver. It is more correct, in my opinion, to have each low-level
driver decide for itself on implementation. Therefore, the separate method.
However, please note that I re-use the qp_create_common() method in
mlx4_ib_create_xrc_rcv_qp.
> But I haven't looked throught carefully to
> see the differences between eg query_xrc_rcv_qp() vs query_qp() methods.
>
Same comment.
> - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* ConnectX Vendor_err
From: Pradeep Satyanarayana @ 2010-05-04 19:10 UTC (permalink / raw)
To: linux-rdma
We are seeing some errors like the following:
status 11, op OP_SEND, vendor_err 0x89
Status 11 corresponds to IB_WC_REM_OP_ERR, but what does vendor_err 0x89 imply?
Is there some place where one can get what each of the vendor_err corresponds to?
This is seen with a ConnectX HCA.
Pradeep
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
From: Steve Wise @ 2010-05-04 18:15 UTC (permalink / raw)
To: Walukiewicz, Miroslaw
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <BE2BFE91933D1B4089447C64486040801B6784ED-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
Walukiewicz, Miroslaw wrote:
> Hello Steve,
>
> Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion that user provides a data payload only for TX and similarly receives a payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached by HW.
>
> Our QP implementation in HW does not provide such possibity of attaching headers by HW for UD traffic so for multicast acceleration we choose L2 raw path. It provides some overhead for user application but it is still zero copy apprach.
>
> I thought about using a simulation of UD path using L2 raw QP to get the same result like for true UD QP (user handles a payload only). Such approach costs additional copy of payload in SW due to putting headers first and next payload to single tx buffer. Similar situation is for rx. It is a need for copy payload to posted buffers or provide data with some offset.
>
> ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP needs access to physical addresses from user space. Due to security reasons we should make a virtual-to-physical address translation in kernel.
>
>
But why couldn't you just use the normal memory registration paths? IE
the user mode app does ibv_reg_mr() and then uses lkey/addr/len in SGEs
in the ibv_post_send() which could do kernel bypass.
> Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow due to some number of dynamic memory allocations in the path. We choose to create own private post_send channel to increase tx bandwidth using ud_post_send and friends.
Seems like maybe you could fix the non-bypass post_send/recv paths
instead of implementing an entirely new user<->kernel interface...
Steve.
>
>
> Regards,
>
> Mirek
>
> -----Original Message-----
> From: Steve Wise [mailto:swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org]
> Sent: Tuesday, May 04, 2010 7:19 PM
> To: Walukiewicz, Miroslaw
> Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
>
> Hey Mirek,
>
> It looks like this patch adds a new file interface for a UD service.
> Why didn't you extend the existing UD interface as needed?
>
> What IO is supported with these changes? IMA via the raw QP, but what
> ud_post_send() and friends used for?
>
>
> Steve.
>
>
>
> miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>
>> This patch implements iWarp multicast acceleration (IMA)
>> over IB_QPT_RAW_ETY QP type in nes driver.
>>
>> Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and
>> manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls.
>>
>> Calling ibv_attach_mcast/ibv_datach_mcast has an effect of
>> enabling/disabling L2 MAC address filters in HW.
>>
>> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>
>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
From: Walukiewicz, Miroslaw @ 2010-05-04 18:09 UTC (permalink / raw)
To: Steve Wise
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <4BE05713.6030101-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Hello Steve,
Our Hw QP is not a UD type QP but L2 raw QP. In verbs API there is assumtion that user provides a data payload only for TX and similarly receives a payload only. The protocol headers (in case of UD - MAC/IP/UDP) are attached by HW.
Our QP implementation in HW does not provide such possibity of attaching headers by HW for UD traffic so for multicast acceleration we choose L2 raw path. It provides some overhead for user application but it is still zero copy apprach.
I thought about using a simulation of UD path using L2 raw QP to get the same result like for true UD QP (user handles a payload only). Such approach costs additional copy of payload in SW due to putting headers first and next payload to single tx buffer. Similar situation is for rx. It is a need for copy payload to posted buffers or provide data with some offset.
ud_post_send and friends implements the transmit path for IMA. Our RAW ETH QP needs access to physical addresses from user space. Due to security reasons we should make a virtual-to-physical address translation in kernel.
Unfortunately an OFED path for ibv_post_send diving to kernel is quite slow due to some number of dynamic memory allocations in the path. We choose to create own private post_send channel to increase tx bandwidth using ud_post_send and friends.
Regards,
Mirek
-----Original Message-----
From: Steve Wise [mailto:swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org]
Sent: Tuesday, May 04, 2010 7:19 PM
To: Walukiewicz, Miroslaw
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
Hey Mirek,
It looks like this patch adds a new file interface for a UD service.
Why didn't you extend the existing UD interface as needed?
What IO is supported with these changes? IMA via the raw QP, but what
ud_post_send() and friends used for?
Steve.
miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> This patch implements iWarp multicast acceleration (IMA)
> over IB_QPT_RAW_ETY QP type in nes driver.
>
> Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and
> manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls.
>
> Calling ibv_attach_mcast/ibv_datach_mcast has an effect of
> enabling/disabling L2 MAC address filters in HW.
>
> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
From: Steve Wise @ 2010-05-04 17:19 UTC (permalink / raw)
To: miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100430165434.1386.80375.stgit-dAdtdUp2yJRU7keBU/FxOFDQ4js95KgL@public.gmane.org>
Hey Mirek,
It looks like this patch adds a new file interface for a UD service.
Why didn't you extend the existing UD interface as needed?
What IO is supported with these changes? IMA via the raw QP, but what
ud_post_send() and friends used for?
Steve.
miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> This patch implements iWarp multicast acceleration (IMA)
> over IB_QPT_RAW_ETY QP type in nes driver.
>
> Application creates a raw eth QP (IBV_QPT_RAW_ETH in user-space) and
> manages the multicast via ibv_attach_mcast and ibv_detach_mcast calls.
>
> Calling ibv_attach_mcast/ibv_datach_mcast has an effect of
> enabling/disabling L2 MAC address filters in HW.
>
> Signed-off-by: Mirek Walukiewicz <miroslaw.walukiewicz-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
>
>
> ---
>
> drivers/infiniband/hw/nes/Makefile | 2
> drivers/infiniband/hw/nes/nes.c | 4
> drivers/infiniband/hw/nes/nes.h | 2
> drivers/infiniband/hw/nes/nes_nic.c | 11
> drivers/infiniband/hw/nes/nes_ud.c | 2070 +++++++++++++++++++++++++++++++++
> drivers/infiniband/hw/nes/nes_ud.h | 86 +
> drivers/infiniband/hw/nes/nes_verbs.c | 221 +++-
> drivers/infiniband/hw/nes/nes_verbs.h | 7
> 8 files changed, 2388 insertions(+), 15 deletions(-)
> create mode 100644 drivers/infiniband/hw/nes/nes_ud.c
> create mode 100644 drivers/infiniband/hw/nes/nes_ud.h
>
>
> diff --git a/drivers/infiniband/hw/nes/Makefile b/drivers/infiniband/hw/nes/Makefile
> index 3514851..850df8e 100644
> --- a/drivers/infiniband/hw/nes/Makefile
> +++ b/drivers/infiniband/hw/nes/Makefile
> @@ -1,3 +1,3 @@
> obj-$(CONFIG_INFINIBAND_NES) += iw_nes.o
>
> -iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o
> +iw_nes-objs := nes.o nes_hw.o nes_nic.o nes_utils.o nes_verbs.o nes_cm.o nes_ud.o
> diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
> index de7b9d7..e430804 100644
> --- a/drivers/infiniband/hw/nes/nes.c
> +++ b/drivers/infiniband/hw/nes/nes.c
> @@ -60,6 +60,8 @@
> #include <linux/route.h>
> #include <net/ip_fib.h>
>
> +#include "nes_ud.h"
> +
> MODULE_AUTHOR("NetEffect");
> MODULE_DESCRIPTION("NetEffect RNIC Low-level iWARP Driver");
> MODULE_LICENSE("Dual BSD/GPL");
> @@ -1205,6 +1207,7 @@ static int __init nes_init_module(void)
> if (retval1 < 0)
> printk(KERN_ERR PFX "Unable to create NetEffect sys files.\n");
> }
> + nes_ud_init();
> return retval;
> }
>
> @@ -1214,6 +1217,7 @@ static int __init nes_init_module(void)
> */
> static void __exit nes_exit_module(void)
> {
> + nes_ud_exit();
> nes_cm_stop();
> nes_remove_driver_sysfs(&nes_pci_driver);
>
> diff --git a/drivers/infiniband/hw/nes/nes.h b/drivers/infiniband/hw/nes/nes.h
> index cc78fee..faf420f 100644
> --- a/drivers/infiniband/hw/nes/nes.h
> +++ b/drivers/infiniband/hw/nes/nes.h
> @@ -102,6 +102,7 @@
> #define NES_DRV_OPT_NO_INLINE_DATA 0x00000080
> #define NES_DRV_OPT_DISABLE_INT_MOD 0x00000100
> #define NES_DRV_OPT_DISABLE_VIRT_WQ 0x00000200
> +#define NES_DRV_OPT_MCAST_LOGPORT_MAP 0x00000800
>
> #define NES_AEQ_EVENT_TIMEOUT 2500
> #define NES_DISCONNECT_EVENT_TIMEOUT 2000
> @@ -128,6 +129,7 @@
> #define NES_DBG_IW_RX 0x00020000
> #define NES_DBG_IW_TX 0x00040000
> #define NES_DBG_SHUTDOWN 0x00080000
> +#define NES_DBG_UD 0x00100000
> #define NES_DBG_RSVD1 0x10000000
> #define NES_DBG_RSVD2 0x20000000
> #define NES_DBG_RSVD3 0x40000000
> diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c
> index b7c813f..c7bbb83 100644
> --- a/drivers/infiniband/hw/nes/nes_nic.c
> +++ b/drivers/infiniband/hw/nes/nes_nic.c
> @@ -897,7 +897,7 @@ static void nes_netdev_set_multicast_list(struct net_device *netdev)
> ((mc_nic_index = nesvnic->mcrq_mcast_filter(nesvnic,
> get_addr(addrs, i++))) == 0));
> if (mc_nic_index < 0)
> - mc_nic_index = nesvnic->nic_index;
> + mc_nic_index = (1 << nesvnic->nic_index);
> while (nesadapter->pft_mcast_map[mc_index] < 16 &&
> nesadapter->pft_mcast_map[mc_index] !=
> nesvnic->nic_index &&
> @@ -930,7 +930,7 @@ static void nes_netdev_set_multicast_list(struct net_device *netdev)
> nes_write_indexed(nesdev,
> perfect_filter_register_address+4+(mc_index * 8),
> (u32)macaddr_high | NES_MAC_ADDR_VALID |
> - ((((u32)(1<<mc_nic_index)) << 16)));
> + ((((u32)(mc_nic_index)) << 16)));
> nesadapter->pft_mcast_map[mc_index] =
> nesvnic->nic_index;
> } else {
> @@ -1676,8 +1676,11 @@ struct net_device *nes_netdev_init(struct nes_device *nesdev,
> (nesvnic->nesdev->nesadapter->port_count == 1 &&
> nesvnic->nesdev->nesadapter->adapter_fcn_count == 2)) {
> nesvnic->qp_nic_index[0] = nesvnic->nic_index;
> - nesvnic->qp_nic_index[1] = nesvnic->nic_index
> - + 2;
> +
> + if (nes_drv_opt & NES_DRV_OPT_MCAST_LOGPORT_MAP)
> + nesvnic->qp_nic_index[1] = 0xf;
> + else
> + nesvnic->qp_nic_index[1] = nesvnic->nic_index+2;
> nesvnic->qp_nic_index[2] = 0xf;
> nesvnic->qp_nic_index[3] = 0xf;
> } else {
> diff --git a/drivers/infiniband/hw/nes/nes_ud.c b/drivers/infiniband/hw/nes/nes_ud.c
> new file mode 100644
> index 0000000..f004855
> --- /dev/null
> +++ b/drivers/infiniband/hw/nes/nes_ud.c
> @@ -0,0 +1,2070 @@
> +/*
> + * Copyright (c) 2008 - 2010 Intel Corporation. All rights reserved.
> + * Copyright (c) 2006 - 2008 Neteffect, All rights reserved.
> + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * - Redistributions of source code must retain the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer.
> + *
> + * - Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer in the documentation and/or other materials
> + * provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +
> +#include <linux/version.h>
> +#include <linux/completion.h>
> +#include <linux/mutex.h>
> +#include <linux/poll.h>
> +#include <linux/idr.h>
> +#include <linux/in.h>
> +#include <linux/in6.h>
> +#include <linux/device.h>
> +#include <linux/netdevice.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/device.h>
> +
> +#include <rdma/ib_umem.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "nes.h"
> +#include "nes_ud.h"
> +
> +#define NES_UD_BASE_XMIT_NIC_QPID 28
> +#define NES_UD_BASE_RECV_NIC_IDX 12
> +#define NES_UD_BASE_XMIT_NIC_IDX 8
> +#define NES_UD_MAX_NIC_CNT 8
> +#define NES_UD_CLEANUP_TIMEOUT (HZ)
> +#define NES_UD_MCAST_TBL_SZ 128
> +#define NES_UD_SINGLE_HDR_SZ 64
> +#define NES_UD_CQE_NUM NES_NIC_WQ_SIZE
> +#define NES_UD_SKSQ_WAIT_TIMEOUT 100000
> +#define NES_UD_MAX_REG_CNT 128
> +
> +#define NES_UD_MAX_ADAPTERS 4 /* number of supported interfaces for RAW ETH */
> +
> +#define NES_UD_MAX_REG_HASH_CNT 256 /* last byte of the STAG is hash key */
> +
> +/*
> + * the same multicast could be allocated up to 2 owners so there could be
> + * two differentmcast entries allocated for the same mcas address
> + */
> +struct nes_ud_file;
> +struct nes_ud_mcast {
> + u8 addr[3];
> + u8 in_use;
> + struct nes_ud_file *owner;
> + u8 nic_mask;
> +};
> +
> +struct nes_ud_mem_region {
> + struct list_head list;
> + dma_addr_t *addrs;
> + u64 va;
> + u64 length;
> + u32 pg_cnt;
> + u32 in_use;
> + u32 stag; /* stag related this structure */
> +};
> +
> +struct nic_queue_info {
> + u32 qpn;
> + u32 nic_index;
> + u32 logical_port;
> + enum nes_ud_dev_priority prio;
> + enum nes_ud_queue_type queue_type;
> + struct nes_ud_file *file;
> + struct nes_ud_file file_body;
> +};
> +
> +struct nes_ud_resources {
> + int num_logport_confed;
> + int num_allocated_nics;
> + u8 logport_2_map;
> + u8 logport_3_map;
> + u32 original_6000;
> + u32 original_60b8;
> + struct nic_queue_info nics[NES_UD_MAX_NIC_CNT];
> + struct mutex mutex;
> + struct nes_ud_mcast mcast[NES_UD_MCAST_TBL_SZ];
> + u32 adapter_no; /* the allocated adapter no */
> +
> + /* the unique ID of the NE020 adapter */
> + /*- it is allocated once per HW */
> + struct nes_adapter *pAdap;
> +};
> +
> +/* memory hash list entry */
> +struct nes_ud_hash_mem {
> + struct list_head list;
> + int read_stats;
> +};
> +
> +
> +
> +struct nes_ud_mem {
> + /* hash list of registered STAGs */
> + struct nes_ud_hash_mem mrs[NES_UD_MAX_REG_HASH_CNT];
> + struct mutex mutex;
> +};
> +
> +/* the QP in format x.y.z where x is adapter no, */
> +/* y is ud file idx in adapter, z is a qp no */
> +static struct nes_ud_mem ud_mem;
> +
> +struct nes_ud_send_wr {
> + u32 wr_cnt;
> + u32 qpn;
> + u32 flags;
> + u32 resv[1];
> + struct ib_sge sg_list[64];
> +};
> +
> +struct nes_ud_recv_wr {
> + u32 wr_cnt;
> + u32 qpn;
> + u32 resv[2];
> + struct ib_sge sg_list[64];
> +};
> +
> +static struct nes_ud_resources nes_ud_rsc[NES_UD_MAX_ADAPTERS];
> +static struct workqueue_struct *nes_ud_workqueue;
> +
> +/*
> + * locate_ud_adapter
> + *
> + * the function locates the UD adapter
> +* on base of the adapter unique ID (structure nes_adapter)
> + */
> +static inline
> +struct nes_ud_resources *locate_ud_adapter(struct nes_adapter *pAdapt)
> +{
> + int i;
> + struct nes_ud_resources *pRsc;
> +
> + for (i = 0; i < NES_UD_MAX_ADAPTERS; i++) {
> + pRsc = &nes_ud_rsc[i];
> +
> + if (pRsc->pAdap == pAdapt)
> + return pRsc;
> +
> + }
> + return NULL;
> +}
> +
> +/*
> + * allocate_ud_adapter()
> + *
> + * function allocates a new adapter
> + */
> +static inline
> +struct nes_ud_resources *allocate_ud_adapter(struct nes_adapter *pAdapt)
> +{
> + int i;
> + struct nes_ud_resources *pRsc;
> +
> + for (i = 0; i < NES_UD_MAX_ADAPTERS; i++) {
> + pRsc = &nes_ud_rsc[i];
> + if (pRsc->pAdap == NULL) {
> + pRsc->pAdap = pAdapt;
> + nes_debug(NES_DBG_UD, "new UD Adapter allocated %d"
> + " for adapter %p no =%d\n", i, pAdapt, pRsc->adapter_no);
> + return pRsc;
> + }
> + }
> + nes_debug(NES_DBG_UD, "Unable to allocate adapter\n");
> + return NULL;
> +}
> +
> +static inline
> +struct nes_ud_file *allocate_nic_queue(struct nes_vnic *nesvnic,
> + enum nes_ud_queue_type queue_type)
> +{
> + struct nes_ud_file *file = NULL;
> + int i = 0;
> + u8 select_log_port = 0xf;
> + struct nes_device *nesdev = nesvnic->nesdev;
> + int log_port_2_alloced = 0;
> + int log_port_3_alloced = 0;
> + int ret = 0;
> + struct nes_ud_resources *pRsc;
> +
> + /* the first thing that must be done is determine the adapter */
> + /* number max the adapter could have up to 2 interfaces */
> + if (nesvnic->nic_index != 0 && nesvnic->nic_index != 1) {
> + nes_debug(NES_DBG_UD, "nic queue allocation failed"
> + " nesvnic->nic_index = %d\n", nesvnic->nic_index);
> + return NULL;
> + }
> +
> + /* locate device on base of nesvnic */
> + /* - when it is an unknown card a new one is allocated */
> + pRsc = locate_ud_adapter(nesdev->nesadapter);
> + if (pRsc == NULL)
> + return NULL;
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active == 0)
> + continue;
> + if (pRsc->nics[i].logical_port == 2 &&
> + queue_type == pRsc->nics[i].queue_type)
> + log_port_2_alloced++;
> + if (pRsc->nics[i].logical_port == 3 &&
> + queue_type == pRsc->nics[i].queue_type)
> + log_port_3_alloced++;
> + }
> +
> + /* check dual/single card */
> + if (pRsc->logport_2_map != pRsc->logport_3_map) {
> + /* a dual port card */
> + /* allocation is NIC2, NIC2, NIC3, NIC3 */
> + /*- no RX packat replication supported */
> + if (log_port_2_alloced < 2 &&
> + pRsc->logport_2_map == nesvnic->nic_index)
> + select_log_port = 2;
> + else if (log_port_3_alloced < 2 &&
> + pRsc->logport_3_map == nesvnic->nic_index)
> + select_log_port = 3;
> + } else {
> + /* single port card */
> + /* change allocation scheme to NIC2,NIC3,NIC2,NIC3 */
> + switch (log_port_2_alloced + log_port_3_alloced) {
> + case 0: /* no QPs allocated - use NIC2 */
> + if (pRsc->logport_2_map == nesvnic->nic_index)
> + select_log_port = 2;
> +
> + break;
> + case 1: /* NIC2 or NIC3 allocated */
> + if (log_port_2_alloced > 0) {
> + /* if NIC2 allocated use NIC3 */
> + if (pRsc->logport_3_map == nesvnic->nic_index)
> + select_log_port = 3;
> +
> + } else {
> + /* when NIC3 allocated use NIC2 */
> + if (pRsc->logport_2_map == nesvnic->nic_index)
> + select_log_port = 2;
> +
> + }
> + break;
> +
> + case 2:
> + /* NIC2 and NIC3 allocated or both ports on NIC3 - use NIC2 */
> + if ((log_port_2_alloced == 1) ||
> + (log_port_3_alloced == 2)) {
> + if (pRsc->logport_2_map == nesvnic->nic_index)
> + select_log_port = 2;
> +
> + } else {
> + /* both ports allocated on NIC2 - use NIC3 */
> + if (pRsc->logport_3_map == nesvnic->nic_index)
> + select_log_port = 3;
> +
> + }
> + break;
> + case 3:
> + /* when both NIC2 allocated use NIC3 */
> + if (log_port_2_alloced == 2) {
> + if (pRsc->logport_3_map == nesvnic->nic_index)
> + select_log_port = 3;
> +
> + } else {
> + /* when both NIC3 alloced use NIC2 */
> + if (pRsc->logport_2_map == nesvnic->nic_index)
> + select_log_port = 2;
> + }
> + break;
> +
> + default:
> + break;
> + }
> + }
> + if (select_log_port == 0xf) {
> + ret = -1;
> + nes_debug(NES_DBG_UD, "%s(%d) logport allocation failed "
> + "log_port_2_alloced=%d log_port_3_alloced=%d\n",
> + __func__, __LINE__, log_port_2_alloced,
> + log_port_3_alloced);
> + goto out;
> + }
> +
> + nes_debug(NES_DBG_UD, "%s(%d) log_port_2_alloced=%d "
> + "log_port_3_alloced=%d select_log_port=%d\n",
> + __func__, __LINE__, log_port_2_alloced,
> + log_port_3_alloced, select_log_port);
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active == 1)
> + continue;
> + if (pRsc->nics[i].logical_port == select_log_port &&
> + queue_type == pRsc->nics[i].queue_type) {
> +
> + /* file is preallocated during initialization */
> + file = pRsc->nics[i].file;
> + memset(file, 0, sizeof(*file));
> +
> + file->nesvnic = nesvnic;
> + file->queue_type = queue_type;
> +
> + file->prio = pRsc->nics[i].prio;
> + file->qpn = pRsc->nics[i].qpn;
> + file->nes_ud_nic_index = pRsc->nics[i].nic_index;
> + file->rsc_idx = i;
> + file->adapter_no = pRsc->adapter_no;
> + goto out;
> + }
> + }
> +
> +out:
> + return file;
> +}
> +
> +static inline int del_rsc_list(struct nes_ud_file *file)
> +{
> + int logport_2_cnt = 0;
> + int logport_3_cnt = 0;
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int i = 0;
> + struct nes_ud_resources *pRsc;
> +
> + if (file == NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) file is NULL\n",
> + __func__, __LINE__);
> + return -EFAULT;
> + }
> + if (file->nesvnic == NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) file->nesvnic is NULL\n",
> + __func__, __LINE__);
> + return -EFAULT;
> + }
> + if (nesdev == NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) nesdev is NULL\n",
> + __func__, __LINE__);
> + return -EFAULT;
> + }
> +
> + /* locate device on base of nesvnic */
> + /*- when it is an unknown card a new one is allocated */
> + pRsc = locate_ud_adapter(nesdev->nesadapter);
> + if (pRsc == NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) cannot locate an allocated "
> + "adapter is NULL\n", __func__, __LINE__);
> + return -EFAULT;
> + }
> + if (--pRsc->num_allocated_nics == 0) {
> + nes_write_indexed(nesdev, 0x60b8, pRsc->original_60b8);
> + nes_write_indexed(nesdev, 0x6000, pRsc->original_6000);
> + pRsc->num_logport_confed = 0;
> + }
> + BUG_ON(pRsc->num_allocated_nics < 0);
> + BUG_ON(file->rsc_idx >= NES_UD_MAX_NIC_CNT);
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active &&
> + pRsc->nics[i].logical_port == 2)
> + logport_2_cnt++;
> + if (pRsc->nics[i].file->active &&
> + pRsc->nics[i].logical_port == 3)
> + logport_3_cnt++;
> + }
> +
> + if (pRsc->num_logport_confed != 0x3 && logport_2_cnt == 0)
> + pRsc->logport_2_map = 0xf;
> +
> + if (pRsc->num_logport_confed != 0x3 && logport_3_cnt == 0)
> + pRsc->logport_3_map = 0xf;
> + return 0;
> +}
> +
> +/*
> +* the QPN contains now the number of the RAW ETH
> +* adapter and QPN number on the adapter
> +* the adapter number is located in the highier
> +* 8 bits so QPN is stored as [adapter:qpn]
> +*/
> +static inline
> +struct nes_ud_file *get_file_by_qpn(struct nes_ud_resources *pRsc, int qpn)
> +{
> + int i = 0;
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active &&
> + pRsc->nics[i].qpn == (qpn & 0xff))
> + return pRsc->nics[i].file;
> +
> + }
> + return NULL;
> +}
> +
> +/* function counts all ETH RAW entities that have */
> +/* a specific type and relation to specific vnic */
> +static inline
> +int count_files_by_nic(struct nes_vnic *nesvnic,
> + enum nes_ud_queue_type queue_type)
> +{
> + int count = 0;
> + int i = 0;
> + struct nes_ud_resources *pRsc;
> +
> + pRsc = locate_ud_adapter(nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return 0;
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active &&
> + pRsc->nics[i].file->nesvnic == nesvnic &&
> + pRsc->nics[i].queue_type == queue_type)
> + count++;
> + }
> + return count;
> +}
> +
> +/* function counts all RAW ETH entities the have a specific type */
> +static inline
> +int count_files(struct nes_vnic *nesvnic, enum nes_ud_queue_type queue_type)
> +{
> + int count = 0;
> + int i = 0;
> + struct nes_ud_resources *pRsc;
> +
> + pRsc = locate_ud_adapter(nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return 0;
> +
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active &&
> + pRsc->nics[i].queue_type == queue_type)
> + count++;
> + }
> + return count;
> +}
> +
> +/*
> + * the function locates the entry allocated by IGMP and modifies the
> + * PFT entry with the list of the NICs allowed to receive that multicast
> + * the NIC0/NIC1 are removed due to performance issue so tcpdum
> + * like tools cannot receive the accelerated multicasts
> + */
> +static void mcast_fix_filter_table_single(struct nes_ud_file *file, u8 *addr)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int i = 0;
> + u32 macaddr_low;
> + u32 orig_low;
> + u32 macaddr_high;
> + u32 prev_high;
> +
> + for (i = 0; i < 48; i++) {
> + macaddr_low = nes_read_indexed(nesdev,
> + NES_IDX_PERFECT_FILTER_LOW + i*8);
> + orig_low = macaddr_low;
> + macaddr_high = nes_read_indexed(nesdev,
> + NES_IDX_PERFECT_FILTER_LOW + 4 + i*8);
> + if (!(macaddr_high & NES_MAC_ADDR_VALID))
> + continue;
> + if ((macaddr_high & 0xffff) != 0x0100)
> + continue;
> + if ((macaddr_low & 0xff) != addr[2])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != addr[1])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != addr[0])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != 0x5e)
> + continue;
> + /* hit - that means Linux or other UD set this bit earlier */
> + prev_high = macaddr_high;
> + nes_write_indexed(nesdev, NES_IDX_PERFECT_FILTER_LOW + 4 + i*8, 0);
> + macaddr_high = (macaddr_high & 0xfffcffff) |
> + ((1<<file->nes_ud_nic_index) << 16);
> +
> + nes_debug(NES_DBG_UD, "%s(%d) found addr to fix, "
> + "i=%d, macaddr_high=0x%x macaddr_low=0x%x "
> + "nic_idx=%d prev_high=0x%x\n",
> + __func__, __LINE__, i, macaddr_high, orig_low,
> + file->nes_ud_nic_index, prev_high);
> + nes_write_indexed(nesdev,
> + NES_IDX_PERFECT_FILTER_LOW + 4 + i*8, macaddr_high);
> + break;
> + }
> +}
> +
> +/* this function is implemented that way because the Linux multicast API
> + use the multicast list approach. When a new multicast address is added
> + all PFT table is reinitialized by linux and all entries must be fixed
> + by this procedure
> +*/
> +static void mcast_fix_filter_table(struct nes_ud_file *file)
> +{
> + int i;
> + struct nes_ud_resources *pRsc;
> +
> + pRsc = locate_ud_adapter(file->nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return;
> +
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (pRsc->mcast[i].in_use != 0)
> + mcast_fix_filter_table_single(pRsc->mcast[i].owner,
> + pRsc->mcast[i].addr);
> + }
> +}
> +
> +/* function invalidates the PFT entry */
> +static void remove_mcast_from_pft(struct nes_ud_file *file, u8 *addr)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int i = 0;
> + u32 macaddr_low;
> + u32 orig_low;
> + u32 macaddr_high;
> + u32 prev_high;
> +
> + for (i = 0; i < 48; i++) {
> + macaddr_low = nes_read_indexed(nesdev,
> + NES_IDX_PERFECT_FILTER_LOW + i*8);
> + orig_low = macaddr_low;
> + macaddr_high = nes_read_indexed(nesdev,
> + NES_IDX_PERFECT_FILTER_LOW + 4 + i*8);
> + if (!(macaddr_high & NES_MAC_ADDR_VALID))
> + continue;
> +
> + if ((macaddr_high & 0xffff) != 0x0100)
> + continue;
> + if ((macaddr_low & 0xff) != addr[2])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != addr[1])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != addr[0])
> + continue;
> + macaddr_low >>= 8;
> + if ((macaddr_low & 0xff) != 0x5e)
> + continue;
> + /* hit - that means Linux or other UD set this bit earlier */
> + /* so remove the NIC from MAC address reception */
> + prev_high = macaddr_high;
> + macaddr_high = (macaddr_high & 0xfffcffff) &
> + ~((1<<file->nes_ud_nic_index) << 16);
> + nes_debug(NES_DBG_UD, "%s(%d) found addr to mcast remove,"
> + "i=%d, macaddr_high=0x%x macaddr_low=0x%x "
> + "nic_idx=%d prev_high=0x%x\n", __func__, __LINE__, i,
> + macaddr_high, orig_low, file->nes_ud_nic_index, prev_high);
> + nes_write_indexed(nesdev, NES_IDX_PERFECT_FILTER_LOW + 4 + i*8,
> + macaddr_high);
> + break;
> + }
> +
> +}
> +
> +/*
> +* the function returns a mask of the NICs
> +* assotiated with given multicast address
> +*/
> +static int nes_ud_mcast_filter(struct nes_vnic *nesvnic, __u8 *dmi_addr)
> +{
> + int i = 0;
> + int ret = 0;
> + int mask = 0;
> + struct nes_ud_resources *pRsc;
> +
> + pRsc = locate_ud_adapter(nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return 0;
> +
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (pRsc->mcast[i].in_use &&
> + pRsc->mcast[i].addr[0] == dmi_addr[3] &&
> + pRsc->mcast[i].addr[1] == dmi_addr[4] &&
> + pRsc->mcast[i].addr[2] == dmi_addr[5]) {
> + mask = (pRsc->mcast[i].owner->mcast_mode ==
> + NES_UD_MCAST_PFT_MODE) ?
> + pRsc->mcast[i].owner->nes_ud_nic_index : 0;
> +
> + ret = ret | (1 << mask);
> + nes_debug(NES_DBG_UD, "mcast filter, "
> + "fpr=%02X%02X%02X ret=%d\n",
> + dmi_addr[3], dmi_addr[4], dmi_addr[5], ret);
> + }
> + }
> + if (ret == 0)
> + return -1;
> + else
> + return ret;
> +
> +}
> +
> +static __u32 mqueue_key[4] = { 0x0, 0x80, 0x0, 0x0 };
> +
> +static inline __u8 nes_ud_calculate_hash(__u8 dest_addr_lsb)
> +{
> + __u8 in[8];
> + __u32 key_arr[4];
> + int i;
> + __u32 result = 0;
> + int j, k;
> + __u8 shift_in, next_shift_in;
> +
> + in[0] = 0;
> + in[1] = 0;
> + in[2] = 0;
> + in[3] = 0;
> +
> + in[4] = 0;
> +
> + in[5] = 0;
> + in[6] = 0;
> + in[7] = dest_addr_lsb;
> +
> +
> +
> + for (i = 0; i < 4; i++)
> + key_arr[3-i] = mqueue_key[i];
> +
> +
> +
> + for (i = 0; i < 8; i++) {
> + for (j = 7; j >= 0; j--) {
> + if (in[i] & (1 << j))
> + result = result ^ key_arr[0];
> +
> + shift_in = 0;
> + for (k = 3; k >= 0; k--) {
> + next_shift_in = key_arr[k] >> 31;
> + key_arr[k] = (key_arr[k] << 1) + shift_in;
> + shift_in = next_shift_in;
> + }
> + }
> + }
> + return result & 0x7f;
> +}
> +
> +static inline void nes_ud_enable_mqueue(struct nes_ud_file *file)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int mqueue_config0;
> + int mqueue_config2;
> + int instance = file->nes_ud_nic_index & 0x1;
> +
> + mqueue_config0 = nes_read_indexed(nesdev, 0x6400);
> + mqueue_config0 |= (4 | (instance & 0x3)) << (file->nes_ud_nic_index*3);
> + nes_write_indexed(nesdev, 0x6400, mqueue_config0);
> + mqueue_config0 = nes_read_indexed(nesdev, 0x6400);
> +
> + mqueue_config2 = nes_read_indexed(nesdev, 0x6408);
> + mqueue_config2 |= (2 << (instance*2)) | (6 << (instance*3+8));
> + nes_write_indexed(nesdev, 0x6408, mqueue_config2);
> + mqueue_config2 = nes_read_indexed(nesdev, 0x6408);
> +
> + nes_write_indexed(nesdev, 0x64a0+instance*0x100, mqueue_key[0]);
> + nes_write_indexed(nesdev, 0x64a4+instance*0x100, mqueue_key[1]);
> + nes_write_indexed(nesdev, 0x64a8+instance*0x100, mqueue_key[2]);
> + nes_write_indexed(nesdev, 0x64ac+instance*0x100, mqueue_key[3]);
> +
> + nes_debug(NES_DBG_UD, "mq_config0=0x%x mq_config2=0x%x nic_idx= %d\n",
> + mqueue_config0, mqueue_config2, file->nes_ud_nic_index);
> +
> +}
> +
> +
> +
> +static inline
> +void nes_ud_redirect_from_mqueue(struct nes_ud_file *file, int num_queues)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int instance = file->nes_ud_nic_index & 0x1;
> + unsigned addr = 0x6420+instance*0x100;
> + unsigned value;
> + int i;
> +
> + value = (file->prio == NES_UD_DEV_PRIO_LOW || num_queues == 1) ?
> + 0x0 : 0x11111111;
> + for (i = 0; i < 16; i++)
> + nes_write_indexed(nesdev, addr+i*4, value);
> +}
> +
> +
> +static int nes_ud_create_nic(struct nes_ud_file *file)
> +{
> + struct nes_vnic *nesvnic = file->nesvnic;
> + struct nes_device *nesdev = nesvnic->nesdev;
> + struct nes_hw_nic_qp_context *nic_context;
> + struct nes_hw_cqp_wqe *cqp_wqe;
> + struct nes_cqp_request *cqp_request;
> + unsigned long flags;
> + void *vmem;
> + dma_addr_t pmem;
> + u64 u64temp;
> + int ret = 0;
> +
> + BUG_ON(file->nic_vbase != NULL);
> +
> + file->nic_mem_size = 256 +
> + (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe)) +
> + sizeof(struct nes_hw_nic_qp_context);
> +
> + file->nic_vbase = pci_alloc_consistent(nesdev->pcidev,
> + file->nic_mem_size,
> + &file->nic_pbase);
> + if (!file->nic_vbase) {
> + nes_debug(NES_DBG_UD, "Unable to allocate memory for NIC host "
> + "descriptor rings\n");
> + return -ENOMEM;
> + }
> +
> + memset(file->nic_vbase, 0, file->nic_mem_size);
> +
> + vmem = (void *)(((unsigned long long)file->nic_vbase + (256 - 1)) &
> + ~(unsigned long long)(256 - 1));
> + pmem = (dma_addr_t)(((unsigned long long)file->nic_pbase + (256 - 1)) &
> + ~(unsigned long long)(256 - 1));
> +
> + file->wq_vbase = vmem;
> + file->wq_pbase = pmem;
> + file->head = 0;
> + file->tail = 0;
> +
> + vmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe));
> + pmem += (NES_NIC_WQ_SIZE * sizeof(struct nes_hw_nic_sq_wqe));
> +
> + cqp_request = nesvnic->get_cqp_request(nesdev);
> + if (cqp_request == NULL) {
> + nes_debug(NES_DBG_QP, "Failed to get a cqp_request.\n");
> + goto fail_cqp_req_alloc;
> + }
> + cqp_request->waiting = 1;
> + cqp_wqe = &cqp_request->cqp_wqe;
> +
> + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] =
> + cpu_to_le32(NES_CQP_CREATE_QP | NES_CQP_QP_TYPE_NIC);
> + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(file->qpn);
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] =
> + cpu_to_le32((u32)((u64)(&nesdev->cqp)));
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] =
> + cpu_to_le32((u32)(((u64)(&nesdev->cqp))>>32));
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0;
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0;
> +
> +
> + nic_context = vmem;
> +
> + nic_context->context_words[NES_NIC_CTX_MISC_IDX] =
> + cpu_to_le32((u32)NES_NIC_CTX_SIZE |
> + ((u32)PCI_FUNC(nesdev->pcidev->devfn) << 12) |
> + (1 << 18));
> +
> + nic_context->context_words[NES_NIC_CTX_SQ_LOW_IDX] = 0;
> + nic_context->context_words[NES_NIC_CTX_SQ_HIGH_IDX] = 0;
> + nic_context->context_words[NES_NIC_CTX_RQ_LOW_IDX] = 0;
> + nic_context->context_words[NES_NIC_CTX_RQ_HIGH_IDX] = 0;
> +
> + u64temp = (u64)file->wq_pbase;
> + if (file->queue_type == NES_UD_SEND_QUEUE) {
> + nic_context->context_words[NES_NIC_CTX_SQ_LOW_IDX] =
> + cpu_to_le32((u32)u64temp);
> + nic_context->context_words[NES_NIC_CTX_SQ_HIGH_IDX] =
> + cpu_to_le32((u32)(u64temp >> 32));
> + } else {
> + nic_context->context_words[NES_NIC_CTX_RQ_LOW_IDX] =
> + cpu_to_le32((u32)u64temp);
> + nic_context->context_words[NES_NIC_CTX_RQ_HIGH_IDX] =
> + cpu_to_le32((u32)(u64temp >> 32));
> + }
> +
> + u64temp = (u64)pmem;
> +
> + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_LOW_IDX] =
> + cpu_to_le32((u32)u64temp);
> + cqp_wqe->wqe_words[NES_CQP_QP_WQE_CONTEXT_HIGH_IDX] =
> + cpu_to_le32((u32)(u64temp >> 32));
> +
> + atomic_set(&cqp_request->refcount, 2);
> + nesvnic->post_cqp_request(nesdev, cqp_request);
> +
> + /* Wait for CQP */
> + ret = wait_event_timeout(cqp_request->waitq,
> + (cqp_request->request_done != 0),
> + NES_EVENT_TIMEOUT);
> + if (!ret)
> + nes_debug(NES_DBG_UD, "NES_UD NIC QP%u "
> + "create timeout expired\n", file->qpn);
> +
> +
> + if (atomic_dec_and_test(&cqp_request->refcount)) {
> + if (cqp_request->dynamic) {
> + kfree(cqp_request);
> + } else {
> + spin_lock_irqsave(&nesdev->cqp.lock, flags);
> + list_add_tail(&cqp_request->list,
> + &nesdev->cqp_avail_reqs);
> + spin_unlock_irqrestore(&nesdev->cqp.lock, flags);
> + }
> + }
> + nes_debug(NES_DBG_UD, "Created NIC, qpn=%d, SQ/RQ pa=0x%p va=%p "
> + "virt_to_phys=%p\n", file->qpn,
> + (void *)file->wq_pbase, (void *)file->nic_vbase,
> + (void *)virt_to_phys(file->nic_vbase));
> + return ret;
> +
> + fail_cqp_req_alloc:
> + pci_free_consistent(nesdev->pcidev, file->nic_mem_size, file->nic_vbase,
> + file->nic_pbase);
> + file->nic_vbase = NULL;
> + return -EFAULT;
> +}
> +
> +
> +static void nes_ud_destroy_nic(struct nes_ud_file *file)
> +{
> + struct nes_vnic *nesvnic = file->nesvnic;
> + struct nes_device *nesdev = nesvnic->nesdev;
> + struct nes_hw_cqp_wqe *cqp_wqe;
> + struct nes_cqp_request *cqp_request;
> + unsigned long flags;
> + int ret = 0;
> +
> + cqp_request = nesvnic->get_cqp_request(nesdev);
> + if (cqp_request == NULL) {
> + nes_debug(NES_DBG_QP, "Failed to get a cqp_request.\n");
> + return;
> + }
> + cqp_request->waiting = 1;
> + cqp_wqe = &cqp_request->cqp_wqe;
> +
> + cqp_wqe->wqe_words[NES_CQP_WQE_OPCODE_IDX] =
> + cpu_to_le32(NES_CQP_DESTROY_QP | NES_CQP_QP_TYPE_NIC);
> + cqp_wqe->wqe_words[NES_CQP_WQE_ID_IDX] = cpu_to_le32(file->qpn);
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_LOW_IDX] =
> + cpu_to_le32((u32)((u64)(&nesdev->cqp)));
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_CTX_HIGH_IDX] =
> + cpu_to_le32((u32)(((u64)(&nesdev->cqp)) >> 32));
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_LOW_IDX] = 0;
> + cqp_wqe->wqe_words[NES_CQP_WQE_COMP_SCRATCH_HIGH_IDX] = 0;
> +
> + atomic_set(&cqp_request->refcount, 2);
> + nesvnic->post_cqp_request(nesdev, cqp_request);
> +
> + /* Wait for CQP */
> + ret = wait_event_timeout(cqp_request->waitq,
> + (cqp_request->request_done != 0),
> + NES_EVENT_TIMEOUT);
> + if (!ret)
> + nes_debug(NES_DBG_UD, "NES_UD NIC QP%u "
> + "destroy timeout expired\n", file->qpn);
> +
> + if (atomic_dec_and_test(&cqp_request->refcount)) {
> + if (cqp_request->dynamic) {
> + kfree(cqp_request);
> + } else {
> + spin_lock_irqsave(&nesdev->cqp.lock, flags);
> + list_add_tail(&cqp_request->list,
> + &nesdev->cqp_avail_reqs);
> + spin_unlock_irqrestore(&nesdev->cqp.lock, flags);
> + }
> + }
> +
> + pci_free_consistent(nesdev->pcidev, file->nic_mem_size, file->nic_vbase,
> + file->nic_pbase);
> + file->nic_vbase = NULL;
> + file->qp_ptr = NULL;
> +
> + return;
> +}
> +
> +static void nes_ud_free_resources(struct nes_ud_file *file)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int nic_active = 0;
> + int mcast_all = 0;
> + int mcast_en = 0;
> + int wqm_config0 = 0;
> + wait_queue_head_t waitq;
> + int num_queues = 0;
> + nes_debug(NES_DBG_UD, " %s(%d) NAME=%s nes_ud_qpid=%d\n",
> + __func__, __LINE__, file->ifrn_name, file->qpn);
> +
> + if (!file->nesvnic || !file->active)
> + return;
> +
> + if (file->queue_type == NES_UD_SEND_QUEUE) {
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> + nic_active &= ~(1 << file->nes_ud_nic_index);
> + nes_write_indexed(nesdev, NES_IDX_NIC_ACTIVE, nic_active);
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> + } else {
> + num_queues = count_files_by_nic(file->nesvnic,
> + file->queue_type);
> +
> + if (num_queues == 1) {
> +
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> + nic_active &= ~(1 << file->nes_ud_nic_index);
> + nes_write_indexed(nesdev, NES_IDX_NIC_ACTIVE, nic_active);
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> +
> + mcast_all = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL);
> + mcast_all &= ~(1 << file->nes_ud_nic_index);
> + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL, mcast_all);
> + mcast_all = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL);
> +
> + mcast_en = nes_read_indexed(nesdev,
> + NES_IDX_NIC_MULTICAST_ENABLE);
> + mcast_en &= ~(1 << file->nes_ud_nic_index);
> + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ENABLE,
> + mcast_en);
> + mcast_en = nes_read_indexed(nesdev,
> + NES_IDX_NIC_MULTICAST_ENABLE);
> +
> + nes_debug(NES_DBG_UD, "nic_active=0x%x, mcast_en=0x%x, "
> + "mcast_all=0x%x nic_index=%d num_queues=%d\n",
> + nic_active, mcast_en, mcast_all,
> + file->nes_ud_nic_index, num_queues);
> + }
> +
> + nes_ud_redirect_from_mqueue(file, num_queues);
> + num_queues = count_files(file->nesvnic, file->queue_type);
> + if (num_queues == 1) {
> + nes_debug(NES_DBG_UD, "Last receive queue, "
> + "restoring MPP debug register\n");
> + nes_write_indexed(nesdev, 0xA00, 0x200);
> + nes_write_indexed(nesdev, 0xA40, 0x200);
> + }
> + }
> +
> +
> +
> + nes_ud_destroy_nic(file);
> +
> + if (file->queue_type == NES_UD_RECV_QUEUE) {
> + wqm_config0 = nes_read_indexed(nesdev, 0x5000);
> + wqm_config0 &= ~0x8000;
> + nes_write_indexed(nesdev, 0x5000, wqm_config0);
> +
> + init_waitqueue_head(&waitq);
> +
> + wait_event_timeout(waitq, 0, NES_UD_CLEANUP_TIMEOUT);
> +
> + nes_debug(NES_DBG_UD, "%s(%d) enabling stall_no_wqes\n",
> + __func__, __LINE__);
> + wqm_config0 = nes_read_indexed(nesdev, 0x5000);
> + wqm_config0 |= 0x8000;
> + nes_write_indexed(nesdev, 0x5000, wqm_config0);
> + }
> +
> + dev_put(file->nesvnic->netdev);
> +
> + file->active = 0;
> +
> + nes_debug(NES_DBG_UD, "%s(%d) done\n", __func__, __LINE__);
> +}
> +
> +
> +static int nes_ud_init_channel(struct nes_ud_file *file)
> +{
> + struct nes_device *nesdev = NULL;
> + int ret = 0;
> + int nic_active = 0;
> + int mcast_all = 0;
> + int mcast_en = 0;
> + int link_ag = 0;
> + int mpp4_dbg = 0;
> +
> + nesdev = file->nesvnic->nesdev;
> +
> + ret = nes_ud_create_nic(file);
> + if (ret != 0)
> + return ret;
> +
> + if (file->queue_type == NES_UD_RECV_QUEUE) {
> +
> + file->nesvnic->mcrq_mcast_filter = nes_ud_mcast_filter;
> +
> + mcast_en = nes_read_indexed(nesdev,
> + NES_IDX_NIC_MULTICAST_ENABLE);
> + mcast_en |= 1 << file->nes_ud_nic_index;
> + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ENABLE,
> + mcast_en);
> + mcast_en = nes_read_indexed(nesdev,
> + NES_IDX_NIC_MULTICAST_ENABLE);
> +
> + /* the only case when we use PFT is for single port
> + two functions, which probably would be the
> + most common usage model :), but anyway */
> + if (file->mcast_mode == NES_UD_MCAST_ALL_MODE) {
> + mcast_all = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL);
> + mcast_all |= 1 << file->nes_ud_nic_index;
> + nes_write_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL, mcast_all);
> + mcast_all = nes_read_indexed(nesdev, NES_IDX_NIC_MULTICAST_ALL);
> + }
> + if (nesdev->nesadapter->port_count <= 2) {
> + link_ag = 0x00;
> + nes_write_indexed(nesdev, 0x6038, link_ag);
> + link_ag = nes_read_indexed(nesdev, 0x6038);
> + }
> + if (nesdev->nesadapter->netdev_count <= 2)
> + nes_ud_enable_mqueue(file);
> +
> + nes_write_indexed(nesdev, 0xA00, 0x245);
> + nes_write_indexed(nesdev, 0xA40, 0x245);
> +
> + }
> + /* NES_UD_SEND_QUEUE */
> + else {
> + mpp4_dbg = nes_read_indexed(nesdev, 0xb00);
> + mpp4_dbg |= 1 << 12;
> + nes_write_indexed(nesdev, 0xb00, mpp4_dbg);
> + mpp4_dbg = nes_read_indexed(nesdev, 0xb00);
> + }
> +
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> + nic_active |= 1 << file->nes_ud_nic_index;
> + nes_write_indexed(nesdev, NES_IDX_NIC_ACTIVE, nic_active);
> + nic_active = nes_read_indexed(nesdev, NES_IDX_NIC_ACTIVE);
> +
> + nes_debug(NES_DBG_UD, "nic_active=0x%x, mcast_en=0x%x, "
> + "mcast_all=0x%x nic_index=%d link_ag=0x%x mpp4_dbg=0x%x\n",
> + nic_active, mcast_en, mcast_all, file->nes_ud_nic_index,
> + link_ag, mpp4_dbg);
> +
> + return ret;
> +}
> +
> +static struct nes_ud_file *nes_ud_get_nxt_channel(struct nes_vnic *nesvnic,
> + enum nes_ud_queue_type queue_type)
> +{
> + struct nes_ud_file *file = NULL;
> + struct net_device *netdev = NULL;
> + struct nes_device *nesdev = NULL;
> + struct nes_ud_resources *pRsc;
> +
> + netdev = nesvnic->netdev;
> + nesdev = nesvnic->nesdev;
> +
> + pRsc = locate_ud_adapter(nesdev->nesadapter);
> + if (pRsc == NULL) {
> + pRsc = allocate_ud_adapter(nesdev->nesadapter);
> + if (pRsc == NULL)
> + return NULL;
> +
> + }
> + if (pRsc->num_logport_confed == 0) {
> + pRsc->original_60b8 = nes_read_indexed(nesdev, 0x60b8);
> + pRsc->original_6000 = nes_read_indexed(nesdev, 0x6000);
> + /* everything goes to port 0x0 */
> + if ((nesvnic->nesdev->nesadapter->port_count == 1) ||
> + (nes_drv_opt & NES_DRV_OPT_MCAST_LOGPORT_MAP)) {
> + /* single port card or dual port using single if */
> + pRsc->num_logport_confed = 0x3;
> + pRsc->logport_2_map = 0x0;
> + pRsc->logport_3_map = 0x0;
> + nes_write_indexed(nesdev, 0x60b8, 0x3);
> + nes_write_indexed(nesdev, 0x6000, 0x0);
> + } else {
> + pRsc->num_logport_confed = 0x3;
> + pRsc->logport_2_map = 0x0;
> + pRsc->logport_3_map = 0x1;
> + }
> + nes_debug(NES_DBG_UD, "%s(%d) num_logport_confed=%d "
> + "original_6000=%d logport_3_map = %d nes_drv_opt=%x\n",
> + __func__, __LINE__, pRsc->num_logport_confed,
> + pRsc->original_6000, pRsc->logport_3_map, nes_drv_opt);
> + }
> +
> + nes_debug(NES_DBG_UD, "%s(%d) logport_2_map=%d logport_3_map=%d\n",
> + __func__, __LINE__, pRsc->logport_2_map, pRsc->logport_3_map);
> +
> + file = allocate_nic_queue(nesvnic, queue_type);
> + if (file == NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) failed to allocate NIC\n",
> + __func__, __LINE__);
> + return NULL;
> + }
> +
> + file->active = 1;
> + memcpy(file->ifrn_name, netdev->name, IFNAMSIZ);
> +
> + /* for now use pft always */
> + file->mcast_mode = NES_UD_MCAST_PFT_MODE;
> +
> + nes_debug(NES_DBG_UD, " %s(%d) NAME=%s qpn=%d nes_ud_nic_index=%d "
> + "nes_ud_nic.qp_id=%d mcast_mode=%d port_count=%d "
> + "netdev_count=%d\n", __func__, __LINE__, file->ifrn_name,
> + file->qpn, file->nes_ud_nic_index, file->nesvnic->mcrq_qp_id,
> + file->mcast_mode, nesdev->nesadapter->port_count,
> + nesdev->nesadapter->netdev_count);
> +
> + file->mss = netdev->mtu-28;
> + pRsc->num_allocated_nics++;
> + BUG_ON(pRsc->num_allocated_nics > 8);
> +
> + return file;
> +
> +}
> +
> +static struct nes_ud_mem_region *nes_ud_allocate_mr(u32 npages)
> +{
> + struct nes_ud_mem_region *mr = NULL;
> +
> + mr = vmalloc(sizeof(*mr));
> + if (mr == NULL)
> + return NULL;
> +
> +
> + mr->addrs = vmalloc(npages * sizeof(dma_addr_t));
> + if (!mr->addrs) {
> + nes_debug(NES_DBG_UD, "%s(%d) Cannot allocate mr struct "
> + "for %d pages\n", __func__, __LINE__, npages);
> + vfree(mr);
> + return NULL;
> + }
> + mr->pg_cnt = npages;
> + mr->in_use = 1;
> +
> + INIT_LIST_HEAD(&mr->list);
> +
> + return mr;
> +}
> +
> +static void nes_ud_free_mr(struct nes_ud_mem_region *mr)
> +{
> + if (mr->addrs != NULL)
> + vfree(mr->addrs);
> +
> + vfree(mr);
> +}
> +
> +/* nes_ud_get_hash_entry()
> + *
> + * function returns a key for hash table
> + */
> +static inline
> +int nes_ud_get_hash_entry(u32 stag)
> +{
> + return stag & 0xff;
> +}
> +
> +
> +/* nes_ud_lookup_mr()
> + *
> + * function returns a pointer to mr realized by specific STAG
> + */
> +static inline
> +struct nes_ud_mem_region *nes_ud_lookup_mr(u32 stag)
> +{
> + int key;
> + struct nes_ud_mem_region *mr;
> +
> + key = nes_ud_get_hash_entry(stag);
> +
> + list_for_each_entry(mr, &ud_mem.mrs[key].list, list) {
> + ud_mem.mrs[key].read_stats++;
> + if (mr->stag == stag)
> + return mr;
> +
> + }
> + return NULL;
> +}
> +
> +/* nes_ud_add_mr_hash()
> + *
> + * the function inserts the mr entry into the hash list
> + * the stag is a key
> + */
> +static inline
> +int nes_ud_add_mr_hash(struct nes_ud_mem_region *mr)
> +{
> + int key;
> +
> + /* first check if the stag is unique */
> + if (nes_ud_lookup_mr(mr->stag) != NULL) {
> + nes_debug(NES_DBG_UD, "%s(%d) double STAG error stag=%x\n",
> + __func__, __LINE__, mr->stag);
> + return -1;
> + }
> + key = nes_ud_get_hash_entry(mr->stag);
> +
> + /* structure is global so mutexes are necessary */
> + mutex_lock(&ud_mem.mutex);
> +
> + /* add mr to the list at start */
> + list_add(&mr->list, &ud_mem.mrs[key].list);
> +
> + mutex_unlock(&ud_mem.mutex);
> +
> + return 0;
> +
> +}
> +
> +/* nes_ud_del_mr()
> + *
> + * the function removes the entry from the hash list
> + * the stag is the key
> + */
> +static inline
> +void nes_ud_del_mr(struct nes_ud_mem_region *mr)
> +{
> + /* structure is global so mutexes are necessary */
> + mutex_lock(&ud_mem.mutex);
> +
> + list_del(&mr->list);
> +
> + /* init entry */
> + INIT_LIST_HEAD(&mr->list);
> +
> + mutex_unlock(&ud_mem.mutex);
> +}
> +
> +/* nes_ud_cleanup_mr()
> + *
> + * function deletes and and frees all hash entries
> + */
> +static inline
> +void nes_ud_cleanup_mr(void)
> +{
> + struct nes_ud_mem_region *mr;
> + struct nes_ud_mem_region *next;
> + int i;
> +
> + /* structure is global so mutexes are necessary */
> + mutex_lock(&ud_mem.mutex);
> +
> + for (i = 0; i < NES_UD_MAX_REG_HASH_CNT; i++) {
> + if (list_empty(&ud_mem.mrs[i].list))
> + continue;
> +
> + list_for_each_entry_safe(mr, next, &ud_mem.mrs[i].list, list) {
> + nes_debug(NES_DBG_UD, "%s(%d) non free stag=%x\n",
> + __func__, __LINE__, mr->stag);
> + list_del_init(&mr->list);
> +
> + nes_ud_free_mr(mr);
> + }
> + }
> +
> + mutex_unlock(&ud_mem.mutex);
> +}
> +
> +u32 nes_ud_reg_mr(struct ib_umem *region, u64 length, u64 virt, u32 stag)
> +{
> + unsigned long npages =
> + PAGE_ALIGN(region->length + region->offset) >> PAGE_SHIFT;
> + struct nes_ud_mem_region *mr = nes_ud_allocate_mr(npages);
> + struct ib_umem_chunk *chunk;
> + dma_addr_t page;
> + u32 chunk_pages = 0;
> + int nmap_index;
> + int i = 0;
> + int mr_id = 0;
> + nes_debug(NES_DBG_UD, "%s(%d) mr=%p length=%d virt=%p\n",
> + __func__, __LINE__, mr, (int)length, (void *)virt);
> + if (!mr)
> + return 0;
> +
> +
> + mr->stag = stag;
> +
> + mr->va = virt;
> + mr->length = length;
> + list_for_each_entry(chunk, ®ion->chunk_list, list) {
> + for (nmap_index = 0; nmap_index < chunk->nmap; ++nmap_index) {
> + page = sg_dma_address(&chunk->page_list[nmap_index]);
> + chunk_pages = sg_dma_len(&chunk->page_list[nmap_index]) >> 12;
> + if (page & ~PAGE_MASK)
> + goto reg_user_mr_err;
> + if (!chunk_pages)
> + goto reg_user_mr_err;
> +
> + for (i = 0; i < chunk_pages; i++) {
> + mr->addrs[mr_id] = page;
> + page += PAGE_SIZE;
> + if (++mr_id > npages)
> + goto reg_user_mr_err;
> + }
> + }
> + }
> + nes_debug(NES_DBG_UD, "%s(%d) stag=0x%x mr_id=%d npages=%d\n",
> + __func__, __LINE__, stag, mr_id, (int)npages);
> + nes_ud_add_mr_hash(mr);
> + return stag;
> +
> +reg_user_mr_err:
> + if (mr)
> + nes_ud_free_mr(mr);
> +
> + return 0;
> +}
> +
> +
> +int nes_ud_dereg_mr(u32 stag)
> +{
> + struct nes_ud_mem_region *mr = NULL;
> +
> + nes_debug(NES_DBG_UD, "%s(%d) stag=0x%x\n", __func__, __LINE__, stag);
> +
> + mr = nes_ud_lookup_mr(stag);
> + if (mr != NULL) {
> + nes_ud_del_mr(mr);
> + nes_ud_free_mr(mr);
> + } else {
> + nes_debug(NES_DBG_UD, "%s(%d) unknown stag=0x%x\n",
> + __func__, __LINE__, stag);
> + }
> +
> + nes_debug(NES_DBG_UD, "%s(%d) done\n", __func__, __LINE__);
> + return 0;
> +}
> +
> +
> +int nes_ud_unsubscribe_mcast(struct nes_ud_file *file, union ib_gid *gid)
> +{
> + int ret = 0;
> + int i;
> + struct nes_ud_resources *pRsc;
> +
> + if (file->queue_type == NES_UD_SEND_QUEUE)
> + return -EFAULT;
> +
> + pRsc = locate_ud_adapter(file->nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return -EFAULT;
> +
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (pRsc->mcast[i].in_use &&
> + pRsc->mcast[i].owner == file &&
> + pRsc->mcast[i].addr[0] == gid->raw[13] &&
> + pRsc->mcast[i].addr[1] == gid->raw[14] &&
> + pRsc->mcast[i].addr[2] == gid->raw[15]) {
> + pRsc->mcast[i].in_use = 0;
> + goto out;
> + }
> + }
> +
> + ret = -EFAULT;
> +out:
> + nes_debug(NES_DBG_UD, "%s(%d) %2.2X:%2.2X:%2.2X:%2.2X:%2.2X:%2.2X \
> + ret=%d mcast=%d\n", __func__, __LINE__, gid->raw[10],
> + gid->raw[11], gid->raw[12], gid->raw[13], gid->raw[14],
> + gid->raw[15], ret , i);
> + return ret;
> +
> +}
> +
> +/* function returns a number of allocated multicast entries in given adapter */
> +static int get_mcast_number_alloced(struct nes_ud_resources *pRsc)
> +{
> + int i;
> + int no = 0;
> +
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (pRsc->mcast[i].in_use != 0)
> + no++;
> +
> + }
> + return no;
> +}
> +
> +/* function subscribe a multicast group in the system - PFT modification */
> +int nes_ud_subscribe_mcast(struct nes_ud_file *file, union ib_gid *gid)
> +{
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + int ret = 0;
> + int i;
> + __u8 hash_idx = 0;
> + __u8 instance = file->nes_ud_nic_index & 0x1;
> + unsigned addr = 0;
> + unsigned mqueue_ind_tbl;
> + struct nes_ud_resources *pRsc;
> +
> + struct net_device *netdev = file->nesvnic->netdev;
> + struct dev_mc_list *mc_list;
> + int multicast_address_exist = 0;
> +
> +
> + if (file->queue_type == NES_UD_SEND_QUEUE)
> + return -EFAULT;
> +
> + pRsc = locate_ud_adapter(nesdev->nesadapter);
> + if (pRsc == NULL)
> + return -EFAULT;
> +
> + for (mc_list = netdev->mc_list;
> + mc_list != NULL;
> + mc_list = mc_list->next) {
> + if (mc_list != NULL) {
> + if ((mc_list->dmi_addr[3] == gid->raw[13]) &&
> + (mc_list->dmi_addr[4] == gid->raw[14]) &&
> + (mc_list->dmi_addr[5] == gid->raw[15]) &&
> + (mc_list->dmi_addr[0] == 0x01) &&
> + (mc_list->dmi_addr[1] == 0) &&
> + (mc_list->dmi_addr[2] == 0x5e)) {
> + multicast_address_exist = 1;
> + break;
> + }
> + } else {
> + break;
> + }
> + }
> +
> + if (multicast_address_exist == 0) {
> + nes_debug(NES_DBG_UD, "WARNING: multicast address not exist "
> + "on multicast list\n");
> + return -EFAULT;
> + }
> +
> + /* first check that we have not subecribed to this mcast address, yet */
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if ((pRsc->mcast[i].in_use > 0) &&
> + (pRsc->mcast[i].addr[0] == gid->raw[13]) &&
> + (pRsc->mcast[i].addr[1] == gid->raw[14]) &&
> + (pRsc->mcast[i].addr[2] == gid->raw[15])) {
> + if (pRsc->mcast[i].owner == file) {
> + nes_debug(NES_DBG_UD, "WARNING - subscribing "
> + "mcast to the same nes_ud more than once\n");
> + break;
> + } else {
> + /* receiving the same multicast on different NICs is allowed:
> + 1. when two different NICS are used
> + 2. exactly one QP exists on this adapter
> + 3. The existing QP was allocated as first
> + or the second in the system
> + */
> + if (pRsc->mcast[i].owner->nes_ud_nic_index !=
> + file->nes_ud_nic_index) {
> + if (get_mcast_number_alloced(pRsc) == 1) {
> + if ((i == 0) || (i == 1)) {
> + /* add the mask of other nics
> + that subscribe this address */
> + break;
> + }
> + }
> + }
> + nes_debug(NES_DBG_UD, "ERROR - subscribing same mcast "
> + "to the diff nes_ud's and NIC owner_idx = %d "
> + "file_idx = %d\n",
> + pRsc->mcast[i].owner->nes_ud_nic_index,
> + file->nes_ud_nic_index);
> + ret = -EFAULT;
> + }
> + goto out;
> + }
> + }
> +
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (!pRsc->mcast[i].in_use) {
> + pRsc->mcast[i].addr[0] = gid->raw[13];
> + pRsc->mcast[i].addr[1] = gid->raw[14];
> + pRsc->mcast[i].addr[2] = gid->raw[15];
> + pRsc->mcast[i].owner = file;
> + pRsc->mcast[i].in_use = 1;
> +
> + hash_idx =
> + nes_ud_calculate_hash(pRsc->mcast[i].addr[2]);
> +
> + addr = 0x6420 + ((hash_idx >> 3) << 2) + instance*0x100;
> + mqueue_ind_tbl = nes_read_indexed(nesdev, addr);
> + if (file->prio == NES_UD_DEV_PRIO_HIGH)
> + mqueue_ind_tbl &= ~(1 << ((hash_idx & 0x7)*4));
> + else
> + mqueue_ind_tbl |= 1 << ((hash_idx & 0x7)*4);
> +
> + nes_write_indexed(nesdev, addr, mqueue_ind_tbl);
> + mqueue_ind_tbl = nes_read_indexed(nesdev, addr);
> +
> + nes_debug(NES_DBG_UD, "%s(%d) addr=0x%x "
> + "mqueue_ind_tbl=0x%x hash=0x%x, mac=0x%x\n",
> + __func__, __LINE__, addr, mqueue_ind_tbl,
> + hash_idx, pRsc->mcast[i].addr[2]);
> + /* take care of the case when linux join_mcast
> + is called before mcast_attach in that case our pft
> + will already be programmed with that mcast address,
> + just with wrong NIC we need just to find an address,
> + and fix the NIC additionally the mask with other NICs
> + that subscribed the address are added*/
> +
> + mcast_fix_filter_table(file);
> + goto out;
> + }
> + }
> + ret = -EFAULT;
> +
> +out:
> +
> + nes_debug(NES_DBG_UD, "%s(%d) %2.2X:%2.2X:%2.2X:%2.2X:%2.2X:%2.2X \
> + ret=%d\n", __func__, __LINE__, gid->raw[10], gid->raw[11],
> + gid->raw[12], gid->raw[13], gid->raw[14], gid->raw[15], ret);
> +
> + return ret;
> +}
> +
> +
> +static inline
> +int nes_ud_post_recv(struct nes_ud_file *file,
> + u32 adap_no,
> + struct nes_ud_recv_wr *nes_ud_wr)
> +{
> + struct nes_hw_nic_rq_wqe *nic_rqe;
> + struct nes_hw_nic_rq_wqe *rq_vbase =
> + (struct nes_hw_nic_rq_wqe *)file->wq_vbase;
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + u16 *wqe_fragment_length = NULL;
> + u32 mr_offset;
> + u32 page_offset;
> + u32 page_id;
> + struct nes_ud_mem_region *mr = NULL;
> + int remaining_length = 0;
> + int wqe_fragment_index = 0;
> + int err = 0;
> + int i = 0;
> + struct nes_ud_resources *pRsc;
> +
> + /* check if qp is activated */
> + if (file->active == 0)
> + return -EFAULT;
> +
> + pRsc = &nes_ud_rsc[adap_no];
> +
> + /* let's assume for now that max sge count is 1 */
> + for (i = 0; i < nes_ud_wr->wr_cnt; i++) {
> + nic_rqe = &rq_vbase[file->head];
> +
> + mr = nes_ud_lookup_mr(nes_ud_wr->sg_list[i].lkey);
> + if (mr == NULL)
> + return -EFAULT;
> +
> +
> + if (mr->va > nes_ud_wr->sg_list[i].addr ||
> + (nes_ud_wr->sg_list[i].addr + nes_ud_wr->sg_list[i].length >
> + mr->va + mr->length)) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + mr_offset = nes_ud_wr->sg_list[i].addr - mr->va;
> + page_offset = nes_ud_wr->sg_list[i].addr & ~PAGE_MASK;
> + page_id = ((mr->va & ~PAGE_MASK) + mr_offset) >> PAGE_SHIFT;
> +
> + wqe_fragment_length =
> + (u16 *)&nic_rqe->wqe_words[NES_NIC_RQ_WQE_LENGTH_1_0_IDX];
> +
> + remaining_length = nes_ud_wr->sg_list[i].length;
> + wqe_fragment_index = 0;
> +
> + while (remaining_length > 0) {
> + if (wqe_fragment_index >= 4) {
> + err = -EFAULT;
> + goto out;
> + }
> +
> + set_wqe_64bit_value(nic_rqe->wqe_words,
> + NES_NIC_RQ_WQE_FRAG0_LOW_IDX + 2*wqe_fragment_index,
> + mr->addrs[page_id]+page_offset);
> +
> + wqe_fragment_length[wqe_fragment_index] =
> + cpu_to_le16(PAGE_SIZE - page_offset);
> +
> + remaining_length -= PAGE_SIZE - page_offset;
> + page_offset = 0;
> + page_id++;
> + wqe_fragment_index++;
> + }
> +
> + nes_write32(nesdev->regs+NES_WQE_ALLOC, (1 << 24) | file->qpn);
> +
> + file->head = (file->head+1) & ~NES_NIC_WQ_SIZE;
> + }
> +out:
> + return err;
> +}
> +
> +static inline
> +int nes_ud_post_send(struct nes_ud_file *file,
> + u32 adap_no,
> + struct nes_ud_send_wr *nes_ud_wr)
> +{
> + struct nes_hw_nic_sq_wqe *nic_sqe;
> + struct nes_hw_nic_sq_wqe *sq_vbase =
> + (struct nes_hw_nic_sq_wqe *)file->wq_vbase;
> + struct nes_device *nesdev = file->nesvnic->nesdev;
> + u16 *wqe_fragment_length = NULL;
> + u32 mr_offset;
> + u32 page_offset;
> + u32 page_id;
> + struct nes_ud_mem_region *mr = NULL;
> + int remaining_length = 0;
> + int wqe_fragment_index = 0;
> + int err = 0;
> + int misc_flags = NES_NIC_SQ_WQE_COMPLETION;
> + int i = 0;
> + struct nes_ud_resources *pRsc;
> +
> + /* check if qp is activated */
> + if (file->active == 0)
> + return -EFAULT;
> +
> + pRsc = &nes_ud_rsc[adap_no];
> +
> + /* check if is not set checksum */
> + if (!(nes_ud_wr->flags & IB_SEND_IP_CSUM))
> + misc_flags |= NES_NIC_SQ_WQE_DISABLE_CHKSUM;
> +
> + /* let's assume for now that max sge count is 1 */
> + for (i = 0; i < nes_ud_wr->wr_cnt; i++) {
> + nic_sqe = &sq_vbase[file->head];
> +
> + mr = nes_ud_lookup_mr(nes_ud_wr->sg_list[i].lkey);
> + if (mr == NULL)
> + return -EFAULT;
> +
> +
> + if ((mr->va > nes_ud_wr->sg_list[i].addr) ||
> + (nes_ud_wr->sg_list[i].addr+nes_ud_wr->sg_list[i].length >
> + mr->va + mr->length)) {
> +
> + err = -EFAULT;
> + goto out;
> + }
> +
> + mr_offset = nes_ud_wr->sg_list[i].addr - mr->va;
> + page_offset = nes_ud_wr->sg_list[i].addr & ~PAGE_MASK;
> + page_id = ((mr->va & ~PAGE_MASK) + mr_offset) >> PAGE_SHIFT;
> +
> + wqe_fragment_length =
> + (u16 *)&nic_sqe->wqe_words[NES_NIC_SQ_WQE_LENGTH_0_TAG_IDX];
> +
> + wqe_fragment_length++; /* skip vlan tag */
> + remaining_length = nes_ud_wr->sg_list[i].length;
> + wqe_fragment_index = 0;
> +
> + while (remaining_length > 0) {
> + if (wqe_fragment_index >= 4) {
> + err = -EFAULT;
> + goto out;
> + }
> + set_wqe_64bit_value(nic_sqe->wqe_words,
> + NES_NIC_SQ_WQE_FRAG0_LOW_IDX +
> + 2*wqe_fragment_index,
> + mr->addrs[page_id]+page_offset);
> + wqe_fragment_length[wqe_fragment_index] =
> + cpu_to_le16(PAGE_SIZE - page_offset);
> + remaining_length -= PAGE_SIZE - page_offset;
> + page_offset = 0;
> + page_id++;
> + wqe_fragment_index++;
> + }
> + nic_sqe->wqe_words[NES_IWARP_SQ_WQE_TOTAL_PAYLOAD_IDX] =
> + cpu_to_le32(nes_ud_wr->sg_list[i].length);
> + nic_sqe->wqe_words[NES_NIC_SQ_WQE_MISC_IDX] =
> + cpu_to_le32(misc_flags);
> +
> + nes_write32(nesdev->regs+NES_WQE_ALLOC,
> + (1 << 24) | (1 << 23) | file->qpn);
> +
> + file->head = (file->head+1) & ~NES_NIC_WQ_SIZE;
> + }
> +out:
> + return err;
> +}
> +
> +
> +
> +static void nes_ud_mcast_cleanup_work(struct nes_ud_file *file)
> +{
> + int i = 0;
> + int num_queues = count_files_by_nic(file->nesvnic, file->queue_type);
> + struct nes_ud_resources *pRsc;
> +
> + pRsc = locate_ud_adapter(file->nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return;
> +
> +
> + nes_debug(NES_DBG_UD, "%s(%d) file->rsc_idx=%d\n",
> + __func__, __LINE__, file->rsc_idx);
> +
> + mutex_lock(&pRsc->mutex);
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++) {
> + if (pRsc->mcast[i].owner == file) {
> + nes_debug(NES_DBG_UD, "%s(%d) mcast cleared idx=%d "
> + "%2.2X:%2.2X:%2.2X\n", __func__, __LINE__,
> + i, pRsc->mcast[i].addr[0],
> + pRsc->mcast[i].addr[1],
> + pRsc->mcast[i].addr[2]);
> +
> + pRsc->mcast[i].in_use = 0;
> + remove_mcast_from_pft(file, pRsc->mcast[i].addr);
> + }
> + }
> +
> + if (del_rsc_list(file) == 0) {
> + if (num_queues == 1)
> + file->nesvnic->mcrq_mcast_filter = NULL;
> +
> + }
> + mutex_unlock(&pRsc->mutex);
> +}
> +
> +struct nes_ud_file *nes_ud_create_wq(struct nes_vnic *nesvnic, int isrecv)
> +{
> + struct nes_ud_file *file;
> + int ret = 0;
> + file = nes_ud_get_nxt_channel(nesvnic, (isrecv) ?
> + NES_UD_RECV_QUEUE : NES_UD_SEND_QUEUE);
> + if (!file)
> + return NULL;
> +
> +
> + ret = nes_ud_init_channel(file);
> + if (ret != 0) {
> + del_rsc_list(file);
> + return NULL;
> + }
> +
> + dev_hold(file->nesvnic->netdev);
> +
> + nes_debug(NES_DBG_UD, "%s(%d) file=%p\n", __func__, __LINE__, file);
> + return file;
> +}
> +
> +
> +
> +int nes_ud_destroy_wq(struct nes_ud_file *file)
> +{
> + struct nes_ud_resources *pRsc;
> + int count = 0;
> + int i;
> + pRsc = locate_ud_adapter(file->nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return -EFAULT;
> +
> + if (file->active) {
> + nes_ud_mcast_cleanup_work(file);
> + nes_ud_free_resources(file);
> + }
> +
> + /* check if the the adapter has any queues */
> + for (i = 0; i < NES_UD_MAX_NIC_CNT; i++) {
> + if (pRsc->nics[i].file->active != 0)
> + count++;
> +
> + }
> + if (count == 0) {
> + nes_debug(NES_DBG_UD, "%s(%d) adapter %d "
> + "is ready to next use\n",
> + __func__, __LINE__, pRsc->adapter_no);
> + pRsc->pAdap = NULL;
> + }
> + nes_debug(NES_DBG_UD, "%s(%d) done\n", __func__, __LINE__);
> + return 0;
> +}
> +
> +
> +struct nes_ud_sksq_file {
> + unsigned long shared_page;
> + struct nes_ud_file *nes_ud_send_file;
> + struct nes_ud_file *nes_ud_recv_file;
> +};
> +
> +static ssize_t nes_ud_sksq_write(struct file *filp, const char __user *buf,
> + size_t len, loff_t *pos)
> +{
> + struct nes_ud_sksq_file *file = filp->private_data;
> + struct nes_ud_send_wr *nes_ud_wr =
> + (struct nes_ud_send_wr *)file->shared_page;
> + u32 adap_no;
> + u32 nic_no;
> +
> + nic_no = ((nes_ud_wr->qpn >> 16) & 0x0f00) >> 8;
> + adap_no = ((nes_ud_wr->qpn >> 16) & 0xf000) >> 12;
> + if (unlikely(!file->nes_ud_send_file)) {
> + struct nes_ud_file *nes_ud_file = NULL;
> +
> + nes_ud_file = nes_ud_rsc[adap_no].nics[nic_no].file;
> + /* the nic must be active and previously activated */
> + if ((nes_ud_file->active == 0) ||
> + (nes_ud_file->qpn != ((nes_ud_wr->qpn >> 16) & 0xff)))
> + return -EAGAIN;
> +
> + file->nes_ud_send_file = nes_ud_file;
> + nes_debug(NES_DBG_UD, "send shared page addr = %p "
> + "adap_no = %d nic_no=%d qpn=%x\n",
> + nes_ud_wr, adap_no, nic_no, nes_ud_wr->qpn);
> + }
> + return nes_ud_post_send(file->nes_ud_send_file, adap_no, nes_ud_wr);
> +
> +}
> +
> +static ssize_t nes_ud_sksq_read(struct file *filp, char __user *buf,
> + size_t len, loff_t *pos)
> +{
> + struct nes_ud_sksq_file *file = filp->private_data;
> + struct nes_ud_recv_wr *nes_ud_recv_wr;
> + u32 adap_no;
> + u32 nic_no;
> +
> + nes_ud_recv_wr = (struct nes_ud_recv_wr *)(file->shared_page+2048);
> + adap_no = (nes_ud_recv_wr->qpn & 0xf000) >> 12;
> + nic_no = (nes_ud_recv_wr->qpn & 0x0f00) >> 8;
> +
> + if (unlikely(!file->nes_ud_recv_file)) {
> + struct nes_ud_file *nes_ud_file = NULL;
> +
> + nes_ud_file = nes_ud_rsc[adap_no].nics[nic_no].file;
> + /* the nic must be active and previously activated */
> + if ((nes_ud_file->active == 0) ||
> + (nes_ud_file->qpn != (nes_ud_recv_wr->qpn & 0xff)))
> + return -EAGAIN;
> +
> + file->nes_ud_recv_file = nes_ud_file;
> + nes_debug(NES_DBG_UD, "recv shared page addr = %p "
> + "adap_no = %d nic_no=%d qpn=%x\n",
> + nes_ud_recv_wr, adap_no, nic_no, nes_ud_recv_wr->qpn);
> + }
> + return nes_ud_post_recv(file->nes_ud_recv_file,
> + adap_no, nes_ud_recv_wr);
> +}
> +
> +static int nes_ud_sksq_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> + struct nes_ud_sksq_file *file = filp->private_data;
> +
> + nes_debug(NES_DBG_UD, "shared mem pgprot_val(prot)=0x%x pa=%p\n",
> + (unsigned int)pgprot_val(vma->vm_page_prot),
> + (void *)virt_to_phys((void *)file->shared_page));
> + if (remap_pfn_range(vma, vma->vm_start,
> + virt_to_phys((void *)file->shared_page) >> PAGE_SHIFT,
> + PAGE_SIZE, vma->vm_page_prot)) {
> + printk(KERN_ERR "remap_pfn_range failed.\n");
> + return -EAGAIN;
> + }
> + return 0;
> +}
> +
> +
> +static int nes_ud_sksq_open(struct inode *inode, struct file *filp)
> +{
> + struct nes_ud_sksq_file *file;
> +
> + file = kmalloc(sizeof *file, GFP_KERNEL);
> + if (!file)
> + return -ENOMEM;
> +
> + memset(file, 0, sizeof *file);
> + nes_debug(NES_DBG_UD, "%s(%d) file=%p\n",
> + __func__, __LINE__, file);
> +
> + filp->private_data = file;
> + file->nes_ud_send_file = NULL;
> + file->nes_ud_recv_file = NULL;
> +
> + file->shared_page = __get_free_page(GFP_USER);
> + return 0;
> +}
> +
> +static int nes_ud_sksq_close(struct inode *inode, struct file *filp)
> +{
> +
> + struct nes_ud_sksq_file *file = filp->private_data;
> +
> + if (file->shared_page) {
> + free_page(file->shared_page);
> + file->shared_page = 0;
> + }
> + kfree(file);
> + return 0;
> +}
> +
> +static const struct file_operations nes_ud_sksq_fops = {
> + .owner = THIS_MODULE,
> + .open = nes_ud_sksq_open,
> + .release = nes_ud_sksq_close,
> + .write = nes_ud_sksq_write,
> + .read = nes_ud_sksq_read,
> + .mmap = nes_ud_sksq_mmap,
> +};
> +
> +
> +static struct miscdevice nes_ud_sksq_misc = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "nes_ud_sksq",
> + .fops = &nes_ud_sksq_fops,
> +};
> +
> +/*
> + * function replaces the CQ pointer in QP stored in the file
> + * the QP must have a valid CQ pointers assotiated with it
> + */
> +int nes_ud_cq_replace(struct nes_vnic *nesvnic, struct nes_cq *cq)
> +{
> + u32 cq_num;
> + struct nes_ud_file *file;
> + struct nes_ud_resources *pRsc;
> +
> + BUG_ON(!cq);
> +
> + pRsc = locate_ud_adapter(nesvnic->nesdev->nesadapter);
> + if (pRsc == NULL)
> + return -EFAULT;
> +
> +
> + /* now create a QP number on base cq and adapter no */
> + cq_num = cq->hw_cq.cq_number;
> +
> + nes_debug(NES_DBG_UD, "%s(%d) cq_number=%d\n",
> + __func__, __LINE__, cq_num);
> +
> + /* the QP number should have the same number like CQ number */
> + file = get_file_by_qpn(pRsc, cq_num);
> + if (!file) {
> + nes_debug(NES_DBG_UD, "%s(%d) file not found\n",
> + __func__, __LINE__);
> + return -EFAULT;
> + }
> + if (file->qp_ptr) {
> + if (file->queue_type == NES_UD_RECV_QUEUE) {
> + nes_debug(NES_DBG_UD, "%s(%d) RECV file found "
> + "old=%p new=%p\n", __func__, __LINE__,
> + file->qp_ptr->ibqp.recv_cq, cq);
> + file->qp_ptr->ibqp.recv_cq = &cq->ibcq;
> + }
> + if (file->queue_type == NES_UD_SEND_QUEUE) {
> + nes_debug(NES_DBG_UD, "%s(%d) SEND file found "
> + "old=%p new=%p\n", __func__, __LINE__,
> + file->qp_ptr->ibqp.send_cq, cq);
> +
> + file->qp_ptr->ibqp.send_cq = &cq->ibcq;
> + }
> + }
> + return 0;
> +}
> +int nes_ud_init(void)
> +{
> + int i = 0;
> + int adap_no;
> + struct nes_ud_resources *pRsc;
> +
> + nes_debug(NES_DBG_UD, "%s(%d)\n", __func__, __LINE__);
> +
> + /* the memory registration is global for all NICS */
> + memset(&ud_mem, 0, sizeof(ud_mem));
> +
> + /* init hash list of memory entries */
> + for (i = 0; i < NES_UD_MAX_REG_HASH_CNT; i++) {
> + INIT_LIST_HEAD(&ud_mem.mrs[i].list);
> + ud_mem.mrs[i].read_stats = 0;
> + }
> + mutex_init(&ud_mem.mutex);
> +
> + /*allocate resources fro each adapter */
> + for (adap_no = 0; adap_no < NES_UD_MAX_ADAPTERS; adap_no++) {
> + pRsc = &nes_ud_rsc[adap_no];
> +
> + memset(pRsc, 0, sizeof(*pRsc));
> +
> + mutex_init(&pRsc->mutex);
> +
> + pRsc->adapter_no = adap_no;
> + pRsc->pAdap = NULL;
> +
> + pRsc->num_logport_confed = 0;
> + pRsc->num_allocated_nics = 0;
> + pRsc->logport_2_map = 0xf;
> + pRsc->logport_3_map = 0xf;
> + for (i = 0; i < NES_UD_MCAST_TBL_SZ; i++)
> + pRsc->mcast[i].in_use = 0;
> +
> + pRsc->nics[0].qpn = 20;
> + pRsc->nics[0].nic_index = 2;
> + pRsc->nics[0].logical_port = 2;
> + pRsc->nics[0].prio = NES_UD_DEV_PRIO_HIGH;
> + pRsc->nics[0].queue_type = NES_UD_RECV_QUEUE;
> + pRsc->nics[0].file = &pRsc->nics[0].file_body;
> +
> + pRsc->nics[1].qpn = 22;
> + pRsc->nics[1].nic_index = 3;
> + pRsc->nics[1].logical_port = 3;
> + pRsc->nics[1].prio = NES_UD_DEV_PRIO_HIGH;
> + pRsc->nics[1].queue_type = NES_UD_RECV_QUEUE;
> + pRsc->nics[1].file = &pRsc->nics[1].file_body;
> +
> + pRsc->nics[2].qpn = 21;
> + pRsc->nics[2].nic_index = 2;
> + pRsc->nics[2].logical_port = 2;
> + pRsc->nics[2].prio = NES_UD_DEV_PRIO_LOW;
> + pRsc->nics[2].queue_type = NES_UD_RECV_QUEUE;
> + pRsc->nics[2].file = &pRsc->nics[2].file_body;
> +
> + pRsc->nics[3].qpn = 23;
> + pRsc->nics[3].nic_index = 3;
> + pRsc->nics[3].logical_port = 3;
> + pRsc->nics[3].prio = NES_UD_DEV_PRIO_LOW;
> + pRsc->nics[3].queue_type = NES_UD_RECV_QUEUE;
> + pRsc->nics[3].file = &pRsc->nics[3].file_body;
> +
> + pRsc->nics[4].qpn = 26;
> + pRsc->nics[4].nic_index = 6;
> + pRsc->nics[4].logical_port = 2;
> + pRsc->nics[4].prio = NES_UD_DEV_PRIO_HIGH;
> + pRsc->nics[4].queue_type = NES_UD_SEND_QUEUE;
> + pRsc->nics[4].file = &pRsc->nics[4].file_body;
> +
> + pRsc->nics[5].qpn = 27;
> + pRsc->nics[5].nic_index = 7;
> + pRsc->nics[5].logical_port = 3;
> + pRsc->nics[5].prio = NES_UD_DEV_PRIO_HIGH;
> + pRsc->nics[5].queue_type = NES_UD_SEND_QUEUE;
> + pRsc->nics[5].file = &pRsc->nics[5].file_body;
> +
> + pRsc->nics[6].qpn = 30;
> + pRsc->nics[6].nic_index = 10;
> + pRsc->nics[6].logical_port = 2;
> + pRsc->nics[6].prio = NES_UD_DEV_PRIO_LOW;
> + pRsc->nics[6].queue_type = NES_UD_SEND_QUEUE;
> + pRsc->nics[6].file = &pRsc->nics[6].file_body;
> +
> + pRsc->nics[7].qpn = 31;
> + pRsc->nics[7].nic_index = 11;
> + pRsc->nics[7].logical_port = 3;
> + pRsc->nics[7].prio = NES_UD_DEV_PRIO_LOW;
> + pRsc->nics[7].queue_type = NES_UD_SEND_QUEUE;
> + pRsc->nics[7].file = &pRsc->nics[7].file_body;
> +
> + }
> + nes_ud_workqueue = create_singlethread_workqueue("nes_ud");
> +
> + return misc_register(&nes_ud_sksq_misc);
> +}
> +
> +
> +int nes_ud_exit(void)
> +{
> + /* clean memory hash list */
> + nes_ud_cleanup_mr();
> + misc_deregister(&nes_ud_sksq_misc);
> + return 0;
> +}
> +
> diff --git a/drivers/infiniband/hw/nes/nes_ud.h b/drivers/infiniband/hw/nes/nes_ud.h
> new file mode 100644
> index 0000000..5a03b33
> --- /dev/null
> +++ b/drivers/infiniband/hw/nes/nes_ud.h
> @@ -0,0 +1,86 @@
> +/*
> + * Copyright (c) 2008 - 2010 Intel Corporation. All rights reserved.
> + * Copyright (c) 2006 - 2008 Neteffect, All rights reserved.
> + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
> + *
> + * This software is available to you under a choice of one of two
> + * licenses. You may choose to be licensed under the terms of the GNU
> + * General Public License (GPL) Version 2, available from the file
> + * COPYING in the main directory of this source tree, or the
> + * OpenIB.org BSD license below:
> + *
> + * Redistribution and use in source and binary forms, with or
> + * without modification, are permitted provided that the following
> + * conditions are met:
> + *
> + * - Redistributions of source code must retain the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer.
> + *
> + * - Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following
> + * disclaimer in the documentation and/or other materials
> + * provided with the distribution.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + */
> +#ifndef __NES_UD_H
> +#define __NES_UD_H
> +
> +enum nes_ud_dev_priority {
> + NES_UD_DEV_PRIO_HIGH,
> + NES_UD_DEV_PRIO_LOW,
> +};
> +
> +enum nes_ud_queue_type {
> + NES_UD_RECV_QUEUE,
> + NES_UD_SEND_QUEUE,
> +};
> +
> +enum nes_ud_mcast_mode {
> + NES_UD_MCAST_ALL_MODE,
> + NES_UD_MCAST_PFT_MODE,
> +};
> +
> +
> +struct nes_ud_file {
> + struct nes_vnic *nesvnic;
> + u8 active;
> + char ifrn_name[IFNAMSIZ];
> + int nes_ud_nic_index;
> + int qpn;
> + enum nes_ud_dev_priority prio;
> + enum nes_ud_mcast_mode mcast_mode;
> + enum nes_ud_queue_type queue_type;
> + void *nic_vbase;
> + dma_addr_t nic_pbase;
> + int nic_mem_size;
> + void *wq_vbase;
> + dma_addr_t wq_pbase;
> + int mss;
> + struct delayed_work mcast_cleanup_work;
> + int head;
> + int tail;
> + u32 rsc_idx;
> + struct nes_qp *qp_ptr; /* it is association used for CQ replacement */
> + u32 adapter_no; /* assotiation to allocated adapter */
> +};
> +
> +int nes_ud_init(void);
> +int nes_ud_exit(void);
> +struct nes_ud_file *nes_ud_create_wq(struct nes_vnic *nesvnic, int isrecv);
> +int nes_ud_destroy_wq(struct nes_ud_file *file);
> +u32 nes_ud_reg_mr(struct ib_umem *region, u64 length, u64 virt, u32 stag);
> +int nes_ud_dereg_mr(u32 stag);
> +int nes_ud_subscribe_mcast(struct nes_ud_file *file, union ib_gid *gid);
> +int nes_ud_unsubscribe_mcast(struct nes_ud_file *file, union ib_gid *gid);
> +int nes_ud_cq_replace(struct nes_vnic *nesvnic, struct nes_cq *cq);
> +
> +#endif
> diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
> index e54f312..ff39235 100644
> --- a/drivers/infiniband/hw/nes/nes_verbs.c
> +++ b/drivers/infiniband/hw/nes/nes_verbs.c
> @@ -46,6 +46,8 @@
>
> #include <rdma/ib_umem.h>
>
> +#include "nes_ud.h"
> +
> atomic_t mod_qp_timouts;
> atomic_t qps_created;
> atomic_t sw_qps_destroyed;
> @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
> if (init_attr->create_flags)
> return ERR_PTR(-EINVAL);
>
> - atomic_inc(&qps_created);
> switch (init_attr->qp_type) {
> case IB_QPT_RC:
> if (nes_drv_opt & NES_DRV_OPT_NO_INLINE_DATA) {
> @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
> nesqp->hwqp.qp_id, nesqp, (u32)sizeof(*nesqp));
> spin_lock_init(&nesqp->lock);
> nes_add_ref(&nesqp->ibqp);
> + /* moved here to be sure that QP is really created */
> + /*(now it counted a number of QP creation trials */
> + atomic_inc(&qps_created);
> break;
> - default:
> - nes_debug(NES_DBG_QP, "Invalid QP type: %d\n", init_attr->qp_type);
> - return ERR_PTR(-EINVAL);
> +
> + case IB_QPT_RAW_ETY:
> + if (!ibpd->uobject)
> + return ERR_PTR(-EINVAL);
> +
> + /* we are about to destroy those cqs w/o destroying qp
> + now free memory for nespbl that is not used
> + first map nespbl with the qp created */
> + if (ibpd->uobject->context) {
> + nes_ucontext = to_nesucontext(ibpd->uobject->context);
> + if (udata) {
> + if (ib_copy_from_udata(&req,
> + udata,
> + sizeof(struct nes_create_qp_req))) {
> + return ERR_PTR(-EFAULT);
> + }
> + if (req.user_wqe_buffers) {
> + err = 1;
> + list_for_each_entry(nespbl,
> + &nes_ucontext->qp_reg_mem_list,
> + list) {
> + if (nespbl->user_base ==
> + req.user_wqe_buffers) {
> + list_del(&nespbl->list);
> + err = 0;
> + /* done with memory allocated
> + during nes_reg_user_mr() */
> + pci_free_consistent(
> + nesdev->pcidev,
> + nespbl->pbl_size,
> + nespbl->pbl_vbase,
> + nespbl->pbl_pbase);
> + kfree(nespbl);
> + break;
> + }
> + }
> + }
> + }
> + }
> + /* Need 512 (actually now 1024) byte alignment on this structure */
> + mem = kzalloc(sizeof(*nesqp)+NES_SW_CONTEXT_ALIGN-1, GFP_KERNEL);
> + if (!mem) {
> + nes_debug(NES_DBG_UD, "Unable to allocate QP\n");
> + return ERR_PTR(-ENOMEM);
> + }
> + u64nesqp = (unsigned long)mem;
> + u64nesqp += ((u64)NES_SW_CONTEXT_ALIGN) - 1;
> + u64temp = ((u64)NES_SW_CONTEXT_ALIGN) - 1;
> + u64nesqp &= ~u64temp;
> + nesqp = (struct nes_qp *)(unsigned long)u64nesqp;
> + nesqp->allocated_buffer = mem;
> +
> + nesqp->rx_ud_wq = nes_ud_create_wq(nesvnic, 1);
> + nesqp->tx_ud_wq = nes_ud_create_wq(nesvnic, 0);
> + if ((!nesqp->rx_ud_wq) || (!nesqp->tx_ud_wq)) {
> + kfree(nesqp->allocated_buffer);
> + return ERR_PTR(-EFAULT);
> + }
> +
> + /* create association between qp and tx/rx files
> + it is used when CQ is replaced from user space */
> + nesqp->rx_ud_wq->qp_ptr = nesqp;
> + nesqp->tx_ud_wq->qp_ptr = nesqp;
> +
> + sq_size = init_attr->cap.max_send_wr;
> + rq_size = init_attr->cap.max_recv_wr;
> + nes_debug(NES_DBG_UD, "%s(%d) sq_size=%d rq_size=%d\n",
> + __func__,
> + __LINE__, sq_size, rq_size);
> + uresp.actual_sq_size = sq_size;
> + uresp.actual_rq_size = rq_size;
> +
> + /* Init qp size due to ibv_query_qp requirements */
> + nesqp->hwqp.sq_size = sq_size;
> + nesqp->hwqp.rq_size = rq_size;
> +
> + /* enhance the response qp number with adapter number and QP number
> + on this adapter
> + user space will use this identifier when packets will be posted */
> + uresp.qp_id = nesqp->rx_ud_wq->qpn |
> + (nesqp->rx_ud_wq->adapter_no << 12) |
> + (nesqp->rx_ud_wq->rsc_idx << 8);
> + uresp.qp_id = uresp.qp_id |
> + ((nesqp->tx_ud_wq->qpn |
> + (nesqp->tx_ud_wq->adapter_no << 12) |
> + (nesqp->tx_ud_wq->rsc_idx << 8)) << 16);
> +
> + nesqp->hwqp.qp_id = uresp.qp_id;
> + nesqp->ibqp.qp_num = uresp.qp_id;
> +
> + nes_debug(NES_DBG_UD, "%s(%d) qpid=0x%x\n",
> + __func__, __LINE__, uresp.qp_id);
> + if (ib_copy_to_udata(udata, &uresp, sizeof uresp)) {
> + kfree(nesqp->allocated_buffer);
> + return ERR_PTR(-EFAULT);
> + }
> + /* the usecount is decreased because without it
> + the cq re-creation in user-spce will fail */
> + atomic_dec(&init_attr->send_cq->usecnt);
> + atomic_dec(&init_attr->recv_cq->usecnt);
> + nes_add_ref(&nesqp->ibqp);
> + spin_lock_init(&nesqp->lock);
> +
> + /* moved here to be sure that QP is really created
> + (now it counted a number of QP creation trials */
> + atomic_inc(&qps_created);
> + return &nesqp->ibqp;
> +
> + default:
> + nes_debug(NES_DBG_QP, "Invalid QP type: %d\n",
> + init_attr->qp_type);
> + return ERR_PTR(-EINVAL);
> }
>
> nesqp->sig_all = (init_attr->sq_sig_type == IB_SIGNAL_ALL_WR);
> @@ -1462,6 +1575,8 @@ static void nes_clean_cq(struct nes_qp *nesqp, struct nes_cq *nescq)
> static int nes_destroy_qp(struct ib_qp *ibqp)
> {
> struct nes_qp *nesqp = to_nesqp(ibqp);
> + struct nes_cq *scq;
> + struct nes_cq *rcq;
> struct nes_ucontext *nes_ucontext;
> struct ib_qp_attr attr;
> struct iw_cm_id *cm_id;
> @@ -1471,6 +1586,39 @@ static int nes_destroy_qp(struct ib_qp *ibqp)
> atomic_inc(&sw_qps_destroyed);
> nesqp->destroyed = 1;
>
> + if (nesqp->ibqp.qp_type == IB_QPT_RAW_ETY) {
> + /* check the QP refernece count */
> + if (atomic_read(&nesqp->refcount) == 0)
> + BUG();
> + if (atomic_dec_and_test(&nesqp->refcount)) {
> + /* destroy send and rcv QPs */
> + if (nesqp->rx_ud_wq)
> + nes_ud_destroy_wq(nesqp->rx_ud_wq);
> + nesqp->rx_ud_wq = 0;
> +
> + if (nesqp->tx_ud_wq)
> + nes_ud_destroy_wq(nesqp->tx_ud_wq);
> + nesqp->tx_ud_wq = 0;
> + atomic_inc(&qps_destroyed);
> +
> + /* to prevent the destroy of cq before QP
> + destroy the usecount is used */
> + if (ibqp->send_cq) {
> + scq = to_nescq(ibqp->send_cq);
> + atomic_inc(&ibqp->send_cq->usecnt);
> + atomic_dec(&scq->usecnt);
> + }
> + if (ibqp->recv_cq) {
> + rcq = to_nescq(ibqp->recv_cq);
> + atomic_inc(&ibqp->recv_cq->usecnt);
> + atomic_dec(&rcq->usecnt);
> + }
> + /* free memory for the qp */
> + kfree(nesqp->allocated_buffer);
> + }
> + return 0;
> + }
> +
> /* Blow away the connection if it exists. */
> if (nesqp->ibqp_state >= IB_QPS_INIT && nesqp->ibqp_state <= IB_QPS_RTS) {
> /* if (nesqp->ibqp_state == IB_QPS_RTS) { */
> @@ -1567,9 +1715,18 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
> return ERR_PTR(-ENOMEM);
> }
>
> + /* to make sure that RAW ETH cq will be not destoyed
> + without qp destroy the internal usecount is used
> + the ibcq usecount cannot be used because the RAW ETH makes
> + recreation of the CQs after QP creation
> + when this situation occured (mcrqf != 0) the usecount is increase
> + the ibcq usecount is cleared after successfull CQ creation */
> + atomic_set(&nescq->usecnt, 0);
> +
> nescq->hw_cq.cq_size = max(entries + 1, 5);
> nescq->hw_cq.cq_number = cq_num;
> nescq->ibcq.cqe = nescq->hw_cq.cq_size - 1;
> + nescq->mcrqf = 0;
>
>
> if (context) {
> @@ -1586,8 +1743,23 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
> nescq->hw_cq.cq_number = nesvnic->nic.qp_id + 28 + 2 * ((nes_ucontext->mcrqf & 0xf) - 1);
> else if (nes_ucontext->mcrqf & 0x40000000)
> nescq->hw_cq.cq_number = nes_ucontext->mcrqf & 0xffff;
> + else if (nes_ucontext->mcrqf & 0x20000000) {
> + /* the cq number is coded
> + adapter:4/nic:4/cq_num:8 */
> + nescq->hw_cq.cq_number =
> + nes_ucontext->mcrqf & 0x00ff;
> +
> + /* to prevent the cq destroy before qp destroy
> + the internal usecount is increased
> + in this place it is the RAW ETH specific CQ
> + (after re-creation)
> + only RAW ETH type QP destroy can decrease
> + this usecounter */
> + atomic_inc(&nescq->usecnt);
> + }
> else
> nescq->hw_cq.cq_number = nesvnic->mcrq_qp_id + nes_ucontext->mcrqf-1;
> +
> nescq->mcrqf = nes_ucontext->mcrqf;
> nes_free_resource(nesadapter, nesadapter->allocated_cqs, cq_num);
> }
> @@ -1776,6 +1948,10 @@ static struct ib_cq *nes_create_cq(struct ib_device *ibdev, int entries,
> kfree(nescq);
> return ERR_PTR(-EFAULT);
> }
> + if (nes_ucontext->mcrqf & 0x20000000) {
> + /* change the cq address only for RAW in QP pointer */
> + nes_ud_cq_replace(nesvnic, nescq);
> + }
> }
>
> return &nescq->ibcq;
> @@ -1805,6 +1981,11 @@ static int nes_destroy_cq(struct ib_cq *ib_cq)
> nesdev = nesvnic->nesdev;
> nesadapter = nesdev->nesadapter;
>
> + if (atomic_read(&nescq->usecnt) != 0) {
> + nes_debug(NES_DBG_CQ, "CQ is in use now. %d\n",
> + (int) atomic_read(&nescq->usecnt));
> + return -EBUSY;
> + }
> nes_debug(NES_DBG_CQ, "Destroy CQ%u\n", nescq->hw_cq.cq_number);
>
> /* Send DestroyCQ request to CQP */
> @@ -2540,6 +2721,13 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
> nesmr->ibmr.lkey = stag;
> nesmr->mode = IWNES_MEMREG_TYPE_MEM;
> ibmr = &nesmr->ibmr;
> + /* register memory parallelly for RAW ETH */
> + if (nes_ud_reg_mr(region, length,
> + virt, stag) == 0) {
> + ib_umem_release(region);
> + kfree(nesmr);
> + ibmr = ERR_PTR(-ENOMEM);
> + }
> } else {
> ib_umem_release(region);
> kfree(nesmr);
> @@ -2733,6 +2921,9 @@ static int nes_dereg_mr(struct ib_mr *ib_mr)
> }
> nes_free_resource(nesadapter, nesadapter->allocated_mrs,
> (ib_mr->rkey & 0x0fffff00) >> 8);
> + ret = nes_ud_dereg_mr(ib_mr->rkey);
> + if (ret != 0)
> + return ret;
>
> kfree(nesmr);
>
> @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
> nesqp->hwqp.qp_id, attr->qp_state, nesqp->ibqp_state,
> nesqp->iwarp_state, atomic_read(&nesqp->refcount));
>
> + if (ibqp->qp_type == IB_QPT_RAW_ETY)
> + return 0;
> +
> spin_lock_irqsave(&nesqp->lock, qplockflags);
>
> nes_debug(NES_DBG_MOD_QP, "QP%u: hw_iwarp_state=0x%X, hw_tcp_state=0x%X,"
> @@ -3208,8 +3402,10 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
> */
> static int nes_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
> {
> - nes_debug(NES_DBG_INIT, "\n");
> - return -ENOSYS;
> + int ret = -ENOSYS;
> + struct nes_qp *nesqp = to_nesqp(ibqp);
> + ret = nes_ud_subscribe_mcast(nesqp->rx_ud_wq, gid);
> + return ret;
> }
>
>
> @@ -3218,8 +3414,10 @@ static int nes_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
> */
> static int nes_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
> {
> - nes_debug(NES_DBG_INIT, "\n");
> - return -ENOSYS;
> + int ret = -ENOSYS;
> + struct nes_qp *nesqp = to_nesqp(ibqp);
> + ret = nes_ud_unsubscribe_mcast(nesqp->rx_ud_wq, gid);
> + return ret;
> }
>
>
> @@ -3846,6 +4044,7 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
> return NULL;
> }
> strlcpy(nesibdev->ibdev.name, "nes%d", IB_DEVICE_NAME_MAX);
> + strcpy(nesibdev->ibdev.name, netdev->name);
> nesibdev->ibdev.owner = THIS_MODULE;
>
> nesibdev->ibdev.node_type = RDMA_NODE_RNIC;
> @@ -3868,6 +4067,9 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
> (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
> (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
> (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
> + (1ull << IB_USER_VERBS_CMD_QUERY_QP) |
> + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) |
> + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) |
> (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
> (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
> (1ull << IB_USER_VERBS_CMD_ALLOC_MW) |
> @@ -3911,8 +4113,9 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
> nesibdev->ibdev.alloc_fast_reg_page_list = nes_alloc_fast_reg_page_list;
> nesibdev->ibdev.free_fast_reg_page_list = nes_free_fast_reg_page_list;
>
> - nesibdev->ibdev.attach_mcast = nes_multicast_attach;
> nesibdev->ibdev.detach_mcast = nes_multicast_detach;
> + nesibdev->ibdev.attach_mcast = nes_multicast_attach;
> +
> nesibdev->ibdev.process_mad = nes_process_mad;
>
> nesibdev->ibdev.req_notify_cq = nes_req_notify_cq;
> diff --git a/drivers/infiniband/hw/nes/nes_verbs.h b/drivers/infiniband/hw/nes/nes_verbs.h
> index 2df9993..cbb6585 100644
> --- a/drivers/infiniband/hw/nes/nes_verbs.h
> +++ b/drivers/infiniband/hw/nes/nes_verbs.h
> @@ -79,6 +79,7 @@ struct nes_mr {
> u16 pbls_used;
> u8 mode;
> u8 pbl_4k;
> + u32 mcrqf;
> };
>
> struct nes_hw_pb {
> @@ -116,7 +117,8 @@ struct nes_cq {
> spinlock_t lock;
> u8 virtual_cq;
> u8 pad[3];
> - u32 mcrqf;
> + atomic_t usecnt;
> + u32 mcrqf;
> };
>
> struct nes_wq {
> @@ -130,6 +132,7 @@ struct disconn_work {
>
> struct iw_cm_id;
> struct ietf_mpa_frame;
> +struct nes_ud_file;
>
> struct nes_qp {
> struct ib_qp ibqp;
> @@ -176,5 +179,7 @@ struct nes_qp {
> u8 hw_tcp_state;
> u8 term_flags;
> u8 sq_kmapped;
> + struct nes_ud_file *rx_ud_wq;
> + struct nes_ud_file *tx_ud_wq;
> };
> #endif /* NES_VERBS_H */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: IB / MLX setup issues
From: Todd Strader @ 2010-05-04 16:34 UTC (permalink / raw)
To: Eli Cohen; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100504163116.GA10080-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
Eli Cohen wrote:
> On Mon, May 03, 2010 at 05:05:47PM -0500, Todd Strader wrote:
>
>> Hi,
>>
>> I'm running into some issues when trying to set up a new IB cluster.
>> I've got machine A which has connections to machine B and C via a
>> single two-port ConnectX card. A and B are connected and have no
>> known problems. I just set up machine C (installed OFED-1.5.1) but
>> whenever I run ibv_devinfo I get:
>>
>> libibverbs: Warning: couldn't load driver 'mlx4':
>> /usr/local/lib/libmlx4-rdmav2.so: symbol ibv_cmd_get_eth_l2_addr,
>> version IBVERBS_1.0 not defined in file libibverbs.so.1 with link
>> time reference
>> libibverbs: Warning: no userspace device-specific driver found for
>> /sys/class/infiniband_verbs/uverbs2
>> libibverbs: Warning: no userspace device-specific driver found for
>> /sys/class/infiniband_verbs/uverbs1
>> libibverbs: Warning: no userspace device-specific driver found for
>> /sys/class/infiniband_verbs/uverbs0
>> No IB devices found
>>
>> The machine does have 3 HCAs. For reference OFED 1.5.1 has
>> deposited the following RPMs on the system:
>> libmlx4-1.0-0.7.g2432360
>> libmlx4-devel-1.0-0.7.g2432360
>> libibverbs-1.1.3-0.6.g932f1a2
>> libibverbs-devel-1.1.3-0.6.g932f1a2
>> libibverbs-devel-static-1.1.3-0.6.g932f1a2
>>
>>
>
> So you only have some installation problem on machine C. By default, OFED
> installation script will install to /usr and I see that in C the
> prefix is /usr/local - is this what you intended? I think you may have
> old libraries in the library path so you may need to remve them.
>
>
Ah, that's probably it. I did install my own libmlx4. I guess I
thought it went to the same place that the OFED installer put things, so
when it said it was uninstalling previous version of OFED it would get
rid of that too. I'll clean it out more thoroughly and try again.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: IB / MLX setup issues
From: Eli Cohen @ 2010-05-04 16:31 UTC (permalink / raw)
To: Todd Strader; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BDF48BB.4070209-YJJp1DzrooEAvxtiuMwx3w@public.gmane.org>
On Mon, May 03, 2010 at 05:05:47PM -0500, Todd Strader wrote:
> Hi,
>
> I'm running into some issues when trying to set up a new IB cluster.
> I've got machine A which has connections to machine B and C via a
> single two-port ConnectX card. A and B are connected and have no
> known problems. I just set up machine C (installed OFED-1.5.1) but
> whenever I run ibv_devinfo I get:
>
> libibverbs: Warning: couldn't load driver 'mlx4':
> /usr/local/lib/libmlx4-rdmav2.so: symbol ibv_cmd_get_eth_l2_addr,
> version IBVERBS_1.0 not defined in file libibverbs.so.1 with link
> time reference
> libibverbs: Warning: no userspace device-specific driver found for
> /sys/class/infiniband_verbs/uverbs2
> libibverbs: Warning: no userspace device-specific driver found for
> /sys/class/infiniband_verbs/uverbs1
> libibverbs: Warning: no userspace device-specific driver found for
> /sys/class/infiniband_verbs/uverbs0
> No IB devices found
>
> The machine does have 3 HCAs. For reference OFED 1.5.1 has
> deposited the following RPMs on the system:
> libmlx4-1.0-0.7.g2432360
> libmlx4-devel-1.0-0.7.g2432360
> libibverbs-1.1.3-0.6.g932f1a2
> libibverbs-devel-1.1.3-0.6.g932f1a2
> libibverbs-devel-static-1.1.3-0.6.g932f1a2
>
So you only have some installation problem on machine C. By default, OFED
installation script will install to /usr and I see that in C the
prefix is /usr/local - is this what you intended? I think you may have
old libraries in the library path so you may need to remve them.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* NFS RDMA using autofs
From: Kumar Vaibhav @ 2010-05-04 11:43 UTC (permalink / raw)
To: nfs-rdma-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
I am trying to use NFS-RDMA on RHEL 5u4.
I can mount the exported filesystem manually on the client using
mount.rnfs 190.3.18.185:/scratch/exports /mnt -i -o rdma,port=20049
But the problem I have is that I want it to use it from autofs. So that
it is mounted based on requirement.
How to make entry corresponfing to this in /etc/auto.master?
Regards,
Vaibhav
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [ANNOUNCE] OFED 1.5.2 beta1 is available
From: Sean Hefty @ 2010-05-03 23:51 UTC (permalink / raw)
To: 'Vladimir Sokolovsky', OpenFabrics EWG; +Cc: linux-rdma
In-Reply-To: <4BDE7CB9.8020801-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
>Main changes from 1.5.1:
>===========================
>1. Updated packages:
> - Management
> Using latest daily builds from
>http://www.openfabrics.org/downloads/management/daily
> - Updated libnes
> libnes-1.0.1-0.1.g89ea0ee.tar.gz
> - Updated libsdp
> libsdp-1.1.101-0.3.gc767eee.tar.gz
> - Updated perftest
> perftest-1.2.4-0.15.g82b7e29.tar.gz
> - Updated mpitests
> mpitests-3.2-923.src.rpm
>2. Added RHEL5.5 support
>3. Use files under /etc/modprobe.d/ instead of /etc/modprobe.conf
>4. Bug fixes
I would like to add support for IB ACM path resolution (i.e. user space path
record caching) into the 1.5.2 release. Vlad pulled in the necessary kernel
patch from 2.6.33 a few days ago. All other changes are in user space -- an
updated librdmacm release and a release of ibacm -- and must be enabled to be
used.
I hope to complete testing this week.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: IB / MLX setup issues
From: Sean Hefty @ 2010-05-03 22:24 UTC (permalink / raw)
To: 'Todd Strader', linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BDF48BB.4070209-YJJp1DzrooEAvxtiuMwx3w@public.gmane.org>
>Also, and I'm unsure if this is related, the link between machines C and
>A is in the initializing state when I bring up the machine. Machine A
>has the subnet manager running, so I would think that should be enough.
>If I run another subnet manager on machine C, I get an active link, but
>I wouldn't think I need to be running two subnet managers.
Assuming directly connected links between machines A <-> C and A <-> B, then
these are separate subnets and each requires a subnet manager. The SM is bound
to a specific port when run, so one instance of the SM is not enough.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* IB / MLX setup issues
From: Todd Strader @ 2010-05-03 22:05 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
I'm running into some issues when trying to set up a new IB cluster.
I've got machine A which has connections to machine B and C via a single
two-port ConnectX card. A and B are connected and have no known
problems. I just set up machine C (installed OFED-1.5.1) but whenever I
run ibv_devinfo I get:
libibverbs: Warning: couldn't load driver 'mlx4':
/usr/local/lib/libmlx4-rdmav2.so: symbol ibv_cmd_get_eth_l2_addr,
version IBVERBS_1.0 not defined in file libibverbs.so.1 with link time
reference
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs0
No IB devices found
The machine does have 3 HCAs. For reference OFED 1.5.1 has deposited
the following RPMs on the system:
libmlx4-1.0-0.7.g2432360
libmlx4-devel-1.0-0.7.g2432360
libibverbs-1.1.3-0.6.g932f1a2
libibverbs-devel-1.1.3-0.6.g932f1a2
libibverbs-devel-static-1.1.3-0.6.g932f1a2
Also, and I'm unsure if this is related, the link between machines C and
A is in the initializing state when I bring up the machine. Machine A
has the subnet manager running, so I would think that should be enough.
If I run another subnet manager on machine C, I get an active link, but
I wouldn't think I need to be running two subnet managers.
Any advice on these issues would be appreciated.
Thanks.
Todd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Mike Heinz @ 2010-05-03 20:24 UTC (permalink / raw)
To: Sean Hefty, Roland Dreier; +Cc: LINUX-RDMA
In-Reply-To: <BC9A3E31E4234E599F7FC99094789B2F-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
Sorry - I'm not trying to force you to drag the info out of me.
Yeah; I think it always uses GETTABLE rather than GET.
-----Original Message-----
From: Sean Hefty [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
Sent: Monday, May 03, 2010 4:17 PM
To: Mike Heinz; Roland Dreier
Cc: LINUX-RDMA
Subject: RE: Hang in ib_umad when attempting to unregister.
>In the recent hangs, the process that is triggering the hang is using the umad
>interface to query path records. Since we usually discover this problem long
>after the onset, I'm not sure if there are actual queries outstanding when the
>problem occurs.
Is it using GETTABLE to retrieve multiple paths?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Sean Hefty @ 2010-05-03 20:17 UTC (permalink / raw)
To: 'Mike Heinz', Roland Dreier; +Cc: LINUX-RDMA
In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB49A4740A29-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
>In the recent hangs, the process that is triggering the hang is using the umad
>interface to query path records. Since we usually discover this problem long
>after the onset, I'm not sure if there are actual queries outstanding when the
>problem occurs.
Is it using GETTABLE to retrieve multiple paths?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Mike Heinz @ 2010-05-03 19:10 UTC (permalink / raw)
To: Sean Hefty, Roland Dreier; +Cc: LINUX-RDMA
In-Reply-To: <B552CE79A4134D83A6F5122C6A021FFF-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
In the recent hangs, the process that is triggering the hang is using the umad interface to query path records. Since we usually discover this problem long after the onset, I'm not sure if there are actual queries outstanding when the problem occurs.
-----Original Message-----
From: Sean Hefty [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
Sent: Monday, May 03, 2010 2:40 PM
To: Mike Heinz; Roland Dreier
Cc: LINUX-RDMA
Subject: RE: Hang in ib_umad when attempting to unregister.
>I should be more clear - there are a couple of reasons why I don't think
>Roland's patch is the cause, or a fix, for this problem. First, because when I
>dug through QLogic's bug database I found incidents like this going back to
>2007. Second, when I first began looking at this I noticed the patch and built
>a version that moved the cancel_delayed_work() calls in ib_cancel_rmpp_recvs()
>back inside the locked area and the problem still occurred.
>
>Finally, I should note that this isn't a spinlock type hang; what's happening
>is that destroy_rmpp_recv() appears to be sleeping, waiting for a completion
>that never arrives. I'm guessing that what is going on is that the reference
>count in an rmpp_recv is wrong, but what is causing the problem is unknown.
What RMPP messages were being sent/received?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Sean Hefty @ 2010-05-03 18:40 UTC (permalink / raw)
To: 'Mike Heinz', Roland Dreier; +Cc: LINUX-RDMA
In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB49A4740A1C-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
>I should be more clear - there are a couple of reasons why I don't think
>Roland's patch is the cause, or a fix, for this problem. First, because when I
>dug through QLogic's bug database I found incidents like this going back to
>2007. Second, when I first began looking at this I noticed the patch and built
>a version that moved the cancel_delayed_work() calls in ib_cancel_rmpp_recvs()
>back inside the locked area and the problem still occurred.
>
>Finally, I should note that this isn't a spinlock type hang; what's happening
>is that destroy_rmpp_recv() appears to be sleeping, waiting for a completion
>that never arrives. I'm guessing that what is going on is that the reference
>count in an rmpp_recv is wrong, but what is causing the problem is unknown.
What RMPP messages were being sent/received?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Mike Heinz @ 2010-05-03 18:25 UTC (permalink / raw)
To: Roland Dreier, Hefty, Sean; +Cc: LINUX-RDMA
In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB49A4740A0C-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
I should be more clear - there are a couple of reasons why I don't think Roland's patch is the cause, or a fix, for this problem. First, because when I dug through QLogic's bug database I found incidents like this going back to 2007. Second, when I first began looking at this I noticed the patch and built a version that moved the cancel_delayed_work() calls in ib_cancel_rmpp_recvs() back inside the locked area and the problem still occurred.
Finally, I should note that this isn't a spinlock type hang; what's happening is that destroy_rmpp_recv() appears to be sleeping, waiting for a completion that never arrives. I'm guessing that what is going on is that the reference count in an rmpp_recv is wrong, but what is causing the problem is unknown.
-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Mike Heinz
Sent: Monday, May 03, 2010 1:07 PM
To: Hefty, Sean
Cc: LINUX-RDMA
Subject: RE: Hang in ib_umad when attempting to unregister.
Ah. Got it. Thanks.
They do seem to be related. 0e442afd92fcdde2cc63b6f25556b8934e42b7d2 seems to be directly related - but I think that fix is already in OFED 1.5:
core_0310-IB-mad-Fix-lock-lock-timer-deadlock-in-RMPP-code.patch
seems to be the same patch as 0e442afd92fcdde2cc63b6f25556b8934e42b7d2.
-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
Sent: Monday, May 03, 2010 12:40 PM
To: Mike Heinz
Subject: RE: Hang in ib_umad when attempting to unregister.
>Where did you get those commit #s? I looked in my local copy of
>
>git://git.openfabrics.org/ofed_1_5/linux-2.6
>
>and they don't seem to be valid objects for that repo. Am I pulling from the
>wrong place?
These are from the upstream kernel.
>commit 6b2eef8fd78ff909c3396b8671d57c42559cc51d
>commit 0e442afd92fcdde2cc63b6f25556b8934e42b7d2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: Hang in ib_umad when attempting to unregister.
From: Mike Heinz @ 2010-05-03 17:06 UTC (permalink / raw)
To: Hefty, Sean; +Cc: LINUX-RDMA
In-Reply-To: <CF9C39F99A89134C9CF9C4CCB68B8DDF254C321EFA-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
Ah. Got it. Thanks.
They do seem to be related. 0e442afd92fcdde2cc63b6f25556b8934e42b7d2 seems to be directly related - but I think that fix is already in OFED 1.5:
core_0310-IB-mad-Fix-lock-lock-timer-deadlock-in-RMPP-code.patch
seems to be the same patch as 0e442afd92fcdde2cc63b6f25556b8934e42b7d2.
-----Original Message-----
From: Hefty, Sean [mailto:sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org]
Sent: Monday, May 03, 2010 12:40 PM
To: Mike Heinz
Subject: RE: Hang in ib_umad when attempting to unregister.
>Where did you get those commit #s? I looked in my local copy of
>
>git://git.openfabrics.org/ofed_1_5/linux-2.6
>
>and they don't seem to be valid objects for that repo. Am I pulling from the
>wrong place?
These are from the upstream kernel.
>commit 6b2eef8fd78ff909c3396b8671d57c42559cc51d
>commit 0e442afd92fcdde2cc63b6f25556b8934e42b7d2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox