* Re: [ewg] ibcheckerrors "Port All FAILED" reported
From: Ira Weiny @ 2010-05-06 1:09 UTC (permalink / raw)
To: Woodruff, Robert J,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: EWG, tziporet-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org
In-Reply-To: <382A478CAD40FA4FB46605CF81FE39F45685DEAD-osO9UTpF0URzLByeVOV5+bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
Interesting...
I have a switch which does this as well. Tracing through the scripts shows
that the perfquery command is failing like this.
14:29:03 > ./perfquery 40 255
./perfquery: iberror: failed: AllPortSelect not supported
It seems there is an issue with the CapabilityMask value...
14:43:32 > ./perfquery 40 255
cap_mask 0x400 <=== my debug output
./perfquery: iberror: failed: AllPortSelect not supported
14:43:38 > ./saquery CPI 40
SA ClassPortInfo:
...
Capability mask..........0x2602
...
Those don't match because... perfquery has a bug...
perfquery is issuing a PMA query when it should be issuing a SA query. It
just so happens that on some switches the result of that PMA query indicates
AllPortSelect is available. Patch to follow.
Ira
On Wed, 5 May 2010 13:47:54 -0700
"Woodruff, Robert J" <robert.j.woodruff-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi guys,
>
> When I run ibcheckerrors on my Mellanox switch,
> it is reporting that Port all FAILED.
>
> From what I can tell, the switch is working fine and
> I think that this is a bogus error from the program.
>
> If this is indeed not a real problem, can the diagnostic
> be fixed to not report this as an error ?
>
>
> ibcheckerrors -nocolor -v -t 100
>
> # Checking Switch: nodeguid 0x0002c902004046a0
> Node check lid 7: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port all: FAILED <------------
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 2: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 3: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 7: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 8: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 9: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 10: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 17: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 18: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 20: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 25: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 26: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 27: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 28: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 34: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 35: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 36: OK
>
> Checking Ca: nodeguid 0x0002c9030002628a
> Node check lid 14: OK
> Error check on lid 14 (cstnh-2 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c90300025e0a
> Node check lid 12: OK
> Error check on lid 12 (cstnh-3 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030002615e
> Node check lid 15: OK
> Error check on lid 15 (cstnh-4 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e442
> Node check lid 11: OK
> Error check on lid 11 (cstnh-8 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e44e
> Node check lid 8: OK
> Error check on lid 8 (cstnh-11 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e3e6
> Node check lid 2: OK
> Error check on lid 2 (cstnh-13 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e44a
> Node check lid 18: OK
> Error check on lid 18 (cstnh-9 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c90300044fb4
> Node check lid 13: OK
> Error check on lid 13 (cstnh-7 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c90300044fbc
> Node check lid 10: OK
> Error check on lid 10 (cstnh-1 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e3ee
> Node check lid 9: OK
> Error check on lid 9 (cstnh-10 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e446
> Node check lid 4: OK
> Error check on lid 4 (cstnh-12 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e22e
> Node check lid 1: OK
> Error check on lid 1 (cstnh-14 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c9030008e43e
> Node check lid 19: OK
> Error check on lid 19 (cstnh-15 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0090270002000345
> Node check lid 6: OK
> Error check on lid 6 (cstnh-5 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0090270002000335
> Node check lid 5: OK
> Error check on lid 5 (cstnh-6 HCA-1) port 1: OK
>
> # Checking Ca: nodeguid 0x0002c90300028238
> Node check lid 3: OK
> Error check on lid 3 (cst-linux HCA-1) port 1: OK
>
> ## Summary: 17 nodes checked, 0 bad nodes found
> ## 32 ports checked, 0 ports have errors beyond threshold
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
--
Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH v4] IB/ipoib: fix dangling pointer references to ipoib_neigh and ipoib_path
From: Ralph Campbell @ 2010-05-05 23:30 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma
Basically, a struct sk_buff has a pointer to a struct dst_entry
which has a pointer to a struct neighbour. IPoIB uses
struct neighbour.ha to store the IB "hardware address" and a pointer
to a struct ipoib_neigh. When using connected mode, struct ipoib_neigh
has a pointer to struct ipoib_cm_tx which contains pointers back to struct
ipoib_neigh and ipoib_path. The core network code will call
ipoib_neigh_cleanup() when it is destroying struct neighbour and IPoIB
should guarantee that the struct ipoib_neigh and all the memory it points
to are destroyed. Also, the connected mode RC QP contains a pointer back
to the struct ipoib_cm_tx which can be dereferenced in the CQ completion handler.
The problem is that there are several places where the struct ipoib_cm_tx
can be used after it has been freed. The easiest way to reproduce this
bug is to run a UDP bandwidth test while loading/unloading the IPoIB module
or ifup/ifdown the interface.
The fix is rather complex since the RC QP connection setup, tear down,
and CQ completion draining are asynchronous processes. The struct ipoib_cm_tx
goes through four "states":
1) created, flags==0, and linked on the priv->cm.start_list.
2) not on the start_list, flags==0, and in the process of creating the RC QP.
3) not on the start_list, flags & IPOIB_FLAG_INITIALIZED, and RC QP created.
4) on the priv->cm.reap_task list or being destroyed by ipoib_cm_tx_reap().
Signed-off-by: Ralph Campbell <ralph.campbell-h88ZbnxC6KDQT0dZR+AlfA@public.gmane.org>
---
I updated the description to be more detailed.
I stripped out as much non-functional code changes as I could.
drivers/infiniband/ulp/ipoib/ipoib.h | 14 +
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 108 +++++-----
drivers/infiniband/ulp/ipoib/ipoib_main.c | 259 +++++++++++-------------
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 76 ++-----
4 files changed, 215 insertions(+), 242 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 753a983..5a842d7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -415,9 +415,7 @@ static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh)
INFINIBAND_ALEN, sizeof(void *));
}
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
- struct net_device *dev);
-void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
+void ipoib_neigh_flush(struct ipoib_neigh *neigh);
extern struct workqueue_struct *ipoib_workqueue;
@@ -464,7 +462,8 @@ void ipoib_dev_cleanup(struct net_device *dev);
void ipoib_mcast_join_task(struct work_struct *work);
void ipoib_mcast_carrier_on_task(struct work_struct *work);
-void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb);
+void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb,
+ struct ipoib_neigh *neigh);
void ipoib_mcast_restart_task(struct work_struct *work);
int ipoib_mcast_start_thread(struct net_device *dev);
@@ -570,6 +569,7 @@ void ipoib_cm_dev_cleanup(struct net_device *dev);
struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
struct ipoib_neigh *neigh);
void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
+void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path);
void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
unsigned int mtu);
void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc);
@@ -659,6 +659,12 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
}
static inline
+void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path)
+{
+ return;
+}
+
+static inline
int ipoib_cm_add_mode_attr(struct net_device *dev)
{
return 0;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index bc65837..28c222b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -798,31 +798,14 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
if (wc->status != IB_WC_SUCCESS &&
wc->status != IB_WC_WR_FLUSH_ERR) {
- struct ipoib_neigh *neigh;
-
ipoib_dbg(priv, "failed cm send event "
"(status=%d, wrid=%d vend_err %x)\n",
wc->status, wr_id, wc->vendor_err);
spin_lock_irqsave(&priv->lock, flags);
- neigh = tx->neigh;
-
- if (neigh) {
- neigh->cm = NULL;
- list_del(&neigh->list);
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
-
- tx->neigh = NULL;
- }
-
- if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
- list_move(&tx->list, &priv->cm.reap_list);
- queue_work(ipoib_workqueue, &priv->cm.reap_task);
- }
clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
+ ipoib_cm_destroy_tx(tx);
spin_unlock_irqrestore(&priv->lock, flags);
}
@@ -1203,7 +1186,6 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
struct ipoib_cm_tx *tx = cm_id->context;
struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
struct net_device *dev = priv->dev;
- struct ipoib_neigh *neigh;
unsigned long flags;
int ret;
@@ -1225,22 +1207,8 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
ipoib_dbg(priv, "CM error %d.\n", event->event);
netif_tx_lock_bh(dev);
spin_lock_irqsave(&priv->lock, flags);
- neigh = tx->neigh;
-
- if (neigh) {
- neigh->cm = NULL;
- list_del(&neigh->list);
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
-
- tx->neigh = NULL;
- }
- if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
- list_move(&tx->list, &priv->cm.reap_list);
- queue_work(ipoib_workqueue, &priv->cm.reap_task);
- }
+ ipoib_cm_destroy_tx(tx);
spin_unlock_irqrestore(&priv->lock, flags);
netif_tx_unlock_bh(dev);
@@ -1267,20 +1235,62 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
tx->path = path;
tx->dev = dev;
list_add(&tx->list, &priv->cm.start_list);
- set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
queue_work(ipoib_workqueue, &priv->cm.start_task);
return tx;
}
+/*
+ * Note: this is called with netif_tx_lock_bh() and priv->lock held.
+ */
void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
{
struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
- if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+ struct ipoib_neigh *neigh = tx->neigh;
+
+ if (!neigh)
+ return;
+
+ ipoib_dbg(priv, "Reap connection for gid %pI6\n", neigh->dgid.raw);
+ neigh->cm = NULL;
+ tx->neigh = NULL;
+
+ /* Force ipoib_start_xmit() to start a new connection if called */
+ if (neigh->ah) {
+ ipoib_put_ah(neigh->ah);
+ neigh->ah = NULL;
+ memset(&neigh->dgid.raw, 0, sizeof(union ib_gid));
+ }
+
+ /*
+ * If ipoib_cm_tx_start() is actively using this tx,
+ * don't delete it by putting it on the reap_list.
+ * Instead, ipoib_cm_tx_start() will handle the destruction.
+ */
+ if (!list_empty(&tx->list) ||
+ test_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
list_move(&tx->list, &priv->cm.reap_list);
queue_work(ipoib_workqueue, &priv->cm.reap_task);
- ipoib_dbg(priv, "Reap connection for gid %pI6\n",
- tx->neigh->dgid.raw);
- tx->neigh = NULL;
+ }
+}
+
+/*
+ * Search the list of connections to be started and remove any entries
+ * which match the path being destroyed.
+ *
+ * This should be called with netif_tx_lock_bh() and priv->lock held.
+ */
+void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ struct ipoib_cm_tx *tx, *nx;
+
+ list_for_each_entry_safe(tx, nx, &priv->cm.start_list, list) {
+ tx = list_entry(priv->cm.start_list.next, typeof(*tx), list);
+ if (tx->path == path) {
+ list_del(&tx->list);
+ kfree(tx);
+ break;
+ }
}
}
@@ -1306,6 +1316,7 @@ static void ipoib_cm_tx_start(struct work_struct *work)
neigh = p->neigh;
qpn = IPOIB_QPN(neigh->neighbour->ha);
memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);
+ p->path = NULL;
spin_unlock_irqrestore(&priv->lock, flags);
netif_tx_unlock_bh(dev);
@@ -1315,18 +1326,17 @@ static void ipoib_cm_tx_start(struct work_struct *work)
netif_tx_lock_bh(dev);
spin_lock_irqsave(&priv->lock, flags);
- if (ret) {
- neigh = p->neigh;
- if (neigh) {
+ /*
+ * If ipoib_cm_destroy_tx() was called or there was an error,
+ * we need to destroy tx.
+ */
+ neigh = p->neigh;
+ if (!neigh || ret) {
+ if (neigh)
neigh->cm = NULL;
- list_del(&neigh->list);
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
- }
- list_del(&p->list);
kfree(p);
- }
+ } else
+ set_bit(IPOIB_FLAG_INITIALIZED, &p->flags);
}
spin_unlock_irqrestore(&priv->lock, flags);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index df3eb8c..6fe905a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -91,6 +91,9 @@ struct workqueue_struct *ipoib_workqueue;
struct ib_sa_client ipoib_sa_client;
+static struct ipoib_neigh *neigh_alloc(struct neighbour *neighbour,
+ struct net_device *dev);
+
static void ipoib_add_one(struct ib_device *device);
static void ipoib_remove_one(struct ib_device *device);
@@ -261,32 +264,16 @@ static int __path_add(struct net_device *dev, struct ipoib_path *path)
return 0;
}
-static void path_free(struct net_device *dev, struct ipoib_path *path)
+static void path_free(struct ipoib_path *path)
{
- struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_neigh *neigh, *tn;
struct sk_buff *skb;
- unsigned long flags;
while ((skb = __skb_dequeue(&path->queue)))
dev_kfree_skb_irq(skb);
- spin_lock_irqsave(&priv->lock, flags);
-
- list_for_each_entry_safe(neigh, tn, &path->neigh_list, list) {
- /*
- * It's safe to call ipoib_put_ah() inside priv->lock
- * here, because we know that path->ah will always
- * hold one more reference, so ipoib_put_ah() will
- * never do more than decrement the ref count.
- */
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
-
- ipoib_neigh_free(dev, neigh);
- }
-
- spin_unlock_irqrestore(&priv->lock, flags);
+ list_for_each_entry_safe(neigh, tn, &path->neigh_list, list)
+ ipoib_neigh_flush(neigh);
if (path->ah)
ipoib_put_ah(path->ah);
@@ -387,10 +374,11 @@ void ipoib_flush_paths(struct net_device *dev)
list_for_each_entry_safe(path, tp, &remove_list, list) {
if (path->query)
ib_sa_cancel_query(path->query_id, path->query);
+ ipoib_cm_flush_path(dev, path);
spin_unlock_irqrestore(&priv->lock, flags);
netif_tx_unlock_bh(dev);
wait_for_completion(&path->done);
- path_free(dev, path);
+ path_free(path);
netif_tx_lock_bh(dev);
spin_lock_irqsave(&priv->lock, flags);
}
@@ -460,19 +448,15 @@ static void path_rec_completion(int status,
memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
sizeof(union ib_gid));
- if (ipoib_cm_enabled(dev, neigh->neighbour)) {
- if (!ipoib_cm_get(neigh))
- ipoib_cm_set(neigh, ipoib_cm_create_tx(dev,
- path,
- neigh));
- if (!ipoib_cm_get(neigh)) {
- list_del(&neigh->list);
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
- continue;
- }
- }
+ /*
+ * If connected mode is enabled but not started,
+ * start getting a connection.
+ */
+ if (ipoib_cm_enabled(dev, neigh->neighbour) &&
+ !ipoib_cm_get(neigh))
+ ipoib_cm_set(neigh, ipoib_cm_create_tx(dev,
+ path,
+ neigh));
while ((skb = __skb_dequeue(&neigh->queue)))
__skb_queue_tail(&skqueue, skb);
@@ -520,6 +504,8 @@ static struct ipoib_path *path_rec_create(struct net_device *dev, void *gid)
path->pathrec.numb_path = 1;
path->pathrec.traffic_class = priv->broadcast->mcmember.traffic_class;
+ __path_add(dev, path);
+
return path;
}
@@ -554,32 +540,27 @@ static int path_rec_start(struct net_device *dev,
return 0;
}
-static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
+static void neigh_add_path(struct sk_buff *skb, struct net_device *dev,
+ struct ipoib_neigh *neigh)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_path *path;
- struct ipoib_neigh *neigh;
+ struct neighbour *n;
unsigned long flags;
- neigh = ipoib_neigh_alloc(skb_dst(skb)->neighbour, skb->dev);
- if (!neigh) {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- return;
- }
+ n = skb_dst(skb)->neighbour;
spin_lock_irqsave(&priv->lock, flags);
- path = __path_find(dev, skb_dst(skb)->neighbour->ha + 4);
+ path = __path_find(dev, n->ha + 4);
if (!path) {
- path = path_rec_create(dev, skb_dst(skb)->neighbour->ha + 4);
+ path = path_rec_create(dev, n->ha + 4);
if (!path)
- goto err_path;
-
- __path_add(dev, path);
+ goto err_drop;
}
- list_add_tail(&neigh->list, &path->neigh_list);
+ if (list_empty(&neigh->list))
+ list_add_tail(&neigh->list, &path->neigh_list);
if (path->ah) {
kref_get(&path->ah->ref);
@@ -587,16 +568,11 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
sizeof(union ib_gid));
- if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+ if (ipoib_cm_enabled(dev, n)) {
if (!ipoib_cm_get(neigh))
ipoib_cm_set(neigh, ipoib_cm_create_tx(dev, path, neigh));
- if (!ipoib_cm_get(neigh)) {
- list_del(&neigh->list);
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
+ if (!ipoib_cm_get(neigh))
goto err_drop;
- }
if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
__skb_queue_tail(&neigh->queue, skb);
else {
@@ -606,12 +582,10 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
}
} else {
spin_unlock_irqrestore(&priv->lock, flags);
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb_dst(skb)->neighbour->ha));
+ ipoib_send(dev, skb, path->ah, IPOIB_QPN(n->ha));
return;
}
} else {
- neigh->ah = NULL;
-
if (!path->query && path_rec_start(dev, path))
goto err_list;
@@ -622,10 +596,7 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
return;
err_list:
- list_del(&neigh->list);
-
-err_path:
- ipoib_neigh_free(dev, neigh);
+ list_del_init(&neigh->list);
err_drop:
++dev->stats.tx_dropped;
dev_kfree_skb_any(skb);
@@ -633,20 +604,22 @@ err_drop:
spin_unlock_irqrestore(&priv->lock, flags);
}
-static void ipoib_path_lookup(struct sk_buff *skb, struct net_device *dev)
+static void path_lookup(struct sk_buff *skb, struct net_device *dev,
+ struct ipoib_neigh *neigh)
{
- struct ipoib_dev_priv *priv = netdev_priv(skb->dev);
+ struct ipoib_dev_priv *priv;
/* Look up path record for unicasts */
if (skb_dst(skb)->neighbour->ha[4] != 0xff) {
- neigh_add_path(skb, dev);
+ neigh_add_path(skb, dev, neigh);
return;
}
/* Add in the P_Key for multicasts */
+ priv = netdev_priv(dev);
skb_dst(skb)->neighbour->ha[8] = (priv->pkey >> 8) & 0xff;
skb_dst(skb)->neighbour->ha[9] = priv->pkey & 0xff;
- ipoib_mcast_send(dev, skb_dst(skb)->neighbour->ha + 4, skb);
+ ipoib_mcast_send(dev, skb_dst(skb)->neighbour->ha + 4, skb, neigh);
}
static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
@@ -659,35 +632,13 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
spin_lock_irqsave(&priv->lock, flags);
path = __path_find(dev, phdr->hwaddr + 4);
- if (!path || !path->valid) {
- int new_path = 0;
-
- if (!path) {
- path = path_rec_create(dev, phdr->hwaddr + 4);
- new_path = 1;
- }
- if (path) {
- /* put pseudoheader back on for next time */
- skb_push(skb, sizeof *phdr);
- __skb_queue_tail(&path->queue, skb);
-
- if (!path->query && path_rec_start(dev, path)) {
- spin_unlock_irqrestore(&priv->lock, flags);
- if (new_path)
- path_free(dev, path);
- return;
- } else
- __path_add(dev, path);
- } else {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- }
-
- spin_unlock_irqrestore(&priv->lock, flags);
- return;
+ if (!path) {
+ path = path_rec_create(dev, phdr->hwaddr + 4);
+ if (!path)
+ goto drop;
}
- if (path->ah) {
+ if (path->valid && path->ah) {
ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));
@@ -699,12 +650,16 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
/* put pseudoheader back on for next time */
skb_push(skb, sizeof *phdr);
__skb_queue_tail(&path->queue, skb);
- } else {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- }
+ } else
+ goto drop;
spin_unlock_irqrestore(&priv->lock, flags);
+ return;
+
+drop:
+ ++dev->stats.tx_dropped;
+ dev_kfree_skb_any(skb);
+ spin_unlock_irqrestore(&priv->lock, flags);
}
static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -714,31 +669,23 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
unsigned long flags;
if (likely(skb_dst(skb) && skb_dst(skb)->neighbour)) {
- if (unlikely(!*to_ipoib_neigh(skb_dst(skb)->neighbour))) {
- ipoib_path_lookup(skb, dev);
+ neigh = *to_ipoib_neigh(skb_dst(skb)->neighbour);
+
+ if (unlikely(!neigh)) {
+ neigh = neigh_alloc(skb_dst(skb)->neighbour, skb->dev);
+ if (!neigh) {
+ ++dev->stats.tx_dropped;
+ dev_kfree_skb_any(skb);
+ } else
+ path_lookup(skb, dev, neigh);
return NETDEV_TX_OK;
}
- neigh = *to_ipoib_neigh(skb_dst(skb)->neighbour);
-
if (unlikely((memcmp(&neigh->dgid.raw,
skb_dst(skb)->neighbour->ha + 4,
sizeof(union ib_gid))) ||
(neigh->dev != dev))) {
- spin_lock_irqsave(&priv->lock, flags);
- /*
- * It's safe to call ipoib_put_ah() inside
- * priv->lock here, because we know that
- * path->ah will always hold one more reference,
- * so ipoib_put_ah() will never do more than
- * decrement the ref count.
- */
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- list_del(&neigh->list);
- ipoib_neigh_free(dev, neigh);
- spin_unlock_irqrestore(&priv->lock, flags);
- ipoib_path_lookup(skb, dev);
+ path_lookup(skb, dev, neigh);
return NETDEV_TX_OK;
}
@@ -770,7 +717,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
phdr->hwaddr[8] = (priv->pkey >> 8) & 0xff;
phdr->hwaddr[9] = priv->pkey & 0xff;
- ipoib_mcast_send(dev, phdr->hwaddr + 4, skb);
+ ipoib_mcast_send(dev, phdr->hwaddr + 4, skb, NULL);
} else {
/* unicast GID -- should be ARP or RARP reply */
@@ -847,34 +794,45 @@ static void ipoib_neigh_cleanup(struct neighbour *n)
{
struct ipoib_neigh *neigh;
struct ipoib_dev_priv *priv = netdev_priv(n->dev);
+ struct sk_buff *skb;
unsigned long flags;
- struct ipoib_ah *ah = NULL;
+
+ spin_lock_irqsave(&priv->lock, flags);
neigh = *to_ipoib_neigh(n);
- if (neigh)
- priv = netdev_priv(neigh->dev);
- else
+ if (!neigh) {
+ spin_unlock_irqrestore(&priv->lock, flags);
return;
+ }
+ *to_ipoib_neigh(n) = NULL;
+ neigh->neighbour = NULL;
+
ipoib_dbg(priv,
"neigh_cleanup for %06x %pI6\n",
IPOIB_QPN(n->ha),
n->ha + 4);
- spin_lock_irqsave(&priv->lock, flags);
+ if (ipoib_cm_get(neigh))
+ ipoib_cm_destroy_tx(ipoib_cm_get(neigh));
- if (neigh->ah)
- ah = neigh->ah;
- list_del(&neigh->list);
- ipoib_neigh_free(n->dev, neigh);
+ if (!list_empty(&neigh->list))
+ list_del_init(&neigh->list);
spin_unlock_irqrestore(&priv->lock, flags);
- if (ah)
- ipoib_put_ah(ah);
+ while ((skb = __skb_dequeue(&neigh->queue))) {
+ ++n->dev->stats.tx_dropped;
+ dev_kfree_skb_any(skb);
+ }
+
+ if (neigh->ah)
+ ipoib_put_ah(neigh->ah);
+
+ kfree(neigh);
}
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
- struct net_device *dev)
+static struct ipoib_neigh *neigh_alloc(struct neighbour *neighbour,
+ struct net_device *dev)
{
struct ipoib_neigh *neigh;
@@ -882,27 +840,58 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour,
if (!neigh)
return NULL;
+ neigh->ah = NULL;
neigh->neighbour = neighbour;
neigh->dev = dev;
memset(&neigh->dgid.raw, 0, sizeof (union ib_gid));
*to_ipoib_neigh(neighbour) = neigh;
skb_queue_head_init(&neigh->queue);
+ INIT_LIST_HEAD(&neigh->list);
ipoib_cm_set(neigh, NULL);
return neigh;
}
-void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh)
+/*
+ * This is called when flushing the path or multicast GID from the
+ * struct ipoib_neigh. ipoib_start_xmit() will then try to reinitialize
+ * the address the next time it is called.
+ * Note that the "neigh" pointer passed should not be used after calling this.
+ */
+void ipoib_neigh_flush(struct ipoib_neigh *neigh)
{
+ struct net_device *dev = neigh->dev;
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ struct sk_buff_head skqueue;
struct sk_buff *skb;
- *to_ipoib_neigh(neigh->neighbour) = NULL;
- while ((skb = __skb_dequeue(&neigh->queue))) {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- }
+ struct ipoib_ah *ah;
+ unsigned long flags;
+
+ skb_queue_head_init(&skqueue);
+
+ netif_tx_lock_bh(dev);
+ spin_lock_irqsave(&priv->lock, flags);
+
+ while ((skb = __skb_dequeue(&neigh->queue)))
+ __skb_queue_tail(&skqueue, skb);
+
+ ah = neigh->ah;
+ neigh->ah = NULL;
+ memset(&neigh->dgid.raw, 0, sizeof(union ib_gid));
+ list_del_init(&neigh->list);
if (ipoib_cm_get(neigh))
ipoib_cm_destroy_tx(ipoib_cm_get(neigh));
- kfree(neigh);
+
+ spin_unlock_irqrestore(&priv->lock, flags);
+ netif_tx_unlock_bh(dev);
+
+ while ((skb = __skb_dequeue(&skqueue))) {
+ ++dev->stats.tx_dropped;
+ dev_kfree_skb_irq(skb);
+ }
+
+ if (ah)
+ ipoib_put_ah(ah);
}
static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index d41ea27..3863302 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -67,28 +67,14 @@ struct ipoib_mcast_iter {
static void ipoib_mcast_free(struct ipoib_mcast *mcast)
{
struct net_device *dev = mcast->dev;
- struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_neigh *neigh, *tmp;
int tx_dropped = 0;
ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group %pI6\n",
mcast->mcmember.mgid.raw);
- spin_lock_irq(&priv->lock);
-
- list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list) {
- /*
- * It's safe to call ipoib_put_ah() inside priv->lock
- * here, because we know that mcast->ah will always
- * hold one more reference, so ipoib_put_ah() will
- * never do more than decrement the ref count.
- */
- if (neigh->ah)
- ipoib_put_ah(neigh->ah);
- ipoib_neigh_free(dev, neigh);
- }
-
- spin_unlock_irq(&priv->lock);
+ list_for_each_entry_safe(neigh, tmp, &mcast->neigh_list, list)
+ ipoib_neigh_flush(neigh);
if (mcast->ah)
ipoib_put_ah(mcast->ah);
@@ -654,7 +640,8 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
return 0;
}
-void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
+void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb,
+ struct ipoib_neigh *neigh)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_mcast *mcast;
@@ -664,11 +651,8 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags) ||
!priv->broadcast ||
- !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- goto unlock;
- }
+ !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags))
+ goto drop;
mcast = __ipoib_mcast_find(dev, mgid);
if (!mcast) {
@@ -680,9 +664,7 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
if (!mcast) {
ipoib_warn(priv, "unable to allocate memory for "
"multicast structure\n");
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- goto out;
+ goto drop;
}
set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
@@ -692,39 +674,20 @@ void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb)
}
if (!mcast->ah) {
- if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
- skb_queue_tail(&mcast->pkt_queue, skb);
- else {
- ++dev->stats.tx_dropped;
- dev_kfree_skb_any(skb);
- }
-
+ if (skb_queue_len(&mcast->pkt_queue) >= IPOIB_MAX_MCAST_QUEUE)
+ goto drop;
+ skb_queue_tail(&mcast->pkt_queue, skb);
if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
ipoib_dbg_mcast(priv, "no address vector, "
"but multicast join already started\n");
else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
ipoib_mcast_sendonly_join(mcast);
-
- /*
- * If lookup completes between here and out:, don't
- * want to send packet twice.
- */
- mcast = NULL;
- }
-
-out:
- if (mcast && mcast->ah) {
- if (skb_dst(skb) &&
- skb_dst(skb)->neighbour &&
- !*to_ipoib_neigh(skb_dst(skb)->neighbour)) {
- struct ipoib_neigh *neigh = ipoib_neigh_alloc(skb_dst(skb)->neighbour,
- skb->dev);
-
- if (neigh) {
- kref_get(&mcast->ah->ref);
- neigh->ah = mcast->ah;
- list_add_tail(&neigh->list, &mcast->neigh_list);
- }
+ } else {
+ if (neigh && list_empty(&neigh->list)) {
+ kref_get(&mcast->ah->ref);
+ neigh->ah = mcast->ah;
+ memcpy(neigh->dgid.raw, mgid, sizeof(union ib_gid));
+ list_add_tail(&neigh->list, &mcast->neigh_list);
}
spin_unlock_irqrestore(&priv->lock, flags);
@@ -732,8 +695,13 @@ out:
return;
}
-unlock:
spin_unlock_irqrestore(&priv->lock, flags);
+ return;
+
+drop:
+ spin_unlock_irqrestore(&priv->lock, flags);
+ ++dev->stats.tx_dropped;
+ dev_kfree_skb_any(skb);
}
void ipoib_mcast_dev_flush(struct net_device *dev)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
From: Roland Dreier @ 2010-05-05 23:12 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172344.GC12286@mtls03>
> @@ -795,11 +799,12 @@ static void mcast_add_one(struct ib_device *device)
> struct mcast_device *dev;
> struct mcast_port *port;
> int i;
> + int count = 0;
>
> if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> return;
>
> - dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
> + dev = kzalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
> @@ -1007,7 +1010,7 @@ static void ib_sa_add_one(struct ib_device *device)
> e = device->phys_port_cnt;
> }
>
> - sa_dev = kmalloc(sizeof *sa_dev +
> + sa_dev = kzalloc(sizeof *sa_dev +
Do you happen to remember why you needed these kmalloc -> kzalloc conversions?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: rnfs: rq_respages pointer is bad
From: Roland Dreier @ 2010-05-05 22:58 UTC (permalink / raw)
To: David J. Wilder
Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
pradeep-r/Jw6+rmf7HQT0dZR+AlfA
In-Reply-To: <1267489621.9774.41.camel-XfwDJb4SXxnMbYB6QlFGEg@public.gmane.org>
> The source of the problem is in rdma_read_complete(), I am finding that
> rqstp->rq_respages is set to point past the end of the rqstp->rq_pages
> page list. This results in a NULL reference in svc_process() when
> passing rq_respages[0] to page_address().
did this ever end up getting fixed upstream?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Roland Dreier @ 2010-05-05 22:56 UTC (permalink / raw)
To: Jack Morgenstein
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Tziporet Koren, diego-VPRAkNaXOzVS1MOuV/RT9w
In-Reply-To: <201005050945.05729.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> The uverbs layer DOES handle the counting/listing (see, for example,
> list_add_tail at the end of ib_uverbs_create_xrc_rcv_qp.
But that is the per-process list of rcv QPs... not the reference
counting for each rcv QP. Look at how trivial
ib_uverbs_destroy_xrc_rcv_qp is -- there is no reference counting of the
real object there.
> However, I had an additional problem -- to distribute async events received
> for the xrc_rcv QP to all registered processes (so that each could unregister
> and allow the QP to be destroyed -- the ref count going to zero).
> In my original implementation, the low-level driver was responsible for generating
> the events for all the processes. To move this mechanism to the core would require
> a fairly extensive re-write. I would need to introduce ib_core methods for create,
> reg, unreg, and destroy, since the uverbs layer operates per user process and does
> not track across multiple processes. I was concerned that modifications of this
> magnitude would introduce instability in XRC, and would require a significant QA cycle
It seems to me that it should, if anything, be easier to track all these
lists at the uverbs layer. We already have a global xrcd structure for
each xrcd that we could easily stick a list of rcv QPs in (and keep the
list of rcv QP registrations per rcv QP).
I do see your point that the changes might be fairly large (although the
total XRC code is not so big), but we shouldn't stick ourselves with a
wrong implementation just because that implementation was already tested.
> Finally, I do not believe that it is such a bad thing to have low-level driver
> procedures for reg/unreg here. If a given low-level driver has implementation
> details that it wishes to record, it should have the opportunity to do so.
Hmm. I'm having a hard time imagining what those details are -- I
really do feel that this tracking of per-process reference counting etc
is something that we should do once in the core, rather than hiding it
in low-level drivers.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/4] ib_core: implement XRC RCV qp's
From: Roland Dreier @ 2010-05-05 22:40 UTC (permalink / raw)
To: Jack Morgenstein
Cc: rolandd-FYB4Gu1CFyUAvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Tziporet Koren, diego-VPRAkNaXOzVS1MOuV/RT9w
In-Reply-To: <201005050945.05729.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> > I don't really understand the semantics here. What is supposed to
> > happen if I do create/reg/destroy?> What happens if one process does
> > destroy while another process is still registered?
> Maybe we can simply assert that the unreg IS the destroy method of the
> IB_SPEC, and get rid of the destroy method.
>
> The xrc target qp section of the spec was not written with QP persistence
> (after the creating process exited) in mind. That requirement surfaced
> at the last minute as a result of testing by the MPI community during the
> implementation phase (as far as I know). Unfortunately, this created
> a semantic problem.
Yes, I think we should try to simplify things here.
It's very unfortunate to diverge from the API that's been shipped for a
while now, but I really think we don't want all these different ways of
saying the same thing, with little difference between create and reg,
and between destroy and unreg.
In fact the smallest possible API would be just
register_xrc_rcv_qp(xrcd, *qp_num)
where the user can pass in an invalid qp_num (say, -1 aka ffffffff) and
have a new QP created, or a valid one to take a new ref on the existing
rcv QP, and
unregister_xrc_rcv_qp(xrcd, qp_num).
(along these lines, the structure in these patches:
+struct ib_uverbs_create_xrc_rcv_qp {
+ __u64 response;
+ __u64 user_handle;
+ __u32 xrcd_handle;
+ __u32 max_send_wr;
+ __u32 max_recv_wr;
+ __u32 max_send_sge;
+ __u32 max_recv_sge;
+ __u32 max_inline_data;
+ __u8 sq_sig_all;
+ __u8 qp_type;
+ __u8 reserved[6];
+ __u64 driver_data[0];
+};
has many fields we don't need. Pretty much all the fields after
xrcd_handle are ignored, except sq_sig_all is used -- and that is highly
dubious since the rcv QP has no SQ! So I would propose something like
just having:
+struct ib_uverbs_reg_xrc_rcv_qp {
+ __u64 response;
+ __u32 xrcd_handle;
+ __u32 qp_num;
+ __u64 driver_data[0];
+};
where response is used to pass back the qp_num in the create case.
And then we just have unreg_xrc_rcv_qp and no destroy method (since they
are synonymous anyway).
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v2 01/51] IB/qib: Add Kconfig
From: Roland Dreier @ 2010-05-05 22:29 UTC (permalink / raw)
To: Ralph Campbell; +Cc: linux-rdma@vger.kernel.org
In-Reply-To: <1273098480.28097.280.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
> I haven't switched all the debug printks to event tracing.
> I guess I will need to strip those out to get something
> ready anytime soon.
Sounds like a good plan. I think we're better off adding something as
small as possible so that we can take a good look at each part as it
goes by.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v2 01/51] IB/qib: Add Kconfig
From: Ralph Campbell @ 2010-05-05 22:28 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <adar5lqat2q.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, 2010-05-05 at 15:21 -0700, Roland Dreier wrote:
> Ralph, I'd like to look at merging this for 2.6.35. Could you send your
> latest code?
>
> - R.
I haven't switched all the debug printks to event tracing.
I guess I will need to strip those out to get something
ready anytime soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 06/11] ipoib: avoid ipoib over IBoE
From: Roland Dreier @ 2010-05-05 22:27 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172420.GG12286@mtls03>
> @@ -1383,6 +1385,9 @@ static void ipoib_remove_one(struct ib_device *device)
> dev_list = ib_get_client_data(device, &ipoib_client);
>
> list_for_each_entry_safe(priv, tmp, dev_list, list) {
> + if (rdma_port_link_layer(device, priv->port) != IB_LINK_LAYER_INFINIBAND)
> + continue;
Why do we need this chunk here? How could a netdev get on our list if
we never create IPoIB interfaces for IBoE ports?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v2 01/51] IB/qib: Add Kconfig
From: Roland Dreier @ 2010-05-05 22:21 UTC (permalink / raw)
To: Ralph Campbell; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20091203190311.29507.83034.stgit-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
Ralph, I'd like to look at merging this for 2.6.35. Could you send your
latest code?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCHv8 01/11] ib core: Add link layer property to ports
From: Roland Dreier @ 2010-05-05 22:19 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172335.GB12286@mtls03>
Hi Eli,
I'm hoping to get this IBoE stuff in for 2.6.35. I started an "iboe"
branch in my tree (similar to the xrc branch I've been carrying for a
while), and I added this patch in, except I renamed
rdma_port_link_layer() to rdma_port_get_link_layer(). This seems to
match rdma_node_get_transport() better.
In any case as I add patches to my branch, you can stop worrying about
them, which should make keeping this series updated easier.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [ewg] [PATCHv8 03/11] IB/umad: Enable support only for IB ports
From: Roland Dreier @ 2010-05-05 22:11 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100218172354.GD12286@mtls03>
Why do we not allow umad for IBoE ports? I understand there's no QP0
but why can't userspace use QP1 just like for IB link layer ports?
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Divy Le Ray @ 2010-05-05 21:40 UTC (permalink / raw)
To: Steve Wise
Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <4BE1CE39.1050707-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
On 05/05/2010 12:59 PM, Steve Wise wrote:
> Looks good to me.
It looks right to me too.
Cheers,
Divy
>
> Steve.
>
> Roland Dreier wrote:
>> Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
>> cxgb3, except with Dimitris instead of Divy) look right? If so I'll add
>> it to my tree.
>>
>> Thanks,
>> Roland
>> ---
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7a9ccda..a00231b 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1719,6 +1719,20 @@ W: http://www.openfabrics.org
>> S: Supported
>> F: drivers/infiniband/hw/cxgb3/
>>
>> +CXGB4 ETHERNET DRIVER (CXGB4)
>> +M: Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>> +L: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +W: http://www.chelsio.com
>> +S: Supported
>> +F: drivers/net/cxgb4/
>> +
>> +CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
>> +M: Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
>> +L: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> +W: http://www.openfabrics.org
>> +S: Supported
>> +F: drivers/infiniband/hw/cxgb4/
>> +
>> CYBERPRO FB DRIVER
>> M: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
>> L: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for
>> non-subscribers)
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/3] ib/iser: remove buggy back-pointer setting
From: Or Gerlitz @ 2010-05-05 21:14 UTC (permalink / raw)
To: Mike Christie; +Cc: Or Gerlitz, Roland Dreier, linux-rdma
In-Reply-To: <4BE1AFF4.30600-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org>
Mike Christie <michaelc-hcNo3dDEHLuVc3sceRu5cw@public.gmane.org> wrote:
> Or Gerlitz wrote:
>> since iscsi connection objects are recycled, on the time the transport connection
>> (e.g iser's ib connection) is released it is illegal to touch the iscsi connection tied to the
>> transport back-pointer, as it may already point to a different transport connection
> I agree on it being a bug
Mike,
So we both agree its a bug, good. Now, with these patches
ep_disconnected isn't blocking any more till the disconnect event is
delivered but rather only till all posted sends and receives are
flushed. Once I started to do that, I was hitting the bug! e.g if I
take a path/port down and then after few seconds UP, before the
session recovery timeout, the session's iscsi connection was
successfully recycled and bounded to an newly allocated ib connection
/ ep. When the disconnect event was finally arriving, the old ib
connection was NULL-ifying the ib_conn pointer
from iscsi_conn->iser_conn and the when iscsid was attempting to send
a PDU (e.g NOP) through the new connection, I got a crash.
> but do you remember why that was added to iscsi_iser_conn_destroy originally?
> I later moved it to iser_conn_release in commit
> b40977d95fb3a1898ace6a7d97e4ed1a33a440a4 but I think Erez had added that null
> and some checks for it being null for a specific bug. I am not 100% sure. Look in the git logs
> to make sure. I will check them too when I get some more time.
okay, I'll look on the git logs, but I think there's a good chance it
was just sitting there, not added at some point, we'll see.
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Steve Wise @ 2010-05-05 19:59 UTC (permalink / raw)
To: Roland Dreier
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <adawrvijhpq.fsf_-_-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
Looks good to me.
Steve.
Roland Dreier wrote:
> Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
> cxgb3, except with Dimitris instead of Divy) look right? If so I'll add
> it to my tree.
>
> Thanks,
> Roland
> ---
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7a9ccda..a00231b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1719,6 +1719,20 @@ W: http://www.openfabrics.org
> S: Supported
> F: drivers/infiniband/hw/cxgb3/
>
> +CXGB4 ETHERNET DRIVER (CXGB4)
> +M: Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> +L: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> +W: http://www.chelsio.com
> +S: Supported
> +F: drivers/net/cxgb4/
> +
> +CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
> +M: Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
> +L: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> +W: http://www.openfabrics.org
> +S: Supported
> +F: drivers/infiniband/hw/cxgb4/
> +
> CYBERPRO FB DRIVER
> M: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
> L: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Roland Dreier @ 2010-05-05 19:55 UTC (permalink / raw)
To: tziporet-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb
Cc: Jason Gunthorpe, Linux RDMA list, ewg, Eli Cohen
In-Reply-To: <adaaasejfew.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
> We found it in performance work of our EN (10G) driver
By the way, it would certainly make sense for the ethernet driver to use
a number of queues that matches num_online_cpus() at the time the
interface is brought up. Since we can't change the # of MSI-X vectors
very easily I think we need to allow for the possible CPUs, but bouncing
a net interface seems lighter weight to me.
Although perhaps reloading a driver on CPU hotplug is OK too?
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
^ permalink raw reply
* Re: [ewg] [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Roland Dreier @ 2010-05-05 19:51 UTC (permalink / raw)
To: tziporet-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb
Cc: Jason Gunthorpe, Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <4BE1C82F.4000009-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
> We found it in performance work of our EN (10G) driver
What does this have to do with performance?
Do you have a system where num_possible_cpus() differs significantly
from num_online_cpus()? What kernel is that with? As far as I can
tell, for x86 they should only be different if there genuinely are CPUs
that are available for hotplug.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [ewg] [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Tziporet Koren @ 2010-05-05 19:34 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: Roland Dreier, Linux RDMA list, Eli Cohen, ewg
In-Reply-To: <20100505163252.GG15969-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
On 5/5/2010 7:32 PM, Jason Gunthorpe wrote:
> On Wed, May 05, 2010 at 07:54:54AM -0700, Roland Dreier wrote:
>
>> > The current code requires num_possible_cpus() + 1 MSIX vectors. However,
>> > num_possible_cpus() stands for the max number of supported CPUs by the kernel.
>>
>> I believe this is wrong -- num_possible_cpus() is the maximum number of
>> CPUs for the running system, including hotplug.
>>
> FWIW, I've seen some systems running RH kernels that blow up here
> because they run out of interrupt numbers if you have two IB cards in
> them.
>
> Mainline kernels have a reworked IRQ number allocator and don't have a
> problem, so maybe this is an ofed only patch?
>
This is not related to OFED
We found it in performance work of our EN (10G) driver
Tziporet
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH/RFC] cxgb4: Add MAINTAINERS info
From: Roland Dreier @ 2010-05-05 19:01 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, dm-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <20100428103356.931ba48c.randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Hi guys, does this info for cxgb4/iw_cxgb4 (pretty much copied from
cxgb3, except with Dimitris instead of Divy) look right? If so I'll add
it to my tree.
Thanks,
Roland
---
diff --git a/MAINTAINERS b/MAINTAINERS
index 7a9ccda..a00231b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1719,6 +1719,20 @@ W: http://www.openfabrics.org
S: Supported
F: drivers/infiniband/hw/cxgb3/
+CXGB4 ETHERNET DRIVER (CXGB4)
+M: Dimitris Michailidis <dm-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
+L: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+W: http://www.chelsio.com
+S: Supported
+F: drivers/net/cxgb4/
+
+CXGB4 IWARP RNIC DRIVER (IW_CXGB4)
+M: Steve Wise <swise-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
+L: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
+W: http://www.openfabrics.org
+S: Supported
+F: drivers/infiniband/hw/cxgb4/
+
CYBERPRO FB DRIVER
M: Russell King <linux-lFZ/pmaqli7XmaaqVzeoHQ@public.gmane.org>
L: linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org (moderated for non-subscribers)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 2/3] ib/iser: remove buggy back-pointer setting
From: Mike Christie @ 2010-05-05 17:50 UTC (permalink / raw)
To: Or Gerlitz; +Cc: Roland Dreier, linux-rdma
In-Reply-To: <Pine.LNX.4.64.1005051730170.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>
On 05/05/2010 09:30 AM, Or Gerlitz wrote:
> iscsi connection object life cycle includes binding and unbinding
> (conn_stop) to/from the iscsi transport connection object. Since
> iscsi connection objects are recycled, on the time the transport
> connection (e.g iser's ib connection) is released it is illegal
> to touch the iscsi connection tied to the transport back-pointer, as
> it may already point to a different transport connection.
>
> Signed-off-by: Or Gerlitz<ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org>
>
> ---
> drivers/infiniband/ulp/iser/iser_verbs.c | 2 --
> 1 file changed, 2 deletions(-)
>
> Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
> ===================================================================
> --- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
> +++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
> @@ -346,8 +346,6 @@ static void iser_conn_release(struct ise
> /* on EVENT_ADDR_ERROR there's no device yet for this conn */
> if (device != NULL)
> iser_device_try_release(device);
> - if (ib_conn->iser_conn)
> - ib_conn->iser_conn->ib_conn = NULL;
> iscsi_destroy_endpoint(ib_conn->ep);
> }
>
I agree on it being a bug, but do you remember why that was added to
iscsi_iser_conn_destroy originally? I later moved it to
iser_conn_release in
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=b40977d95fb3a1898ace6a7d97e4ed1a33a440a4)
but I think Erez had added that null and some checks for it being null
for a specific bug.
I am not 100% sure. Look in the git logs to make sure. I will check them
too when I get some more time.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] mlx4: Fix chunk sg list overflow in mlx4_alloc_icm()
From: Roland Dreier @ 2010-05-05 17:42 UTC (permalink / raw)
To: sebastien dugue; +Cc: Eli Cohen, linux-rdma
In-Reply-To: <20100430132131.0d829b3c@frecb007965>
good catch, applied. Did this hit you in practice? I guess it would
take a big coherent ICM allocation, were you getting those?
Also what do you think of this independent cleanup on top of your patch?
It handles the error case for allocation and then avoids having the
common case inside a deeper nested block:
drivers/net/mlx4/icm.c | 37 +++++++++++++++++++------------------
1 files changed, 19 insertions(+), 18 deletions(-)
diff --git a/drivers/net/mlx4/icm.c b/drivers/net/mlx4/icm.c
index ef62f17..b07e4de 100644
--- a/drivers/net/mlx4/icm.c
+++ b/drivers/net/mlx4/icm.c
@@ -163,29 +163,30 @@ struct mlx4_icm *mlx4_alloc_icm(struct mlx4_dev *dev, int npages,
ret = mlx4_alloc_icm_pages(&chunk->mem[chunk->npages],
cur_order, gfp_mask);
- if (!ret) {
- ++chunk->npages;
-
- if (coherent)
- ++chunk->nsg;
- else if (chunk->npages == MLX4_ICM_CHUNK_LEN) {
- chunk->nsg = pci_map_sg(dev->pdev, chunk->mem,
- chunk->npages,
- PCI_DMA_BIDIRECTIONAL);
-
- if (chunk->nsg <= 0)
- goto fail;
- }
+ if (ret) {
+ if (--cur_order < 0)
+ goto fail;
+ else
+ continue;
+ }
- if (chunk->npages == MLX4_ICM_CHUNK_LEN)
- chunk = NULL;
+ ++chunk->npages;
- npages -= 1 << cur_order;
- } else {
- --cur_order;
- if (cur_order < 0)
+ if (coherent)
+ ++chunk->nsg;
+ else if (chunk->npages == MLX4_ICM_CHUNK_LEN) {
+ chunk->nsg = pci_map_sg(dev->pdev, chunk->mem,
+ chunk->npages,
+ PCI_DMA_BIDIRECTIONAL);
+
+ if (chunk->nsg <= 0)
goto fail;
}
+
+ if (chunk->npages == MLX4_ICM_CHUNK_LEN)
+ chunk = NULL;
+
+ npages -= 1 << cur_order;
}
if (!coherent && chunk) {
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Jason Gunthorpe @ 2010-05-05 16:32 UTC (permalink / raw)
To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg
In-Reply-To: <adahbmmpfep.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
On Wed, May 05, 2010 at 07:54:54AM -0700, Roland Dreier wrote:
> > The current code requires num_possible_cpus() + 1 MSIX vectors. However,
> > num_possible_cpus() stands for the max number of supported CPUs by the kernel.
>
> I believe this is wrong -- num_possible_cpus() is the maximum number of
> CPUs for the running system, including hotplug.
FWIW, I've seen some systems running RH kernels that blow up here
because they run out of interrupt numbers if you have two IB cards in
them.
Mainline kernels have a reworked IRQ number allocator and don't have a
problem, so maybe this is an ofed only patch?
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/2] RDMA/nes: add support of iWARP multicast acceleration over IB_QPT_RAW_ETY QP type
From: Steve Wise @ 2010-05-05 14:56 UTC (permalink / raw)
To: Walukiewicz, Miroslaw
Cc: rdreier-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <BE2BFE91933D1B4089447C64486040801D00D549-IGOiFh9zz4wLt2AQoY/u9bfspsVTdybXVpNB7YpNyf8@public.gmane.org>
> I see here some misunderstanding. Let me explain better how our tramsmit path works.
>
> In our implementation we use normal memory registration path using ibv_reg_mr and we use ibv_post_send() with lkey/vaddr/len.
>
> The implementation of ibv_post_send (nes_post_send in libnes) for RAW QP passes lkey/virtual_addr/len information to kernel using shared page to our device driver (ud_post_send). There is no data copy here and the driver is used only for fast synchronization.
>
> Because our RAW ETH QP must use physical addresses only, ud_post_send() in kernel makes a virtual to physical memory translation and accesses the QP HW for packet transmission. Previously a packet buffer memory was registered and pinned by ibv_reg_mr to provide necessary information for making such translation.
>
>
I see. Thanks!
> The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is shared with all other user-kernel communication and it is quite complex. It is a perfect path for QP/CQ/PD/mem management but for me it is too complex for traffic acceleration.
>
> The user<->kernel path through additional driver, shared page for lkey/vaddr/len passing and SW memory translation in kernel is much more effective.
>
> Maybe it is a good idea to make that API more official after some kind of standarization. Our tests proved that it works. We achieved twice better performance and latency. That way could open the way for adding some non-RDMA devices to devices supported OFED API.
>
>
Sounds good.
Do you have specific perf numbers to share? Is this all just optimizing
mcast packets?
Also:
Does this raw qp service share the mac address with the ports being used
by the host stack? Or does each raw qp get its own mac address?
Do you all have a user mode UDP/IP running on this raw qp? If so, does
it use its own IP address separate from the host stack or does it share
the host's IP address.
Thanks,
Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] mlx4_core: request MSIX vectors as much as there CPU cores
From: Roland Dreier @ 2010-05-05 14:54 UTC (permalink / raw)
To: Eli Cohen; +Cc: Linux RDMA list, ewg
In-Reply-To: <20100505113047.GA12242-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
> The current code requires num_possible_cpus() + 1 MSIX vectors. However,
> num_possible_cpus() stands for the max number of supported CPUs by the kernel.
I believe this is wrong -- num_possible_cpus() is the maximum number of
CPUs for the running system, including hotplug.
> We should use num_online_cpus() which is the number of available CPUs for the
> system.
But then we don't have vectors for hotplugged CPUs.
- R.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
From: Or Gerlitz @ 2010-05-05 14:31 UTC (permalink / raw)
To: Roland Dreier; +Cc: linux-rdma, Mike Christie, Alexander Nezhinsky, Yaron Haviv
In-Reply-To: <Pine.LNX.4.64.1005051726110.29957-aDiYczhfhVLdX2U7gxhm1tBPR1lH4CV8@public.gmane.org>
The iser connection teardown flow isn't over till the underlying
Connection Manager (e.g the IB CM) delivers a disconnected or timeout
event through the RDMA-CM. When the remote (target) side isn't reachable,
e.g when some HW e.g port/hca/switch isn't functioning or taken down
administratively, the CM timeout flow is used and the event may be
generated only after relatively long time, in the order of tens of seconds.
The current iser code exposes this possibly long delay to higher layers,
specifically to the iscsid daemon and iscsi kernel stack. As a result,
the iscsi stack doesn't respond well, to the extent of this low-level CM
delay being added to the fail-over time under HA schemes such as the one
provided by DM multipath through the multipathd(8) service.
This patch enhances the reference counting scheme on iser's IB
connections such that the disconnect flow initiated by iscsid from
user space (ep_disconnect) isn't waiting for the CM to deliver the
disconnect/timeout event. On the other hand, the connection teardown
isn't done from iser's view point till the event is delivered.
The iser ib (rdma) connection object is destroyed when its reference
count reaches zero. When this happens on the RDMA-CM callback context,
extra care is taken such that the RDMA-CM does the actual destroying
of the associated ID as doing it in the callback is prohibited.
The reference count of iser ib connection would normally reach
three, where the <ref, deref> relations are
1. conn <init, terminate>
2. conn <bind, stop/destroy>
3. cma id <create, disconnect/error/timeout callbacks>
Signed-off-by: Or Gerlitz <ogerlitz-smomgflXvOZWk0Htik3J/w@public.gmane.org>
---
with this patch, multipath fail-over time is about 30 seconds,
which is seen here, when a DD over the multi-path device is done
before/during/after the fail-over
regulary, before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.926 s, 1.0 GB/s
taking a port down, causing fail-over during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 46.6117 s, 369 MB/s
after path-failure, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6474 s, 1.0 GB/s
13:00:09 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
13:00:24 connection8:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...]
13:00:24 connection8:0: detected conn error (1011)
13:00:24 iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3)
13:00:39 cto-1 kernel: device-mapper: multipath: Failing path 8:48.
13:00:39 cto-1 multipathd: 8:48: mark as failed
13:00:39 cto-1 multipathd: mpathd: remaining active paths: 1
--> the disconnected event is delivered after the IB CM timeout expires
--> but fail-over doesn't pend on this
13:01:56 iser: iser_cma_handler:event 10 status 0 conn ffff88022dcb39b0 id ffff88022cf09400
without this patch, multipath fail-over time is about 130 seconds
before taking a port down
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.6812 s, 1.0 GB/s
taking a port down during IO
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 145.094 s, 118 MB/s
after fail-over, back to speed
# dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k
17179869184 bytes (17 GB) copied, 16.8935 s, 1.0 GB/s
14:24:05 iser: iser_event_handler:async event 10 on device mlx4_0 port 1
14:24:20 connection4:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...]
14:24:20 kernel: connection4:0: detected conn error (1011)
14:24:21 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3)
--> the disconnected event is delivered after the IB CM timeout expires
--> fail-over pending on this
14:25:59 iser: iser_cma_handler:event 10 conn ffff88022625a1b0 id ffff880222537c00
14:26:14 session4: session recovery timed out after 15 secs
14:26:14 device-mapper: multipath: Failing path 8:64.
14:26:14 multipathd: mpathd: remaining active paths: 1
drivers/infiniband/ulp/iser/iscsi_iser.c | 9 ++-
drivers/infiniband/ulp/iser/iscsi_iser.h | 3 -
drivers/infiniband/ulp/iser/iser_verbs.c | 72 +++++++++++++++++--------------
3 files changed, 46 insertions(+), 38 deletions(-)
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -238,7 +238,7 @@ alloc_err:
* releases the FMR pool, QP and CMA ID objects, returns 0 on success,
* -1 on failure
*/
-static int iser_free_ib_conn_res(struct iser_conn *ib_conn)
+static int iser_free_ib_conn_res(struct iser_conn *ib_conn, int can_destroy_id)
{
BUG_ON(ib_conn == NULL);
@@ -253,7 +253,8 @@ static int iser_free_ib_conn_res(struct
if (ib_conn->qp != NULL)
rdma_destroy_qp(ib_conn->cma_id);
- if (ib_conn->cma_id != NULL)
+ /* if cma handler context, the caller acts s.t the cma destroy the id */
+ if (ib_conn->cma_id != NULL && can_destroy_id)
rdma_destroy_id(ib_conn->cma_id);
ib_conn->fmr_pool = NULL;
@@ -331,7 +332,7 @@ static int iser_conn_state_comp_exch(str
/**
* Frees all conn objects and deallocs conn descriptor
*/
-static void iser_conn_release(struct iser_conn *ib_conn)
+static void iser_conn_release(struct iser_conn *ib_conn, int can_destroy_id)
{
struct iser_device *device = ib_conn->device;
@@ -341,7 +342,7 @@ static void iser_conn_release(struct ise
list_del(&ib_conn->conn_list);
mutex_unlock(&ig.connlist_mutex);
iser_free_rx_descriptors(ib_conn);
- iser_free_ib_conn_res(ib_conn);
+ iser_free_ib_conn_res(ib_conn, can_destroy_id);
ib_conn->device = NULL;
/* on EVENT_ADDR_ERROR there's no device yet for this conn */
if (device != NULL)
@@ -354,10 +355,13 @@ void iser_conn_get(struct iser_conn *ib_
atomic_inc(&ib_conn->refcount);
}
-void iser_conn_put(struct iser_conn *ib_conn)
+int iser_conn_put(struct iser_conn *ib_conn, int can_destroy_id)
{
- if (atomic_dec_and_test(&ib_conn->refcount))
- iser_conn_release(ib_conn);
+ if (atomic_dec_and_test(&ib_conn->refcount)) {
+ iser_conn_release(ib_conn, can_destroy_id);
+ return 1;
+ }
+ return 0;
}
/**
@@ -381,19 +385,20 @@ void iser_conn_terminate(struct iser_con
wait_event_interruptible(ib_conn->wait,
ib_conn->state == ISER_CONN_DOWN);
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref ib conn deallocate */
}
-static void iser_connect_error(struct rdma_cm_id *cma_id)
+static int iser_connect_error(struct rdma_cm_id *cma_id)
{
struct iser_conn *ib_conn;
ib_conn = (struct iser_conn *)cma_id->context;
ib_conn->state = ISER_CONN_DOWN;
wake_up_interruptible(&ib_conn->wait);
+ return iser_conn_put(ib_conn, 0); /* deref ib conn's cma id */
}
-static void iser_addr_handler(struct rdma_cm_id *cma_id)
+static int iser_addr_handler(struct rdma_cm_id *cma_id)
{
struct iser_device *device;
struct iser_conn *ib_conn;
@@ -402,8 +407,7 @@ static void iser_addr_handler(struct rdm
device = iser_device_find_by_ib_device(cma_id);
if (!device) {
iser_err("device lookup/creation failed\n");
- iser_connect_error(cma_id);
- return;
+ return iser_connect_error(cma_id);
}
ib_conn = (struct iser_conn *)cma_id->context;
@@ -412,11 +416,13 @@ static void iser_addr_handler(struct rdm
ret = rdma_resolve_route(cma_id, 1000);
if (ret) {
iser_err("resolve route failed: %d\n", ret);
- iser_connect_error(cma_id);
+ return iser_connect_error(cma_id);
}
+
+ return 0;
}
-static void iser_route_handler(struct rdma_cm_id *cma_id)
+static int iser_route_handler(struct rdma_cm_id *cma_id)
{
struct rdma_conn_param conn_param;
int ret;
@@ -437,9 +443,9 @@ static void iser_route_handler(struct rd
goto failure;
}
- return;
+ return 0;
failure:
- iser_connect_error(cma_id);
+ return iser_connect_error(cma_id);
}
static void iser_connected_handler(struct rdma_cm_id *cma_id)
@@ -451,12 +457,12 @@ static void iser_connected_handler(struc
wake_up_interruptible(&ib_conn->wait);
}
-static void iser_disconnected_handler(struct rdma_cm_id *cma_id)
+static int iser_disconnected_handler(struct rdma_cm_id *cma_id)
{
struct iser_conn *ib_conn;
+ int ret;
ib_conn = (struct iser_conn *)cma_id->context;
- ib_conn->disc_evt_flag = 1;
/* getting here when the state is UP means that the conn is being *
* terminated asynchronously from the iSCSI layer's perspective. */
@@ -471,20 +477,24 @@ static void iser_disconnected_handler(st
ib_conn->state = ISER_CONN_DOWN;
wake_up_interruptible(&ib_conn->wait);
}
+
+ ret = iser_conn_put(ib_conn, 0); /* deref ib conn's cma id */
+ return ret;
}
static int iser_cma_handler(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
{
int ret = 0;
- iser_err("event %d conn %p id %p\n",event->event,cma_id->context,cma_id);
+ iser_err("event %d status %d conn %p id %p\n",
+ event->event, event->status, cma_id->context, cma_id);
switch (event->event) {
case RDMA_CM_EVENT_ADDR_RESOLVED:
- iser_addr_handler(cma_id);
+ ret = iser_addr_handler(cma_id);
break;
case RDMA_CM_EVENT_ROUTE_RESOLVED:
- iser_route_handler(cma_id);
+ ret = iser_route_handler(cma_id);
break;
case RDMA_CM_EVENT_ESTABLISHED:
iser_connected_handler(cma_id);
@@ -494,13 +504,12 @@ static int iser_cma_handler(struct rdma_
case RDMA_CM_EVENT_CONNECT_ERROR:
case RDMA_CM_EVENT_UNREACHABLE:
case RDMA_CM_EVENT_REJECTED:
- iser_err("event: %d, error: %d\n", event->event, event->status);
- iser_connect_error(cma_id);
+ ret = iser_connect_error(cma_id);
break;
case RDMA_CM_EVENT_DISCONNECTED:
case RDMA_CM_EVENT_DEVICE_REMOVAL:
case RDMA_CM_EVENT_ADDR_CHANGE:
- iser_disconnected_handler(cma_id);
+ ret = iser_disconnected_handler(cma_id);
break;
default:
iser_err("Unexpected RDMA CM event (%d)\n", event->event);
@@ -515,7 +524,7 @@ void iser_conn_init(struct iser_conn *ib
init_waitqueue_head(&ib_conn->wait);
ib_conn->post_recv_buf_count = 0;
atomic_set(&ib_conn->post_send_buf_count, 0);
- atomic_set(&ib_conn->refcount, 1);
+ atomic_set(&ib_conn->refcount, 1); /* ref ib conn allocation */
INIT_LIST_HEAD(&ib_conn->conn_list);
spin_lock_init(&ib_conn->lock);
}
@@ -543,6 +552,7 @@ int iser_connect(struct iser_conn *ib_
ib_conn->state = ISER_CONN_PENDING;
+ iser_conn_get(ib_conn); /* ref ib conn's cma id */
ib_conn->cma_id = rdma_create_id(iser_cma_handler,
(void *)ib_conn,
RDMA_PS_TCP);
@@ -580,7 +590,7 @@ id_failure:
addr_failure:
ib_conn->state = ISER_CONN_DOWN;
connect_failure:
- iser_conn_release(ib_conn);
+ iser_conn_release(ib_conn, 1);
return err;
}
@@ -749,12 +759,10 @@ static void iser_handle_comp_error(struc
iscsi_conn_failure(ib_conn->iser_conn->iscsi_conn,
ISCSI_ERR_CONN_FAILED);
- /* complete the termination process if disconnect event was delivered *
- * note there are no more non completed posts to the QP */
- if (ib_conn->disc_evt_flag) {
- ib_conn->state = ISER_CONN_DOWN;
- wake_up_interruptible(&ib_conn->wait);
- }
+ /* no more non completed posts to the QP, complete the
+ * termination process w.o worrying on disconnect event */
+ ib_conn->state = ISER_CONN_DOWN;
+ wake_up_interruptible(&ib_conn->wait);
}
}
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.c
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -325,7 +325,7 @@ iscsi_iser_conn_destroy(struct iscsi_cls
*/
if (ib_conn) {
ib_conn->iser_conn = NULL;
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref iscsi/ib conn unbinding */
}
}
@@ -357,11 +357,12 @@ iscsi_iser_conn_bind(struct iscsi_cls_se
/* binds the iSER connection retrieved from the previously
* connected ep_handle to the iSCSI layer connection. exchanges
* connection pointers */
- iser_err("binding iscsi conn %p to iser_conn %p\n",conn,ib_conn);
+ iser_err("binding iscsi/iser conn %p %p to ib_conn %p\n",
+ conn, conn->dd_data, ib_conn);
iser_conn = conn->dd_data;
ib_conn->iser_conn = iser_conn;
iser_conn->ib_conn = ib_conn;
- iser_conn_get(ib_conn);
+ iser_conn_get(ib_conn); /* ref iscsi/ib conn binding */
return 0;
}
@@ -382,7 +383,7 @@ iscsi_iser_conn_stop(struct iscsi_cls_co
* There is no unbind event so the stop callback
* must release the ref from the bind.
*/
- iser_conn_put(ib_conn);
+ iser_conn_put(ib_conn, 1); /* deref iscsi/ib conn unbinding */
}
iser_conn->ib_conn = NULL;
}
Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
===================================================================
--- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -247,7 +247,6 @@ struct iser_conn {
struct rdma_cm_id *cma_id; /* CMA ID */
struct ib_qp *qp; /* QP */
struct ib_fmr_pool *fmr_pool; /* pool of IB FMRs */
- int disc_evt_flag; /* disconn event delivered */
wait_queue_head_t wait; /* waitq for conn/disconn */
int post_recv_buf_count; /* posted rx count */
atomic_t post_send_buf_count; /* posted tx count */
@@ -321,7 +320,7 @@ void iser_conn_init(struct iser_conn *ib
void iser_conn_get(struct iser_conn *ib_conn);
-void iser_conn_put(struct iser_conn *ib_conn);
+int iser_conn_put(struct iser_conn *ib_conn, int destroy_cma_id_allowed);
void iser_conn_terminate(struct iser_conn *ib_conn);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox