* Re: [PATCH] IB/core: Control number of retries for SA to leave an MCG
[not found] ` <4D4973B2.9070300-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
@ 2011-02-02 15:38 ` Mike Heinz
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E98F-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Mike Heinz @ 2011-02-02 15:38 UTC (permalink / raw)
To: Moni Shoua, Vlad
Cc: nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ewg
Wouldn't the BUSY patch I proposed last year deal with this situation?
-----Original Message-----
From: ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org [mailto:ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org] On Behalf Of Moni Shoua
Sent: Wednesday, February 02, 2011 10:10 AM
To: Vlad
Cc: nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org; ewg
Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
This patch helps when SM is busy and so an MC group is left joined
while the host bellies that it is was left.
Note: the patch below is not to driver/infiniband/core but it generates
a patch under kernel_patches/fixes.
Index: ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch 2011-02-02 16:52:02.000000000 +0200
@@ -0,0 +1,46 @@
+Add a multicast leave maximum retry setting in sys/module/ib_sa/parameters/mcast_leave_retries.
+Add a debug print when the maximum retry count is reached.
+
+Signed-off-by: Nir Muchtar <nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org>
+Reviewed-by: Moni Shoua <monis-smomgflXvOZWk0Htik3J/w@public.gmane.org>
+--
+
+Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
+===================================================================
+--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c 2010-08-17 12:56:06.000000000 +0300
++++ ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c 2010-08-17 13:15:38.000000000 +0300
+@@ -40,6 +40,12 @@
+ #include <rdma/ib_cache.h>
+ #include "sa.h"
+
++static int mcast_leave_retries = 3;
++
++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
++ &mcast_leave_retries, 0644);
++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast leave requests before giving up");
++
+ static void mcast_add_one(struct ib_device *device);
+ static void mcast_remove_one(struct ib_device *device);
+
+@@ -520,8 +526,11 @@
+ if (status && (group->retries > 0) &&
+ !send_leave(group, group->leave_state))
+ group->retries--;
+- else
++ else {
++ if (status && group->retries <= 0)
++ printk("reached max retry count. status=%d .Giving up\n", status);
+ mcast_work_handler(&group->work);
++ }
+ }
+
+ static struct mcast_group *acquire_group(struct mcast_port *port,
+@@ -544,7 +553,7 @@
+ if (!group)
+ return NULL;
+
+- group->retries = 3;
++ group->retries = mcast_leave_retries;
+ group->port = port;
+ group->rec.mgid = *mgid;
+ group->pkey_index = MCAST_INVALID_PKEY_INDEX;
_______________________________________________
ewg mailing list
ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E98F-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
@ 2011-02-02 15:41 ` Moni Shoua
[not found] ` <4D497B35.2010901-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Moni Shoua @ 2011-02-02 15:41 UTC (permalink / raw)
To: Mike Heinz
Cc: Vlad, nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ewg
Mike Heinz wrote:
> Wouldn't the BUSY patch I proposed last year deal with this situation?
Can you please send a link to this patch?
>
> -----Original Message-----
> From: ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org [mailto:ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org] On Behalf Of Moni Shoua
> Sent: Wednesday, February 02, 2011 10:10 AM
> To: Vlad
> Cc: nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org; ewg
> Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
>
> This patch helps when SM is busy and so an MC group is left joined
> while the host bellies that it is was left.
>
> Note: the patch below is not to driver/infiniband/core but it generates
> a patch under kernel_patches/fixes.
>
> Index: ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch 2011-02-02 16:52:02.000000000 +0200
> @@ -0,0 +1,46 @@
> +Add a multicast leave maximum retry setting in sys/module/ib_sa/parameters/mcast_leave_retries.
> +Add a debug print when the maximum retry count is reached.
> +
> +Signed-off-by: Nir Muchtar <nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> +Reviewed-by: Moni Shoua <monis-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> +--
> +
> +Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
> +===================================================================
> +--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c 2010-08-17 12:56:06.000000000 +0300
> ++++ ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c 2010-08-17 13:15:38.000000000 +0300
> +@@ -40,6 +40,12 @@
> + #include <rdma/ib_cache.h>
> + #include "sa.h"
> +
> ++static int mcast_leave_retries = 3;
> ++
> ++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
> ++ &mcast_leave_retries, 0644);
> ++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast leave requests before giving up");
> ++
> + static void mcast_add_one(struct ib_device *device);
> + static void mcast_remove_one(struct ib_device *device);
> +
> +@@ -520,8 +526,11 @@
> + if (status && (group->retries > 0) &&
> + !send_leave(group, group->leave_state))
> + group->retries--;
> +- else
> ++ else {
> ++ if (status && group->retries <= 0)
> ++ printk("reached max retry count. status=%d .Giving up\n", status);
> + mcast_work_handler(&group->work);
> ++ }
> + }
> +
> + static struct mcast_group *acquire_group(struct mcast_port *port,
> +@@ -544,7 +553,7 @@
> + if (!group)
> + return NULL;
> +
> +- group->retries = 3;
> ++ group->retries = mcast_leave_retries;
> + group->port = port;
> + group->rec.mgid = *mgid;
> + group->pkey_index = MCAST_INVALID_PKEY_INDEX;
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>
> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
[not found] ` <4D497B35.2010901-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
@ 2011-02-02 15:45 ` Mike Heinz
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E993-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Mike Heinz @ 2011-02-02 15:45 UTC (permalink / raw)
To: Moni Shoua
Cc: Vlad, nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ewg
It was discussed in the Linux-RDMA list for many months. You can find a list of the archived messages here:
http://www.mail-archive.com/search?q=SA+Busy&l=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
The most recent version of the patch is here:
http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg06644.html
Basically, the spec permits an SM to reply "busy" instead of simply tossing packets on the floor, but OFED does not handle this case right now.
-----Original Message-----
From: Moni Shoua [mailto:monis-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org]
Sent: Wednesday, February 02, 2011 10:42 AM
To: Mike Heinz
Cc: Vlad; nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; ewg
Subject: Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
Mike Heinz wrote:
> Wouldn't the BUSY patch I proposed last year deal with this situation?
Can you please send a link to this patch?
>
> -----Original Message-----
> From: ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org [mailto:ewg-bounces-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org] On Behalf Of Moni Shoua
> Sent: Wednesday, February 02, 2011 10:10 AM
> To: Vlad
> Cc: nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org; ewg
> Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
>
> This patch helps when SM is busy and so an MC group is left joined
> while the host bellies that it is was left.
>
> Note: the patch below is not to driver/infiniband/core but it generates
> a patch under kernel_patches/fixes.
>
> Index: ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch 2011-02-02 16:52:02.000000000 +0200
> @@ -0,0 +1,46 @@
> +Add a multicast leave maximum retry setting in sys/module/ib_sa/parameters/mcast_leave_retries.
> +Add a debug print when the maximum retry count is reached.
> +
> +Signed-off-by: Nir Muchtar <nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> +Reviewed-by: Moni Shoua <monis-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> +--
> +
> +Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
> +===================================================================
> +--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c 2010-08-17 12:56:06.000000000 +0300
> ++++ ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c 2010-08-17 13:15:38.000000000 +0300
> +@@ -40,6 +40,12 @@
> + #include <rdma/ib_cache.h>
> + #include "sa.h"
> +
> ++static int mcast_leave_retries = 3;
> ++
> ++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
> ++ &mcast_leave_retries, 0644);
> ++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast leave requests before giving up");
> ++
> + static void mcast_add_one(struct ib_device *device);
> + static void mcast_remove_one(struct ib_device *device);
> +
> +@@ -520,8 +526,11 @@
> + if (status && (group->retries > 0) &&
> + !send_leave(group, group->leave_state))
> + group->retries--;
> +- else
> ++ else {
> ++ if (status && group->retries <= 0)
> ++ printk("reached max retry count. status=%d .Giving up\n", status);
> + mcast_work_handler(&group->work);
> ++ }
> + }
> +
> + static struct mcast_group *acquire_group(struct mcast_port *port,
> +@@ -544,7 +553,7 @@
> + if (!group)
> + return NULL;
> +
> +- group->retries = 3;
> ++ group->retries = mcast_leave_retries;
> + group->port = port;
> + group->rec.mgid = *mgid;
> + group->pkey_index = MCAST_INVALID_PKEY_INDEX;
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>
> This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
>
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E993-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
@ 2011-02-02 16:21 ` Moni Shoua
[not found] ` <4D49848B.6090206-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Moni Shoua @ 2011-02-02 16:21 UTC (permalink / raw)
To: Mike Heinz
Cc: Vlad, nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ewg
Mike Heinz wrote:
> It was discussed in the Linux-RDMA list for many months. You can find a list of the archived messages here:
>
> http://www.mail-archive.com/search?q=SA+Busy&l=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>
> The most recent version of the patch is here:
>
> http://www.mail-archive.com/linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg06644.html
>
> Basically, the spec permits an SM to reply "busy" instead of simply tossing packets on the floor, but OFED does not handle this case right now.
>
I took a look and I'm not sure that your patch solves the issue I was trying to handle.
Nothing will prevent from number of retries to reach zero and no one will report the upper layer (e.g IPoIB)
that leaving the multicast group had failed.
The worst thing is that there is no indication to the user about this state (a host is joined without no one to ever try and make it leave)
The patch I sent also puts a message in the kernel log so users can read and react.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG
[not found] ` <4D49848B.6090206-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
@ 2011-02-02 17:26 ` Hefty, Sean
0 siblings, 0 replies; 5+ messages in thread
From: Hefty, Sean @ 2011-02-02 17:26 UTC (permalink / raw)
To: Moni Shoua, Mike Heinz
Cc: Vlad, nirm-smomgflXvOZWk0Htik3J/w@public.gmane.org,
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ewg
> The worst thing is that there is no indication to the user about this state
There can be multiple users of the same group. Only the last one leaving causes the leave request to be sent. Notifying that one user of the group doesn't help. All they can do is ask for the leave request to be retried anyway, which has to be coordinated with potential new users.
> (a host is joined without no one to ever try and make it leave)
> The patch I sent also puts a message in the kernel log so users can read
> and react.
IMO, this is always a possibility and something that the SA must be able to handle. If a node, switch, link, etc. go down, there's no guarantee that any leave request will be generated, let alone make it to the SA.
Architecturally, I think the only currently available option is for the SA to ask a client to reregister.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-02-02 17:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4D4973B2.9070300@Voltaire.COM>
[not found] ` <4D4973B2.9070300-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
2011-02-02 15:38 ` [PATCH] IB/core: Control number of retries for SA to leave an MCG Mike Heinz
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E98F-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
2011-02-02 15:41 ` [ewg] " Moni Shoua
[not found] ` <4D497B35.2010901-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
2011-02-02 15:45 ` Mike Heinz
[not found] ` <4C2744E8AD2982428C5BFE523DF8CDCB4A20A5E993-amwN6d8PyQWXx9kJd3VG2h2eb7JE58TQ@public.gmane.org>
2011-02-02 16:21 ` Moni Shoua
[not found] ` <4D49848B.6090206-hKgKHo2Ms0F+cjeuK/JdrQ@public.gmane.org>
2011-02-02 17:26 ` Hefty, Sean
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox