* [Patch 0/2] IB/mlx5: Add PCI error handler support for mlx5
@ 2014-03-12 3:42 clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
2014-03-12 3:42 ` [Patch 1/2] IB/mlx5: Implementation of PCI error handler clsoto
2014-03-12 3:42 ` [Patch 2/2] IB/mlx5: Free resources during PCI error clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
0 siblings, 2 replies; 11+ messages in thread
From: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 @ 2014-03-12 3:42 UTC (permalink / raw)
To: eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
Cc: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
The following series add PCI error handler support for mlx5 driver and
improvements in the mlx5 remove_one function
to make sure that it can execute all the functions to tear down the resources.
IB/mlx5: Implementation of PCI error handler
IB/mlx5: Free resources during PCI error
drivers/infiniband/hw/mlx5/main.c | 32 ++++++++++++++++++++++++-
drivers/infiniband/hw/mlx5/mr.c | 9 ++++---
drivers/infiniband/hw/mlx5/qp.c | 6 +++-
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 8 ++----
drivers/net/ethernet/mellanox/mlx5/core/main.c | 1
include/linux/mlx5/driver.h | 6 ++--
7 files changed, 55 insertions(+), 14 deletions(-)
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-12 3:42 [Patch 0/2] IB/mlx5: Add PCI error handler support for mlx5 clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
@ 2014-03-12 3:42 ` clsoto
[not found] ` <20140312034512.065218504-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2014-03-12 3:42 ` [Patch 2/2] IB/mlx5: Free resources during PCI error clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
1 sibling, 1 reply; 11+ messages in thread
From: clsoto @ 2014-03-12 3:42 UTC (permalink / raw)
To: eli, roland, sean.hefty, hal.rosenstock, linux-rdma, netdev
Cc: brking, Carol Soto
[-- Attachment #1: ib_mlx5_add_pci_error.patch --]
[-- Type: text/plain, Size: 3396 bytes --]
This patch is to add PCI error handler function support for mlx5.
Created the functions for error_detected and slot_rest, plus will
send a port down event to users when the driver error_detected
function is invoked. This is to prevent a hang seeing in
mcast_remove_one at the time ib_unregister_device is called for the
ib_sa module. It will fail hardware commands while the driver is
handling a PCI error. It will reduce the hardware commands timeout
to 10 msecs so it does not hang waiting for an interrupt of the
completion of the hardware command.
Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com>
---
drivers/infiniband/hw/mlx5/main.c | 32 +++++++++++++++++++++++++-
drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++
include/linux/mlx5/driver.h | 4 +--
3 files changed, 40 insertions(+), 3 deletions(-)
Index: b/drivers/infiniband/hw/mlx5/main.c
===================================================================
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1508,11 +1508,41 @@ static DEFINE_PCI_DEVICE_TABLE(mlx5_ib_p
MODULE_DEVICE_TABLE(pci, mlx5_ib_pci_table);
+static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ struct mlx5_ib_dev *dev = mlx5_pci2ibdev(pdev);
+ struct mlx5_core_dev *mdev = &dev->mdev;
+ u8 port;
+
+ /* To avoid the mcast hang with ipoib up */
+ for (port = 1; port <= dev->mdev.caps.num_ports; port++)
+ mlx5_ib_event(mdev, MLX5_DEV_EVENT_PORT_DOWN, &port);
+
+ remove_one(pdev);
+
+ return state == pci_channel_io_perm_failure ?
+ PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
+}
+
+static pci_ers_result_t mlx5_pci_slot_reset(struct pci_dev *pdev)
+{
+ int ret = init_one(pdev, 0);
+
+ return ret ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
+}
+
+static const struct pci_error_handlers mlx5_err_handler = {
+ .error_detected = mlx5_pci_err_detected,
+ .slot_reset = mlx5_pci_slot_reset,
+};
+
static struct pci_driver mlx5_ib_driver = {
.name = DRIVER_NAME,
.id_table = mlx5_ib_pci_table,
.probe = init_one,
- .remove = remove_one
+ .remove = remove_one,
+ .err_handler = &mlx5_err_handler,
};
static int __init mlx5_ib_init(void)
Index: b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
===================================================================
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -646,6 +646,13 @@ static int mlx5_cmd_invoke(struct mlx5_c
if (callback && page_queue)
return -EINVAL;
+ if (pci_channel_offline(dev->pdev)) {
+ /* Device is going through error recovery
+ * and cannot accept commands.
+ */
+ return -EIO;
+ }
+
ent = alloc_cmd(cmd, in, out, uout, uout_size, callback, context,
page_queue);
if (IS_ERR(ent))
Index: b/include/linux/mlx5/driver.h
===================================================================
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -51,10 +51,10 @@ enum {
};
enum {
- /* one minute for the sake of bringup. Generally, commands must always
+ /* 10 msecs for the sake of bringup. Generally, commands must always
* complete and we may need to increase this timeout value
*/
- MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000,
+ MLX5_CMD_TIMEOUT_MSEC = 10 * 1000,
MLX5_CMD_WQ_MAX_NAME = 32,
};
--
^ permalink raw reply [flat|nested] 11+ messages in thread
* [Patch 2/2] IB/mlx5: Free resources during PCI error
2014-03-12 3:42 [Patch 0/2] IB/mlx5: Add PCI error handler support for mlx5 clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
2014-03-12 3:42 ` [Patch 1/2] IB/mlx5: Implementation of PCI error handler clsoto
@ 2014-03-12 3:42 ` clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
1 sibling, 0 replies; 11+ messages in thread
From: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 @ 2014-03-12 3:42 UTC (permalink / raw)
To: eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
Cc: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Carol Soto
[-- Attachment #1: ib_mlx5_free_resources_during_pci_error.patch --]
[-- Type: text/plain, Size: 4224 bytes --]
This patch is to make sure that during a pci error, the remove_one
function frees the resources even though the hardware command failed
to avoid memory leaks when the adapter recovers. Also make sure that
remove_one function goes thru all the functions like disable all irqs
and disable pci so when the remove_one function is done then the eehd
daemon will continue the recovery process.
Signed-off-by: Carol Soto <clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
drivers/infiniband/hw/mlx5/mr.c | 9 ++++++---
drivers/infiniband/hw/mlx5/qp.c | 6 +++++-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 8 +++-----
drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 -
include/linux/mlx5/driver.h | 2 +-
5 files changed, 15 insertions(+), 11 deletions(-)
Index: b/drivers/infiniband/hw/mlx5/mr.c
===================================================================
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -471,9 +471,12 @@ static void clean_keys(struct mlx5_ib_de
ent->size--;
spin_unlock_irq(&ent->lock);
err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
- if (err)
- mlx5_ib_warn(dev, "failed destroy mkey\n");
- else
+ if (err) {
+ if (pci_channel_offline(dev->mdev.pdev))
+ kfree(mr);
+ else
+ mlx5_ib_warn(dev, "failed destroy mkey\n");
+ } else
kfree(mr);
}
}
Index: b/drivers/infiniband/hw/mlx5/qp.c
===================================================================
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -2564,7 +2564,11 @@ int mlx5_ib_dealloc_xrcd(struct ib_xrcd
err = mlx5_core_xrcd_dealloc(&dev->mdev, xrcdn);
if (err) {
- mlx5_ib_warn(dev, "failed to dealloc xrcdn 0x%x\n", xrcdn);
+ if (pci_channel_offline(dev->mdev.pdev))
+ kfree(xrcd);
+ else
+ mlx5_ib_warn(dev, "failed to dealloc xrcdn 0x%x\n",
+ xrcdn);
return err;
}
Index: b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
===================================================================
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -482,14 +482,12 @@ err1:
return err;
}
-int mlx5_stop_eqs(struct mlx5_core_dev *dev)
+void mlx5_stop_eqs(struct mlx5_core_dev *dev)
{
struct mlx5_eq_table *table = &dev->priv.eq_table;
int err;
- err = mlx5_destroy_unmap_eq(dev, &table->pages_eq);
- if (err)
- return err;
+ mlx5_destroy_unmap_eq(dev, &table->pages_eq);
mlx5_destroy_unmap_eq(dev, &table->async_eq);
mlx5_cmd_use_polling(dev);
@@ -498,7 +496,7 @@ int mlx5_stop_eqs(struct mlx5_core_dev *
if (err)
mlx5_cmd_use_events(dev);
- return err;
+ return;
}
int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
Index: b/drivers/net/ethernet/mellanox/mlx5/core/main.c
===================================================================
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -508,7 +508,6 @@ void mlx5_dev_cleanup(struct mlx5_core_d
mlx5_stop_health_poll(dev);
if (mlx5_cmd_teardown_hca(dev)) {
dev_err(&dev->pdev->dev, "tear_down_hca failed, skip cleanup\n");
- return;
}
mlx5_pagealloc_stop(dev);
mlx5_reclaim_startup_pages(dev);
Index: b/include/linux/mlx5/driver.h
===================================================================
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -721,7 +721,7 @@ int mlx5_create_map_eq(struct mlx5_core_
int nent, u64 mask, const char *name, struct mlx5_uar *uar);
int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
int mlx5_start_eqs(struct mlx5_core_dev *dev);
-int mlx5_stop_eqs(struct mlx5_core_dev *dev);
+void mlx5_stop_eqs(struct mlx5_core_dev *dev);
int mlx5_core_attach_mcg(struct mlx5_core_dev *dev, union ib_gid *mgid, u32 qpn);
int mlx5_core_detach_mcg(struct mlx5_core_dev *dev, union ib_gid *mgid, u32 qpn);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
[not found] ` <20140312034512.065218504-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2014-03-12 18:34 ` Ben Hutchings
[not found] ` <1394649252.23624.36.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: Ben Hutchings @ 2014-03-12 18:34 UTC (permalink / raw)
To: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
Cc: eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
[-- Attachment #1: Type: text/plain, Size: 951 bytes --]
On Tue, 2014-03-11 at 22:42 -0500, clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org wrote:
[...]
> Index: b/include/linux/mlx5/driver.h
> ===================================================================
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@ -51,10 +51,10 @@ enum {
> };
>
> enum {
> - /* one minute for the sake of bringup. Generally, commands must always
> + /* 10 msecs for the sake of bringup. Generally, commands must always
> * complete and we may need to increase this timeout value
> */
> - MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000,
> + MLX5_CMD_TIMEOUT_MSEC = 10 * 1000,
You seem to be changing the timeout from 2 hours (not one minute) to 10
seconds (not milliseconds).
Ben.
> MLX5_CMD_WQ_MAX_NAME = 32,
> };
>
>
--
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
[not found] ` <1394649252.23624.36.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
@ 2014-03-12 22:00 ` Carol Soto
2014-03-13 6:45 ` Eli Cohen
1 sibling, 0 replies; 11+ messages in thread
From: Carol Soto @ 2014-03-12 22:00 UTC (permalink / raw)
To: Ben Hutchings
Cc: eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
On 3/12/2014 1:34 PM, Ben Hutchings wrote:
> On Tue, 2014-03-11 at 22:42 -0500, clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org wrote:
> [...]
>> Index: b/include/linux/mlx5/driver.h
>> ===================================================================
>> --- a/include/linux/mlx5/driver.h
>> +++ b/include/linux/mlx5/driver.h
>> @@ -51,10 +51,10 @@ enum {
>> };
>>
>> enum {
>> - /* one minute for the sake of bringup. Generally, commands must always
>> + /* 10 msecs for the sake of bringup. Generally, commands must always
>> * complete and we may need to increase this timeout value
>> */
>> - MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000,
>> + MLX5_CMD_TIMEOUT_MSEC = 10 * 1000,
> You seem to be changing the timeout from 2 hours (not one minute) to 10
> seconds (not milliseconds).
>
> Ben.
Yes you are right the comment should say 10 seconds instead of 10 msecs.
Carol
>
>> MLX5_CMD_WQ_MAX_NAME = 32,
>> };
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
[not found] ` <1394649252.23624.36.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
2014-03-12 22:00 ` Carol Soto
@ 2014-03-13 6:45 ` Eli Cohen
2014-03-13 15:12 ` Carol Soto
1 sibling, 1 reply; 11+ messages in thread
From: Eli Cohen @ 2014-03-13 6:45 UTC (permalink / raw)
To: Ben Hutchings
Cc: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
On Wed, Mar 12, 2014 at 06:34:12PM +0000, Ben Hutchings wrote:
> >
> > enum {
> > - /* one minute for the sake of bringup. Generally, commands must always
> > + /* 10 msecs for the sake of bringup. Generally, commands must always
> > * complete and we may need to increase this timeout value
> > */
> > - MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000,
> > + MLX5_CMD_TIMEOUT_MSEC = 10 * 1000,
>
> You seem to be changing the timeout from 2 hours (not one minute) to 10
> seconds (not milliseconds).
>
Thanks for noting this. Actually, the time should remain 2 hours and
the comment should be fixed. Also note that long time/missing
completion of a command is, genrallly, not indicative of a PCI error.
If that happens, we would want to have enough time to do diagnostics
before timing out.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-13 6:45 ` Eli Cohen
@ 2014-03-13 15:12 ` Carol Soto
2014-03-13 15:40 ` Eli Cohen
0 siblings, 1 reply; 11+ messages in thread
From: Carol Soto @ 2014-03-13 15:12 UTC (permalink / raw)
To: Eli Cohen, Ben Hutchings
Cc: eli-VPRAkNaXOzVWk0Htik3J/w, roland-DgEjT+Ai2ygdnm+yROfE0A,
sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
On 3/13/2014 1:45 AM, Eli Cohen wrote:
> On Wed, Mar 12, 2014 at 06:34:12PM +0000, Ben Hutchings wrote:
>>>
>>> enum {
>>> - /* one minute for the sake of bringup. Generally, commands must always
>>> + /* 10 msecs for the sake of bringup. Generally, commands must always
>>> * complete and we may need to increase this timeout value
>>> */
>>> - MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000,
>>> + MLX5_CMD_TIMEOUT_MSEC = 10 * 1000,
>> You seem to be changing the timeout from 2 hours (not one minute) to 10
>> seconds (not milliseconds).
>>
> Thanks for noting this. Actually, the time should remain 2 hours and
> the comment should be fixed. Also note that long time/missing
> completion of a command is, genrallly, not indicative of a PCI error.
> If that happens, we would want to have enough time to do diagnostics
> before timing out.
>
Hi Eli
In mlx4 code, I do not recall a timeout for commands this big. So the
reason in mlx5 is 2 hrs is just for
debugging purposes? So if for any reason a command hang then the user
can not remove this module
for the next 2 hrs?
Carol
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-13 15:12 ` Carol Soto
@ 2014-03-13 15:40 ` Eli Cohen
2014-03-13 15:51 ` Carol Soto
0 siblings, 1 reply; 11+ messages in thread
From: Eli Cohen @ 2014-03-13 15:40 UTC (permalink / raw)
To: Carol Soto
Cc: Ben Hutchings, eli, roland, sean.hefty, hal.rosenstock,
linux-rdma, netdev, brking
On Thu, Mar 13, 2014 at 10:12:19AM -0500, Carol Soto wrote:
>
> In mlx4 code, I do not recall a timeout for commands this big. So
> the reason in mlx5 is 2 hrs is just for
> debugging purposes? So if for any reason a command hang then the
> user can not remove this module
> for the next 2 hrs?
>
Hi Carol,
well I haven't seen any such case with latest firmware releases.
Anyway, 10 msec is really too short timeout value since there are
commands that can take more than that (e.g. memory registartion of
regions larger then 512 MB - though this will be changed soon). I
wonder what was the original motivation and have you been able to
simulate PCI errors and see this in action.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-13 15:40 ` Eli Cohen
@ 2014-03-13 15:51 ` Carol Soto
2014-03-13 16:03 ` Eli Cohen
0 siblings, 1 reply; 11+ messages in thread
From: Carol Soto @ 2014-03-13 15:51 UTC (permalink / raw)
To: Eli Cohen
Cc: Ben Hutchings, eli-VPRAkNaXOzVWk0Htik3J/w,
roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
On 3/13/2014 10:40 AM, Eli Cohen wrote:
> On Thu, Mar 13, 2014 at 10:12:19AM -0500, Carol Soto wrote:
>> In mlx4 code, I do not recall a timeout for commands this big. So
>> the reason in mlx5 is 2 hrs is just for
>> debugging purposes? So if for any reason a command hang then the
>> user can not remove this module
>> for the next 2 hrs?
>>
> Hi Carol,
> well I haven't seen any such case with latest firmware releases.
> Anyway, 10 msec is really too short timeout value since there are
> commands that can take more than that (e.g. memory registartion of
> regions larger then 512 MB - though this will be changed soon). I
> wonder what was the original motivation and have you been able to
> simulate PCI errors and see this in action.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eli,
The motivation to reduce that timeout is that if there is a process in
the middle of a HW command
in the middle of the PCI error, I probably did not want to wait 2hrs
since the command will never complete
since the card is dead. Now you are right, I forgot the case of big
memory registration where commands can
take longer than that. Do you have an idea of what is the longest time
that a command can take in mlx5?
Carol
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-13 15:51 ` Carol Soto
@ 2014-03-13 16:03 ` Eli Cohen
2014-03-13 16:26 ` Carol Soto
0 siblings, 1 reply; 11+ messages in thread
From: Eli Cohen @ 2014-03-13 16:03 UTC (permalink / raw)
To: Carol Soto
Cc: Ben Hutchings, eli, roland, sean.hefty, hal.rosenstock,
linux-rdma, netdev, brking
On Thu, Mar 13, 2014 at 10:51:46AM -0500, Carol Soto wrote:
>
> The motivation to reduce that timeout is that if there is a process
> in the middle of a HW command
> in the middle of the PCI error, I probably did not want to wait 2hrs
> since the command will never complete
> since the card is dead. Now you are right, I forgot the case of big
> memory registration where commands can
> take longer than that. Do you have an idea of what is the longest
> time that a command can take in mlx5?
>
There is no guranteed time for command completions. With current
driver/firmware, registration of 4.5 GB can take around 2 minutes; for
larger regions it can take even more time. As I mentioned earlier, I
will soon send a patch that reduces registation times for large
regions.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler
2014-03-13 16:03 ` Eli Cohen
@ 2014-03-13 16:26 ` Carol Soto
0 siblings, 0 replies; 11+ messages in thread
From: Carol Soto @ 2014-03-13 16:26 UTC (permalink / raw)
To: Eli Cohen
Cc: Ben Hutchings, eli-VPRAkNaXOzVWk0Htik3J/w,
roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w,
hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
On 3/13/2014 11:03 AM, Eli Cohen wrote:
> On Thu, Mar 13, 2014 at 10:51:46AM -0500, Carol Soto wrote:
>> The motivation to reduce that timeout is that if there is a process
>> in the middle of a HW command
>> in the middle of the PCI error, I probably did not want to wait 2hrs
>> since the command will never complete
>> since the card is dead. Now you are right, I forgot the case of big
>> memory registration where commands can
>> take longer than that. Do you have an idea of what is the longest
>> time that a command can take in mlx5?
>>
> There is no guranteed time for command completions. With current
> driver/firmware, registration of 4.5 GB can take around 2 minutes; for
> larger regions it can take even more time. As I mentioned earlier, I
> will soon send a patch that reduces registation times for large
> regions.
Hi Eli,
I can wait for your patch to be available and try this again. For now,
I can remove
my changes for the timeout and resubmit the patch.
Thanks for the feedback about the commands.
Carol
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-03-13 16:26 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-12 3:42 [Patch 0/2] IB/mlx5: Add PCI error handler support for mlx5 clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
2014-03-12 3:42 ` [Patch 1/2] IB/mlx5: Implementation of PCI error handler clsoto
[not found] ` <20140312034512.065218504-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2014-03-12 18:34 ` Ben Hutchings
[not found] ` <1394649252.23624.36.camel-nDn/Rdv9kqW9Jme8/bJn5UCKIB8iOfG2tUK59QYPAWc@public.gmane.org>
2014-03-12 22:00 ` Carol Soto
2014-03-13 6:45 ` Eli Cohen
2014-03-13 15:12 ` Carol Soto
2014-03-13 15:40 ` Eli Cohen
2014-03-13 15:51 ` Carol Soto
2014-03-13 16:03 ` Eli Cohen
2014-03-13 16:26 ` Carol Soto
2014-03-12 3:42 ` [Patch 2/2] IB/mlx5: Free resources during PCI error clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).