* [Patch 0/3] Hangs with IPoIB when doing PCI error injection
@ 2014-03-27 14:28 clsoto
2014-03-27 14:28 ` [Patch 1/3] IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during " clsoto
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: clsoto @ 2014-03-27 14:28 UTC (permalink / raw)
To: clsoto, roland, sean.hefty, hal.rosenstock, linux-rdma, netdev; +Cc: brking
This patch is to resolve some hangs we are seeing when doing PCI error injection
to Mellanox Infiniband cards. With this patch we make mlx4 driver send an IB_EVENT_DEVICE_FATAL
to the users and added this event to event handlers to avoid these hangs.
If IPoIB is in connected mode, then added to cm an event handler and tried to
make sure that when it sees the fatal event it does not try to send anymore packet because
it will not receive any more completions or interrupts.
infiniband/core/cm.c | 43 +++++++++++++++++++++++++++++++++++++
infiniband/core/multicast.c | 1
infiniband/core/sa_query.c | 1
infiniband/ulp/ipoib/ipoib_verbs.c | 1
net/ethernet/mellanox/mlx4/main.c | 6 +++++
Carol Soto
--
^ permalink raw reply [flat|nested] 9+ messages in thread* [Patch 1/3] IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during PCI error injection 2014-03-27 14:28 [Patch 0/3] Hangs with IPoIB when doing PCI error injection clsoto @ 2014-03-27 14:28 ` clsoto 2014-03-27 14:28 ` [Patch 2/3] IB: hang in mcast_remove_one " clsoto ` (2 subsequent siblings) 3 siblings, 0 replies; 9+ messages in thread From: clsoto @ 2014-03-27 14:28 UTC (permalink / raw) To: clsoto, roland, sean.hefty, hal.rosenstock, linux-rdma, netdev; +Cc: brking [-- Attachment #1: mlx4_send_device_fatal_event.patch --] [-- Type: text/plain, Size: 922 bytes --] If the cards hits a pci error, then notify users that the device is not in good shape. Send IB_EVENT_DEVICE_FATAL to IB users. Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com> --- drivers/net/ethernet/mellanox/mlx4/main.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: b/drivers/net/ethernet/mellanox/mlx4/main.c =================================================================== --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2673,6 +2673,12 @@ MODULE_DEVICE_TABLE(pci, mlx4_pci_table) static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev, pci_channel_state_t state) { + struct mlx4_dev *dev = pci_get_drvdata(pdev); + int i; + + for (i = 1; i <= dev->caps.num_ports; ++i) + mlx4_dispatch_event(dev, MLX4_DEV_EVENT_CATASTROPHIC_ERROR, i); + mlx4_remove_one(pdev); return state == pci_channel_io_perm_failure ? -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Patch 2/3] IB: hang in mcast_remove_one during PCI error injection 2014-03-27 14:28 [Patch 0/3] Hangs with IPoIB when doing PCI error injection clsoto 2014-03-27 14:28 ` [Patch 1/3] IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during " clsoto @ 2014-03-27 14:28 ` clsoto 2014-03-27 14:28 ` [Patch 3/3] IB/ib_cm: hang in cm_destroy_id " clsoto [not found] ` <20140327142813.535289178-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 3 siblings, 0 replies; 9+ messages in thread From: clsoto @ 2014-03-27 14:28 UTC (permalink / raw) To: clsoto, roland, sean.hefty, hal.rosenstock, linux-rdma, netdev; +Cc: brking [-- Attachment #1: mcast_remove_one_hang.patch --] [-- Type: text/plain, Size: 3482 bytes --] This patch is to avoid this hang: kernel: Call Trace: kernel: [C0000000FF9E34D0] [C0000000FF9E3560] 0xc0000000ff9e3560 (unreliable) kernel: [C0000000FF9E36A0] [C00000000001070C] .__switch_to+0x124/0x148 kernel: [C0000000FF9E3730] [C0000000003E6D30] .schedule+0xc10/0xdc4 kernel: [C0000000FF9E3840] [C0000000003E7024] .wait_for_completion+0xcc/0x150 kernel: [C0000000FF9E3900] [D000000000882288] .mcast_remove_one+0x8c/0xe8 [ib_sa] kernel: [C0000000FF9E39A0] [D0000000004E404C] .ib_unregister_device+0x64/0x15c [ib_core] kernel: [C0000000FF9E3A40] [D000000000542A4C] .mlx4_ib_remove+0x50/0x148 [mlx4_ib] kernel: [C0000000FF9E3AD0] [D0000000004A6EBC] .mlx4_remove_device+0xa0/0xf0 [mlx4_core] kernel: [C0000000FF9E3B60] [D0000000004A73F0] .mlx4_unregister_device+0x44/0xa8 [mlx4_core] kernel: [C0000000FF9E3BF0] [D0000000004AA0A8] .mlx4_remove_one+0x40/0x1bc [mlx4_core] kernel: [C0000000FF9E3C80] [D0000000004AA240] .mlx4_pci_err_detected+0x1c/0x48 [mlx4_core] kernel: [C0000000FF9E3D10] [C000000000053E84] .eeh_report_error+0x70/0xb4 kernel: [C0000000FF9E3DA0] [C0000000001DCB18] .pci_walk_bus+0xf8/0x168 kernel: [C0000000FF9E3E50] [C000000000054254] .handle_eeh_events+0x1a8/0x3d0 kernel: [C0000000FF9E3F00] [C000000000054580] .eeh_event_handler+0xc0/0x160 kernel: [C0000000FF9E3F90] [C000000000027A3C] .kernel_thread+0x4c/0x68 Add IB_EVENT_DEVICE_FATAL event to ib_sa, multicast and ipoib event handlers so the event handler will make the multicast group that are in joined state to move from that state so it will decrease the counter that will create this hang. Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com> --- drivers/infiniband/core/multicast.c | 1 + drivers/infiniband/core/sa_query.c | 1 + drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 1 + 3 files changed, 3 insertions(+) Index: b/drivers/infiniband/core/multicast.c =================================================================== --- a/drivers/infiniband/core/multicast.c +++ b/drivers/infiniband/core/multicast.c @@ -785,6 +785,7 @@ static void mcast_event_handler(struct i case IB_EVENT_PORT_ERR: case IB_EVENT_LID_CHANGE: case IB_EVENT_SM_CHANGE: + case IB_EVENT_DEVICE_FATAL: case IB_EVENT_CLIENT_REREGISTER: mcast_groups_event(&dev->port[index], MCAST_GROUP_ERROR); break; Index: b/drivers/infiniband/core/sa_query.c =================================================================== --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -443,6 +443,7 @@ static void ib_sa_event(struct ib_event_ event->event == IB_EVENT_LID_CHANGE || event->event == IB_EVENT_PKEY_CHANGE || event->event == IB_EVENT_SM_CHANGE || + event->event == IB_EVENT_DEVICE_FATAL || event->event == IB_EVENT_CLIENT_REREGISTER) { unsigned long flags; struct ib_sa_device *sa_dev = Index: b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c =================================================================== --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -289,6 +289,7 @@ void ipoib_event(struct ib_event_handler queue_work(ipoib_workqueue, &priv->flush_light); } else if (record->event == IB_EVENT_PORT_ERR || record->event == IB_EVENT_PORT_ACTIVE || + record->event == IB_EVENT_DEVICE_FATAL || record->event == IB_EVENT_LID_CHANGE) { queue_work(ipoib_workqueue, &priv->flush_normal); } else if (record->event == IB_EVENT_PKEY_CHANGE) { -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* [Patch 3/3] IB/ib_cm: hang in cm_destroy_id during PCI error injection 2014-03-27 14:28 [Patch 0/3] Hangs with IPoIB when doing PCI error injection clsoto 2014-03-27 14:28 ` [Patch 1/3] IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during " clsoto 2014-03-27 14:28 ` [Patch 2/3] IB: hang in mcast_remove_one " clsoto @ 2014-03-27 14:28 ` clsoto [not found] ` <20140327142939.460692817-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> [not found] ` <20140327142813.535289178-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 3 siblings, 1 reply; 9+ messages in thread From: clsoto @ 2014-03-27 14:28 UTC (permalink / raw) To: clsoto, roland, sean.hefty, hal.rosenstock, linux-rdma, netdev; +Cc: brking [-- Attachment #1: cm_destroy_id_hangs.patch --] [-- Type: text/plain, Size: 5562 bytes --] This patch is to avoid this hang: kernel: Call Trace: kernel: [c0000003ea9faa70] [c0000000000144f0] .__switch_to+0x1c0/0x390 kernel: [c0000003ea9fab20] [c0000000006d5528] .__schedule+0x328/0x920 kernel: [c0000003ea9fada0] [c0000000006d2da4] .schedule_timeout+0x244/0x2e0 kernel: [c0000003ea9faea0] [c0000000006d61dc] .wait_for_common+0x18c/0x210 kernel: [c0000003ea9faf70] [d00000000cee45a0] .cm_destroy_id+0x190/0x590 [ib_cm] kernel: [c0000003ea9fb040] [d00000000d6ac3d8] .ipoib_cm_free_rx_reap_list+0xc8/0x1b0 [ib_ipoib] kernel: [c0000003ea9fb100] [d00000000d6af870] .ipoib_cm_dev_stop+0x210/0x340 [ib_ipoib] kernel: [c0000003ea9fb1c0] [d00000000d6a67b0] .ipoib_ib_dev_stop+0x100/0x500 [ib_ipoib] kernel: [c0000003ea9fb320] [d00000000d6a0ce4] .ipoib_stop+0x94/0x1a0 [ib_ipoib] kernel: [c0000003ea9fb3b0] [c00000000059db58] .__dev_close_many+0xc8/0x140 kernel: [c0000003ea9fb440] [c00000000059dc24] .__dev_close+0x54/0x90 kernel: [c0000003ea9fb4e0] [c0000000005a6940] .__dev_change_flags+0x170/0x1f0 kernel: [c0000003ea9fb580] [c0000000005a6aa4] .dev_change_flags+0x24/0x90 kernel: [c0000003ea9fb610] [d00000000d6a1604] .ipoib_remove_one+0xc4/0x160 [ib_ipoib] kernel: [c0000003ea9fb6b0] [d00000000c285174] .ib_unregister_device+0x74/0x150 [ib_core] kernel: [c0000003ea9fb750] [d00000000c857af4] .mlx4_ib_remove+0x44/0x220 [mlx4_ib] kernel: [c0000003ea9fb7e0] [d000000003b6d07c] .mlx4_remove_device+0xdc/0x120 [mlx4_core] kernel: [c0000003ea9fb870] [d000000003b6d6ec] .mlx4_unregister_device+0x7c/0xf0 [mlx4_core] kernel: [c0000003ea9fb900] [d000000003b6ec20] .mlx4_remove_one+0x60/0x3e0 [mlx4_core] kernel: [c0000003ea9fb9a0] [d000000003b6efb8] .mlx4_pci_err_detected+0x18/0x40 [mlx4_core] kernel: [c0000003ea9fba20] [c000000000035630] .eeh_report_error+0xa0/0x120 kernel: [c0000003ea9fbab0] [c00000000003431c] .eeh_pe_dev_traverse+0x9c/0x190 kernel: [c0000003ea9fbb60] [c000000000035c4c] .eeh_handle_normal_event+0x11c/0x3c0 kernel: [c0000003ea9fbbf0] [c000000000035f20] .eeh_handle_event+0x30/0x2b0 kernel: [c0000003ea9fbc90] [c0000000000362f4] .eeh_event_handler+0x154/0x170 kernel: [c0000003ea9fbd30] [c0000000000c0018] .kthread+0xe8/0xf0 kernel: [c0000003ea9fbe30] [c00000000000a168] .ret_from_kernel_thread+0x5c/0x74 The approach in this patch is to send the IB_EVENT_DEVICE_FATAL to cm, after this cm will not try to send a message so it will not increase the counter that is causing this hang. If cm tries to send a packet after receiving the Fatal event, the adapter may not create a completion event and it will cause this hang. Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com> --- drivers/infiniband/core/cm.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) Index: b/drivers/infiniband/core/cm.c =================================================================== --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -161,6 +161,7 @@ struct cm_port { struct ib_mad_agent *mad_agent; struct kobject port_obj; u8 port_num; + u8 device_fatal; struct cm_counter_group counter_group[CM_COUNTER_GROUPS]; }; @@ -168,6 +169,7 @@ struct cm_device { struct list_head list; struct ib_device *ib_device; struct device *device; + struct ib_event_handler event_handler; u8 ack_delay; struct cm_port *port[0]; }; @@ -258,6 +260,10 @@ static int cm_alloc_msg(struct cm_id_pri struct ib_mad_agent *mad_agent; struct ib_mad_send_buf *m; struct ib_ah *ah; + struct cm_port *port = cm_id_priv->av.port; + + if (port->device_fatal) + return -EIO; mad_agent = cm_id_priv->av.port->mad_agent; ah = ib_create_ah(mad_agent->qp->pd, &cm_id_priv->av.ah_attr); @@ -290,6 +296,9 @@ static int cm_alloc_response_msg(struct struct ib_mad_send_buf *m; struct ib_ah *ah; + if (port->device_fatal) + return -EIO; + ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc, mad_recv_wc->recv_buf.grh, port->port_num); if (IS_ERR(ah)) @@ -3764,6 +3773,33 @@ static void cm_remove_port_fs(struct cm_ kobject_put(&port->port_obj); } +static void ib_cm_event_handler(struct ib_event_handler *handler, + struct ib_event *event) +{ + struct cm_device *cm_dev; + struct cm_port *port = NULL; + + cm_dev = container_of(handler, struct cm_device, event_handler); + switch (event->event) { + case IB_EVENT_PORT_ACTIVE: + port = cm_dev->port[event->element.port_num - 1]; + if (port == NULL) + return; + if (port->port_num == event->element.port_num) + port->device_fatal = 0; + break; + case IB_EVENT_DEVICE_FATAL: + port = cm_dev->port[event->element.port_num - 1]; + if (port == NULL) + return; + if (port->port_num == event->element.port_num) + port->device_fatal = 1; + break; + default: + break; + } +} + static void cm_add_one(struct ib_device *ib_device) { struct cm_device *cm_dev; @@ -3828,6 +3864,12 @@ static void cm_add_one(struct ib_device } ib_set_client_data(ib_device, &cm_client, cm_dev); + INIT_IB_EVENT_HANDLER(&cm_dev->event_handler, + ib_device, ib_cm_event_handler); + ret = ib_register_event_handler(&cm_dev->event_handler); + if (ret) + goto error3; + write_lock_irqsave(&cm.device_lock, flags); list_add_tail(&cm_dev->list, &cm.device_list); write_unlock_irqrestore(&cm.device_lock, flags); @@ -3864,6 +3906,7 @@ static void cm_remove_one(struct ib_devi if (!cm_dev) return; + ib_unregister_event_handler(&cm_dev->event_handler); write_lock_irqsave(&cm.device_lock, flags); list_del(&cm_dev->list); write_unlock_irqrestore(&cm.device_lock, flags); -- ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20140327142939.460692817-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* RE: [Patch 3/3] IB/ib_cm: hang in cm_destroy_id during PCI error injection [not found] ` <20140327142939.460692817-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2014-04-23 18:15 ` Hefty, Sean [not found] ` <1828884A29C6694DAF28B7E6B8A82373992F353B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Hefty, Sean @ 2014-04-23 18:15 UTC (permalink / raw) To: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org, roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org > Index: b/drivers/infiniband/core/cm.c > =================================================================== > --- a/drivers/infiniband/core/cm.c > +++ b/drivers/infiniband/core/cm.c > @@ -161,6 +161,7 @@ struct cm_port { > struct ib_mad_agent *mad_agent; > struct kobject port_obj; > u8 port_num; > + u8 device_fatal; > struct cm_counter_group counter_group[CM_COUNTER_GROUPS]; > }; > > @@ -168,6 +169,7 @@ struct cm_device { > struct list_head list; > struct ib_device *ib_device; > struct device *device; > + struct ib_event_handler event_handler; > u8 ack_delay; > struct cm_port *port[0]; > }; > @@ -258,6 +260,10 @@ static int cm_alloc_msg(struct cm_id_pri > struct ib_mad_agent *mad_agent; > struct ib_mad_send_buf *m; > struct ib_ah *ah; > + struct cm_port *port = cm_id_priv->av.port; > + > + if (port->device_fatal) > + return -EIO; > > mad_agent = cm_id_priv->av.port->mad_agent; > ah = ib_create_ah(mad_agent->qp->pd, &cm_id_priv->av.ah_attr); > @@ -290,6 +296,9 @@ static int cm_alloc_response_msg(struct > struct ib_mad_send_buf *m; > struct ib_ah *ah; > > + if (port->device_fatal) > + return -EIO; > + > ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc, > mad_recv_wc->recv_buf.grh, port->port_num); > if (IS_ERR(ah)) > @@ -3764,6 +3773,33 @@ static void cm_remove_port_fs(struct cm_ > kobject_put(&port->port_obj); > } > > +static void ib_cm_event_handler(struct ib_event_handler *handler, > + struct ib_event *event) > +{ > + struct cm_device *cm_dev; > + struct cm_port *port = NULL; > + > + cm_dev = container_of(handler, struct cm_device, event_handler); > + switch (event->event) { > + case IB_EVENT_PORT_ACTIVE: > + port = cm_dev->port[event->element.port_num - 1]; > + if (port == NULL) > + return; > + if (port->port_num == event->element.port_num) > + port->device_fatal = 0; > + break; > + case IB_EVENT_DEVICE_FATAL: > + port = cm_dev->port[event->element.port_num - 1]; > + if (port == NULL) > + return; > + if (port->port_num == event->element.port_num) > + port->device_fatal = 1; > + break; This is a device level event, not a port event. The port_num value may not be valid. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <1828884A29C6694DAF28B7E6B8A82373992F353B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: [Patch 3/3] IB/ib_cm: hang in cm_destroy_id during PCI error injection [not found] ` <1828884A29C6694DAF28B7E6B8A82373992F353B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2014-04-23 18:58 ` Carol Soto [not found] ` <53580D42.1060201-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Carol Soto @ 2014-04-23 18:58 UTC (permalink / raw) To: Hefty, Sean, roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org On 4/23/2014 1:15 PM, Hefty, Sean wrote: >> Index: b/drivers/infiniband/core/cm.c >> =================================================================== >> --- a/drivers/infiniband/core/cm.c >> +++ b/drivers/infiniband/core/cm.c >> @@ -161,6 +161,7 @@ struct cm_port { >> struct ib_mad_agent *mad_agent; >> struct kobject port_obj; >> u8 port_num; >> + u8 device_fatal; >> struct cm_counter_group counter_group[CM_COUNTER_GROUPS]; >> }; >> >> @@ -168,6 +169,7 @@ struct cm_device { >> struct list_head list; >> struct ib_device *ib_device; >> struct device *device; >> + struct ib_event_handler event_handler; >> u8 ack_delay; >> struct cm_port *port[0]; >> }; >> @@ -258,6 +260,10 @@ static int cm_alloc_msg(struct cm_id_pri >> struct ib_mad_agent *mad_agent; >> struct ib_mad_send_buf *m; >> struct ib_ah *ah; >> + struct cm_port *port = cm_id_priv->av.port; >> + >> + if (port->device_fatal) >> + return -EIO; >> >> mad_agent = cm_id_priv->av.port->mad_agent; >> ah = ib_create_ah(mad_agent->qp->pd, &cm_id_priv->av.ah_attr); >> @@ -290,6 +296,9 @@ static int cm_alloc_response_msg(struct >> struct ib_mad_send_buf *m; >> struct ib_ah *ah; >> >> + if (port->device_fatal) >> + return -EIO; >> + >> ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc, >> mad_recv_wc->recv_buf.grh, port->port_num); >> if (IS_ERR(ah)) >> @@ -3764,6 +3773,33 @@ static void cm_remove_port_fs(struct cm_ >> kobject_put(&port->port_obj); >> } >> >> +static void ib_cm_event_handler(struct ib_event_handler *handler, >> + struct ib_event *event) >> +{ >> + struct cm_device *cm_dev; >> + struct cm_port *port = NULL; >> + >> + cm_dev = container_of(handler, struct cm_device, event_handler); >> + switch (event->event) { >> + case IB_EVENT_PORT_ACTIVE: >> + port = cm_dev->port[event->element.port_num - 1]; >> + if (port == NULL) >> + return; >> + if (port->port_num == event->element.port_num) >> + port->device_fatal = 0; >> + break; >> + case IB_EVENT_DEVICE_FATAL: >> + port = cm_dev->port[event->element.port_num - 1]; >> + if (port == NULL) >> + return; >> + if (port->port_num == event->element.port_num) >> + port->device_fatal = 1; >> + break; > This is a device level event, not a port event. The port_num value may not be valid. The first patch of the series(IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during PCI error injection), when mlx4 send the IB_EVENT_DEVICE_FATAL event I am passing the port number to that field so in that way I was able to make it work. Any other suggestions are welcome. Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <53580D42.1060201-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* RE: [Patch 3/3] IB/ib_cm: hang in cm_destroy_id during PCI error injection [not found] ` <53580D42.1060201-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2014-04-23 21:34 ` Hefty, Sean 0 siblings, 0 replies; 9+ messages in thread From: Hefty, Sean @ 2014-04-23 21:34 UTC (permalink / raw) To: Carol Soto, roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org > > This is a device level event, not a port event. The port_num value may > not be valid. > The first patch of the series(IB/mlx4: send a IB_EVENT_DEVICE_FATAL to > users during PCI error injection), when mlx4 send the IB_EVENT_DEVICE_FATAL > event I am passing the port number to that field so in that way I was > able to make it work. Any other suggestions are welcome. This would need to be done for all devices. Converting this to a per port event requires multiple events per device, which would impact the user space ABI. I think the patch itself needs to change to remove the per port association and operate at the device level, since it is really a device event. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20140327142813.535289178-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: [Patch 0/3] Hangs with IPoIB when doing PCI error injection [not found] ` <20140327142813.535289178-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2014-03-28 20:47 ` David Miller 2014-03-28 20:47 ` Roland Dreier 0 siblings, 1 reply; 9+ messages in thread From: David Miller @ 2014-03-28 20:47 UTC (permalink / raw) To: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w, hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w, linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 From: clsoto-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org Date: Thu, 27 Mar 2014 09:28:13 -0500 > This patch is to resolve some hangs we are seeing when doing PCI error injection > to Mellanox Infiniband cards. With this patch we make mlx4 driver send an IB_EVENT_DEVICE_FATAL > to the users and added this event to event handlers to avoid these hangs. > If IPoIB is in connected mode, then added to cm an event handler and tried to > make sure that when it sees the fatal event it does not try to send anymore packet because > it will not receive any more completions or interrupts. I'm assuming Roland will take this in via his tree. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Patch 0/3] Hangs with IPoIB when doing PCI error injection 2014-03-28 20:47 ` [Patch 0/3] Hangs with IPoIB when doing " David Miller @ 2014-03-28 20:47 ` Roland Dreier 0 siblings, 0 replies; 9+ messages in thread From: Roland Dreier @ 2014-03-28 20:47 UTC (permalink / raw) To: David Miller Cc: clsoto, Hefty, Sean, Hal Rosenstock, linux-rdma@vger.kernel.org, netdev@vger.kernel.org, brking On Fri, Mar 28, 2014 at 1:47 PM, David Miller <davem@davemloft.net> wrote: > I'm assuming Roland will take this in via his tree. Yes, hoping for some feedback from Mellanox people. - R. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-04-23 21:34 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-27 14:28 [Patch 0/3] Hangs with IPoIB when doing PCI error injection clsoto
2014-03-27 14:28 ` [Patch 1/3] IB/mlx4: send a IB_EVENT_DEVICE_FATAL to users during " clsoto
2014-03-27 14:28 ` [Patch 2/3] IB: hang in mcast_remove_one " clsoto
2014-03-27 14:28 ` [Patch 3/3] IB/ib_cm: hang in cm_destroy_id " clsoto
[not found] ` <20140327142939.460692817-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2014-04-23 18:15 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A82373992F353B-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2014-04-23 18:58 ` Carol Soto
[not found] ` <53580D42.1060201-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2014-04-23 21:34 ` Hefty, Sean
[not found] ` <20140327142813.535289178-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2014-03-28 20:47 ` [Patch 0/3] Hangs with IPoIB when doing " David Miller
2014-03-28 20:47 ` Roland Dreier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).