All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kleber Sacilotto de Souza <klebers-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Cc: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	cascardo-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org,
	brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org
Subject: Re: [PATCH] IB: fix task hanging on error recovery
Date: Thu, 01 Nov 2012 15:53:55 -0200	[thread overview]
Message-ID: <5092B733.9090901@linux.vnet.ibm.com> (raw)
In-Reply-To: <509295B1.6050209-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>

On 11/01/2012 01:30 PM, Hal Rosenstock wrote:
> On 11/1/2012 11:12 AM, Or Gerlitz wrote:
>> On 19/10/2012 23:58, Kleber Sacilotto de Souza wrote:
>>> During PCI error recovery, the calls to wait_for_completion() in the
>>> infiniband core path can hang waiting for some tasks that will never
>>> complete, since the hardware is nonfunctional.
>>>
>>> INFO: task eehd:16029 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>> eehd            D 0000000000000000  9664 16029   3093 0x00000080
>>> Call Trace:
>>> [c0000000af8430e0] [c0000000be05e200] 0xc0000000be05e200 (unreliable)
>>> [c0000000af8432b0] [c000000000014070] .__switch_to+0x130/0x250
>>> [c0000000af843360] [c000000000777c5c] .__schedule+0x40c/0x920
>>> [c0000000af843600] [c000000000775a10] .schedule_timeout+0x210/0x260
>>> [c0000000af8436e0] [c000000000777658] .wait_for_common+0xf8/0x210
>>> [c0000000af8437c0] [d000000004113538]
>>> .ib_unregister_mad_agent+0x498/0x690 [ib_mad]
>>> [c0000000af8438c0] [d0000000042c01f4] .ib_sa_remove_one+0xe4/0x180
>>> [ib_sa]
>>> [c0000000af843970] [d0000000040952e8] .ib_unregister_device+0x78/0x170
>>> [ib_core]
>>> [c0000000af843a10] [d000000004183290] .mlx4_ib_remove+0x40/0x1f0
>>> [mlx4_ib]
>>> [c0000000af843ab0] [d000000003f6ab54] .mlx4_remove_device+0xd4/0x110
>>> [mlx4_core]
>>> [c0000000af843b40] [d000000003f6abfc]
>>> .mlx4_unregister_device+0x6c/0xf0 [mlx4_core]
>>> [c0000000af843be0] [d000000003f6f19c] .mlx4_remove_one+0x10c/0x3a0
>>> [mlx4_core]
>>> [c0000000af843c80] [d000000003f6f448] .mlx4_pci_err_detected+0x18/0x40
>>> [mlx4_core]
>>> [c0000000af843d00] [c000000000058a50] .eeh_report_error+0x70/0xe0
>>> [c0000000af843d90] [c0000000003e4d94] .pci_walk_bus+0xa4/0x140
>>> [c0000000af843e50] [c000000000058628] .handle_eeh_events+0x1f8/0x480
>>> [c0000000af843f00] [c000000000058dfc] .eeh_event_handler+0x13c/0x1e0
>>> [c0000000af843f90] [c00000000002031c] .kernel_thread+0x54/0x70
>>>
>>> This patch fixes the issue by replacing the calls to
>>> wait_for_completion() by wait_for_completion_timeout(), providing a
>>> timeout of 5 seconds to wait for the normal completion of the task.
>>>
>>> Signed-off-by: Kleber Sacilotto de Souza<klebers-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>>> ---
>>>   drivers/infiniband/core/addr.c        |    2 +-
>>>   drivers/infiniband/core/cm.c          |    2 +-
>>>   drivers/infiniband/core/cma.c         |    4 ++--
>>>   drivers/infiniband/core/iwcm.c        |    2 +-
>>>   drivers/infiniband/core/mad.c         |    4 ++--
>>>   drivers/infiniband/core/mad_rmpp.c    |    2 +-
>>>   drivers/infiniband/core/multicast.c   |    4 ++--
>>>   drivers/infiniband/core/sa_query.c    |    2 +-
>> for all the above files, your patch makes sense, since the
>> completion/refcount logic
>> relate to waiting for event that needs to come from the device.
> Is this a fix or a workaround for some other underlying issue which
> needs fixing ?
>
> What happens when modules are reloaded after this ? Do things still work
> ? Are resources lost ?

After the error is recovered, the IB devices resume their normal operations,
and the modules can be reloaded successfully.

>
> -- Hal
>
>>>   drivers/infiniband/core/ucm.c         |    2 +-
>>>   drivers/infiniband/core/ucma.c        |    2 +-
>> On these files, as far as I understand this code from quick looking, I'm
>> not sure on what
>> exactly the completion objects protects, Sean?
>>
>> Or.
>>
>>>   drivers/infiniband/core/uverbs_main.c |    4 ++--
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


-- 
Kleber Sacilotto de Souza
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-11-01 17:53 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-19 21:58 [PATCH] IB: fix task hanging on error recovery Kleber Sacilotto de Souza
     [not found] ` <1350683896-3928-1-git-send-email-klebers-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2012-10-30 13:04   ` Kleber Sacilotto de Souza
2012-11-01 15:12   ` Or Gerlitz
     [not found]     ` <50929160.1000109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-11-01 15:30       ` Hal Rosenstock
     [not found]         ` <509295B1.6050209-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-11-01 17:53           ` Kleber Sacilotto de Souza [this message]
     [not found]             ` <5092B733.9090901-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2012-11-02 13:24               ` Hal Rosenstock
2012-11-05 18:21       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A8237346AC5B9F-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2012-11-05 18:38           ` Or Gerlitz
2012-11-05 19:15           ` Hal Rosenstock
     [not found]             ` <50981044.3010901-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-11-05 20:54               ` Kleber Sacilotto de Souza
     [not found]                 ` <CAJZOPZKEqAT2KPSiwagccdVOCEqssKRPdVh94DdqkuNhXsJxYg@mail.gmail.com>
     [not found]                   ` <CAJZOPZKEqAT2KPSiwagccdVOCEqssKRPdVh94DdqkuNhXsJxYg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-11-06  5:30                     ` Or Gerlitz
     [not found]                       ` <CAJZOPZKL-C6HPzhDPidLjPaWqEc+PbitzBU05-UASN75Xkh0Pw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-11-06 10:58                         ` Kleber Sacilotto de Souza
     [not found]                           ` <5098ED70.4010900-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2012-11-06 17:44                             ` Or Gerlitz
     [not found]                               ` <CAJZOPZ+_-VTOA45MY5rS=tNK8MYK1Kc0ssGrrV3xOUd0=w97Yg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-11-06 17:45                                 ` Or Gerlitz
2012-11-05 17:51   ` Roland Dreier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5092B733.9090901@linux.vnet.ibm.com \
    --to=klebers-23vcf4htsmix0ybbhkvfkdbpr1lh4cv8@public.gmane.org \
    --cc=brking-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=cascardo-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.