netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thadeu Lima de Souza Cascardo <cascardo-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
To: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: yevgenyp-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org,
	jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org
Subject: Re: [PATCH] mlx4: prevent the device from being removed concurrently
Date: Tue, 28 Feb 2012 17:34:38 -0300	[thread overview]
Message-ID: <20120228203438.GA12028@oc1711230544.ibm.com> (raw)
In-Reply-To: <20120228.143051.352474620462899753.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Tue, Feb 28, 2012 at 02:30:51PM -0500, David Miller wrote:
> From: Thadeu Lima de Souza Cascardo <cascardo-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Date: Tue, 28 Feb 2012 15:36:16 -0300
> 
> > When a EEH happens, the catas poll code will try to restart the device,
> > removing it and adding it back again. The EEH code will try to do the
> > same. One of the threads ends up accessing memory that was freed by the
> > other thread and we get a crash.
> 
> Stop adding bandaids to the locking.
> 
> If the EEH infrastructure doesn't synchronize parallel operations
> on the same device, that is the real bug, and that's where the real
> fix belongs.
> 
> I refuse to apply this patch.
> 

It's not EEH that does not synchronize removal. The problem is that the
driver itself calls the driver remove function through mlx4_restart_one.

>From catas.c:

 88 static void catas_reset(struct work_struct *work)
...
103                 ret = mlx4_restart_one(priv->dev.pdev);


>From main.c:

2013 int mlx4_restart_one(struct pci_dev *pdev)
...
2015         mlx4_remove_one(pdev);

2067 static struct pci_driver mlx4_driver = {
...
2071         .remove         = __devexit_p(mlx4_remove_one)


Real EEH support in this driver will have to do something similar to
reset the device. And this either requires that we remove or rewrite the
catas code, or that we implement some kind of locking.

Probably, what we should do here is rewrite catas code so that is
resilient to races with any code removing the device, be it EEH or
anything else. It's just that EEH will trigger the catas code and makes
it a lot easier to test this problem.

Regards.
Cascardo.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-02-28 20:34 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-28 18:36 [PATCH] mlx4: prevent the device from being removed concurrently Thadeu Lima de Souza Cascardo
     [not found] ` <1330454176-17768-1-git-send-email-cascardo-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2012-02-28 19:30   ` David Miller
     [not found]     ` <20120228.143051.352474620462899753.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2012-02-28 20:34       ` Thadeu Lima de Souza Cascardo [this message]
     [not found]         ` <20120228203438.GA12028-/9mL1TZGaJOu3CHPIDa7bVaTQe2KTcn/@public.gmane.org>
2012-02-28 20:46           ` David Miller
     [not found]             ` <20120228.154657.1817512578346429850.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2012-02-29 14:47               ` Jack Morgenstein
     [not found]                 ` <201202291647.53161.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-02-29 15:19                   ` Jack Morgenstein
     [not found]                     ` <201202291719.50764.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-03-01  7:51                       ` Jack Morgenstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120228203438.GA12028@oc1711230544.ibm.com \
    --to=cascardo-23vcf4htsmix0ybbhkvfkdbpr1lh4cv8@public.gmane.org \
    --cc=davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org \
    --cc=jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org \
    --cc=yevgenyp-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).