All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Axtens <dja@axtens.net>
To: "Matthew R. Ochs" <mrochs@linux.vnet.ibm.com>,
	linux-scsi@vger.kernel.org,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
	Brian King <brking@linux.vnet.ibm.com>,
	Ian Munsie <imunsie@au1.ibm.com>,
	Andrew Donnellan <andrew.donnellan@au1.ibm.com>,
	Tomas Henzl <thenzl@redhat.com>,
	David Laight <David.Laight@ACULAB.COM>
Cc: Michael Neuling <mikey@neuling.org>,
	linuxppc-dev@lists.ozlabs.org,
	"Manoj N. Kumar" <manoj@linux.vnet.ibm.com>
Subject: Re: [PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure
Date: Tue, 29 Sep 2015 11:25:50 +1000	[thread overview]
Message-ID: <87612uuich.fsf@gamma.ozlabs.ibm.com> (raw)
In-Reply-To: <1443223134-9886-1-git-send-email-mrochs@linux.vnet.ibm.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

"Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes:


> The process_sense() routine can perform a read capacity which
> can take some time to complete. If an EEH occurs while waiting
> on the read capacity, the EEH handler is unable to obtain the
> context's mutex in order to put the context in an error state.
> The EEH handler will sit and wait until the context is free,
> but this wait can last longer than the EEH handler tolerates,
> leading to a failed recovery.

I'm not quite clear on what you mean by the EEH handler timing
out. AFAIK there's nothing in eehd and the EEH core that times out if a
driver doesn't respond - indeed, it's pretty easy to hang eehd with a
misbehaving driver.

Are you referring to your own internal timeouts?
cxlflash_wait_for_pci_err_recovery and anything else that uses
CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?

Regards,
Daniel

>
> To address this issue, make the context unavailable to new,
> non-system owned threads and release the context while calling
> into process_sense(). After returning from process_sense() the
> context mutex is reacquired and the context is made available
> again. The context can be safely moved to the error state if
> needed during the unavailable window as no other threads will
> hold its reference.
>
> Signed-off-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
> Signed-off-by: Manoj N. Kumar <manoj@linux.vnet.ibm.com>
> Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
> ---
>  drivers/scsi/cxlflash/superpipe.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/superpipe.c
> index a6316f5..7283e83 100644
> --- a/drivers/scsi/cxlflash/superpipe.c
> +++ b/drivers/scsi/cxlflash/superpipe.c
> @@ -1787,12 +1787,21 @@ static int cxlflash_disk_verify(struct scsi_device *sdev,
>  	 * inquiry (i.e. the Unit attention is due to the WWN changing).
>  	 */
>  	if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) {
> +		/* Can't hold mutex across process_sense/read_cap16,
> +		 * since we could have an intervening EEH event.
> +		 */
> +		ctxi->unavail = true;
> +		mutex_unlock(&ctxi->mutex);
>  		rc = process_sense(sdev, verify);
>  		if (unlikely(rc)) {
>  			dev_err(dev, "%s: Failed to validate sense data (%d)\n",
>  				__func__, rc);
> +			mutex_lock(&ctxi->mutex);
> +			ctxi->unavail = false;
>  			goto out;
>  		}
> +		mutex_lock(&ctxi->mutex);
> +		ctxi->unavail = false;
>  	}
>  
>  	switch (gli->mode) {
> -- 
> 2.1.0
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJWCeieAAoJEPC3R3P2I92F+hMP/1OdLQCin+kKbOb9qxf952bH
DUAkmEhc0oD7xZFQI8HgDmHRxpes5HHxXtwXFsLgsr8QYG+aOIV568GXIZtTbrl0
aCFMqtKXZ6jVqv5L60r1tgzcWxmWdshMLd1op6t3BwA67nUc5Edcr94ePUyDDLj1
at335wCnxuGxn0kdB0Ud/lbPzTsgDPcuV6tCLy0o4J15KFOyFt9hCjO4nmL/wcIt
kmjyn5SHbdgje+73uaRQnXkli4wDA9x7x6/8wFgLspnOxgMEJgnHmm+HYbOXnHyX
nFFHw9+X2ETUcucVWuKNaFzW1vH+WJDteEZbjS7t7liJIkmIiZSFHyUTtVGdBkl1
FsWswA0pkzuGq94Wsb0nGtNHbsMw+WeWTcTlNN46DMG/wqz75iO3yMGK9MZuddSX
9jUokiM0kQvvfwAoujmvpMCVB4b2oseRRG4/yJ0lKSCcC8kETQTXgVHbT8oLmCdk
rUA0hxbbKzVQsDzw8s5HqYZjqHdLp3sDPeyukPeJl2CNhysrmnyHXpq8XgcLi3op
kbuuiR3z8UH3MW4BDpplnjhZ+5Wyw9cSI57vRF2Kr80NnU+5hBvftNh4rBneeny2
0gCDlPHDvB7Ks9HkcxkK9MW78FTgj50ePofS/dUUod4M9ohDd4MSwRKjpwQ+H3By
jmxnzfvWO/oTlL1D9+2W
=3BcU
-----END PGP SIGNATURE-----

WARNING: multiple messages have this Message-ID (diff)
From: Daniel Axtens <dja@axtens.net>
To: "Matthew R. Ochs" <mrochs@linux.vnet.ibm.com>,
	linux-scsi@vger.kernel.org,
	James Bottomley <James.Bottomley@HansenPartnership.com>,
	"Nicholas A. Bellinger" <nab@linux-iscsi.org>,
	Brian King <brking@linux.vnet.ibm.com>,
	Ian Munsie <imunsie@au1.ibm.com>,
	Andrew Donnellan <andrew.donnellan@au1.ibm.com>,
	Tomas Henzl <thenzl@redhat.com>,
	David Laight <David.Laight@ACULAB.COM>
Cc: Michael Neuling <mikey@neuling.org>,
	linuxppc-dev@lists.ozlabs.org,
	"Manoj N. Kumar" <manoj@linux.vnet.ibm.com>
Subject: Re: [PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure
Date: Tue, 29 Sep 2015 11:25:50 +1000	[thread overview]
Message-ID: <87612uuich.fsf@gamma.ozlabs.ibm.com> (raw)
In-Reply-To: <1443223134-9886-1-git-send-email-mrochs@linux.vnet.ibm.com>

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

"Matthew R. Ochs" <mrochs@linux.vnet.ibm.com> writes:


> The process_sense() routine can perform a read capacity which
> can take some time to complete. If an EEH occurs while waiting
> on the read capacity, the EEH handler is unable to obtain the
> context's mutex in order to put the context in an error state.
> The EEH handler will sit and wait until the context is free,
> but this wait can last longer than the EEH handler tolerates,
> leading to a failed recovery.

I'm not quite clear on what you mean by the EEH handler timing
out. AFAIK there's nothing in eehd and the EEH core that times out if a
driver doesn't respond - indeed, it's pretty easy to hang eehd with a
misbehaving driver.

Are you referring to your own internal timeouts?
cxlflash_wait_for_pci_err_recovery and anything else that uses
CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT?

Regards,
Daniel

>
> To address this issue, make the context unavailable to new,
> non-system owned threads and release the context while calling
> into process_sense(). After returning from process_sense() the
> context mutex is reacquired and the context is made available
> again. The context can be safely moved to the error state if
> needed during the unavailable window as no other threads will
> hold its reference.
>
> Signed-off-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
> Signed-off-by: Manoj N. Kumar <manoj@linux.vnet.ibm.com>
> Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
> ---
>  drivers/scsi/cxlflash/superpipe.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/scsi/cxlflash/superpipe.c b/drivers/scsi/cxlflash/su=
perpipe.c
> index a6316f5..7283e83 100644
> --- a/drivers/scsi/cxlflash/superpipe.c
> +++ b/drivers/scsi/cxlflash/superpipe.c
> @@ -1787,12 +1787,21 @@ static int cxlflash_disk_verify(struct scsi_devic=
e *sdev,
>  	 * inquiry (i.e. the Unit attention is due to the WWN changing).
>  	 */
>  	if (verify->hint & DK_CXLFLASH_VERIFY_HINT_SENSE) {
> +		/* Can't hold mutex across process_sense/read_cap16,
> +		 * since we could have an intervening EEH event.
> +		 */
> +		ctxi->unavail =3D true;
> +		mutex_unlock(&ctxi->mutex);
>  		rc =3D process_sense(sdev, verify);
>  		if (unlikely(rc)) {
>  			dev_err(dev, "%s: Failed to validate sense data (%d)\n",
>  				__func__, rc);
> +			mutex_lock(&ctxi->mutex);
> +			ctxi->unavail =3D false;
>  			goto out;
>  		}
> +		mutex_lock(&ctxi->mutex);
> +		ctxi->unavail =3D false;
>  	}
>=20=20
>  	switch (gli->mode) {
> --=20
> 2.1.0
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - https://gpgtools.org

iQIcBAEBCgAGBQJWCeieAAoJEPC3R3P2I92F+hMP/1OdLQCin+kKbOb9qxf952bH
DUAkmEhc0oD7xZFQI8HgDmHRxpes5HHxXtwXFsLgsr8QYG+aOIV568GXIZtTbrl0
aCFMqtKXZ6jVqv5L60r1tgzcWxmWdshMLd1op6t3BwA67nUc5Edcr94ePUyDDLj1
at335wCnxuGxn0kdB0Ud/lbPzTsgDPcuV6tCLy0o4J15KFOyFt9hCjO4nmL/wcIt
kmjyn5SHbdgje+73uaRQnXkli4wDA9x7x6/8wFgLspnOxgMEJgnHmm+HYbOXnHyX
nFFHw9+X2ETUcucVWuKNaFzW1vH+WJDteEZbjS7t7liJIkmIiZSFHyUTtVGdBkl1
FsWswA0pkzuGq94Wsb0nGtNHbsMw+WeWTcTlNN46DMG/wqz75iO3yMGK9MZuddSX
9jUokiM0kQvvfwAoujmvpMCVB4b2oseRRG4/yJ0lKSCcC8kETQTXgVHbT8oLmCdk
rUA0hxbbKzVQsDzw8s5HqYZjqHdLp3sDPeyukPeJl2CNhysrmnyHXpq8XgcLi3op
kbuuiR3z8UH3MW4BDpplnjhZ+5Wyw9cSI57vRF2Kr80NnU+5hBvftNh4rBneeny2
0gCDlPHDvB7Ks9HkcxkK9MW78FTgj50ePofS/dUUod4M9ohDd4MSwRKjpwQ+H3By
jmxnzfvWO/oTlL1D9+2W
=3D3BcU
=2D----END PGP SIGNATURE-----

  reply	other threads:[~2015-09-29  1:26 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-25 23:09 [PATCH v4 00/32] cxlflash: Miscellaneous bug fixes and corrections Matthew R. Ochs
2015-09-25 23:12 ` [PATCH v4 01/32] cxlflash: Fix to avoid invalid port_sel value Matthew R. Ochs
2015-09-25 23:12 ` [PATCH v4 02/32] cxlflash: Replace magic numbers with literals Matthew R. Ochs
2015-09-29  5:40   ` Andrew Donnellan
2015-09-25 23:12 ` [PATCH v4 03/32] cxlflash: Fix read capacity timeout Matthew R. Ochs
2015-09-25 23:13 ` [PATCH v4 04/32] cxlflash: Fix potential oops following LUN removal Matthew R. Ochs
2015-09-25 23:13 ` [PATCH v4 05/32] cxlflash: Fix data corruption when vLUN used over multiple cards Matthew R. Ochs
2015-09-25 23:14 ` [PATCH v4 06/32] cxlflash: Fix to avoid sizeof(bool) Matthew R. Ochs
2015-09-28 22:35   ` Daniel Axtens
2015-09-28 22:35     ` Daniel Axtens
2015-09-25 23:14 ` [PATCH v4 07/32] cxlflash: Fix context encode mask width Matthew R. Ochs
2015-09-28 22:39   ` Daniel Axtens
2015-09-28 22:39     ` Daniel Axtens
2015-09-25 23:14 ` [PATCH v4 08/32] cxlflash: Fix to avoid CXL services during EEH Matthew R. Ochs
2015-09-28 22:07   ` Brian King
2015-09-28 23:05   ` Daniel Axtens
2015-09-29 19:28     ` Matthew R. Ochs
2015-09-29 19:28       ` Matthew R. Ochs
2015-09-25 23:14 ` [PATCH v4 09/32] cxlflash: Correct naming of limbo state and waitq Matthew R. Ochs
2015-09-28 23:09   ` Daniel Axtens
2015-09-28 23:09     ` Daniel Axtens
2015-09-25 23:14 ` [PATCH v4 10/32] cxlflash: Make functions static Matthew R. Ochs
2015-09-25 23:14 ` [PATCH v4 11/32] cxlflash: Refine host/device attributes Matthew R. Ochs
2015-09-29  4:29   ` Andrew Donnellan
2015-09-25 23:15 ` [PATCH v4 12/32] cxlflash: Fix to avoid spamming the kernel log Matthew R. Ochs
2015-09-29  5:05   ` Andrew Donnellan
2015-09-29 20:37     ` Matthew R. Ochs
2015-09-29 20:37       ` Matthew R. Ochs
2015-09-25 23:16 ` [PATCH v4 13/32] cxlflash: Fix to avoid stall while waiting on TMF Matthew R. Ochs
2015-09-25 23:16 ` [PATCH v4 14/32] cxlflash: Fix location of setting resid Matthew R. Ochs
2015-09-25 23:16 ` [PATCH v4 15/32] cxlflash: Fix host link up event handling Matthew R. Ochs
2015-09-25 23:16 ` [PATCH v4 16/32] cxlflash: Fix async interrupt bypass logic Matthew R. Ochs
2015-09-25 23:16 ` [PATCH v4 17/32] cxlflash: Remove dual port online dependency Matthew R. Ochs
2015-09-28 23:37   ` Daniel Axtens
2015-09-28 23:37     ` Daniel Axtens
2015-09-29 19:38     ` Matthew R. Ochs
2015-09-29 19:38       ` Matthew R. Ochs
2015-09-30 23:50       ` Daniel Axtens
2015-10-01 15:00         ` Matthew R. Ochs
2015-10-01 15:00           ` Matthew R. Ochs
2015-09-25 23:17 ` [PATCH v4 18/32] cxlflash: Fix AFU version access/storage and add check Matthew R. Ochs
2015-09-25 23:17 ` [PATCH v4 19/32] cxlflash: Correct usage of scsi_host_put() Matthew R. Ochs
2015-09-25 23:17 ` [PATCH v4 20/32] cxlflash: Fix to prevent workq from accessing freed memory Matthew R. Ochs
2015-09-25 23:17 ` [PATCH v4 21/32] cxlflash: Correct behavior in device reset handler following EEH Matthew R. Ochs
2015-09-25 23:17 ` [PATCH v4 22/32] cxlflash: Remove unnecessary scsi_block_requests Matthew R. Ochs
2015-09-25 23:18 ` [PATCH v4 23/32] cxlflash: Fix function prolog parameters and return codes Matthew R. Ochs
2015-09-29  4:36   ` Andrew Donnellan
2015-09-29 20:31     ` Matthew R. Ochs
2015-09-29 20:31       ` Matthew R. Ochs
2015-09-25 23:18 ` [PATCH v4 24/32] cxlflash: Fix MMIO and endianness errors Matthew R. Ochs
2015-09-29  1:52   ` Andrew Donnellan
2015-09-25 23:18 ` [PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure Matthew R. Ochs
2015-09-29  1:25   ` Daniel Axtens [this message]
2015-09-29  1:25     ` Daniel Axtens
2015-09-29 20:11     ` Matthew R. Ochs
2015-09-30 23:53       ` Daniel Axtens
2015-09-25 23:18 ` [PATCH v4 26/32] cxlflash: Correct spelling, grammar, and alignment mistakes Matthew R. Ochs
2015-09-29  1:18   ` Andrew Donnellan
2015-09-25 23:19 ` [PATCH v4 27/32] cxlflash: Fix to prevent stale AFU RRQ Matthew R. Ochs
2015-09-29  1:36   ` Daniel Axtens
2015-09-29  1:36     ` Daniel Axtens
2015-09-29 20:22     ` Matthew R. Ochs
2015-09-29 20:22       ` Matthew R. Ochs
2015-09-30 23:51       ` Daniel Axtens
2015-09-25 23:19 ` [PATCH v4 28/32] MAINTAINERS: Add cxlflash driver Matthew R. Ochs
2015-09-25 23:19 ` [PATCH v4 29/32] cxlflash: Fix to double the delay each time Matthew R. Ochs
2015-09-29  1:19   ` Andrew Donnellan
2015-09-29  1:40   ` Daniel Axtens
2015-09-29  1:40     ` Daniel Axtens
2015-09-29 20:28     ` Matthew R. Ochs
2015-09-30  0:08       ` Daniel Axtens
2015-09-25 23:19 ` [PATCH v4 30/32] cxlflash: Fix to avoid corrupting adapter fops Matthew R. Ochs
2015-09-28 22:13   ` Brian King
2015-09-29  0:54   ` Andrew Donnellan
2015-09-30  0:18   ` Daniel Axtens
2015-09-25 23:19 ` [PATCH v4 31/32] cxlflash: Correct trace string Matthew R. Ochs
2015-09-29  1:20   ` Andrew Donnellan
2015-09-25 23:19 ` [PATCH v4 32/32] cxlflash: Fix to avoid potential deadlock on EEH Matthew R. Ochs
2015-09-28 23:41   ` Brian King
2015-09-29 19:40     ` Matthew R. Ochs
2015-09-29 19:40       ` Matthew R. Ochs
2015-09-30  0:33   ` Daniel Axtens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87612uuich.fsf@gamma.ozlabs.ibm.com \
    --to=dja@axtens.net \
    --cc=David.Laight@ACULAB.COM \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=andrew.donnellan@au1.ibm.com \
    --cc=brking@linux.vnet.ibm.com \
    --cc=imunsie@au1.ibm.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=manoj@linux.vnet.ibm.com \
    --cc=mikey@neuling.org \
    --cc=mrochs@linux.vnet.ibm.com \
    --cc=nab@linux-iscsi.org \
    --cc=thenzl@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.