* linux-next scsi-mq hang in suspend-resume
@ 2017-07-12 14:51 Tomi Sarvela
2017-07-12 16:50 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Tomi Sarvela @ 2017-07-12 14:51 UTC (permalink / raw)
To: linux-block; +Cc: axboe
Hello there,
I've been running Intel GFX CI testing for linux DRM-Tip i915 driver,
and couple of weeks ago we took linux-next for a ride to see what kind
of integration problems there might pop up when pulling 4.13-rc1.
Latest results can be seen at
https://intel-gfx-ci.01.org/CI/next-issues.html
https://intel-gfx-ci.01.org/CI/next-all.html
The purple blocks are hangs, starting from 20170628 (20170627 was
untestable due to locking changes which were reverted). Traces were
pointing to ext4 but bisecting between good 20170626 and bad 20170628
pointed to:
commit 5c279bd9e40624f4ab6e688671026d6005b066fa
Date: Fri Jun 16 10:27:55 2017 +0200
scsi: default to scsi-mq
Reproduction is 100% or close to it when running two i-g-t tests as a
testlist. I'm assuming that it creates the correct amount or pattern
of actions to the device. The testlist consists of the following
lines:
igt@gem_exec_gttfill@basic
igt@gem_exec_suspend@basic-s3
Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts.
Configuration option was copied over on testhosts and 20170712 was re-
tested, that's why today looks so much greener.
More information including traces and reproduction instructions at
https://bugzilla.kernel.org/show_bug.cgi?id=196223
I can run patchsets through the farm, if needed. In addition, daily
linux-next tags are automatically tested and results published.
Best regards,
Tomi Sarvela
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-12 14:51 linux-next scsi-mq hang in suspend-resume Tomi Sarvela
@ 2017-07-12 16:50 ` Jens Axboe
2017-07-13 7:12 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2017-07-12 16:50 UTC (permalink / raw)
To: Tomi Sarvela, linux-block; +Cc: Christoph Hellwig
On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> Hello there,
>
> I've been running Intel GFX CI testing for linux DRM-Tip i915 driver,
> and couple of weeks ago we took linux-next for a ride to see what kind
> of integration problems there might pop up when pulling 4.13-rc1.
> Latest results can be seen at
>
> https://intel-gfx-ci.01.org/CI/next-issues.html
> https://intel-gfx-ci.01.org/CI/next-all.html
>
> The purple blocks are hangs, starting from 20170628 (20170627 was
> untestable due to locking changes which were reverted). Traces were
> pointing to ext4 but bisecting between good 20170626 and bad 20170628
> pointed to:
>
> commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> Date: Fri Jun 16 10:27:55 2017 +0200
>
> scsi: default to scsi-mq
>
> Reproduction is 100% or close to it when running two i-g-t tests as a
> testlist. I'm assuming that it creates the correct amount or pattern
> of actions to the device. The testlist consists of the following
> lines:
>
> igt@gem_exec_gttfill@basic
> igt@gem_exec_suspend@basic-s3
>
> Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts.
> Configuration option was copied over on testhosts and 20170712 was re-
> tested, that's why today looks so much greener.
>
> More information including traces and reproduction instructions at
> https://bugzilla.kernel.org/show_bug.cgi?id=196223
>
> I can run patchsets through the farm, if needed. In addition, daily
> linux-next tags are automatically tested and results published.
Christoph, any ideas? Smells like something in SCSI, my notebook
with nvme/blk-mq suspend/resumes just fine.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-12 16:50 ` Jens Axboe
@ 2017-07-13 7:12 ` Christoph Hellwig
2017-07-14 12:44 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2017-07-13 7:12 UTC (permalink / raw)
To: Jens Axboe; +Cc: Tomi Sarvela, linux-block, Christoph Hellwig
On Wed, Jul 12, 2017 at 10:50:19AM -0600, Jens Axboe wrote:
> On 07/12/2017 08:51 AM, Tomi Sarvela wrote:
> > Hello there,
> >
> > I've been running Intel GFX CI testing for linux DRM-Tip i915 driver,
> > and couple of weeks ago we took linux-next for a ride to see what kind
> > of integration problems there might pop up when pulling 4.13-rc1.
> > Latest results can be seen at
> >
> > https://intel-gfx-ci.01.org/CI/next-issues.html
> > https://intel-gfx-ci.01.org/CI/next-all.html
> >
> > The purple blocks are hangs, starting from 20170628 (20170627 was
> > untestable due to locking changes which were reverted). Traces were
> > pointing to ext4 but bisecting between good 20170626 and bad 20170628
> > pointed to:
> >
> > commit 5c279bd9e40624f4ab6e688671026d6005b066fa
> > Date: Fri Jun 16 10:27:55 2017 +0200
> >
> > scsi: default to scsi-mq
> >
> > Reproduction is 100% or close to it when running two i-g-t tests as a
> > testlist. I'm assuming that it creates the correct amount or pattern
> > of actions to the device. The testlist consists of the following
> > lines:
> >
> > igt@gem_exec_gttfill@basic
> > igt@gem_exec_suspend@basic-s3
> >
> > Kernel option scsi_mod.use_blk_mq=0 hides the issue on testhosts.
> > Configuration option was copied over on testhosts and 20170712 was re-
> > tested, that's why today looks so much greener.
> >
> > More information including traces and reproduction instructions at
> > https://bugzilla.kernel.org/show_bug.cgi?id=196223
> >
> > I can run patchsets through the farm, if needed. In addition, daily
> > linux-next tags are automatically tested and results published.
>
> Christoph, any ideas? Smells like something in SCSI, my notebook
> with nvme/blk-mq suspend/resumes just fine.
There isn't much mq-specific scsi code, so it's probably an interaction
of both. I'll see if the bugzilla has enough data to reproduce it
locally.
Although I really wish people wouldn't use #TY^$Y^$ bugzilla and just
post the important data to the list :(
>
> --
> Jens Axboe
>
---end quoted text---
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-13 7:12 ` Christoph Hellwig
@ 2017-07-14 12:44 ` Christoph Hellwig
2017-07-14 13:33 ` Tomi Sarvela
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2017-07-14 12:44 UTC (permalink / raw)
To: Jens Axboe; +Cc: Tomi Sarvela, linux-block, Christoph Hellwig
Tomi,
can you please report what hardware this is one (e.g. libata or
real scsi, which driver), a kernel config and the actual command
used to suspend the system (to ram, to disk?) so that I an try to
reproduce it?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-14 12:44 ` Christoph Hellwig
@ 2017-07-14 13:33 ` Tomi Sarvela
2017-07-17 7:53 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Tomi Sarvela @ 2017-07-14 13:33 UTC (permalink / raw)
To: Christoph Hellwig, Jens Axboe; +Cc: linux-block
On 14/07/17 15:44, Christoph Hellwig wrote:
> can you please report what hardware this is one (e.g. libata or
> real scsi, which driver), a kernel config and the actual command
> used to suspend the system (to ram, to disk?) so that I an try to
> reproduce it?
The hardware I used to bisect the problem is is Broxton: Asrock
ITX-J3455 motherboard with Intel J3455 SoC (about Skylake Gen). Disk is
Intel SATA SSD. Issue also happens with Samsung SSD on other testhost.
Note that there is half dozen other hosts indicating the same problem,
and traces are available starting from ILK to Skylake. None of the Kaby
Lakes triggers the issue (the KBL issue is probably NVMe-related
instead). Usual setup is one SATA SSD disk on port 0 on motherboard.
Kernel config is available at:
https://intel-gfx-ci.01.org/CI/next-20170711/kernel.config.bz2
Kernel options:
BOOT_IMAGE=/boot/drm_intel root=/dev/sda2 console=ttyS0,115200n8
console=tty0 intel_iommu=igfx_off drm.debug=0xe nmi_watchdog=panic,auto
panic=1 softdog.soft_panic=1 rootwait ro 3
To reproduce the problem on Broxton, i-g-t was used:
https://cgit.freedesktop.org/xorg/app/intel-gpu-tools/
From i-g-t, the binaries could be run with:
tests/gem_exec_gttfill --r basic
tests/gem_exec_suspend --r basic-s3
but, from my experience, this issue pops up much easier if there is
piglit framework capturing logs to disk:
https://cgit.freedesktop.org/piglit
With IGT/piglit testlist file would be (ex. scsi-mq.testlist):
#
igt@gem_exec_gttfill@basic
igt@gem_exec_suspend@basic-s3
#
and command to run i-g-t through piglit is
/opt/igt/scripts/run-tests.sh -vT scsi-mq.testlist
I can try to reproduce the issue without i-g-t/piglit, but it might take
some trying. Definitely suspend-to-ram and writes to disk are needed to
trigger this, gem_exec_suspend/basic-s3 can loop quite well without
panicing.
Tomi
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-14 13:33 ` Tomi Sarvela
@ 2017-07-17 7:53 ` Christoph Hellwig
2017-07-17 10:30 ` Tomi Sarvela
0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2017-07-17 7:53 UTC (permalink / raw)
To: Tomi Sarvela; +Cc: Christoph Hellwig, Jens Axboe, linux-block
[-- Attachment #1: Type: text/plain, Size: 546 bytes --]
I still haven't gotten hold of an i915 machine where I could
run the actua ltest suite.
But I did some audit of the code, and it seems blk-mq is lacking
support for the RQF_PM flag. While I can't directly see how
this would cause the hang your caused it's a least easy to test.
Can you apply the patch below and test with the use_blk_mq=0 parameter?
Note that implementing RQF_PM for blk-mq shouldn't be too hard either,
but if we don't get rid of the nr_pending counter somehow it would
be a severe performance penalty for all scsi devices.
[-- Attachment #2: sd.diff --]
[-- Type: text/plain, Size: 698 bytes --]
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bea36adeee17..5c3818ebee9c 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -554,7 +554,7 @@ static struct scsi_driver sd_template = {
.probe = sd_probe,
.remove = sd_remove,
.shutdown = sd_shutdown,
- .pm = &sd_pm_ops,
+// .pm = &sd_pm_ops,
},
.rescan = sd_rescan,
.init_command = sd_init_command,
@@ -3249,7 +3249,7 @@ static void sd_probe_async(void *data, async_cookie_t cookie)
gd->events |= DISK_EVENT_MEDIA_CHANGE;
}
- blk_pm_runtime_init(sdp->request_queue, dev);
+// blk_pm_runtime_init(sdp->request_queue, dev);
device_add_disk(dev, gd);
if (sdkp->capacity)
sd_dif_config_host(sdkp);
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-17 7:53 ` Christoph Hellwig
@ 2017-07-17 10:30 ` Tomi Sarvela
2017-07-17 10:35 ` Christoph Hellwig
0 siblings, 1 reply; 10+ messages in thread
From: Tomi Sarvela @ 2017-07-17 10:30 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Jens Axboe, linux-block
On 17/07/17 10:53, Christoph Hellwig wrote:
> I still haven't gotten hold of an i915 machine where I could
> run the actua ltest suite.
>
> But I did some audit of the code, and it seems blk-mq is lacking
> support for the RQF_PM flag. While I can't directly see how
> this would cause the hang your caused it's a least easy to test.
>
> Can you apply the patch below and test with the use_blk_mq=0 parameter?
>
> Note that implementing RQF_PM for blk-mq shouldn't be too hard either,
> but if we don't get rid of the nr_pending counter somehow it would
> be a severe performance penalty for all scsi devices.
First, tested that next-20170717 still triggers the problem when no
extra options given. Adding scsi_mod.use_blk_mq=0 makes tests work.
Then I tried with sd.diff patched next-20170717. Works (still) with
use_blk_mq=0. Also works when no options given, so this patch avoids the
hang when using the new block-mq.
These tests on generic Haswell 4790K desktop machine.
Best regards,
Tomi
--
Intel Finland Oy - BIC 0357606-4 - Westendinkatu 7, 02160 Espoo
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-17 10:30 ` Tomi Sarvela
@ 2017-07-17 10:35 ` Christoph Hellwig
0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2017-07-17 10:35 UTC (permalink / raw)
To: Tomi Sarvela; +Cc: Christoph Hellwig, Jens Axboe, linux-block
On Mon, Jul 17, 2017 at 01:30:00PM +0300, Tomi Sarvela wrote:
> First, tested that next-20170717 still triggers the problem when no extra
> options given. Adding scsi_mod.use_blk_mq=0 makes tests work.
>
> Then I tried with sd.diff patched next-20170717. Works (still) with
> use_blk_mq=0. Also works when no options given, so this patch avoids the
> hang when using the new block-mq.
>
> These tests on generic Haswell 4790K desktop machine.
Thanks Tomi,
this seems to confirm it's runtime PM related, although I don't
really understand why that's an issue. Let me spin up an implementation
of RQF_PM for blk-mq and give it to you for testing.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
@ 2017-07-17 15:18 Evangelos Foutras
2017-07-17 17:17 ` Evangelos Foutras
0 siblings, 1 reply; 10+ messages in thread
From: Evangelos Foutras @ 2017-07-17 15:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block
(Hopefully I got the In-Reply-To header right and won't mess up the thread.)
On 17/07/17 10:53, Christoph Hellwig wrote:
> I still haven't gotten hold of an i915 machine where I could
> run the actua ltest suite.
At the risk of posting an unproductive "me too" reply, I also got bit by
the dead disk on resume from S3 when Arch Linux enabled MQ by default in
the 4.12 kernel (CONFIG_SCSI_MQ_DEFAULT=y). The configuration change was
later reverted due to this issue.
For me the hang occurs pretty reliably (tested about 5-6 times) on an
Intel laptop and an AMD desktop, both with HDDs and ext4 on top of LUKS.
It feels as if the disk stops responding to commands. The machine itself
wakes up from sleep but even a simple `ls` will hang and do nothing.
> But I did some audit of the code, and it seems blk-mq is lacking
> support for the RQF_PM flag. While I can't directly see how
> this would cause the hang your caused it's a least easy to test.
>
> Can you apply the patch below and test with the use_blk_mq=0 parameter?
I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I
will try to do and report back).
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: linux-next scsi-mq hang in suspend-resume
2017-07-17 15:18 Evangelos Foutras
@ 2017-07-17 17:17 ` Evangelos Foutras
0 siblings, 0 replies; 10+ messages in thread
From: Evangelos Foutras @ 2017-07-17 17:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-block
On 17 July 2017 at 18:18, Evangelos Foutras <evangelos@foutrelis.com> wrote:
> On 17/07/17 10:53, Christoph Hellwig wrote:
>> But I did some audit of the code, and it seems blk-mq is lacking
>> support for the RQF_PM flag. While I can't directly see how
>> this would cause the hang your caused it's a least easy to test.
>>
>> Can you apply the patch below and test with the use_blk_mq=0 parameter?
>
> I think the patch needs to be tested with scsi_mod.use_blk_mq=1 (which I
> will try to do and report back).
I briefly tested the patch (on top of Linux 4.12.2) and it appears to
successfully work around the issue; my laptop happily resumes from S3
and can access the HDD.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-07-17 17:17 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-07-12 14:51 linux-next scsi-mq hang in suspend-resume Tomi Sarvela
2017-07-12 16:50 ` Jens Axboe
2017-07-13 7:12 ` Christoph Hellwig
2017-07-14 12:44 ` Christoph Hellwig
2017-07-14 13:33 ` Tomi Sarvela
2017-07-17 7:53 ` Christoph Hellwig
2017-07-17 10:30 ` Tomi Sarvela
2017-07-17 10:35 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2017-07-17 15:18 Evangelos Foutras
2017-07-17 17:17 ` Evangelos Foutras
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).