"blocked for more than 120 secs" --> a valid situation, how to prevent?

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* "blocked for more than 120 secs" --> a valid situation, how to prevent?
@ 2010-09-23 23:41 Mark Lord
  2010-09-24  0:05 ` Douglas Gilbert
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Mark Lord @ 2010-09-23 23:41 UTC (permalink / raw)
  To: Linux Kernel, IDE/ATA development list, linux-scsi

What's the purpose of this stack dump,
and how can it be prevented in this NORMAL situation??

The command was "hdparm --security-erase NULL /dev/sdb",
which takes about 66 minutes to complete on this particular drive.

I don't see any obvious way for the task to mark itself
as needing longer than 120 secs to complete the operation.

Thanks.

[ 1800.373281] INFO: task hdparm:1979 blocked for more than 120 seconds.
[ 1800.373288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.373294] hdparm        D f64a4c00     0  1979   1718 0x00000000
[ 1800.373303]  f3065c38 00200086 c11be0e3 f64a4c00 c1439ac0 c1439ac0 c1439ac0 c1439ac0
[ 1800.373317]  f571bdb8 c1439ac0 c1439ac0 dbdc5ec7 00000163 dbdb0cfb 00000163 f571bb60
[ 1800.373329]  00000001 f3065d18 7fffffff f571bb60 f3065c64 c1270527 f684ca50 f684ca50
[ 1800.373341] Call Trace:
[ 1800.373355]  [<c11be0e3>] ? ata_scsi_queuecmd+0x6a/0x73
[ 1800.373366]  [<c1270527>] schedule_timeout+0x16/0xa5
[ 1800.373376]  [<c111a4a2>] ? __blk_run_queue+0x3d/0x5e
[ 1800.373383]  [<c1118626>] ? elv_insert+0x67/0x18f
[ 1800.373389]  [<c126fd6a>] wait_for_common+0x8a/0xd9
[ 1800.373399]  [<c1028b75>] ? default_wake_function+0x0/0xd
[ 1800.373406]  [<c126fe3a>] wait_for_completion+0x12/0x14
[ 1800.373413]  [<c111d02f>] blk_execute_rq+0x76/0x8f
[ 1800.373420]  [<c111cf1c>] ? blk_end_sync_rq+0x0/0x28
[ 1800.373428]  [<c111cc3f>] ? blk_rq_append_bio+0x14/0x3b
[ 1800.373434]  [<c111ce95>] ? blk_rq_map_user+0x12e/0x1b5
[ 1800.373442]  [<c11201f8>] sg_io+0x269/0x343
[ 1800.373450]  [<c11204be>] scsi_cmd_ioctl+0x1ec/0x396
[ 1800.373457]  [<c11993cf>] ? get_device+0x13/0x18
[ 1800.373464]  [<c11aeabd>] ? sd_open+0x45/0x104
[ 1800.373472]  [<c11aea0e>] sd_ioctl+0x6b/0x8c
[ 1800.373479]  [<c111e2fe>] __blkdev_driver_ioctl+0x66/0x87
[ 1800.373486]  [<c111ebb2>] blkdev_ioctl+0x5fe/0x62c
[ 1800.373495]  [<c1061e3f>] ? filemap_fault+0xb5/0x2fc
[ 1800.373503]  [<c10a7bde>] block_ioctl+0x2a/0x32
[ 1800.373509]  [<c10a7bde>] ? block_ioctl+0x2a/0x32
[ 1800.373518]  [<c109345a>] vfs_ioctl+0x27/0x91
[ 1800.373524]  [<c10a7bb4>] ? block_ioctl+0x0/0x32
[ 1800.373531]  [<c109398d>] do_vfs_ioctl+0x42a/0x45b
[ 1800.373538]  [<c10727b7>] ? handle_mm_fault+0x3d8/0x7c2
[ 1800.373547]  [<c1088e98>] ? fsnotify_modify+0x4f/0x5a
[ 1800.373555]  [<c101b8f9>] ? do_page_fault+0x1e4/0x243
[ 1800.373562]  [<c10939ec>] sys_ioctl+0x2e/0x48
[ 1800.373570]  [<c1002750>] sysenter_do_call+0x12/0x26


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-23 23:41 "blocked for more than 120 secs" --> a valid situation, how to prevent? Mark Lord
@ 2010-09-24  0:05 ` Douglas Gilbert
  2010-09-24  2:53   ` Mark Lord
  2010-09-24  0:51 ` Stan Hoeppner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Douglas Gilbert @ 2010-09-24  0:05 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

Mark,
If you issued the SG_IO ioctl with a timeout of at
least 66 minutes (expressed in milliseconds) then
it looks like ata_scsi_queuecmd() has a problem.

The SCSI FORMAT UNIT command can take a long time.
It has an IMMED bit which when set will return
after the command and its parameters have been
received. The progress of the format can then be
polled by other commands. With the IMMED bit clear
the FORMAT UNIT returns when the format completes
(i.e. after an hour or so). Either way, setting the
timeout on the SG_IO ioctl works as expected.

Doug Gilbert


On 10-09-23 07:41 PM, Mark Lord wrote:
> What's the purpose of this stack dump,
> and how can it be prevented in this NORMAL situation??
>
> The command was "hdparm --security-erase NULL /dev/sdb",
> which takes about 66 minutes to complete on this particular drive.
>
> I don't see any obvious way for the task to mark itself
> as needing longer than 120 secs to complete the operation.
>
> Thanks.
>
> [ 1800.373281] INFO: task hdparm:1979 blocked for more than 120 seconds.
> [ 1800.373288] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [ 1800.373294] hdparm D f64a4c00 0 1979 1718 0x00000000
> [ 1800.373303] f3065c38 00200086 c11be0e3 f64a4c00 c1439ac0 c1439ac0
> c1439ac0 c1439ac0
> [ 1800.373317] f571bdb8 c1439ac0 c1439ac0 dbdc5ec7 00000163 dbdb0cfb
> 00000163 f571bb60
> [ 1800.373329] 00000001 f3065d18 7fffffff f571bb60 f3065c64 c1270527
> f684ca50 f684ca50
> [ 1800.373341] Call Trace:
> [ 1800.373355] [<c11be0e3>] ? ata_scsi_queuecmd+0x6a/0x73
> [ 1800.373366] [<c1270527>] schedule_timeout+0x16/0xa5
> [ 1800.373376] [<c111a4a2>] ? __blk_run_queue+0x3d/0x5e
> [ 1800.373383] [<c1118626>] ? elv_insert+0x67/0x18f
> [ 1800.373389] [<c126fd6a>] wait_for_common+0x8a/0xd9
> [ 1800.373399] [<c1028b75>] ? default_wake_function+0x0/0xd
> [ 1800.373406] [<c126fe3a>] wait_for_completion+0x12/0x14
> [ 1800.373413] [<c111d02f>] blk_execute_rq+0x76/0x8f
> [ 1800.373420] [<c111cf1c>] ? blk_end_sync_rq+0x0/0x28
> [ 1800.373428] [<c111cc3f>] ? blk_rq_append_bio+0x14/0x3b
> [ 1800.373434] [<c111ce95>] ? blk_rq_map_user+0x12e/0x1b5
> [ 1800.373442] [<c11201f8>] sg_io+0x269/0x343
> [ 1800.373450] [<c11204be>] scsi_cmd_ioctl+0x1ec/0x396
> [ 1800.373457] [<c11993cf>] ? get_device+0x13/0x18
> [ 1800.373464] [<c11aeabd>] ? sd_open+0x45/0x104
> [ 1800.373472] [<c11aea0e>] sd_ioctl+0x6b/0x8c
> [ 1800.373479] [<c111e2fe>] __blkdev_driver_ioctl+0x66/0x87
> [ 1800.373486] [<c111ebb2>] blkdev_ioctl+0x5fe/0x62c
> [ 1800.373495] [<c1061e3f>] ? filemap_fault+0xb5/0x2fc
> [ 1800.373503] [<c10a7bde>] block_ioctl+0x2a/0x32
> [ 1800.373509] [<c10a7bde>] ? block_ioctl+0x2a/0x32
> [ 1800.373518] [<c109345a>] vfs_ioctl+0x27/0x91
> [ 1800.373524] [<c10a7bb4>] ? block_ioctl+0x0/0x32
> [ 1800.373531] [<c109398d>] do_vfs_ioctl+0x42a/0x45b
> [ 1800.373538] [<c10727b7>] ? handle_mm_fault+0x3d8/0x7c2
> [ 1800.373547] [<c1088e98>] ? fsnotify_modify+0x4f/0x5a
> [ 1800.373555] [<c101b8f9>] ? do_page_fault+0x1e4/0x243
> [ 1800.373562] [<c10939ec>] sys_ioctl+0x2e/0x48
> [ 1800.373570] [<c1002750>] sysenter_do_call+0x12/0x26
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-23 23:41 "blocked for more than 120 secs" --> a valid situation, how to prevent? Mark Lord
  2010-09-24  0:05 ` Douglas Gilbert
@ 2010-09-24  0:51 ` Stan Hoeppner
  2010-09-24  1:37   ` Kyle McMartin
  2010-09-24  1:58 ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Maxim Levitsky
  2010-09-24  2:08 ` Kyle McMartin
  3 siblings, 1 reply; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-24  0:51 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

Mark Lord put forth on 9/23/2010 6:41 PM:
> What's the purpose of this stack dump,
> and how can it be prevented in this NORMAL situation??
> 
> The command was "hdparm --security-erase NULL /dev/sdb",
> which takes about 66 minutes to complete on this particular drive.
> 
> I don't see any obvious way for the task to mark itself
> as needing longer than 120 secs to complete the operation.

~$ man hdparm

ATA Security Feature Set

These switches are DANGEROUS to experiment with,
and might not work with every kernel. USE AT YOUR OWN RISK.

--security-erase PWD

THIS FEATURE IS EXPERIMENTAL AND NOT WELL TESTED.  USE  AT  YOUR
              OWN RISK.

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  0:51 ` Stan Hoeppner
@ 2010-09-24  1:37   ` Kyle McMartin
  2010-09-24  3:48     ` Stan Hoeppner
  0 siblings, 1 reply; 17+ messages in thread
From: Kyle McMartin @ 2010-09-24  1:37 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Mark Lord, Linux Kernel, IDE/ATA development list, linux-scsi

On Thu, Sep 23, 2010 at 07:51:48PM -0500, Stan Hoeppner wrote:
> ~$ man hdparm
> 

kyle@dreadnought ~ $ hdparm
hdparm - get/set hard disk parameters - version v9.27, by Mark Lord.
                                                       ^^^^^^^^^^^^
[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-23 23:41 "blocked for more than 120 secs" --> a valid situation, how to prevent? Mark Lord
  2010-09-24  0:05 ` Douglas Gilbert
  2010-09-24  0:51 ` Stan Hoeppner
@ 2010-09-24  1:58 ` Maxim Levitsky
  2010-09-24  2:08 ` Kyle McMartin
  3 siblings, 0 replies; 17+ messages in thread
From: Maxim Levitsky @ 2010-09-24  1:58 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

On Thu, 2010-09-23 at 19:41 -0400, Mark Lord wrote:
> What's the purpose of this stack dump,
> and how can it be prevented in this NORMAL situation??
> 
> The command was "hdparm --security-erase NULL /dev/sdb",
> which takes about 66 minutes to complete on this particular drive.
> 
> I don't see any obvious way for the task to mark itself
> as needing longer than 120 secs to complete the operation.

There is other valid user case.
If application calls sys_sync and at same time the filesystem is written
by an other application, the sync can easily take more that 2 minutes.
Or if block device is slow (as it was with some flash cards).

Indeed it would be good to have a kind of touch_softlockup_watchdog().

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-23 23:41 "blocked for more than 120 secs" --> a valid situation, how to prevent? Mark Lord
                   ` (2 preceding siblings ...)
  2010-09-24  1:58 ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Maxim Levitsky
@ 2010-09-24  2:08 ` Kyle McMartin
  3 siblings, 0 replies; 17+ messages in thread
From: Kyle McMartin @ 2010-09-24  2:08 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

On Thu, Sep 23, 2010 at 07:41:28PM -0400, Mark Lord wrote:
> What's the purpose of this stack dump,
> and how can it be prevented in this NORMAL situation??
> 
> The command was "hdparm --security-erase NULL /dev/sdb",
> which takes about 66 minutes to complete on this particular drive.
> 
> I don't see any obvious way for the task to mark itself
> as needing longer than 120 secs to complete the operation.
> 

It's an excellent question, a large number of bug reports I see against
Fedora are the result of this firing on some long-running task (like a
huge 'sync') and the user getting scared by the message assuming the
universe was imploding and reporting a bug.

I'd suggest we add a new task_struct flag for it, but we appear to be
out of bits! Perhaps we can abuse one of the other ones (PF_FROZEN seems
the likely choice since the watchdog bails out if it sees it, but that
likely has other ramifications.)

--Kyle

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  0:05 ` Douglas Gilbert
@ 2010-09-24  2:53   ` Mark Lord
  2010-09-24  3:51     ` Mark Lord
  2010-09-24  4:41     ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Douglas Gilbert
  0 siblings, 2 replies; 17+ messages in thread
From: Mark Lord @ 2010-09-24  2:53 UTC (permalink / raw)
  To: dgilbert; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

On 10-09-23 08:05 PM, Douglas Gilbert wrote:
> Mark,
> If you issued the SG_IO ioctl with a timeout of at
> least 66 minutes (expressed in milliseconds) then
> it looks like ata_scsi_queuecmd() has a problem.
..

Mmm.. more like  blk_execute_rq() perhaps.
That's where the wait_for_completion(&wait) call is at.

Perhaps I should change it to wait in smaller increments,
so that the lockup detection doesn't trigger on it..

Doing that seems rather wasteful, though.

Note that this is the ATA "SECURITY ERASE" command,
which doesn't have an "immed" bit to toggle.
So one must wait for it to complete.

cheers

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  1:37   ` Kyle McMartin
@ 2010-09-24  3:48     ` Stan Hoeppner
  2010-09-24  5:02       ` Douglas Gilbert
  2010-09-24  5:31       ` Mark Lord
  0 siblings, 2 replies; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-24  3:48 UTC (permalink / raw)
  To: Kyle McMartin
  Cc: Mark Lord, Linux Kernel, IDE/ATA development list, linux-scsi

Kyle McMartin put forth on 9/23/2010 8:37 PM:
> On Thu, Sep 23, 2010 at 07:51:48PM -0500, Stan Hoeppner wrote:
>> ~$ man hdparm
>>
> 
> kyle@dreadnought ~ $ hdparm
> hdparm - get/set hard disk parameters - version v9.27, by Mark Lord.
>                                                        ^^^^^^^^^^^^
> [...]

Please pardon me while I wipe this egg off my face. :)

My apologies Mark.  Please feel free to publicly flog me if you wish.

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  2:53   ` Mark Lord
@ 2010-09-24  3:51     ` Mark Lord
  2010-09-24  9:12       ` Jens Axboe
  2010-09-24  4:41     ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Douglas Gilbert
  1 sibling, 1 reply; 17+ messages in thread
From: Mark Lord @ 2010-09-24  3:51 UTC (permalink / raw)
  To: dgilbert
  Cc: Linux Kernel, IDE/ATA development list, linux-scsi, Jens Axboe,
	Joel Becker

On 10-09-23 10:53 PM, Mark Lord wrote:
> On 10-09-23 08:05 PM, Douglas Gilbert wrote:
>> Mark,
>> If you issued the SG_IO ioctl with a timeout of at
>> least 66 minutes (expressed in milliseconds) then
>> it looks like ata_scsi_queuecmd() has a problem.
> ..
>
> Mmm.. more like blk_execute_rq() perhaps.
> That's where the wait_for_completion(&wait) call is at.
>
> Perhaps I should change it to wait in smaller increments,
> so that the lockup detection doesn't trigger on it..
..

This patch (below) seems to work.

Does this look kosher enough for me to roll it up
as a proper patch submission?   Jens?  Joel?

The problem, again, is that the hangcheck timer fires
inappropriately during very long SG_IO commands,
such as --security-erase operations which take minutes/hours to complete.

Thanks

--- old/block/blk-exec.c	2010-08-26 19:47:12.000000000 -0400
+++ linux/block/blk-exec.c	2010-09-23 23:41:47.478826002 -0400
@@ -95,7 +95,8 @@
  
  	rq->end_io_data = &wait;
  	blk_execute_rq_nowait(q, bd_disk, rq, at_head, blk_end_sync_rq);
-	wait_for_completion(&wait);
+	while (!wait_for_completion_timeout(&wait, (sysctl_hung_task_timeout_secs >> 1) * HZ))
+		; /* periodic wakeup prevents "hung_task" warnings */
  
  	if (rq->errors)
  		err = -EIO;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  2:53   ` Mark Lord
  2010-09-24  3:51     ` Mark Lord
@ 2010-09-24  4:41     ` Douglas Gilbert
  1 sibling, 0 replies; 17+ messages in thread
From: Douglas Gilbert @ 2010-09-24  4:41 UTC (permalink / raw)
  To: Mark Lord; +Cc: Linux Kernel, IDE/ATA development list, linux-scsi

On 10-09-23 10:53 PM, Mark Lord wrote:
> On 10-09-23 08:05 PM, Douglas Gilbert wrote:
>> Mark,
>> If you issued the SG_IO ioctl with a timeout of at
>> least 66 minutes (expressed in milliseconds) then
>> it looks like ata_scsi_queuecmd() has a problem.
> ..
>
> Mmm.. more like blk_execute_rq() perhaps.
> That's where the wait_for_completion(&wait) call is at.
>
> Perhaps I should change it to wait in smaller increments,
> so that the lockup detection doesn't trigger on it..
>
> Doing that seems rather wasteful, though.
>
> Note that this is the ATA "SECURITY ERASE" command,
> which doesn't have an "immed" bit to toggle.
> So one must wait for it to complete.

And I have seen another issue with long (SCSI) commands.
During a FORMAT UNIT another pesky program might
have nothing better to do than periodically send out
things like TEST UNIT READY (check a disk is ready
for IO) which will have a normal timeout on it (e.g.
60 seconds). With a format underway, the HBA or the device
may not accept the TEST UNIT READY so its timeout expires
and the error handling code thinks the device is unwell
and decides to reset it.

There is a useful flag in the scsi_device structure called
no_uld_attach which hides a device from the sd driver
(assuming it is a disk). Then the disk can only be accessed
via the bsg or sg driver. And those other pesky programs
can't find the disk in question. I'm not aware of a way
to control that flag from the user space.

Doug Gilbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  3:48     ` Stan Hoeppner
@ 2010-09-24  5:02       ` Douglas Gilbert
  2010-09-24  5:31       ` Mark Lord
  1 sibling, 0 replies; 17+ messages in thread
From: Douglas Gilbert @ 2010-09-24  5:02 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Kyle McMartin, Mark Lord, Linux Kernel, IDE/ATA development list,
	linux-scsi

On 10-09-23 11:48 PM, Stan Hoeppner wrote:
> Kyle McMartin put forth on 9/23/2010 8:37 PM:
>> On Thu, Sep 23, 2010 at 07:51:48PM -0500, Stan Hoeppner wrote:
>>> ~$ man hdparm
>>>
>>
>> kyle@dreadnought ~ $ hdparm
>> hdparm - get/set hard disk parameters - version v9.27, by Mark Lord.
>>                                                         ^^^^^^^^^^^^
>> [...]
>
> Please pardon me while I wipe this egg off my face. :)
>
> My apologies Mark.  Please feel free to publicly flog me if you wish.

I wonder if Mark every watched a program called
Sledge Hammer! whose signature line was "Trust me.
I know what I'm doing." Recently I needed to plead
with hdparm to corrupt a sector as follows:

# hdparm --make-bad-sector 976770000 --yes-i-know-what-i-am-doing /dev/sdb

Doug Gilbert

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  3:48     ` Stan Hoeppner
  2010-09-24  5:02       ` Douglas Gilbert
@ 2010-09-24  5:31       ` Mark Lord
  2010-09-24  6:22         ` Stan Hoeppner
  2010-09-24  6:30         ` hdparm-9.32 released: recommended upgrade Mark Lord
  1 sibling, 2 replies; 17+ messages in thread
From: Mark Lord @ 2010-09-24  5:31 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Kyle McMartin, Linux Kernel, IDE/ATA development list, linux-scsi

On 10-09-23 11:48 PM, Stan Hoeppner wrote:
> Kyle McMartin put forth on 9/23/2010 8:37 PM:
>> On Thu, Sep 23, 2010 at 07:51:48PM -0500, Stan Hoeppner wrote:
>>> ~$ man hdparm
>>>
>>
>> kyle@dreadnought ~ $ hdparm
>> hdparm - get/set hard disk parameters - version v9.27, by Mark Lord.
>>                                                         ^^^^^^^^^^^^
>> [...]
>
> Please pardon me while I wipe this egg off my face. :)
>
> My apologies Mark.  Please feel free to publicly flog me if you wish.

Chuckle.  :)

But you did manage to prompt me to remove that obsolete warning
from the --security-* commands in hdparm.  They are rather well tested
at this point in the game.

hdparm-9.31 is now released, with some fixes to --security,
and with the nasty warnings mostly removed.

Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  5:31       ` Mark Lord
@ 2010-09-24  6:22         ` Stan Hoeppner
  2010-09-24  6:30         ` hdparm-9.32 released: recommended upgrade Mark Lord
  1 sibling, 0 replies; 17+ messages in thread
From: Stan Hoeppner @ 2010-09-24  6:22 UTC (permalink / raw)
  To: Mark Lord
  Cc: Kyle McMartin, Linux Kernel, IDE/ATA development list, linux-scsi

Mark Lord put forth on 9/24/2010 12:31 AM:
> On 10-09-23 11:48 PM, Stan Hoeppner wrote:
>> Kyle McMartin put forth on 9/23/2010 8:37 PM:
>>> On Thu, Sep 23, 2010 at 07:51:48PM -0500, Stan Hoeppner wrote:
>>>> ~$ man hdparm
>>>>
>>>
>>> kyle@dreadnought ~ $ hdparm
>>> hdparm - get/set hard disk parameters - version v9.27, by Mark Lord.
>>>                                                         ^^^^^^^^^^^^
>>> [...]
>>
>> Please pardon me while I wipe this egg off my face. :)
>>
>> My apologies Mark.  Please feel free to publicly flog me if you wish.
> 
> Chuckle.  :)
> 
> But you did manage to prompt me to remove that obsolete warning
> from the --security-* commands in hdparm.  They are rather well tested
> at this point in the game.
> 
> hdparm-9.31 is now released, with some fixes to --security,
> and with the nasty warnings mostly removed.
> 
> Thanks!

I guess it's a good thing when one can commit such a public blunder and
still manage to be somewhat helpful?  If so I don't feel 'quite' so
sheepish now.  :)

NOTE to $self:  when you subscribe to Linux dev lists, the odds are
_much_ greater that people who actually write the software you use _are_
the people posting messages.  Perform SENDER_IDENTITY_CHECK and
SANITY_CHECK routines in the future before referring an author to his
own documentation. ;)

-- 
Stan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* hdparm-9.32 released: recommended upgrade
  2010-09-24  5:31       ` Mark Lord
  2010-09-24  6:22         ` Stan Hoeppner
@ 2010-09-24  6:30         ` Mark Lord
  1 sibling, 0 replies; 17+ messages in thread
From: Mark Lord @ 2010-09-24  6:30 UTC (permalink / raw)
  To: IDE/ATA development list; +Cc: Linux Kernel, linux-scsi

On 10-09-24 01:31 AM, Mark Lord wrote:

> hdparm-9.31 is now released, with some fixes to --security,
> and with the nasty warnings mostly removed.

And quickly followed by hdparm-9.32.  Upgrade is RECOMMENDED.

Apparently a number of commands in hdparm have been b0rked since 9.27+,
but nobody bothered to email me about them.  Doh!

Fixed.

(but the libata/libahci AHCI result_tf problem is still an issue).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: "blocked for more than 120 secs" --> a valid situation, how to prevent?
  2010-09-24  3:51     ` Mark Lord
@ 2010-09-24  9:12       ` Jens Axboe
  2010-09-24 13:51         ` [PATCH] block: Prevent hang_check firing during long I/O Mark Lord
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2010-09-24  9:12 UTC (permalink / raw)
  To: Mark Lord
  Cc: dgilbert, Linux Kernel, IDE/ATA development list, linux-scsi,
	Joel Becker

On 2010-09-24 05:51, Mark Lord wrote:
> On 10-09-23 10:53 PM, Mark Lord wrote:
>> On 10-09-23 08:05 PM, Douglas Gilbert wrote:
>>> Mark,
>>> If you issued the SG_IO ioctl with a timeout of at
>>> least 66 minutes (expressed in milliseconds) then
>>> it looks like ata_scsi_queuecmd() has a problem.
>> ..
>>
>> Mmm.. more like blk_execute_rq() perhaps.
>> That's where the wait_for_completion(&wait) call is at.
>>
>> Perhaps I should change it to wait in smaller increments,
>> so that the lockup detection doesn't trigger on it..
> ..
> 
> This patch (below) seems to work.
> 
> Does this look kosher enough for me to roll it up
> as a proper patch submission?   Jens?  Joel?

Ideally it would be nice to just pass the info down that it should not
complain, since waiting > 120 seconds (or whatever the timeout is set
to) is expected by the caller in some cases.

But your patch is simple enough and it gets the job done. I will queue
it up for .37 if you send a properly formatted and signed-off-by
version.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] block: Prevent hang_check firing during long I/O
  2010-09-24  9:12       ` Jens Axboe
@ 2010-09-24 13:51         ` Mark Lord
  2010-09-24 13:52           ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Lord @ 2010-09-24 13:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dgilbert, Linux Kernel, IDE/ATA development list, linux-scsi,
	Joel Becker

During long I/O operations, the hang_check timer may fire,
trigger stack dumps that unnecessarily alarm the user.

Eg.  hdparm --security-erase NULL /dev/sdb  ## can take *hours* to complete

So, if hang_check is armed, we should wake up periodically
to prevent it from triggering.  This patch uses a wake-up interval
equal to half the hang_check timer period, which keeps overhead low enough.

Signed-off-by: Mark Lord <mlord@pobox.com>

--- old/block/blk-exec.c	2010-09-20 19:56:53.000000000 -0400
+++ linux/block/blk-exec.c	2010-09-24 09:43:32.342604574 -0400
@@ -80,6 +80,7 @@
 	DECLARE_COMPLETION_ONSTACK(wait);
 	char sense[SCSI_SENSE_BUFFERSIZE];
 	int err = 0;
+	unsigned long hang_check;
 
 	/*
 	 * we need an extra reference to the request, so we can look at
@@ -95,7 +96,13 @@
 
 	rq->end_io_data = &wait;
 	blk_execute_rq_nowait(q, bd_disk, rq, at_head, blk_end_sync_rq);
-	wait_for_completion(&wait);
+
+	/* Prevent hang_check timer from firing at us during very long I/O */
+	hang_check = sysctl_hung_task_timeout_secs;
+	if (hang_check)
+		while (!wait_for_completion_timeout(&wait, hang_check * (HZ/2)));
+	else
+		wait_for_completion(&wait);
 
 	if (rq->errors)
 		err = -EIO;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] block: Prevent hang_check firing during long I/O
  2010-09-24 13:51         ` [PATCH] block: Prevent hang_check firing during long I/O Mark Lord
@ 2010-09-24 13:52           ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2010-09-24 13:52 UTC (permalink / raw)
  To: Mark Lord
  Cc: dgilbert, Linux Kernel, IDE/ATA development list, linux-scsi,
	Joel Becker

On 2010-09-24 15:51, Mark Lord wrote:
> During long I/O operations, the hang_check timer may fire,
> trigger stack dumps that unnecessarily alarm the user.
> 
> Eg.  hdparm --security-erase NULL /dev/sdb  ## can take *hours* to complete
> 
> So, if hang_check is armed, we should wake up periodically
> to prevent it from triggering.  This patch uses a wake-up interval
> equal to half the hang_check timer period, which keeps overhead low enough.

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-09-24 13:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-23 23:41 "blocked for more than 120 secs" --> a valid situation, how to prevent? Mark Lord
2010-09-24  0:05 ` Douglas Gilbert
2010-09-24  2:53   ` Mark Lord
2010-09-24  3:51     ` Mark Lord
2010-09-24  9:12       ` Jens Axboe
2010-09-24 13:51         ` [PATCH] block: Prevent hang_check firing during long I/O Mark Lord
2010-09-24 13:52           ` Jens Axboe
2010-09-24  4:41     ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Douglas Gilbert
2010-09-24  0:51 ` Stan Hoeppner
2010-09-24  1:37   ` Kyle McMartin
2010-09-24  3:48     ` Stan Hoeppner
2010-09-24  5:02       ` Douglas Gilbert
2010-09-24  5:31       ` Mark Lord
2010-09-24  6:22         ` Stan Hoeppner
2010-09-24  6:30         ` hdparm-9.32 released: recommended upgrade Mark Lord
2010-09-24  1:58 ` "blocked for more than 120 secs" --> a valid situation, how to prevent? Maxim Levitsky
2010-09-24  2:08 ` Kyle McMartin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).