В сообщении от Wednesday 13 June 2018 07:03:47 jianchao.wang написал(а):
> On 06/12/2018 11:28 PM, Bart Van Assche wrote:
> > On Sat, 2018-06-09 at 08:02 -0700, Randy Dunlap wrote:
> >> On 06/09/2018 06:06 AM, Andrew Randrianasulu wrote:
> >>> Hello, everyone!
> >>>
> >>> I was trying to test not-yet-completed 4.18-rc1, and run into this bug,
> >>> probably due to HAL (yes, I still use this) trying to poll empty DVD-RW
> >>> drive:
> >>>
> >>> [   35.040167] ------------[ cut here ]------------
> >>> [   35.040171] kernel BUG at drivers/scsi/scsi_error.c:197!
> >>> [   35.040196] invalid opcode: 0000 [#1] SMP NOPTI
> >>> [   35.040210] CPU: 2 PID: 117 Comm: kworker/2:1H Not tainted
> >>> 4.17.0-x64-08428-g7d3bf613e99a-dirty #18
> >>> [   35.040225] Hardware name: MSI MS-7693/970A-G46 (MS-7693), BIOS V2.8
> >>> 01/08/2016
> >>> [   35.040244] Workqueue: kblockd blk_mq_timeout_work
> >>> [   35.040283] RIP: 0010:scsi_times_out+0xe9/0x1a6 [scsi_mod]
> >>> [   35.040287] Code: 03 76 18 48 c7 c2 f9 c2 01 a0 48 89 ee 48 c7 c7 14
> >>> be 01 a0 31 c0 e8 76 ad 00 00 48 8b 83 98 01 00 00 a8 01 0f 84 a5 00 00
> >>> 00 <0f> 0b 49 8b 7c 24 38 e8 8f ba 4b e1 41 83 bc 24 2c 01
> >>>  00 00 ff 74
> >>> [   35.040312] RSP: 0018:ffffc90000293d98 EFLAGS: 00010202
> >>> [   35.040318] RAX: ffff88031f240605 RBX: ffff88031eed8500 RCX:
> >>> 00000000fffefd40 [   35.040323] RDX: ffffc90000293e80 RSI:
> >>> 0000000000002007 RDI: ffff88031eed8638 [   35.040327] RBP:
> >>> ffff88031eed8638 R08: ffff88031eed85d8 R09: 0000000000000000 [  
> >>> 35.040332] R10: 0000000000000040 R11: fefefefefefefeff R12:
> >>> ffff88031f2a8800 [   35.040336] R13: ffff88031f2a8800 R14:
> >>> ffff88031f29e910 R15: 0000000000000001 [   35.040341] FS: 
> >>> 0000000000000000(0000) GS:ffff88032ed00000(0000) knlGS:0000000000000000
> >>> [   35.040346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [   35.040350] CR2: 00000000f7ea83c0 CR3: 000000031e194000 CR4:
> >>> 00000000000406e0 [   35.040354] Call Trace:
> >>> [   35.040360]  blk_mq_check_expired+0xdc/0x10c
> >>> [   35.040365]  bt_iter+0x42/0x45
> >>> [   35.040369]  __sbitmap_for_each_set.constprop.12+0x83/0xac
> >>> [   35.040374]  ? blk_mq_update_nr_requests+0xad/0xad
> >>> [   35.040595]  ? blk_mq_exit_hctx+0xda/0xda
> >>> [   35.041595]  blk_mq_queue_tag_busy_iter+0xa3/0xb4
> >>> [   35.042578]  ? blk_mq_exit_hctx+0xda/0xda
> >>> [   35.043550]  blk_mq_timeout_work+0x6b/0xca
> >>> [   35.044505]  process_one_work+0x17c/0x2a6
> >>> [   35.045445]  worker_thread+0x19f/0x243
> >>> [   35.046393]  ? rescuer_thread+0x262/0x262
> >>> [   35.047346]  kthread+0x100/0x108
> >>> [   35.048290]  ? kthread_destroy_worker+0x3e/0x3e
> >>> [   35.049242]  ret_from_fork+0x27/0x50
> >>> [   35.050192] Modules linked in: bridge stp llc bnep rfcomm hidp
> >>> snd_aloop ipv6
>
> .....
>
> > Is this something that happens systematically or sporadically?
> > Additionally, have you verified whether or not this can be triggered with
> > kernel v4.17? I'm wondering whether or not this is a regression that was
> > introduced in the merge window.
>
> I suspect this is due to we could expire a same request twice or even more.
> For scsi mid-layer, it return BLK_EH_DONE from .timeout, in fact, the
> request is not completed there, but just queue a delayed abort_work
> (HZ/100). If the blk_mq_timeout_work runs again before the abort_work, the
> request will be timed out again, because there is not any mark on it to
> identify this request has been timed out.
>
> Would please try the patch attached on to see whether this issue could be
> fixed ? (this patch only works for scsi device currently)

with this patch on top of 4.17.0-x64-11730-gf5b7769eb040-dirty I can 
suspend/resume my machine again, no errors in dmesg (attached)

Thanks!

>
> Thanks
> Jianchao