В сообщении от Wednesday 13 June 2018 07:03:47 jianchao.wang написал(а): > On 06/12/2018 11:28 PM, Bart Van Assche wrote: > > On Sat, 2018-06-09 at 08:02 -0700, Randy Dunlap wrote: > >> On 06/09/2018 06:06 AM, Andrew Randrianasulu wrote: > >>> Hello, everyone! > >>> > >>> I was trying to test not-yet-completed 4.18-rc1, and run into this bug, > >>> probably due to HAL (yes, I still use this) trying to poll empty DVD-RW > >>> drive: > >>> > >>> [ 35.040167] ------------[ cut here ]------------ > >>> [ 35.040171] kernel BUG at drivers/scsi/scsi_error.c:197! > >>> [ 35.040196] invalid opcode: 0000 [#1] SMP NOPTI > >>> [ 35.040210] CPU: 2 PID: 117 Comm: kworker/2:1H Not tainted > >>> 4.17.0-x64-08428-g7d3bf613e99a-dirty #18 > >>> [ 35.040225] Hardware name: MSI MS-7693/970A-G46 (MS-7693), BIOS V2.8 > >>> 01/08/2016 > >>> [ 35.040244] Workqueue: kblockd blk_mq_timeout_work > >>> [ 35.040283] RIP: 0010:scsi_times_out+0xe9/0x1a6 [scsi_mod] > >>> [ 35.040287] Code: 03 76 18 48 c7 c2 f9 c2 01 a0 48 89 ee 48 c7 c7 14 > >>> be 01 a0 31 c0 e8 76 ad 00 00 48 8b 83 98 01 00 00 a8 01 0f 84 a5 00 00 > >>> 00 <0f> 0b 49 8b 7c 24 38 e8 8f ba 4b e1 41 83 bc 24 2c 01 > >>> 00 00 ff 74 > >>> [ 35.040312] RSP: 0018:ffffc90000293d98 EFLAGS: 00010202 > >>> [ 35.040318] RAX: ffff88031f240605 RBX: ffff88031eed8500 RCX: > >>> 00000000fffefd40 [ 35.040323] RDX: ffffc90000293e80 RSI: > >>> 0000000000002007 RDI: ffff88031eed8638 [ 35.040327] RBP: > >>> ffff88031eed8638 R08: ffff88031eed85d8 R09: 0000000000000000 [ > >>> 35.040332] R10: 0000000000000040 R11: fefefefefefefeff R12: > >>> ffff88031f2a8800 [ 35.040336] R13: ffff88031f2a8800 R14: > >>> ffff88031f29e910 R15: 0000000000000001 [ 35.040341] FS: > >>> 0000000000000000(0000) GS:ffff88032ed00000(0000) knlGS:0000000000000000 > >>> [ 35.040346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 35.040350] CR2: 00000000f7ea83c0 CR3: 000000031e194000 CR4: > >>> 00000000000406e0 [ 35.040354] Call Trace: > >>> [ 35.040360] blk_mq_check_expired+0xdc/0x10c > >>> [ 35.040365] bt_iter+0x42/0x45 > >>> [ 35.040369] __sbitmap_for_each_set.constprop.12+0x83/0xac > >>> [ 35.040374] ? blk_mq_update_nr_requests+0xad/0xad > >>> [ 35.040595] ? blk_mq_exit_hctx+0xda/0xda > >>> [ 35.041595] blk_mq_queue_tag_busy_iter+0xa3/0xb4 > >>> [ 35.042578] ? blk_mq_exit_hctx+0xda/0xda > >>> [ 35.043550] blk_mq_timeout_work+0x6b/0xca > >>> [ 35.044505] process_one_work+0x17c/0x2a6 > >>> [ 35.045445] worker_thread+0x19f/0x243 > >>> [ 35.046393] ? rescuer_thread+0x262/0x262 > >>> [ 35.047346] kthread+0x100/0x108 > >>> [ 35.048290] ? kthread_destroy_worker+0x3e/0x3e > >>> [ 35.049242] ret_from_fork+0x27/0x50 > >>> [ 35.050192] Modules linked in: bridge stp llc bnep rfcomm hidp > >>> snd_aloop ipv6 > > ..... > > > Is this something that happens systematically or sporadically? > > Additionally, have you verified whether or not this can be triggered with > > kernel v4.17? I'm wondering whether or not this is a regression that was > > introduced in the merge window. > > I suspect this is due to we could expire a same request twice or even more. > For scsi mid-layer, it return BLK_EH_DONE from .timeout, in fact, the > request is not completed there, but just queue a delayed abort_work > (HZ/100). If the blk_mq_timeout_work runs again before the abort_work, the > request will be timed out again, because there is not any mark on it to > identify this request has been timed out. > > Would please try the patch attached on to see whether this issue could be > fixed ? (this patch only works for scsi device currently) with this patch on top of 4.17.0-x64-11730-gf5b7769eb040-dirty I can suspend/resume my machine again, no errors in dmesg (attached) Thanks! > > Thanks > Jianchao