* General protection fault with use_blk_mq=1. @ 2018-03-28 23:03 Zephaniah E. Loss-Cutler-Hull 2018-03-29 1:02 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: Zephaniah E. Loss-Cutler-Hull @ 2018-03-28 23:03 UTC (permalink / raw) To: linux-kernel, linux-block, linux-scsi I am not subscribed to any of the lists on the To list here, please CC me on any replies. I am encountering a fairly consistent crash anywhere from 15 minutes to 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1 The crash looks like: [ 5466.075993] general protection fault: 0000 [#1] PREEMPT SMP PTI [ 5466.075997] Modules linked in: esp4 xfrm4_mode_tunnel fuse usblp uvcvideo pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables intel_rapl joydev serio_raw wmi_bmof iwldvm iwlwifi shpchp kvm_intel kvm irqbypass autofs4 algif_skcipher nls_iso8859_1 nls_cp437 crc32_pclmul ghash_clmulni_intel [ 5466.076022] CPU: 3 PID: 10573 Comm: pool Tainted: G O 4.15.13-f1-dirty #148 [ 5466.076024] Hardware name: Hewlett-Packard HP EliteBook Folio 9470m/18DF, BIOS 68IBD Ver. F.44 05/22/2013 [ 5466.076029] RIP: 0010:percpu_counter_add_batch+0x2b/0xb0 [ 5466.076031] RSP: 0018:ffffa556c47afb58 EFLAGS: 00010002 [ 5466.076033] RAX: ffff95cda87ce018 RBX: ffff95cda87cdb68 RCX: 0000000000000000 [ 5466.076034] RDX: 000000003fffffff RSI: ffffffff896495c4 RDI: ffffffff895b2bed [ 5466.076036] RBP: 000000003fffffff R08: 0000000000000000 R09: ffff95cb7d5f8148 [ 5466.076037] R10: 0000000000000200 R11: 0000000000000000 R12: 0000000000000001 [ 5466.076038] R13: ffff95cda87ce088 R14: ffff95cda6ebd100 R15: ffffa556c47afc58 [ 5466.076040] FS: 00007f25f5305700(0000) GS:ffff95cdbeac0000(0000) knlGS:0000000000000000 [ 5466.076042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5466.076043] CR2: 00007f25e807e0a8 CR3: 00000003ed5a6001 CR4: 00000000001606e0 [ 5466.076044] Call Trace: [ 5466.076050] bfqg_stats_update_io_add+0x58/0x100 [ 5466.076055] bfq_insert_requests+0xec/0xd80 [ 5466.076059] ? blk_rq_append_bio+0x8f/0xa0 [ 5466.076061] ? blk_rq_map_user_iov+0xc3/0x1d0 [ 5466.076065] blk_mq_sched_insert_request+0xa3/0x130 [ 5466.076068] blk_execute_rq+0x3a/0x50 [ 5466.076070] sg_io+0x197/0x3e0 [ 5466.076073] ? dput+0xca/0x210 [ 5466.076077] ? mntput_no_expire+0x11/0x1a0 [ 5466.076079] scsi_cmd_ioctl+0x289/0x400 [ 5466.076082] ? filename_lookup+0xe1/0x170 [ 5466.076085] sd_ioctl+0xc7/0x1a0 [ 5466.076088] blkdev_ioctl+0x4d4/0x8c0 [ 5466.076091] block_ioctl+0x39/0x40 [ 5466.076094] do_vfs_ioctl+0x92/0x5e0 [ 5466.076097] ? __fget+0x73/0xc0 [ 5466.076099] SyS_ioctl+0x74/0x80 [ 5466.076102] do_syscall_64+0x60/0x110 [ 5466.076106] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 5466.076109] RIP: 0033:0x7f25f75fef47 [ 5466.076110] RSP: 002b:00007f25f53049a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 5466.076112] RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f25f75fef47 [ 5466.076114] RDX: 00007f25f53049b0 RSI: 0000000000002285 RDI: 000000000000000c [ 5466.076115] RBP: 0000000000000010 R08: 00007f25e8007818 R09: 0000000000000200 [ 5466.076116] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 5466.076118] R13: 0000000000000000 R14: 00007f25f8a6b5e0 R15: 00007f25e80173e0 [ 5466.076120] Code: 41 55 49 89 fd bf 01 00 00 00 41 54 49 89 f4 55 89 d5 53 e8 18 e1 bb ff 48 c7 c7 c4 95 64 89 e8 dc e9 fb ff 49 8b 45 20 48 63 d5 <65> 8b 18 48 63 db 4c 01 e3 48 39 d3 7d 0a f7 dd 48 63 ed 48 39 [ 5466.076147] RIP: percpu_counter_add_batch+0x2b/0xb0 RSP: ffffa556c47afb58 [ 5466.076149] ---[ end trace 8d7eb80aafef4494 ]--- [ 5466.670153] note: pool[10573] exited with preempt_count 2 (I only have the one instance right this minute as a result of not having remote syslog setup before now.) This is clearly deep in the blk_mq code, and it goes away when I remove the use_blk_mq kernel command line parameters. My next obvious step is to try and disable the load of the vbox modules. I can include the full dmesg output if it would be helpful. The system is an older HP Ultrabook, and the root partition is, sda1 (a SSD) -> a LUKS encrypted partition -> LVM -> BTRFS. The kernel is a stock 4.15.11, however I only recently added the blk_mq options, so while I can state that I have seen this on multiple kernels in the 4.15.x series, I have not tested earlier kernels in this configuration. Looking through the code, I'd guess that this is dying inside blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP is pointing at. Regards, Zephaniah E. Loss-Cutler-Hull. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-28 23:03 General protection fault with use_blk_mq=1 Zephaniah E. Loss-Cutler-Hull @ 2018-03-29 1:02 ` Jens Axboe 2018-03-29 3:13 ` Zephaniah E. Loss-Cutler-Hull 2018-03-29 4:56 ` Paolo Valente 0 siblings, 2 replies; 8+ messages in thread From: Jens Axboe @ 2018-03-29 1:02 UTC (permalink / raw) To: Zephaniah E. Loss-Cutler-Hull, linux-kernel, linux-block, linux-scsi On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: > I am not subscribed to any of the lists on the To list here, please CC > me on any replies. > > I am encountering a fairly consistent crash anywhere from 15 minutes to > 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> > The crash looks like: > > [ 5466.075993] general protection fault: 0000 [#1] PREEMPT SMP PTI > [ 5466.075997] Modules linked in: esp4 xfrm4_mode_tunnel fuse usblp > uvcvideo pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) > ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 > xt_conntrack nf_conntrack iptable_filter ip_tables x_tables intel_rapl > joydev serio_raw wmi_bmof iwldvm iwlwifi shpchp kvm_intel kvm irqbypass > autofs4 algif_skcipher nls_iso8859_1 nls_cp437 crc32_pclmul > ghash_clmulni_intel > [ 5466.076022] CPU: 3 PID: 10573 Comm: pool Tainted: G O > 4.15.13-f1-dirty #148 > [ 5466.076024] Hardware name: Hewlett-Packard HP EliteBook Folio > 9470m/18DF, BIOS 68IBD Ver. F.44 05/22/2013 > [ 5466.076029] RIP: 0010:percpu_counter_add_batch+0x2b/0xb0 > [ 5466.076031] RSP: 0018:ffffa556c47afb58 EFLAGS: 00010002 > [ 5466.076033] RAX: ffff95cda87ce018 RBX: ffff95cda87cdb68 RCX: > 0000000000000000 > [ 5466.076034] RDX: 000000003fffffff RSI: ffffffff896495c4 RDI: > ffffffff895b2bed > [ 5466.076036] RBP: 000000003fffffff R08: 0000000000000000 R09: > ffff95cb7d5f8148 > [ 5466.076037] R10: 0000000000000200 R11: 0000000000000000 R12: > 0000000000000001 > [ 5466.076038] R13: ffff95cda87ce088 R14: ffff95cda6ebd100 R15: > ffffa556c47afc58 > [ 5466.076040] FS: 00007f25f5305700(0000) GS:ffff95cdbeac0000(0000) > knlGS:0000000000000000 > [ 5466.076042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 5466.076043] CR2: 00007f25e807e0a8 CR3: 00000003ed5a6001 CR4: > 00000000001606e0 > [ 5466.076044] Call Trace: > [ 5466.076050] bfqg_stats_update_io_add+0x58/0x100 > [ 5466.076055] bfq_insert_requests+0xec/0xd80 > [ 5466.076059] ? blk_rq_append_bio+0x8f/0xa0 > [ 5466.076061] ? blk_rq_map_user_iov+0xc3/0x1d0 > [ 5466.076065] blk_mq_sched_insert_request+0xa3/0x130 > [ 5466.076068] blk_execute_rq+0x3a/0x50 > [ 5466.076070] sg_io+0x197/0x3e0 > [ 5466.076073] ? dput+0xca/0x210 > [ 5466.076077] ? mntput_no_expire+0x11/0x1a0 > [ 5466.076079] scsi_cmd_ioctl+0x289/0x400 > [ 5466.076082] ? filename_lookup+0xe1/0x170 > [ 5466.076085] sd_ioctl+0xc7/0x1a0 > [ 5466.076088] blkdev_ioctl+0x4d4/0x8c0 > [ 5466.076091] block_ioctl+0x39/0x40 > [ 5466.076094] do_vfs_ioctl+0x92/0x5e0 > [ 5466.076097] ? __fget+0x73/0xc0 > [ 5466.076099] SyS_ioctl+0x74/0x80 > [ 5466.076102] do_syscall_64+0x60/0x110 > [ 5466.076106] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 5466.076109] RIP: 0033:0x7f25f75fef47 > [ 5466.076110] RSP: 002b:00007f25f53049a8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 5466.076112] RAX: ffffffffffffffda RBX: 000000000000000c RCX: > 00007f25f75fef47 > [ 5466.076114] RDX: 00007f25f53049b0 RSI: 0000000000002285 RDI: > 000000000000000c > [ 5466.076115] RBP: 0000000000000010 R08: 00007f25e8007818 R09: > 0000000000000200 > [ 5466.076116] R10: 0000000000000001 R11: 0000000000000246 R12: > 0000000000000000 > [ 5466.076118] R13: 0000000000000000 R14: 00007f25f8a6b5e0 R15: > 00007f25e80173e0 > [ 5466.076120] Code: 41 55 49 89 fd bf 01 00 00 00 41 54 49 89 f4 55 89 > d5 53 e8 18 e1 bb ff 48 c7 c7 c4 95 64 89 e8 dc e9 fb ff 49 8b 45 20 48 > 63 d5 <65> 8b 18 48 63 db 4c 01 e3 48 39 d3 7d 0a f7 dd 48 63 ed 48 39 > [ 5466.076147] RIP: percpu_counter_add_batch+0x2b/0xb0 RSP: ffffa556c47afb58 > [ 5466.076149] ---[ end trace 8d7eb80aafef4494 ]--- > [ 5466.670153] note: pool[10573] exited with preempt_count 2 > > (I only have the one instance right this minute as a result of not > having remote syslog setup before now.) > > This is clearly deep in the blk_mq code, and it goes away when I remove > the use_blk_mq kernel command line parameters. > > My next obvious step is to try and disable the load of the vbox modules. > > I can include the full dmesg output if it would be helpful. > > The system is an older HP Ultrabook, and the root partition is, sda1 (a > SSD) -> a LUKS encrypted partition -> LVM -> BTRFS. > > The kernel is a stock 4.15.11, however I only recently added the blk_mq > options, so while I can state that I have seen this on multiple kernels > in the 4.15.x series, I have not tested earlier kernels in this > configuration. > > Looking through the code, I'd guess that this is dying inside > blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP > is pointing at. Leaving the whole thing here for Paolo - it's crashing off insertion of a request coming out of SG_IO. Don't think we've seen this BFQ failure case before. You can mitigate this by switching the scsi-mq devices to mq-deadline instead. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 1:02 ` Jens Axboe @ 2018-03-29 3:13 ` Zephaniah E. Loss-Cutler-Hull 2018-03-29 3:22 ` Jens Axboe 2018-03-29 4:56 ` Paolo Valente 1 sibling, 1 reply; 8+ messages in thread From: Zephaniah E. Loss-Cutler-Hull @ 2018-03-29 3:13 UTC (permalink / raw) To: Jens Axboe, Zephaniah E. Loss-Cutler-Hull, linux-kernel, linux-block, linux-scsi [-- Attachment #1.1: Type: text/plain, Size: 1089 bytes --] On 03/28/2018 06:02 PM, Jens Axboe wrote: > On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >> I am not subscribed to any of the lists on the To list here, please CC >> me on any replies. >> >> I am encountering a fairly consistent crash anywhere from 15 minutes to >> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >> The crash looks like: >> >> >> Looking through the code, I'd guess that this is dying inside >> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >> is pointing at. > > Leaving the whole thing here for Paolo - it's crashing off insertion of > a request coming out of SG_IO. Don't think we've seen this BFQ failure > case before. > > You can mitigate this by switching the scsi-mq devices to mq-deadline > instead. > I'm thinking that I should also be able to mitigate it by disabling CONFIG_DEBUG_BLK_CGROUP. That should remove that entire chunk of code. Of course, that won't help if this is actually a symptom of a bigger problem. Regards, Zephaniah E. Loss-Cutler-Hull. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 3:13 ` Zephaniah E. Loss-Cutler-Hull @ 2018-03-29 3:22 ` Jens Axboe 2018-03-29 5:13 ` Paolo Valente 0 siblings, 1 reply; 8+ messages in thread From: Jens Axboe @ 2018-03-29 3:22 UTC (permalink / raw) To: Zephaniah E. Loss-Cutler-Hull, Zephaniah E. Loss-Cutler-Hull, linux-kernel, linux-block, linux-scsi Cc: Paolo Valente On 3/28/18 9:13 PM, Zephaniah E. Loss-Cutler-Hull wrote: > On 03/28/2018 06:02 PM, Jens Axboe wrote: >> On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>> I am not subscribed to any of the lists on the To list here, please CC >>> me on any replies. >>> >>> I am encountering a fairly consistent crash anywhere from 15 minutes to >>> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >>> The crash looks like: >>> > >>> >>> Looking through the code, I'd guess that this is dying inside >>> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >>> is pointing at. >> >> Leaving the whole thing here for Paolo - it's crashing off insertion of >> a request coming out of SG_IO. Don't think we've seen this BFQ failure >> case before. >> >> You can mitigate this by switching the scsi-mq devices to mq-deadline >> instead. >> > > I'm thinking that I should also be able to mitigate it by disabling > CONFIG_DEBUG_BLK_CGROUP. > > That should remove that entire chunk of code. > > Of course, that won't help if this is actually a symptom of a bigger > problem. Yes, it's not a given that it will fully mask the issue at hand. But turning off BFQ has a much higher chance of working for you. This time actually CC'ing Paolo. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 3:22 ` Jens Axboe @ 2018-03-29 5:13 ` Paolo Valente 2018-03-29 9:12 ` Zephaniah E. Loss-Cutler-Hull 0 siblings, 1 reply; 8+ messages in thread From: Paolo Valente @ 2018-03-29 5:13 UTC (permalink / raw) To: Jens Axboe Cc: Zephaniah E. Loss-Cutler-Hull, Zephaniah E. Loss-Cutler-Hull, Linux Kernel Mailing List, linux-block, linux-scsi > Il giorno 29 mar 2018, alle ore 05:22, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 3/28/18 9:13 PM, Zephaniah E. Loss-Cutler-Hull wrote: >> On 03/28/2018 06:02 PM, Jens Axboe wrote: >>> On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>>> I am not subscribed to any of the lists on the To list here, please CC >>>> me on any replies. >>>> >>>> I am encountering a fairly consistent crash anywhere from 15 minutes to >>>> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >>>> The crash looks like: >>>> >> >>>> >>>> Looking through the code, I'd guess that this is dying inside >>>> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >>>> is pointing at. >>> >>> Leaving the whole thing here for Paolo - it's crashing off insertion of >>> a request coming out of SG_IO. Don't think we've seen this BFQ failure >>> case before. >>> >>> You can mitigate this by switching the scsi-mq devices to mq-deadline >>> instead. >>> >> >> I'm thinking that I should also be able to mitigate it by disabling >> CONFIG_DEBUG_BLK_CGROUP. >> >> That should remove that entire chunk of code. >> >> Of course, that won't help if this is actually a symptom of a bigger >> problem. > > Yes, it's not a given that it will fully mask the issue at hand. But > turning off BFQ has a much higher chance of working for you. > > This time actually CC'ing Paolo. > Hi Zephaniah, if you are actually interested in the benefits of BFQ (low latency, high responsiveness, fairness, ...) then it may be worth to try what you yourself suggest: disabling CONFIG_DEBUG_BLK_CGROUP. Also because this option activates the heavy computation of debug cgroup statistics, which probably you don't use. In addition, the outcome of your attempt without CONFIG_DEBUG_BLK_CGROUP would give us useful bisection information: - if no failure occurs, then the issue is likely to be confined in that debugging code (which, on the bright side, is likely to be of occasional interest, for only a handful of developers) - if the issue still shows up, then we may have new hints on this odd failure Finally, consider that this issue has been reported to disappear from 4.16 [1], and, as a plus, that the service quality of BFQ had a further boost exactly from 4.16. Looking forward to your feedback, in case you try BFQ without CONFIG_DEBUG_BLK_CGROUP, Paolo [1] https://www.spinics.net/lists/linux-block/msg21422.html > > -- > Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 5:13 ` Paolo Valente @ 2018-03-29 9:12 ` Zephaniah E. Loss-Cutler-Hull 2018-03-30 5:43 ` Zephaniah E. Loss-Cutler-Hull 0 siblings, 1 reply; 8+ messages in thread From: Zephaniah E. Loss-Cutler-Hull @ 2018-03-29 9:12 UTC (permalink / raw) To: Paolo Valente, Jens Axboe Cc: Zephaniah E. Loss-Cutler-Hull, Linux Kernel Mailing List, linux-block, linux-scsi [-- Attachment #1.1: Type: text/plain, Size: 2932 bytes --] On 03/28/2018 10:13 PM, Paolo Valente wrote: > > >> Il giorno 29 mar 2018, alle ore 05:22, Jens Axboe <axboe@kernel.dk> ha scritto: >> >> On 3/28/18 9:13 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>> On 03/28/2018 06:02 PM, Jens Axboe wrote: >>>> On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >>>>> I am not subscribed to any of the lists on the To list here, please CC >>>>> me on any replies. >>>>> >>>>> I am encountering a fairly consistent crash anywhere from 15 minutes to >>>>> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >>>>> The crash looks like: >>>>> >>> >>>>> >>>>> Looking through the code, I'd guess that this is dying inside >>>>> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >>>>> is pointing at. >>>> >>>> Leaving the whole thing here for Paolo - it's crashing off insertion of >>>> a request coming out of SG_IO. Don't think we've seen this BFQ failure >>>> case before. >>>> >>>> You can mitigate this by switching the scsi-mq devices to mq-deadline >>>> instead. >>>> >>> >>> I'm thinking that I should also be able to mitigate it by disabling >>> CONFIG_DEBUG_BLK_CGROUP. >>> >>> That should remove that entire chunk of code. >>> >>> Of course, that won't help if this is actually a symptom of a bigger >>> problem. >> >> Yes, it's not a given that it will fully mask the issue at hand. But >> turning off BFQ has a much higher chance of working for you. >> >> This time actually CC'ing Paolo. >> > > Hi Zephaniah, > if you are actually interested in the benefits of BFQ (low latency, > high responsiveness, fairness, ...) then it may be worth to try what > you yourself suggest: disabling CONFIG_DEBUG_BLK_CGROUP. Also because > this option activates the heavy computation of debug cgroup statistics, > which probably you don't use. I definitely am. > > In addition, the outcome of your attempt without > CONFIG_DEBUG_BLK_CGROUP would give us useful bisection information: > - if no failure occurs, then the issue is likely to be confined in > that debugging code (which, on the bright side, is likely to be of > occasional interest, for only a handful of developers) > - if the issue still shows up, then we may have new hints on this odd > failure > > Finally, consider that this issue has been reported to disappear from > 4.16 [1], and, as a plus, that the service quality of BFQ had a > further boost exactly from 4.16. I look forward to that either way then. > > Looking forward to your feedback, in case you try BFQ without > CONFIG_DEBUG_BLK_CGROUP, I'm running that now, judging from the past if it survives until tomorrow evening then we're good, so I should hopefully know in the next day. Thank you, Zephaniah E. Loss-Cutler-Hull. > Paolo > > [1] https://www.spinics.net/lists/linux-block/msg21422.html > >> >> -- >> Jens Axboe > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 9:12 ` Zephaniah E. Loss-Cutler-Hull @ 2018-03-30 5:43 ` Zephaniah E. Loss-Cutler-Hull 0 siblings, 0 replies; 8+ messages in thread From: Zephaniah E. Loss-Cutler-Hull @ 2018-03-30 5:43 UTC (permalink / raw) To: Paolo Valente, Jens Axboe Cc: Zephaniah E. Loss-Cutler-Hull, Linux Kernel Mailing List, linux-block, linux-scsi On 03/29/2018 02:12 AM, Zephaniah E. Loss-Cutler-Hull wrote: > On 03/28/2018 10:13 PM, Paolo Valente wrote: >> In addition, the outcome of your attempt without >> CONFIG_DEBUG_BLK_CGROUP would give us useful bisection information: >> - if no failure occurs, then the issue is likely to be confined in >> that debugging code (which, on the bright side, is likely to be of >> occasional interest, for only a handful of developers) >> - if the issue still shows up, then we may have new hints on this odd >> failure >> >> Finally, consider that this issue has been reported to disappear from >> 4.16 [1], and, as a plus, that the service quality of BFQ had a >> further boost exactly from 4.16. > > I look forward to that either way then. >> >> Looking forward to your feedback, in case you try BFQ without >> CONFIG_DEBUG_BLK_CGROUP, > > I'm running that now, judging from the past if it survives until > tomorrow evening then we're good, so I should hopefully know in the next > day. Alright, I now have an uptime of over 20 hours, with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1 I did upgrade from 4.15.13 to 4.15.14 in the process, but a quick look at the changes doesn't have anything jump out at me as impacting this. So I'm reasonably comfortable stating that disabling CONFIG_DEBUG_BLK_CGROUP was sufficient to render this stable. Regards, Zephaniah E. Loss-Cutler-Hull. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: General protection fault with use_blk_mq=1. 2018-03-29 1:02 ` Jens Axboe 2018-03-29 3:13 ` Zephaniah E. Loss-Cutler-Hull @ 2018-03-29 4:56 ` Paolo Valente 1 sibling, 0 replies; 8+ messages in thread From: Paolo Valente @ 2018-03-29 4:56 UTC (permalink / raw) To: Jens Axboe Cc: Zephaniah E. Loss-Cutler-Hull, linux-kernel, linux-block, linux-scsi > Il giorno 29 mar 2018, alle ore 03:02, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 3/28/18 5:03 PM, Zephaniah E. Loss-Cutler-Hull wrote: >> I am not subscribed to any of the lists on the To list here, please CC >> me on any replies. >> >> I am encountering a fairly consistent crash anywhere from 15 minutes to >> 12 hours after boot with scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=1> >> The crash looks like: >> >> [ 5466.075993] general protection fault: 0000 [#1] PREEMPT SMP PTI >> [ 5466.075997] Modules linked in: esp4 xfrm4_mode_tunnel fuse usblp >> uvcvideo pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) >> ip6table_filter ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 >> xt_conntrack nf_conntrack iptable_filter ip_tables x_tables intel_rapl >> joydev serio_raw wmi_bmof iwldvm iwlwifi shpchp kvm_intel kvm irqbypass >> autofs4 algif_skcipher nls_iso8859_1 nls_cp437 crc32_pclmul >> ghash_clmulni_intel >> [ 5466.076022] CPU: 3 PID: 10573 Comm: pool Tainted: G O >> 4.15.13-f1-dirty #148 >> [ 5466.076024] Hardware name: Hewlett-Packard HP EliteBook Folio >> 9470m/18DF, BIOS 68IBD Ver. F.44 05/22/2013 >> [ 5466.076029] RIP: 0010:percpu_counter_add_batch+0x2b/0xb0 >> [ 5466.076031] RSP: 0018:ffffa556c47afb58 EFLAGS: 00010002 >> [ 5466.076033] RAX: ffff95cda87ce018 RBX: ffff95cda87cdb68 RCX: >> 0000000000000000 >> [ 5466.076034] RDX: 000000003fffffff RSI: ffffffff896495c4 RDI: >> ffffffff895b2bed >> [ 5466.076036] RBP: 000000003fffffff R08: 0000000000000000 R09: >> ffff95cb7d5f8148 >> [ 5466.076037] R10: 0000000000000200 R11: 0000000000000000 R12: >> 0000000000000001 >> [ 5466.076038] R13: ffff95cda87ce088 R14: ffff95cda6ebd100 R15: >> ffffa556c47afc58 >> [ 5466.076040] FS: 00007f25f5305700(0000) GS:ffff95cdbeac0000(0000) >> knlGS:0000000000000000 >> [ 5466.076042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 5466.076043] CR2: 00007f25e807e0a8 CR3: 00000003ed5a6001 CR4: >> 00000000001606e0 >> [ 5466.076044] Call Trace: >> [ 5466.076050] bfqg_stats_update_io_add+0x58/0x100 >> [ 5466.076055] bfq_insert_requests+0xec/0xd80 >> [ 5466.076059] ? blk_rq_append_bio+0x8f/0xa0 >> [ 5466.076061] ? blk_rq_map_user_iov+0xc3/0x1d0 >> [ 5466.076065] blk_mq_sched_insert_request+0xa3/0x130 >> [ 5466.076068] blk_execute_rq+0x3a/0x50 >> [ 5466.076070] sg_io+0x197/0x3e0 >> [ 5466.076073] ? dput+0xca/0x210 >> [ 5466.076077] ? mntput_no_expire+0x11/0x1a0 >> [ 5466.076079] scsi_cmd_ioctl+0x289/0x400 >> [ 5466.076082] ? filename_lookup+0xe1/0x170 >> [ 5466.076085] sd_ioctl+0xc7/0x1a0 >> [ 5466.076088] blkdev_ioctl+0x4d4/0x8c0 >> [ 5466.076091] block_ioctl+0x39/0x40 >> [ 5466.076094] do_vfs_ioctl+0x92/0x5e0 >> [ 5466.076097] ? __fget+0x73/0xc0 >> [ 5466.076099] SyS_ioctl+0x74/0x80 >> [ 5466.076102] do_syscall_64+0x60/0x110 >> [ 5466.076106] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> [ 5466.076109] RIP: 0033:0x7f25f75fef47 >> [ 5466.076110] RSP: 002b:00007f25f53049a8 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000010 >> [ 5466.076112] RAX: ffffffffffffffda RBX: 000000000000000c RCX: >> 00007f25f75fef47 >> [ 5466.076114] RDX: 00007f25f53049b0 RSI: 0000000000002285 RDI: >> 000000000000000c >> [ 5466.076115] RBP: 0000000000000010 R08: 00007f25e8007818 R09: >> 0000000000000200 >> [ 5466.076116] R10: 0000000000000001 R11: 0000000000000246 R12: >> 0000000000000000 >> [ 5466.076118] R13: 0000000000000000 R14: 00007f25f8a6b5e0 R15: >> 00007f25e80173e0 >> [ 5466.076120] Code: 41 55 49 89 fd bf 01 00 00 00 41 54 49 89 f4 55 89 >> d5 53 e8 18 e1 bb ff 48 c7 c7 c4 95 64 89 e8 dc e9 fb ff 49 8b 45 20 48 >> 63 d5 <65> 8b 18 48 63 db 4c 01 e3 48 39 d3 7d 0a f7 dd 48 63 ed 48 39 >> [ 5466.076147] RIP: percpu_counter_add_batch+0x2b/0xb0 RSP: ffffa556c47afb58 >> [ 5466.076149] ---[ end trace 8d7eb80aafef4494 ]--- >> [ 5466.670153] note: pool[10573] exited with preempt_count 2 >> >> (I only have the one instance right this minute as a result of not >> having remote syslog setup before now.) >> >> This is clearly deep in the blk_mq code, and it goes away when I remove >> the use_blk_mq kernel command line parameters. >> >> My next obvious step is to try and disable the load of the vbox modules. >> >> I can include the full dmesg output if it would be helpful. >> >> The system is an older HP Ultrabook, and the root partition is, sda1 (a >> SSD) -> a LUKS encrypted partition -> LVM -> BTRFS. >> >> The kernel is a stock 4.15.11, however I only recently added the blk_mq >> options, so while I can state that I have seen this on multiple kernels >> in the 4.15.x series, I have not tested earlier kernels in this >> configuration. >> >> Looking through the code, I'd guess that this is dying inside >> blkg_rwstat_add, which calls percpu_counter_add_batch, which is what RIP >> is pointing at. > > Leaving the whole thing here for Paolo - it's crashing off insertion of > a request coming out of SG_IO. Don't think we've seen this BFQ failure > case before. > Actually, we have. Found and reported by Ming about two months and a half ago: https://www.spinics.net/lists/linux-block/msg21422.html Then it just disappeared with 4.16, and Ming moved on. This forced me to abandon the problem, as I never succeeded in reproducing it. Thanks, Paolo > You can mitigate this by switching the scsi-mq devices to mq-deadline > instead. > > -- > Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-03-30 5:43 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-03-28 23:03 General protection fault with use_blk_mq=1 Zephaniah E. Loss-Cutler-Hull 2018-03-29 1:02 ` Jens Axboe 2018-03-29 3:13 ` Zephaniah E. Loss-Cutler-Hull 2018-03-29 3:22 ` Jens Axboe 2018-03-29 5:13 ` Paolo Valente 2018-03-29 9:12 ` Zephaniah E. Loss-Cutler-Hull 2018-03-30 5:43 ` Zephaniah E. Loss-Cutler-Hull 2018-03-29 4:56 ` Paolo Valente
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox