* Deadlock with ide_issue_trim and draining @ 2023-03-07 12:22 Fiona Ebner 2023-03-07 13:44 ` Hanna Czenczek 0 siblings, 1 reply; 6+ messages in thread From: Fiona Ebner @ 2023-03-07 12:22 UTC (permalink / raw) To: QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini, Hanna Czenczek Hi, I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight counter for TRIM BH") introduced an issue in combination with draining. From a debug session on a costumer's machine I gathered the following information: * The QEMU process hangs in aio_poll called during draining and doesn't progress. * The in_flight counter for the BlockDriverState is 0 and for the BlockBackend it is 1. * There is a blk_aio_pdiscard_entry request in the BlockBackend's queued_requests. * The drive is attached via ahci. I suspect that something like the following happened: 1. ide_issue_trim is called, and increments the in_flight counter. 2. ide_issue_trim_cb calls blk_aio_pdiscard. 3. somebody else starts draining. 4. ide_issue_trim_cb is called as the completion callback for blk_aio_pdiscard. 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. 6. The request is added to the wait queue via blk_wait_while_drained, because draining has been started. 7. Nobody ever decrements the in_flight counter and draining can't finish. The issue occurs very rarely and is difficult to reproduce, but with the help of GDB, I'm able to do it rather reliably: 1. Use GDB to break on blk_aio_pdiscard. 2. Run mkfs.ext4 on a huge disk in the guest. 3. Issue a drive-backup QMP command after landing on the breakpoint. 4. Continue a few times in GDB. 5. After that I can observe the same situation as described above. I'd be happy about suggestions for how to fix it. Unfortunately, I don't see a clear-cut way at the moment. The only idea I have right now is to change the code to issue all discard requests at the same time, but I fear there might pitfalls with that? Best Regards, Fiona ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Deadlock with ide_issue_trim and draining 2023-03-07 12:22 Deadlock with ide_issue_trim and draining Fiona Ebner @ 2023-03-07 13:44 ` Hanna Czenczek 2023-03-07 14:27 ` Hanna Czenczek 2023-03-08 18:03 ` Hanna Czenczek 0 siblings, 2 replies; 6+ messages in thread From: Hanna Czenczek @ 2023-03-07 13:44 UTC (permalink / raw) To: Fiona Ebner, QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini On 07.03.23 13:22, Fiona Ebner wrote: > Hi, > I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight > counter for TRIM BH") introduced an issue in combination with draining. > > From a debug session on a costumer's machine I gathered the following > information: > * The QEMU process hangs in aio_poll called during draining and doesn't > progress. > * The in_flight counter for the BlockDriverState is 0 and for the > BlockBackend it is 1. > * There is a blk_aio_pdiscard_entry request in the BlockBackend's > queued_requests. > * The drive is attached via ahci. > > I suspect that something like the following happened: > > 1. ide_issue_trim is called, and increments the in_flight counter. > 2. ide_issue_trim_cb calls blk_aio_pdiscard. > 3. somebody else starts draining. > 4. ide_issue_trim_cb is called as the completion callback for > blk_aio_pdiscard. > 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. > 6. The request is added to the wait queue via blk_wait_while_drained, > because draining has been started. > 7. Nobody ever decrements the in_flight counter and draining can't finish. Sounds about right. > The issue occurs very rarely and is difficult to reproduce, but with the > help of GDB, I'm able to do it rather reliably: > 1. Use GDB to break on blk_aio_pdiscard. > 2. Run mkfs.ext4 on a huge disk in the guest. > 3. Issue a drive-backup QMP command after landing on the breakpoint. > 4. Continue a few times in GDB. > 5. After that I can observe the same situation as described above. > > I'd be happy about suggestions for how to fix it. Unfortunately, I don't > see a clear-cut way at the moment. The only idea I have right now is to > change the code to issue all discard requests at the same time, but I > fear there might pitfalls with that? The point of 7e5cdb345f was that we need any in-flight count to accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is happening, we don’t necessarily need another count. But we do need it while there is no blk_aio_pdiscard(). ide_issue_trim_cb() returns in two cases (and, recursively through its callers, leaves s->bus->dma->aiocb set): 1. After calling blk_aio_pdiscard(), which will keep an in-flight count, 2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), which does not keep an in-flight count. Perhaps we just need to move the blk_inc_in_flight() above the replay_bh_schedule_event() call? Hanna ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Deadlock with ide_issue_trim and draining 2023-03-07 13:44 ` Hanna Czenczek @ 2023-03-07 14:27 ` Hanna Czenczek 2023-03-08 10:35 ` Fiona Ebner 2023-03-08 18:03 ` Hanna Czenczek 1 sibling, 1 reply; 6+ messages in thread From: Hanna Czenczek @ 2023-03-07 14:27 UTC (permalink / raw) To: Fiona Ebner, QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini On 07.03.23 14:44, Hanna Czenczek wrote: > On 07.03.23 13:22, Fiona Ebner wrote: >> Hi, >> I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight >> counter for TRIM BH") introduced an issue in combination with draining. >> >> From a debug session on a costumer's machine I gathered the following >> information: >> * The QEMU process hangs in aio_poll called during draining and doesn't >> progress. >> * The in_flight counter for the BlockDriverState is 0 and for the >> BlockBackend it is 1. >> * There is a blk_aio_pdiscard_entry request in the BlockBackend's >> queued_requests. >> * The drive is attached via ahci. >> >> I suspect that something like the following happened: >> >> 1. ide_issue_trim is called, and increments the in_flight counter. >> 2. ide_issue_trim_cb calls blk_aio_pdiscard. >> 3. somebody else starts draining. >> 4. ide_issue_trim_cb is called as the completion callback for >> blk_aio_pdiscard. >> 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. >> 6. The request is added to the wait queue via blk_wait_while_drained, >> because draining has been started. >> 7. Nobody ever decrements the in_flight counter and draining can't >> finish. > > Sounds about right. > >> The issue occurs very rarely and is difficult to reproduce, but with the >> help of GDB, I'm able to do it rather reliably: >> 1. Use GDB to break on blk_aio_pdiscard. >> 2. Run mkfs.ext4 on a huge disk in the guest. >> 3. Issue a drive-backup QMP command after landing on the breakpoint. >> 4. Continue a few times in GDB. >> 5. After that I can observe the same situation as described above. >> >> I'd be happy about suggestions for how to fix it. Unfortunately, I don't >> see a clear-cut way at the moment. The only idea I have right now is to >> change the code to issue all discard requests at the same time, but I >> fear there might pitfalls with that? > > The point of 7e5cdb345f was that we need any in-flight count to > accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is > happening, we don’t necessarily need another count. But we do need it > while there is no blk_aio_pdiscard(). > > ide_issue_trim_cb() returns in two cases (and, recursively through its > callers, leaves s->bus->dma->aiocb set): > 1. After calling blk_aio_pdiscard(), which will keep an in-flight count, > 2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), > which does not keep an in-flight count. > > Perhaps we just need to move the blk_inc_in_flight() above the > replay_bh_schedule_event() call? FWIW, doing so at least keeps the reproducer from https://bugzilla.redhat.com/show_bug.cgi?id=2029980 working. Hanna ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Deadlock with ide_issue_trim and draining 2023-03-07 14:27 ` Hanna Czenczek @ 2023-03-08 10:35 ` Fiona Ebner 2023-03-08 15:02 ` Hanna Czenczek 0 siblings, 1 reply; 6+ messages in thread From: Fiona Ebner @ 2023-03-08 10:35 UTC (permalink / raw) To: Hanna Czenczek, QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini Am 07.03.23 um 15:27 schrieb Hanna Czenczek: > On 07.03.23 14:44, Hanna Czenczek wrote: >> On 07.03.23 13:22, Fiona Ebner wrote: >>> Hi, >>> I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight >>> counter for TRIM BH") introduced an issue in combination with draining. >>> >>> From a debug session on a costumer's machine I gathered the following >>> information: >>> * The QEMU process hangs in aio_poll called during draining and doesn't >>> progress. >>> * The in_flight counter for the BlockDriverState is 0 and for the >>> BlockBackend it is 1. >>> * There is a blk_aio_pdiscard_entry request in the BlockBackend's >>> queued_requests. >>> * The drive is attached via ahci. >>> >>> I suspect that something like the following happened: >>> >>> 1. ide_issue_trim is called, and increments the in_flight counter. >>> 2. ide_issue_trim_cb calls blk_aio_pdiscard. >>> 3. somebody else starts draining. >>> 4. ide_issue_trim_cb is called as the completion callback for >>> blk_aio_pdiscard. >>> 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. >>> 6. The request is added to the wait queue via blk_wait_while_drained, >>> because draining has been started. >>> 7. Nobody ever decrements the in_flight counter and draining can't >>> finish. >> >> Sounds about right. >> >>> The issue occurs very rarely and is difficult to reproduce, but with the >>> help of GDB, I'm able to do it rather reliably: >>> 1. Use GDB to break on blk_aio_pdiscard. >>> 2. Run mkfs.ext4 on a huge disk in the guest. >>> 3. Issue a drive-backup QMP command after landing on the breakpoint. >>> 4. Continue a few times in GDB. >>> 5. After that I can observe the same situation as described above. >>> >>> I'd be happy about suggestions for how to fix it. Unfortunately, I don't >>> see a clear-cut way at the moment. The only idea I have right now is to >>> change the code to issue all discard requests at the same time, but I >>> fear there might pitfalls with that? >> >> The point of 7e5cdb345f was that we need any in-flight count to >> accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is >> happening, we don’t necessarily need another count. But we do need it >> while there is no blk_aio_pdiscard(). >> >> ide_issue_trim_cb() returns in two cases (and, recursively through its >> callers, leaves s->bus->dma->aiocb set): >> 1. After calling blk_aio_pdiscard(), which will keep an in-flight count, >> 2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), >> which does not keep an in-flight count. >> >> Perhaps we just need to move the blk_inc_in_flight() above the >> replay_bh_schedule_event() call? > > FWIW, doing so at least keeps the reproducer from > https://bugzilla.redhat.com/show_bug.cgi?id=2029980 working. > And I'm not able to run into my current issue anymore :) Thank you! FWIW, the suggested change and explanation sound good to me. Are you going to send a patch for it? Best Regards, Fiona ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Deadlock with ide_issue_trim and draining 2023-03-08 10:35 ` Fiona Ebner @ 2023-03-08 15:02 ` Hanna Czenczek 0 siblings, 0 replies; 6+ messages in thread From: Hanna Czenczek @ 2023-03-08 15:02 UTC (permalink / raw) To: Fiona Ebner, QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini On 08.03.23 11:35, Fiona Ebner wrote: > Am 07.03.23 um 15:27 schrieb Hanna Czenczek: >> On 07.03.23 14:44, Hanna Czenczek wrote: >>> On 07.03.23 13:22, Fiona Ebner wrote: >>>> Hi, >>>> I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight >>>> counter for TRIM BH") introduced an issue in combination with draining. >>>> >>>> From a debug session on a costumer's machine I gathered the following >>>> information: >>>> * The QEMU process hangs in aio_poll called during draining and doesn't >>>> progress. >>>> * The in_flight counter for the BlockDriverState is 0 and for the >>>> BlockBackend it is 1. >>>> * There is a blk_aio_pdiscard_entry request in the BlockBackend's >>>> queued_requests. >>>> * The drive is attached via ahci. >>>> >>>> I suspect that something like the following happened: >>>> >>>> 1. ide_issue_trim is called, and increments the in_flight counter. >>>> 2. ide_issue_trim_cb calls blk_aio_pdiscard. >>>> 3. somebody else starts draining. >>>> 4. ide_issue_trim_cb is called as the completion callback for >>>> blk_aio_pdiscard. >>>> 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. >>>> 6. The request is added to the wait queue via blk_wait_while_drained, >>>> because draining has been started. >>>> 7. Nobody ever decrements the in_flight counter and draining can't >>>> finish. >>> Sounds about right. >>> >>>> The issue occurs very rarely and is difficult to reproduce, but with the >>>> help of GDB, I'm able to do it rather reliably: >>>> 1. Use GDB to break on blk_aio_pdiscard. >>>> 2. Run mkfs.ext4 on a huge disk in the guest. >>>> 3. Issue a drive-backup QMP command after landing on the breakpoint. >>>> 4. Continue a few times in GDB. >>>> 5. After that I can observe the same situation as described above. >>>> >>>> I'd be happy about suggestions for how to fix it. Unfortunately, I don't >>>> see a clear-cut way at the moment. The only idea I have right now is to >>>> change the code to issue all discard requests at the same time, but I >>>> fear there might pitfalls with that? >>> The point of 7e5cdb345f was that we need any in-flight count to >>> accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is >>> happening, we don’t necessarily need another count. But we do need it >>> while there is no blk_aio_pdiscard(). >>> >>> ide_issue_trim_cb() returns in two cases (and, recursively through its >>> callers, leaves s->bus->dma->aiocb set): >>> 1. After calling blk_aio_pdiscard(), which will keep an in-flight count, >>> 2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), >>> which does not keep an in-flight count. >>> >>> Perhaps we just need to move the blk_inc_in_flight() above the >>> replay_bh_schedule_event() call? >> FWIW, doing so at least keeps the reproducer from >> https://bugzilla.redhat.com/show_bug.cgi?id=2029980 working. >> > And I'm not able to run into my current issue anymore :) Thank you! Great! :) > FWIW, the suggested change and explanation sound good to me. Are you > going to send a patch for it? Sure, can do. Hanna ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Deadlock with ide_issue_trim and draining 2023-03-07 13:44 ` Hanna Czenczek 2023-03-07 14:27 ` Hanna Czenczek @ 2023-03-08 18:03 ` Hanna Czenczek 1 sibling, 0 replies; 6+ messages in thread From: Hanna Czenczek @ 2023-03-08 18:03 UTC (permalink / raw) To: Fiona Ebner, QEMU Developers Cc: open list:Network Block Dev..., Thomas Lamprecht, John Snow, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 4157 bytes --] On 07.03.23 14:44, Hanna Czenczek wrote: > On 07.03.23 13:22, Fiona Ebner wrote: >> Hi, >> I am suspecting that commit 7e5cdb345f ("ide: Increment BB in-flight >> counter for TRIM BH") introduced an issue in combination with draining. >> >> From a debug session on a costumer's machine I gathered the following >> information: >> * The QEMU process hangs in aio_poll called during draining and doesn't >> progress. >> * The in_flight counter for the BlockDriverState is 0 and for the >> BlockBackend it is 1. >> * There is a blk_aio_pdiscard_entry request in the BlockBackend's >> queued_requests. >> * The drive is attached via ahci. >> >> I suspect that something like the following happened: >> >> 1. ide_issue_trim is called, and increments the in_flight counter. >> 2. ide_issue_trim_cb calls blk_aio_pdiscard. >> 3. somebody else starts draining. >> 4. ide_issue_trim_cb is called as the completion callback for >> blk_aio_pdiscard. >> 5. ide_issue_trim_cb issues yet another blk_aio_pdiscard request. >> 6. The request is added to the wait queue via blk_wait_while_drained, >> because draining has been started. >> 7. Nobody ever decrements the in_flight counter and draining can't >> finish. > > Sounds about right. > >> The issue occurs very rarely and is difficult to reproduce, but with the >> help of GDB, I'm able to do it rather reliably: >> 1. Use GDB to break on blk_aio_pdiscard. >> 2. Run mkfs.ext4 on a huge disk in the guest. >> 3. Issue a drive-backup QMP command after landing on the breakpoint. >> 4. Continue a few times in GDB. >> 5. After that I can observe the same situation as described above. >> >> I'd be happy about suggestions for how to fix it. Unfortunately, I don't >> see a clear-cut way at the moment. The only idea I have right now is to >> change the code to issue all discard requests at the same time, but I >> fear there might pitfalls with that? > > The point of 7e5cdb345f was that we need any in-flight count to > accompany a set s->bus->dma->aiocb. While blk_aio_pdiscard() is > happening, we don’t necessarily need another count. But we do need it > while there is no blk_aio_pdiscard(). > > ide_issue_trim_cb() returns in two cases (and, recursively through its > callers, leaves s->bus->dma->aiocb set): > 1. After calling blk_aio_pdiscard(), which will keep an in-flight count, > 2. After calling replay_bh_schedule_event() (i.e. qemu_bh_schedule()), > which does not keep an in-flight count. > > Perhaps we just need to move the blk_inc_in_flight() above the > replay_bh_schedule_event() call? While writing the commit message for this, I noticed it isn’t quite right: ide_cancel_dma_sync() drains s->blk only once, so once the in-flight counter goes to 0, s->blk is considered drained and ide_cancel_dma_sync() will go on to assert that s->bus->dma->aiocb is now NULL. However, if we do have a blk_aio_pdiscard() in flight, the drain will wait only for that one to complete, not for the whole trim operation to complete, i.e. the next discard or ide_trim_bh_cb() will be scheduled, but neither will necessarily be run before blk_drain() returns. I’ve attached a reproducer that issues two trim requests. Before 7e5cdb345f, it makes qemu crash because the assertion fails (one or two of the blk_aio_pdiscard()s is drained, but the trim isn’t settled yet). After 7e5cdb345f, qemu hangs because of what you describe (the second blk_aio_pdiscard() is enqueued, so the drain can’t make progress, resulting in a deadlock). With my proposed fix, qemu crashes again. (Reproducer is run like this: $ qemu-system-x86_64 -drive if=ide,file=/tmp/test.bin,format=raw ) What comes to my mind is either what you’ve proposed initially (issuing all discards simultaneously), or to still use my proposed fix, but also have ide_cancel_dma_sync() run blk_drain() in a loop until s->bus->dma->aiocb becomes NULL. (Kind of like my original patch (https://lists.nongnu.org/archive/html/qemu-block/2022-01/msg00024.html), only that we can still use blk_drain() instead of aio_poll() because we increment the in-flight counter while the completion BH is scheduled.) Hanna [-- Attachment #2: test.bin --] [-- Type: application/octet-stream, Size: 1024 bytes --] [-- Attachment #3: test.asm --] [-- Type: text/plain, Size: 1637 bytes --] format binary use16 org 0x7c00 DMA_BUF = 0x7e00 cld xor ax, ax mov ds, ax mov es, ax ; clear DMA buffer mov di, DMA_BUF xor ax, ax mov cx, 256 repnz stosw ; two TRIM requests (both are the same: one sector, starting at sector index 1) mov di, DMA_BUF mov dword [di+ 0], 0x00000001 mov dword [di+ 4], 0x00010000 mov dword [di+ 8], 0x00000001 mov dword [di+12], 0x00010000 ; find IDE PCI device mov ax, 0xb102 mov dx, 0x8086 mov cx, 0x7010 xor si, si int 0x1a ; bx has PCI address push bx ; enable BM+MEM+IO mov di, 0x04 ; command/status mov ax, 0xb10a ; read config dword int 0x1a pop bx push bx or cl, 0x7 ; BM+MEM+IO mov di, 0x04 mov ax, 0xb10d ; write config dword int 0x1a pop bx push bx ; read BAR4 (DMA I/O space) mov di, 0x20 ; bar4 mov ax, 0xb10a int 0x1a and cx, 0xfffc ; DMA I/O base push cx mov dx, cx ; set up DMA xor al, al ; status: 0 out dx, al mov al, 0x04 ; clear pending interrupts add dx, 2 out dx, al mov eax, prdt add dx, 2 out dx, eax ; send TRIM command mov dx, 0x1f0 mov si, dsm_trim_cmd mov cx, 8 out_loop: lodsb out dx, al inc dx loop out_loop ; start DMA transfer pop dx mov al, 0x01 out dx, al ; immediately reset device, cancelling ongoing TRIM mov dx, 0x3f6 mov al, 0x04 out dx, al cli hlt dsm_trim_cmd: db 0x00, 0x01, 0x01, 0x00, 0x00, 0x00, 0xc0, 0x06 pciaddr: dw ? align(8) prdt: dd DMA_BUF dd 512 or 0x80000000 times 510-($-$$) db 0 dw 0xaa55 times 512 db 0 ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-03-08 18:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-07 12:22 Deadlock with ide_issue_trim and draining Fiona Ebner 2023-03-07 13:44 ` Hanna Czenczek 2023-03-07 14:27 ` Hanna Czenczek 2023-03-08 10:35 ` Fiona Ebner 2023-03-08 15:02 ` Hanna Czenczek 2023-03-08 18:03 ` Hanna Czenczek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).