From: "Michael S. Tsirkin" <mst@redhat.com>
To: Suwan Kim <suwan.kim027@gmail.com>
Cc: "Roberts, Martin" <martin.roberts@intel.com>,
virtualization <virtualization@lists.linux-foundation.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>
Subject: Re: virtio-blk: support completion batching for the IRQ path - failure
Date: Thu, 8 Jun 2023 11:15:30 -0400 [thread overview]
Message-ID: <20230608111505-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAFNWusZZbFD+RLeJdno3vT6BAguq3jDB2EX8H8z5vPBE5sp54g@mail.gmail.com>
On Fri, Jun 09, 2023 at 12:12:16AM +0900, Suwan Kim wrote:
> On Thu, Jun 8, 2023 at 11:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jun 08, 2023 at 11:07:21PM +0900, Suwan Kim wrote:
> > > On Thu, Jun 8, 2023 at 7:16 PM Roberts, Martin <martin.roberts@intel.com> wrote:
> > > >
> > > > The rq_affinity change does not resolve the issue; just reduces its occurrence rate; I am still seeing hangs with it set to 2.
> > > >
> > > > Martin
> > > >
> > > >
> > > >
> > > > From: Roberts, Martin
> > > > Sent: Wednesday, June 7, 2023 3:46 PM
> > > > To: Suwan Kim <suwan.kim027@gmail.com>
> > > > Cc: mst@redhat.com; virtualization <virtualization@lists.linux-foundation.org>; linux-block@vger.kernel.org
> > > > Subject: RE: virtio-blk: support completion batching for the IRQ path - failure
> > > >
> > > >
> > > >
> > > > It is the change indicated that breaks it - changing the IRQ handling to batching.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From reports such as,
> > > >
> > > > [PATCH 1/1] blk-mq: added case for cpu offline during send_ipi in rq_complete (kernel.org)
> > > https://lore.kernel.org/lkml/20220929033428.25948-1-mj0123.lee@samsung.com/T/
> > > >
> > > > [RFC] blk-mq: Don't IPI requests on PREEMPT_RT - Patchwork (linaro.org)
> > > https://patches.linaro.org/project/linux-rt-users/patch/20201023110400.bx3uzsb7xy5jtsea@linutronix.de/
> > > >
> > > >
> > > >
> > > > I’m thinking the issue has something to do with which CPU the IRQ is running on.
> > > >
> > > >
> > > >
> > > > So, I set,
> > > >
> > > > # echo 2 > /sys/block/vda/queue/rq_affinity
> > > >
> > > > # echo 2 > /sys/block/vdb/queue/rq_affinity
> > > >
> > > > …
> > > >
> > > > # echo 2 > /sys/block/vdp/queue/rq_affinity
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > and the system (running 16 disks, 4 queues/disk) has not yet hung (running OK for several hours)…
> > > >
> > > >
> > > >
> > > > Martin
> > > >
> > >
> > > Hi Martin,
> > >
> > > Both codes (original code and your simple path) execute
> > > blk_mq_complete_send_ipi()
> > > at blk_mq_complete_request_remote(). So maybe missing request completion
> > > on other vCPU is not the cause...
> > >
> > > The difference between the original code and your simple path is that
> > > the original code calls blk_mq_end_request_batch() at virtblk_done()
> > > to process request at block layer
> > > and your code calls blk_mq_end_request() at virtblk_done() to do same thing.
> > >
> > > The original code :
> > > virtblk_handle_req() first collects all requests from virtqueue in while loop
> > > and pass it to blk_mq_end_request_batch() at once
> > >
> > > Your simple path:
> > > virtblk_handle_req() get single request from virtqueue and pass it to
> > > blk_mq_end_request() and do it again in while loop until there in no request
> > > in virtqueue
> > >
> > >
> > > I think we need to focus on the difference between blk_mq_end_request()
> > > and blk_mq_end_request_batch()
> > >
> > > Regards,
> > > Suwan Kim
> > >
> >
> > Yes but linux release is imminent and regressions are bad.
> > What do you suggest for now? If there's no better idea
> > I'll send a revert patch and we'll see in the next linux version.
> >
> >
>
> It is better to revert this commit. I have no good idea to debug it for now.
> I will try to reproduce it in my machine.
>
> Regards,
> Suwan Kim
ok so reverting
[PATCH v3 2/2] virtio-blk: support completion batching for the IRQ path
for now
> > >
> > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Suwan Kim <suwan.kim027@gmail.com>
> > > > Sent: Wednesday, June 7, 2023 3:21 PM
> > > > To: Roberts, Martin <martin.roberts@intel.com>
> > > > Cc: mst@redhat.com; virtualization <virtualization@lists.linux-foundation.org>; linux-block@vger.kernel.org
> > > > Subject: Re: virtio-blk: support completion batching for the IRQ path - failure
> > > >
> > > >
> > > >
> > > > On Wed, Jun 7, 2023 at 6:14 PM Roberts, Martin <martin.roberts@intel.com> wrote:
> > > >
> > > > >
> > > >
> > > > > Re: virtio-blk: support completion batching for the IRQ path · torvalds/linux@07b679f · GitHub
> > > >
> > > > >
> > > >
> > > > > Signed-off-by: Suwan Kim suwan.kim027@gmail.com
> > > >
> > > > >
> > > >
> > > > > Signed-off-by: Michael S. Tsirkin mst@redhat.com
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > This change appears to have broken things…
> > > >
> > > > >
> > > >
> > > > > We now see applications hanging during disk accesses.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > multi-port virtio-blk device running in h/w (FPGA)
> > > >
> > > > >
> > > >
> > > > > Host running a simple ‘fio‘ test.
> > > >
> > > > >
> > > >
> > > > > [global]
> > > >
> > > > >
> > > >
> > > > > thread=1
> > > >
> > > > >
> > > >
> > > > > direct=1
> > > >
> > > > >
> > > >
> > > > > ioengine=libaio
> > > >
> > > > >
> > > >
> > > > > norandommap=1
> > > >
> > > > >
> > > >
> > > > > group_reporting=1
> > > >
> > > > >
> > > >
> > > > > bs=4K
> > > >
> > > > >
> > > >
> > > > > rw=read
> > > >
> > > > >
> > > >
> > > > > iodepth=128
> > > >
> > > > >
> > > >
> > > > > runtime=1
> > > >
> > > > >
> > > >
> > > > > numjobs=4
> > > >
> > > > >
> > > >
> > > > > time_based
> > > >
> > > > >
> > > >
> > > > > [job0]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vda
> > > >
> > > > >
> > > >
> > > > > [job1]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdb
> > > >
> > > > >
> > > >
> > > > > [job2]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdc
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > > [job15]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdp
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
> > > >
> > > > >
> > > >
> > > > > This is repeatedly run in a loop.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > After a few, normally <10 seconds, fio hangs.
> > > >
> > > > >
> > > >
> > > > > With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
> > > >
> > > > >
> > > >
> > > > > Last message:
> > > >
> > > > >
> > > >
> > > > > fio-3.19
> > > >
> > > > >
> > > >
> > > > > Starting 8 threads
> > > >
> > > > >
> > > >
> > > > > Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
> > > >
> > > > >
> > > >
> > > > > I think this means at the end of the run 1 queue was left incomplete.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > ‘diskstats’ (run while fio is hung) shows no outstanding transactions.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > $ cat /proc/diskstats
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > > 252 0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
> > > >
> > > > >
> > > >
> > > > > 252 16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > PF= 0 vq=0 1 2 3
> > > >
> > > > >
> > > >
> > > > > [a]request_count - 839416590 813148916 105586179 84988123
> > > >
> > > > >
> > > >
> > > > > [b]completion1_count - 839416590 813148916 105586179 84988123
> > > >
> > > > >
> > > >
> > > > > [c]completion2_count - 0 0 0 0
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > PF= 1 vq=0 1 2 3
> > > >
> > > > >
> > > >
> > > > > [a]request_count - 823335887 812516140 104582672 75856549
> > > >
> > > > >
> > > >
> > > > > [b]completion1_count - 823335887 812516140 104582672 75856549
> > > >
> > > > >
> > > >
> > > > > [c]completion2_count - 0 0 0 0
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > i.e. the issue is after the virtio-blk driver.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > This change was introduced in kernel 6.3.0.
> > > >
> > > > >
> > > >
> > > > > I am seeing this using 6.3.3.
> > > >
> > > > >
> > > >
> > > > > If I run with an earlier kernel (5.15), it does not occur.
> > > >
> > > > >
> > > >
> > > > > If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > kernel 5.15 – this is OK
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_done() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > blk_mq_complete_request(req);
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > kernel 6.3.3 – this fails
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_handle_req() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_complete_request_remote(req)) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > If I do, kernel 6.3.3 – this is OK
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_handle_req() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_complete_request_remote(req)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //force this here...
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > Perhaps you might like to fix/test/revert this change…
> > > >
> > > > >
> > > >
> > > > > Martin
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > >
> > > >
> > > > Hi Martin,
> > > >
> > > >
> > > >
> > > > There are many changes between 6.3.0 and 6.3.3.
> > > >
> > > > Could you try to find a commit which triggers the io hang?
> > > >
> > > > Is it ok with 6.3.0 kernel or with reverting
> > > >
> > > > "virtio-blk: support completion batching for the IRQ path" commit?
> > > >
> > > >
> > > >
> > > > We need to confirm which commit is causing the error.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Suwan Kim
> >
WARNING: multiple messages have this Message-ID (diff)
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Suwan Kim <suwan.kim027@gmail.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
Jens Axboe <axboe@kernel.dk>,
"Roberts, Martin" <martin.roberts@intel.com>,
virtualization <virtualization@lists.linux-foundation.org>
Subject: Re: virtio-blk: support completion batching for the IRQ path - failure
Date: Thu, 8 Jun 2023 11:15:30 -0400 [thread overview]
Message-ID: <20230608111505-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <CAFNWusZZbFD+RLeJdno3vT6BAguq3jDB2EX8H8z5vPBE5sp54g@mail.gmail.com>
On Fri, Jun 09, 2023 at 12:12:16AM +0900, Suwan Kim wrote:
> On Thu, Jun 8, 2023 at 11:46 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jun 08, 2023 at 11:07:21PM +0900, Suwan Kim wrote:
> > > On Thu, Jun 8, 2023 at 7:16 PM Roberts, Martin <martin.roberts@intel.com> wrote:
> > > >
> > > > The rq_affinity change does not resolve the issue; just reduces its occurrence rate; I am still seeing hangs with it set to 2.
> > > >
> > > > Martin
> > > >
> > > >
> > > >
> > > > From: Roberts, Martin
> > > > Sent: Wednesday, June 7, 2023 3:46 PM
> > > > To: Suwan Kim <suwan.kim027@gmail.com>
> > > > Cc: mst@redhat.com; virtualization <virtualization@lists.linux-foundation.org>; linux-block@vger.kernel.org
> > > > Subject: RE: virtio-blk: support completion batching for the IRQ path - failure
> > > >
> > > >
> > > >
> > > > It is the change indicated that breaks it - changing the IRQ handling to batching.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > From reports such as,
> > > >
> > > > [PATCH 1/1] blk-mq: added case for cpu offline during send_ipi in rq_complete (kernel.org)
> > > https://lore.kernel.org/lkml/20220929033428.25948-1-mj0123.lee@samsung.com/T/
> > > >
> > > > [RFC] blk-mq: Don't IPI requests on PREEMPT_RT - Patchwork (linaro.org)
> > > https://patches.linaro.org/project/linux-rt-users/patch/20201023110400.bx3uzsb7xy5jtsea@linutronix.de/
> > > >
> > > >
> > > >
> > > > I’m thinking the issue has something to do with which CPU the IRQ is running on.
> > > >
> > > >
> > > >
> > > > So, I set,
> > > >
> > > > # echo 2 > /sys/block/vda/queue/rq_affinity
> > > >
> > > > # echo 2 > /sys/block/vdb/queue/rq_affinity
> > > >
> > > > …
> > > >
> > > > # echo 2 > /sys/block/vdp/queue/rq_affinity
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > and the system (running 16 disks, 4 queues/disk) has not yet hung (running OK for several hours)…
> > > >
> > > >
> > > >
> > > > Martin
> > > >
> > >
> > > Hi Martin,
> > >
> > > Both codes (original code and your simple path) execute
> > > blk_mq_complete_send_ipi()
> > > at blk_mq_complete_request_remote(). So maybe missing request completion
> > > on other vCPU is not the cause...
> > >
> > > The difference between the original code and your simple path is that
> > > the original code calls blk_mq_end_request_batch() at virtblk_done()
> > > to process request at block layer
> > > and your code calls blk_mq_end_request() at virtblk_done() to do same thing.
> > >
> > > The original code :
> > > virtblk_handle_req() first collects all requests from virtqueue in while loop
> > > and pass it to blk_mq_end_request_batch() at once
> > >
> > > Your simple path:
> > > virtblk_handle_req() get single request from virtqueue and pass it to
> > > blk_mq_end_request() and do it again in while loop until there in no request
> > > in virtqueue
> > >
> > >
> > > I think we need to focus on the difference between blk_mq_end_request()
> > > and blk_mq_end_request_batch()
> > >
> > > Regards,
> > > Suwan Kim
> > >
> >
> > Yes but linux release is imminent and regressions are bad.
> > What do you suggest for now? If there's no better idea
> > I'll send a revert patch and we'll see in the next linux version.
> >
> >
>
> It is better to revert this commit. I have no good idea to debug it for now.
> I will try to reproduce it in my machine.
>
> Regards,
> Suwan Kim
ok so reverting
[PATCH v3 2/2] virtio-blk: support completion batching for the IRQ path
for now
> > >
> > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Suwan Kim <suwan.kim027@gmail.com>
> > > > Sent: Wednesday, June 7, 2023 3:21 PM
> > > > To: Roberts, Martin <martin.roberts@intel.com>
> > > > Cc: mst@redhat.com; virtualization <virtualization@lists.linux-foundation.org>; linux-block@vger.kernel.org
> > > > Subject: Re: virtio-blk: support completion batching for the IRQ path - failure
> > > >
> > > >
> > > >
> > > > On Wed, Jun 7, 2023 at 6:14 PM Roberts, Martin <martin.roberts@intel.com> wrote:
> > > >
> > > > >
> > > >
> > > > > Re: virtio-blk: support completion batching for the IRQ path · torvalds/linux@07b679f · GitHub
> > > >
> > > > >
> > > >
> > > > > Signed-off-by: Suwan Kim suwan.kim027@gmail.com
> > > >
> > > > >
> > > >
> > > > > Signed-off-by: Michael S. Tsirkin mst@redhat.com
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > This change appears to have broken things…
> > > >
> > > > >
> > > >
> > > > > We now see applications hanging during disk accesses.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > multi-port virtio-blk device running in h/w (FPGA)
> > > >
> > > > >
> > > >
> > > > > Host running a simple ‘fio‘ test.
> > > >
> > > > >
> > > >
> > > > > [global]
> > > >
> > > > >
> > > >
> > > > > thread=1
> > > >
> > > > >
> > > >
> > > > > direct=1
> > > >
> > > > >
> > > >
> > > > > ioengine=libaio
> > > >
> > > > >
> > > >
> > > > > norandommap=1
> > > >
> > > > >
> > > >
> > > > > group_reporting=1
> > > >
> > > > >
> > > >
> > > > > bs=4K
> > > >
> > > > >
> > > >
> > > > > rw=read
> > > >
> > > > >
> > > >
> > > > > iodepth=128
> > > >
> > > > >
> > > >
> > > > > runtime=1
> > > >
> > > > >
> > > >
> > > > > numjobs=4
> > > >
> > > > >
> > > >
> > > > > time_based
> > > >
> > > > >
> > > >
> > > > > [job0]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vda
> > > >
> > > > >
> > > >
> > > > > [job1]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdb
> > > >
> > > > >
> > > >
> > > > > [job2]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdc
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > > [job15]
> > > >
> > > > >
> > > >
> > > > > filename=/dev/vdp
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
> > > >
> > > > >
> > > >
> > > > > This is repeatedly run in a loop.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > After a few, normally <10 seconds, fio hangs.
> > > >
> > > > >
> > > >
> > > > > With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
> > > >
> > > > >
> > > >
> > > > > Last message:
> > > >
> > > > >
> > > >
> > > > > fio-3.19
> > > >
> > > > >
> > > >
> > > > > Starting 8 threads
> > > >
> > > > >
> > > >
> > > > > Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
> > > >
> > > > >
> > > >
> > > > > I think this means at the end of the run 1 queue was left incomplete.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > ‘diskstats’ (run while fio is hung) shows no outstanding transactions.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > $ cat /proc/diskstats
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > > 252 0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
> > > >
> > > > >
> > > >
> > > > > 252 16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
> > > >
> > > > >
> > > >
> > > > > ...
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > PF= 0 vq=0 1 2 3
> > > >
> > > > >
> > > >
> > > > > [a]request_count - 839416590 813148916 105586179 84988123
> > > >
> > > > >
> > > >
> > > > > [b]completion1_count - 839416590 813148916 105586179 84988123
> > > >
> > > > >
> > > >
> > > > > [c]completion2_count - 0 0 0 0
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > PF= 1 vq=0 1 2 3
> > > >
> > > > >
> > > >
> > > > > [a]request_count - 823335887 812516140 104582672 75856549
> > > >
> > > > >
> > > >
> > > > > [b]completion1_count - 823335887 812516140 104582672 75856549
> > > >
> > > > >
> > > >
> > > > > [c]completion2_count - 0 0 0 0
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > i.e. the issue is after the virtio-blk driver.
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > This change was introduced in kernel 6.3.0.
> > > >
> > > > >
> > > >
> > > > > I am seeing this using 6.3.3.
> > > >
> > > > >
> > > >
> > > > > If I run with an earlier kernel (5.15), it does not occur.
> > > >
> > > > >
> > > >
> > > > > If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
> > > >
> > > > >
> > > >
> > > > > e.g.
> > > >
> > > > >
> > > >
> > > > > kernel 5.15 – this is OK
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_done() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > blk_mq_complete_request(req);
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > kernel 6.3.3 – this fails
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_handle_req() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_complete_request_remote(req)) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > If I do, kernel 6.3.3 – this is OK
> > > >
> > > > >
> > > >
> > > > > virtio_blk.c,virtblk_handle_req() [irq handler]
> > > >
> > > > >
> > > >
> > > > > if (likely(!blk_should_fake_timeout(req->q))) {
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_complete_request_remote(req)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //force this here...
> > > >
> > > > >
> > > >
> > > > > if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
> > > >
> > > > >
> > > >
> > > > > virtblk_request_done(req); //this never gets called... so blk_mq_add_to_batch() must always succeed
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > > }
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > > > Perhaps you might like to fix/test/revert this change…
> > > >
> > > > >
> > > >
> > > > > Martin
> > > >
> > > > >
> > > >
> > > > >
> > > >
> > > >
> > > >
> > > > Hi Martin,
> > > >
> > > >
> > > >
> > > > There are many changes between 6.3.0 and 6.3.3.
> > > >
> > > > Could you try to find a commit which triggers the io hang?
> > > >
> > > > Is it ok with 6.3.0 kernel or with reverting
> > > >
> > > > "virtio-blk: support completion batching for the IRQ path" commit?
> > > >
> > > >
> > > >
> > > > We need to confirm which commit is causing the error.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Suwan Kim
> >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
next prev parent reply other threads:[~2023-06-08 15:16 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <BN9PR11MB53545DD1516BFA0FB23F95458353A@BN9PR11MB5354.namprd11.prod.outlook.com>
2023-06-07 14:20 ` virtio-blk: support completion batching for the IRQ path - failure Suwan Kim
[not found] ` <BN9PR11MB535433DFB3A1CFAD097C13278353A@BN9PR11MB5354.namprd11.prod.outlook.com>
[not found] ` <BN9PR11MB53545EDF64FC43EF8854D0628350A@BN9PR11MB5354.namprd11.prod.outlook.com>
2023-06-08 14:07 ` Suwan Kim
2023-06-08 14:46 ` Michael S. Tsirkin
2023-06-08 14:46 ` Michael S. Tsirkin
2023-06-08 15:12 ` Suwan Kim
2023-06-08 15:15 ` Michael S. Tsirkin [this message]
2023-06-08 15:15 ` Michael S. Tsirkin
2023-06-08 18:04 ` Michael S. Tsirkin
2023-06-08 18:04 ` Michael S. Tsirkin
2023-06-12 15:04 ` Suwan Kim
2023-06-12 17:43 ` Roberts, Martin
2023-06-13 14:41 ` Suwan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230608111505-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=martin.roberts@intel.com \
--cc=suwan.kim027@gmail.com \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.