public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
From: Klaus Jensen <its@irrelevant.dk>
To: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	Pankaj Raghav <pankydev8@gmail.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Adam Manzanares <a.manzanares@samsung.com>
Subject: Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall
Date: Wed, 27 Apr 2022 09:41:32 +0200	[thread overview]
Message-ID: <YmjzrLo0/zW3Ou03@apples> (raw)
In-Reply-To: <20220427050825.rkn633nevijh3ux5@shindev>

[-- Attachment #1: Type: text/plain, Size: 6317 bytes --]

On Apr 27 05:08, Shinichiro Kawasaki wrote:
> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
> > On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
> > > On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
> > > > Hey folks,
> > > > 
> > > > While enhancing kdevops [0] to embrace automation of testing with
> > > > blktests for ZNS I ended up spotting a possible false positive RCU stall
> > > > when running zbd/006 after zbd/005. The curious thing though is that
> > > > this possible RCU stall is only possible when using the qemu
> > > > ZNS drive, not when using nbd. In so far as kdevops is concerned
> > > > it creates ZNS drives for you when you enable the config option
> > > > CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
> > > > suffices. When configuring blktests you can just enable the zbd
> > > > guest, so only a pair of guests are reated the zbd guest and the
> > > > respective development guest, zbd-dev guest. When using
> > > > CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
> > > > just two guests:
> > > > 
> > > >   * linux517-blktests-zbd
> > > >   * linux517-blktests-zbd-dev
> > > > 
> > > > The RCU stall can be triggered easily as follows:
> > > > 
> > > > make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
> > > > make
> > > > make bringup # bring up guests
> > > > make linux # build and boot into v5.17-rc7
> > > > make blktests # build and install blktests
> > > > 
> > > > Now let's ssh to the guest while leaving a console attached
> > > > with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
> > > > 
> > > > ssh linux517-blktests-zbd
> > > > sudo su -
> > > > cd /usr/local/blktests
> > > > export TEST_DEVS=/dev/nvme9n1
> > > > i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
> > > > 
> > > > The above should never fail, but you should eventually see an RCU
> > > > stall candidate on the console. The full details can be observed on the
> > > > gist [1] but for completeness I list some of it below. It may be a false
> > > > positive at this point, not sure.
> > > > 
> > > > [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
> > > > [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
> > > > [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
> > > > [493336.981666] nvme nvme9: Abort status: 0x0
> > > > [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
> > > > [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > 
> > > Hello Luis,
> > > 
> > > I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
> > > kernel releases. But, I have not ever observed the symptom above. Now I'm
> > > repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
> > > not observe the symptom so far, after 400 times repeat.
> > 
> > Did you try v5.17-rc7 ?
> 
> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
> 
> > 
> > > I would like to run the test using same ZNS set up as yours. Can you share how
> > > your ZNS device is set up? I would like to know device size and QEMU -device
> > > options, such as zoned.zone_size or zoned.max_active.
> > 
> > It is as easy as the above make commands, and follow up login commands.
> 
> I managed to run kdevops on my machine, and saw the I/O timeout and abort
> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
> in my test environment also (the reset controller message and RCU relegated
> error were not observed).
> 

Can you extract the relevant part of the QEMU parameters? I tried to
reproduce this, but could not with a 10G, neither with discard=on or
off, qcow2 or raw.

> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
> 
> The conditions to recreate the I/O timeout error are as follows:
> 
> - Larger size of QEMU ZNS drive (10GB)
>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>       size, the I/O timeout is not observed.
> 
> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>   after zbd/005 completion to the drive.
>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>       zbd/005 and zone reset command to recreate the I/O timeout.
>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>       the I/O timeout was not observed.
>     - The data write pattern of zbd/005 looks important. Simple dd command to
>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
> 
> I dug into QEMU code and found that it takes long time to complete zone reset
> command with all zones flag. It takes more than 30 seconds and looks triggering
> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
> backend file for each zone, so that data of each zone is zero cleared. Each
> fallocate call is quick but between the calls, 0.7 second delay was observed
> often. I guess some fsync or fdatasync operation would be running and causing
> the delay.
> 

QEMU uses a write zeroes for zone reset. This is because of the
requirement that block in empty zones must be considered deallocated.

When the drive is configured with `discard=on`, these write zeroes
*should* turn into discards. However, I also tested with discard=off and
I could not reproduce it.

It might make sense to force QEMU to use a discard for zone reset in all
cases, and then change the reported DLFEAT appropriately, since we
cannot guarantee zeroes then.

> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
> ZNS drive's size and status. Performance improvement of zone reset is desired in
> QEMU. I will seek for the chance to work on it.
> 

Currently, each zone is a separate discard/write zero call. It would be
fair to special case all zones and do it in much larger chunks.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

  parent reply	other threads:[~2022-04-27  7:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-14 22:02 blktests with zbd/006 ZNS triggers a possible false positive RCU stall Luis Chamberlain
2022-04-15  1:09 ` Davidlohr Bueso
2022-04-15  3:54   ` Paul E. McKenney
2022-04-15  4:30     ` Davidlohr Bueso
2022-04-15 17:35       ` Luis Chamberlain
2022-04-15 17:33   ` Luis Chamberlain
2022-04-15 17:42     ` Paul E. McKenney
2022-04-20  5:54 ` Shinichiro Kawasaki
2022-04-21 18:00   ` Luis Chamberlain
2022-04-27  5:08     ` Shinichiro Kawasaki
2022-04-27  5:42       ` Luis Chamberlain
2022-04-27  7:41       ` Klaus Jensen [this message]
2022-04-27  8:39         ` Damien Le Moal
2022-04-27  8:55           ` Klaus Jensen
2022-04-27  8:53         ` Shinichiro Kawasaki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YmjzrLo0/zW3Ou03@apples \
    --to=its@irrelevant.dk \
    --cc=a.manzanares@samsung.com \
    --cc=dave@stgolabs.net \
    --cc=linux-block@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=pankydev8@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox