From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Daniel Henrique Barboza <danielhb@linux.ibm.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
qemu-block@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <famz@redhat.com>
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 10:47:31 +0100 [thread overview]
Message-ID: <20180516094731.GA2741@work-vm> (raw)
In-Reply-To: <42ec7519-653d-09e2-abc6-78f04733ca47@linux.ibm.com>
* Daniel Henrique Barboza (danielhb@linux.ibm.com) wrote:
> Hi,
>
> I've been working in the last two months in a miscompare issue that happens
> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
> cache=none and io=threads during a hardware stress test. I'll summarize it
> here as best as I can without creating a great wall of text - Red Hat folks
> can check [1] for all the details.
>
> Using the following setup:
>
> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
>
> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
>
> <disk type='block' device='disk'>
> <driver name='qemu' type='raw' cache='none'/>
> <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
> <target dev='sdc' bus='scsi'/>
> <alias name='scsi0-0-0-2'/>
> <address type='drive' controller='0' bus='0' target='0' unit='2'/>
> </disk>
>
> Both block devices have WCE off in the host.
>
> With this env, we found problems when running a stress test called HTX [2].
> At a given time (usually after 24+ hours of test) HTX finds a data
> miscompare in one of the devices. This is an example:
>
> -------
>
> Device name: /dev/sdb
> Total blocks: 0x74706daf, Block size: 0x200
> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
> Number of Rulefile passes (cycle) completed: 0
> Stanza running: rule_6, Thread no.: 8
> Oper performed: wrc, Current seek type: SEQ
> LBA no. where IO started: 0x94fa
> Transfer size: 0x8400
>
> Miscompare Summary:
> ===================
> LBA no. where miscomapre started: 0x94fa
> LBA no. where miscomapre ended: 0x94ff
> Miscompare start offset (in bytes): 0x8
> Miscomapre end offset (in bytes): 0xbff
> Miscompare size (in bytes): 0xbf8
>
> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
> Actual data (at miscomapre offset): 889aea5a736462000000000000007275
Are all the miscompares single bit errors like that one?
Is the test doing single bit manipulation or is that coming out of the
blue?
Dave
> -----
>
>
> This means that the test executed a write at LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.
>
>
> I've tested all sort of configurations between disk vs LUN, cache modes and
> AIO. My findings are:
>
> - using device='lun' instead of device='disk', I can't reproduce the issue
> doesn't matter what other configurations are;
> - using device='disk' but with cache='writethrough', issue doesn't happen
> (haven't checked other cache modes);
> - using device='disk', cache='none' and io='native', issue doesn't happen.
>
>
> The issue seems to be tied with the combination device=disk + cache=none +
> io=threads. I've started digging into the SCSI layer all the way down to the
> block backend. With a shameful amount of logs I've discovered that, in the
> write that the test finds a miscompare, in block/file-posix.c:
>
> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
> reports that all bytes were written
> - in both reads after the write, handle_aiocb_rw_vector returns success, all
> bytes read by preadv(). In both reads, the data read is different from the
> data written by the pwritev() that happened before
>
> In the discussions at [1], Fam Zheng suggested a test in which we would take
> down the number of threads created in the POSIX thread pool from 64 to 1.
> The idea is to ensure that we're using the same thread to write and read.
> There was a suspicion that the kernel can't guarantee data coherency between
> different threads, even if using the same fd, when using pwritev() and
> preadv(). This would explain why the following reads in the same fd would
> fail to retrieve the same data that was written before. After doing this
> modification, the miscompare didn't reproduce.
>
> After reverting the thread pool number change, I've made a couple of
> attempts trying to flush before read() and flushing after write(). Both
> attempts failed - the miscompare appears in both scenarios. This enforces
> the suspicion we have above - if data coherency can't be granted between
> different threads, flushing in different threads wouldn't make a difference
> too. I've also tested a suggestion from Fam where I started the disks with
> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
>
>
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked or
> got it wrong, before I started changing the POSIX thread pool behavior to
> see if I can enforce one specific POSIX thread to do a read() if we had a
> write() done in the same fd. Any suggestions?
>
>
>
> ps: it is worth mentioning that I was able to reproduce this same bug in a
> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
> is also reproducible in other archs like x86.
>
>
> Thanks,
>
> Daniel
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
> [2] https://github.com/open-power/HTX
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2018-05-16 9:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
2018-05-16 7:47 ` Paolo Bonzini
2018-05-16 21:35 ` Daniel Henrique Barboza
2018-05-16 22:12 ` Daniel Henrique Barboza
2018-05-16 9:47 ` Dr. David Alan Gilbert [this message]
2018-05-16 21:40 ` Daniel Henrique Barboza
2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-05-24 21:30 ` Daniel Henrique Barboza
2018-06-01 11:49 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180516094731.GA2741@work-vm \
--to=dgilbert@redhat.com \
--cc=danielhb@linux.ibm.com \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).