From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Daniel Henrique Barboza <danielhb@linux.ibm.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
qemu-block@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <famz@redhat.com>
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 10:47:31 +0100 [thread overview]
Message-ID: <20180516094731.GA2741@work-vm> (raw)
In-Reply-To: <42ec7519-653d-09e2-abc6-78f04733ca47@linux.ibm.com>
* Daniel Henrique Barboza (danielhb@linux.ibm.com) wrote:
> Hi,
>
> I've been working in the last two months in a miscompare issue that happens
> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
> cache=none and io=threads during a hardware stress test. I'll summarize it
> here as best as I can without creating a great wall of text - Red Hat folks
> can check [1] for all the details.
>
> Using the following setup:
>
> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
>
> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
>
> <disk type='block' device='disk'>
> <driver name='qemu' type='raw' cache='none'/>
> <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
> <target dev='sdc' bus='scsi'/>
> <alias name='scsi0-0-0-2'/>
> <address type='drive' controller='0' bus='0' target='0' unit='2'/>
> </disk>
>
> Both block devices have WCE off in the host.
>
> With this env, we found problems when running a stress test called HTX [2].
> At a given time (usually after 24+ hours of test) HTX finds a data
> miscompare in one of the devices. This is an example:
>
> -------
>
> Device name: /dev/sdb
> Total blocks: 0x74706daf, Block size: 0x200
> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
> Number of Rulefile passes (cycle) completed: 0
> Stanza running: rule_6, Thread no.: 8
> Oper performed: wrc, Current seek type: SEQ
> LBA no. where IO started: 0x94fa
> Transfer size: 0x8400
>
> Miscompare Summary:
> ===================
> LBA no. where miscomapre started: 0x94fa
> LBA no. where miscomapre ended: 0x94ff
> Miscompare start offset (in bytes): 0x8
> Miscomapre end offset (in bytes): 0xbff
> Miscompare size (in bytes): 0xbf8
>
> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
> Actual data (at miscomapre offset): 889aea5a736462000000000000007275
Are all the miscompares single bit errors like that one?
Is the test doing single bit manipulation or is that coming out of the
blue?
Dave
> -----
>
>
> This means that the test executed a write at LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.
>
>
> I've tested all sort of configurations between disk vs LUN, cache modes and
> AIO. My findings are:
>
> - using device='lun' instead of device='disk', I can't reproduce the issue
> doesn't matter what other configurations are;
> - using device='disk' but with cache='writethrough', issue doesn't happen
> (haven't checked other cache modes);
> - using device='disk', cache='none' and io='native', issue doesn't happen.
>
>
> The issue seems to be tied with the combination device=disk + cache=none +
> io=threads. I've started digging into the SCSI layer all the way down to the
> block backend. With a shameful amount of logs I've discovered that, in the
> write that the test finds a miscompare, in block/file-posix.c:
>
> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
> reports that all bytes were written
> - in both reads after the write, handle_aiocb_rw_vector returns success, all
> bytes read by preadv(). In both reads, the data read is different from the
> data written by the pwritev() that happened before
>
> In the discussions at [1], Fam Zheng suggested a test in which we would take
> down the number of threads created in the POSIX thread pool from 64 to 1.
> The idea is to ensure that we're using the same thread to write and read.
> There was a suspicion that the kernel can't guarantee data coherency between
> different threads, even if using the same fd, when using pwritev() and
> preadv(). This would explain why the following reads in the same fd would
> fail to retrieve the same data that was written before. After doing this
> modification, the miscompare didn't reproduce.
>
> After reverting the thread pool number change, I've made a couple of
> attempts trying to flush before read() and flushing after write(). Both
> attempts failed - the miscompare appears in both scenarios. This enforces
> the suspicion we have above - if data coherency can't be granted between
> different threads, flushing in different threads wouldn't make a difference
> too. I've also tested a suggestion from Fam where I started the disks with
> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
>
>
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked or
> got it wrong, before I started changing the POSIX thread pool behavior to
> see if I can enforce one specific POSIX thread to do a read() if we had a
> write() done in the same fd. Any suggestions?
>
>
>
> ps: it is worth mentioning that I was able to reproduce this same bug in a
> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
> is also reproducible in other archs like x86.
>
>
> Thanks,
>
> Daniel
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
> [2] https://github.com/open-power/HTX
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2018-05-16 9:47 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
2018-05-16 7:47 ` Paolo Bonzini
2018-05-16 21:35 ` Daniel Henrique Barboza
2018-05-16 22:12 ` Daniel Henrique Barboza
2018-05-16 9:47 ` Dr. David Alan Gilbert [this message]
2018-05-16 21:40 ` Daniel Henrique Barboza
2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-05-24 21:30 ` Daniel Henrique Barboza
2018-06-01 11:49 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180516094731.GA2741@work-vm \
--to=dgilbert@redhat.com \
--cc=danielhb@linux.ibm.com \
--cc=famz@redhat.com \
--cc=kwolf@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.