Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Daniel Henrique Barboza <danielhb@linux.ibm.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	qemu-block@nongnu.org, Kevin Wolf <kwolf@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Fam Zheng <famz@redhat.com>
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 10:47:31 +0100	[thread overview]
Message-ID: <20180516094731.GA2741@work-vm> (raw)
In-Reply-To: <42ec7519-653d-09e2-abc6-78f04733ca47@linux.ibm.com>

* Daniel Henrique Barboza (danielhb@linux.ibm.com) wrote:
> Hi,
> 
> I've been working in the last two months in a miscompare issue that happens
> when using a raid device and a SATA as scsi-hd (emulated SCSI) with
> cache=none and io=threads during a hardware stress test. I'll summarize it
> here as best as I can without creating a great wall of text - Red Hat folks
> can check [1] for all the details.
> 
> Using the following setup:
> 
> - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le,
> qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU)
> 
> - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage
> disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows:
> 
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none'/>
>       <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/>
>       <target dev='sdc' bus='scsi'/>
>       <alias name='scsi0-0-0-2'/>
>       <address type='drive' controller='0' bus='0' target='0' unit='2'/>
>     </disk>
> 
> Both block devices have WCE off in the host.
> 
> With this env, we found problems when running a stress test called HTX [2].
> At a given time (usually after 24+ hours of test) HTX finds a data
> miscompare in one of the devices. This is an example:
> 
> -------
> 
> Device name: /dev/sdb
> Total blocks: 0x74706daf, Block size: 0x200
> Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd
> Number of Rulefile passes (cycle) completed: 0
> Stanza running: rule_6, Thread no.: 8
> Oper performed: wrc, Current seek type: SEQ
> LBA no. where IO started: 0x94fa
> Transfer size: 0x8400
> 
> Miscompare Summary:
> ===================
> LBA no. where miscomapre started:     0x94fa
> LBA no. where miscomapre ended:       0x94ff
> Miscompare start offset (in bytes):   0x8
> Miscomapre end offset (in bytes):     0xbff
> Miscompare size (in bytes):           0xbf8
> 
> Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275
> Actual data (at miscomapre offset): 889aea5a736462000000000000007275

Are all the miscompares single bit errors like that one?
Is the test doing single bit manipulation or is that coming out of the
blue?

Dave

> -----
> 
> 
> This means that the test executed a write at  LBA 0x94fa and, after
> confirming that the write was completed, issue 2 reads in the same LBA to
> assert the written contents and found out a mismatch.
> 
> 
> I've tested all sort of configurations between disk vs LUN, cache modes and
> AIO. My findings are:
> 
> - using device='lun' instead of device='disk', I can't reproduce the issue
> doesn't matter what other configurations are;
> - using device='disk' but with cache='writethrough', issue doesn't happen
> (haven't checked other cache modes);
> - using device='disk', cache='none' and io='native', issue doesn't happen.
> 
> 
> The issue seems to be tied with the combination device=disk + cache=none +
> io=threads. I've started digging into the SCSI layer all the way down to the
> block backend. With a shameful amount of logs I've discovered that, in the
> write that the test finds a miscompare, in block/file-posix.c:
> 
> - when doing the write, handle_aiocb_rw_vector() returns success, pwritev()
> reports that all bytes were written
> - in both reads after the write, handle_aiocb_rw_vector returns success, all
> bytes read by preadv(). In both reads, the data read is different from the
> data written by  the pwritev() that happened before
> 
> In the discussions at [1], Fam Zheng suggested a test in which we would take
> down the number of threads created in the POSIX thread pool from 64 to 1.
> The idea is to ensure that we're using the same thread to write and read.
> There was a suspicion that the kernel can't guarantee data coherency between
> different threads, even if using the same fd, when using pwritev() and
> preadv(). This would explain why the following reads in the same fd would
> fail to retrieve the same data that was written before. After doing this
> modification, the miscompare didn't reproduce.
> 
> After reverting the thread pool number change, I've made a couple of
> attempts trying to flush before read() and flushing after write(). Both
> attempts failed - the miscompare appears in both scenarios. This enforces
> the suspicion we have above - if data coherency can't be granted between
> different threads, flushing in different threads wouldn't make a difference
> too. I've also tested a suggestion from Fam where I started the disks with
> "cache.direct=on,cache.no-flush=off" - bug still reproduces.
> 
> 
> This is the current status of this investigation. I decided to start a
> discussion here, see if someone can point me something that I overlooked or
> got it wrong, before I started changing the POSIX thread pool behavior to
> see if I can enforce one specific POSIX thread to do a read() if we had a
> write() done in the same fd. Any suggestions?
> 
> 
> 
> ps: it is worth mentioning that I was able to reproduce this same bug in a
> POWER8 system running Ubuntu 18.04. Given that the code we're dealing with
> doesn't have any arch-specific behavior I wouldn't be surprised if this bug
> is also reproducible in other archs like x86.
> 
> 
> Thanks,
> 
> Daniel
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017
> [2] https://github.com/open-power/HTX
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

next prev parent reply	other threads:[~2018-05-16  9:47 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
2018-05-16  7:47 ` Paolo Bonzini
2018-05-16 21:35   ` Daniel Henrique Barboza
2018-05-16 22:12     ` Daniel Henrique Barboza
2018-05-16  9:47 ` Dr. David Alan Gilbert [this message]
2018-05-16 21:40   ` Daniel Henrique Barboza
2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-05-24 21:30   ` Daniel Henrique Barboza
2018-06-01 11:49     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180516094731.GA2741@work-vm \
    --to=dgilbert@redhat.com \
    --cc=danielhb@linux.ibm.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).