qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Henrique Barboza <danielhb@linux.ibm.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>, Fam Zheng <famz@redhat.com>
Subject: Re: [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads
Date: Wed, 16 May 2018 19:12:21 -0300	[thread overview]
Message-ID: <fcb0c9ca-10bc-1f4d-71e9-402023be5f06@linux.ibm.com> (raw)
In-Reply-To: <34187a39-5d59-62d4-6dd2-15f013738026@linux.ibm.com>



On 05/16/2018 06:35 PM, Daniel Henrique Barboza wrote:
>
>
> On 05/16/2018 04:47 AM, Paolo Bonzini wrote:
>> On 15/05/2018 23:25, Daniel Henrique Barboza wrote:
>>> This is the current status of this investigation. I decided to start a
>>> discussion here, see if someone can point me something that I 
>>> overlooked
>>> or got it wrong, before I started changing the POSIX thread pool
>>> behavior to see if I can enforce one specific POSIX thread to do a
>>> read() if we had a write() done in the same fd. Any suggestions?
>> Copying from the bug:
>>
>>> Unless we learn something new, my understanding is that we're dealing
>>> with a host side limitation/bug when calling pwritev() in a different
>>> thread than a following preadv(), using the same file descriptor
>>> opened with O_DIRECT and no WCE in the host side, the kernel can't
>>> grant data coherency, e.g:
>>>
>>> - thread A executes a pwritev() writing dataA in the disk
>>>
>>> - thread B executes a preadv() call to read the data, but this
>>> preadv() call isn't aware of the previous pwritev() call done in
>>> thread A, thus the guarantee of the preadv() call reading dataA isn't
>>> assured (as opposed to what is described in man 3 write)
>>>
>>> - the physical disk, due to the heavy load of the stress test, didn't
>>> finish writing up dataA. Since the disk itself doesn't have any
>>> internal cache to rely on, the preadv() call goes in and read an old
>>> data that differs from dataA.
>> There is a problem in the reasoning of the third point: if the physical
>> disk hasn't yet finished writing up dataA, pwritev() shouldn't have
>> returned.  This could be a bug in the kernel, or even in the disk.  I
>> suspect the kernel because SCSI passthrough doesn't show the bug; SCSI
>> passthrough uses ioctl() which completes exactly when the disk tells
>> QEMU that the command is done---it cannot report completion too early.
>>
>> (Another small problem in the third point is that the disk actually does
>> have a cache.  But the cache should be transparent, if it weren't the
>> bug would be in the disk firmware).
>>
>> It has to be debugged and fixed in the kernel.  The thread pool is
>> just... a thread pool, and shouldn't be working around bugs, especially
>> as serious as these.
>
> Fixing in the thread pool would only make sense if we were sure that
> the kernel was working as intended. I think the next step would be to
> look it in the kernel level and see what is not working there.
>
>>
>> A more likely possibility: maybe the disk has 4K sectors and QEMU is
>> doing read-modify-write cycles to emulate 512 byte sectors?  In this
>> case, mismatches are not expected, since QEMU serializes RMW cycles, but
>> at least we would know that the bug would be in QEMU, and where.
>
> Haven't considered this possibility. I'll look it up if the disk has 4k
> sectors and whether QEMU is emulating 512 bytes sectors.

There are several differences between the guest and the host device 
regarding the
kernel parameters. This is how the guest configured the SATA disk:


# grep . /sys/block/sdb/queue/*
/sys/block/sdb/queue/add_random:1
/sys/block/sdb/queue/chunk_sectors:0
/sys/block/sdb/queue/dax:0
/sys/block/sdb/queue/discard_granularity:4096
/sys/block/sdb/queue/discard_max_bytes:1073741824
/sys/block/sdb/queue/discard_max_hw_bytes:1073741824
/sys/block/sdb/queue/discard_zeroes_data:0
/sys/block/sdb/queue/hw_sector_size:512
/sys/block/sdb/queue/io_poll:0
/sys/block/sdb/queue/io_poll_delay:0
grep: /sys/block/sdb/queue/iosched: Is a directory
/sys/block/sdb/queue/iostats:1
/sys/block/sdb/queue/logical_block_size:512
/sys/block/sdb/queue/max_discard_segments:1
/sys/block/sdb/queue/max_hw_sectors_kb:32767
/sys/block/sdb/queue/max_integrity_segments:0
/sys/block/sdb/queue/max_sectors_kb:256
/sys/block/sdb/queue/max_segments:126
/sys/block/sdb/queue/max_segment_size:65536
/sys/block/sdb/queue/minimum_io_size:262144
/sys/block/sdb/queue/nomerges:0
/sys/block/sdb/queue/nr_requests:128
/sys/block/sdb/queue/optimal_io_size:262144
/sys/block/sdb/queue/physical_block_size:512
/sys/block/sdb/queue/read_ahead_kb:4096
/sys/block/sdb/queue/rotational:1
/sys/block/sdb/queue/rq_affinity:1
/sys/block/sdb/queue/scheduler:noop [deadline] cfq
/sys/block/sdb/queue/unpriv_sgio:0
grep: /sys/block/sdb/queue/wbt_lat_usec: Invalid argument
/sys/block/sdb/queue/write_cache:write back
/sys/block/sdb/queue/write_same_max_bytes:262144
/sys/block/sdb/queue/write_zeroes_max_bytes:262144
/sys/block/sdb/queue/zoned:none


The same device in the host:

$ grep . /sys/block/sdc/queue/*
/sys/block/sdc/queue/add_random:1
/sys/block/sdc/queue/chunk_sectors:0
/sys/block/sdc/queue/dax:0
/sys/block/sdc/queue/discard_granularity:0
/sys/block/sdc/queue/discard_max_bytes:0
/sys/block/sdc/queue/discard_max_hw_bytes:0
/sys/block/sdc/queue/discard_zeroes_data:0
/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/io_poll:0
/sys/block/sdc/queue/io_poll_delay:0
grep: /sys/block/sdc/queue/iosched: Is a directory
/sys/block/sdc/queue/iostats:1
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/max_discard_segments:1
/sys/block/sdc/queue/max_hw_sectors_kb:256
/sys/block/sdc/queue/max_integrity_segments:0
/sys/block/sdc/queue/max_sectors_kb:256
/sys/block/sdc/queue/max_segments:64
/sys/block/sdc/queue/max_segment_size:65536
/sys/block/sdc/queue/minimum_io_size:512
/sys/block/sdc/queue/nomerges:0
/sys/block/sdc/queue/nr_requests:128
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/physical_block_size:512
/sys/block/sdc/queue/read_ahead_kb:4096
/sys/block/sdc/queue/rotational:1
/sys/block/sdc/queue/rq_affinity:1
/sys/block/sdc/queue/scheduler:noop [deadline] cfq
/sys/block/sdc/queue/unpriv_sgio:0
grep: /sys/block/sdc/queue/wbt_lat_usec: Invalid argument
/sys/block/sdc/queue/write_cache:write through
/sys/block/sdc/queue/write_same_max_bytes:0
/sys/block/sdc/queue/write_zeroes_max_bytes:0
/sys/block/sdc/queue/zoned:none



Physical block size is 512 in both guest and host but there are a lot of 
differences
in how the guest sees the device. Not sure if there is something suspicious
in these differences that can bring some light in the problem though.


Daniel


>
>
>>
>> Paolo
>>
>

  reply	other threads:[~2018-05-16 22:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 21:25 [Qemu-devel] Problem with data miscompare using scsi-hd, cache=none and io=threads Daniel Henrique Barboza
2018-05-16  7:47 ` Paolo Bonzini
2018-05-16 21:35   ` Daniel Henrique Barboza
2018-05-16 22:12     ` Daniel Henrique Barboza [this message]
2018-05-16  9:47 ` Dr. David Alan Gilbert
2018-05-16 21:40   ` Daniel Henrique Barboza
2018-05-24 14:04 ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-05-24 21:30   ` Daniel Henrique Barboza
2018-06-01 11:49     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcb0c9ca-10bc-1f4d-71e9-402023be5f06@linux.ibm.com \
    --to=danielhb@linux.ibm.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).