qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Overlapping buffers in I/O requests
@ 2010-09-30 12:00 Stefan Hajnoczi
  0 siblings, 0 replies; only message in thread
From: Stefan Hajnoczi @ 2010-09-30 12:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Christoph Hellwig

There is a block I/O corner case that I don't fully understand.  I'd
appreciate thoughts on the expected behavior.

At one point during a Windows Server 2008 install to an IDE disk the
guest sends a read request with overlapping sglist buffers.  It looks
like this:
[0] addr=A len=4k
[1] addr=B len=4k
[2] addr=C len=4k
[3] addr=B len=4k

Buffers 1 and 3 are the same guest memory, their addresses match.

If I understand correctly, IDE will perform each operation in turn and
DMA the result back to the buffers in order.  Therefore disk contents
at +12k should be written to address B.

Unfortunately QEMU does not guarantee this today.  Sometimes the disk
contents at +4k (buffer 1) are read and other times the disk contents
at +12k (buffer 3) are read.

QEMU can be taken out of the picture and replaced by a simple test
program that calls preadv(2) directly with the same overlapping buffer
pattern.  There doesn't appear to be a guarantee that the disk
contents at +12k (buffer 3) will be read instead of +4k (buffer 1).

When the page cache is active preadv(2) produces consistent results.

When the page cache is bypassed (O_DIRECT) preadv(2) produces
consistent results against a physical disk:
               a-22904 [001]  3042.186790: block_bio_queue: 8,0 R 2048 + 32 [a]
               a-22904 [001]  3042.186807: block_getrq: 8,0 R 2048 + 32 [a]
               a-22904 [001]  3042.186812: block_plug: [a]
               a-22904 [001]  3042.186816: block_rq_insert: 8,0 R 0 ()
2048 + 32 [a]
               a-22904 [001]  3042.186822: block_unplug_io: [a] 1
               a-22904 [001]  3042.186829: block_rq_issue: 8,0 R 0 ()
2048 + 32 [a]
 pam-foreground--22912 [001]  3042.187066: block_rq_complete: 8,0 R ()
2048 + 32 [0]

Notice that a single 32 sector read is issued on /dev/sda (8,0).  This
makes sense under the assumption that the disk honors DMA buffer
ordering within a request.

However, when the page cache is bypassed preadv(2) produces
inconsistent results against a file on ext3 -> LVM -> dm-crypt ->
/dev/sda.
               a-22834 [001]  3038.425802: block_bio_queue: 254,3 R
32616672 + 8 [a]
               a-22834 [001]  3038.425812: block_remap: 254,0 R
58544736 + 8 <- (254,3) 32616672
               a-22834 [001]  3038.425813: block_bio_queue: 254,0 R
58544736 + 8 [a]
      kcryptd_io-379   [001]  3038.425832: block_remap: 8,0 R 59044807
+ 8 <- (8,2) 58546792
      kcryptd_io-379   [001]  3038.425833: block_bio_queue: 8,0 R
59044807 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425841: block_getrq: 8,0 R 59044807
+ 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425845: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425848: block_rq_insert: 8,0 R 0 ()
59044807 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425859: block_rq_issue: 8,0 R 0 ()
59044807 + 8 [kcryptd_io]
               a-22834 [001]  3038.425894: block_bio_queue: 254,3 R
32616792 + 16 [a]
               a-22834 [001]  3038.425898: block_remap: 254,0 R
58544856 + 16 <- (254,3) 32616792
               a-22834 [001]  3038.425899: block_bio_queue: 254,0 R
58544856 + 16 [a]
      kcryptd_io-379   [001]  3038.425908: block_remap: 8,0 R 59044927
+ 16 <- (8,2) 58546912
      kcryptd_io-379   [001]  3038.425909: block_bio_queue: 8,0 R
59044927 + 16 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425911: block_getrq: 8,0 R 59044927
+ 16 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425913: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425914: block_rq_insert: 8,0 R 0 ()
59044927 + 16 [kcryptd_io]
               a-22834 [001]  3038.425920: block_bio_queue: 254,3 R
32616992 + 8 [a]
               a-22834 [001]  3038.425922: block_remap: 254,0 R
58545056 + 8 <- (254,3) 32616992
               a-22834 [001]  3038.425923: block_bio_queue: 254,0 R
58545056 + 8 [a]
               a-22834 [001]  3038.425929: block_unplug_io: [a] 0
               a-22834 [001]  3038.425930: block_unplug_io: [a] 0
               a-22834 [001]  3038.425931: block_unplug_io: [a] 2
               a-22834 [001]  3038.425934: block_rq_issue: 8,0 R 0 ()
59044927 + 16 [a]
      kcryptd_io-379   [001]  3038.425948: block_remap: 8,0 R 59045127
+ 8 <- (8,2) 58547112
      kcryptd_io-379   [001]  3038.425949: block_bio_queue: 8,0 R
59045127 + 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425951: block_getrq: 8,0 R 59045127
+ 8 [kcryptd_io]
      kcryptd_io-379   [001]  3038.425953: block_plug: [kcryptd_io]
      kcryptd_io-379   [001]  3038.425954: block_rq_insert: 8,0 R 0 ()
59045127 + 8 [kcryptd_io]
          <idle>-0     [001]  3038.427414: block_unplug_timer: [swapper] 3
       kblockd/1-21    [001]  3038.427437: block_unplug_io: [kblockd/1] 3
       kblockd/1-21    [001]  3038.427440: block_rq_issue: 8,0 R 0 ()
59045127 + 8 [kblockd/1]
          <idle>-0     [000]  3038.436786: block_rq_complete: 8,0 R ()
59044807 + 8 [0]
         kcryptd-380   [001]  3038.436960: block_bio_complete: 254,0 R
58544736 + 8 [0]
         kcryptd-380   [001]  3038.436963: block_bio_complete: 254,3 R
32616672 + 8 [0]
          <idle>-0     [001]  3038.437070: block_rq_complete: 8,0 R ()
59044927 + 16 [0]
         kcryptd-380   [000]  3038.437343: block_bio_complete: 254,0 R
58544856 + 16 [611733513]
         kcryptd-380   [000]  3038.437346: block_bio_complete: 254,3 R
32616792 + 16 [-815025730]
          <idle>-0     [000]  3038.437428: block_rq_complete: 8,0 R ()
59045127 + 8 [0]
         kcryptd-380   [000]  3038.437569: block_bio_complete: 254,0 R
58545056 + 8 [-2107963545]
         kcryptd-380   [000]  3038.437571: block_bio_complete: 254,3 R
32616992 + 8 [176593183]

The 32 sectors are broken up into 8, 8, and 16 sector requests.  I
believe the filesystem is doing this before LVM is reached.  This
makes sense since a file may not be contiguous on disk and several
extents need to be read.

These 3 independent requests can complete in any order.  The order
will affect what contents are visible at address B when the read
completes.

So now my question:

Is QEMU risking data corruption when buffers overlap?  If IDE
guarantees that buffers are read in order then we are doing it wrong
(at least when O_DIRECT is used).

Perhaps there is no ordering guarantee in IDE, Windows is doing
something crazy, and QEMU is within its writes to use preadv(2) like
this.

Stefan

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2010-09-30 12:00 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-30 12:00 [Qemu-devel] Overlapping buffers in I/O requests Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).