RE: Race condition in overlayed qcow2?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Pavel Dovgalyuk" <dovgaluk@ispras.ru>
To: "'Vladimir Sementsov-Ogievskiy'" <vsementsov@virtuozzo.com>
Cc: kwolf@redhat.com, qemu-devel@nongnu.org, mreitz@redhat.com
Subject: RE: Race condition in overlayed qcow2?
Date: Thu, 20 Feb 2020 13:00:21 +0300	[thread overview]
Message-ID: <003701d5e7d4$90bcc130$b2364390$@ru> (raw)
In-Reply-To: <af246719-910b-1394-2f18-b88e3daa9c81@virtuozzo.com>

> From: Vladimir Sementsov-Ogievskiy [mailto:vsementsov@virtuozzo.com]
> 20.02.2020 11:31, dovgaluk wrote:
> > Vladimir Sementsov-Ogievskiy писал 2020-02-19 19:07:
> >> 19.02.2020 17:32, dovgaluk wrote:
> >>> I encountered a problem with record/replay of QEMU execution and figured out the
> following, when
> >>> QEMU is started with one virtual disk connected to the qcow2 image with applied 'snapshot'
> option.
> >>>
> >>> The patch d710cf575ad5fb3ab329204620de45bfe50caa53 "block/qcow2: introduce parallel
> subrequest handling in read and write"
> >>> introduces some kind of race condition, which causes difference in the data read from the
> disk.
> >>>
> >>> I detected this by adding the following code, which logs IO operation checksum. And this
> checksum may be different in different runs of the same recorded execution.
> >>>
> >>> logging in blk_aio_complete function:
> >>>          qemu_log("%"PRId64": blk_aio_complete\n", replay_get_current_icount());
> >>>          QEMUIOVector *qiov = acb->rwco.iobuf;
> >>>          if (qiov && qiov->iov) {
> >>>              size_t i, j;
> >>>              uint64_t sum = 0;
> >>>              int count = 0;
> >>>              for (i = 0 ; i < qiov->niov ; ++i) {
> >>>                  for (j = 0 ; j < qiov->iov[i].iov_len ; ++j) {
> >>>                      sum += ((uint8_t*)qiov->iov[i].iov_base)[j];
> >>>                      ++count;
> >>>                  }
> >>>              }
> >>>              qemu_log("--- iobuf offset %"PRIx64" len %x sum: %"PRIx64"\n", acb-
> >rwco.offset, count, sum);
> >>>          }
> >>>
> >>> I tried to get rid of aio task by patching qcow2_co_preadv_part:
> >>> ret = qcow2_co_preadv_task(bs, ret, cluster_offset, offset, cur_bytes, qiov, qiov_offset);
> >>>
> >>> That change fixed a bug, but I have no idea what to debug next to figure out the exact
> reason of the failure.
> >>>
> >>> Do you have any ideas or hints?
> >>>
> >>
> >> Hi!
> >>
> >> Hmm, do mean that read from the disk may return wrong data? It would
> >> be very bad of course :(
> >> Could you provide a reproducer, so that I can look at it and debug?
> >
> > It is just a winxp-32 image. I record the execution and replay it with the following command
> lines:
> >
> > qemu-system-i386 -icount shift=7,rr=record,rrfile=replay.bin -m 512M -drive
> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
> none
> >
> > qemu-system-i386 -icount shift=7,rr=replay,rrfile=replay.bin -m 512M -drive
> file=xp.qcow2,if=none,id=device-34-file,snapshot -drive driver=blkreplay,if=none,image=device-
> 34-file,id=device-34-driver -device ide-hd,drive=device-34-driver,bus=ide.0,id=device-34 -net
> none
> >
> > Replay stalls at some moment due to the non-determinism of the execution (probably caused by
> the wrong data read).
> 
> Hmm.. I tried it  (with x86_64 qemu and centos image). I waited for some time for a first
> command, than Ctrl+C it. After it replay.bin was 4M. Than started the second command. It
> works, not failing, not finishing. Is it bad? What is expected behavior and what is wrong?

The second command should finish. There is no replay introspection yet (in master), but you can
stop qemu with gdb and inspect replay_state.current_icount field. It should increase with every
virtual CPU instruction execution. If that counter has stopped, it means that replay hangs.

> >> What is exactly the case? May be you have other parallel aio
> >> operations to the same region?
> >
> > As far as I understand, all aio operations, initiated by IDE controller, are performed one-
> by-one.
> > I don't see anything else in the logs.
> >
> >> Ideas to experiment:
> >>
> >> 1. change QCOW2_MAX_WORKERS to 1 or to 2, will it help?
> >
> > 1 or 2 are ok, and 4 or 8 lead to the failures.
> >
> >> 2. understand what is the case in code: is it read from one or several
> >> clusters, is it aligned,
> >> what is the type of clusters, is encryption in use, compression?
> >
> > There is no encryption and I thinks compression is not enabled too.
> > Clusters are read from the temporary overlay:
> >
> > blk_aio_prwv
> > blk_aio_read_entry
> > bdrv_co_preadv_part complete offset: 26300000 qiov_offset: 1c200 len: 1e00
> > bdrv_co_preadv_part complete offset: 24723e00 qiov_offset: 0 len: 1c200
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> > bdrv_co_preadv_part complete offset: c0393e00 qiov_offset: 0 len: 1e000
> >
> >
> >> 3. understand what kind of data corruption. What we read instead of
> >> correct data? Just garbage, or may be zeroes, or what..
> >
> > Most bytes are the same, but some are different:
> >
> > < 00 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00
> > < 46 49 4c 45 30 00 03 00 18 d1 33 02 00 00 00 00
> > < 01 00 01 00 38 00 01 00 68 01 00 00 00 04 00 00
> > < 00 00 00 00 00 00 00 00 04 00 00 00 9d 0e 00 00
> > < 02 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> > ---
> >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00
> >> 46 49 4c 45 30 00 03 00 86 78 35 03 00 00 00 00
> >> 01 00 01 00 38 00 01 00 60 01 00 00 00 04 00 00
> >> 00 00 00 00 00 00 00 00 04 00 00 00 a1 0e 00 00
> >> 04 00 00 00 00 00 00 00 10 00 00 00 60 00 00 00
> >
> > That is strange. I could think, that it was caused by the bugs in
> > deterministic CPU execution, but the first difference in logs
> > occur in READ operation (I dump read/write buffers in blk_aio_complete).
> >
> 
> Aha, yes, looks strange.
> 
> Then next steps:
> 
> 1. Does problem hit into the same offset every time?

Yes, almost the same offset, almost the same phase of the execution.

> 2. Do we write to this region before this strange read?

No.

> 2.1. If yes, we need to check that we read what we write.. You say you dump buffers
> in blk_aio_complete... I think it would be more reliable to dump at start of
> bdrv_co_pwritev and at end of bdrv_co_preadv. Also, guest may modify its buffers
> during operation which would be strange but possible.

I dumped every write in file-posix.c handle_aiocb_rw_linear and qemu_pwritev
and found no difference in executions.

> 2.2 If not, hmm...

Exactly.

Pavel Dovgalyuk

next prev parent reply	other threads:[~2020-02-20 10:01 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19 14:32 Race condition in overlayed qcow2? dovgaluk
2020-02-19 16:07 ` Vladimir Sementsov-Ogievskiy
2020-02-20  8:31   ` dovgaluk
2020-02-20  9:05     ` Vladimir Sementsov-Ogievskiy
2020-02-20  9:36       ` Vladimir Sementsov-Ogievskiy
2020-02-21  9:49         ` dovgaluk
2020-02-21 10:09           ` Vladimir Sementsov-Ogievskiy
2020-02-21 12:35             ` dovgaluk
2020-02-21 13:23               ` Vladimir Sementsov-Ogievskiy
2020-02-25  5:58                 ` dovgaluk
2020-02-25  7:27                   ` Vladimir Sementsov-Ogievskiy
2020-02-25  7:56                     ` dovgaluk
2020-02-25  9:19                       ` Vladimir Sementsov-Ogievskiy
2020-02-25  9:26                         ` Pavel Dovgalyuk
2020-02-25 10:07                         ` Pavel Dovgalyuk
2020-02-25 11:47                           ` Kevin Wolf
2020-02-20 10:00       ` Pavel Dovgalyuk [this message]
2020-02-20 11:26         ` Vladimir Sementsov-Ogievskiy
2020-02-20 11:48           ` Pavel Dovgalyuk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='003701d5e7d4$90bcc130$b2364390$@ru' \
    --to=dovgaluk@ispras.ru \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).