All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Peter Maydell <peter.maydell@linaro.org>
Cc: Juan Quintela <quintela@redhat.com>,
	QEMU Developers <qemu-devel@nongnu.org>,
	Laurent Vivier <lvivier@redhat.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [Qemu-devel] [PULL 16/40] migration: allow dst vm pause on postcopy
Date: Tue, 5 Jun 2018 15:48:09 +0800	[thread overview]
Message-ID: <20180605074809.GE9216@xz-mi> (raw)
In-Reply-To: <CAFEAcA_cpAT27=DQj6qyvoP8u=zUgzG2vidcgN=vP1KLDn8LMA@mail.gmail.com>

On Mon, Jun 04, 2018 at 02:49:58PM +0100, Peter Maydell wrote:
> On 16 May 2018 at 00:39, Juan Quintela <quintela@redhat.com> wrote:
> > From: Peter Xu <peterx@redhat.com>
> >
> > When there is IO error on the incoming channel (e.g., network down),
> > instead of bailing out immediately, we allow the dst vm to switch to the
> > new POSTCOPY_PAUSE state. Currently it is still simple - it waits the
> > new semaphore, until someone poke it for another attempt.
> >
> > One note is that here on ram loading thread we cannot detect the
> > POSTCOPY_ACTIVE state, but we need to detect the more specific
> > POSTCOPY_INCOMING_RUNNING state, to make sure we have already loaded all
> > the device states.
> >
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > Message-Id: <20180502104740.12123-5-peterx@redhat.com>
> > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > ---
> 
> Hi; Coverity (CID 1391289) points out what it thinks is an issue in
> this commit. I think it's wrong, but it does leave me uncertain
> whether we have the locking correct here...

Hi, Peter,

Yeah actually this confused me a bit too when I wanted to fix this...

> 
> > +/* Return true if we should continue the migration, or false. */
> > +static bool postcopy_pause_incoming(MigrationIncomingState *mis)
> > +{
> > +    trace_postcopy_pause_incoming();
> > +
> > +    migrate_set_state(&mis->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > +                      MIGRATION_STATUS_POSTCOPY_PAUSED);
> > +
> > +    assert(mis->from_src_file);
> > +    qemu_file_shutdown(mis->from_src_file);
> > +    qemu_fclose(mis->from_src_file);
> > +    mis->from_src_file = NULL;
> 
> In postcopy_pause_incoming() we always set mis->from_src_file to NULL...
> 
> > +
> > +    assert(mis->to_src_file);
> > +    qemu_file_shutdown(mis->to_src_file);
> > +    qemu_mutex_lock(&mis->rp_mutex);
> > +    qemu_fclose(mis->to_src_file);
> > +    mis->to_src_file = NULL;
> > +    qemu_mutex_unlock(&mis->rp_mutex);
> > +
> > +    error_report("Detected IO failure for postcopy. "
> > +                 "Migration paused.");
> > +
> > +    while (mis->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
> > +        qemu_sem_wait(&mis->postcopy_pause_sem_dst);
> > +    }
> > +
> > +    trace_postcopy_pause_incoming_continued();
> > +
> > +    return true;
> > +}
> > +
> >  static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> >  {
> >      uint8_t section_type;
> >      int ret = 0;
> >
> > +retry:
> >      while (true) {
> >          section_type = qemu_get_byte(f);
> >
> > @@ -2104,6 +2145,24 @@ static int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
> >  out:
> >      if (ret < 0) {
> >          qemu_file_set_error(f, ret);
> > +
> > +        /*
> > +         * Detect whether it is:
> > +         *
> > +         * 1. postcopy running (after receiving all device data, which
> > +         *    must be in POSTCOPY_INCOMING_RUNNING state.  Note that
> > +         *    POSTCOPY_INCOMING_LISTENING is still not enough, it's
> > +         *    still receiving device states).
> > +         * 2. network failure (-EIO)
> > +         *
> > +         * If so, we try to wait for a recovery.
> > +         */
> > +        if (postcopy_state_get() == POSTCOPY_INCOMING_RUNNING &&
> > +            ret == -EIO && postcopy_pause_incoming(mis)) {
> > +            /* Reset f to point to the newly created channel */
> > +            f = mis->from_src_file;
> 
> ...but here we set f to mis->from_src_file, which Coverity
> thinks must be NULL...
> 
> > +            goto retry;
> 
> ...and then we goto the 'retry' label, which will always
> dereference the NULL pointer and crash in qemu_get_byte().
> 
> > +        }
> >      }
> >      return ret;
> >  }
> 
> Looking at the wider code, I think what Coverity has not spotted is
> that postcopy_pause_incoming() blocks on the semaphore, and the
> code that wakes it up in migration_fd_process_incoming() will
> set mis->from_src_file to something non-NULL before it posts that
> semaphore.
> 
> However, I'm not sure about the locking being used here. There's
> no lock held while postcopy_pause_incoming() does the "set state
> to paused and then clear mis->from_src_file", so what prevents
> this interleaving of execution of the two threads?
> 
>   postcopy_pause_incoming()        migration_fd_process_incoming()
>     + set state to PAUSED
>                                      + find that state is PAUSED
>                                      + mis->from_src_file = f
>     + mis->from_src_file = NULL
>     + wait on semaphore
>                                      + post semaphare
> 
> ?

Thanks for raising this question up.  I suspect here I should postpone
the status update in postcopy_pause_incoming() to be after clearing
the from_src_file var.  Then we'll make sure the
migration_fd_process_incoming() will always update the correct new
file handle after it was cleared in the other thread.

> 
> There are also a couple of other things Coverity thinks might
> be data race conditions (CID 1391295 and CID 1391288) that you
> might want to look at, though I suspect they are false-positives
> (access occurs before thread create of the thread the mutex
> is providing protection against).

Could you kindly let me know if there is any way for me to lookup
these CIDs?  E.g., I searched "CID 1391295 QEMU Coverity" on Google
but I can't find any link.  Any preferred way to do this?

(I think a stupid way is to lookup the CID in every Coverity reports,
 but I guess there is a faster way)

Thanks,

-- 
Peter Xu

  reply	other threads:[~2018-06-05  7:48 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 23:39 [Qemu-devel] [PULL 00/40] Migration PULL requset (take 2) Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 01/40] migration: fix saving normal page even if it's been compressed Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 02/40] tests: Add migration precopy test Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 03/40] tests: Migration ppc now inlines its program Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 04/40] migration: Set error state in case of error Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 05/40] migration: Introduce multifd_recv_new_channel() Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 06/40] migration: terminate_* can be called for other threads Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 07/40] migration: Be sure all recv channels are created Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 08/40] migration: Export functions to create send channels Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 09/40] migration: Create multifd channels Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 10/40] migration: Delay start of migration main routines Juan Quintela
2018-05-18  8:59   ` Kevin Wolf
2018-05-18 10:34     ` Dr. David Alan Gilbert
2018-05-18 12:14       ` Kevin Wolf
2018-05-22 16:20         ` Kevin Wolf
2018-05-23  6:29           ` Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 11/40] migration: Transmit initial package through the multifd channels Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 12/40] migration: Define MultifdRecvParams sooner Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 13/40] migration: let incoming side use thread context Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 14/40] migration: new postcopy-pause state Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 15/40] migration: implement "postcopy-pause" src logic Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 16/40] migration: allow dst vm pause on postcopy Juan Quintela
2018-06-04 13:49   ` Peter Maydell
2018-06-05  7:48     ` Peter Xu [this message]
2018-06-05 10:45       ` Peter Xu
2018-05-15 23:39 ` [Qemu-devel] [PULL 17/40] migration: allow src return path to pause Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 18/40] migration: allow fault thread " Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 19/40] qmp: hmp: add migrate "resume" option Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 20/40] migration: rebuild channel on source Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 21/40] migration: new state "postcopy-recover" Juan Quintela
2018-05-15 23:39 ` [Qemu-devel] [PULL 22/40] migration: wakeup dst ram-load-thread for recover Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 23/40] migration: new cmd MIG_CMD_RECV_BITMAP Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 24/40] migration: new message MIG_RP_MSG_RECV_BITMAP Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 25/40] migration: new cmd MIG_CMD_POSTCOPY_RESUME Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 26/40] migration: new message MIG_RP_MSG_RESUME_ACK Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 27/40] migration: introduce SaveVMHandlers.resume_prepare Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 28/40] migration: synchronize dirty bitmap for resume Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 29/40] migration: setup ramstate " Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 30/40] migration: final handshake for the resume Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 31/40] migration: init dst in migration_object_init too Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 32/40] qmp/migration: new command migrate-recover Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 33/40] hmp/migration: add migrate_recover command Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 34/40] migration: introduce lock for to_dst_file Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 35/40] migration/qmp: add command migrate-pause Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 36/40] migration/hmp: add migrate_pause command Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 37/40] migration: update docs Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 38/40] migration: update index field when delete or qsort RDMALocalBlock Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 39/40] migration: Textual fixups for blocktime Juan Quintela
2018-05-15 23:40 ` [Qemu-devel] [PULL 40/40] Migration+TLS: Fix crash due to double cleanup Juan Quintela
2018-05-17 10:59 ` [Qemu-devel] [PULL 00/40] Migration PULL requset (take 2) Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180605074809.GE9216@xz-mi \
    --to=peterx@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.