qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Juan Quintela <quintela@redhat.com>
Cc: qemu-devel@nongnu.org, Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH v3 03/10] migration: Refactor error handling in source return path
Date: Thu, 5 Oct 2023 15:35:05 -0400	[thread overview]
Message-ID: <ZR8P6Qul8UD0JeBT@x1n> (raw)
In-Reply-To: <87wmw1o4v7.fsf@secure.mitica>

On Thu, Oct 05, 2023 at 10:22:52AM +0200, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> > rp_state.error was a boolean used to show error happened in return path
> > thread.  That's not only duplicating error reporting (migrate_set_error),
> > but also not good enough in that we only do error_report() and set it to
> > true, we never can keep a history of the exact error and show it in
> > query-migrate.
> >
> > To make this better, a few things done:
> >
> >   - Use error_setg() rather than error_report() across the whole lifecycle
> >     of return path thread, keeping the error in an Error*.
> 
> Good.
> 
> >   - Use migrate_set_error() to apply that captured error to the global
> >     migration object when error occured in this thread.
> 
> Good.
> 
> >   - With above, no need to have mark_source_rp_bad(), remove it, alongside
> >     with rp_state.error itself.
> 
> Good.
> 
> >  uint64_t ram_pagesize_summary(void);
> > -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
> > +int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
> > +                         Error **errp);
> 
> 
> good.
> 
> > @@ -1793,37 +1782,36 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
> >       */
> >      if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
> >          !QEMU_IS_ALIGNED(len, our_host_ps)) {
> > -        error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
> > -                     " len: %zd", __func__, start, len);
> > -        mark_source_rp_bad(ms);
> > +        error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
> > +                   RAM_ADDR_FMT " len: %zd", start, len);
> >          return;
> >      }
> >  
> > -    if (ram_save_queue_pages(rbname, start, len)) {
> > -        mark_source_rp_bad(ms);
> > -    }
> > +    ram_save_queue_pages(rbname, start, len, errp);
> 
> ram_save_queue_pages() returns an int.
> I think this function should return an int.

Phil suggested something similar for the other patch, instead of also let
this function return int, I'll add one more patch to let it return boolean
to show whether there's an error, keeping the real error in errp.

> 
> Next is independent of this patch:
> 
> > -static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
> > +static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> > +                                         Error **errp)
> >  {
> >      RAMBlock *block = qemu_ram_block_by_name(block_name);
> >  
> >      if (!block) {
> > -        error_report("%s: invalid block name '%s'", __func__, block_name);
> > +        error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
> > +                   block_name);
> >          return -EINVAL;
> 
> We sent -EINVAL.
> 
> >      }
> >  
> >      /* Fetch the received bitmap and refresh the dirty bitmap */
> > -    return ram_dirty_bitmap_reload(s, block);
> > +    return ram_dirty_bitmap_reload(s, block, errp);
> >  }
> >  
> > -static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> > +static int migrate_handle_rp_resume_ack(MigrationState *s,
> > +                                        uint32_t value, Error **errp)
> >  {
> >      trace_source_return_path_thread_resume_ack(value);
> >  
> >      if (value != MIGRATION_RESUME_ACK_VALUE) {
> > -        error_report("%s: illegal resume_ack value %"PRIu32,
> > -                     __func__, value);
> > +        error_setg(errp, "illegal resume_ack value %"PRIu32, value);
> >          return -1;
> 
> And here -1.
> 
> On both callers we just check if it is different from zero.  We never
> use the return value as errno, so I think we should move to -1, if there
> is an error, that is what errp is for.

Right.  I'll switch all rp-return thread paths to use boolean as return, as
long as there's errp.

> 
> 
> > -/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
> > -static int await_return_path_close_on_source(MigrationState *ms)
> > +static void await_return_path_close_on_source(MigrationState *ms)
> >  {
> > -    int ret;
> > -
> >      if (!ms->rp_state.rp_thread_created) {
> > -        return 0;
> > +        return;
> >      }
> >  
> >      trace_migration_return_path_end_before();
> > @@ -2060,18 +2050,10 @@ static int await_return_path_close_on_source(MigrationState *ms)
> >          }
> >      }
> >  
> > -    trace_await_return_path_close_on_source_joining();
> >      qemu_thread_join(&ms->rp_state.rp_thread);
> >      ms->rp_state.rp_thread_created = false;
> > -    trace_await_return_path_close_on_source_close();
> > -
> > -    ret = ms->rp_state.error;
> > -    ms->rp_state.error = false;
> > -
> >      migration_release_dst_files(ms);
> > -
> > -    trace_migration_return_path_end_after(ret);
> > -    return ret;
> > +    trace_migration_return_path_end_after();
> >  }
> >  
> >  static inline void
> > @@ -2367,7 +2349,10 @@ static void migration_completion(MigrationState *s)
> >          goto fail;
> >      }
> >  
> > -    if (await_return_path_close_on_source(s)) {
> > +    await_return_path_close_on_source(s);
> > +
> > +    /* If return path has error, should have been set here */
> > +    if (migrate_has_error(s)) {
> >          goto fail;
> >      }
> 
> In general, I think this is bad.  We are moving for
> 
> int foo(..)
> {
> 
> }
> 
> ....
> 
> if (foo()) {
>      goto fail;
> }
> 
> to:
> 
> void foo(..)
> {
> 
> }
> 
> ....
> 
> foo();
> 
> if (bar()) {
>      goto fail;
> }
> 
> I would preffer to move the other way around.  Move the error
> synchrconously. My plan is that at some point in time
> qemu_file_get_error() dissapears, i.e. we return the error when we
> receive it and we handle it synchronously.
> 
> And yes, that is a something will take a lot of time, but I will hope we
> move on that direction, not in trusting more setting internal errors,
> use void functions and then check with yet another functions.

IIUC "synchronous" here means we can have the Error* returned from
pthread_join(), but I worry that might be too late, that the real return
path Error* doesn't get its chance to set into MigrationState.error because
there can already be some error set.

I can at least move that check into await_return_path_close_on_source()
again, so it keeps returning something.  Does that sound like okay for now?

> 
> 
> On top of your changes:
> 
> > -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> > +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
> >  {
> >      int ret = -EINVAL;
> >      /* from_dst_file is always valid because we're within rp_thread */
> > @@ -4163,8 +4165,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> >      trace_ram_dirty_bitmap_reload_begin(block->idstr);
> >  
> >      if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> > -        error_report("%s: incorrect state %s", __func__,
> > -                     MigrationStatus_str(s->state));
> > +        error_setg(errp, "Reload bitmap in incorrect state %s",
> > +                   MigrationStatus_str(s->state));
> >          return -EINVAL;
> 
> return -1
> 
> same for the rest of the cases. Callers only check for != 0, and if you
> want details, you need to look at errp.

I'll let them return boolean here too to be consistent.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2023-10-05 19:37 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-04 22:02 [PATCH v3 00/10] migration: Better error handling in rp thread, allow failures in recover Peter Xu
2023-10-04 22:02 ` [PATCH v3 01/10] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-10-05  7:28   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 02/10] migration: Introduce migrate_has_error() Peter Xu
2023-10-05  7:30   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 03/10] migration: Refactor error handling in source return path Peter Xu
2023-10-05  6:11   ` Philippe Mathieu-Daudé
2023-10-05 16:05     ` Peter Xu
2023-10-08 11:39       ` Philippe Mathieu-Daudé
2023-10-05  8:22   ` Juan Quintela
2023-10-05 19:35     ` Peter Xu [this message]
2023-10-05 12:57   ` Fabiano Rosas
2023-10-05 19:35     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 04/10] migration: Deliver return path file error to migrate state too Peter Xu
2023-10-05  7:32   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 05/10] qemufile: Always return a verbose error Peter Xu
2023-10-05  7:42   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 06/10] migration: Remember num of ramblocks to sync during recovery Peter Xu
2023-10-05  7:43   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 07/10] migration: Add migration_rp_wait|kick() Peter Xu
2023-10-05  7:49   ` Juan Quintela
2023-10-05 20:47     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 08/10] migration: Allow network to fail even during recovery Peter Xu
2023-10-05 13:25   ` Fabiano Rosas
2023-10-04 22:02 ` [PATCH v3 09/10] migration: Allow RECOVER->PAUSED convertion for dest qemu Peter Xu
2023-10-05  8:24   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 10/10] tests/migration-test: Add a test for postcopy hangs during RECOVER Peter Xu
2023-10-05 13:24   ` Fabiano Rosas
2023-10-05 13:37     ` Fabiano Rosas
2023-10-05 20:55       ` Peter Xu
2023-10-05 21:10         ` Fabiano Rosas
2023-10-05 21:44           ` Peter Xu
2023-10-05 22:01             ` Fabiano Rosas
2023-10-09 16:50               ` Fabiano Rosas
2023-10-10 16:00                 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZR8P6Qul8UD0JeBT@x1n \
    --to=peterx@redhat.com \
    --cc=farosas@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).