qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Juan Quintela <quintela@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: qemu-devel@nongnu.org,  Fabiano Rosas <farosas@suse.de>
Subject: Re: [PATCH v3 03/10] migration: Refactor error handling in source return path
Date: Thu, 05 Oct 2023 10:22:52 +0200	[thread overview]
Message-ID: <87wmw1o4v7.fsf@secure.mitica> (raw)
In-Reply-To: <20231004220240.167175-4-peterx@redhat.com> (Peter Xu's message of "Wed, 4 Oct 2023 18:02:33 -0400")

Peter Xu <peterx@redhat.com> wrote:
> rp_state.error was a boolean used to show error happened in return path
> thread.  That's not only duplicating error reporting (migrate_set_error),
> but also not good enough in that we only do error_report() and set it to
> true, we never can keep a history of the exact error and show it in
> query-migrate.
>
> To make this better, a few things done:
>
>   - Use error_setg() rather than error_report() across the whole lifecycle
>     of return path thread, keeping the error in an Error*.

Good.

>   - Use migrate_set_error() to apply that captured error to the global
>     migration object when error occured in this thread.

Good.

>   - With above, no need to have mark_source_rp_bad(), remove it, alongside
>     with rp_state.error itself.

Good.

>  uint64_t ram_pagesize_summary(void);
> -int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len);
> +int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
> +                         Error **errp);


good.

> @@ -1793,37 +1782,36 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
>       */
>      if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
>          !QEMU_IS_ALIGNED(len, our_host_ps)) {
> -        error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
> -                     " len: %zd", __func__, start, len);
> -        mark_source_rp_bad(ms);
> +        error_setg(errp, "MIG_RP_MSG_REQ_PAGES: Misaligned page request, start:"
> +                   RAM_ADDR_FMT " len: %zd", start, len);
>          return;
>      }
>  
> -    if (ram_save_queue_pages(rbname, start, len)) {
> -        mark_source_rp_bad(ms);
> -    }
> +    ram_save_queue_pages(rbname, start, len, errp);

ram_save_queue_pages() returns an int.
I think this function should return an int.

Next is independent of this patch:

> -static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name)
> +static int migrate_handle_rp_recv_bitmap(MigrationState *s, char *block_name,
> +                                         Error **errp)
>  {
>      RAMBlock *block = qemu_ram_block_by_name(block_name);
>  
>      if (!block) {
> -        error_report("%s: invalid block name '%s'", __func__, block_name);
> +        error_setg(errp, "MIG_RP_MSG_RECV_BITMAP has invalid block name '%s'",
> +                   block_name);
>          return -EINVAL;

We sent -EINVAL.

>      }
>  
>      /* Fetch the received bitmap and refresh the dirty bitmap */
> -    return ram_dirty_bitmap_reload(s, block);
> +    return ram_dirty_bitmap_reload(s, block, errp);
>  }
>  
> -static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
> +static int migrate_handle_rp_resume_ack(MigrationState *s,
> +                                        uint32_t value, Error **errp)
>  {
>      trace_source_return_path_thread_resume_ack(value);
>  
>      if (value != MIGRATION_RESUME_ACK_VALUE) {
> -        error_report("%s: illegal resume_ack value %"PRIu32,
> -                     __func__, value);
> +        error_setg(errp, "illegal resume_ack value %"PRIu32, value);
>          return -1;

And here -1.

On both callers we just check if it is different from zero.  We never
use the return value as errno, so I think we should move to -1, if there
is an error, that is what errp is for.


> -/* Returns 0 if the RP was ok, otherwise there was an error on the RP */
> -static int await_return_path_close_on_source(MigrationState *ms)
> +static void await_return_path_close_on_source(MigrationState *ms)
>  {
> -    int ret;
> -
>      if (!ms->rp_state.rp_thread_created) {
> -        return 0;
> +        return;
>      }
>  
>      trace_migration_return_path_end_before();
> @@ -2060,18 +2050,10 @@ static int await_return_path_close_on_source(MigrationState *ms)
>          }
>      }
>  
> -    trace_await_return_path_close_on_source_joining();
>      qemu_thread_join(&ms->rp_state.rp_thread);
>      ms->rp_state.rp_thread_created = false;
> -    trace_await_return_path_close_on_source_close();
> -
> -    ret = ms->rp_state.error;
> -    ms->rp_state.error = false;
> -
>      migration_release_dst_files(ms);
> -
> -    trace_migration_return_path_end_after(ret);
> -    return ret;
> +    trace_migration_return_path_end_after();
>  }
>  
>  static inline void
> @@ -2367,7 +2349,10 @@ static void migration_completion(MigrationState *s)
>          goto fail;
>      }
>  
> -    if (await_return_path_close_on_source(s)) {
> +    await_return_path_close_on_source(s);
> +
> +    /* If return path has error, should have been set here */
> +    if (migrate_has_error(s)) {
>          goto fail;
>      }

In general, I think this is bad.  We are moving for

int foo(..)
{

}

....

if (foo()) {
     goto fail;
}

to:

void foo(..)
{

}

....

foo();

if (bar()) {
     goto fail;
}

I would preffer to move the other way around.  Move the error
synchrconously. My plan is that at some point in time
qemu_file_get_error() dissapears, i.e. we return the error when we
receive it and we handle it synchronously.

And yes, that is a something will take a lot of time, but I will hope we
move on that direction, not in trusting more setting internal errors,
use void functions and then check with yet another functions.


On top of your changes:

> -int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
> +int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block, Error **errp)
>  {
>      int ret = -EINVAL;
>      /* from_dst_file is always valid because we're within rp_thread */
> @@ -4163,8 +4165,8 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
>      trace_ram_dirty_bitmap_reload_begin(block->idstr);
>  
>      if (s->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
> -        error_report("%s: incorrect state %s", __func__,
> -                     MigrationStatus_str(s->state));
> +        error_setg(errp, "Reload bitmap in incorrect state %s",
> +                   MigrationStatus_str(s->state));
>          return -EINVAL;

return -1

same for the rest of the cases. Callers only check for != 0, and if you
want details, you need to look at errp.

See the nice series for migration/rdma.c for why this is better (and
more consistent).

Rest of the patch is very nice.

Thanks, Juan.



  parent reply	other threads:[~2023-10-05  8:24 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-04 22:02 [PATCH v3 00/10] migration: Better error handling in rp thread, allow failures in recover Peter Xu
2023-10-04 22:02 ` [PATCH v3 01/10] migration: Display error in query-migrate irrelevant of status Peter Xu
2023-10-05  7:28   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 02/10] migration: Introduce migrate_has_error() Peter Xu
2023-10-05  7:30   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 03/10] migration: Refactor error handling in source return path Peter Xu
2023-10-05  6:11   ` Philippe Mathieu-Daudé
2023-10-05 16:05     ` Peter Xu
2023-10-08 11:39       ` Philippe Mathieu-Daudé
2023-10-05  8:22   ` Juan Quintela [this message]
2023-10-05 19:35     ` Peter Xu
2023-10-05 12:57   ` Fabiano Rosas
2023-10-05 19:35     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 04/10] migration: Deliver return path file error to migrate state too Peter Xu
2023-10-05  7:32   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 05/10] qemufile: Always return a verbose error Peter Xu
2023-10-05  7:42   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 06/10] migration: Remember num of ramblocks to sync during recovery Peter Xu
2023-10-05  7:43   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 07/10] migration: Add migration_rp_wait|kick() Peter Xu
2023-10-05  7:49   ` Juan Quintela
2023-10-05 20:47     ` Peter Xu
2023-10-04 22:02 ` [PATCH v3 08/10] migration: Allow network to fail even during recovery Peter Xu
2023-10-05 13:25   ` Fabiano Rosas
2023-10-04 22:02 ` [PATCH v3 09/10] migration: Allow RECOVER->PAUSED convertion for dest qemu Peter Xu
2023-10-05  8:24   ` Juan Quintela
2023-10-04 22:02 ` [PATCH v3 10/10] tests/migration-test: Add a test for postcopy hangs during RECOVER Peter Xu
2023-10-05 13:24   ` Fabiano Rosas
2023-10-05 13:37     ` Fabiano Rosas
2023-10-05 20:55       ` Peter Xu
2023-10-05 21:10         ` Fabiano Rosas
2023-10-05 21:44           ` Peter Xu
2023-10-05 22:01             ` Fabiano Rosas
2023-10-09 16:50               ` Fabiano Rosas
2023-10-10 16:00                 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wmw1o4v7.fsf@secure.mitica \
    --to=quintela@redhat.com \
    --cc=farosas@suse.de \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).