Re: [PATCH 0/3] migration: Error fixes and improvements

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, Markus Armbruster <armbru@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [PATCH 0/3] migration: Error fixes and improvements
Date: Fri, 21 Nov 2025 09:38:02 -0300	[thread overview]
Message-ID: <87bjkvftn9.fsf@suse.de> (raw)
In-Reply-To: <aR4vdRcORY4em3yB@x1.local>

Peter Xu <peterx@redhat.com> writes:

> On Wed, Nov 19, 2025 at 08:45:39AM +0100, Markus Armbruster wrote:
>
> [...]
>
>> The hairy part is the background task.
>> 
>> I believe it used to simply do its job, reporting errors to stderr along
>> the way, until it either succeeded or failed.  The errors reported made
>> success / failure "obvious" for users.
>> 
>> This can report multiple errors, which can be confusing.
>> 
>> Worse, it was no good for management applications.  These need to
>> observe migration as a state machine, with final success and error
>> states, where the error state comes with an indication of what went
>> wrong.  So we made migration store the first of certain errors in the
>> migration state in addition to reporting to stderr.
>> 
>> "First", because we store only when the state doesn't already have an
>> error.  "Certain", because I doubt we do it for all errors we report.
>> 
>> Compare this to how jobs solve this problem.  These are a much, much
>> later invention, and designed for management applications from the
>> start[*].  A job is a state machine.  Management applications can
>> observe and control the state.  Errors are not supposed to be reported,
>> they should be fed to the state machine, which goes into an error state
>> then.  The job is not supposed to do actual work in an error state.
>> Therefore, no further errors should be possible.  When something goes
>> wrong, we get a single error, stored in the job state, where the
>> management application can find it.
>> 
>> Migration is also a state machine, and we long ago retrofitted the means
>> for management applications to observe and control the state.  What we
>> haven't done is the disciplined feeding of errors to the state machine.
>> We can still get multiple errors.  We store the first of certain errors
>> where the managament application can find it, but whether that error
>> suffices to explain what went wrong is a crap shot.  As long as that's
>> the case, we need to spew the other errors to stderr, where a human can
>> find it.
>
> Since above mentioned once more on the possibility of reusing Jobs idea, I
> did try to list things explicitly this time, that why I think it should be
> challenging and maybe not as worthwhile (?) to do so, however I might be
> wrong.  I attached it at the end of this email almost for myself in the
> future to reference, please feel free comment, or, to ignore all of those!
> IMHO it's not directly relevant to the error reporting issues.
>
> IMHO rewriting migration with Jobs will not help much in error reporting,
> because the challenge for refactoring from migration side is not the "Jobs"
> interfacing, but internally of migration.  Say, even if migration provided
> a "job", it's the "job" impl that did error reporting bad, not the Jobs
> interfacing.. the "job" impl will need to manage quite some threads on its
> own, making sure errors are properly reported at least to the "job"
> interface.
>
> Said that, I totally agree we should try to improve error reporting in
> migration.. with / without Jobs.
>
> [...]
>
>> > Maybe I should ping Vladimir on his recent work here?
>> >
>> > https://lore.kernel.org/r/20251028231347.194844-1-vsementsov@yandex-team.ru
>> >
>> > That'll be part of such cleanup effort (and yes unfortunately many
>> > migration related cleanups will need a lot of code churns...).
>> 
>> I know...
>> 
>> Can we afford modest efforts to reduce the mess one step at a time?
>
> Yes, I'll try to follow up on that.
>
> [...]
>
>> [*] If the job abstraction had been available in time, migration would
>> totally be a job.  There's no *design* reason for it being not a job.
>> Plenty of implementation and backward compatibility reasons, though.
>
> There might be something common between Jobs that block uses and a
> migration process.  If so, we can provide CommonJob and make MigrationJob
> and BlockJobs dependent on it.
>
> However, I sincerely don't know how much common function will there be.
> IOW, I doubt even in an imaginery world, if we could go back to when Jobs
> was designed and if we would make migration a Job too (note!  snapshots is
> definitely a too simple migration scenario..).  Is it possible after
> evaluation we still don't?  I don't know, but I think it's possible.
>
> Thanks!
> Peter
>
>
>
>
> Possible challenges of adopting Jobs in migration flow
> ======================================================
>
> - Many Jobs defined property doesn't directly suite migration
>
>   - JobStatus is not directly suitable for migration purposes.  There're
>     some of the JobStatus that I can't think of any use
>     (e.g. JOB_STATUS_WAITING, JOB_STATUS_PENDING, which is fine, because we
>     can simply not use it), but there're other status that migration needs
>     but isn't availble. Introducing them seems to be an overkill instead to
>     block layer's use case.
>
>   - Similarly to JobVerb.  E.g. JOB_VERB_CHANGE doesn't seem to apply to
>     any concept to migration, but it misses quite some others
>     (e.g. JOB_VERB_SET_DOWNTIME, JOB_VERB_POSTCOPY_START, and more).
>
>   - Similarly, JobInfo reports in current-progress (which is not optional
>     but required), which may make perfect sense for block jobs. However
>     migration is OTOH convergence-triggered process, or user-triggered (in
>     case of postcopy).  It doesn't have a quantified process but only
>     "COMPLETED" / "IN_PROGRESS".
>
>   - Another very major example that I have discussed a few times
>     previously, Jobs are close attached to AioContext, while migration
>     doesn't have, meanwhile migration is moving even further away from
>     event driven model..  See:
>
>     https://lore.kernel.org/all/20251022192612.2737648-1-peterx@redhat.com/#t
>
>   There're just too many example showing that Jobs are defined almost only
>   for block layer.. e.g. job-finalize (which may not make much sense in a
>   migration context anyway..) mentions finalizing of graph changes, which
>   also doesn't exist in migration process.
>
>   So if we rewrite migration somehow with Jobs or keeping migration in mind
>   designing Jobs, Jobs may need to be very bloated containing both
>   migration and block layer requirements.
>
> - Migration involves "two" QEMU instances instead of one
>
>   I'm guessing existing Jobs operations are not as such, and providing such
>   mechanisms in "Jobs" only for migration may introduce unnecessary code
>   that block layer will never use.
>
>   E.g. postcopy migration attached the two QEMU instances to represent one
>   VM instance.  I do not have a clear picture in mind yet on how we can
>   manage that if we see it as two separate Jobs on each side, and what
>   happens if each side operates on its own Job with different purposes, and
>   how we should connect two Jobs to say they're relevant (or maybe we don't
>   need to?).
>
> - More challenges on dest QEMU (VM loader) than src QEMU
>
>   Unlike on the src side, the dest QEMU, when in an incoming state, is not
>   a VM at all yet, but waiting to receive the migration data to become a
>   working VM. It's not a generic long term process, but a pure listening
>   port of QEMU where QEMU can do nothing without this "job" being
>   completed..
>
>   If we think about CPR it's even more complicated, because we essential
>   require part of incoming process to happen before almost everything.. it
>   may even include monitors being initialized.
>
> - Deep integration with other subsystems
>
>   Migration is deeply integrated into many other subsystems (auto-converge
>   being able to throttle vCPUs, RAM being able to ignore empty pages
>   reported from balloons, dirty trackings per-module, etc.), so we're not
>   sure if there'll be some limitation from Jobs (when designed with block
>   layer in mind) that will make such transition harder.
>
>   For example, we at least want to make sure Jobs won't have simple locks
>   that will be held while running migration, that can further deadlock if
>   the migration code may invoke something else that tries to re-take the
>   Jobs lock, which may cause dead-locks.
>
>   Or, since migration runs nowadays with quite some threads concurrently,
>   whether the main migration Job can always properly synchronize between
>   all of them with no problem (maybe yes, but I just don't know Jobs enough
>   to say).  This is also a relevant question about how much AioContext
>   plays a role in core of Jobs idea and whether it can work well with
>   complicated threaded environment.

Thanks for looking into this, Peter! I'm saving it for future reference
as well! It was on my todo list to make such an analysis.

I hope Markus can comment on some of those and maybe we can still find a
way to converge, but I think I agree that migration is (at this point) a
little too particular to be retrofitted (which I'd be very much in favor
of, if it were at all feasible).

(wondering what happened in QEMU historically that we devised so many
well designed interfaces, but chose to leave migration aside altogether)

(maybe this right here is what happened)

     prev parent reply	other threads:[~2025-11-22  1:51 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-15  8:34 [PATCH 0/3] migration: Error fixes and improvements Markus Armbruster
2025-11-15  8:34 ` [PATCH 1/3] migration: Plug memory leaks after migrate_set_error() Markus Armbruster
2025-11-15  8:34 ` [PATCH 2/3] migration: Use warn_reportf_err() where appropriate Markus Armbruster
2025-11-17 15:47   ` Fabiano Rosas
2025-11-15  8:35 ` [PATCH 3/3] migration/postcopy-ram: Improve error reporting after loadvm failure Markus Armbruster
2025-11-17 15:50   ` Fabiano Rosas
2025-11-17 16:03 ` [PATCH 0/3] migration: Error fixes and improvements Peter Xu
2025-11-18  7:44   ` Markus Armbruster
2025-11-18 17:35     ` Peter Xu
2025-11-19  7:45       ` Markus Armbruster
2025-11-19 20:58         ` Peter Xu
2025-11-20 10:30           ` Migration and the Job abstraction (was: [PATCH 0/3] migration: Error fixes and improvements) Markus Armbruster
2025-11-20 12:16             ` Kevin Wolf
2025-11-21 12:38           ` Fabiano Rosas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bjkvftn9.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=armbru@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).