Re: [PATCH] migration: savevm_state_insert_handler: constant-time element insertion

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Juan Quintela <quintela@redhat.com>
To: Scott Cheloha <cheloha@linux.vnet.ibm.com>
Cc: qemu-devel@nongnu.org, "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH] migration: savevm_state_insert_handler: constant-time element insertion
Date: Sat, 19 Oct 2019 00:15:07 +0200	[thread overview]
Message-ID: <87pnit7las.fsf@trasno.org> (raw)
In-Reply-To: <20191017192508.nwa2a34hzen3xgnr@rascal.austin.ibm.com> (Scott Cheloha's message of "Thu, 17 Oct 2019 14:25:08 -0500")

Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
> On Thu, Oct 17, 2019 at 10:43:08AM +0200, Juan Quintela wrote:
>> Scott Cheloha <cheloha@linux.vnet.ibm.com> wrote:
>> 
>> > Registering a SaveStateEntry object via savevm_state_insert_handler()
>> > is an O(n) operation because the list is a priority queue maintained by
>> > walking the list from head to tail to find a suitable insertion point.
>> >
>> > This adds considerable overhead for VMs with many such objects.  For
>> > instance, ppc64 machines with large maxmem (8T+) spend ~10% or more of
>> > their CPU time in savevm_state_insert_handler() before attempting to
>> > boot a kernel.

> I was trying to avoid churning the file more than absolutely
> necessary.  There are 18 QTAILQ_FOREACH() loops in savevm.c right now.
> Making ~15 of them double-loops doesn't make the code easier to read.

Change thecode to be something different, I agree that is more churn,
but ...

>
> I think incurring slight complexity on insertion/removal to make
> insertion fast is well worth the conceptual simplicity of addressing
> one big list of elements for every other operation.
>
>> savevm_state_handler_insert() for instance becomes even easier, just a
>> QTALIQ_INSERT_TAIL() in the proper queue, right?
>
> Yes, insertion becomes extremely obvious: you just append the element
> to the tail of its priority queue, which must already exist.
>
> But see above for the cost.
>
>> I agree with the idea of the patch.  Especially when you told us how bad
>> the performance of the current code is.
>> 
>> Out of curiosity, how many objects are we talking about?
>
> At maxmem=8T I'm seeing about 40000 elements in that list.  At
> maxmem=64T I'm seeing around 262000.  The vast majority of these
> elements are "spapr_drc" objects, each of which (IIRC) corresponds to
> a 256MB chunk of address space.

We are having trouble because we have too many objects.  So, the right
approach IMHO is just to add the list of queueue.  Looking into the
functions:

static int calculate_new_instance_id(const char *idstr)
static int calculate_compat_instance_id(const char *idstr)
   * We can call QTAILQ_FOREACH in the propper subqueue

static void savevm_state_handler_insert(SaveStateEntry *nse)
   * We don't need the call if we do propper subqueues array

void unregister_savevm(DeviceState *dev, const char *idstr, void
   *opaque)
   * We can use the propper subqueue

vmstate_unregister
   * We can use the propper subqueue

bool qemu_savevm_state_blocked(Error **errp)
  * We need to loop over all queues

void qemu_savevm_state_setup(QEMUFile *f)
int qemu_savevm_state_resume_prepare(MigrationState *s)
int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
void qemu_savevm_state_complete_postcopy(QEMUFile *f)
int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool
  in_postcopy)
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
void qemu_savevm_state_cleanup(void)
int qemu_save_device_state(QEMUFile *f)
static int qemu_loadvm_state_setup(QEMUFile *f)
void qemu_loadvm_state_cleanup(void)
 * Loop over all queues

static SaveStateEntry *find_se(const char *idstr, int instance_id)
qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
* we know the propper queue


But basically all the ones that we need to loop over all queues don't
have local state, so we can create a

loop_over_all_handlers() function that takes a callback and does all the
work.  They don't share states between iterations.

What do you think?
My problem with your appreach is that it makes insertion/removal more
expensive, that is where you are showing performance problems.  In the
places where we need to loop over all queues, we need to do it over all
elements anyways, so the performance difference is going to be
negigible.

Once told that, having 40000 elements on that queue, it will make
"downtime" for migration "interesting", to say the least, no?  How much
size are we talking about?  Should we consider moving it to a live
section?


Later, Juan.

     prev parent reply	other threads:[~2019-10-18 22:16 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-16 16:41 [PATCH] migration: savevm_state_insert_handler: constant-time element insertion Scott Cheloha
2019-10-17  8:43 ` Juan Quintela
2019-10-17 19:25   ` Scott Cheloha
2019-10-18 22:15     ` Juan Quintela [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pnit7las.fsf@trasno.org \
    --to=quintela@redhat.com \
    --cc=cheloha@linux.vnet.ibm.com \
    --cc=dgilbert@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.