qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Aaron Lindsay via <qemu-devel@nongnu.org>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: qemu-devel@nongnu.org, "Emilio G. Cota" <cota@braap.org>,
	Richard Henderson <richard.henderson@linaro.org>
Subject: Re: Plugin Memory Callback Debugging
Date: Fri, 18 Nov 2022 16:58:56 -0500	[thread overview]
Message-ID: <Y3gAIEythPY+z0q0@strawberry.localdomain> (raw)
In-Reply-To: <878rkbalba.fsf@linaro.org>

On Nov 15 22:36, Alex Bennée wrote:
> 
> Aaron Lindsay <aaron@os.amperecomputing.com> writes:
> 
> > Hello,
> >
> > I have been wrestling with what might be a bug in the plugin memory
> > callbacks. The immediate error is that I hit the
> > `g_assert_not_reached()` in the 'default:' case in
> > qemu_plugin_vcpu_mem_cb, indicating the callback type was invalid. When
> > breaking on this assertion in gdb, the contents of cpu->plugin_mem_cbs
> > are obviously bogus (`len` was absurdly high, for example).  After doing
> > some further digging/instrumenting, I eventually found that
> > `free_dyn_cb_arr(void *p, ...)` is being called shortly before the
> > assertion is hit with `p` pointing to the same address as
> > `cpu->plugin_mem_cbs` will later hold at assertion-time. We are freeing
> > the memory still pointed to by `cpu->plugin_mem_cbs`.
> >
> > I believe the code *should* always reset `cpu->plugin_mem_cbs` to NULL at the
> > end of an instruction/TB's execution, so its not exactly clear to me how this
> > is occurring. However, I suspect it may be relevant that we are calling
> > `free_dyn_cb_arr()` because my plugin called `qemu_plugin_reset()`.
> 
> Hmm I'm going to have to remind myself about how this bit works.
> 
> >
> > I have additionally found that the below addition allows me to run successfully
> > without hitting the assert:
> >
> > diff --git a/plugins/core.c b/plugins/core.c
> > --- a/plugins/core.c
> > +++ b/plugins/core.c
> > @@ -427,9 +427,14 @@ static bool free_dyn_cb_arr(void *p, uint32_t h, void *userp)
> >
> >  void qemu_plugin_flush_cb(void)
> >  {
> > +    CPUState *cpu;
> >      qht_iter_remove(&plugin.dyn_cb_arr_ht, free_dyn_cb_arr, NULL);
> >      qht_reset(&plugin.dyn_cb_arr_ht);
> >
> > +    CPU_FOREACH(cpu) {
> > +        cpu->plugin_mem_cbs = NULL;
> > +    }
> > +
> 
> This is essentially qemu_plugin_disable_mem_helpers() but for all CPUs.
> I think we should be able to treat the CPUs separately.

I agree it's similar to qemu_plugin_disable_mem_helpers(), but for all
CPUs. Though a perhaps important distinction is that its occurring
unconditionally in conjunction with the event which flushes the TBs and
frees the callback arrays.

Isn't the code calling into qemu_plugin_flush_cb() already clearing TBs
for all CPUs? Can you help me understand why treating the CPUs
separately would be better?

> >      plugin_cb__simple(QEMU_PLUGIN_EV_FLUSH);
> >  }
> >
> > Unfortunately, the workload/setup I have encountered this bug with are
> > difficult to reproduce in a way suitable for sharing upstream (admittedly
> > potentially because I do not fully understand the conditions necessary to
> > trigger it). It is also deep into a run
> 
> How many full TB flushes have there been? You only see
> qemu_plugin_flush_cb when we flush whole translation buffer (which is
> something we do more often when plugins exit).

There have been maybe hundreds of TB flushes at this point (I, erm, use
qemu_plugin_reset() somewhat liberally in this plugin). I believe it is
the most recent such flush that is problematic - I observe the call to
free_dyn_cb_arr() mentioned above as a result of it.

> Does lowering tb-size make it easier to hit the failure mode?

Hrm, interesting, I have not tried that. I'll poke at that if the rr
debug doesn't pan out.

> > , and I haven't found a good way
> > to break in gdb immediately prior to it happening in order to inspect
> > it, without perturbing it enough such that it doesn't happen...
> 
> This is exactly the sort of thing rr is great for. Can you trigger it in
> that?
> 
>   https://rr-project.org/

I had not used rr before, thanks for the push to do so!

I have, so far, discovered the following timeline:
1. My plugin receives a instruction execution callback for a load
   instruction. At this time, cpu->plugin_mem_cbs points to the same
   memory which will later be freed
2. During the handling of this callback, my plugin calls qemu_plugin_reset()
3. Ostensibly something goes wrong here with the cleanup of
   cpu->plugin_mem_cbs??? 
4. Step 2 triggers the TBs to be flushed, which frees the memory pointed
   to by cpu->plugin_mem_cbs 

Since I have this nicely recorded now with rr, it ought to be easier to
poke at, though I admit I'm not entirely sure how to poke at the
generated code to see what's going wrong (i.e. why wouldn't the tb exit
stuff be clearing this pointer like normal?).

> > I welcome any feedback or insights on how to further nail down the
> > failure case and/or help in working towards an appropriate solution.

-Aaron


  reply	other threads:[~2022-11-18 22:04 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-15 22:05 Plugin Memory Callback Debugging Aaron Lindsay
2022-11-15 22:36 ` Alex Bennée
2022-11-18 21:58   ` Aaron Lindsay via [this message]
2022-11-18 22:02     ` Aaron Lindsay
2022-11-21 22:02       ` Alex Bennée
2022-11-22 17:05         ` Aaron Lindsay via
2022-11-21 20:18   ` Aaron Lindsay via
2022-11-21 21:51     ` Alex Bennée
2022-11-22  2:22       ` Richard Henderson
2022-11-22 15:57         ` Aaron Lindsay via
2022-11-29 20:37           ` Aaron Lindsay via
2022-12-01 19:32             ` Alex Bennée
2022-12-18  5:24             ` Emilio Cota
2022-12-19 20:11               ` Aaron Lindsay
2023-01-06 10:30                 ` Alex Bennée
2023-01-07  3:07                   ` Emilio Cota
2022-11-16  6:19 ` Emilio Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y3gAIEythPY+z0q0@strawberry.localdomain \
    --to=qemu-devel@nongnu.org \
    --cc=aaron@os.amperecomputing.com \
    --cc=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).