qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* stable-8.1.1: which bug do we keep?
@ 2023-09-20  4:46 Michael Tokarev
  2023-09-20  9:17 ` Daniel P. Berrangé
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Tokarev @ 2023-09-20  4:46 UTC (permalink / raw)
  To: QEMU Developers, Richard Henderson, Alex Bennée
  Cc: Paul Menzel, Philippe Mathieu-Daudé

Hi!

I'm in somewhat doubt what to do with 8.1.1 release.

There are 2 compelling issues, fixing one discovers the other.

https://gitlab.com/qemu-project/qemu/-/issues/1864
"x86 VM with TCG and SMP fails to start on 8.1.0"
is fixed by 0d58c660689f "softmmu: Use async_run_on_cpu in tcg_commit"

But this brings up

https://gitlab.com/qemu-project/qemu/-/issues/1866
"mips/mip64 virtio broken on master (and 8.1.0 with tcg fix)"
(which is actually more than mips, as I've shown down the line,
https://gitlab.com/qemu-project/qemu/-/issues/1866#note_1558221926 )

Also, one commit alone,
86e4f93d827 "softmmu: Assert data in bounds in iotlb_to_section",
when not followed with "async_run_on_cpu in tcg_commit", causes
assertion failures, eg
https://www.mail-archive.com/qemu-devel@nongnu.org/msg989846.html
I don't know if "async_run_on_cpu in tcg_commit" was supposed to
fix this assertion or not, or maybe some additional fix is needed, -
but I haven't see this is triggered with 0d58c660689f applied.

There were at least two attempts by Richard to fix issues after
0d58c660689f, one "accel/tcg: Always require can_do_io", which fixes
both reproducers for #1866 but at a high cost, and another,
"softmmu: Introduce cpu_address_space_sync", which addresses the
mips regression but does not fix my reproducer with ovmf
and none of the 2 landed on master so far.

Right now I have a "which bug to keep?" situation for 8.1.1, and
I'd love to have at least *some* comments about all this.  I've got
no replies to my earlier emails in this area.

To mee, it *feels* like 0d58c660689f should be there.
Note: the scheduled deadline for staging-8.1.1 is gone yesterday.
But this stuff seems to be important enough to delay 8.1.1 further.

Just some comments please? :)

Thank you!

/mjt


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stable-8.1.1: which bug do we keep?
  2023-09-20  4:46 stable-8.1.1: which bug do we keep? Michael Tokarev
@ 2023-09-20  9:17 ` Daniel P. Berrangé
  2023-09-20 20:45   ` Michael Tokarev
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel P. Berrangé @ 2023-09-20  9:17 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: QEMU Developers, Richard Henderson, Alex Bennée, Paul Menzel,
	Philippe Mathieu-Daudé

On Wed, Sep 20, 2023 at 07:46:36AM +0300, Michael Tokarev wrote:
> Hi!
> 
> I'm in somewhat doubt what to do with 8.1.1 release.
> 
> There are 2 compelling issues, fixing one discovers the other.
> 
> https://gitlab.com/qemu-project/qemu/-/issues/1864
> "x86 VM with TCG and SMP fails to start on 8.1.0"
> is fixed by 0d58c660689f "softmmu: Use async_run_on_cpu in tcg_commit"
> 
> But this brings up
> 
> https://gitlab.com/qemu-project/qemu/-/issues/1866
> "mips/mip64 virtio broken on master (and 8.1.0 with tcg fix)"
> (which is actually more than mips, as I've shown down the line,
> https://gitlab.com/qemu-project/qemu/-/issues/1866#note_1558221926 )
> 
> Also, one commit alone,
> 86e4f93d827 "softmmu: Assert data in bounds in iotlb_to_section",
> when not followed with "async_run_on_cpu in tcg_commit", causes
> assertion failures, eg
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg989846.html
> I don't know if "async_run_on_cpu in tcg_commit" was supposed to
> fix this assertion or not, or maybe some additional fix is needed, -
> but I haven't see this is triggered with 0d58c660689f applied.
> 
> There were at least two attempts by Richard to fix issues after
> 0d58c660689f, one "accel/tcg: Always require can_do_io", which fixes
> both reproducers for #1866 but at a high cost, and another,
> "softmmu: Introduce cpu_address_space_sync", which addresses the
> mips regression but does not fix my reproducer with ovmf
> and none of the 2 landed on master so far.

In the cover letter for the 2nd proposed series Richard says

[quote]
I've done a tiny bit of performance testing between the two
solutions and it seems to be a wash.  So now it's simply a
matter of cleanliness.
[/quote]

Since the 2nd series is shown to still be broken in some cases
and 1st is thought to solve them all, IMHO it feels like we
should just press ahead with applying the the 1st series to
git master, and then stable.

If we still want a cleaner solution, it can be reverted/replaced
later once someone figures out an option that addresses all the
problems. We shouldn't leave such a big regression in TCG unfixed
for so long while we figure out a cleaner option.

> Right now I have a "which bug to keep?" situation for 8.1.1, and
> I'd love to have at least *some* comments about all this.  I've got
> no replies to my earlier emails in this area.
> 
> To mee, it *feels* like 0d58c660689f should be there.
> Note: the scheduled deadline for staging-8.1.1 is gone yesterday.
> But this stuff seems to be important enough to delay 8.1.1 further.

On the one hand breaking x86 is a big deal because it is a mainstream
architecture, on the other hand people have real x86 hardware, so
using TCG emulation for x86 is less compelling. I agree we need to
fully address this in 8.1.1.

I guess the other unmentioned option is to revert whatever TCG changes
went into 8.1 that caused the regression in the first place. I've no
idea if that is at all practical though.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stable-8.1.1: which bug do we keep?
  2023-09-20  9:17 ` Daniel P. Berrangé
@ 2023-09-20 20:45   ` Michael Tokarev
  2023-09-28 17:24     ` Richard Henderson
  0 siblings, 1 reply; 5+ messages in thread
From: Michael Tokarev @ 2023-09-20 20:45 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: QEMU Developers, Richard Henderson, Alex Bennée, Paul Menzel,
	Philippe Mathieu-Daudé

20.09.2023 12:17, Daniel P. Berrangé wrote:
> On Wed, Sep 20, 2023 at 07:46:36AM +0300, Michael Tokarev wrote:
>> Hi!
>>
>> I'm in somewhat doubt what to do with 8.1.1 release.
>>
>> There are 2 compelling issues, fixing one discovers the other.
>>
>> https://gitlab.com/qemu-project/qemu/-/issues/1864
>> "x86 VM with TCG and SMP fails to start on 8.1.0"
>> is fixed by 0d58c660689f "softmmu: Use async_run_on_cpu in tcg_commit"
>>
>> But this brings up
>>
>> https://gitlab.com/qemu-project/qemu/-/issues/1866
>> "mips/mip64 virtio broken on master (and 8.1.0 with tcg fix)"
>> (which is actually more than mips, as I've shown down the line,
>> https://gitlab.com/qemu-project/qemu/-/issues/1866#note_1558221926 )
...

> In the cover letter for the 2nd proposed series Richard says
> 
> [quote]
> I've done a tiny bit of performance testing between the two
> solutions and it seems to be a wash.  So now it's simply a
> matter of cleanliness.
> [/quote]
> 
> Since the 2nd series is shown to still be broken in some cases
> and 1st is thought to solve them all, IMHO it feels like we
> should just press ahead with applying the the 1st series to
> git master, and then stable.
> 
> If we still want a cleaner solution, it can be reverted/replaced
> later once someone figures out an option that addresses all the
> problems. We shouldn't leave such a big regression in TCG unfixed
> for so long while we figure out a cleaner option.

Daniel, you have a very good point here.

I just collected the first version of Richard's fixes (with Phil's
changes and tags), added them to qemu debian package and pushed that
one out, - debian has much wider CI than qemu has, hopefully it will
clear things out.

Also I pushed them to staging-8.1 branch for qemu ci run.  This obviously
should not go to current stable-8.1 since these fixes aren't in master.

The only thing I regret is that his simple thing didn't occur to me
much earlier (and actually didn't occur to me at all).

Let's see..

>> To mee, it *feels* like 0d58c660689f should be there.
>> Note: the scheduled deadline for staging-8.1.1 is gone yesterday.
>> But this stuff seems to be important enough to delay 8.1.1 further.
> 
> On the one hand breaking x86 is a big deal because it is a mainstream
> architecture, on the other hand people have real x86 hardware, so
> using TCG emulation for x86 is less compelling. I agree we need to
> fully address this in 8.1.1.

As it turns out, quite a lot of various CI stuff uses qemu in tcg
mode behind the scenes.

> I guess the other unmentioned option is to revert whatever TCG changes
> went into 8.1 that caused the regression in the first place. I've no
> idea if that is at all practical though.

This does not seem to be practical.  I did find commit which broke (some)
things, but it isn't easy to revert it now.  IIRC anyway.

Thank you for the excellent hint!

/mjt


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stable-8.1.1: which bug do we keep?
  2023-09-20 20:45   ` Michael Tokarev
@ 2023-09-28 17:24     ` Richard Henderson
  2023-09-28 17:31       ` Michael Tokarev
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Henderson @ 2023-09-28 17:24 UTC (permalink / raw)
  To: Michael Tokarev, Daniel P. Berrangé
  Cc: QEMU Developers, Alex Bennée, Paul Menzel,
	Philippe Mathieu-Daudé

On 9/20/23 16:45, Michael Tokarev wrote:
> 20.09.2023 12:17, Daniel P. Berrangé wrote:
>> Since the 2nd series is shown to still be broken in some cases
>> and 1st is thought to solve them all, IMHO it feels like we
>> should just press ahead with applying the the 1st series to
>> git master, and then stable.
>>
>> If we still want a cleaner solution, it can be reverted/replaced
>> later once someone figures out an option that addresses all the
>> problems. We shouldn't leave such a big regression in TCG unfixed
>> for so long while we figure out a cleaner option.
> 
> Daniel, you have a very good point here.
> 
> I just collected the first version of Richard's fixes (with Phil's
> changes and tags), added them to qemu debian package and pushed that
> one out, - debian has much wider CI than qemu has, hopefully it will
> clear things out.

Thanks for that.  Any surprises?

The timing of all this has been poor, coming just before I've been away for a few weeks. 
I plan to get the can_do_io patch set upstreamed soon.


r~


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: stable-8.1.1: which bug do we keep?
  2023-09-28 17:24     ` Richard Henderson
@ 2023-09-28 17:31       ` Michael Tokarev
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Tokarev @ 2023-09-28 17:31 UTC (permalink / raw)
  To: Richard Henderson, Daniel P. Berrangé
  Cc: QEMU Developers, Alex Bennée, Paul Menzel,
	Philippe Mathieu-Daudé

28.09.2023 20:24, Richard Henderson:
> On 9/20/23 16:45, Michael Tokarev wrote:
>> I just collected the first version of Richard's fixes (with Phil's
>> changes and tags), added them to qemu debian package and pushed that
>> one out, - debian has much wider CI than qemu has, hopefully it will
>> clear things out.
> 
> Thanks for that.  Any surprises?

Hi Richard, I'm very glad to hear from you, I really am! :)

Nope, there was no single surprise so far, it is more, it all
works now quite well, all long-standing issues which we had
are now gone.

We have new discoveries, but that's just that - new discoveries,
like stuff which was broken the same way in 6.0 as in current
master.  Or new tests which no one run before, discovering new
issues in new code.  But overall, this part works fairy well.

> The timing of all this has been poor, coming just before I've been away for a few weeks. I plan to get the can_do_io patch set upstreamed soon.

Yeah, I guess everyone noticed your absence :)
You did great job at fixing this stuff, it waited for the
good refresh for quite some time.

Thank you!

/mjt


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-09-28 17:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-20  4:46 stable-8.1.1: which bug do we keep? Michael Tokarev
2023-09-20  9:17 ` Daniel P. Berrangé
2023-09-20 20:45   ` Michael Tokarev
2023-09-28 17:24     ` Richard Henderson
2023-09-28 17:31       ` Michael Tokarev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).