Re: Display update issue on M1 Macs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Akihiko Odaki <akihiko.odaki@gmail.com>
To: BALATON Zoltan <balaton@eik.bme.hu>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	qemu-devel@nongnu.org, Gerd Hoffmann <kraxel@redhat.com>,
	Joelle van Dyne <j@getutm.app>
Subject: Re: Display update issue on M1 Macs
Date: Sat, 4 Feb 2023 14:19:52 +0900	[thread overview]
Message-ID: <4550ebac-9ff5-038d-8675-bee9bde1a96f@gmail.com> (raw)
In-Reply-To: <7b61f595-964b-5113-dbf8-1e4a167c4954@eik.bme.hu>

On 2023/02/03 22:45, BALATON Zoltan wrote:
> On Fri, 3 Feb 2023, Akihiko Odaki wrote:
>> On 2023/02/02 19:51, BALATON Zoltan wrote:
>>> On Tue, 31 Jan 2023, BALATON Zoltan wrote:
>>>> On Tue, 31 Jan 2023, Akihiko Odaki wrote:
>>> [...]
>>> To summarise previous discussion:
>>>
>>> - There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel 
>>> functions drawing from device model into the video memory of the 
>>> emulated card which is not shown on screen when the display update 
>>> callback is called from another thread. This works on x86_64 host so 
>>> I suspect it may be related to missing memory synchronisation that 
>>> ARM may need.
>>>
>>> - This can be reproduced running AmigaOS4 on sam460ex or MorphOS 
>>> (demo iso downliadable from their web site) on sam460ex, pegasos2 or 
>>> mac99,via=pmu with -device ati-vga,romfile="" as described here: 
>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/
>>>
>>> - I can't test it myself lacking hardware so I have to rely on 
>>> reports from people who have this hardware so there may be some 
>>> uncertainity in the info I get.
>>>
>>> - We have confirmed it's not related to a known race condition as 
>>> disabling dirty tracking and always doing full updates of whole 
>>> screen did not fix it:
>>>
>>>>>>>>>>> But there is an exception: 
>>>>>>>>>>> memory_region_snapshot_and_clear_dirty() releases iothread 
>>>>>>>>>>> lock, and that broke raspi3b display device:
>>>>>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>>>>>>
>>>>>>>>>>> It is unexpected that gfx_update() callback releases iothread 
>>>>>>>>>>> lock so it may break things in peculiar ways.
>>>>>>>>>>>
>>>>>>>>>>> Peter, is there any change in the situation regarding the 
>>>>>>>>>>> race introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>>>>>>
>>>>>>>>>>> For now, to workaround the issue, I think you can create 
>>>>>>>>>>> another mutex and make the entire sm501_2d_engine_write() and 
>>>>>>>>>>> sm501_update_display() critical sections.
>>>>>>>>>>
>>>>>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>>>>>> workaround may not be enough to fix my issue. Here's a video 
>>>>>>>>>> posted by one of the people who reported it showing the 
>>>>>>>>>> problem on M1 Mac:
>>>>>>>>>>
>>>>>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>>>>>>
>>>>>>>>>> and here's how it looks like on other machines:
>>>>>>>>>>
>>>>>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>>>>>>
>>>>>>>>>> There are also videos showing it running on RPi 4 and G5 Mac 
>>>>>>>>>> without this issue so it seems to only happen on Apple Silicon 
>>>>>>>>>> M1 Macs. What's strange is that graphics elements are not just 
>>>>>>>>>> delayed which I think should happen with missing thread 
>>>>>>>>>> synchronisation where the update callback would miss some 
>>>>>>>>>> pixels rendered during it's running but subsequent update 
>>>>>>>>>> callbacks would eventually draw those, woudn't they? Also 
>>>>>>>>>> setting full_update to 1 in sm501_update_display() callback to 
>>>>>>>>>> disable dirty tracking does not fix the problem. So it looks 
>>>>>>>>>> like as if sm501_2d_operation() running on one CPU core only 
>>>>>>>>>> writes data to the local cache of that core which 
>>>>>>>>>> sm501_update_display() running on other core can't see, so 
>>>>>>>>>> maybe some cache synchronisation is needed in 
>>>>>>>>>> memory_region_set_dirty() or if that's already there maybe I 
>>>>>>>>>> should call it for all changes not only those in the visible 
>>>>>>>>>> display area? I'm still not sure I understand the problem and 
>>>>>>>>>> don't know what could be a fix for it so anything to test to 
>>>>>>>>>> identify the issue better might also bring us closer to a 
>>>>>>>>>> solution.
>>>>>>>>>
>>>>>>>>> If you set full_update to 1, you may also comment out 
>>>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex 
>>>>>>>>> being unlocked. The iothread mutex should ensure cache 
>>>>>>>>> coherency as well.
>>>>>>>>>
>>>>>>>>> But as you say, it's weird that the rendered result is not just 
>>>>>>>>> delayed but missed. That may imply other possibilities (e.g., 
>>>>>>>>> the results are overwritten by someone else). If the problem 
>>>>>>>>> persists after commenting out 
>>>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>>>>>> sm501_update_display() is not causing the problem.
>>>>>>>>
>>>>>>>> I've asked people who reported and can reproduce it to test this 
>>>>>>>> but it did not change anything so confirmed it's not that race 
>>>>>>>> condition but looks more like some cache inconsistency maybe. 
>>>>>>>> Any other ideas?
>>>>>>>
>>>>>>> I can come up with two important differences between x86 and Arm 
>>>>>>> which can affect the execution of QEMU:
>>>>>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so 
>>>>>>> it is more sensitive for synchronization failures among threads.
>>>>>>> 2. Different instructions. TCG uses JIT so differences in 
>>>>>>> instructions matter.
>>>>>>>
>>>>>>> We should be able to exclude 1) as a potential cause of the 
>>>>>>> problem. iothread mutex should take care of race condition and 
>>>>>>> even cache coherency problem; mutex includes memory barrier 
>>>>>>> functionality.
>>> [...]
>>>>>>> For difference 2), you may try to use TCI. You can find details 
>>>>>>> of TCI in tcg/tci/README.
>>>>>>
>>>>>> This was tested and also with TCI got the same results just much 
>>>>>> slower.
>>>>>>
>>>>>>> The common sense tells, however, the memory model is usually the 
>>>>>>> cause of the problem when you see behavioral differences between 
>>>>>>> x86 and Arm, and TCG should work fine with both of x86 and Arm as 
>>>>>>> they should have been tested well.
>>> [...]
>>>>> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, 
>>>>> which makes it possible to compare x86 and Arm without concerning 
>>>>> the difference of the microarchitecture.
>>>>
>>>> We've tried that before and even running x86 QEMU on M1 with Rosetta 
>>>> 2 it was the same so it's probably not something about the code 
>>>> itself but how it's
>>>
>>> As this was odd I've asked to re-test this and now I'm told at least 
>>> QEMU 5.1 x86_64 build from emaculation.com is working with Rosetta on 
>>> M1 Mac so this suggests it may be a problem with memory sync but 
>>> still don't know where and what to try. We're now try newer X86_64 
>>> builds to see if it broke somewhere along the way.
>>>
>>> Anybody else with an M1 Mac wants to help testing? Can you reproduce 
>>> the same with UTM with MorphOS and ati-vga? Here's what I've got 
>>> showing the problem: https://www.youtube.com/watch?v=j5Ag5_Yq-Mk
>>>
>>> Regards,
>>> BALATON Zoltan
>>
>> Hi,
>>
>> I finally reproduced the issue with MorphOS and ati-vga and figured 
>> out its cause.
> 
> Great, thanks a lot. After establishing it works with x86 version we 
> were about to test with aarch64 QEMU 5.0 where sm501 did not yet use 
> pixman but ati-vga did so we could check if it's related to pixman as 
> previous test results with old version were all wrong it seems. But you 
> were faster.
> 
>> The problem is that pixman_blt() is disabled because its backend is 
>> written in GNU assembly, and GNU assembler is not available on macOS. 
>> There is no fallback written in C, unfortunately. The issue is tracked 
>> by the upstream at:
>> https://gitlab.freedesktop.org/pixman/pixman/-/issues/59
> 
> Hm, OK but that ticket is just about compile error and suggests to 
> disable it and does not say it won't work then. Are they aware this is a 
> problem? Maybe we should write to their mailing list after we're sure 
> what's happening.

That's a good idea. They may prioritize the issue if they realize that 
disables pixman_blt().

> 
>> I hit the same problem on Asahi Linux, which is based on Arch Linux 
>> ARM. It is because Arch Linux copied PKGBUILD from x86 Arch Linux, 
>> which disables Arm backends. It is easy to enable the backend for the 
>> platform so I proposed a change at:
>> https://github.com/archlinuxarm/PKGBUILDs/pull/1985
> 
> On macOS one source of pixman most people use is brew.sh where this 
> seems to be disabled:
> 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/pixman.rb
> 
> another source is macports which has an older version and no such options:
> 
> https://github.com/macports/macports-ports/blob/master/graphics/libpixman-devel/Portfile
> 
> I wonder if it compiles from macports on aarch64 then.

It's more likely that it is just outdated. It does not carry a patch to 
fix the issue.

> 
> I wait if I can get some more test results and try to check pixman but 
> its source is not too clear to me and there are no docs either so maybe 
> the best way is to ask on their list. If this is a pixman issue I hope 
> it can be fixed there and we don't need to implement a fallback in QEMU.

This is certainly a pixman issue.

If you read the source, you can see pixman_blt() calls 
_pixman_implementation_blt(). _pixman_implementation_blt() calls blt 
member of pixman_implementation_t in turn. Grepping for "blt =" tells it 
is only assigned in:
pixman/pixman-arm-neon.c
pixman/pixman-arm-simd.c
pixman/pixman-mips-dspr2.c
pixman/pixman-mmx.c
pixman/pixman-sse2.c

For AArch64, only pixman/pixman-arm-neon.c is relevant, and it needs to 
be disabled to build the library on macOS.

Regards,
Akihiko Odaki

> 
> Regards,
> BALATON Zoltan

     prev parent reply	other threads:[~2023-02-04  5:21 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-04 23:24 Display update issue on M1 Macs BALATON Zoltan
2023-01-13 13:43 ` BALATON Zoltan
2023-01-14  2:41   ` Akihiko Odaki
2023-01-14 18:11     ` BALATON Zoltan
2023-01-19 13:10       ` Akihiko Odaki
2023-01-22 23:28         ` BALATON Zoltan
2023-01-28  4:01           ` Akihiko Odaki
2023-01-30 23:58             ` BALATON Zoltan
2023-01-31  7:37               ` Akihiko Odaki
2023-01-31 14:15                 ` BALATON Zoltan
2023-02-02 10:51                   ` BALATON Zoltan
2023-02-03 10:16                     ` Akihiko Odaki
2023-02-03 13:45                       ` BALATON Zoltan
2023-02-04  5:19                         ` Akihiko Odaki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4550ebac-9ff5-038d-8675-bee9bde1a96f@gmail.com \
    --to=akihiko.odaki@gmail.com \
    --cc=balaton@eik.bme.hu \
    --cc=j@getutm.app \
    --cc=kraxel@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).