Re: Display update issue on M1 Macs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Akihiko Odaki <akihiko.odaki@gmail.com>
To: BALATON Zoltan <balaton@eik.bme.hu>
Cc: Peter Maydell <peter.maydell@linaro.org>,
	qemu-devel@nongnu.org, Gerd Hoffmann <kraxel@redhat.com>,
	Joelle van Dyne <j@getutm.app>
Subject: Re: Display update issue on M1 Macs
Date: Tue, 31 Jan 2023 16:37:59 +0900	[thread overview]
Message-ID: <08551d7d-c17e-7a35-3908-e2b8b3465366@gmail.com> (raw)
In-Reply-To: <b8403b65-7c55-20fb-1ee5-730e4eb9833c@eik.bme.hu>

On 2023/01/31 8:58, BALATON Zoltan wrote:
> On Sat, 28 Jan 2023, Akihiko Odaki wrote:
>> On 2023/01/23 8:28, BALATON Zoltan wrote:
>>> On Thu, 19 Jan 2023, Akihiko Odaki wrote:
>>>> On 2023/01/15 3:11, BALATON Zoltan wrote:
>>>>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>>>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I got reports from several users trying to run AmigaOS4 on 
>>>>>>>> sam460ex on Apple silicon Macs that they get missing graphics 
>>>>>>>> that I can't reproduce on x86_64. With help from the users who 
>>>>>>>> get the problem we've narrowed it down to the following:
>>>>>>>>
>>>>>>>> It looks like that data written to the sm501's ram in 
>>>>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen 
>>>>>>>> from sm501_update_display() in the same file. The 
>>>>>>>> sm501_2d_operation() function is called when the guest accesses 
>>>>>>>> the emulated card so it may run in a different thread than 
>>>>>>>> sm501_update_display() which is called by the ui backend but I'm 
>>>>>>>> not sure how QEMU calls these. Is device code running in 
>>>>>>>> iothread and display update in main thread? The problem is also 
>>>>>>>> independent of the display backend and was reproduced with both 
>>>>>>>> -display cocoa and -display sdl.
>>>>>>>>
>>>>>>>> We have confirmed it's not the pixman routines that 
>>>>>>>> sm501_2d_operation() uses as the same issue is seen also with 
>>>>>>>> QEMU 4.x where pixman wasn't used and with all versions up to 
>>>>>>>> 7.2 so it's also not some bisectable change in QEMU. It also 
>>>>>>>> happens with --enable-debug so it doesn't seem to be related to 
>>>>>>>> optimisation either and I don't get it on x86_64 but even x86_64 
>>>>>>>> QEMU builds run on Apple M1 with Rosetta 2 show the problem. It 
>>>>>>>> also only seems to affect graphics written from 
>>>>>>>> sm501_2d_operation() which AmigaOS4 uses extensively but other 
>>>>>>>> OSes don't and just render graphics with the vcpu which work 
>>>>>>>> without problem also on the M1 Macs that show this problem with 
>>>>>>>> AmigaOS4. Theoretically this could be some missing 
>>>>>>>> syncronisation which is something ARM and PPC may need while x86 
>>>>>>>> doesn't but I don't know if this is really the reason and if so 
>>>>>>>> where and how to fix it). Any idea what may cause this and what 
>>>>>>>> could be a fix to try?
>>>>>>>
>>>>>>> Any idea anyone? At least some explanation if the above is 
>>>>>>> plausible or if there's an option to disable the iothread and run 
>>>>>>> everyting in a single thread to verify the theory could help. 
>>>>>>> I've got reports from at least 3 people getting this problem but 
>>>>>>> I can't do much to fix it without some help.
>>>>>>>
>>>>>>>> (Info on how to run it is here:
>>>>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>>>>> reproduce. Some Linux X servers that support sm501/sm502 may 
>>>>>>>> also use the card's 2d engine but I don't know about any live 
>>>>>>>> CDs that readily run on sam460ex.)
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> BALATON Zoltan
>>>>>>
>>>>>> Sorry, I missed the email.
>>>>>>
>>>>>> Indeed the ui backend should call sm501_update_display() in the 
>>>>>> main thread, which should be different from the thread calling 
>>>>>> sm501_2d_operation(). However, if I understand it correctly, both 
>>>>>> of the functions should be called with iothread lock held so there 
>>>>>> should be no race condition in theory.
>>>>>>
>>>>>> But there is an exception: 
>>>>>> memory_region_snapshot_and_clear_dirty() releases iothread lock, 
>>>>>> and that broke raspi3b display device:
>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>
>>>>>> It is unexpected that gfx_update() callback releases iothread lock 
>>>>>> so it may break things in peculiar ways.
>>>>>>
>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>
>>>>>> For now, to workaround the issue, I think you can create another 
>>>>>> mutex and make the entire sm501_2d_engine_write() and 
>>>>>> sm501_update_display() critical sections.
>>>>>
>>>>> Interesting thread but not sure it's the same problem so this 
>>>>> workaround may not be enough to fix my issue. Here's a video posted 
>>>>> by one of the people who reported it showing the problem on M1 Mac:
>>>>>
>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>
>>>>> and here's how it looks like on other machines:
>>>>>
>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>
>>>>> There are also videos showing it running on RPi 4 and G5 Mac 
>>>>> without this issue so it seems to only happen on Apple Silicon M1 
>>>>> Macs. What's strange is that graphics elements are not just delayed 
>>>>> which I think should happen with missing thread synchronisation 
>>>>> where the update callback would miss some pixels rendered during 
>>>>> it's running but subsequent update callbacks would eventually draw 
>>>>> those, woudn't they? Also setting full_update to 1 in 
>>>>> sm501_update_display() callback to disable dirty tracking does not 
>>>>> fix the problem. So it looks like as if sm501_2d_operation() 
>>>>> running on one CPU core only writes data to the local cache of that 
>>>>> core which sm501_update_display() running on other core can't see, 
>>>>> so maybe some cache synchronisation is needed in 
>>>>> memory_region_set_dirty() or if that's already there maybe I should 
>>>>> call it for all changes not only those in the visible display area? 
>>>>> I'm still not sure I understand the problem and don't know what 
>>>>> could be a fix for it so anything to test to identify the issue 
>>>>> better might also bring us closer to a solution.
>>>>>
>>>>> Regards,
>>>>> BALATON Zoltan
>>>>
>>>> If you set full_update to 1, you may also comment out 
>>>> memory_region_snapshot_and_clear_dirty() and 
>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>>>
>>>> But as you say, it's weird that the rendered result is not just 
>>>> delayed but missed. That may imply other possibilities (e.g., the 
>>>> results are overwritten by someone else). If the problem persists 
>>>> after commenting out memory_region_snapshot_and_clear_dirty() and 
>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>> inter-thread coherency between sm501_2d_operation() and 
>>>> sm501_update_display() is not causing the problem.
>>>
>>> I've asked people who reported and can reproduce it to test this but 
>>> it did not change anything so confirmed it's not that race condition 
>>> but looks more like some cache inconsistency maybe. Any other ideas?
>>>
>>> Regards,
>>> BALATON Zoltan
>>
>> I can come up with two important differences between x86 and Arm which 
>> can affect the execution of QEMU:
>> 1. Memory model. Arm uses a memory model more relaxed than x86 so it 
>> is more sensitive for synchronization failures among threads.
>> 2. Different instructions. TCG uses JIT so differences in instructions 
>> matter.
>>
>> We should be able to exclude 1) as a potential cause of the problem. 
>> iothread mutex should take care of race condition and even cache 
>> coherency problem; mutex includes memory barrier functionality.
> 
> Where is this barrier in QEMU code? Does this also ensure cache 
> coherency between different cores or only memory sync in one core? From 
> the testing I suspect it's probably not becuase of the weak ordering of 
> ARM but something to do with different threads writing and reading the 
> memory area. Is there a way to disable separate vcpu thread and run 
> everything in a single thread to verify this theory? (We only have one 
> vcpu so it's not an MTTCG issue but something between the vcpu and main 
> thread maybe.)

QEMU uses pthread_mutex for macOS, and pthread_mutex (or any sane mutex 
implementation for SMP systems) should also ensure memory 
synchronization across different cores.

That said, it is still possible that we miss something that prevents 
memory synchronization. Ideally the theory should be confirmed by 
experiments, but it is not easy with Mac.

The easiest option is to run QEMU/sam460ex on Linux on QEMU/hvf. Running 
the entire Linux system without -smp option may be too slow so you may 
use taskset command on Linux to pin QEMU/sam460ex process to a 
particular vCPU. This is somewhat incomplete as virtualization 
interferes with caches and hide problems or trigger other bugs. The 
difference of the operating systems is also concerning.

Another option is to use taskset command on Asahi Linux. Installing 
Asahi Linux is easy, but uninstalling it is a bit complicated.

m1n1 hypervisor from Asahi Linux project allows to restrict CPUs to use, 
and I think it also allows to change the memory model to x86 TSO. Unlike 
QEMU/hvf on macOS, it is very minimalistic so its interference to e.g.m 
caches is limited. It is very useful for debugging XNU or Linux, but 
hard to set up and requires another computer to control it.

Finally, you can patch XNU kernel, but this is obviously not easy.

> 
>> For difference 2), you may try to use TCI. You can find details of TCI 
>> in tcg/tci/README.
> 
> This was tested and also with TCI got the same results just much slower.
> 
>> The common sense tells, however, the memory model is usually the cause 
>> of the problem when you see behavioral differences between x86 and 
>> Arm, and TCG should work fine with both of x86 and Arm as they should 
>> have been tested well.
> 
> It's not only between x86 and ARM but also between different ARM CPUs it 
> seems as there are videos of this test case running on Raspberry Pi 4 
> but all QEMU versions failed on Apple M1 so maybe it's something 
> specific to that CPU.

It is likely that the combination of Apple's microarchitecture and Arm 
instruction set causes the problem. For example, even though the memory 
model in x86 is weaker than x86, such difference may not surface 
depending on the design of load/store unit or the size of load/store 
buffers.

Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, 
which makes it possible to compare x86 and Arm without concerning the 
difference of the microarchitecture.

Regards,
Akihiko Odaki

> 
> Regards,
> BALATON Zoltan

next prev parent reply	other threads:[~2023-01-31  7:39 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-04 23:24 Display update issue on M1 Macs BALATON Zoltan
2023-01-13 13:43 ` BALATON Zoltan
2023-01-14  2:41   ` Akihiko Odaki
2023-01-14 18:11     ` BALATON Zoltan
2023-01-19 13:10       ` Akihiko Odaki
2023-01-22 23:28         ` BALATON Zoltan
2023-01-28  4:01           ` Akihiko Odaki
2023-01-30 23:58             ` BALATON Zoltan
2023-01-31  7:37               ` Akihiko Odaki [this message]
2023-01-31 14:15                 ` BALATON Zoltan
2023-02-02 10:51                   ` BALATON Zoltan
2023-02-03 10:16                     ` Akihiko Odaki
2023-02-03 13:45                       ` BALATON Zoltan
2023-02-04  5:19                         ` Akihiko Odaki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08551d7d-c17e-7a35-3908-e2b8b3465366@gmail.com \
    --to=akihiko.odaki@gmail.com \
    --cc=balaton@eik.bme.hu \
    --cc=j@getutm.app \
    --cc=kraxel@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).