From: "Alex Bennée" <alex.bennee@linaro.org>
To: qemu-devel@nongnu.org
Cc: "John Snow" <jsnow@redhat.com>,
"Eduardo Habkost" <eduardo@habkost.net>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
"Cleber Rosa" <crosa@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Beraldo Leal" <bleal@redhat.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Pavel Dovgalyuk" <pavel.dovgaluk@ispras.ru>
Subject: Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green
Date: Wed, 13 Dec 2023 10:57:39 +0000 [thread overview]
Message-ID: <87msue9xp8.fsf@draig.linaro.org> (raw)
In-Reply-To: <20231211091346.14616-1-alex.bennee@linaro.org> ("Alex Bennée"'s message of "Mon, 11 Dec 2023 09:13:29 +0000")
Alex Bennée <alex.bennee@linaro.org> writes:
> As I'm a glutton for punishment I thought I'd have a go at fixing the
> slowly growing number of record/replay bugs. The two fixes are:
>
> replay: stop us hanging in rr_wait_io_event
> chardev: force write all when recording replay logs
>
> I think we are beyond 8.2 material but it would be nice to get this
> functionality stable again. We have a growing number of bugs under the
> icount label on gitlab:
>
> https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
>
> Changes
> -------
>
> v2
>
> Apart from addressing tidy ups and tags I've been investigating the
> failures in replay_linux.py which are the more exhaustive tests which
> boot the kernel and user-space. The "fix":
>
> replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
>
> triggers around the time of the hang in the logs and despite the
> rather hairy EXCP->INT transitions around cpu_exec_loop() I think
> points to a genuine problem. I added the tracing to cputlb to verify
> the page tables are the same and started detecting divergence between
> record and replay a lot earlier on that when the replay_sync_error()
> catches things. I see patterns like this:
>
> 1878 tlb_fill 0x4770c000/1 1 2 tlb_fill 0x4770c000/1 1 2
> 1879 tlb_fill 0x4770d000/1 1 2 tlb_fill 0x4770d000/1 1 2
> 1880 tlb_fill 0x59000/1 0 2 tlb_fill 0x59000/1 0 2
> 1881 > tlb_fill 0x476dd116/1 0 2
> 1882 tlb_fill 0x4770e000/1 1 2 tlb_fill 0x4770e000/1 1 2
> 1883 tlb_fill 0x476dd527/1 0 2 | tlb_fill 0x476dfb17/1 0 2
> 1884 > tlb_fill 0x476de0fd/1 0 2
> 1885 > tlb_fill 0x476dce2e/1 0 2
> 1886 tlb_fill 0x4770f000/1 1 2 tlb_fill 0x4770f000/1 1 2
> 1887 tlb_fill 0x476df939/1 0 2 <
> 1888 tlb_fill 0x47710000/1 1 2 tlb_fill 0x47710000/1 1 2
> 1889 tlb_fill 0x47711000/1 1 2 tlb_fill 0x47711000/1 1 2
>
> These don't seem to affect the overall program flow but are concerning
> because the memory access patterns should be the same. My
> investigations with rr seem to indicate the difference is due to
> behaviour of the victim_tlb_cache which again AFAICT should be
> deterministic.
>
> Anyway I can't spend any time debugging it this week so I thought I'd
> post the current state in case anyone is curious enough to want to go
> diving into record/replay.
Just for the actual failure report, on replay:
[ 144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="
systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 145.579135] audit: type=1305 audit(1702415470.620:46): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left, event 2848747 is EVENT_INSTRUCTION)
fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT (Abort)
I've made some more fixes to the replay_dump script and the next event
expected is a CLOCK_WARP followed by some more instructions before an
exception. However there are a lot of exceptions/interrupts firing off
in the stream:
2848736:EVENT_EXCEPTION(2) no data
2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170
2848738:EVENT_EXCEPTION(2) no data
2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303
2848740:EVENT_EXCEPTION(2) no data
2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042
2848742:EVENT_EXCEPTION(2) no data
2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907
2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848746:EVENT_INTERRUPT(1)
2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561
we are here...
2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745
2848750:EVENT_EXCEPTION(2) no data
2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577
2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583
2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543
2848756:EVENT_EXCEPTION(2) no data
2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108
2848758:EVENT_EXCEPTION(2) no data
2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104
2848760:EVENT_EXCEPTION(2) no data
2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936
2848762:EVENT_ASYNC(9)
REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes
That said given the complexity of the exception/interrupt processing
back and forth I can see there is potential for things getting confused.
Of course I still haven't explained why there is a divergence in the tlb
processing so early on despite seemingly having no effect on the replay
until this very late stage.
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
prev parent reply other threads:[~2023-12-13 10:58 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-11 9:13 [PATCH v2 00/16] record/replay fixes: attempting to get avocado green Alex Bennée
2023-12-11 9:13 ` [PATCH v2 01/16] tests/avocado: add a simple i386 replay kernel test Alex Bennée
2023-12-11 9:13 ` [PATCH v2 02/16] tests/avocado: fix typo in replay_linux Alex Bennée
2023-12-11 9:13 ` [PATCH v2 03/16] tests/avocado: modernise the drive args for replay_linux Alex Bennée
2023-12-11 9:13 ` [PATCH v2 04/16] scripts/replay-dump: update to latest format Alex Bennée
2023-12-11 9:13 ` [PATCH v2 05/16] scripts/replay_dump: track total number of instructions Alex Bennée
2023-12-11 9:13 ` [PATCH v2 06/16] replay: remove host_clock_last Alex Bennée
2023-12-11 9:13 ` [PATCH v2 07/16] replay: add proper kdoc for ReplayState Alex Bennée
2023-12-11 9:13 ` [PATCH v2 08/16] replay: make has_unread_data a bool Alex Bennée
2023-12-11 9:13 ` [PATCH v2 09/16] replay: introduce a central report point for sync errors Alex Bennée
2023-12-11 9:13 ` [PATCH v2 10/16] replay/replay-char: use report_sync_error Alex Bennée
2023-12-11 17:38 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 11/16] replay: stop us hanging in rr_wait_io_event Alex Bennée
2023-12-11 17:39 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 12/16] chardev: force write all when recording replay logs Alex Bennée
2023-12-11 17:39 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 13/16] tests/avocado: remove skips from replay_kernel Alex Bennée
2023-12-11 9:13 ` [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point Alex Bennée
2023-12-11 13:04 ` Philippe Mathieu-Daudé
2023-12-11 17:46 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 15/16] accel/tcg: add trace_tlb_resize " Alex Bennée
2023-12-11 13:04 ` Philippe Mathieu-Daudé
2023-12-11 17:50 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 16/16] replay: report sync error when no exception in log (!DEBUG INVESTIGATION) Alex Bennée
2023-12-13 10:57 ` Alex Bennée [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87msue9xp8.fsf@draig.linaro.org \
--to=alex.bennee@linaro.org \
--cc=bleal@redhat.com \
--cc=crosa@redhat.com \
--cc=eduardo@habkost.net \
--cc=jsnow@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=pavel.dovgaluk@ispras.ru \
--cc=pbonzini@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=wainersm@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).