From: "Alex Bennée" <alex.bennee@linaro.org>
To: qemu-devel@nongnu.org
Cc: "John Snow" <jsnow@redhat.com>,
"Eduardo Habkost" <eduardo@habkost.net>,
"Philippe Mathieu-Daudé" <philmd@linaro.org>,
"Paolo Bonzini" <pbonzini@redhat.com>,
"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
"Cleber Rosa" <crosa@redhat.com>,
"Marc-André Lureau" <marcandre.lureau@redhat.com>,
"Beraldo Leal" <bleal@redhat.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
"Pavel Dovgalyuk" <pavel.dovgaluk@ispras.ru>,
"Alex Bennée" <alex.bennee@linaro.org>
Subject: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green
Date: Mon, 11 Dec 2023 09:13:29 +0000 [thread overview]
Message-ID: <20231211091346.14616-1-alex.bennee@linaro.org> (raw)
As I'm a glutton for punishment I thought I'd have a go at fixing the
slowly growing number of record/replay bugs. The two fixes are:
replay: stop us hanging in rr_wait_io_event
chardev: force write all when recording replay logs
I think we are beyond 8.2 material but it would be nice to get this
functionality stable again. We have a growing number of bugs under the
icount label on gitlab:
https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
Changes
-------
v2
Apart from addressing tidy ups and tags I've been investigating the
failures in replay_linux.py which are the more exhaustive tests which
boot the kernel and user-space. The "fix":
replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
triggers around the time of the hang in the logs and despite the
rather hairy EXCP->INT transitions around cpu_exec_loop() I think
points to a genuine problem. I added the tracing to cputlb to verify
the page tables are the same and started detecting divergence between
record and replay a lot earlier on that when the replay_sync_error()
catches things. I see patterns like this:
1878 tlb_fill 0x4770c000/1 1 2 tlb_fill 0x4770c000/1 1 2
1879 tlb_fill 0x4770d000/1 1 2 tlb_fill 0x4770d000/1 1 2
1880 tlb_fill 0x59000/1 0 2 tlb_fill 0x59000/1 0 2
1881 > tlb_fill 0x476dd116/1 0 2
1882 tlb_fill 0x4770e000/1 1 2 tlb_fill 0x4770e000/1 1 2
1883 tlb_fill 0x476dd527/1 0 2 | tlb_fill 0x476dfb17/1 0 2
1884 > tlb_fill 0x476de0fd/1 0 2
1885 > tlb_fill 0x476dce2e/1 0 2
1886 tlb_fill 0x4770f000/1 1 2 tlb_fill 0x4770f000/1 1 2
1887 tlb_fill 0x476df939/1 0 2 <
1888 tlb_fill 0x47710000/1 1 2 tlb_fill 0x47710000/1 1 2
1889 tlb_fill 0x47711000/1 1 2 tlb_fill 0x47711000/1 1 2
These don't seem to affect the overall program flow but are concerning
because the memory access patterns should be the same. My
investigations with rr seem to indicate the difference is due to
behaviour of the victim_tlb_cache which again AFAICT should be
deterministic.
Anyway I can't spend any time debugging it this week so I thought I'd
post the current state in case anyone is curious enough to want to go
diving into record/replay.
The following need review:
replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
accel/tcg: add trace_tlb_resize trace point
accel/tcg: define tlb_fill as a trace point
tests/avocado: remove skips from replay_kernel (1 acks, 1 sobs, 0 tbs)
replay: stop us hanging in rr_wait_io_event
replay/replay-char: use report_sync_error
tests/avocado: modernise the drive args for replay_linux
tests/avocado: add a simple i386 replay kernel test (2 acks, 1 sobs, 0 tbs)
Alex Bennée (16):
tests/avocado: add a simple i386 replay kernel test
tests/avocado: fix typo in replay_linux
tests/avocado: modernise the drive args for replay_linux
scripts/replay-dump: update to latest format
scripts/replay_dump: track total number of instructions
replay: remove host_clock_last
replay: add proper kdoc for ReplayState
replay: make has_unread_data a bool
replay: introduce a central report point for sync errors
replay/replay-char: use report_sync_error
replay: stop us hanging in rr_wait_io_event
chardev: force write all when recording replay logs
tests/avocado: remove skips from replay_kernel
accel/tcg: define tlb_fill as a trace point
accel/tcg: add trace_tlb_resize trace point
replay: report sync error when no exception in log (!DEBUG
INVESTIGATION)
include/sysemu/replay.h | 5 ++
replay/replay-internal.h | 50 ++++++++----
accel/tcg/cputlb.c | 4 +
accel/tcg/tcg-accel-ops-rr.c | 2 +-
chardev/char.c | 12 +++
replay/replay-char.c | 6 +-
replay/replay-internal.c | 5 +-
replay/replay-snapshot.c | 7 +-
replay/replay.c | 141 ++++++++++++++++++++++++++++++++-
accel/tcg/trace-events | 2 +
scripts/replay-dump.py | 95 +++++++++++++++++++---
tests/avocado/replay_kernel.py | 27 ++++---
tests/avocado/replay_linux.py | 9 ++-
13 files changed, 314 insertions(+), 51 deletions(-)
--
2.39.2
next reply other threads:[~2023-12-11 9:17 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-11 9:13 Alex Bennée [this message]
2023-12-11 9:13 ` [PATCH v2 01/16] tests/avocado: add a simple i386 replay kernel test Alex Bennée
2023-12-11 9:13 ` [PATCH v2 02/16] tests/avocado: fix typo in replay_linux Alex Bennée
2023-12-11 9:13 ` [PATCH v2 03/16] tests/avocado: modernise the drive args for replay_linux Alex Bennée
2023-12-11 9:13 ` [PATCH v2 04/16] scripts/replay-dump: update to latest format Alex Bennée
2023-12-11 9:13 ` [PATCH v2 05/16] scripts/replay_dump: track total number of instructions Alex Bennée
2023-12-11 9:13 ` [PATCH v2 06/16] replay: remove host_clock_last Alex Bennée
2023-12-11 9:13 ` [PATCH v2 07/16] replay: add proper kdoc for ReplayState Alex Bennée
2023-12-11 9:13 ` [PATCH v2 08/16] replay: make has_unread_data a bool Alex Bennée
2023-12-11 9:13 ` [PATCH v2 09/16] replay: introduce a central report point for sync errors Alex Bennée
2023-12-11 9:13 ` [PATCH v2 10/16] replay/replay-char: use report_sync_error Alex Bennée
2023-12-11 17:38 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 11/16] replay: stop us hanging in rr_wait_io_event Alex Bennée
2023-12-11 17:39 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 12/16] chardev: force write all when recording replay logs Alex Bennée
2023-12-11 17:39 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 13/16] tests/avocado: remove skips from replay_kernel Alex Bennée
2023-12-11 9:13 ` [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point Alex Bennée
2023-12-11 13:04 ` Philippe Mathieu-Daudé
2023-12-11 17:46 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 15/16] accel/tcg: add trace_tlb_resize " Alex Bennée
2023-12-11 13:04 ` Philippe Mathieu-Daudé
2023-12-11 17:50 ` Richard Henderson
2023-12-11 9:13 ` [PATCH v2 16/16] replay: report sync error when no exception in log (!DEBUG INVESTIGATION) Alex Bennée
2023-12-13 10:57 ` [PATCH v2 00/16] record/replay fixes: attempting to get avocado green Alex Bennée
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231211091346.14616-1-alex.bennee@linaro.org \
--to=alex.bennee@linaro.org \
--cc=bleal@redhat.com \
--cc=crosa@redhat.com \
--cc=eduardo@habkost.net \
--cc=jsnow@redhat.com \
--cc=marcandre.lureau@redhat.com \
--cc=pavel.dovgaluk@ispras.ru \
--cc=pbonzini@redhat.com \
--cc=philmd@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=wainersm@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).