qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: qemu-devel@nongnu.org
Cc: "John Snow" <jsnow@redhat.com>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
	"Cleber Rosa" <crosa@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Beraldo Leal" <bleal@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Pavel Dovgalyuk" <pavel.dovgaluk@ispras.ru>
Subject: Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green
Date: Wed, 13 Dec 2023 10:57:39 +0000	[thread overview]
Message-ID: <87msue9xp8.fsf@draig.linaro.org> (raw)
In-Reply-To: <20231211091346.14616-1-alex.bennee@linaro.org> ("Alex Bennée"'s message of "Mon, 11 Dec 2023 09:13:29 +0000")

Alex Bennée <alex.bennee@linaro.org> writes:

> As I'm a glutton for punishment I thought I'd have a go at fixing the
> slowly growing number of record/replay bugs. The two fixes are:
>
>  replay: stop us hanging in rr_wait_io_event
>  chardev: force write all when recording replay logs
>
> I think we are beyond 8.2 material but it would be nice to get this
> functionality stable again. We have a growing number of bugs under the
> icount label on gitlab:
>
>   https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
>
> Changes
> -------
>
> v2
>
> Apart from addressing tidy ups and tags I've been investigating the
> failures in replay_linux.py which are the more exhaustive tests which
> boot the kernel and user-space. The "fix":
>
>   replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
>
> triggers around the time of the hang in the logs and despite the
> rather hairy EXCP->INT transitions around cpu_exec_loop() I think
> points to a genuine problem. I added the tracing to cputlb to verify
> the page tables are the same and started detecting divergence between
> record and replay a lot earlier on that when the replay_sync_error()
> catches things. I see patterns like this:
>
>    1878 tlb_fill 0x4770c000/1 1 2                                       tlb_fill 0x4770c000/1 1 2
>    1879 tlb_fill 0x4770d000/1 1 2                                       tlb_fill 0x4770d000/1 1 2
>    1880 tlb_fill 0x59000/1 0 2                                          tlb_fill 0x59000/1 0 2
>    1881                                                               > tlb_fill 0x476dd116/1 0 2
>    1882 tlb_fill 0x4770e000/1 1 2                                       tlb_fill 0x4770e000/1 1 2
>    1883 tlb_fill 0x476dd527/1 0 2                                     | tlb_fill 0x476dfb17/1 0 2
>    1884                                                               > tlb_fill 0x476de0fd/1 0 2
>    1885                                                               > tlb_fill 0x476dce2e/1 0 2
>    1886 tlb_fill 0x4770f000/1 1 2                                       tlb_fill 0x4770f000/1 1 2
>    1887 tlb_fill 0x476df939/1 0 2                                     <
>    1888 tlb_fill 0x47710000/1 1 2                                       tlb_fill 0x47710000/1 1 2
>    1889 tlb_fill 0x47711000/1 1 2                                       tlb_fill 0x47711000/1 1 2
>
> These don't seem to affect the overall program flow but are concerning
> because the memory access patterns should be the same. My
> investigations with rr seem to indicate the difference is due to
> behaviour of the victim_tlb_cache which again AFAICT should be
> deterministic.
>
> Anyway I can't spend any time debugging it this week so I thought I'd
> post the current state in case anyone is curious enough to want to go
> diving into record/replay.

Just for the actual failure report, on replay:

  [  144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="
  systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
  [  145.579135] audit: type=1305 audit(1702415470.620:46): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
  qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left, event 2848747 is EVENT_INSTRUCTION)
  fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT (Abort)

I've made some more fixes to the replay_dump script and the next event
expected is a CLOCK_WARP followed by some more instructions before an
exception. However there are a lot of exceptions/interrupts firing off
in the stream:

  2848736:EVENT_EXCEPTION(2) no data
  2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170
  2848738:EVENT_EXCEPTION(2) no data
  2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303
  2848740:EVENT_EXCEPTION(2) no data
  2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042
  2848742:EVENT_EXCEPTION(2) no data
  2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907
  2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848746:EVENT_INTERRUPT(1)
  2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561

  we are here...

  2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745
  2848750:EVENT_EXCEPTION(2) no data
  2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577
  2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583
  2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543
  2848756:EVENT_EXCEPTION(2) no data
  2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108
  2848758:EVENT_EXCEPTION(2) no data
  2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104
  2848760:EVENT_EXCEPTION(2) no data
  2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936
  2848762:EVENT_ASYNC(9)
    REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes

That said given the complexity of the exception/interrupt processing
back and forth I can see there is potential for things getting confused.
Of course I still haven't explained why there is a divergence in the tlb
processing so early on despite seemingly having no effect on the replay
until this very late stage.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


      parent reply	other threads:[~2023-12-13 10:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-11  9:13 [PATCH v2 00/16] record/replay fixes: attempting to get avocado green Alex Bennée
2023-12-11  9:13 ` [PATCH v2 01/16] tests/avocado: add a simple i386 replay kernel test Alex Bennée
2023-12-11  9:13 ` [PATCH v2 02/16] tests/avocado: fix typo in replay_linux Alex Bennée
2023-12-11  9:13 ` [PATCH v2 03/16] tests/avocado: modernise the drive args for replay_linux Alex Bennée
2023-12-11  9:13 ` [PATCH v2 04/16] scripts/replay-dump: update to latest format Alex Bennée
2023-12-11  9:13 ` [PATCH v2 05/16] scripts/replay_dump: track total number of instructions Alex Bennée
2023-12-11  9:13 ` [PATCH v2 06/16] replay: remove host_clock_last Alex Bennée
2023-12-11  9:13 ` [PATCH v2 07/16] replay: add proper kdoc for ReplayState Alex Bennée
2023-12-11  9:13 ` [PATCH v2 08/16] replay: make has_unread_data a bool Alex Bennée
2023-12-11  9:13 ` [PATCH v2 09/16] replay: introduce a central report point for sync errors Alex Bennée
2023-12-11  9:13 ` [PATCH v2 10/16] replay/replay-char: use report_sync_error Alex Bennée
2023-12-11 17:38   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 11/16] replay: stop us hanging in rr_wait_io_event Alex Bennée
2023-12-11 17:39   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 12/16] chardev: force write all when recording replay logs Alex Bennée
2023-12-11 17:39   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 13/16] tests/avocado: remove skips from replay_kernel Alex Bennée
2023-12-11  9:13 ` [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point Alex Bennée
2023-12-11 13:04   ` Philippe Mathieu-Daudé
2023-12-11 17:46   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 15/16] accel/tcg: add trace_tlb_resize " Alex Bennée
2023-12-11 13:04   ` Philippe Mathieu-Daudé
2023-12-11 17:50   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 16/16] replay: report sync error when no exception in log (!DEBUG INVESTIGATION) Alex Bennée
2023-12-13 10:57 ` Alex Bennée [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87msue9xp8.fsf@draig.linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=bleal@redhat.com \
    --cc=crosa@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=jsnow@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pavel.dovgaluk@ispras.ru \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=wainersm@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).