All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: qemu-devel@nongnu.org
Cc: "John Snow" <jsnow@redhat.com>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Wainer dos Santos Moschetta" <wainersm@redhat.com>,
	"Cleber Rosa" <crosa@redhat.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Beraldo Leal" <bleal@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	"Pavel Dovgalyuk" <pavel.dovgaluk@ispras.ru>
Subject: Re: [PATCH v2 00/16] record/replay fixes: attempting to get avocado green
Date: Wed, 13 Dec 2023 10:57:39 +0000	[thread overview]
Message-ID: <87msue9xp8.fsf@draig.linaro.org> (raw)
In-Reply-To: <20231211091346.14616-1-alex.bennee@linaro.org> ("Alex Bennée"'s message of "Mon, 11 Dec 2023 09:13:29 +0000")

Alex Bennée <alex.bennee@linaro.org> writes:

> As I'm a glutton for punishment I thought I'd have a go at fixing the
> slowly growing number of record/replay bugs. The two fixes are:
>
>  replay: stop us hanging in rr_wait_io_event
>  chardev: force write all when recording replay logs
>
> I think we are beyond 8.2 material but it would be nice to get this
> functionality stable again. We have a growing number of bugs under the
> icount label on gitlab:
>
>   https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount
>
> Changes
> -------
>
> v2
>
> Apart from addressing tidy ups and tags I've been investigating the
> failures in replay_linux.py which are the more exhaustive tests which
> boot the kernel and user-space. The "fix":
>
>   replay: report sync error when no exception in log (!DEBUG INVESTIGATION)
>
> triggers around the time of the hang in the logs and despite the
> rather hairy EXCP->INT transitions around cpu_exec_loop() I think
> points to a genuine problem. I added the tracing to cputlb to verify
> the page tables are the same and started detecting divergence between
> record and replay a lot earlier on that when the replay_sync_error()
> catches things. I see patterns like this:
>
>    1878 tlb_fill 0x4770c000/1 1 2                                       tlb_fill 0x4770c000/1 1 2
>    1879 tlb_fill 0x4770d000/1 1 2                                       tlb_fill 0x4770d000/1 1 2
>    1880 tlb_fill 0x59000/1 0 2                                          tlb_fill 0x59000/1 0 2
>    1881                                                               > tlb_fill 0x476dd116/1 0 2
>    1882 tlb_fill 0x4770e000/1 1 2                                       tlb_fill 0x4770e000/1 1 2
>    1883 tlb_fill 0x476dd527/1 0 2                                     | tlb_fill 0x476dfb17/1 0 2
>    1884                                                               > tlb_fill 0x476de0fd/1 0 2
>    1885                                                               > tlb_fill 0x476dce2e/1 0 2
>    1886 tlb_fill 0x4770f000/1 1 2                                       tlb_fill 0x4770f000/1 1 2
>    1887 tlb_fill 0x476df939/1 0 2                                     <
>    1888 tlb_fill 0x47710000/1 1 2                                       tlb_fill 0x47710000/1 1 2
>    1889 tlb_fill 0x47711000/1 1 2                                       tlb_fill 0x47711000/1 1 2
>
> These don't seem to affect the overall program flow but are concerning
> because the memory access patterns should be the same. My
> investigations with rr seem to indicate the difference is due to
> behaviour of the victim_tlb_cache which again AFAICT should be
> deterministic.
>
> Anyway I can't spend any time debugging it this week so I thought I'd
> post the current state in case anyone is curious enough to want to go
> diving into record/replay.

Just for the actual failure report, on replay:

  [  144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="
  systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
  [  145.579135] audit: type=1305 audit(1702415470.620:46): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
  qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left, event 2848747 is EVENT_INSTRUCTION)
  fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT (Abort)

I've made some more fixes to the replay_dump script and the next event
expected is a CLOCK_WARP followed by some more instructions before an
exception. However there are a lot of exceptions/interrupts firing off
in the stream:

  2848736:EVENT_EXCEPTION(2) no data
  2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170
  2848738:EVENT_EXCEPTION(2) no data
  2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303
  2848740:EVENT_EXCEPTION(2) no data
  2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042
  2848742:EVENT_EXCEPTION(2) no data
  2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907
  2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848746:EVENT_INTERRUPT(1)
  2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561

  we are here...

  2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745
  2848750:EVENT_EXCEPTION(2) no data
  2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577
  2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583
  2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data
  2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543
  2848756:EVENT_EXCEPTION(2) no data
  2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108
  2848758:EVENT_EXCEPTION(2) no data
  2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104
  2848760:EVENT_EXCEPTION(2) no data
  2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936
  2848762:EVENT_ASYNC(9)
    REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes

That said given the complexity of the exception/interrupt processing
back and forth I can see there is potential for things getting confused.
Of course I still haven't explained why there is a divergence in the tlb
processing so early on despite seemingly having no effect on the replay
until this very late stage.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


      parent reply	other threads:[~2023-12-13 10:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-11  9:13 [PATCH v2 00/16] record/replay fixes: attempting to get avocado green Alex Bennée
2023-12-11  9:13 ` [PATCH v2 01/16] tests/avocado: add a simple i386 replay kernel test Alex Bennée
2023-12-11  9:13 ` [PATCH v2 02/16] tests/avocado: fix typo in replay_linux Alex Bennée
2023-12-11  9:13 ` [PATCH v2 03/16] tests/avocado: modernise the drive args for replay_linux Alex Bennée
2023-12-11  9:13 ` [PATCH v2 04/16] scripts/replay-dump: update to latest format Alex Bennée
2023-12-11  9:13 ` [PATCH v2 05/16] scripts/replay_dump: track total number of instructions Alex Bennée
2023-12-11  9:13 ` [PATCH v2 06/16] replay: remove host_clock_last Alex Bennée
2023-12-11  9:13 ` [PATCH v2 07/16] replay: add proper kdoc for ReplayState Alex Bennée
2023-12-11  9:13 ` [PATCH v2 08/16] replay: make has_unread_data a bool Alex Bennée
2023-12-11  9:13 ` [PATCH v2 09/16] replay: introduce a central report point for sync errors Alex Bennée
2023-12-11  9:13 ` [PATCH v2 10/16] replay/replay-char: use report_sync_error Alex Bennée
2023-12-11 17:38   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 11/16] replay: stop us hanging in rr_wait_io_event Alex Bennée
2023-12-11 17:39   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 12/16] chardev: force write all when recording replay logs Alex Bennée
2023-12-11 17:39   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 13/16] tests/avocado: remove skips from replay_kernel Alex Bennée
2023-12-11  9:13 ` [PATCH v2 14/16] accel/tcg: define tlb_fill as a trace point Alex Bennée
2023-12-11 13:04   ` Philippe Mathieu-Daudé
2023-12-11 17:46   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 15/16] accel/tcg: add trace_tlb_resize " Alex Bennée
2023-12-11 13:04   ` Philippe Mathieu-Daudé
2023-12-11 17:50   ` Richard Henderson
2023-12-11  9:13 ` [PATCH v2 16/16] replay: report sync error when no exception in log (!DEBUG INVESTIGATION) Alex Bennée
2023-12-13 10:57 ` Alex Bennée [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87msue9xp8.fsf@draig.linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=bleal@redhat.com \
    --cc=crosa@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=jsnow@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=pavel.dovgaluk@ispras.ru \
    --cc=pbonzini@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=wainersm@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.