[PATCH RFC] util/error.c: Print backtrace on error

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] util/error.c: Print backtrace on error
@ 2025-08-05  9:19 Manos Pitsidianakis
  2025-08-05 15:59 ` Daniel P. Berrangé
  0 siblings, 1 reply; 13+ messages in thread
From: Manos Pitsidianakis @ 2025-08-05  9:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier, Manos Pitsidianakis

Add a backtrace_on_error meson feature (enabled with
--enable-backtrace-on-error) that compiles system binaries with
-rdynamic option and prints a function backtrace on error to stderr.

Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:

  ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
  ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
  ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
  ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
  ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
  ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
  ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
  ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
  ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
  ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
  ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
  ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
  /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
  ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]

  Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:

For qemu-system-aarch64, this adds an additional 2MB to the binary size
so this is not enabled by default. I also only tested this on Linux, it
might not be portable.

Signed-off-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
---
 meson.build                   |  3 +++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  4 ++++
 util/error.c                  | 17 +++++++++++++++++
 4 files changed, 26 insertions(+)

diff --git a/meson.build b/meson.build
index e53cd5b413847f33c972540e1f67a092164a200d..5d6ec5a5d5cb4c176bb32450f2adf85fc4963105 100644
--- a/meson.build
+++ b/meson.build
@@ -2649,6 +2649,7 @@ config_host_data.set('CONFIG_DEBUG_STACK_USAGE', get_option('debug_stack_usage')
 config_host_data.set('CONFIG_DEBUG_TCG', get_option('debug_tcg'))
 config_host_data.set('CONFIG_DEBUG_REMAP', get_option('debug_remap'))
 config_host_data.set('CONFIG_QOM_CAST_DEBUG', get_option('qom_cast_debug'))
+config_host_data.set('CONFIG_BACKTRACE', get_option('backtrace_on_error'))
 config_host_data.set('CONFIG_REPLICATION', get_option('replication').allowed())
 config_host_data.set('CONFIG_FSFREEZE', qga_fsfreeze)
 config_host_data.set('CONFIG_FSTRIM', qga_fstrim)
@@ -4454,6 +4455,7 @@ foreach target : target_dirs
     endif
 
     emulator = executable(exe_name, exe['sources'],
+               export_dynamic: get_option('backtrace_on_error'),
                install: true,
                c_args: c_args,
                dependencies: arch_deps + exe['dependencies'],
@@ -4714,6 +4716,7 @@ if 'simple' in get_option('trace_backends')
   summary_info += {'Trace output file': get_option('trace_file') + '-<pid>'}
 endif
 summary_info += {'QOM debugging':     get_option('qom_cast_debug')}
+summary_info += {'Backtrace support':     get_option('backtrace_on_error')}
 summary_info += {'Relocatable install': get_option('relocatable')}
 summary_info += {'vhost-kernel support': have_vhost_kernel}
 summary_info += {'vhost-net support': have_vhost_net}
diff --git a/meson_options.txt b/meson_options.txt
index dd335307505fb52c5be469aa4fab8883c93ec5c9..450c17349263429e59413b0c483c2d52db13325e 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -368,6 +368,8 @@ option('debug_stack_usage', type: 'boolean', value: false,
        description: 'measure coroutine stack usage')
 option('qom_cast_debug', type: 'boolean', value: true,
        description: 'cast debugging support')
+option('backtrace_on_error', type: 'boolean', value: false,
+       description: 'print backtrace on error')
 option('slirp_smbd', type : 'feature', value : 'auto',
        description: 'use smbd (at path --smbd=*) in slirp networking')
 
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index d559e260ed17e50a423aa25b27a86777489565d7..0dd28c93e4d8bd24f7343745c6462aa5d9a603ee 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -22,6 +22,8 @@ meson_options_help() {
   printf "%s\n" '  --docdir=VALUE           Base directory for documentation installation'
   printf "%s\n" '                           (can be empty) [share/doc]'
   printf "%s\n" '  --enable-asan            enable address sanitizer'
+  printf "%s\n" '  --enable-backtrace-on-error'
+  printf "%s\n" '                           print backtrace on error'
   printf "%s\n" '  --enable-block-drv-whitelist-in-tools'
   printf "%s\n" '                           use block whitelist also in tools instead of only'
   printf "%s\n" '                           QEMU'
@@ -251,6 +253,8 @@ _meson_option_parse() {
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
     --disable-lto) printf "%s" -Db_lto=false ;;
+    --enable-backtrace-on-error) printf "%s" -Dbacktrace_on_error=true ;;
+    --disable-backtrace-on-error) printf "%s" -Dbacktrace_on_error=false ;;
     --bindir=*) quote_sh "-Dbindir=$2" ;;
     --enable-blkio) printf "%s" -Dblkio=enabled ;;
     --disable-blkio) printf "%s" -Dblkio=disabled ;;
diff --git a/util/error.c b/util/error.c
index daea2142f30121abc46e5342526f5eec40ea246e..d834756abd563f4108b7599bb4da1e5dd85ec46c 100644
--- a/util/error.c
+++ b/util/error.c
@@ -17,13 +17,28 @@
 #include "qemu/error-report.h"
 #include "qapi/error-internal.h"
 
+#ifdef CONFIG_BACKTRACE
+#include <execinfo.h>
+#endif /* CONFIG_BACKTRACE */
+
 Error *error_abort;
 Error *error_fatal;
 Error *error_warn;
 
+static inline void dump_backtrace(void)
+{
+#ifdef CONFIG_BACKTRACE
+    void *buffer[255] = { 0 };
+    const int calls =
+        backtrace(buffer, sizeof(buffer) / sizeof(void *));
+    backtrace_symbols_fd(buffer, calls, 2);
+#endif /* CONFIG_BACKTRACE */
+}
+
 static void error_handle(Error **errp, Error *err)
 {
     if (errp == &error_abort) {
+        dump_backtrace();
         if (err->func) {
             fprintf(stderr, "Unexpected error in %s() at %.*s:%d:\n",
                     err->func, err->src_len, err->src, err->line);
@@ -38,10 +53,12 @@ static void error_handle(Error **errp, Error *err)
         abort();
     }
     if (errp == &error_fatal) {
+        dump_backtrace();
         error_report_err(err);
         exit(1);
     }
     if (errp == &error_warn) {
+        dump_backtrace();
         warn_report_err(err);
     } else if (errp && !*errp) {
         *errp = err;

---
base-commit: a41280fd5b94c49089f7631c6fa8bb9c308b7962
change-id: 20250805-backtrace-e8e4c2490b46

--
γαῖα πυρί μιχθήτω



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05  9:19 [PATCH RFC] util/error.c: Print backtrace on error Manos Pitsidianakis
@ 2025-08-05 15:59 ` Daniel P. Berrangé
  2025-08-05 16:22   ` Manos Pitsidianakis
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-05 15:59 UTC (permalink / raw)
  To: Manos Pitsidianakis
  Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier

On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
> Add a backtrace_on_error meson feature (enabled with
> --enable-backtrace-on-error) that compiles system binaries with
> -rdynamic option and prints a function backtrace on error to stderr.
> 
> Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
> 
>   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
>   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
>   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
>   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
>   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
>   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
>   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
>   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
>   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
>   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
>   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
>   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
>   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
>   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
>   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
> 
>   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:

From an end-user POV, IMHO the error messages need to be good enough
that such backtraces aren't needed to understand the problem. For
developers, GDB can give much better backtraces (file+line numbers,
plus parameters plus local variables) in the ideally rare cases that
the error message alone has insufficient info. So I'm not really
convinced that programs (in general, not just QEMU) should try to
create backtraces themselves.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 15:59 ` Daniel P. Berrangé
@ 2025-08-05 16:22   ` Manos Pitsidianakis
  2025-08-05 16:48     ` Daniel P. Berrangé
  2025-08-07  5:23     ` Markus Armbruster
  0 siblings, 2 replies; 13+ messages in thread
From: Manos Pitsidianakis @ 2025-08-05 16:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier

On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
> > Add a backtrace_on_error meson feature (enabled with
> > --enable-backtrace-on-error) that compiles system binaries with
> > -rdynamic option and prints a function backtrace on error to stderr.
> >
> > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
> >
> >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
> >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
> >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
> >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
> >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
> >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
> >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
> >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
> >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
> >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
> >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
> >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
> >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
> >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
> >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
> >
> >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
>
> From an end-user POV, IMHO the error messages need to be good enough
> that such backtraces aren't needed to understand the problem. For
> developers, GDB can give much better backtraces (file+line numbers,
> plus parameters plus local variables) in the ideally rare cases that
> the error message alone has insufficient info. So I'm not really
> convinced that programs (in general, not just QEMU) should try to
> create backtraces themselves.

Hi Daniel,

I don't think there's value in replacing gdb debugging with this, I
agree. I think it has value for "fire and forget" uses, when errors
happen unexpectedly and are hard to replicate and you only end up with
log entries and no easy way to debug it.

-- 
Manos Pitsidianakis
Emulation and Virtualization Engineer at Linaro Ltd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 16:22   ` Manos Pitsidianakis
@ 2025-08-05 16:48     ` Daniel P. Berrangé
  2025-08-05 16:57       ` Manos Pitsidianakis
  2025-08-07  5:23     ` Markus Armbruster
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-05 16:48 UTC (permalink / raw)
  To: Manos Pitsidianakis
  Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier

On Tue, Aug 05, 2025 at 07:22:14PM +0300, Manos Pitsidianakis wrote:
> On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
> > > Add a backtrace_on_error meson feature (enabled with
> > > --enable-backtrace-on-error) that compiles system binaries with
> > > -rdynamic option and prints a function backtrace on error to stderr.
> > >
> > > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
> > >
> > >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
> > >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
> > >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
> > >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
> > >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
> > >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
> > >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
> > >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
> > >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
> > >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
> > >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
> > >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
> > >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
> > >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
> > >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
> > >
> > >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
> >
> > From an end-user POV, IMHO the error messages need to be good enough
> > that such backtraces aren't needed to understand the problem. For
> > developers, GDB can give much better backtraces (file+line numbers,
> > plus parameters plus local variables) in the ideally rare cases that
> > the error message alone has insufficient info. So I'm not really
> > convinced that programs (in general, not just QEMU) should try to
> > create backtraces themselves.
> 
> I don't think there's value in replacing gdb debugging with this, I
> agree. I think it has value for "fire and forget" uses, when errors
> happen unexpectedly and are hard to replicate and you only end up with
> log entries and no easy way to debug it.

If the log entry with the error message is useless for devs, then it
is even worse for end users... who will be copying that message into
bug reports anyway. This patch doesn't feel like something we could
enable in formal builds in the distro, so we still need better error
reporting without it, such that user bug reports are actionable.

Was there a specific place where you found things hard to debug
from the error message alone ?  I'm sure we have plenty of examples
of errors that can be improved, but wondering if there are some
general patterns we're doing badly that would be a good win
to improve ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 16:48     ` Daniel P. Berrangé
@ 2025-08-05 16:57       ` Manos Pitsidianakis
  2025-08-05 18:00         ` Daniel P. Berrangé
  0 siblings, 1 reply; 13+ messages in thread
From: Manos Pitsidianakis @ 2025-08-05 16:57 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier

On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 05, 2025 at 07:22:14PM +0300, Manos Pitsidianakis wrote:
> > On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
> > > > Add a backtrace_on_error meson feature (enabled with
> > > > --enable-backtrace-on-error) that compiles system binaries with
> > > > -rdynamic option and prints a function backtrace on error to stderr.
> > > >
> > > > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
> > > >
> > > >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
> > > >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
> > > >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
> > > >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
> > > >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
> > > >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
> > > >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
> > > >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
> > > >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
> > > >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
> > > >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
> > > >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
> > > >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
> > > >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
> > > >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
> > > >
> > > >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
> > >
> > > From an end-user POV, IMHO the error messages need to be good enough
> > > that such backtraces aren't needed to understand the problem. For
> > > developers, GDB can give much better backtraces (file+line numbers,
> > > plus parameters plus local variables) in the ideally rare cases that
> > > the error message alone has insufficient info. So I'm not really
> > > convinced that programs (in general, not just QEMU) should try to
> > > create backtraces themselves.
> >
> > I don't think there's value in replacing gdb debugging with this, I
> > agree. I think it has value for "fire and forget" uses, when errors
> > happen unexpectedly and are hard to replicate and you only end up with
> > log entries and no easy way to debug it.
>
> If the log entry with the error message is useless for devs, then it
> is even worse for end users... who will be copying that message into
> bug reports anyway. This patch doesn't feel like something we could
> enable in formal builds in the distro, so we still need better error
> reporting without it, such that user bug reports are actionable.
>
> Was there a specific place where you found things hard to debug
> from the error message alone ?  I'm sure we have plenty of examples
> of errors that can be improved, but wondering if there are some
> general patterns we're doing badly that would be a good win
> to improve ?

Some months ago I was debugging a MemoryRegion use-after-free and used
this code to figure out that the free was called from RCU context
instead of the main thread.

For problems where the error can happen from multiple contexts and
places in the code-base, a backtrace can provide additional insight
that might be helpful in a few cases. Again, the intented usecase is
not developers with gdb that can reproduce a bug. People ask on IRC
about bugs they have that happen rarely over the timespan of a few
months and they only have logs to go with. Considering that this
feature can be off by default (as it is in this RFC) I don't think
there's potential for distro end users to be confused.

Thanks,

-- 
Manos Pitsidianakis
Emulation and Virtualization Engineer at Linaro Ltd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 16:57       ` Manos Pitsidianakis
@ 2025-08-05 18:00         ` Daniel P. Berrangé
  2025-08-06 11:11           ` Alex Bennée
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-05 18:00 UTC (permalink / raw)
  To: Manos Pitsidianakis
  Cc: qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Markus Armbruster, Alex Bennée,
	Gustavo Romero, Pierrick Bouvier

On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 05, 2025 at 07:22:14PM +0300, Manos Pitsidianakis wrote:
> > > On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
> > > > > Add a backtrace_on_error meson feature (enabled with
> > > > > --enable-backtrace-on-error) that compiles system binaries with
> > > > > -rdynamic option and prints a function backtrace on error to stderr.
> > > > >
> > > > > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
> > > > >
> > > > >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
> > > > >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
> > > > >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
> > > > >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
> > > > >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
> > > > >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
> > > > >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
> > > > >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
> > > > >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
> > > > >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
> > > > >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
> > > > >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
> > > > >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
> > > > >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
> > > > >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
> > > > >
> > > > >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
> > > >
> > > > From an end-user POV, IMHO the error messages need to be good enough
> > > > that such backtraces aren't needed to understand the problem. For
> > > > developers, GDB can give much better backtraces (file+line numbers,
> > > > plus parameters plus local variables) in the ideally rare cases that
> > > > the error message alone has insufficient info. So I'm not really
> > > > convinced that programs (in general, not just QEMU) should try to
> > > > create backtraces themselves.
> > >
> > > I don't think there's value in replacing gdb debugging with this, I
> > > agree. I think it has value for "fire and forget" uses, when errors
> > > happen unexpectedly and are hard to replicate and you only end up with
> > > log entries and no easy way to debug it.
> >
> > If the log entry with the error message is useless for devs, then it
> > is even worse for end users... who will be copying that message into
> > bug reports anyway. This patch doesn't feel like something we could
> > enable in formal builds in the distro, so we still need better error
> > reporting without it, such that user bug reports are actionable.
> >
> > Was there a specific place where you found things hard to debug
> > from the error message alone ?  I'm sure we have plenty of examples
> > of errors that can be improved, but wondering if there are some
> > general patterns we're doing badly that would be a good win
> > to improve ?
> 
> Some months ago I was debugging a MemoryRegion use-after-free and used
> this code to figure out that the free was called from RCU context
> instead of the main thread.

We give useful names to many (but not neccessarily all) threads that we
spawn. Perhaps we should call pthread_getname_np() to fetch the current
thread name, and used that as a prefix on the error message we print
out, as a bit of extra context ?

Obviously not as much info as a full stack trace, but that is something
we could likely enable unconditionally without any overheads to worry
about, so a likely incremental wni. 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 18:00         ` Daniel P. Berrangé
@ 2025-08-06 11:11           ` Alex Bennée
  2025-08-06 11:34             ` Daniel P. Berrangé
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2025-08-06 11:11 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Manos Pitsidianakis, qemu-devel, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Markus Armbruster, Gustavo Romero, Pierrick Bouvier

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
>> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >
>> > On Tue, Aug 05, 2025 at 07:22:14PM +0300, Manos Pitsidianakis wrote:
>> > > On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>> > > >
>> > > > On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
>> > > > > Add a backtrace_on_error meson feature (enabled with
>> > > > > --enable-backtrace-on-error) that compiles system binaries with
>> > > > > -rdynamic option and prints a function backtrace on error to stderr.
>> > > > >
>> > > > > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
>> > > > >
>> > > > >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
>> > > > >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
>> > > > >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
>> > > > >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
>> > > > >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
>> > > > >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
>> > > > >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
>> > > > >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
>> > > > >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
>> > > > >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
>> > > > >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
>> > > > >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
>> > > > >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
>> > > > >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
>> > > > >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
>> > > > >
>> > > > >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
>> > > >
>> > > > From an end-user POV, IMHO the error messages need to be good enough
>> > > > that such backtraces aren't needed to understand the problem. For
>> > > > developers, GDB can give much better backtraces (file+line numbers,
>> > > > plus parameters plus local variables) in the ideally rare cases that
>> > > > the error message alone has insufficient info. So I'm not really
>> > > > convinced that programs (in general, not just QEMU) should try to
>> > > > create backtraces themselves.
>> > >
>> > > I don't think there's value in replacing gdb debugging with this, I
>> > > agree. I think it has value for "fire and forget" uses, when errors
>> > > happen unexpectedly and are hard to replicate and you only end up with
>> > > log entries and no easy way to debug it.
>> >
>> > If the log entry with the error message is useless for devs, then it
>> > is even worse for end users... who will be copying that message into
>> > bug reports anyway. This patch doesn't feel like something we could
>> > enable in formal builds in the distro, so we still need better error
>> > reporting without it, such that user bug reports are actionable.
>> >
>> > Was there a specific place where you found things hard to debug
>> > from the error message alone ?  I'm sure we have plenty of examples
>> > of errors that can be improved, but wondering if there are some
>> > general patterns we're doing badly that would be a good win
>> > to improve ?
>> 
>> Some months ago I was debugging a MemoryRegion use-after-free and used
>> this code to figure out that the free was called from RCU context
>> instead of the main thread.
>
> We give useful names to many (but not neccessarily all) threads that we
> spawn. Perhaps we should call pthread_getname_np() to fetch the current
> thread name, and used that as a prefix on the error message we print
> out, as a bit of extra context ?

Do we always have sensible names for threads or only if we enable the
option?

> Obviously not as much info as a full stack trace, but that is something
> we could likely enable unconditionally without any overheads to worry
> about, so a likely incremental wni.

The place where it comes in useful is when we get bug reports from users
who have crashed QEMU in a embedded docker container and can't give us a
reasonable reproducer. If we can encourage such users to enable this
option (or maybe make it part of --enable-debug-info) then we could get
a slightly more useful backtrace for those bugs.

I agree most sane configurations (i.e. a distro) would just attach gdb
and use whatever symbol resolution the distro provides.

>
> With regards,
> Daniel

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-06 11:11           ` Alex Bennée
@ 2025-08-06 11:34             ` Daniel P. Berrangé
  2025-08-06 20:26               ` Pierrick Bouvier
  2025-08-18 16:52               ` Daniel P. Berrangé
  0 siblings, 2 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-06 11:34 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Manos Pitsidianakis, qemu-devel, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Markus Armbruster, Gustavo Romero, Pierrick Bouvier

On Wed, Aug 06, 2025 at 12:11:38PM +0100, Alex Bennée wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
> >> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >> >
> >> >
> >> > Was there a specific place where you found things hard to debug
> >> > from the error message alone ?  I'm sure we have plenty of examples
> >> > of errors that can be improved, but wondering if there are some
> >> > general patterns we're doing badly that would be a good win
> >> > to improve ?
> >> 
> >> Some months ago I was debugging a MemoryRegion use-after-free and used
> >> this code to figure out that the free was called from RCU context
> >> instead of the main thread.
> >
> > We give useful names to many (but not neccessarily all) threads that we
> > spawn. Perhaps we should call pthread_getname_np() to fetch the current
> > thread name, and used that as a prefix on the error message we print
> > out, as a bit of extra context ?
> 
> Do we always have sensible names for threads or only if we enable the
> option?

I was surprised to discover we don't name threads by default, only if we
add '-name debug-threads=yes'.  I'm struggling to understand why we would
ever want thread naming disabled, if an OS supports it ?

I'm inclined to deprecate 'debug-threads' and always set the names when
available.

> > Obviously not as much info as a full stack trace, but that is something
> > we could likely enable unconditionally without any overheads to worry
> > about, so a likely incremental wni.
> 
> The place where it comes in useful is when we get bug reports from users
> who have crashed QEMU in a embedded docker container and can't give us a
> reasonable reproducer. If we can encourage such users to enable this
> option (or maybe make it part of --enable-debug-info) then we could get
> a slightly more useful backtrace for those bugs.

The challenge is whether this build option would be enabled widely
enough to make a significant difference ?

I don't think we could really enable this in any distro builds, as
this is way too noisy to have turned on unconditionally at build
time for all users. Most containers are going to be consuming
distro builds, with relatively few building custom QEMU themselves
IME.  We might have better luck if this was a runtime option to
the -msg arg.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-06 11:34             ` Daniel P. Berrangé
@ 2025-08-06 20:26               ` Pierrick Bouvier
  2025-08-07  5:41                 ` Markus Armbruster
  2025-08-18 16:52               ` Daniel P. Berrangé
  1 sibling, 1 reply; 13+ messages in thread
From: Pierrick Bouvier @ 2025-08-06 20:26 UTC (permalink / raw)
  To: Daniel P. Berrangé, Alex Bennée
  Cc: Manos Pitsidianakis, qemu-devel, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Markus Armbruster, Gustavo Romero

On 8/6/25 4:34 AM, Daniel P. Berrangé wrote:
> On Wed, Aug 06, 2025 at 12:11:38PM +0100, Alex Bennée wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>>> On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
>>>> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>>
>>>>>
>>>>> Was there a specific place where you found things hard to debug
>>>>> from the error message alone ?  I'm sure we have plenty of examples
>>>>> of errors that can be improved, but wondering if there are some
>>>>> general patterns we're doing badly that would be a good win
>>>>> to improve ?
>>>>
>>>> Some months ago I was debugging a MemoryRegion use-after-free and used
>>>> this code to figure out that the free was called from RCU context
>>>> instead of the main thread.
>>>

I ran into something similar recently [1], and it was a pain to 
reproduce it. Luckily, I caught it using rr and could debug it, but it 
would have been much easier to just get a backtrace of the crash.

In this case, it was a segmentation fault, which is not covered by 
current patch. Which brings me the thought I share at the end of this email.

[1] 
https://lore.kernel.org/qemu-devel/173c1c78-1432-48a4-8251-65c65568c112@linaro.org/T/#

>>> We give useful names to many (but not neccessarily all) threads that we
>>> spawn. Perhaps we should call pthread_getname_np() to fetch the current
>>> thread name, and used that as a prefix on the error message we print
>>> out, as a bit of extra context ?
>>
>> Do we always have sensible names for threads or only if we enable the
>> option?
> 
> I was surprised to discover we don't name threads by default, only if we
> add '-name debug-threads=yes'.  I'm struggling to understand why we would
> ever want thread naming disabled, if an OS supports it ?
> 
> I'm inclined to deprecate 'debug-threads' and always set the names when
> available.
> 
>>> Obviously not as much info as a full stack trace, but that is something
>>> we could likely enable unconditionally without any overheads to worry
>>> about, so a likely incremental wni.
>>
>> The place where it comes in useful is when we get bug reports from users
>> who have crashed QEMU in a embedded docker container and can't give us a
>> reasonable reproducer. If we can encourage such users to enable this
>> option (or maybe make it part of --enable-debug-info) then we could get
>> a slightly more useful backtrace for those bugs.
> 
> The challenge is whether this build option would be enabled widely
> enough to make a significant difference ?
>

For developers working on crashes/bug fix, it's definitely a good 
addition (could come with --enable-debug for sure). It's something we 
could enable in CI by default too. Usually, with sanitizers, the 
reported stacktrace is enough to get a rough idea of what the problem 
is, without having to use any debugger.

> I don't think we could really enable this in any distro builds, as
> this is way too noisy to have turned on unconditionally at build
> time for all users. Most containers are going to be consuming
> distro builds, with relatively few building custom QEMU themselves
> IME.  We might have better luck if this was a runtime option to
> the -msg arg.
>

Regarding the outside world and users, I share Daniel's opinion that it 
would be too verbose if a backtrace is emitted with every fatal error 
message.

However, I think it could have *incredible* value if we reported this 
backtrace when QEMU segfaults, which is always something exceptional.
In this case, we could always enable this.
It's not covered by the current patch, maybe it could be a great addition?

Regarding binary size increase due to -rdynamic, I already know some 
people won't like it, so I'm not sure how we can ensure to have useful 
symbols in distributed binaries, which is a harder debate than enabling 
backtraces on segfaults or not.

> With regards,
> Daniel



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-06 20:26               ` Pierrick Bouvier
@ 2025-08-07  5:41                 ` Markus Armbruster
  2025-08-18 16:49                   ` Daniel P. Berrangé
  0 siblings, 1 reply; 13+ messages in thread
From: Markus Armbruster @ 2025-08-07  5:41 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: Daniel P. Berrangé, Alex Bennée, Manos Pitsidianakis,
	qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Gustavo Romero

Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:

> On 8/6/25 4:34 AM, Daniel P. Berrangé wrote:
>> On Wed, Aug 06, 2025 at 12:11:38PM +0100, Alex Bennée wrote:
>>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>>
>>>> On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
>>>>> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>>>
>>>>>>
>>>>>> Was there a specific place where you found things hard to debug
>>>>>> from the error message alone ?  I'm sure we have plenty of examples
>>>>>> of errors that can be improved, but wondering if there are some
>>>>>> general patterns we're doing badly that would be a good win
>>>>>> to improve ?
>>>>>
>>>>> Some months ago I was debugging a MemoryRegion use-after-free and used
>>>>> this code to figure out that the free was called from RCU context
>>>>> instead of the main thread.
>>>>
>
> I ran into something similar recently [1], and it was a pain to reproduce it. Luckily, I caught it using rr and could debug it, but it would have been much easier to just get a backtrace of the crash.
>
> In this case, it was a segmentation fault, which is not covered by current patch. Which brings me the thought I share at the end of this email.
>
> [1] https://lore.kernel.org/qemu-devel/173c1c78-1432-48a4-8251-65c65568c112@linaro.org/T/#
>
>>>> We give useful names to many (but not neccessarily all) threads that we
>>>> spawn. Perhaps we should call pthread_getname_np() to fetch the current
>>>> thread name, and used that as a prefix on the error message we print
>>>> out, as a bit of extra context ?
>>>
>>> Do we always have sensible names for threads or only if we enable the
>>> option?
>>
>> I was surprised to discover we don't name threads by default, only if we
>> add '-name debug-threads=yes'.  I'm struggling to understand why we would
>> ever want thread naming disabled, if an OS supports it ?
>> I'm inclined to deprecate 'debug-threads' and always set the names when
>> available.

On POSIX, thread naming uses pthread_setname_np(), which is a GNU
extension.  Can't see drawbacks; just use it when available.

On Windows, thread naming appears to use a dynamically loaded
SetThreadDescription().  Any drawbacks?  I'm a Windows ignoramus...

>>>> Obviously not as much info as a full stack trace, but that is something
>>>> we could likely enable unconditionally without any overheads to worry
>>>> about, so a likely incremental wni.
>>>
>>> The place where it comes in useful is when we get bug reports from users
>>> who have crashed QEMU in a embedded docker container and can't give us a
>>> reasonable reproducer. If we can encourage such users to enable this
>>> option (or maybe make it part of --enable-debug-info) then we could get
>>> a slightly more useful backtrace for those bugs.
>> The challenge is whether this build option would be enabled widely
>> enough to make a significant difference ?
>>
>
> For developers working on crashes/bug fix, it's definitely a good addition (could come with --enable-debug for sure). It's something we could enable in CI by default too. Usually, with sanitizers, the reported stacktrace is enough to get a rough idea of what the problem is, without having to use any debugger.
>
>> I don't think we could really enable this in any distro builds, as
>> this is way too noisy to have turned on unconditionally at build
>> time for all users. Most containers are going to be consuming
>> distro builds, with relatively few building custom QEMU themselves
>> IME.  We might have better luck if this was a runtime option to
>> the -msg arg.
>>
>
> Regarding the outside world and users, I share Daniel's opinion that it would be too verbose if a backtrace is emitted with every fatal error message.

Yes, that's out of the question.  We can debate backtrace on internal
errors, such as hitting &error_abort, or more generally abort().  Need
to demonstrate it adds value to simply dumping core, which we get for
free.

> However, I think it could have *incredible* value if we reported this backtrace when QEMU segfaults, which is always something exceptional.

This would be a best effort.  The program is already out of order, and
printing may or may not work.  Avoiding printf() and memory allocation
would improve the odds.

> In this case, we could always enable this.
> It's not covered by the current patch, maybe it could be a great addition?
>
> Regarding binary size increase due to -rdynamic, I already know some people won't like it, so I'm not sure how we can ensure to have useful symbols in distributed binaries, which is a harder debate than enabling backtraces on segfaults or not.

1. Core dumps may take disk space!  Let's disable them.

2. My programs crash!  I need to know why.

3. I know!  Let's make all the program bigger!

SCNR ;)



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-07  5:41                 ` Markus Armbruster
@ 2025-08-18 16:49                   ` Daniel P. Berrangé
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-18 16:49 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Pierrick Bouvier, Alex Bennée, Manos Pitsidianakis,
	qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé, Gustavo Romero

On Thu, Aug 07, 2025 at 07:41:24AM +0200, Markus Armbruster wrote:
> Pierrick Bouvier <pierrick.bouvier@linaro.org> writes:
> 
> > On 8/6/25 4:34 AM, Daniel P. Berrangé wrote:
> >> On Wed, Aug 06, 2025 at 12:11:38PM +0100, Alex Bennée wrote:
> >>> Daniel P. Berrangé <berrange@redhat.com> writes:
> >>>
> >>>> On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
> >>>>> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Was there a specific place where you found things hard to debug
> >>>>>> from the error message alone ?  I'm sure we have plenty of examples
> >>>>>> of errors that can be improved, but wondering if there are some
> >>>>>> general patterns we're doing badly that would be a good win
> >>>>>> to improve ?
> >>>>>
> >>>>> Some months ago I was debugging a MemoryRegion use-after-free and used
> >>>>> this code to figure out that the free was called from RCU context
> >>>>> instead of the main thread.
> >>>>
> >
> > I ran into something similar recently [1], and it was a pain to reproduce it. Luckily, I caught it using rr and could debug it, but it would have been much easier to just get a backtrace of the crash.
> >
> > In this case, it was a segmentation fault, which is not covered by current patch. Which brings me the thought I share at the end of this email.
> >
> > [1] https://lore.kernel.org/qemu-devel/173c1c78-1432-48a4-8251-65c65568c112@linaro.org/T/#
> >
> >>>> We give useful names to many (but not neccessarily all) threads that we
> >>>> spawn. Perhaps we should call pthread_getname_np() to fetch the current
> >>>> thread name, and used that as a prefix on the error message we print
> >>>> out, as a bit of extra context ?
> >>>
> >>> Do we always have sensible names for threads or only if we enable the
> >>> option?
> >>
> >> I was surprised to discover we don't name threads by default, only if we
> >> add '-name debug-threads=yes'.  I'm struggling to understand why we would
> >> ever want thread naming disabled, if an OS supports it ?
> >> I'm inclined to deprecate 'debug-threads' and always set the names when
> >> available.
> 
> On POSIX, thread naming uses pthread_setname_np(), which is a GNU
> extension.  Can't see drawbacks; just use it when available.
> 
> On Windows, thread naming appears to use a dynamically loaded
> SetThreadDescription().  Any drawbacks?  I'm a Windows ignoramus...
> 
> >>>> Obviously not as much info as a full stack trace, but that is something
> >>>> we could likely enable unconditionally without any overheads to worry
> >>>> about, so a likely incremental wni.
> >>>
> >>> The place where it comes in useful is when we get bug reports from users
> >>> who have crashed QEMU in a embedded docker container and can't give us a
> >>> reasonable reproducer. If we can encourage such users to enable this
> >>> option (or maybe make it part of --enable-debug-info) then we could get
> >>> a slightly more useful backtrace for those bugs.
> >> The challenge is whether this build option would be enabled widely
> >> enough to make a significant difference ?
> >>
> >
> > For developers working on crashes/bug fix, it's definitely a good addition (could come with --enable-debug for sure). It's something we could enable in CI by default too. Usually, with sanitizers, the reported stacktrace is enough to get a rough idea of what the problem is, without having to use any debugger.
> >
> >> I don't think we could really enable this in any distro builds, as
> >> this is way too noisy to have turned on unconditionally at build
> >> time for all users. Most containers are going to be consuming
> >> distro builds, with relatively few building custom QEMU themselves
> >> IME.  We might have better luck if this was a runtime option to
> >> the -msg arg.
> >>
> >
> > Regarding the outside world and users, I share Daniel's opinion that it would be too verbose if a backtrace is emitted with every fatal error message.
> 
> Yes, that's out of the question.  We can debate backtrace on internal
> errors, such as hitting &error_abort, or more generally abort().  Need
> to demonstrate it adds value to simply dumping core, which we get for
> free.
> 
> > However, I think it could have *incredible* value if we reported this backtrace when QEMU segfaults, which is always something exceptional.
> 
> This would be a best effort.  The program is already out of order, and
> printing may or may not work.  Avoiding printf() and memory allocation
> would improve the odds.

The risk of doing this on segvs in particular is that the act of
generating the stack trace corrupts further state, which then
makes debugging the original problem harder. 

> > In this case, we could always enable this.
> > It's not covered by the current patch, maybe it could be a great addition?
> >
> > Regarding binary size increase due to -rdynamic, I already know some people won't like it, so I'm not sure how we can ensure to have useful symbols in distributed binaries, which is a harder debate than enabling backtraces on segfaults or not.
> 
> 1. Core dumps may take disk space!  Let's disable them.
> 
> 2. My programs crash!  I need to know why.
> 
> 3. I know!  Let's make all the program bigger!
> 
> SCNR ;)

FWIW, in systemd enabled distros, core dump stack traces can be captured
for everything without needing to store the full dumps. It will capture
this even from stuff running inside containers. Fedora at least has this
enabled out of the box, not sure about defaults of other systemd based
distros.

Also if you are capturing full core dumps, QEMU can (and generally should)
be told to tell the kernel to omit guest RAM from dumps with

 -machine dump-guest-core=off

which maps to madvise(MADV_DONTDUMP).

So disk space concerns shouldn't be a reason for loosing important
debugging information from QEMU crashes in many cases

eg this example 

$ qemu-system-x86_64 &
$ killall -SEGV qemu-system-x86_64
$ coredumpctl  | tail -1
Mon 2025-08-18 17:35:45 BST  174809 1000 1000 SIGSEGV present  /usr/bin/qemu-system-x86_64 4.1M
$ coredumpctl info 174809
           PID: 174809 (qemu-system-x86)
           UID: 1000 (berrange)
           GID: 1000 (berrange)
        Signal: 11 (SEGV)
     Timestamp: Mon 2025-08-18 17:35:44 BST (2min 5s ago)
  Command Line: qemu-system-x86_64
    Executable: /usr/bin/qemu-system-x86_64
 Control Group: /user.slice/user-1000.slice/user@1000.service/user.slice/libpod-f697dc69d8bf78044d0a3964d6e9b7cc4644a66f07045796bfe5320780ffb0f0.scope/co>
          Unit: user@1000.service
     User Unit: libpod-f697dc69d8bf78044d0a3964d6e9b7cc4644a66f07045796bfe5320780ffb0f0.scope
         Slice: user-1000.slice
     Owner UID: 1000 (berrange)
       Boot ID: dab4d69ed89d444ba265e1b3a2ba3ccd
    Machine ID: 6fd7abbbb7b447e3968dd84ca07ab101
      Hostname: toolbx
       Storage: /var/lib/systemd/coredump/core.qemu-system-x86.1000.dab4d69ed89d444ba265e1b3a2ba3ccd.174809.1755534944000000.zst (present)
  Size on Disk: 4.1M
       Message: Process 174809 (qemu-system-x86) of user 1000 dumped core.
                
                Module /usr/lib64/libfdt.so.1.7.2 without build-id.
                Module /usr/lib64/libfdt.so.1.7.2
                Module /usr/lib64/libcapstone.so.5 without build-id.
                Module /usr/lib64/libcapstone.so.5
                Module libltdl.so.7 from rpm libtool-2.5.4-4.fc42.x86_64
		....snip...
                Module libpixman-1.so.0 from rpm pixman-0.46.2-1.fc42.x86_64
                Module libgnutls.so.30 from rpm gnutls-3.8.10-1.fc42.x86_64
                Stack trace of thread 174809:
                #0  0x00007f6900045642 n/a (/usr/lib64/libc.so.6 + 0x79642)
                #1  0x00007f69000399a4 n/a (/usr/lib64/libc.so.6 + 0x6d9a4)
                #2  0x00007f69000b3136 n/a (/usr/lib64/libc.so.6 + 0xe7136)
                #3  0x000055f0d2b85e15 n/a (/usr/bin/qemu-system-x86_64 + 0x5eee15)
                #4  0x000055f0d2b836e5 n/a (/usr/bin/qemu-system-x86_64 + 0x5ec6e5)
                #5  0x000055f0d27d156e n/a (/usr/bin/qemu-system-x86_64 + 0x23a56e)
                #6  0x000055f0d2ae6301 n/a (/usr/bin/qemu-system-x86_64 + 0x54f301)
                #7  0x00007f68fffcf575 n/a (/usr/lib64/libc.so.6 + 0x3575)
                #8  0x00007f68fffcf628 n/a (/usr/lib64/libc.so.6 + 0x3628)
                #9  0x000055f0d25d4395 n/a (/usr/bin/qemu-system-x86_64 + 0x3d395)
                ELF object binary architecture: AMD x86-64


NB, it can't resolve symbols in this example, as my dev env is inside
a toolbox container. If running outside a container, the stack trace
would have shown all symbols too.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-06 11:34             ` Daniel P. Berrangé
  2025-08-06 20:26               ` Pierrick Bouvier
@ 2025-08-18 16:52               ` Daniel P. Berrangé
  1 sibling, 0 replies; 13+ messages in thread
From: Daniel P. Berrangé @ 2025-08-18 16:52 UTC (permalink / raw)
  To: Alex Bennée, Manos Pitsidianakis, qemu-devel, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Markus Armbruster, Gustavo Romero, Pierrick Bouvier

On Wed, Aug 06, 2025 at 12:34:23PM +0100, Daniel P. Berrangé wrote:
> On Wed, Aug 06, 2025 at 12:11:38PM +0100, Alex Bennée wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> > 
> > > On Tue, Aug 05, 2025 at 07:57:38PM +0300, Manos Pitsidianakis wrote:
> > >> On Tue, Aug 5, 2025 at 7:49 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >> >
> > >> >
> > >> > Was there a specific place where you found things hard to debug
> > >> > from the error message alone ?  I'm sure we have plenty of examples
> > >> > of errors that can be improved, but wondering if there are some
> > >> > general patterns we're doing badly that would be a good win
> > >> > to improve ?
> > >> 
> > >> Some months ago I was debugging a MemoryRegion use-after-free and used
> > >> this code to figure out that the free was called from RCU context
> > >> instead of the main thread.
> > >
> > > We give useful names to many (but not neccessarily all) threads that we
> > > spawn. Perhaps we should call pthread_getname_np() to fetch the current
> > > thread name, and used that as a prefix on the error message we print
> > > out, as a bit of extra context ?
> > 
> > Do we always have sensible names for threads or only if we enable the
> > option?
> 
> I was surprised to discover we don't name threads by default, only if we
> add '-name debug-threads=yes'.  I'm struggling to understand why we would
> ever want thread naming disabled, if an OS supports it ?
> 
> I'm inclined to deprecate 'debug-threads' and always set the names when
> available.

FYI, I'm working on a small series that will enable thread names and
IDs to be printed by default with errors, and should post it sometime
this week.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] util/error.c: Print backtrace on error
  2025-08-05 16:22   ` Manos Pitsidianakis
  2025-08-05 16:48     ` Daniel P. Berrangé
@ 2025-08-07  5:23     ` Markus Armbruster
  1 sibling, 0 replies; 13+ messages in thread
From: Markus Armbruster @ 2025-08-07  5:23 UTC (permalink / raw)
  To: Manos Pitsidianakis
  Cc: Daniel P. Berrangé, qemu-devel, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Alex Bennée, Gustavo Romero, Pierrick Bouvier

Manos Pitsidianakis <manos.pitsidianakis@linaro.org> writes:

> On Tue, Aug 5, 2025 at 7:00 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Tue, Aug 05, 2025 at 12:19:26PM +0300, Manos Pitsidianakis wrote:
>> > Add a backtrace_on_error meson feature (enabled with
>> > --enable-backtrace-on-error) that compiles system binaries with
>> > -rdynamic option and prints a function backtrace on error to stderr.
>> >
>> > Example output by adding an unconditional error_setg on error_abort in hw/arm/boot.c:
>> >
>> >   ./qemu-system-aarch64(+0x13b4a2c) [0x55d015406a2c]
>> >   ./qemu-system-aarch64(+0x13b4abd) [0x55d015406abd]
>> >   ./qemu-system-aarch64(+0x13b4d49) [0x55d015406d49]
>> >   ./qemu-system-aarch64(error_setg_internal+0xe7) [0x55d015406f62]
>> >   ./qemu-system-aarch64(arm_load_dtb+0xbf) [0x55d014d7686f]
>> >   ./qemu-system-aarch64(+0xd2f1d8) [0x55d014d811d8]
>> >   ./qemu-system-aarch64(notifier_list_notify+0x44) [0x55d01540a282]
>> >   ./qemu-system-aarch64(qdev_machine_creation_done+0xa0) [0x55d01476ae17]
>> >   ./qemu-system-aarch64(+0xaa691e) [0x55d014af891e]
>> >   ./qemu-system-aarch64(qmp_x_exit_preconfig+0x72) [0x55d014af8a5d]
>> >   ./qemu-system-aarch64(qemu_init+0x2a89) [0x55d014afb657]
>> >   ./qemu-system-aarch64(main+0x2f) [0x55d01521e836]
>> >   /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7f3033d67ca8]
>> >   /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f3033d67d65]
>> >   ./qemu-system-aarch64(_start+0x21) [0x55d0146814f1]
>> >
>> >   Unexpected error in arm_load_dtb() at ../hw/arm/boot.c:529:
>>
>> From an end-user POV, IMHO the error messages need to be good enough
>> that such backtraces aren't needed to understand the problem. For
>> developers, GDB can give much better backtraces (file+line numbers,
>> plus parameters plus local variables) in the ideally rare cases that
>> the error message alone has insufficient info. So I'm not really
>> convinced that programs (in general, not just QEMU) should try to
>> create backtraces themselves.
>
> Hi Daniel,
>
> I don't think there's value in replacing gdb debugging with this, I
> agree. I think it has value for "fire and forget" uses, when errors
> happen unexpectedly and are hard to replicate and you only end up with
> log entries and no easy way to debug it.

Enable core dumps.  I doubt that's harder than recompiling and
redeploying QEMU with backtraces enabled.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-08-18 16:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05  9:19 [PATCH RFC] util/error.c: Print backtrace on error Manos Pitsidianakis
2025-08-05 15:59 ` Daniel P. Berrangé
2025-08-05 16:22   ` Manos Pitsidianakis
2025-08-05 16:48     ` Daniel P. Berrangé
2025-08-05 16:57       ` Manos Pitsidianakis
2025-08-05 18:00         ` Daniel P. Berrangé
2025-08-06 11:11           ` Alex Bennée
2025-08-06 11:34             ` Daniel P. Berrangé
2025-08-06 20:26               ` Pierrick Bouvier
2025-08-07  5:41                 ` Markus Armbruster
2025-08-18 16:49                   ` Daniel P. Berrangé
2025-08-18 16:52               ` Daniel P. Berrangé
2025-08-07  5:23     ` Markus Armbruster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).