All of lore.kernel.org
 help / color / mirror / Atom feed
From: Markus Armbruster <armbru@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Steven Sistare <steven.sistare@oracle.com>,
	 qemu-devel@nongnu.org, Michael Roth <michael.roth@amd.com>,
	 "Dr. David Alan Gilbert" <dave@treblig.org>
Subject: Re: [PATCH] monitor: flush messages on abort
Date: Wed, 15 Nov 2023 09:41:39 +0100	[thread overview]
Message-ID: <875y23s918.fsf@pond.sub.org> (raw)
In-Reply-To: <ZUi7izJoVpU+iiuC@redhat.com> ("Daniel P. Berrangé"'s message of "Mon, 6 Nov 2023 10:10:19 +0000")

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Fri, Nov 03, 2023 at 03:51:00PM -0400, Steven Sistare wrote:
>> On 11/3/2023 1:33 PM, Daniel P. Berrangé wrote:
>> > On Fri, Nov 03, 2023 at 09:01:29AM -0700, Steve Sistare wrote:
>> >> Buffered monitor output is lost when abort() is called.  The pattern
>> >> error_report() followed by abort() occurs about 60 times, so valuable
>> >> information is being lost when the abort is called in the context of a
>> >> monitor command.
>> > 
>> > I'm curious, was there a particular abort() scenario that you hit ?
>> 
>> Yes, while tweaking the suspended state, and forgetting to add transitions:
>> 
>>         error_report("invalid runstate transition: '%s' -> '%s'",
>>         abort();
>> 
>> But I have previously hit this for other errors.

Can you provide a reproducer?

>> > For some crude statistics:
>> > 
>> >   $ for i in abort return exit goto ; do echo -n "$i: " ; git grep --after 1 error_report | grep $i | wc -l ; done
>> >   abort: 47
>> >   return: 512
>> >   exit: 458
>> >   goto: 177
>> > 
>> > to me those numbers say that calling "abort()" after error_report
>> > should be considered a bug, and we can blanket replace all the
>> > abort() calls with exit(EXIT_FAILURE), and thus avoid the need to
>> > special case flushing the monitor.
>> 
>> And presumably add an atexit handler to flush the monitor ala monitor_abort.
>> AFAICT currently no destructor is called for the monitor at exit time.
>
> The HMP monitor flushes at each newline,  and exit() will take care of
> flushing stdout, so I don't think there's anything else needed.

Correct.

>> > Also I think there's a decent case to be made for error_report()
>> > to call monitor_flush().

No, because the messages printed by error_report() all end in newline,
and printing a newline to a monitor triggers monitor_flush_locked().

>> A good start, but that would not help for monitors with skip_flush=true, which 
>> need to format the buffered string in a json response, which is the case I 
>> tripped over.
>
> 'skip_flush' is only set to 'true' when using a QMP monitor and invoking
> "hmp-monitor-command".

Correct.

> In such a case, the error message needs to be built into a JSON error
> reply and sent over the socket. Your patch doesn't help this case
> since you've just printed to stderr.  I don't think it is reasonable
> to expect QMP monitors to send replies on SIG_ABRT anyway. So I don't
> think the skip_flush=true scenario is a problem to be concerned with.
>
>> >> To fix, install a SIGABRT handler to flush the monitor buffer to stderr.
>> >>
>> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> >> ---
>> >>  monitor/monitor.c | 38 ++++++++++++++++++++++++++++++++++++++
>> >>  1 file changed, 38 insertions(+)
>> >>
>> >> diff --git a/monitor/monitor.c b/monitor/monitor.c
>> >> index dc352f9..65dace0 100644
>> >> --- a/monitor/monitor.c
>> >> +++ b/monitor/monitor.c
>> >> @@ -701,6 +701,43 @@ void monitor_cleanup(void)
>> >>      }
>> >>  }
>> >>  
>> >> +#ifdef CONFIG_LINUX
>> >> +
>> >> +static void monitor_abort(int signal, siginfo_t *info, void *c)
>> >> +{
>> >> +    Monitor *mon = monitor_cur();
>> >> +
>> >> +    if (!mon || qemu_mutex_trylock(&mon->mon_lock)) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    if (mon->outbuf && mon->outbuf->len) {
>> >> +        fputs("SIGABRT received: ", stderr);
>> >> +        fputs(mon->outbuf->str, stderr);
>> >> +        if (mon->outbuf->str[mon->outbuf->len - 1] != '\n') {
>> >> +            fputc('\n', stderr);
>> >> +        }
>> >> +    }
>> >> +
>> >> +    qemu_mutex_unlock(&mon->mon_lock);
>> > 
>> > The SIGABRT handling does not only fire in response to abort()
>> > calls, but also in response to bad memory scenarios, so we have
>> > to be careful what we do in signal handlers.
>> > 
>> > In particular using mutexes in signal handlers is a big red
>> > flag generally. Mutex APIs are not declare async signal
>> > safe, so this code is technically a POSIX compliance
>> > violation.

"Technically a POSIX compliance violation" sounds like something only
pedants would care about.  It's actually a recipe for deadlocks and
crashes.

>> Righto.  I would need to mask all signals in the sigaction to be on the safe(r) side.
>
> This is still doomed, because SIGABRT could fire while 'mon_lock' is
> already held, and so this code would deadlock trying to acquire the
> lock.

Yup.

There is no way to make async signal unsafe code safe.

>> > So I think we'd be safer just eliminating the explicit abort()
>> > calls and adding monitor_flush call to error_report.
>> 
>> I like adding a handler because it is future proof.  No need to play whack-a-mole when
>> developers re-introduce abort() calls in the future.  A minor benefit is I would not
>> need ack's from 50 maintainers to change 50 call sites from abort to exit.
>
> That's a bit of a crazy exaggeration. THe aborts() don't cover 50 different
> subsystems, and we don't require explicit acks from every subsystem maintainer
> for trivial cleanups like this.
>
>> A slight risk of the exit solution is that something bad happened at the call site, so 
>> qemu state can no longer be trusted.  Calling abort immediately may be safer than calling 
>> exit which will call the existing atexit handlers and could have side effects.
>
> If that was a real problem, then we already face it because we have
> ~500 places already calling exit() and only 50 calling abort().
>
>> A third option is to define qemu_abort() which flushes the monitor, and replaces all abort
>> calls.  That avoids async-signal-mutex hand wringing, but is still subject to whack-a-mole.
>> 
>> So: atexit, signal handler, or qemu_abort?  I will go with your preference.
>
> Just replace abort -> exit.

I disagree.

Use exit(1) for fatal errors.

Use abort() for programming errors (a.k.a. bugs).

> I'm not seeing a need for an atexit handler on top.

I'm not yet seeing a need for anything.  A reproducer may change that.



  reply	other threads:[~2023-11-15  8:42 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-03 16:01 [PATCH] monitor: flush messages on abort Steve Sistare
2023-11-03 17:33 ` Daniel P. Berrangé
2023-11-03 19:51   ` Steven Sistare
2023-11-03 20:28     ` Steven Sistare
2023-11-06 10:10     ` Daniel P. Berrangé
2023-11-15  8:41       ` Markus Armbruster [this message]
2023-11-15 15:52         ` Steven Sistare
2023-11-15 16:15           ` Markus Armbruster
2023-11-15 17:41             ` Steven Sistare
2023-11-16 15:14               ` Markus Armbruster
2023-11-16 17:03                 ` Steven Sistare
2023-11-15 15:30       ` Steven Sistare
2023-11-15 16:05         ` Markus Armbruster
2023-11-16 11:21         ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875y23s918.fsf@pond.sub.org \
    --to=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dave@treblig.org \
    --cc=michael.roth@amd.com \
    --cc=qemu-devel@nongnu.org \
    --cc=steven.sistare@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.