From: Markus Armbruster <armbru@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org
Subject: Re: [PATCH 2/5] json-parser: replace with a push parser
Date: Tue, 17 Feb 2026 08:39:16 +0100 [thread overview]
Message-ID: <875x7vrf8b.fsf@pond.sub.org> (raw)
In-Reply-To: <682de27d-4ebf-4c08-ad0c-45626a5a9dea@redhat.com> (Paolo Bonzini's message of "Mon, 16 Feb 2026 17:41:18 +0100")
Paolo Bonzini <pbonzini@redhat.com> writes:
> On 2/12/26 14:12, Markus Armbruster wrote:
>>>> We know the stack is at most 1024 deep (MAX_NESTING). Have you
>>>> considered using an array instead of GQueue?
>>>
>>> It'd be only 8K, but I preferred to keep MAX_NESTING hidden within
>>> json-streamer.c. Not having bounds checking makes me a bit nervous
>>> about having fixed-size arrays.
>>
>> Looks mostly harmless to me: there's exactly one spot where we push
>> something onto the stack, one where we pop, and one where we pop
>> everything. We only ever look at the top of the stack.
>
> Yes, code-wise it's not hard. But it's ugly/brittle to have two pieces
> of code that have to sync on MAX_NESTING.
>
>>>> The parser's interesting part follows. The parser is a pushdown
>>>> automaton. The pushdown automaton is coded in C. On the one hand,
>>>> d'oh! Of course it is. On the other hand, it's hard to see the actual
>>>> automaton in C. May I have a comment explaining state and state
>>>> transitions? Perhaps we better start with an informal description,
>>>> discuss it, then distill the discussion into a comment.
>>>
>>> That was the purpose of the mysterious comment above. If a mix between
>>> BNF and automaton is okay for you, it could be something like
>>>
>>> input := value -> END_OF_VALUE
>>> END_OF_INPUT -> check stack is empty, return parsed value
>>>
>>> // entered on BEFORE_VALUE; after any of these rules are processed, the
>>> // parser has completed a QObject and is in the END_OF_VALUE state.
>>
>> What do you mean by "after any of these rules are processed, ..."?
>
> That "literal", "'[' after_lsquare" and "'{' after_lcurly" both end up
> with the state machine in the END_OF_VALUE state.
So this is a "how to read the annotated BNF" comment?
>>> //
>>> // When the parser reaches the END_OF_VALUE state, it examines the
>>> // top of the stack to see if it's coming from "input" (stack empty),
>>> // "array_items" (TOS is a QList) or "dict_pairs" (TOS is a QString; the
>>> // item below will be a QDict). It then proceeds with the corresponding
>>> // actions, which will be one of:
>>> // - return parsed value
>>> // - add value to QList
>>> // - pop QString with the key, add key/value to the QDict
>>
>> Forward references to array_items and dict_pairs. Suggest to add a
>> suitable "below", or indicate the forward reference some other way.
>
> I can put instead after_lsquare and after_lcurly.
BNF usually reads easiest when written top down, i.e. start symbol
production first, other non-terminals' productions after their use.
I don't mind the forward references in the comment here, it just took me
a second to recognize them. No real problem, but perhaps adding "above"
/ "below" could help the reader.
>>> value := literal -> END_OF_VALUE
>>> | '[' -> push empty QList -> AFTER_LSQUARE
>>> after_lsquare -> END_OF_VALUE
>>> | '{' -> push empty QDict -> AFTER_LCURLY
>>> after_lcurly -> END_OF_VALUE
>>>
>>> // non-recursive values, entered on BEFORE_VALUE
>>> literal := INTEGER -> END_OF_VALUE
>>> | FLOAT -> END_OF_VALUE
>>> | KEYWORD -> END_OF_VALUE
>>> | STRING -> END_OF_VALUE
>>> | INTERP -> END_OF_VALUE
>>>
>>> // entered on AFTER_LSQUARE
>>> after_lsquare := ']' -> pop completed QList -> END_OF_VALUE
>>> | ϵ -> BEFORE_VALUE
>>> array_items -> END_OF_VALUE
>>>
>>> // entered on BEFORE_VALUE, with TOS being a QList
>>> array_items := value -> add value to QList -> END_OF_VALUE
>>> (']' -> pop completed QList -> END_OF_VALUE
>>> | ',' -> BEFORE_VALUE
>>> array_items) -> END_OF_VALUE
>>>
>>> // entered on AFTER_LCURLY
>>> after_lcurly := '}' -> pop completed QDict -> END_OF_VALUE
>>> | ϵ -> BEFORE_KEY
>>> dict_pairs -> END_OF_VALUE
>>>
>>> // entered on BEFORE_KEY, with TOS being a QDict
>>> dict_pairs := STRING -> push QString -> END_OF_KEY
>>> ':' -> BEFORE_VALUE
>>> value -> pop QString + add pair to QDict -> END_OF_VALUE
>>> ('}' -> pop completed QDict -> END_OF_VALUE
>>> | ',' -> BEFORE_KEY
>>> dict_pairs) -> END_OF_VALUE
>>
>> The BNF part is everything but the -> ... Easy enough. However, I'm
>> not quite certain how to interpret the --> ...
>>
>> The "// entered on STATE ..." I can read as assertion.
>
> Yes, that's a comment (with C++ syntax to have something familiar).
>
>> The text after "->" seems to be one ore more state / stack transitions.
>> Let me describe them in my own words to make sure I understand you.
>> Note that I had to also look at the source code to produce my
>> description.
>>
>> -> STATE
>>
>> Set top stack entry's state to STATE.
>>
>> -> push OBJ -> STATE
>>
>> Push (STATE, OBJ) onto the stack.
>
> Intended more like push (OBJ, undefined), set top stack entry state to
> STATE, but yes.
>
>> -> pop completed OBJ -> END_OF_VALUE
>>
>> Pop (_, OBJ) from the stack, set new top stack entry's state to
>> END_OF_VALUE. OBJ is the result of parsing non-terminal @value.
>>
>> -> add value to QList -> END_OF_VALUE
>>
>> Append the result of parsing non-terminal @value to the top stack
>> entry's QList. Set the top stack entry's state to END_OF_VALUE.
>>
>> -> pop QString + add pair to QDict -> END_OF_VALUE
>>
>> Pop (_, QString) from the stack, it's the key. Its value is the
>> result of parsing non-terminal value. Store this key: value pair in
>> the top stack entry's QDict. Set the top stack entry's state to
>> END_OF_VALUE.
>
> Correct. "-> STATE" generally means "go to state STATE", where teh state
> is stored in the top stack entry.
So I was able to work out how to read the annotations. How can we make
that easier for the next guy? Could well be me in six months...
>> Stack entry members @partial from bottom to top:
>>
>> * a QList element for an unclosed '[',
>>
>> * a QDict and a QString element for an unclosed '{', except the
>> innermost one need not have a QString. This is because the QString is
>> on the stack only while we're parsing ':' value.
>>
>> So, stack contents invariant:
>>
>> ( QList | QDict QString )* QDict?
>
> Makes sense.
Let's write it down where we describe the structure of the stack.
>> I found the QString business somewhat confusing, probably because I
>> expected one stack entry per nesting level, and tossing that invalid
>> assumption took mental effort.
>
> ... while I wrote the parser as one per pending nonterminal. Coming
> from the recursive descent probably suggested that approach, somewhat
> subconsciously.
>
>> I figure we could have one stack entry per: add a const char *key member
>> to JSONParserStackEntry, non-null exactly when we have a QString sitting
>> on a QDict now. We'd trade representable invalid state "QString
>> directly above non-QDict in the stack" for "QList stack entry has member
>> @key set". Would be slightly more local. Matter of taste, I guess.
>
> I can check what the code looks like, sure.
>
>> More representable invalid state, as far as I can tell (observation, not
>> objection):
>>
>> * a QList stack entry can only have state AFTER_LSQUARE, BEFORE_VALUE,
>> END_OF_VALUE, never AFTER_LCURLY, BEFORE_KEY, END_OF_KEY
>>
>> * a QDict stack entry can only have state AFTER_LCURLY, BEFORE_KEY,
>> END_OF_VALUE, never AFTER_LSQUARE, BEFORE_VALUE, END_OF_KEY
>>
>> * a QString stack entry can only have state BEFORE_VALUE and END_OF_KEY,
>> never AFTER_LCURLY, AFTER_LSQUARE, BEFORE_VALUE, END_OF_VALUE.
>>
>> Stack entries by @state instead of by @partial:
>>
>> * AFTER_LCURLY: @partial is QDict
>>
>> * AFTER_LSQUARE: @partial is QList
>>
>> * BEFORE_KEY: @partial is QDict
>>
>> * BEFORE_VALUE: @partial is QList or QString
>
> ... or empty:
>
> state = entry ? entry->state : BEFORE_VALUE;
Yes, forgot that one.
>> * END_OF_KEY: @partial is QString
>>
>> * END_OF_VALUE: @partial is QList or QString
>>
>> Invariants worth documenting, I think.
>
> Ok---as assertions in the code?
Making invalid states unrepresentable is best, but that's often not
practical in C.
I generally prefer assertions to comments as long as the assertions are
easy enough to read.
>> Here's my attempt at explaining the automaton and how states and stack
>> fit together: annotated JSON syntax diagrams. The diagrams are from
>> www.json.org with whitespace nonterminal omitted for brevity.
>>
>> Semantic actions are connected with dashed lines. State names are
>> abbreviated to ALC, ALS, BK, BV, EOK, EOV. (STATE, QTYPE) describes the
>> top stack entry. "assert (STATE, QTYPE)" asserts the top stack entry's
>> @state is STATE, and its @partial is of QTYPE. "set STATE" means the
>> top stack entry's state is set to STATE.
>>
>> set EOV
>> result is QObject
>> value ┆
>> >────────╮─── string ──╭───>
>> ┆ │ │
>> assert (BV, _) ╰─── number ──╯
>> │ │
>> ╰─── object ──╯
>> │ │
>> ╰─── array ───╯
>> │ │
>> ╰─── true ────╯
>> │ │
>> ╰─── false ───╯
>> │ │
>> ╰─── null ────╯
>>
>>
>> push (ALS, QList) pop (ALS, QList)
>> ┆ assert (BV, _)
>> ┆ set EOV
>> ┆ result is popped QList
>> array ┆ ┆
>> >─────── [ ─╮────────────╭─ ] ──>
>> ┆ │ │
>> assert (BV, _) │ ┄ set BV │
>> │ │
>> set BV │ │
>> ┆ │ │
>> ╭─ , ────╰── value ──╮╯
>> │ │ ┄ assert (EOV, QList)
>> ╰────────────────────╯ put value into QList
>>
>>
>>
>> push (ALC, QDict) pop (ALC, QDict)
>> ┆ assert (BV, _)
>> ┆ set EOV
>> ┆ result is popped QDict
>> object ┆ ┆
>> >─────── { ─╮───────────────────────────╭─ } ──>
>> ┆ │ │
>> assert (BV, _) │ ┄ set BK │
>> │ │
>> set BK │ push (EOK, QString) │
>> ┆ │ ┆ │
>> ╭─ , ────╰─ string ─── : ─── value ─╮╯
>> │ ┆ │ pop (BV, QString)
>> │ set BV │ ┄ assert (EOV, QDict)
>> │ │ put string: value
>> ╰───────────────────────────────────╯ into QDict
>>
>> Might be your turn to be confused ;)
>
> I'm not confused but maybe it is a bit too detailed.
We can elide detail. For instance, if we document the stack structure
elsewhere, assertions on stack contents are low value here.
>>>>> +static QObject *json_parser_parse_token(JSONParserContext *ctxt, const JSONToken *token)
>>>>
>>>> Rename to parse_token()? Or inline into json_parser_feed()?
>>>
>>> Renaming is good.
>>>
>>>>> + if (!key) {
>>>>> + return NULL;
>>>>> + }
>>>>
>>>> Key to understand the switch is the meaning of "break" and "return
>>>> NULL".
>>>>
>>>> We do the latter when we're done for this token.
>>>>
>>>> We do the former when this token completed a (sub-)value. @value holds
>>>> it. If the stack is empty, @value is the result of the parse, and the
>>>> code after the switch returns it. Else, @value is a sub-value, and the
>>>> code after the switch stores it in the object being constructed.
>>>
>>> Correct. Want me to include it somehow or is the grammar above fine?
>>
>> Whether the grammar makes the meaning of "break" and "return NULL"
>> obvious enough is hard to tell now: I know too much. Should fix itself
>> within a few days ;)
>
> Probably not. :) I'll see if I can move the post-pr
I can offer a pair of fresher eyes on v2.
next prev parent reply other threads:[~2026-02-17 7:40 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-07 8:48 [PATCH 0/5] qobject: switch JSON parser to push Paolo Bonzini
2026-01-07 8:48 ` [PATCH 1/5] json-parser: pass around lookahead token, constify Paolo Bonzini
2026-02-06 10:45 ` Markus Armbruster
2026-02-06 10:54 ` Paolo Bonzini
2026-01-07 8:48 ` [PATCH 2/5] json-parser: replace with a push parser Paolo Bonzini
2026-02-09 9:36 ` Markus Armbruster
2026-02-09 10:53 ` Paolo Bonzini
2026-02-12 13:12 ` Markus Armbruster
2026-02-16 16:41 ` Paolo Bonzini
2026-02-17 7:39 ` Markus Armbruster [this message]
2026-01-07 8:48 ` [PATCH 3/5] json-streamer: remove token queue Paolo Bonzini
2026-02-10 7:58 ` Markus Armbruster
2026-02-10 8:22 ` Paolo Bonzini
2026-02-11 7:13 ` Markus Armbruster
2026-01-07 8:48 ` [PATCH 4/5] json-streamer: do not heap-allocate JSONToken Paolo Bonzini
2026-02-10 8:30 ` Markus Armbruster
2026-01-07 8:48 ` [PATCH 5/5] json-parser: add location to JSON parsing errors Paolo Bonzini
2026-02-10 9:21 ` Markus Armbruster
2026-02-10 9:44 ` Paolo Bonzini
2026-02-11 7:18 ` Markus Armbruster
2026-01-21 5:57 ` [PATCH 0/5] qobject: switch JSON parser to push Paolo Bonzini
2026-01-30 13:00 ` Markus Armbruster
2026-01-30 13:36 ` Paolo Bonzini
2026-02-10 13:06 ` Markus Armbruster
2026-02-10 13:12 ` Paolo Bonzini
2026-02-10 15:52 ` Markus Armbruster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=875x7vrf8b.fsf@pond.sub.org \
--to=armbru@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.