From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=44720 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Q3EOk-00021r-MY for qemu-devel@nongnu.org; Fri, 25 Mar 2011 17:18:23 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Q3EOj-0008Iw-1d for qemu-devel@nongnu.org; Fri, 25 Mar 2011 17:18:22 -0400 Received: from e33.co.us.ibm.com ([32.97.110.151]:51603) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Q3EOi-0008Hn-PQ for qemu-devel@nongnu.org; Fri, 25 Mar 2011 17:18:20 -0400 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by e33.co.us.ibm.com (8.14.4/8.13.1) with ESMTP id p2PLBZ1C031828 for ; Fri, 25 Mar 2011 15:11:35 -0600 Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p2PLIDWi097154 for ; Fri, 25 Mar 2011 15:18:13 -0600 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p2PLICdP011614 for ; Fri, 25 Mar 2011 15:18:13 -0600 Message-ID: <4D8D0693.5090409@us.ibm.com> Date: Fri, 25 Mar 2011 16:18:11 -0500 From: Anthony Liguori MIME-Version: 1.0 References: <1301082479-4058-1-git-send-email-mdroth@linux.vnet.ibm.com> <1301082479-4058-2-git-send-email-mdroth@linux.vnet.ibm.com> In-Reply-To: <1301082479-4058-2-git-send-email-mdroth@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: [RFC][PATCH v1 01/12] json-lexer: make lexer error-recovery more deterministic List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Roth Cc: aliguori@linux.vnet.ibm.com, agl@linux.vnet.ibm.com, qemu-devel@nongnu.org, Jes.Sorensen@redhat.com On 03/25/2011 02:47 PM, Michael Roth wrote: > Currently when we reach an error state we effectively flush everything > fed to the lexer, which can put us in a state where we keep feeding > tokens into the parser at arbitrary offsets in the stream. This makes it > difficult for the lexer/tokenizer/parser to get back in sync when bad > input is made by the client. > > With these changes we emit an error state/token up to the tokenizer as > soon as we reach an error state, and continue processing any data passed > in rather than bailing out. The reset token will be used to reset the > tokenizer and parser, such that they'll recover state as soon as the > lexer begins generating valid token sequences again. > > We also map chr(0xFF) to an error state here, since it's an invalid > UTF-8 character. QMP guest proxy/agent use this to force a flush/reset > of previous input for reliable delivery of certain events, so also we > document that thoroughly here. > > Signed-off-by: Michael Roth > --- > json-lexer.c | 22 ++++++++++++++++++---- > json-lexer.h | 1 + > 2 files changed, 19 insertions(+), 4 deletions(-) > > diff --git a/json-lexer.c b/json-lexer.c > index 3462c89..21aa03a 100644 > --- a/json-lexer.c > +++ b/json-lexer.c > @@ -105,7 +105,7 @@ static const uint8_t json_lexer[][256] = { > ['u'] = IN_DQ_UCODE0, > }, > [IN_DQ_STRING] = { > - [1 ... 0xFF] = IN_DQ_STRING, > + [1 ... 0xFE] = IN_DQ_STRING, We also need to exclude 192, 193, 245-254 as these are all invalid bytes in a UTF-8 sequence. See http://en.wikipedia.org/wiki/UTF-8#Codepage_layout We probably ought to actually handle UTF-8 extend byte sequences in the lexer but we can keep this as a future exercise. > ['\\'] = IN_DQ_STRING_ESCAPE, > ['"'] = JSON_STRING, > }, > @@ -144,7 +144,7 @@ static const uint8_t json_lexer[][256] = { > ['u'] = IN_SQ_UCODE0, > }, > [IN_SQ_STRING] = { > - [1 ... 0xFF] = IN_SQ_STRING, > + [1 ... 0xFE] = IN_SQ_STRING, > ['\\'] = IN_SQ_STRING_ESCAPE, > ['\''] = JSON_STRING, > }, > @@ -305,10 +305,25 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch) > new_state = IN_START; > break; > case ERROR: > + /* XXX: To avoid having previous bad input leaving the parser in an > + * unresponsive state where we consume unpredictable amounts of > + * subsequent "good" input, percolate this error state up to the > + * tokenizer/parser by forcing a NULL object to be emitted, then > + * reset state. > + * > + * Also note that this handling is required for reliable channel > + * negotiation between QMP and the guest agent, since chr(0xFF) > + * is placed at the beginning of certain events to ensure proper > + * delivery when the channel is in an unknown state. chr(0xFF) is > + * never a valid ASCII/UTF-8 sequence, so this should reliably > + * induce an error/flush state. > + */ > + lexer->emit(lexer, lexer->token, JSON_ERROR, lexer->x, lexer->y); > QDECREF(lexer->token); > lexer->token = qstring_new(); > new_state = IN_START; > - return -EINVAL; > + lexer->state = new_state; > + return 0; > default: > break; > } > @@ -334,7 +349,6 @@ int json_lexer_feed(JSONLexer *lexer, const char *buffer, size_t size) > > for (i = 0; i< size; i++) { > int err; > - This whitespace change slipped in FWIW. Regards, Anthony Liguori > err = json_lexer_feed_char(lexer, buffer[i]); > if (err< 0) { > return err; > diff --git a/json-lexer.h b/json-lexer.h > index 3b50c46..10bc0a7 100644 > --- a/json-lexer.h > +++ b/json-lexer.h > @@ -25,6 +25,7 @@ typedef enum json_token_type { > JSON_STRING, > JSON_ESCAPE, > JSON_SKIP, > + JSON_ERROR, > } JSONTokenType; > > typedef struct JSONLexer JSONLexer;