From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=38154 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1PzES8-0001fm-HI
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 16:33:22 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <lcapitulino@redhat.com>) id 1PzES7-0000Es-1O
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 16:33:20 -0400
Received: from mx1.redhat.com ([209.132.183.28]:40414)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <lcapitulino@redhat.com>) id 1PzES6-0000Eg-NX
	for qemu-devel@nongnu.org; Mon, 14 Mar 2011 16:33:19 -0400
Date: Mon, 14 Mar 2011 17:33:11 -0300
From: Luiz Capitulino <lcapitulino@redhat.com>
Subject: Re: [Qemu-devel] Re: [PATCH 09/11] json-lexer: limit the maximum
	size of a given token
Message-ID: <20110314173311.23dfa5f7@doriath>
In-Reply-To: <4D7E7831.1060306@codemonkey.ws>
References: <1299877249-13433-1-git-send-email-aliguori@us.ibm.com>
	<1299877249-13433-10-git-send-email-aliguori@us.ibm.com>
	<20110314162502.12a7deab@doriath> <4D7E7831.1060306@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Anthony Liguori <aliguori@us.ibm.com>, qemu-devel@nongnu.org, Michael Roth <mdroth@us.ibm.com>, Markus Armbruster <armbru@redhat.com>

On Mon, 14 Mar 2011 15:18:57 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 03/14/2011 02:25 PM, Luiz Capitulino wrote:
> > On Fri, 11 Mar 2011 15:00:47 -0600
> > Anthony Liguori<aliguori@us.ibm.com>  wrote:
> >
> >> This is a security consideration.  We don't want a client to cause an arbitrary
> >> amount of memory to be allocated in QEMU.  For now, we use a limit of 64MB
> >> which should be large enough for any reasonably sized token.
> >>
> >> This is important for parsing JSON from untrusted sources.
> >>
> >> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
> >>
> >> diff --git a/json-lexer.c b/json-lexer.c
> >> index 834d7af..3462c89 100644
> >> --- a/json-lexer.c
> >> +++ b/json-lexer.c
> >> @@ -18,6 +18,8 @@
> >>   #include "qemu-common.h"
> >>   #include "json-lexer.h"
> >>
> >> +#define MAX_TOKEN_SIZE (64ULL<<  20)
> >> +
> >>   /*
> >>    * \"([^\\\"]|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*\"
> >>    * '([^\\']|(\\\"\\'\\\\\\/\\b\\f\\n\\r\\t\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]))*'
> >> @@ -312,6 +314,17 @@ static int json_lexer_feed_char(JSONLexer *lexer, char ch)
> >>           }
> >>           lexer->state = new_state;
> >>       } while (!char_consumed);
> >> +
> >> +    /* Do not let a single token grow to an arbitrarily large size,
> >> +     * this is a security consideration.
> >> +     */
> >> +    if (lexer->token->length>  MAX_TOKEN_SIZE) {
> >> +        lexer->emit(lexer, lexer->token, lexer->state, lexer->x, lexer->y);
> >> +        QDECREF(lexer->token);
> >> +        lexer->token = qstring_new();
> >> +        lexer->state = IN_START;
> >> +    }
> > Entering an invalid token is an error, we should fail here.
> 
> It's not so clear to me.
> 
> I think of it like GCC.  GCC doesn't bail out on the first invalid 
> character the lexer encounters.  Instead, it records that an error has 
> occurred and tries its best to recover.
> 
> The result is that instead of getting an error message about the first 
> error in your code, you'll get a long listing of all the mistakes you 
> made (usually).

I'm not a compiler expert, but I think compilers do this because it would be
very problematic from a usability perspective to report one error at time:
by reporting most errors per run, the compiler gives the user to fix as much
as possible programming mistakes w/o having to re-run.

Our target are not humans, though.

> One thing that makes this more difficult in our case is that when you're 
> testing, we don't have a clear EOI to flush things out.  So a bad 
> sequence of inputs might make the message parser wait for an a character 
> that you're not necessarily inputting which makes the session appear 
> hung.  Usually, if you throw a couple extra brackets around, you'll get 
> back to valid input.

Not sure if this is easy to do, but shouldn't we fail if we don't get
the expected character?

> >   Which brings
> > two features:
> >
> >   1. A test code could trigger this condition and check for the specific
> >      error code
> >
> >   2. Developers will know when they hit the limit. Although I don't expect
> >      expect this to happen, there was talking about adding base64 support
> >      to transfer something (I can't remember what, but we never know how the
> >      protocol will evolve).
> >
> > Also, by testing this I found that the parser seems to get confused when
> > the limit is reached: it stops responding.
> 
> Actually, it does respond.  The lexer just takes an incredibly long time 
> to process a large token because qstring_append_ch is incredibly slow 
> :-)  If you drop the token size down to 64k instead of 64mb, it'll seem 
> a lot more reasonable.

Actually, I dropped it down to 20 :) This is supposed to work like your 64k
suggestion, right?

> 
> Regards,
> 
> Anthony Liguori
> 
> >> +
> >>       return 0;
> >>   }
> >>
> >
>