[v1] json: More fixes, error reporting improvements, cleanups

[Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F'

Posted by Markus Armbruster 7 years, 5 months ago

The lexer fails to end a valid token when the lookahead character is
beyond '\x7F'.  For instance, input

    true\xC2\xA2

produces the tokens

    JSON_ERROR     true\xC2
    JSON_ERROR     \xA2

The first token should be

    JSON_KEYWORD   true

instead.

The culprit is

    #define TERMINAL(state) [0 ... 0x7F] = (state)

It leaves [0x80..0xFF] zero, i.e. IN_ERROR.  Has always been broken.
Fix it to initialize the complete array.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-lexer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index e1745a3d95..4867839f66 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -123,7 +123,7 @@ enum json_lexer_state {
 QEMU_BUILD_BUG_ON((int)JSON_MIN <= (int)IN_START_INTERP);
 QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
 
-#define TERMINAL(state) [0 ... 0x7F] = (state)
+#define TERMINAL(state) [0 ... 0xFF] = (state)
 
 /* Return whether TERMINAL is a terminal state and the transition to it
    from OLD_STATE required lookahead.  This happens whenever the table
-- 
2.17.1

Re: [Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F'

Posted by Eric Blake 7 years, 5 months ago

On 08/27/2018 02:00 AM, Markus Armbruster wrote:
> The lexer fails to end a valid token when the lookahead character is
> beyond '\x7F'.  For instance, input
> 
>      true\xC2\xA2
> 
> produces the tokens
> 
>      JSON_ERROR     true\xC2
>      JSON_ERROR     \xA2
> 
> The first token should be
> 
>      JSON_KEYWORD   true
> 
> instead.

As long as we still get a JSON_ERROR in the end.

> 
> The culprit is
> 
>      #define TERMINAL(state) [0 ... 0x7F] = (state)
> 
> It leaves [0x80..0xFF] zero, i.e. IN_ERROR.  Has always been broken.

I wonder if that was done because it was assuming that valid input is 
only ASCII, and that any byte larger than 0x7f is invalid except within 
the context of a string.  But whatever the reason for the original bug, 
your fix makes sense.

> Fix it to initialize the complete array.

Worth testsuite coverage?

> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-lexer.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
> index e1745a3d95..4867839f66 100644
> --- a/qobject/json-lexer.c
> +++ b/qobject/json-lexer.c
> @@ -123,7 +123,7 @@ enum json_lexer_state {
>   QEMU_BUILD_BUG_ON((int)JSON_MIN <= (int)IN_START_INTERP);
>   QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
>   
> -#define TERMINAL(state) [0 ... 0x7F] = (state)
> +#define TERMINAL(state) [0 ... 0xFF] = (state)
>   
>   /* Return whether TERMINAL is a terminal state and the transition to it
>      from OLD_STATE required lookahead.  This happens whenever the table
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F'

Posted by Markus Armbruster 7 years, 5 months ago

Eric Blake <eblake@redhat.com> writes:

> On 08/27/2018 02:00 AM, Markus Armbruster wrote:
>> The lexer fails to end a valid token when the lookahead character is
>> beyond '\x7F'.  For instance, input
>>
>>      true\xC2\xA2
>>
>> produces the tokens
>>
>>      JSON_ERROR     true\xC2
>>      JSON_ERROR     \xA2
>>
>> The first token should be
>>
>>      JSON_KEYWORD   true
>>
>> instead.
>
> As long as we still get a JSON_ERROR in the end.

We do: one for \xC2, and one for \xA2.  PATCH 4 will lose the second one.

>> The culprit is
>>
>>      #define TERMINAL(state) [0 ... 0x7F] = (state)
>>
>> It leaves [0x80..0xFF] zero, i.e. IN_ERROR.  Has always been broken.
>
> I wonder if that was done because it was assuming that valid input is
> only ASCII, and that any byte larger than 0x7f is invalid except
> within the context of a string.

Plausible thinko.

>                                  But whatever the reason for the
> original bug, your fix makes sense.
>
>> Fix it to initialize the complete array.
>
> Worth testsuite coverage?

Since lookahead bytes > 0x7F are always a parse error, all the bug can
do is swallow a TERMINAL() token right before a parse error.  The
TERMINAL() tokens are JSON_INTEGER, JSON_FLOAT, JSON_KEYWORD, JSON_SKIP,
JSON_INTERP.  Fairly harmless.  In particular, JSON objects get through
even when followed by a byte > 0x7F.

Of course, test coverage wouldn't hurt regardless.

>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-lexer.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

[Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F'
[Qemu-devel] [PATCH 2/6] json: Clean up how lexer consumes "end of input"
[Qemu-devel] [PATCH 3/6] json: Make lexer's "character consumed" logic less confusing
[Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors
[Qemu-devel] [PATCH 5/6] json: Eliminate lexer state IN_ERROR
[Qemu-devel] [PATCH 6/6] json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP