[v1] json: More fixes, error reporting improvements, cleanups

[Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors

Posted by Markus Armbruster 7 years, 5 months ago

When the lexer chokes on an input character, it consumes the
character, emits a JSON error token, and enters its start state.  This
can lead to suboptimal error recovery.  For instance, input

    0123 ,

produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23
    JSON_COMMA    ,

Make the lexer skip characters after a lexical error until a
structural character ('[', ']', '{', '}', ':', ','), an ASCII control
character, or '\xFE', or '\xFF'.

Note that we must not skip ASCII control characters, '\xFE', '\xFF',
because those are documented to force the JSON parser into known-good
state, by docs/interop/qmp-spec.txt.

The lexer now produces

    JSON_ERROR    01
    JSON_COMMA    ,

Update qmp-test for the nicer error recovery: QMP now report just one
error for input %p instead of two.  Also drop the newline after %p; it
was needed to tease out the second error.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
 tests/qmp-test.c     |  6 +-----
 2 files changed, 30 insertions(+), 19 deletions(-)

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index 28582e17d9..39c7ce7adc 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -101,6 +101,7 @@
 
 enum json_lexer_state {
     IN_ERROR = 0,               /* must really be 0, see json_lexer[] */
+    IN_RECOVERY,
     IN_DQ_STRING_ESCAPE,
     IN_DQ_STRING,
     IN_SQ_STRING_ESCAPE,
@@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
 static const uint8_t json_lexer[][256] =  {
     /* Relies on default initialization to IN_ERROR! */
 
+    /* error recovery */
+    [IN_RECOVERY] = {
+        /*
+         * Skip characters until a structural character, an ASCII
+         * control character other than '\t', or impossible UTF-8
+         * bytes '\xFE', '\xFF'.  Structural characters and line
+         * endings are promising resynchronization points.  Clients
+         * may use the others to force the JSON parser into known-good
+         * state; see docs/interop/qmp-spec.txt.
+         */
+        [0 ... 0x1F] = IN_START | LOOKAHEAD,
+        [0x20 ... 0xFD] = IN_RECOVERY,
+        [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
+        ['\t'] = IN_RECOVERY,
+        ['['] = IN_START | LOOKAHEAD,
+        [']'] = IN_START | LOOKAHEAD,
+        ['{'] = IN_START | LOOKAHEAD,
+        ['}'] = IN_START | LOOKAHEAD,
+        [':'] = IN_START | LOOKAHEAD,
+        [','] = IN_START | LOOKAHEAD,
+    },
+
     /* double quote string */
     [IN_DQ_STRING_ESCAPE] = {
         [0x20 ... 0xFD] = IN_DQ_STRING,
@@ -301,26 +324,18 @@ static void json_lexer_feed_char(JSONLexer *lexer, char ch, bool flush)
             /* fall through */
         case JSON_SKIP:
             g_string_truncate(lexer->token, 0);
+            /* fall through */
+        case IN_START:
             new_state = lexer->start_state;
             break;
         case IN_ERROR:
-            /* XXX: To avoid having previous bad input leaving the parser in an
-             * unresponsive state where we consume unpredictable amounts of
-             * subsequent "good" input, percolate this error state up to the
-             * parser by emitting a JSON_ERROR token, then reset lexer state.
-             *
-             * Also note that this handling is required for reliable channel
-             * negotiation between QMP and the guest agent, since chr(0xFF)
-             * is placed at the beginning of certain events to ensure proper
-             * delivery when the channel is in an unknown state. chr(0xFF) is
-             * never a valid ASCII/UTF-8 sequence, so this should reliably
-             * induce an error/flush state.
-             */
             json_message_process_token(lexer, lexer->token, JSON_ERROR,
                                        lexer->x, lexer->y);
+            new_state = IN_RECOVERY;
+            /* fall through */
+        case IN_RECOVERY:
             g_string_truncate(lexer->token, 0);
-            lexer->state = lexer->start_state;
-            return;
+            break;
         default:
             break;
         }
diff --git a/tests/qmp-test.c b/tests/qmp-test.c
index 4ae2245484..04ad7648d2 100644
--- a/tests/qmp-test.c
+++ b/tests/qmp-test.c
@@ -93,11 +93,7 @@ static void test_malformed(QTestState *qts)
     g_assert(recovered(qts));
 
     /* lexical error: interpolation */
-    qtest_qmp_send_raw(qts, "%%p\n");
-    /* two errors, one for "%", one for "p" */
-    resp = qtest_qmp_receive(qts);
-    g_assert_cmpstr(get_error_class(resp), ==, "GenericError");
-    qobject_unref(resp);
+    qtest_qmp_send_raw(qts, "%%p");
     resp = qtest_qmp_receive(qts);
     g_assert_cmpstr(get_error_class(resp), ==, "GenericError");
     qobject_unref(resp);
-- 
2.17.1

Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors

Posted by Eric Blake 7 years, 5 months ago

On 08/27/2018 02:00 AM, Markus Armbruster wrote:
> When the lexer chokes on an input character, it consumes the
> character, emits a JSON error token, and enters its start state.  This
> can lead to suboptimal error recovery.  For instance, input
> 
>      0123 ,
> 
> produces the tokens
> 
>      JSON_ERROR    01
>      JSON_INTEGER  23
>      JSON_COMMA    ,
> 
> Make the lexer skip characters after a lexical error until a
> structural character ('[', ']', '{', '}', ':', ','), an ASCII control
> character, or '\xFE', or '\xFF'.
> 
> Note that we must not skip ASCII control characters, '\xFE', '\xFF',
> because those are documented to force the JSON parser into known-good
> state, by docs/interop/qmp-spec.txt.
> 
> The lexer now produces
> 
>      JSON_ERROR    01
>      JSON_COMMA    ,

So the lexer has now completely skipped the intermediate input, but the 
resulting error message need only point at the start of where input went 
wrong, and skipping to a sane point results in fewer error tokens to be 
reported.  Makes sense.

> 
> Update qmp-test for the nicer error recovery: QMP now report just one

s/report/reports/

> error for input %p instead of two.  Also drop the newline after %p; it
> was needed to tease out the second error.

That's because pre-patch, 'p' is one of the input characters that 
requires lookahead to determine if it forms a complete token (and the 
newline provides the transition needed to consume it); now post-patch, 
the 'p' is consumed as part of the junk after the error is first 
detected at the '%'.

And to my earlier complaint about 0x1a resulting in JSON_ERROR then 
JSON_INTEGER then JSON_KEYWORD, that sequence is likewise now identified 
as a single JSON_ERROR at the 'x', with the rest of the attempted hex 
number (invalid in JSON) silently skipped.  Nice.

> 
> Signed-off-by: Markus Armbruster <armbru@redhat.com>
> ---
>   qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
>   tests/qmp-test.c     |  6 +-----
>   2 files changed, 30 insertions(+), 19 deletions(-)
> 
> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
> index 28582e17d9..39c7ce7adc 100644
> --- a/qobject/json-lexer.c
> +++ b/qobject/json-lexer.c
> @@ -101,6 +101,7 @@
>   
>   enum json_lexer_state {
>       IN_ERROR = 0,               /* must really be 0, see json_lexer[] */
> +    IN_RECOVERY,
>       IN_DQ_STRING_ESCAPE,
>       IN_DQ_STRING,
>       IN_SQ_STRING_ESCAPE,
> @@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
>   static const uint8_t json_lexer[][256] =  {
>       /* Relies on default initialization to IN_ERROR! */
>   
> +    /* error recovery */
> +    [IN_RECOVERY] = {
> +        /*
> +         * Skip characters until a structural character, an ASCII
> +         * control character other than '\t', or impossible UTF-8
> +         * bytes '\xFE', '\xFF'.  Structural characters and line
> +         * endings are promising resynchronization points.  Clients
> +         * may use the others to force the JSON parser into known-good
> +         * state; see docs/interop/qmp-spec.txt.
> +         */
> +        [0 ... 0x1F] = IN_START | LOOKAHEAD,

And here's where the LOOKAHEAD bit has to be separate, because you are 
now assigning semantics to the transition on '\0' that are distinct from 
either of the two roles previously enumerated as possible.

> +        [0x20 ... 0xFD] = IN_RECOVERY,
> +        [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
> +        ['\t'] = IN_RECOVERY,
> +        ['['] = IN_START | LOOKAHEAD,
> +        [']'] = IN_START | LOOKAHEAD,
> +        ['{'] = IN_START | LOOKAHEAD,
> +        ['}'] = IN_START | LOOKAHEAD,
> +        [':'] = IN_START | LOOKAHEAD,
> +        [','] = IN_START | LOOKAHEAD,
> +    },

It took me a couple of reads before I was satisfied that everything is 
initialized as desired (range assignments followed by more-specific 
re-assignment works, but isn't common), but this looks right.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors

Posted by Markus Armbruster 7 years, 5 months ago

Eric Blake <eblake@redhat.com> writes:

> On 08/27/2018 02:00 AM, Markus Armbruster wrote:
>> When the lexer chokes on an input character, it consumes the
>> character, emits a JSON error token, and enters its start state.  This
>> can lead to suboptimal error recovery.  For instance, input
>>
>>      0123 ,
>>
>> produces the tokens
>>
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>>      JSON_COMMA    ,
>>
>> Make the lexer skip characters after a lexical error until a
>> structural character ('[', ']', '{', '}', ':', ','), an ASCII control
>> character, or '\xFE', or '\xFF'.
>>
>> Note that we must not skip ASCII control characters, '\xFE', '\xFF',
>> because those are documented to force the JSON parser into known-good
>> state, by docs/interop/qmp-spec.txt.
>>
>> The lexer now produces
>>
>>      JSON_ERROR    01
>>      JSON_COMMA    ,
>
> So the lexer has now completely skipped the intermediate input, but
> the resulting error message need only point at the start of where
> input went wrong, and skipping to a sane point results in fewer error
> tokens to be reported.  Makes sense.

Exactly.

We could emit a recovery token to let json_message_process_token()
report what we skipped, but I don't think it's worth the trouble.

>> Update qmp-test for the nicer error recovery: QMP now report just one
>
> s/report/reports/

Will fix.

>> error for input %p instead of two.  Also drop the newline after %p; it
>> was needed to tease out the second error.
>
> That's because pre-patch, 'p' is one of the input characters that
> requires lookahead to determine if it forms a complete token (and the
> newline provides the transition needed to consume it); now post-patch,
> the 'p' is consumed as part of the junk after the error is first
> detected at the '%'.

Exactly.

> And to my earlier complaint about 0x1a resulting in JSON_ERROR then
> JSON_INTEGER then JSON_KEYWORD, that sequence is likewise now
> identified as a single JSON_ERROR at the 'x', with the rest of the
> attempted hex number (invalid in JSON) silently skipped.  Nice.

Correct.

>>
>> Signed-off-by: Markus Armbruster <armbru@redhat.com>
>> ---
>>   qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
>>   tests/qmp-test.c     |  6 +-----
>>   2 files changed, 30 insertions(+), 19 deletions(-)
>>
>> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
>> index 28582e17d9..39c7ce7adc 100644
>> --- a/qobject/json-lexer.c
>> +++ b/qobject/json-lexer.c
>> @@ -101,6 +101,7 @@
>>     enum json_lexer_state {
>>       IN_ERROR = 0,               /* must really be 0, see json_lexer[] */
>> +    IN_RECOVERY,
>>       IN_DQ_STRING_ESCAPE,
>>       IN_DQ_STRING,
>>       IN_SQ_STRING_ESCAPE,
>> @@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
>>   static const uint8_t json_lexer[][256] =  {
>>       /* Relies on default initialization to IN_ERROR! */
>>   +    /* error recovery */
>> +    [IN_RECOVERY] = {
>> +        /*
>> +         * Skip characters until a structural character, an ASCII
>> +         * control character other than '\t', or impossible UTF-8
>> +         * bytes '\xFE', '\xFF'.  Structural characters and line
>> +         * endings are promising resynchronization points.  Clients
>> +         * may use the others to force the JSON parser into known-good
>> +         * state; see docs/interop/qmp-spec.txt.
>> +         */
>> +        [0 ... 0x1F] = IN_START | LOOKAHEAD,
>
> And here's where the LOOKAHEAD bit has to be separate, because you are
> now assigning semantics to the transition on '\0' that are distinct
> from either of the two roles previously enumerated as possible.

I could do

            TERMINAL(IN_START)
            [0x20 ... 0xFD] = IN_RECOVERY,
            ['\t'] = IN_RECOVERY,

but it would be awful :)

>> +        [0x20 ... 0xFD] = IN_RECOVERY,
>> +        [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
>> +        ['\t'] = IN_RECOVERY,
>> +        ['['] = IN_START | LOOKAHEAD,
>> +        [']'] = IN_START | LOOKAHEAD,
>> +        ['{'] = IN_START | LOOKAHEAD,
>> +        ['}'] = IN_START | LOOKAHEAD,
>> +        [':'] = IN_START | LOOKAHEAD,
>> +        [','] = IN_START | LOOKAHEAD,
>> +    },
>
> It took me a couple of reads before I was satisfied that everything is
> initialized as desired (range assignments followed by more-specific
> re-assignment works, but isn't common), but this looks right.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

[Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F'
[Qemu-devel] [PATCH 2/6] json: Clean up how lexer consumes "end of input"
[Qemu-devel] [PATCH 3/6] json: Make lexer's "character consumed" logic less confusing
[Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors
[Qemu-devel] [PATCH 5/6] json: Eliminate lexer state IN_ERROR
[Qemu-devel] [PATCH 6/6] json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP