tests/functional: collect more info in base.log file

[PATCH 2/2] tests/functional: include the lower level QMP log messages

Posted by Daniel P. Berrangé 3 months, 1 week ago

We've seen a GitLab CI timeout failure in the test_pseries.py test,
where it appears likely that the test has hung in a self.qmp('quit')
call, but we don't have conclusive proof. Adding the QMP log category
to what we capture should help us diagnose this, at the cost of the
base.log file becoming significantly more verbose. The previous
commit to include the logger category name and function should at
least help understanding the more verbose logs.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/functional/qemu_test/testcase.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tests/functional/qemu_test/testcase.py b/tests/functional/qemu_test/testcase.py
index 6fc6e1ac0a..5ab84ce674 100644
--- a/tests/functional/qemu_test/testcase.py
+++ b/tests/functional/qemu_test/testcase.py
@@ -225,6 +225,9 @@ def setUp(self):
         self.machinelog = logging.getLogger('qemu.machine')
         self.machinelog.setLevel(logging.DEBUG)
         self.machinelog.addHandler(self._log_fh)
+        self.qmplog = logging.getLogger('qemu.qmp')
+        self.qmplog.setLevel(logging.DEBUG)
+        self.qmplog.addHandler(self._log_fh)
 
         if not self.assets_available():
             self.skipTest('One or more assets is not available')
@@ -235,6 +238,7 @@ def tearDown(self):
         if self.socketdir is not None:
             shutil.rmtree(self.socketdir.name)
             self.socketdir = None
+        self.qmplog.removeHandler(self._log_fh)
         self.machinelog.removeHandler(self._log_fh)
         self.log.removeHandler(self._log_fh)
         self._log_fh.close()
-- 
2.51.1

Re: [PATCH 2/2] tests/functional: include the lower level QMP log messages

Posted by Thomas Huth 3 months, 1 week ago

On 28/10/2025 19.26, Daniel P. Berrangé wrote:
> We've seen a GitLab CI timeout failure in the test_pseries.py test,
> where it appears likely that the test has hung in a self.qmp('quit')
> call, but we don't have conclusive proof. Adding the QMP log category
> to what we capture should help us diagnose this, at the cost of the
> base.log file becoming significantly more verbose. The previous
> commit to include the logger category name and function should at
> least help understanding the more verbose logs.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/functional/qemu_test/testcase.py | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/tests/functional/qemu_test/testcase.py b/tests/functional/qemu_test/testcase.py
> index 6fc6e1ac0a..5ab84ce674 100644
> --- a/tests/functional/qemu_test/testcase.py
> +++ b/tests/functional/qemu_test/testcase.py
> @@ -225,6 +225,9 @@ def setUp(self):
>           self.machinelog = logging.getLogger('qemu.machine')
>           self.machinelog.setLevel(logging.DEBUG)
>           self.machinelog.addHandler(self._log_fh)
> +        self.qmplog = logging.getLogger('qemu.qmp')
> +        self.qmplog.setLevel(logging.DEBUG)
> +        self.qmplog.addHandler(self._log_fh)
>   
>           if not self.assets_available():
>               self.skipTest('One or more assets is not available')
> @@ -235,6 +238,7 @@ def tearDown(self):
>           if self.socketdir is not None:
>               shutil.rmtree(self.socketdir.name)
>               self.socketdir = None
> +        self.qmplog.removeHandler(self._log_fh)
>           self.machinelog.removeHandler(self._log_fh)
>           self.log.removeHandler(self._log_fh)
>           self._log_fh.close()

  Hi,

I tested this patch, and now I'm seeing Python stack traces in the base.log, 
like this:

2025-10-29 07:22:45,264 - DEBUG: qemu.qmp.protocol.default._bh_loop_forever 
Task.Reader: failure:
   | Traceback (most recent call last):
   |   File "../python/qemu/qmp/protocol.py", line 914, in _bh_loop_forever
   |     await async_fn()
   |   File "../python/qemu/qmp/protocol.py", line 952, in _bh_recv_message
   |     msg = await self._recv()
   |           ^^^^^^^^^^^^^^^^^^
   |   File "../python/qemu/qmp/protocol.py", line 1053, in _recv
   |     message = await self._do_recv()
   |               ^^^^^^^^^^^^^^^^^^^^^
   |   File "../python/qemu/qmp/qmp_client.py", line 459, in _do_recv
   |     msg_bytes = await self._readline()
   |                 ^^^^^^^^^^^^^^^^^^^^^^
   |   File "../python/qemu/qmp/protocol.py", line 1021, in _readline
   |     raise EOFError
   | EOFError

If if it's harmless, I think that's very confusing for the casual user when 
trying to debug a problem. Can we avoid these?

Or maybe rather log the qmp stuff into a separate file, e.g. qmp.log ?

  Thomas

Re: [PATCH 2/2] tests/functional: include the lower level QMP log messages

Posted by Daniel P. Berrangé 3 months, 1 week ago

On Wed, Oct 29, 2025 at 08:12:31AM +0100, Thomas Huth wrote:
> On 28/10/2025 19.26, Daniel P. Berrangé wrote:
> > We've seen a GitLab CI timeout failure in the test_pseries.py test,
> > where it appears likely that the test has hung in a self.qmp('quit')
> > call, but we don't have conclusive proof. Adding the QMP log category
> > to what we capture should help us diagnose this, at the cost of the
> > base.log file becoming significantly more verbose. The previous
> > commit to include the logger category name and function should at
> > least help understanding the more verbose logs.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >   tests/functional/qemu_test/testcase.py | 4 ++++
> >   1 file changed, 4 insertions(+)
> > 
> > diff --git a/tests/functional/qemu_test/testcase.py b/tests/functional/qemu_test/testcase.py
> > index 6fc6e1ac0a..5ab84ce674 100644
> > --- a/tests/functional/qemu_test/testcase.py
> > +++ b/tests/functional/qemu_test/testcase.py
> > @@ -225,6 +225,9 @@ def setUp(self):
> >           self.machinelog = logging.getLogger('qemu.machine')
> >           self.machinelog.setLevel(logging.DEBUG)
> >           self.machinelog.addHandler(self._log_fh)
> > +        self.qmplog = logging.getLogger('qemu.qmp')
> > +        self.qmplog.setLevel(logging.DEBUG)
> > +        self.qmplog.addHandler(self._log_fh)
> >           if not self.assets_available():
> >               self.skipTest('One or more assets is not available')
> > @@ -235,6 +238,7 @@ def tearDown(self):
> >           if self.socketdir is not None:
> >               shutil.rmtree(self.socketdir.name)
> >               self.socketdir = None
> > +        self.qmplog.removeHandler(self._log_fh)
> >           self.machinelog.removeHandler(self._log_fh)
> >           self.log.removeHandler(self._log_fh)
> >           self._log_fh.close()
> 
>  Hi,
> 
> I tested this patch, and now I'm seeing Python stack traces in the base.log,
> like this:
> 
> 2025-10-29 07:22:45,264 - DEBUG: qemu.qmp.protocol.default._bh_loop_forever
> Task.Reader: failure:
>   | Traceback (most recent call last):
>   |   File "../python/qemu/qmp/protocol.py", line 914, in _bh_loop_forever
>   |     await async_fn()
>   |   File "../python/qemu/qmp/protocol.py", line 952, in _bh_recv_message
>   |     msg = await self._recv()
>   |           ^^^^^^^^^^^^^^^^^^
>   |   File "../python/qemu/qmp/protocol.py", line 1053, in _recv
>   |     message = await self._do_recv()
>   |               ^^^^^^^^^^^^^^^^^^^^^
>   |   File "../python/qemu/qmp/qmp_client.py", line 459, in _do_recv
>   |     msg_bytes = await self._readline()
>   |                 ^^^^^^^^^^^^^^^^^^^^^^
>   |   File "../python/qemu/qmp/protocol.py", line 1021, in _readline
>   |     raise EOFError
>   | EOFError
> 
> If if it's harmless, I think that's very confusing for the casual user when
> trying to debug a problem. Can we avoid these?

That comes from this code in qemu/qmp/protocol.py

        except BaseException as err:
            self.logger.log(
                logging.INFO if isinstance(err, EOFError) else logging.ERROR,
                "Task.%s: %s",
                name, exception_summary(err)
            )
            self.logger.debug("Task.%s: failure:\n%s\n",
                              name, pretty_traceback())
            self._schedule_disconnect()
            raise

Given EOF is a "normal" thing, I'm not convinced it justifies printing
a traceback.  I'd be inclined to suggest a new "except EOFError" clause
that catches that normal exception and thus skips the traceback.

I don't think we need to block this patch waiting for that improvement
though. When collecting DEBUG logs there are always going to be weird
things like this present, and anything misleading just needs to be
fixed when we uncover it.

> Or maybe rather log the qmp stuff into a separate file, e.g. qmp.log ?

I really don't want separate files for different log categories, as then
you have created a problem of merging the two files in order to interpret
the flow. That doesn't doesn't scale as we capture yet more log categories.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 2/2] tests/functional: include the lower level QMP log messages

Posted by Thomas Huth 3 months, 1 week ago

On 29/10/2025 09.12, Daniel P. Berrangé wrote:
> On Wed, Oct 29, 2025 at 08:12:31AM +0100, Thomas Huth wrote:
>> On 28/10/2025 19.26, Daniel P. Berrangé wrote:
>>> We've seen a GitLab CI timeout failure in the test_pseries.py test,
>>> where it appears likely that the test has hung in a self.qmp('quit')
>>> call, but we don't have conclusive proof. Adding the QMP log category
>>> to what we capture should help us diagnose this, at the cost of the
>>> base.log file becoming significantly more verbose. The previous
>>> commit to include the logger category name and function should at
>>> least help understanding the more verbose logs.
>>>
>>> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
>>> ---
>>>    tests/functional/qemu_test/testcase.py | 4 ++++
>>>    1 file changed, 4 insertions(+)
>>>
>>> diff --git a/tests/functional/qemu_test/testcase.py b/tests/functional/qemu_test/testcase.py
>>> index 6fc6e1ac0a..5ab84ce674 100644
>>> --- a/tests/functional/qemu_test/testcase.py
>>> +++ b/tests/functional/qemu_test/testcase.py
>>> @@ -225,6 +225,9 @@ def setUp(self):
>>>            self.machinelog = logging.getLogger('qemu.machine')
>>>            self.machinelog.setLevel(logging.DEBUG)
>>>            self.machinelog.addHandler(self._log_fh)
>>> +        self.qmplog = logging.getLogger('qemu.qmp')
>>> +        self.qmplog.setLevel(logging.DEBUG)
>>> +        self.qmplog.addHandler(self._log_fh)
>>>            if not self.assets_available():
>>>                self.skipTest('One or more assets is not available')
>>> @@ -235,6 +238,7 @@ def tearDown(self):
>>>            if self.socketdir is not None:
>>>                shutil.rmtree(self.socketdir.name)
>>>                self.socketdir = None
>>> +        self.qmplog.removeHandler(self._log_fh)
>>>            self.machinelog.removeHandler(self._log_fh)
>>>            self.log.removeHandler(self._log_fh)
>>>            self._log_fh.close()
>>
>>   Hi,
>>
>> I tested this patch, and now I'm seeing Python stack traces in the base.log,
>> like this:
>>
>> 2025-10-29 07:22:45,264 - DEBUG: qemu.qmp.protocol.default._bh_loop_forever
>> Task.Reader: failure:
>>    | Traceback (most recent call last):
>>    |   File "../python/qemu/qmp/protocol.py", line 914, in _bh_loop_forever
>>    |     await async_fn()
>>    |   File "../python/qemu/qmp/protocol.py", line 952, in _bh_recv_message
>>    |     msg = await self._recv()
>>    |           ^^^^^^^^^^^^^^^^^^
>>    |   File "../python/qemu/qmp/protocol.py", line 1053, in _recv
>>    |     message = await self._do_recv()
>>    |               ^^^^^^^^^^^^^^^^^^^^^
>>    |   File "../python/qemu/qmp/qmp_client.py", line 459, in _do_recv
>>    |     msg_bytes = await self._readline()
>>    |                 ^^^^^^^^^^^^^^^^^^^^^^
>>    |   File "../python/qemu/qmp/protocol.py", line 1021, in _readline
>>    |     raise EOFError
>>    | EOFError
>>
>> If if it's harmless, I think that's very confusing for the casual user when
>> trying to debug a problem. Can we avoid these?
> 
> That comes from this code in qemu/qmp/protocol.py
> 
>          except BaseException as err:
>              self.logger.log(
>                  logging.INFO if isinstance(err, EOFError) else logging.ERROR,
>                  "Task.%s: %s",
>                  name, exception_summary(err)
>              )
>              self.logger.debug("Task.%s: failure:\n%s\n",
>                                name, pretty_traceback())
>              self._schedule_disconnect()
>              raise
> 
> Given EOF is a "normal" thing, I'm not convinced it justifies printing
> a traceback.  I'd be inclined to suggest a new "except EOFError" clause
> that catches that normal exception and thus skips the traceback.

Sounds reasonable, could you please send a patch for qemu.qmp?

> I don't think we need to block this patch waiting for that improvement
> though. When collecting DEBUG logs there are always going to be weird
> things like this present, and anything misleading just needs to be
> fixed when we uncover it.

Ack.

>> Or maybe rather log the qmp stuff into a separate file, e.g. qmp.log ?
> 
> I really don't want separate files for different log categories, as then
> you have created a problem of merging the two files in order to interpret
> the flow. That doesn't doesn't scale as we capture yet more log categories.

Ok, fine with me, we still can refine later if we feel that it is necessary. So:

Reviewed-by: Thomas Huth <thuth@redhat.com>

I'll queue this for my next pull request.

  Thomas

Re: [PATCH 2/2] tests/functional: include the lower level QMP log messages

Posted by Daniel P. Berrangé 3 months, 1 week ago

On Wed, Oct 29, 2025 at 09:30:03AM +0100, Thomas Huth wrote:
> On 29/10/2025 09.12, Daniel P. Berrangé wrote:
> > On Wed, Oct 29, 2025 at 08:12:31AM +0100, Thomas Huth wrote:
> > > On 28/10/2025 19.26, Daniel P. Berrangé wrote:
> > > > We've seen a GitLab CI timeout failure in the test_pseries.py test,
> > > > where it appears likely that the test has hung in a self.qmp('quit')
> > > > call, but we don't have conclusive proof. Adding the QMP log category
> > > > to what we capture should help us diagnose this, at the cost of the
> > > > base.log file becoming significantly more verbose. The previous
> > > > commit to include the logger category name and function should at
> > > > least help understanding the more verbose logs.
> > > > 
> > > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > > ---
> > > >    tests/functional/qemu_test/testcase.py | 4 ++++
> > > >    1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/tests/functional/qemu_test/testcase.py b/tests/functional/qemu_test/testcase.py
> > > > index 6fc6e1ac0a..5ab84ce674 100644
> > > > --- a/tests/functional/qemu_test/testcase.py
> > > > +++ b/tests/functional/qemu_test/testcase.py
> > > > @@ -225,6 +225,9 @@ def setUp(self):
> > > >            self.machinelog = logging.getLogger('qemu.machine')
> > > >            self.machinelog.setLevel(logging.DEBUG)
> > > >            self.machinelog.addHandler(self._log_fh)
> > > > +        self.qmplog = logging.getLogger('qemu.qmp')
> > > > +        self.qmplog.setLevel(logging.DEBUG)
> > > > +        self.qmplog.addHandler(self._log_fh)
> > > >            if not self.assets_available():
> > > >                self.skipTest('One or more assets is not available')
> > > > @@ -235,6 +238,7 @@ def tearDown(self):
> > > >            if self.socketdir is not None:
> > > >                shutil.rmtree(self.socketdir.name)
> > > >                self.socketdir = None
> > > > +        self.qmplog.removeHandler(self._log_fh)
> > > >            self.machinelog.removeHandler(self._log_fh)
> > > >            self.log.removeHandler(self._log_fh)
> > > >            self._log_fh.close()
> > > 
> > >   Hi,
> > > 
> > > I tested this patch, and now I'm seeing Python stack traces in the base.log,
> > > like this:
> > > 
> > > 2025-10-29 07:22:45,264 - DEBUG: qemu.qmp.protocol.default._bh_loop_forever
> > > Task.Reader: failure:
> > >    | Traceback (most recent call last):
> > >    |   File "../python/qemu/qmp/protocol.py", line 914, in _bh_loop_forever
> > >    |     await async_fn()
> > >    |   File "../python/qemu/qmp/protocol.py", line 952, in _bh_recv_message
> > >    |     msg = await self._recv()
> > >    |           ^^^^^^^^^^^^^^^^^^
> > >    |   File "../python/qemu/qmp/protocol.py", line 1053, in _recv
> > >    |     message = await self._do_recv()
> > >    |               ^^^^^^^^^^^^^^^^^^^^^
> > >    |   File "../python/qemu/qmp/qmp_client.py", line 459, in _do_recv
> > >    |     msg_bytes = await self._readline()
> > >    |                 ^^^^^^^^^^^^^^^^^^^^^^
> > >    |   File "../python/qemu/qmp/protocol.py", line 1021, in _readline
> > >    |     raise EOFError
> > >    | EOFError
> > > 
> > > If if it's harmless, I think that's very confusing for the casual user when
> > > trying to debug a problem. Can we avoid these?
> > 
> > That comes from this code in qemu/qmp/protocol.py
> > 
> >          except BaseException as err:
> >              self.logger.log(
> >                  logging.INFO if isinstance(err, EOFError) else logging.ERROR,
> >                  "Task.%s: %s",
> >                  name, exception_summary(err)
> >              )
> >              self.logger.debug("Task.%s: failure:\n%s\n",
> >                                name, pretty_traceback())
> >              self._schedule_disconnect()
> >              raise
> > 
> > Given EOF is a "normal" thing, I'm not convinced it justifies printing
> > a traceback.  I'd be inclined to suggest a new "except EOFError" clause
> > that catches that normal exception and thus skips the traceback.
> 
> Sounds reasonable, could you please send a patch for qemu.qmp?

https://gitlab.com/qemu-project/python-qemu-qmp/-/merge_requests/46


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

[PATCH 1/2] tests/functional: include logger name and function in messages
[PATCH 2/2] tests/functional: include the lower level QMP log messages