[PATCH] tests/avocado: Fix console data loss

Nicholas Piggin posted 1 patch 7 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230912131340.405619-1-npiggin@gmail.com
Maintainers: John Snow <jsnow@redhat.com>, Cleber Rosa <crosa@redhat.com>, Beraldo Leal <bleal@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Wainer dos Santos Moschetta <wainersm@redhat.com>
python/qemu/machine/machine.py         | 19 +++++++++++++++++++
tests/avocado/avocado_qemu/__init__.py |  2 +-
2 files changed, 20 insertions(+), 1 deletion(-)
[PATCH] tests/avocado: Fix console data loss
Posted by Nicholas Piggin 7 months, 2 weeks ago
Occasionally some avocado tests will fail waiting for console line
despite the machine running correctly. Console data goes missing, as can
be seen in the console log. This is due to _console_interaction calling
makefile() on the console socket each time it is invoked, which must be
losing old buffer contents when going out of scope.

It is not enough to makefile() with buffered=0. That helps significantly
but data loss is still possible. My guess is that readline() has a line
buffer even when the file is in unbuffered mode, that can eat data.

Fix this by providing a console file that persists for the life of the
console.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---

For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
was flakey for me due to this bug. I don't know why that in particular,
3 calls to wait_for_console_pattern probably helps.

I didn't pinpoint when the bug was introduced because the original
was probably not buggy because it was only run once at the end of the
test. At some point after it was moved to common code, something would
have started to call it more than once which is where potential for bug
is introduced.

Thanks,
Nick

 python/qemu/machine/machine.py         | 19 +++++++++++++++++++
 tests/avocado/avocado_qemu/__init__.py |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
index c16a0b6fed..35d5a672db 100644
--- a/python/qemu/machine/machine.py
+++ b/python/qemu/machine/machine.py
@@ -191,6 +191,7 @@ def __init__(self,
             self.sock_dir, f"{self._name}.con"
         )
         self._console_socket: Optional[socket.socket] = None
+        self._console_file: Optional[socket.SocketIO] = None
         self._remove_files: List[str] = []
         self._user_killed = False
         self._quit_issued = False
@@ -509,6 +510,11 @@ def _early_cleanup(self) -> None:
         # If we keep the console socket open, we may deadlock waiting
         # for QEMU to exit, while QEMU is waiting for the socket to
         # become writable.
+        if self._console_file is not None:
+            LOG.debug("Closing console file")
+            self._console_file.close()
+            self._console_file = None
+
         if self._console_socket is not None:
             LOG.debug("Closing console socket")
             self._console_socket.close()
@@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket:
         Returns a socket connected to the console
         """
         if self._console_socket is None:
+            LOG.debug("Opening console socket")
             self._console_socket = console_socket.ConsoleSocket(
                 self._console_address,
                 file=self._console_log_path,
                 drain=self._drain_console)
         return self._console_socket
 
+    @property
+    def console_file(self) -> socket.SocketIO:
+        """
+        Returns a file associated with the console socket
+        """
+        if self._console_file is None:
+            LOG.debug("Opening console file")
+            self._console_file = self.console_socket.makefile(mode='rb',
+                                                              buffering=0,
+                                                              encoding='utf-8')
+        return self._console_file
+
     @property
     def temp_dir(self) -> str:
         """
diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
index 33090903f1..0172a359b7 100644
--- a/tests/avocado/avocado_qemu/__init__.py
+++ b/tests/avocado/avocado_qemu/__init__.py
@@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message,
     assert not keep_sending or send_string
     if vm is None:
         vm = test.vm
-    console = vm.console_socket.makefile(mode='rb', encoding='utf-8')
+    console = vm.console_file
     console_logger = logging.getLogger('console')
     while True:
         if send_string:
-- 
2.40.1
Re: [PATCH] tests/avocado: Fix console data loss
Posted by Alex Bennée 7 months, 2 weeks ago
Nicholas Piggin <npiggin@gmail.com> writes:

> Occasionally some avocado tests will fail waiting for console line
> despite the machine running correctly. Console data goes missing, as can
> be seen in the console log. This is due to _console_interaction calling
> makefile() on the console socket each time it is invoked, which must be
> losing old buffer contents when going out of scope.
>
> It is not enough to makefile() with buffered=0. That helps significantly
> but data loss is still possible. My guess is that readline() has a line
> buffer even when the file is in unbuffered mode, that can eat data.
>
> Fix this by providing a console file that persists for the life of the
> console.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Queued to testing/next, thanks.

> ---
>
> For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
> was flakey for me due to this bug. I don't know why that in particular,
> 3 calls to wait_for_console_pattern probably helps.
>
> I didn't pinpoint when the bug was introduced because the original
> was probably not buggy because it was only run once at the end of the
> test. At some point after it was moved to common code, something would
> have started to call it more than once which is where potential for bug
> is introduced.

There is a sprawling mass somewhere between:

  - pythons buffering of IO
  - device models dropping chars when blocked
  - noisy tests with competing console output

that adds up to unreliable tests that rely on seeing certain patterns on
the console. 

>
> Thanks,
> Nick
>
>  python/qemu/machine/machine.py         | 19 +++++++++++++++++++
>  tests/avocado/avocado_qemu/__init__.py |  2 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> index c16a0b6fed..35d5a672db 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -191,6 +191,7 @@ def __init__(self,
>              self.sock_dir, f"{self._name}.con"
>          )
>          self._console_socket: Optional[socket.socket] = None
> +        self._console_file: Optional[socket.SocketIO] = None
>          self._remove_files: List[str] = []
>          self._user_killed = False
>          self._quit_issued = False
> @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None:
>          # If we keep the console socket open, we may deadlock waiting
>          # for QEMU to exit, while QEMU is waiting for the socket to
>          # become writable.
> +        if self._console_file is not None:
> +            LOG.debug("Closing console file")
> +            self._console_file.close()
> +            self._console_file = None
> +
>          if self._console_socket is not None:
>              LOG.debug("Closing console socket")
>              self._console_socket.close()
> @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket:
>          Returns a socket connected to the console
>          """
>          if self._console_socket is None:
> +            LOG.debug("Opening console socket")
>              self._console_socket = console_socket.ConsoleSocket(
>                  self._console_address,
>                  file=self._console_log_path,
>                  drain=self._drain_console)
>          return self._console_socket
>  
> +    @property
> +    def console_file(self) -> socket.SocketIO:
> +        """
> +        Returns a file associated with the console socket
> +        """
> +        if self._console_file is None:
> +            LOG.debug("Opening console file")
> +            self._console_file = self.console_socket.makefile(mode='rb',
> +                                                              buffering=0,
> +                                                              encoding='utf-8')
> +        return self._console_file
> +
>      @property
>      def temp_dir(self) -> str:
>          """
> diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
> index 33090903f1..0172a359b7 100644
> --- a/tests/avocado/avocado_qemu/__init__.py
> +++ b/tests/avocado/avocado_qemu/__init__.py
> @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message,
>      assert not keep_sending or send_string
>      if vm is None:
>          vm = test.vm
> -    console = vm.console_socket.makefile(mode='rb', encoding='utf-8')
> +    console = vm.console_file
>      console_logger = logging.getLogger('console')
>      while True:
>          if send_string:


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro
Re: [PATCH] tests/avocado: Fix console data loss
Posted by Nicholas Piggin 7 months, 2 weeks ago
On Wed Sep 13, 2023 at 6:51 PM AEST, Alex Bennée wrote:
>
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Occasionally some avocado tests will fail waiting for console line
> > despite the machine running correctly. Console data goes missing, as can
> > be seen in the console log. This is due to _console_interaction calling
> > makefile() on the console socket each time it is invoked, which must be
> > losing old buffer contents when going out of scope.
> >
> > It is not enough to makefile() with buffered=0. That helps significantly
> > but data loss is still possible. My guess is that readline() has a line
> > buffer even when the file is in unbuffered mode, that can eat data.
> >
> > Fix this by providing a console file that persists for the life of the
> > console.
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>
> Queued to testing/next, thanks.
>
> > ---
> >
> > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
> > was flakey for me due to this bug. I don't know why that in particular,
> > 3 calls to wait_for_console_pattern probably helps.
> >
> > I didn't pinpoint when the bug was introduced because the original
> > was probably not buggy because it was only run once at the end of the
> > test. At some point after it was moved to common code, something would
> > have started to call it more than once which is where potential for bug
> > is introduced.
>
> There is a sprawling mass somewhere between:
>
>   - pythons buffering of IO
>   - device models dropping chars when blocked
>   - noisy tests with competing console output
>
> that adds up to unreliable tests that rely on seeing certain patterns on
> the console. 

Yeah it's a tricky bug and a difficult stack to diagnose. I started to
look at 40p machine firmware console at first since it was happening on
there.

It's actually not too bad now, I was irritating it by putting delays in
various avocado console socket reading, which can trigger it easily (my
guess is due to delay allowing file buffer to pull in more data than is
consumed). With patch the only check-avocado failures I was getting was
some OS watchdog timeouts in their console print code caused by back
pressure.

Thanks,
Nick
Re: [PATCH] tests/avocado: Fix console data loss
Posted by John Snow 7 months, 2 weeks ago
On Wed, Sep 13, 2023 at 4:53 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Occasionally some avocado tests will fail waiting for console line
> > despite the machine running correctly. Console data goes missing, as can
> > be seen in the console log. This is due to _console_interaction calling
> > makefile() on the console socket each time it is invoked, which must be
> > losing old buffer contents when going out of scope.
> >
> > It is not enough to makefile() with buffered=0. That helps significantly
> > but data loss is still possible. My guess is that readline() has a line
> > buffer even when the file is in unbuffered mode, that can eat data.
> >
> > Fix this by providing a console file that persists for the life of the
> > console.
> >
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>
> Queued to testing/next, thanks.

Sure, ACK.

I didn't realize that avocado was technically creating multiple
buffered files like this.

--js

>
> > ---
> >
> > For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
> > was flakey for me due to this bug. I don't know why that in particular,
> > 3 calls to wait_for_console_pattern probably helps.
> >
> > I didn't pinpoint when the bug was introduced because the original
> > was probably not buggy because it was only run once at the end of the
> > test. At some point after it was moved to common code, something would
> > have started to call it more than once which is where potential for bug
> > is introduced.
>
> There is a sprawling mass somewhere between:
>
>   - pythons buffering of IO
>   - device models dropping chars when blocked
>   - noisy tests with competing console output
>
> that adds up to unreliable tests that rely on seeing certain patterns on
> the console.
>
> >
> > Thanks,
> > Nick
> >
> >  python/qemu/machine/machine.py         | 19 +++++++++++++++++++
> >  tests/avocado/avocado_qemu/__init__.py |  2 +-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> >
> > diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> > index c16a0b6fed..35d5a672db 100644
> > --- a/python/qemu/machine/machine.py
> > +++ b/python/qemu/machine/machine.py
> > @@ -191,6 +191,7 @@ def __init__(self,
> >              self.sock_dir, f"{self._name}.con"
> >          )
> >          self._console_socket: Optional[socket.socket] = None
> > +        self._console_file: Optional[socket.SocketIO] = None
> >          self._remove_files: List[str] = []
> >          self._user_killed = False
> >          self._quit_issued = False
> > @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None:
> >          # If we keep the console socket open, we may deadlock waiting
> >          # for QEMU to exit, while QEMU is waiting for the socket to
> >          # become writable.
> > +        if self._console_file is not None:
> > +            LOG.debug("Closing console file")
> > +            self._console_file.close()
> > +            self._console_file = None
> > +
> >          if self._console_socket is not None:
> >              LOG.debug("Closing console socket")
> >              self._console_socket.close()
> > @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket:
> >          Returns a socket connected to the console
> >          """
> >          if self._console_socket is None:
> > +            LOG.debug("Opening console socket")
> >              self._console_socket = console_socket.ConsoleSocket(
> >                  self._console_address,
> >                  file=self._console_log_path,
> >                  drain=self._drain_console)
> >          return self._console_socket
> >
> > +    @property
> > +    def console_file(self) -> socket.SocketIO:
> > +        """
> > +        Returns a file associated with the console socket
> > +        """
> > +        if self._console_file is None:
> > +            LOG.debug("Opening console file")
> > +            self._console_file = self.console_socket.makefile(mode='rb',
> > +                                                              buffering=0,
> > +                                                              encoding='utf-8')
> > +        return self._console_file
> > +
> >      @property
> >      def temp_dir(self) -> str:
> >          """
> > diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
> > index 33090903f1..0172a359b7 100644
> > --- a/tests/avocado/avocado_qemu/__init__.py
> > +++ b/tests/avocado/avocado_qemu/__init__.py
> > @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message,
> >      assert not keep_sending or send_string
> >      if vm is None:
> >          vm = test.vm
> > -    console = vm.console_socket.makefile(mode='rb', encoding='utf-8')
> > +    console = vm.console_file
> >      console_logger = logging.getLogger('console')
> >      while True:
> >          if send_string:
>
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
>
Re: [PATCH] tests/avocado: Fix console data loss
Posted by Philippe Mathieu-Daudé 7 months, 2 weeks ago
+Alex who also looked at that IIRC.

On 12/9/23 15:13, Nicholas Piggin wrote:
> Occasionally some avocado tests will fail waiting for console line
> despite the machine running correctly. Console data goes missing, as can
> be seen in the console log. This is due to _console_interaction calling
> makefile() on the console socket each time it is invoked, which must be
> losing old buffer contents when going out of scope.
> 
> It is not enough to makefile() with buffered=0. That helps significantly
> but data loss is still possible. My guess is that readline() has a line
> buffer even when the file is in unbuffered mode, that can eat data.
> 
> Fix this by providing a console file that persists for the life of the
> console.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
> was flakey for me due to this bug. I don't know why that in particular,
> 3 calls to wait_for_console_pattern probably helps.
> 
> I didn't pinpoint when the bug was introduced because the original
> was probably not buggy because it was only run once at the end of the
> test. At some point after it was moved to common code, something would
> have started to call it more than once which is where potential for bug
> is introduced.
> 
> Thanks,
> Nick
> 
>   python/qemu/machine/machine.py         | 19 +++++++++++++++++++
>   tests/avocado/avocado_qemu/__init__.py |  2 +-
>   2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> index c16a0b6fed..35d5a672db 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -191,6 +191,7 @@ def __init__(self,
>               self.sock_dir, f"{self._name}.con"
>           )
>           self._console_socket: Optional[socket.socket] = None
> +        self._console_file: Optional[socket.SocketIO] = None
>           self._remove_files: List[str] = []
>           self._user_killed = False
>           self._quit_issued = False
> @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None:
>           # If we keep the console socket open, we may deadlock waiting
>           # for QEMU to exit, while QEMU is waiting for the socket to
>           # become writable.
> +        if self._console_file is not None:
> +            LOG.debug("Closing console file")
> +            self._console_file.close()
> +            self._console_file = None
> +
>           if self._console_socket is not None:
>               LOG.debug("Closing console socket")
>               self._console_socket.close()
> @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket:
>           Returns a socket connected to the console
>           """
>           if self._console_socket is None:
> +            LOG.debug("Opening console socket")
>               self._console_socket = console_socket.ConsoleSocket(
>                   self._console_address,
>                   file=self._console_log_path,
>                   drain=self._drain_console)
>           return self._console_socket
>   
> +    @property
> +    def console_file(self) -> socket.SocketIO:
> +        """
> +        Returns a file associated with the console socket
> +        """
> +        if self._console_file is None:
> +            LOG.debug("Opening console file")
> +            self._console_file = self.console_socket.makefile(mode='rb',
> +                                                              buffering=0,
> +                                                              encoding='utf-8')
> +        return self._console_file
> +
>       @property
>       def temp_dir(self) -> str:
>           """
> diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
> index 33090903f1..0172a359b7 100644
> --- a/tests/avocado/avocado_qemu/__init__.py
> +++ b/tests/avocado/avocado_qemu/__init__.py
> @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message,
>       assert not keep_sending or send_string
>       if vm is None:
>           vm = test.vm
> -    console = vm.console_socket.makefile(mode='rb', encoding='utf-8')
> +    console = vm.console_file
>       console_logger = logging.getLogger('console')
>       while True:
>           if send_string:
Re: [PATCH] tests/avocado: Fix console data loss
Posted by Daniel P. Berrangé 7 months, 2 weeks ago
On Tue, Sep 12, 2023 at 11:13:40PM +1000, Nicholas Piggin wrote:
> Occasionally some avocado tests will fail waiting for console line
> despite the machine running correctly. Console data goes missing, as can
> be seen in the console log. This is due to _console_interaction calling
> makefile() on the console socket each time it is invoked, which must be
> losing old buffer contents when going out of scope.
> 
> It is not enough to makefile() with buffered=0. That helps significantly
> but data loss is still possible. My guess is that readline() has a line
> buffer even when the file is in unbuffered mode, that can eat data.
> 
> Fix this by providing a console file that persists for the life of the
> console.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> For some reason, ppc_prep_40p.py:IbmPrep40pMachine.test_openbios_192m
> was flakey for me due to this bug. I don't know why that in particular,
> 3 calls to wait_for_console_pattern probably helps.
> 
> I didn't pinpoint when the bug was introduced because the original
> was probably not buggy because it was only run once at the end of the
> test. At some point after it was moved to common code, something would
> have started to call it more than once which is where potential for bug
> is introduced.

I'm suspecting this patch might also fix many other failures we
see in avocado, all with wait_for_console_pattern in their trace

https://gitlab.com/qemu-project/qemu/-/issues/1884

>  python/qemu/machine/machine.py         | 19 +++++++++++++++++++
>  tests/avocado/avocado_qemu/__init__.py |  2 +-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> index c16a0b6fed..35d5a672db 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -191,6 +191,7 @@ def __init__(self,
>              self.sock_dir, f"{self._name}.con"
>          )
>          self._console_socket: Optional[socket.socket] = None
> +        self._console_file: Optional[socket.SocketIO] = None
>          self._remove_files: List[str] = []
>          self._user_killed = False
>          self._quit_issued = False
> @@ -509,6 +510,11 @@ def _early_cleanup(self) -> None:
>          # If we keep the console socket open, we may deadlock waiting
>          # for QEMU to exit, while QEMU is waiting for the socket to
>          # become writable.
> +        if self._console_file is not None:
> +            LOG.debug("Closing console file")
> +            self._console_file.close()
> +            self._console_file = None
> +
>          if self._console_socket is not None:
>              LOG.debug("Closing console socket")
>              self._console_socket.close()
> @@ -874,12 +880,25 @@ def console_socket(self) -> socket.socket:
>          Returns a socket connected to the console
>          """
>          if self._console_socket is None:
> +            LOG.debug("Opening console socket")
>              self._console_socket = console_socket.ConsoleSocket(
>                  self._console_address,
>                  file=self._console_log_path,
>                  drain=self._drain_console)
>          return self._console_socket
>  
> +    @property
> +    def console_file(self) -> socket.SocketIO:
> +        """
> +        Returns a file associated with the console socket
> +        """
> +        if self._console_file is None:
> +            LOG.debug("Opening console file")
> +            self._console_file = self.console_socket.makefile(mode='rb',
> +                                                              buffering=0,
> +                                                              encoding='utf-8')
> +        return self._console_file
> +
>      @property
>      def temp_dir(self) -> str:
>          """
> diff --git a/tests/avocado/avocado_qemu/__init__.py b/tests/avocado/avocado_qemu/__init__.py
> index 33090903f1..0172a359b7 100644
> --- a/tests/avocado/avocado_qemu/__init__.py
> +++ b/tests/avocado/avocado_qemu/__init__.py
> @@ -137,7 +137,7 @@ def _console_interaction(test, success_message, failure_message,
>      assert not keep_sending or send_string
>      if vm is None:
>          vm = test.vm
> -    console = vm.console_socket.makefile(mode='rb', encoding='utf-8')
> +    console = vm.console_file
>      console_logger = logging.getLogger('console')
>      while True:
>          if send_string:

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|