[PATCH 06/13] scripts/qmp_helper: add support for a timeout logic

Mauro Carvalho Chehab posted 13 patches 2 weeks, 4 days ago
Maintainers: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>, John Snow <jsnow@redhat.com>, Cleber Rosa <crosa@redhat.com>
There is a newer version of this series
[PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Mauro Carvalho Chehab 2 weeks, 4 days ago
We can't inject a new GHES record to the same source before
it has been acked. There is an async mechanism to verify when
the Kernel is ready, which is implemented at QEMU's ghes
driver.

If error inject is too fast, QEMU may return an error. When
such errors occur, implement a retry mechanism, based on a
maximum timeout.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
index 40059cd105f6..63f3df2d75c3 100755
--- a/scripts/qmp_helper.py
+++ b/scripts/qmp_helper.py
@@ -14,6 +14,7 @@
 
 from datetime import datetime
 from os import path as os_path
+from time import sleep
 
 try:
     qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
@@ -324,7 +325,8 @@ class qmp:
     Opens a connection and send/receive QMP commands.
     """
 
-    def send_cmd(self, command, args=None, may_open=False, return_error=True):
+    def send_cmd(self, command, args=None, may_open=False, return_error=True,
+                 timeout=None):
         """Send a command to QMP, optinally opening a connection"""
 
         if may_open:
@@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
         if args:
             msg['arguments'] = args
 
-        try:
-            obj = self.qmp_monitor.cmd_obj(msg)
-        # Can we use some other exception class here?
-        except Exception as e:                         # pylint: disable=W0718
-            print(f"Command: {command}")
-            print(f"Failed to inject error: {e}.")
+        if timeout and timeout > 0:
+            attempts = int(timeout * 10)
+        else:
+            attempts = 1
+
+        # Try up to attempts
+        for i in range(0, attempts):
+            try:
+                obj = self.qmp_monitor.cmd_obj(msg)
+
+                if obj and "return" in obj and not obj["return"]:
+                    break
+
+            except Exception as e:                     # pylint: disable=W0718
+                print(f"Command: {command}")
+                print(f"Failed to inject error: {e}.")
+                obj = None
+
+            if attempts > 1:
+                print(f"Error inject attempt {i + 1}/{attempts} failed.")
+
+            if i + 1 < attempts:
+                sleep(0.1)
+
+        if not obj:
             return None
 
         if "return" in obj:
@@ -531,7 +552,7 @@ def __init__(self, host, port, debug=False):
     #
     # Socket QMP send command
     #
-    def send_cper_raw(self, cper_data):
+    def send_cper_raw(self, cper_data, timeout=None):
         """
         Send a raw CPER data to QEMU though QMP TCP socket.
 
@@ -546,11 +567,11 @@ def send_cper_raw(self, cper_data):
 
         self._connect()
 
-        if self.send_cmd("inject-ghes-v2-error", cmd_arg):
+        ret = self.send_cmd("inject-ghes-v2-error", cmd_arg, timeout=timeout)
+        if ret:
             print("Error injected.")
-            return True
 
-        return False
+        return ret
 
     def get_gede(self, notif_type, payload_length):
         """
@@ -597,7 +618,7 @@ def get_gebs(self, payload_length):
         return gebs
 
     def send_cper(self, notif_type, payload,
-                  gede=None, gebs=None, raw_data=None):
+                  gede=None, gebs=None, raw_data=None, timeout=None):
         """
         Send commands to QEMU though QMP TCP socket.
 
@@ -656,7 +677,7 @@ def send_cper(self, notif_type, payload,
 
             util.dump_bytearray("Payload", payload)
 
-        return self.send_cper_raw(cper_data)
+        return self.send_cper_raw(cper_data, timeout=timeout)
 
     def search_qom(self, path, prop, regex):
         """
-- 
2.52.0
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Jonathan Cameron via qemu development 2 weeks, 4 days ago
On Wed, 21 Jan 2026 12:25:14 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> We can't inject a new GHES record to the same source before
> it has been acked. There is an async mechanism to verify when
> the Kernel is ready, which is implemented at QEMU's ghes
> driver.
> 
> If error inject is too fast, QEMU may return an error. When
> such errors occur, implement a retry mechanism, based on a
> maximum timeout.
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
A few trivial comments below. Either way this seems fine to me and
should make the tooling easier to use.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
>  1 file changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> index 40059cd105f6..63f3df2d75c3 100755
> --- a/scripts/qmp_helper.py
> +++ b/scripts/qmp_helper.py
> @@ -14,6 +14,7 @@
>  
>  from datetime import datetime
>  from os import path as os_path
> +from time import sleep
>  
>  try:
>      qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
> @@ -324,7 +325,8 @@ class qmp:
>      Opens a connection and send/receive QMP commands.
>      """
>  
> -    def send_cmd(self, command, args=None, may_open=False, return_error=True):
> +    def send_cmd(self, command, args=None, may_open=False, return_error=True,
> +                 timeout=None):
>          """Send a command to QMP, optinally opening a connection"""
>  
>          if may_open:
> @@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
>          if args:
>              msg['arguments'] = args
>  
> -        try:
> -            obj = self.qmp_monitor.cmd_obj(msg)
> -        # Can we use some other exception class here?
> -        except Exception as e:                         # pylint: disable=W0718
> -            print(f"Command: {command}")
> -            print(f"Failed to inject error: {e}.")
> +        if timeout and timeout > 0:
> +            attempts = int(timeout * 10)
> +        else:
> +            attempts = 1
> +
> +        # Try up to attempts
That reads oddly because of the variable name.  Made me ask myself
"How many attempts?"
Maybe  " Retry up to attempts times" or something like that.

> +        for i in range(0, attempts):
> +            try:
> +                obj = self.qmp_monitor.cmd_obj(msg)
> +
> +                if obj and "return" in obj and not obj["return"]:
> +                    break
> +
> +            except Exception as e:                     # pylint: disable=W0718
> +                print(f"Command: {command}")
> +                print(f"Failed to inject error: {e}.")
> +                obj = None
> +
> +            if attempts > 1:
> +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> +
> +            if i + 1 < attempts:
> +                sleep(0.1)

Do we care about a sleep at the end?  Feels like a micro optimization that
isn't needed.

> +
> +        if not obj:
>              return None
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Mauro Carvalho Chehab 2 weeks, 4 days ago
On Wed, Jan 21, 2026 at 12:39:27PM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 12:25:14 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:
> 
> > We can't inject a new GHES record to the same source before
> > it has been acked. There is an async mechanism to verify when
> > the Kernel is ready, which is implemented at QEMU's ghes
> > driver.
> > 
> > If error inject is too fast, QEMU may return an error. When
> > such errors occur, implement a retry mechanism, based on a
> > maximum timeout.
> > 
> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> A few trivial comments below. Either way this seems fine to me and
> should make the tooling easier to use.
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> > ---
> >  scripts/qmp_helper.py | 47 +++++++++++++++++++++++++++++++------------
> >  1 file changed, 34 insertions(+), 13 deletions(-)
> > 
> > diff --git a/scripts/qmp_helper.py b/scripts/qmp_helper.py
> > index 40059cd105f6..63f3df2d75c3 100755
> > --- a/scripts/qmp_helper.py
> > +++ b/scripts/qmp_helper.py
> > @@ -14,6 +14,7 @@
> >  
> >  from datetime import datetime
> >  from os import path as os_path
> > +from time import sleep
> >  
> >  try:
> >      qemu_dir = os_path.abspath(os_path.dirname(os_path.dirname(__file__)))
> > @@ -324,7 +325,8 @@ class qmp:
> >      Opens a connection and send/receive QMP commands.
> >      """
> >  
> > -    def send_cmd(self, command, args=None, may_open=False, return_error=True):
> > +    def send_cmd(self, command, args=None, may_open=False, return_error=True,
> > +                 timeout=None):
> >          """Send a command to QMP, optinally opening a connection"""
> >  
> >          if may_open:
> > @@ -336,12 +338,31 @@ def send_cmd(self, command, args=None, may_open=False, return_error=True):
> >          if args:
> >              msg['arguments'] = args
> >  
> > -        try:
> > -            obj = self.qmp_monitor.cmd_obj(msg)
> > -        # Can we use some other exception class here?
> > -        except Exception as e:                         # pylint: disable=W0718
> > -            print(f"Command: {command}")
> > -            print(f"Failed to inject error: {e}.")
> > +        if timeout and timeout > 0:
> > +            attempts = int(timeout * 10)
> > +        else:
> > +            attempts = 1
> > +
> > +        # Try up to attempts
> That reads oddly because of the variable name.  Made me ask myself
> "How many attempts?"
> Maybe  " Retry up to attempts times" or something like that.

I'll improve the message. The goal here is to try up to at least 
timeout" seconds.

That's why we multiply it by 10...

> 
> > +        for i in range(0, attempts):
> > +            try:
> > +                obj = self.qmp_monitor.cmd_obj(msg)
> > +
> > +                if obj and "return" in obj and not obj["return"]:
> > +                    break
> > +
> > +            except Exception as e:                     # pylint: disable=W0718
> > +                print(f"Command: {command}")
> > +                print(f"Failed to inject error: {e}.")
> > +                obj = None
> > +
> > +            if attempts > 1:
> > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > +
> > +            if i + 1 < attempts:
> > +                sleep(0.1)

... and here, we sleep for 0.1 seconds.

> 
> Do we care about a sleep at the end?  Feels like a micro optimization that
> isn't needed.

This is not a micro-optimization. It is more to ensure that we won't
respin it too fast.

What happens is that QMP interface asks the BIOS to send an async
message to OSPM, cleaning an ack register. When the OSPM reads the
error, it writes 1 to the ack register.

If we send messages too fast, the logic at ghes.c will detect that
the ack didn't happen, imediately returning an errocr code.

On such case, we sleep for 100ms before trying again.

In practice, on my Ryzen 9 machines with QEMU emulating ARM,
even under massive error injection, 99% of the time no retries
happen. The worse case scenario I got here is that sometimes
Kernel got stuck and took between 5s to 10s to accept the error
submission.

> 
> > +
> > +        if not obj:
> >              return None
> 
> 

-- 
Thanks,
Mauro
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Jonathan Cameron via qemu development 2 weeks, 2 days ago
> >   
> > > +        for i in range(0, attempts):
> > > +            try:
> > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > +
> > > +                if obj and "return" in obj and not obj["return"]:
> > > +                    break
> > > +
> > > +            except Exception as e:                     # pylint: disable=W0718
> > > +                print(f"Command: {command}")
> > > +                print(f"Failed to inject error: {e}.")
> > > +                obj = None
> > > +
> > > +            if attempts > 1:
> > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > +
> > > +            if i + 1 < attempts:
> > > +                sleep(0.1)  
> 
> ... and here, we sleep for 0.1 seconds.
> 
> > 
> > Do we care about a sleep at the end?  Feels like a micro optimization that
> > isn't needed.  
> 
> This is not a micro-optimization. It is more to ensure that we won't
> respin it too fast.
> 
> What happens is that QMP interface asks the BIOS to send an async
> message to OSPM, cleaning an ack register. When the OSPM reads the
> error, it writes 1 to the ack register.
> 
> If we send messages too fast, the logic at ghes.c will detect that
> the ack didn't happen, imediately returning an errocr code.
> 
> On such case, we sleep for 100ms before trying again.
I was suggesting the opposite.  Just sleep one more time at the end
before timing out.
So instead of
	if i + 1 < attempts
		sleep(0.1)

simply
	sleep(0.1)



> 
> In practice, on my Ryzen 9 machines with QEMU emulating ARM,
> even under massive error injection, 99% of the time no retries
> happen. The worse case scenario I got here is that sometimes
> Kernel got stuck and took between 5s to 10s to accept the error
> submission.
> 
> >   
> > > +
> > > +        if not obj:
> > >              return None  
> > 
> >   
>
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Mauro Carvalho Chehab 1 week, 6 days ago
On Fri, 23 Jan 2026 16:16:03 +0000
Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:

> > >     
> > > > +        for i in range(0, attempts):
> > > > +            try:
> > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > +
> > > > +                if obj and "return" in obj and not obj["return"]:
> > > > +                    break
> > > > +
> > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > +                print(f"Command: {command}")
> > > > +                print(f"Failed to inject error: {e}.")
> > > > +                obj = None
> > > > +
> > > > +            if attempts > 1:
> > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > +
> > > > +            if i + 1 < attempts:
> > > > +                sleep(0.1)    
> > 
> > ... and here, we sleep for 0.1 seconds.
> >   
> > > 
> > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > isn't needed.    
> > 
> > This is not a micro-optimization. It is more to ensure that we won't
> > respin it too fast.
> > 
> > What happens is that QMP interface asks the BIOS to send an async
> > message to OSPM, cleaning an ack register. When the OSPM reads the
> > error, it writes 1 to the ack register.
> > 
> > If we send messages too fast, the logic at ghes.c will detect that
> > the ack didn't happen, imediately returning an errocr code.
> > 
> > On such case, we sleep for 100ms before trying again.  
> I was suggesting the opposite.  Just sleep one more time at the end
> before timing out.
> So instead of
> 	if i + 1 < attempts
> 		sleep(0.1)
> 
> simply
> 	sleep(0.1)

If one writes an external loop calling fuzzy with different parameters,
like:

	for i in $(seq 1 360000); do
            scripts/ghes_inject.py fuzzy -T proc-arm;
            scripts/ghes_inject.py fuzzy -T firmware-error;
        done

The extra unneeded would sleep waste 10 hours doing nothing.

Regards,
Mauro
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Mauro Carvalho Chehab 1 week, 6 days ago
On Mon, 26 Jan 2026 12:23:30 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> (by way of Mauro Carvalho Chehab <mchehab+huawei@kernel.org>) wrote:

> On Fri, 23 Jan 2026 16:16:03 +0000
> Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> 
> > > >     
> > > > > +        for i in range(0, attempts):
> > > > > +            try:
> > > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > > +
> > > > > +                if obj and "return" in obj and not obj["return"]:
> > > > > +                    break
> > > > > +
> > > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > > +                print(f"Command: {command}")
> > > > > +                print(f"Failed to inject error: {e}.")
> > > > > +                obj = None
> > > > > +
> > > > > +            if attempts > 1:
> > > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > > +
> > > > > +            if i + 1 < attempts:
> > > > > +                sleep(0.1)    
> > > 
> > > ... and here, we sleep for 0.1 seconds.
> > >   
> > > > 
> > > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > > isn't needed.    
> > > 
> > > This is not a micro-optimization. It is more to ensure that we won't
> > > respin it too fast.
> > > 
> > > What happens is that QMP interface asks the BIOS to send an async
> > > message to OSPM, cleaning an ack register. When the OSPM reads the
> > > error, it writes 1 to the ack register.
> > > 
> > > If we send messages too fast, the logic at ghes.c will detect that
> > > the ack didn't happen, imediately returning an errocr code.
> > > 
> > > On such case, we sleep for 100ms before trying again.  
> > I was suggesting the opposite.  Just sleep one more time at the end
> > before timing out.
> > So instead of
> > 	if i + 1 < attempts
> > 		sleep(0.1)
> > 
> > simply
> > 	sleep(0.1)
> 
> If one writes an external loop calling fuzzy with different parameters,
> like:
> 
> 	for i in $(seq 1 360000); do
>             scripts/ghes_inject.py fuzzy -T proc-arm;
>             scripts/ghes_inject.py fuzzy -T firmware-error;
>         done
> 
> The extra unneeded would sleep waste 10 hours doing nothing.

Btw, the same applies when using the -c parameter:

             scripts/ghes_inject.py fuzzy -T proc-arm -c 360000

The goal here is to optimize in a way that we could one day have a
CI running lots of tests in a reasonable time to detect regressions
at QEMU + Linux Kernel + rasdaemon.

So, we don't want unneeded delays. We only need to sleep if a
retry attempt failed and it will be retrying again.

Regards,
Re: [PATCH 06/13] scripts/qmp_helper: add support for a timeout logic
Posted by Jonathan Cameron via qemu development 1 week, 6 days ago
On Mon, 26 Jan 2026 12:29:32 +0100
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Mon, 26 Jan 2026 12:23:30 +0100
> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> (by way of Mauro Carvalho Chehab <mchehab+huawei@kernel.org>) wrote:
> 
> > On Fri, 23 Jan 2026 16:16:03 +0000
> > Jonathan Cameron via qemu development <qemu-devel@nongnu.org> wrote:
> >   
> > > > >       
> > > > > > +        for i in range(0, attempts):
> > > > > > +            try:
> > > > > > +                obj = self.qmp_monitor.cmd_obj(msg)
> > > > > > +
> > > > > > +                if obj and "return" in obj and not obj["return"]:
> > > > > > +                    break
> > > > > > +
> > > > > > +            except Exception as e:                     # pylint: disable=W0718
> > > > > > +                print(f"Command: {command}")
> > > > > > +                print(f"Failed to inject error: {e}.")
> > > > > > +                obj = None
> > > > > > +
> > > > > > +            if attempts > 1:
> > > > > > +                print(f"Error inject attempt {i + 1}/{attempts} failed.")
> > > > > > +
> > > > > > +            if i + 1 < attempts:
> > > > > > +                sleep(0.1)      
> > > > 
> > > > ... and here, we sleep for 0.1 seconds.
> > > >     
> > > > > 
> > > > > Do we care about a sleep at the end?  Feels like a micro optimization that
> > > > > isn't needed.      
> > > > 
> > > > This is not a micro-optimization. It is more to ensure that we won't
> > > > respin it too fast.
> > > > 
> > > > What happens is that QMP interface asks the BIOS to send an async
> > > > message to OSPM, cleaning an ack register. When the OSPM reads the
> > > > error, it writes 1 to the ack register.
> > > > 
> > > > If we send messages too fast, the logic at ghes.c will detect that
> > > > the ack didn't happen, imediately returning an errocr code.
> > > > 
> > > > On such case, we sleep for 100ms before trying again.    
> > > I was suggesting the opposite.  Just sleep one more time at the end
> > > before timing out.
> > > So instead of
> > > 	if i + 1 < attempts
> > > 		sleep(0.1)
> > > 
> > > simply
> > > 	sleep(0.1)  
> > 
> > If one writes an external loop calling fuzzy with different parameters,
> > like:
> > 
> > 	for i in $(seq 1 360000); do
> >             scripts/ghes_inject.py fuzzy -T proc-arm;
> >             scripts/ghes_inject.py fuzzy -T firmware-error;
> >         done
> > 
> > The extra unneeded would sleep waste 10 hours doing nothing.

True if it fails every time, which you were suggesting was very rare. 

Anyhow I really don't mind that much, just seemed like a tiny
bit over engineered for a rare case. 
 
> 
> Btw, the same applies when using the -c parameter:
> 
>              scripts/ghes_inject.py fuzzy -T proc-arm -c 360000
> 
> The goal here is to optimize in a way that we could one day have a
> CI running lots of tests in a reasonable time to detect regressions
> at QEMU + Linux Kernel + rasdaemon.
> 
> So, we don't want unneeded delays. We only need to sleep if a
> retry attempt failed and it will be retrying again.
> 
> Regards,
>