CI: Rework RISCV smoke test

[XEN PATCH] CI: Rework RISCV smoke test

Posted by Anthony PERARD 5 months, 3 weeks ago

Currently, the test rely on QEMU and Xen finishing the boot in under
two seconds. That's both very long and very short. Xen usually managed
to print "All set up" under a second. Unless for some reason we try to
run the test on a machine that's busy doing something else.

Rework the test to exit as soon as Xen is done.

There's two `tail -f`, the first one is there simply to monitor test
progress in GitLab console. The second one is used to detect the test
result as soon as QEMU add it to the file. Both `tail` exit as soon as
QEMU exit.

If QEMU fails at start, and exit early, both `tail` will simply exit,
resulting in a failure.

If the line we are looking for is never printed, the `timeout` on the
second `tail` will force the test to exit with an error.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
---

Notes:
    The "machine might be busy" bits refere to having a gitlab-runner
    running two jobs, one been a build with no restriction on CPU resource use.
    
    We could use "-daemonize" with "-pidfile" to more easly detect qemu's
    initialisation failure, but needs "-display none" instead of
    "-nographic"

 automation/scripts/qemu-smoke-riscv64.sh | 32 ++++++++++++++++++------
 1 file changed, 24 insertions(+), 8 deletions(-)

diff --git a/automation/scripts/qemu-smoke-riscv64.sh b/automation/scripts/qemu-smoke-riscv64.sh
index 4008191302..ba7b61db8b 100755
--- a/automation/scripts/qemu-smoke-riscv64.sh
+++ b/automation/scripts/qemu-smoke-riscv64.sh
@@ -2,19 +2,35 @@
 
 set -ex
 
-# Run the test
-rm -f smoke.serial
-set +e
+# Truncate or create serial output file
+echo -n > smoke.serial
 
-timeout -k 1 2 \
+# cleanup: kill QEMU when the script exit for any reason
+qemu_pid=
+cleanup() {
+    if [ "$qemu_pid" ]; then
+        kill -s SIGKILL "$qemu_pid"
+    fi
+}
+
+trap 'cleanup' EXIT
+
+# Run the test
 qemu-system-riscv64 \
     -M virt \
     -smp 1 \
     -nographic \
     -m 2g \
+    -monitor none \
+    -chardev file,id=serial-out,path=smoke.serial \
+    -serial chardev:serial-out \
     -kernel binaries/xen \
-    |& tee smoke.serial
+    &
+qemu_pid=$!
 
-set -e
-(grep -q "All set up" smoke.serial) || exit 1
-exit 0
+# Monitor test progression until QEMU exit
+tail --pid=$qemu_pid -f smoke.serial &
+
+# Check boot test result
+timeout 60 tail --pid=$qemu_pid -f smoke.serial | \
+    grep -a -q "All set up"
-- 
Anthony PERARD

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Andrew Cooper 5 months, 3 weeks ago

On 09/11/2023 3:49 pm, Anthony PERARD wrote:
> Currently, the test rely on QEMU and Xen finishing the boot in under
> two seconds. That's both very long and very short. Xen usually managed
> to print "All set up" under a second. Unless for some reason we try to
> run the test on a machine that's busy doing something else.
>
> Rework the test to exit as soon as Xen is done.
>
> There's two `tail -f`, the first one is there simply to monitor test
> progress in GitLab console. The second one is used to detect the test
> result as soon as QEMU add it to the file. Both `tail` exit as soon as
> QEMU exit.
>
> If QEMU fails at start, and exit early, both `tail` will simply exit,
> resulting in a failure.
>
> If the line we are looking for is never printed, the `timeout` on the
> second `tail` will force the test to exit with an error.
>
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>

Looks plausible, but all these qemu-smoke scripts are pretty similar,
and copied from one-another.

We should make this change consistently to all testing (there's nothing
RISC-V specific about why this test is failing on this runner), and it
would be really nice if we could try to make it a bit more common than
it currently is.

~Andrew

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Anthony PERARD 5 months, 2 weeks ago

On Thu, Nov 09, 2023 at 04:52:36PM +0000, Andrew Cooper wrote:
> On 09/11/2023 3:49 pm, Anthony PERARD wrote:
> > Currently, the test rely on QEMU and Xen finishing the boot in under
> > two seconds. That's both very long and very short. Xen usually managed
> > to print "All set up" under a second. Unless for some reason we try to
> > run the test on a machine that's busy doing something else.
> >
> > Rework the test to exit as soon as Xen is done.
> >
> > There's two `tail -f`, the first one is there simply to monitor test
> > progress in GitLab console. The second one is used to detect the test
> > result as soon as QEMU add it to the file. Both `tail` exit as soon as
> > QEMU exit.
> >
> > If QEMU fails at start, and exit early, both `tail` will simply exit,
> > resulting in a failure.
> >
> > If the line we are looking for is never printed, the `timeout` on the
> > second `tail` will force the test to exit with an error.
> >
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> 
> Looks plausible, but all these qemu-smoke scripts are pretty similar,
> and copied from one-another.
> 
> We should make this change consistently to all testing (there's nothing
> RISC-V specific about why this test is failing on this runner), and it
> would be really nice if we could try to make it a bit more common than
> it currently is.

Yes, it would be nice if a change to a qemu-smoke script was applied to
every other one. But making those scripts more generic is a lot more
works, which would be useful to apply a change once for all.

The problem I'm trying to resolve only appear with this script, because
of a timeout been too short, a solution could just be to increase the
timeout (or not allowing runners to do more than one thing at a time).

BTW, the last time I've been told to apply a change to other things, I
never managed to finish it and the change wasn't applied at all (would
have result in some containers been smaller).

I guess will see if anyone complain about the test randomly failing. :-)

Cheers,

-- 
Anthony PERARD

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Stefano Stabellini 5 months, 2 weeks ago

On Mon, 13 Nov 2023, Anthony PERARD wrote:
> On Thu, Nov 09, 2023 at 04:52:36PM +0000, Andrew Cooper wrote:
> > On 09/11/2023 3:49 pm, Anthony PERARD wrote:
> > > Currently, the test rely on QEMU and Xen finishing the boot in under
> > > two seconds. That's both very long and very short. Xen usually managed
> > > to print "All set up" under a second. Unless for some reason we try to
> > > run the test on a machine that's busy doing something else.
> > >
> > > Rework the test to exit as soon as Xen is done.
> > >
> > > There's two `tail -f`, the first one is there simply to monitor test
> > > progress in GitLab console. The second one is used to detect the test
> > > result as soon as QEMU add it to the file. Both `tail` exit as soon as
> > > QEMU exit.
> > >
> > > If QEMU fails at start, and exit early, both `tail` will simply exit,
> > > resulting in a failure.
> > >
> > > If the line we are looking for is never printed, the `timeout` on the
> > > second `tail` will force the test to exit with an error.
> > >
> > > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> > 
> > Looks plausible, but all these qemu-smoke scripts are pretty similar,
> > and copied from one-another.
> > 
> > We should make this change consistently to all testing (there's nothing
> > RISC-V specific about why this test is failing on this runner), and it
> > would be really nice if we could try to make it a bit more common than
> > it currently is.
> 
> Yes, it would be nice if a change to a qemu-smoke script was applied to
> every other one. But making those scripts more generic is a lot more
> works, which would be useful to apply a change once for all.
> 
> The problem I'm trying to resolve only appear with this script, because
> of a timeout been too short, a solution could just be to increase the
> timeout (or not allowing runners to do more than one thing at a time).
> 
> BTW, the last time I've been told to apply a change to other things, I
> never managed to finish it and the change wasn't applied at all (would
> have result in some containers been smaller).
> 
> I guess will see if anyone complain about the test randomly failing. :-)

Hi Anthony, I think it is OK if you only fix this script for now. As
part of the tests we are writing for Xen FuSa we want to improve the
basic gitlab testing infrastructure and make the code common like Andrew
asked (we haven't done it yet but soon). We should get to it in the next
few months.

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Stefano Stabellini 5 months, 3 weeks ago

On Thu, 9 Nov 2023, Andrew Cooper wrote:
> On 09/11/2023 3:49 pm, Anthony PERARD wrote:
> > Currently, the test rely on QEMU and Xen finishing the boot in under
> > two seconds. That's both very long and very short. Xen usually managed
> > to print "All set up" under a second. Unless for some reason we try to
> > run the test on a machine that's busy doing something else.
> >
> > Rework the test to exit as soon as Xen is done.
> >
> > There's two `tail -f`, the first one is there simply to monitor test
> > progress in GitLab console. The second one is used to detect the test
> > result as soon as QEMU add it to the file. Both `tail` exit as soon as
> > QEMU exit.
> >
> > If QEMU fails at start, and exit early, both `tail` will simply exit,
> > resulting in a failure.
> >
> > If the line we are looking for is never printed, the `timeout` on the
> > second `tail` will force the test to exit with an error.
> >
> > Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> 
> Looks plausible, but all these qemu-smoke scripts are pretty similar,
> and copied from one-another.
> 
> We should make this change consistently to all testing (there's nothing
> RISC-V specific about why this test is failing on this runner), and it
> would be really nice if we could try to make it a bit more common than
> it currently is.

One idea you might like is do to the following (this was developed by
Oleksandr).

---

# Run the test
rm -f smoke.serial
set +e
export qemu_cmd="./binaries/qemu-system-aarch64 \
    -machine virtualization=true \
    -cpu cortex-a57 -machine type=virt \
    -m 2048 -monitor none -serial stdio \
    -smp 2 \
    -no-reboot \
    -device virtio-net-pci,netdev=n0 \
    -netdev user,id=n0,tftp=binaries \
    -bios /usr/lib/u-boot/qemu_arm64/u-boot.bin"

export qemu_arg="virtio scan; dhcp; tftpb 0x40000000 boot.scr; source 0x40000000"
export qemu_log="smoke.serial"
export LOG_MSG="Run /init as init process"
timeout -k 1 60 ./automation/scripts/qemu_key.sh

### qemu_key.sh is using "expect", see below. I think we should be able
### to achieve the same by using expect to close on the expected string
### (instead of waiting for eof)

#!/usr/bin/expect -f

set timeout -1

log_file -a $env(qemu_log)

match_max 10000

eval spawn $env(qemu_cmd)

expect "=>"

send "$env(qemu_arg)\r"

expect "$env(LOG_MSG)\r"

send "$env(XEN_CMD)\r"

expect eof

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Anthony PERARD 5 months, 2 weeks ago

On Thu, Nov 09, 2023 at 05:02:08PM -0800, Stefano Stabellini wrote:
> ### qemu_key.sh is using "expect", see below. I think we should be able
> ### to achieve the same by using expect to close on the expected string
> ### (instead of waiting for eof)

So, `expect` is just a different kind of language than shell is?
Also, `grep -q` doesn't wait for EOF, and just exit as soon as it found
the pattern.

Cheers,

-- 
Anthony PERARD

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Stefano Stabellini 5 months, 2 weeks ago

+Artem, Oleksandr

On Mon, 13 Nov 2023, Anthony PERARD wrote:
> On Thu, Nov 09, 2023 at 05:02:08PM -0800, Stefano Stabellini wrote:
> > ### qemu_key.sh is using "expect", see below. I think we should be able
> > ### to achieve the same by using expect to close on the expected string
> > ### (instead of waiting for eof)
> 
> So, `expect` is just a different kind of language than shell is?
> Also, `grep -q` doesn't wait for EOF, and just exit as soon as it found
> the pattern.

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Artem Mygaiev 5 months, 2 weeks ago

Hi Stefano, Anthony

On Mon, 13 Nov 2023, Anthony PERARD wrote:
> On Thu, Nov 09, 2023 at 05:02:08PM -0800, Stefano Stabellini wrote:
> > ### qemu_key.sh is using "expect", see below. I think we should be able
> > ### to achieve the same by using expect to close on the expected string
> > ### (instead of waiting for eof)
> 
> So, `expect` is just a different kind of language than shell is?
> Also, `grep -q` doesn't wait for EOF, and just exit as soon as it found
> the pattern.

Just a brief background on `expect`: it is a nice tool widely used in testing of
embedded Linux systems with CLI. It is popular because it is small and available
for basically every distro including very old ones, and it is very flexible. We
were using it for ages in automated testing, so for me it was no-brainer to pick
it when we needed to "interact" with console.

One minor drawback though is that `expect` uses Tcl for scripting, not shell :)

 -- Artem

Re: [XEN PATCH] CI: Rework RISCV smoke test

Posted by Oleksii 5 months, 3 weeks ago

Hello Anthony,

On Thu, 2023-11-09 at 15:49 +0000, Anthony PERARD wrote:
> Currently, the test rely on QEMU and Xen finishing the boot in under
> two seconds. That's both very long and very short. Xen usually
> managed
> to print "All set up" under a second. Unless for some reason we try
> to
> run the test on a machine that's busy doing something else.
> 
> Rework the test to exit as soon as Xen is done.
> 
> There's two `tail -f`, the first one is there simply to monitor test
> progress in GitLab console. The second one is used to detect the test
> result as soon as QEMU add it to the file. Both `tail` exit as soon
> as
> QEMU exit.
> 
> If QEMU fails at start, and exit early, both `tail` will simply exit,
> resulting in a failure.
> 
> If the line we are looking for is never printed, the `timeout` on the
> second `tail` will force the test to exit with an error.
> 
> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
> ---
> 
> Notes:
>     The "machine might be busy" bits refere to having a gitlab-runner
>     running two jobs, one been a build with no restriction on CPU
> resource use.
>     
>     We could use "-daemonize" with "-pidfile" to more easly detect
> qemu's
>     initialisation failure, but needs "-display none" instead of
>     "-nographic"
> 
>  automation/scripts/qemu-smoke-riscv64.sh | 32 ++++++++++++++++++----
> --
>  1 file changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/automation/scripts/qemu-smoke-riscv64.sh
> b/automation/scripts/qemu-smoke-riscv64.sh
> index 4008191302..ba7b61db8b 100755
> --- a/automation/scripts/qemu-smoke-riscv64.sh
> +++ b/automation/scripts/qemu-smoke-riscv64.sh
> @@ -2,19 +2,35 @@
>  
>  set -ex
>  
> -# Run the test
> -rm -f smoke.serial
> -set +e
> +# Truncate or create serial output file
> +echo -n > smoke.serial
>  
> -timeout -k 1 2 \
> +# cleanup: kill QEMU when the script exit for any reason
> +qemu_pid=
> +cleanup() {
> +    if [ "$qemu_pid" ]; then
> +        kill -s SIGKILL "$qemu_pid"
> +    fi
> +}
> +
> +trap 'cleanup' EXIT
> +
> +# Run the test
>  qemu-system-riscv64 \
>      -M virt \
>      -smp 1 \
>      -nographic \
>      -m 2g \
> +    -monitor none \
> +    -chardev file,id=serial-out,path=smoke.serial \
> +    -serial chardev:serial-out \
>      -kernel binaries/xen \
> -    |& tee smoke.serial
> +    &
> +qemu_pid=$!
>  
> -set -e
> -(grep -q "All set up" smoke.serial) || exit 1
> -exit 0
> +# Monitor test progression until QEMU exit
> +tail --pid=$qemu_pid -f smoke.serial &
> +
> +# Check boot test result
> +timeout 60 tail --pid=$qemu_pid -f smoke.serial | \
> +    grep -a -q "All set up"

I am OK with provided changes. Thanks!

Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

~ Oleksii