All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs
@ 2023-02-09  8:09 Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 1/8] oeqa ssh.py: move output prints to new line Mikko Rapeli
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

I get a qemu hang on kirkstone, swtpm and optee. One of the
optee-test/xtest hangs the qemu machine in some kind of deadlock.
While this needs to be debugged and fixed, the oeqa runtime tests
also hanged and never returned. Thus this patch set. With these changes
qemu deadlock is detected and do_testimage() task eventually exits
with all correct tests failing and the hangin qemu system gets killed.

Tested on kirkstone and cherry-picked to master. If something blows up
I'll do more testing on master branch based setup.

v2: removed all debug prints

v1: https://lists.openembedded.org/g/openembedded-core/message/176192

Mikko Rapeli (8):
  oeqa ssh.py: move output prints to new line
  oeqa ssh.py: fix hangs in run()
  oeqa ssh.py: add connection keep alive options to ssh client
  oeqa dump.py: add error counter and stop after 5 failures
  oeqa qemurunner: read more data at a time from serial
  oeqa qemurunner.py: add timeout to QMP calls
  oeqa qemurunner.py: kill qemu if it hangs
  oeqa qemurunner.py: try to avoid reading one character at a time

 meta/lib/oeqa/core/target/ssh.py  | 39 ++++++++++++++++++++++++-------
 meta/lib/oeqa/utils/dump.py       | 23 ++++++++++++++++--
 meta/lib/oeqa/utils/qemurunner.py | 16 ++++++++++++-
 3 files changed, 66 insertions(+), 12 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/8] oeqa ssh.py: move output prints to new line
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 2/8] oeqa ssh.py: fix hangs in run() Mikko Rapeli
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

The output from is garbled otherwise and it's not
easy to remove debug output form real command output on target.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/core/target/ssh.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
index 7a10ba9763..13fd5b2a49 100644
--- a/meta/lib/oeqa/core/target/ssh.py
+++ b/meta/lib/oeqa/core/target/ssh.py
@@ -240,7 +240,7 @@ def SSHCall(command, logger, timeout=None, **opts):
                             eof = True
                         else:
                             output += data
-                            logger.debug('Partial data from SSH call: %s' % data)
+                            logger.debug('Partial data from SSH call:\n%s' % data)
                             endtime = time.time() + timeout
                 except InterruptedError:
                     continue
@@ -256,12 +256,12 @@ def SSHCall(command, logger, timeout=None, **opts):
                 endtime = time.time() - starttime
                 lastline = ("\nProcess killed - no output for %d seconds. Total"
                             " running time: %d seconds." % (timeout, endtime))
-                logger.debug('Received data from SSH call %s ' % lastline)
+                logger.debug('Received data from SSH call:\n%s ' % lastline)
                 output += lastline
 
         else:
             output = process.communicate()[0].decode('utf-8', errors='ignore')
-            logger.debug('Data from SSH call: %s' % output.rstrip())
+            logger.debug('Data from SSH call:\n%s' % output.rstrip())
 
     options = {
         "stdout": subprocess.PIPE,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 1/8] oeqa ssh.py: move output prints to new line Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  9:15   ` [OE-core] " Richard Purdie
  2023-02-09 10:20   ` Richard Purdie
  2023-02-09  8:09 ` [PATCH v2 3/8] oeqa ssh.py: add connection keep alive options to ssh client Mikko Rapeli
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

When qemu machine hangs, the ssh commands done by tests
are not timing out. do_testimage() task has last logs like this:

DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502

The test process is stuck for hours, or for ever if the
executing command or test case did not set a timeout correctly.
The default 300 second timeout is not working when target hangs.
Note that timeout is really a "inactive timeout" since data returned
by the process will reset the timeout.

Make the process stdout non-blocking so read() will always return
right away using os.set_blocking() available in python 3.5 and later.

Then change from python codec reader to plain read() and make
the ssh subprocess stdout non-blocking. Even with select()
making sure the file had input to be read, the codec reader was
trying to find more stuff and blocking for ever when process hangs.

While at it, add a small timeout to read data in larger chunks if
possible. This avoids reading data one or few characters at a time
and makes the debug logs more readable.

close() the stdout file in all cases after read loop is complete.

Then make sure to wait or kill the ssh subprocess in all cases.
Just reading the output stream and receiving EOF there does not mean
that the process exited, and wait() needs a timeout if the process
is hanging. In the end kill the process and return the return value
and captured output utf-8 encoded, just like before these changes.

This fixes ssh run() related deadlocks when a qemu target hangs
completely.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
index 13fd5b2a49..466a795eb4 100644
--- a/meta/lib/oeqa/core/target/ssh.py
+++ b/meta/lib/oeqa/core/target/ssh.py
@@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
         if timeout:
             endtime = starttime + timeout
             eof = False
+            os.set_blocking(process.stdout.fileno(), False)
             while time.time() < endtime and not eof:
-                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
                 try:
+                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
                     if select.select([process.stdout], [], [], 5)[0] != []:
-                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
-                        data = reader.read(1024, 4096)
+                        # wait a bit for more data, tries to avoid reading single characters
+                        time.sleep(0.2)
+                        data = process.stdout.read()
                         if not data:
-                            process.stdout.close()
                             eof = True
                         else:
-                            output += data
-                            logger.debug('Partial data from SSH call:\n%s' % data)
+                            # ignore errors to capture as much as possible
+                            string = data.decode('utf-8', errors='ignore')
+                            output += string
+                            logger.debug('Partial data from SSH call:\n%s' % string)
                             endtime = time.time() + timeout
                 except InterruptedError:
+                    logger.debug('InterruptedError')
                     continue
 
+            process.stdout.close()
+
             # process hasn't returned yet
             if not eof:
                 process.terminate()
@@ -252,6 +258,7 @@ def SSHCall(command, logger, timeout=None, **opts):
                 try:
                     process.kill()
                 except OSError:
+                    logger.debug('OSError when killing process')
                     pass
                 endtime = time.time() - starttime
                 lastline = ("\nProcess killed - no output for %d seconds. Total"
@@ -263,6 +270,17 @@ def SSHCall(command, logger, timeout=None, **opts):
             output = process.communicate()[0].decode('utf-8', errors='ignore')
             logger.debug('Data from SSH call:\n%s' % output.rstrip())
 
+        # timout or not, make sure process exits and is not hanging
+        if process.returncode == None:
+            try:
+                process.wait(timeout=5)
+            except TimeoutExpired:
+                try:
+                    process.kill()
+                except OSError:
+                    logger.debug('OSError')
+                    pass
+
     options = {
         "stdout": subprocess.PIPE,
         "stderr": subprocess.STDOUT,
@@ -290,4 +308,5 @@ def SSHCall(command, logger, timeout=None, **opts):
             process.kill()
         logger.debug('Something went wrong, killing SSH process')
         raise
-    return (process.wait(), output.rstrip())
+
+    return (process.returncode, output.rstrip())
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/8] oeqa ssh.py: add connection keep alive options to ssh client
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 1/8] oeqa ssh.py: move output prints to new line Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 2/8] oeqa ssh.py: fix hangs in run() Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 4/8] oeqa dump.py: add error counter and stop after 5 failures Mikko Rapeli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

Configure ssh client to test that connection with server is up.
If the server does not respond within a minute then the connection,
target machine or sshd daemon are stuck and it's better to exit
the command execution with errors.

Some tests can execute a long time without returning stdout/stderror
data and it's difficult to adjust timers for those cases if
connection to target machine or the target machine itself hangs
and output is not expected in minutes or even hours.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/core/target/ssh.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
index 466a795eb4..bb3f2e3dc3 100644
--- a/meta/lib/oeqa/core/target/ssh.py
+++ b/meta/lib/oeqa/core/target/ssh.py
@@ -34,6 +34,8 @@ class OESSHTarget(OETarget):
         self.timeout = timeout
         self.user = user
         ssh_options = [
+                '-o', 'ServerAliveCountMax=2',
+                '-o', 'ServerAliveInterval=30',
                 '-o', 'UserKnownHostsFile=/dev/null',
                 '-o', 'StrictHostKeyChecking=no',
                 '-o', 'LogLevel=ERROR'
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/8] oeqa dump.py: add error counter and stop after 5 failures
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
                   ` (2 preceding siblings ...)
  2023-02-09  8:09 ` [PATCH v2 3/8] oeqa ssh.py: add connection keep alive options to ssh client Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 5/8] oeqa qemurunner: read more data at a time from serial Mikko Rapeli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

If test target qemu machine hangs completely, dump_target() calls
over serial console are taking a long time to time out, possibly
for every failing ssh command execution and a lot of test cases,
and same with dump_monitor().

Instead of trying for ever, count errors and after 5 stop trying
to dump_target() and dump_monitor() completely.

These help to end testing earlier when a test target is completely
deadlocked and all ssh, serial and QMP communication with it are
failing.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/utils/dump.py | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/meta/lib/oeqa/utils/dump.py b/meta/lib/oeqa/utils/dump.py
index bcee03b576..d420b497f9 100644
--- a/meta/lib/oeqa/utils/dump.py
+++ b/meta/lib/oeqa/utils/dump.py
@@ -93,37 +93,55 @@ class HostDumper(BaseDumper):
             self._write_dump(cmd.split()[0], result.output)
 
 class TargetDumper(BaseDumper):
-    """ Class to get dumps from target, it only works with QemuRunner """
+    """ Class to get dumps from target, it only works with QemuRunner.
+        Will give up permanently after 5 errors from running commands over
+        serial console. This helps to end testing when target is really dead, hanging
+        or unresponsive.
+    """
 
     def __init__(self, cmds, parent_dir, runner):
         super(TargetDumper, self).__init__(cmds, parent_dir)
         self.runner = runner
+        self.errors = 0
 
     def dump_target(self, dump_dir=""):
+        if self.errors >= 5:
+                print("Too many errors when dumping data from target, assuming it is dead! Will not dump data anymore!")
+                return
         if dump_dir:
             self.dump_dir = dump_dir
         for cmd in self.cmds:
             # We can continue with the testing if serial commands fail
             try:
                 (status, output) = self.runner.run_serial(cmd)
+                if status == 0:
+                    self.errors = self.errors + 1
                 self._write_dump(cmd.split()[0], output)
             except:
+                self.errors = self.errors + 1
                 print("Tried to dump info from target but "
                         "serial console failed")
                 print("Failed CMD: %s" % (cmd))
 
 class MonitorDumper(BaseDumper):
-    """ Class to get dumps via the Qemu Monitor, it only works with QemuRunner """
+    """ Class to get dumps via the Qemu Monitor, it only works with QemuRunner
+        Will stop completely if there are more than 5 errors when dumping monitor data.
+        This helps to end testing when target is really dead, hanging or unresponsive.
+    """
 
     def __init__(self, cmds, parent_dir, runner):
         super(MonitorDumper, self).__init__(cmds, parent_dir)
         self.runner = runner
+        self.errors = 0
 
     def dump_monitor(self, dump_dir=""):
         if self.runner is None:
             return
         if dump_dir:
             self.dump_dir = dump_dir
+        if self.errors >= 5:
+                print("Too many errors when dumping data from qemu monitor, assuming it is dead! Will not dump data anymore!")
+                return
         for cmd in self.cmds:
             cmd_name = cmd.split()[0]
             try:
@@ -137,4 +155,5 @@ class MonitorDumper(BaseDumper):
                     output = self.runner.run_monitor(cmd_name)
                 self._write_dump(cmd_name, output)
             except Exception as e:
+                self.errors = self.errors + 1
                 print("Failed to dump QMP CMD: %s with\nException: %s" % (cmd_name, e))
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/8] oeqa qemurunner: read more data at a time from serial
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
                   ` (3 preceding siblings ...)
  2023-02-09  8:09 ` [PATCH v2 4/8] oeqa dump.py: add error counter and stop after 5 failures Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 6/8] oeqa qemurunner.py: add timeout to QMP calls Mikko Rapeli
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

Use a short sleep to bundle serial console reads so that
we are not reading one character at a time which reduces busy
looping.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index 8b893601d4..2e054447fc 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -202,7 +202,7 @@ class QemuRunner:
         qmp_file = "." + next(tempfile._get_candidate_names())
         qmp_param = ' -S -qmp unix:./%s,server,wait' % (qmp_file)
         qmp_port = self.tmpdir + "/" + qmp_file
-        # Create a second socket connection for debugging use, 
+        # Create a second socket connection for debugging use,
         # note this will NOT cause qemu to block waiting for the connection
         qmp_file2 = "." + next(tempfile._get_candidate_names())
         qmp_param += ' -qmp unix:./%s,server,nowait' % (qmp_file2)
@@ -468,6 +468,8 @@ class QemuRunner:
                     socklist.remove(self.server_socket)
                     self.logger.debug("Connection from %s:%s" % addr)
                 else:
+                    # try to avoid reading only a single character at a time
+                    time.sleep(0.1)
                     data = data + sock.recv(1024)
                     if data:
                         bootlog += data
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/8] oeqa qemurunner.py: add timeout to QMP calls
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
                   ` (4 preceding siblings ...)
  2023-02-09  8:09 ` [PATCH v2 5/8] oeqa qemurunner: read more data at a time from serial Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs Mikko Rapeli
  2023-02-09  8:09 ` [PATCH v2 8/8] oeqa qemurunner.py: try to avoid reading one character at a time Mikko Rapeli
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

When a qemu machine hangs, the QMP calls can hang for ever
too, and when this happens any failing test commands from ssh
runner may be followed by dump_monitor() calls which
then also hang. Hangs followed by hangs.

Use runqemutime at setup and run_monitor() specific timeout
for later calls.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index 2e054447fc..bce00c696a 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -350,6 +350,8 @@ class QemuRunner:
                     return False
 
             try:
+                # set timeout value for all QMP calls
+                self.qmp.settimeout(self.runqemutime)
                 self.qmp.connect()
                 connect_time = time.time()
                 self.logger.info("QMP connected to QEMU at %s and took %s seconds" %
@@ -628,6 +630,7 @@ class QemuRunner:
 
     def run_monitor(self, command, args=None, timeout=60):
         if hasattr(self, 'qmp') and self.qmp:
+            self.qmp.settimeout(timeout)
             if args is not None:
                 return self.qmp.cmd(command, args)
             else:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
                   ` (5 preceding siblings ...)
  2023-02-09  8:09 ` [PATCH v2 6/8] oeqa qemurunner.py: add timeout to QMP calls Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  2023-02-09  9:45   ` [OE-core] " Alexander Kanavin
  2023-02-09  8:09 ` [PATCH v2 8/8] oeqa qemurunner.py: try to avoid reading one character at a time Mikko Rapeli
  7 siblings, 1 reply; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

qemu doesn't always behave well and can hang too.
kill it with force if was still alive.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index bce00c696a..8e3484385d 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -589,6 +589,13 @@ class QemuRunner:
                 os.kill(self.qemupid, signal.SIGTERM)
             except ProcessLookupError as e:
                 self.logger.warning('qemu-system ended unexpectedly')
+            time.sleep(5)
+            try:
+                # qemu-system did not behave well
+                os.kill(self.qemupid, signal.SIGKILL)
+            except ProcessLookupError as e:
+                # already dead
+                pass
 
     def stop_thread(self):
         if self.thread and self.thread.is_alive():
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 8/8] oeqa qemurunner.py: try to avoid reading one character at a time
  2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
                   ` (6 preceding siblings ...)
  2023-02-09  8:09 ` [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs Mikko Rapeli
@ 2023-02-09  8:09 ` Mikko Rapeli
  7 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  8:09 UTC (permalink / raw)
  To: openembedded-core; +Cc: Mikko Rapeli

Read from serial console with a small delay to bundle data to e.g.
full lines. Reading one character at a time is not needed and causes
busy looping.

Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
---
 meta/lib/oeqa/utils/qemurunner.py | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
index 8e3484385d..0f4157d318 100644
--- a/meta/lib/oeqa/utils/qemurunner.py
+++ b/meta/lib/oeqa/utils/qemurunner.py
@@ -665,6 +665,8 @@ class QemuRunner:
             except InterruptedError:
                 continue
             if sread:
+                # try to avoid reading single character at a time
+                time.sleep(0.1)
                 answer = self.server_socket.recv(1024)
                 if answer:
                     data += answer.decode('utf-8')
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09  8:09 ` [PATCH v2 2/8] oeqa ssh.py: fix hangs in run() Mikko Rapeli
@ 2023-02-09  9:15   ` Richard Purdie
  2023-02-09  9:36     ` Mikko Rapeli
  2023-02-09 10:20   ` Richard Purdie
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Purdie @ 2023-02-09  9:15 UTC (permalink / raw)
  To: Mikko Rapeli, openembedded-core

On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> When qemu machine hangs, the ssh commands done by tests
> are not timing out. do_testimage() task has last logs like this:
> 
> DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> 
> The test process is stuck for hours, or for ever if the
> executing command or test case did not set a timeout correctly.
> The default 300 second timeout is not working when target hangs.
> Note that timeout is really a "inactive timeout" since data returned
> by the process will reset the timeout.
> 
> Make the process stdout non-blocking so read() will always return
> right away using os.set_blocking() available in python 3.5 and later.
> 
> Then change from python codec reader to plain read() and make
> the ssh subprocess stdout non-blocking. Even with select()
> making sure the file had input to be read, the codec reader was
> trying to find more stuff and blocking for ever when process hangs.
> 
> While at it, add a small timeout to read data in larger chunks if
> possible. This avoids reading data one or few characters at a time
> and makes the debug logs more readable.
> 
> close() the stdout file in all cases after read loop is complete.
> 
> Then make sure to wait or kill the ssh subprocess in all cases.
> Just reading the output stream and receiving EOF there does not mean
> that the process exited, and wait() needs a timeout if the process
> is hanging. In the end kill the process and return the return value
> and captured output utf-8 encoded, just like before these changes.
> 
> This fixes ssh run() related deadlocks when a qemu target hangs
> completely.
> 
> Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> ---
>  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> index 13fd5b2a49..466a795eb4 100644
> --- a/meta/lib/oeqa/core/target/ssh.py
> +++ b/meta/lib/oeqa/core/target/ssh.py
> @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
>          if timeout:
>              endtime = starttime + timeout
>              eof = False
> +            os.set_blocking(process.stdout.fileno(), False)
>              while time.time() < endtime and not eof:
> -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
>                  try:
> +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
>                      if select.select([process.stdout], [], [], 5)[0] != []:
> -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> -                        data = reader.read(1024, 4096)
> +                        # wait a bit for more data, tries to avoid reading single characters
> +                        time.sleep(0.2)
> +                        data = process.stdout.read()
>                          if not data:
> -                            process.stdout.close()
>                              eof = True

I'm not sure I understand the use of eof now. If stdout is non-
blocking, it would be perfectly normal for it to return with no data
but that doesn't mean we're at eof, it just means there is no data
ready?

I suspect the code might be ok but we should stop calling it eof since
it no longer is?

I also tend to dislike sleep() codepaths as it usually means the code
is sub optimal. I was going to make further suggestions but I stepped
back and looked at the bigger picture of the code.

The code is basically doing:

        if timeout:
            <complex code>
        else:
            output = process.communicate()


Python 3.3 added a timeout parameter to subprocess.communicate(). Can
we just use that instead? Or is there something special about our needs
which means we can't?

Cheers,

Richard



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09  9:15   ` [OE-core] " Richard Purdie
@ 2023-02-09  9:36     ` Mikko Rapeli
  2023-02-09 10:53       ` Alexander Kanavin
  0 siblings, 1 reply; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  9:36 UTC (permalink / raw)
  To: Richard Purdie; +Cc: openembedded-core

Hi,

On Thu, Feb 09, 2023 at 09:15:10AM +0000, Richard Purdie wrote:
> On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > When qemu machine hangs, the ssh commands done by tests
> > are not timing out. do_testimage() task has last logs like this:
> > 
> > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > 
> > The test process is stuck for hours, or for ever if the
> > executing command or test case did not set a timeout correctly.
> > The default 300 second timeout is not working when target hangs.
> > Note that timeout is really a "inactive timeout" since data returned
> > by the process will reset the timeout.
> > 
> > Make the process stdout non-blocking so read() will always return
> > right away using os.set_blocking() available in python 3.5 and later.
> > 
> > Then change from python codec reader to plain read() and make
> > the ssh subprocess stdout non-blocking. Even with select()
> > making sure the file had input to be read, the codec reader was
> > trying to find more stuff and blocking for ever when process hangs.
> > 
> > While at it, add a small timeout to read data in larger chunks if
> > possible. This avoids reading data one or few characters at a time
> > and makes the debug logs more readable.
> > 
> > close() the stdout file in all cases after read loop is complete.
> > 
> > Then make sure to wait or kill the ssh subprocess in all cases.
> > Just reading the output stream and receiving EOF there does not mean
> > that the process exited, and wait() needs a timeout if the process
> > is hanging. In the end kill the process and return the return value
> > and captured output utf-8 encoded, just like before these changes.
> > 
> > This fixes ssh run() related deadlocks when a qemu target hangs
> > completely.
> > 
> > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > ---
> >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > index 13fd5b2a49..466a795eb4 100644
> > --- a/meta/lib/oeqa/core/target/ssh.py
> > +++ b/meta/lib/oeqa/core/target/ssh.py
> > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> >          if timeout:
> >              endtime = starttime + timeout
> >              eof = False
> > +            os.set_blocking(process.stdout.fileno(), False)
> >              while time.time() < endtime and not eof:
> > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> >                  try:
> > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > -                        data = reader.read(1024, 4096)
> > +                        # wait a bit for more data, tries to avoid reading single characters
> > +                        time.sleep(0.2)
> > +                        data = process.stdout.read()
> >                          if not data:
> > -                            process.stdout.close()
> >                              eof = True
> 
> I'm not sure I understand the use of eof now. If stdout is non-
> blocking, it would be perfectly normal for it to return with no data
> but that doesn't mean we're at eof, it just means there is no data
> ready?

The select() makes sure that there must be data available for read(), or eof when the
file descriptor is closed. But closed file descriptor does not mean that
the subprocess is dead. For that we must wait() with a timeout and
possibly after that kill().

> I suspect the code might be ok but we should stop calling it eof since
> it no longer is?
> 
> I also tend to dislike sleep() codepaths as it usually means the code
> is sub optimal. I was going to make further suggestions but I stepped
> back and looked at the bigger picture of the code.

I think sleep() is ok in this case to reduce busy looping over serial
console and read() more data than what select() initially saw. Reading
single character at a time is a bit silly and just wastes CPU time when
we don't need ms level accuracy.

> The code is basically doing:
> 
>         if timeout:
>             <complex code>
>         else:
>             output = process.communicate()
> 
> 
> Python 3.3 added a timeout parameter to subprocess.communicate(). Can
> we just use that instead? Or is there something special about our needs
> which means we can't?

The current code uses timeout as "inactivity timeout" and thus code
reads output using select() and read() until the end or if there is no
output until timeout.

Code could use communicate() with a timeout but would still need to use
select() and read() loop to check if there was any output. I did not see
a simple way to rewrite this in a more pythonic way. The reader
iterators did the wrong thing and ended up dead locking when subprocess
hangs.

Cheers,

-Mikko


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs
  2023-02-09  8:09 ` [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs Mikko Rapeli
@ 2023-02-09  9:45   ` Alexander Kanavin
  2023-02-09  9:56     ` Mikko Rapeli
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Kanavin @ 2023-02-09  9:45 UTC (permalink / raw)
  To: Mikko Rapeli; +Cc: openembedded-core

Isn't the code waiting 5 seconds and then sending SIGKILL, regardless
of whether SIGTERM was successful or not here?

I would actually remove this function altogether. qemu process is
started by runqemu and it's the job of that to clean up the actual
qemu process properly.

Alex

On Thu, 9 Feb 2023 at 09:10, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
>
> qemu doesn't always behave well and can hang too.
> kill it with force if was still alive.
>
> Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> ---
>  meta/lib/oeqa/utils/qemurunner.py | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
> index bce00c696a..8e3484385d 100644
> --- a/meta/lib/oeqa/utils/qemurunner.py
> +++ b/meta/lib/oeqa/utils/qemurunner.py
> @@ -589,6 +589,13 @@ class QemuRunner:
>                  os.kill(self.qemupid, signal.SIGTERM)
>              except ProcessLookupError as e:
>                  self.logger.warning('qemu-system ended unexpectedly')
> +            time.sleep(5)
> +            try:
> +                # qemu-system did not behave well
> +                os.kill(self.qemupid, signal.SIGKILL)
> +            except ProcessLookupError as e:
> +                # already dead
> +                pass
>
>      def stop_thread(self):
>          if self.thread and self.thread.is_alive():
> --
> 2.34.1
>
>
> -=-=-=-=-=-=-=-=-=-=-=-
> Links: You receive all messages sent to this group.
> View/Reply Online (#176926): https://lists.openembedded.org/g/openembedded-core/message/176926
> Mute This Topic: https://lists.openembedded.org/mt/96849162/1686489
> Group Owner: openembedded-core+owner@lists.openembedded.org
> Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alex.kanavin@gmail.com]
> -=-=-=-=-=-=-=-=-=-=-=-
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs
  2023-02-09  9:45   ` [OE-core] " Alexander Kanavin
@ 2023-02-09  9:56     ` Mikko Rapeli
  2023-02-09 10:01       ` Alexander Kanavin
  0 siblings, 1 reply; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09  9:56 UTC (permalink / raw)
  To: Alexander Kanavin; +Cc: openembedded-core

Hi,

On Thu, Feb 09, 2023 at 10:45:36AM +0100, Alexander Kanavin wrote:
> Isn't the code waiting 5 seconds and then sending SIGKILL, regardless
> of whether SIGTERM was successful or not here?

Yes. Not nice but better than leaking the process completely.

> I would actually remove this function altogether. qemu process is
> started by runqemu and it's the job of that to clean up the actual
> qemu process properly.

It doesn't currently. Should the sigterm_handler() also use SIGKILL if
it has to?

Cheers,

-Mikko

> Alex
> 
> On Thu, 9 Feb 2023 at 09:10, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> >
> > qemu doesn't always behave well and can hang too.
> > kill it with force if was still alive.
> >
> > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > ---
> >  meta/lib/oeqa/utils/qemurunner.py | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/meta/lib/oeqa/utils/qemurunner.py b/meta/lib/oeqa/utils/qemurunner.py
> > index bce00c696a..8e3484385d 100644
> > --- a/meta/lib/oeqa/utils/qemurunner.py
> > +++ b/meta/lib/oeqa/utils/qemurunner.py
> > @@ -589,6 +589,13 @@ class QemuRunner:
> >                  os.kill(self.qemupid, signal.SIGTERM)
> >              except ProcessLookupError as e:
> >                  self.logger.warning('qemu-system ended unexpectedly')
> > +            time.sleep(5)
> > +            try:
> > +                # qemu-system did not behave well
> > +                os.kill(self.qemupid, signal.SIGKILL)
> > +            except ProcessLookupError as e:
> > +                # already dead
> > +                pass
> >
> >      def stop_thread(self):
> >          if self.thread and self.thread.is_alive():
> > --
> > 2.34.1
> >
> >
> > -=-=-=-=-=-=-=-=-=-=-=-
> > Links: You receive all messages sent to this group.
> > View/Reply Online (#176926): https://lists.openembedded.org/g/openembedded-core/message/176926
> > Mute This Topic: https://lists.openembedded.org/mt/96849162/1686489
> > Group Owner: openembedded-core+owner@lists.openembedded.org
> > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [alex.kanavin@gmail.com]
> > -=-=-=-=-=-=-=-=-=-=-=-
> >


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs
  2023-02-09  9:56     ` Mikko Rapeli
@ 2023-02-09 10:01       ` Alexander Kanavin
  0 siblings, 0 replies; 22+ messages in thread
From: Alexander Kanavin @ 2023-02-09 10:01 UTC (permalink / raw)
  To: Mikko Rapeli; +Cc: openembedded-core

On Thu, 9 Feb 2023 at 10:56, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> It doesn't currently. Should the sigterm_handler() also use SIGKILL if
> it has to?

Yes please. Let's not duplicate functionality. And please do try to
avoid adding hard sleep(), there are usually ways to do things
synchronously.

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09  8:09 ` [PATCH v2 2/8] oeqa ssh.py: fix hangs in run() Mikko Rapeli
  2023-02-09  9:15   ` [OE-core] " Richard Purdie
@ 2023-02-09 10:20   ` Richard Purdie
  2023-02-09 10:35     ` Mikko Rapeli
  1 sibling, 1 reply; 22+ messages in thread
From: Richard Purdie @ 2023-02-09 10:20 UTC (permalink / raw)
  To: Mikko Rapeli, openembedded-core

On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> When qemu machine hangs, the ssh commands done by tests
> are not timing out. do_testimage() task has last logs like this:
> 
> DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> 
> The test process is stuck for hours, or for ever if the
> executing command or test case did not set a timeout correctly.
> The default 300 second timeout is not working when target hangs.
> Note that timeout is really a "inactive timeout" since data returned
> by the process will reset the timeout.
> 
> Make the process stdout non-blocking so read() will always return
> right away using os.set_blocking() available in python 3.5 and later.
> 
> Then change from python codec reader to plain read() and make
> the ssh subprocess stdout non-blocking. Even with select()
> making sure the file had input to be read, the codec reader was
> trying to find more stuff and blocking for ever when process hangs.
> 
> While at it, add a small timeout to read data in larger chunks if
> possible. This avoids reading data one or few characters at a time
> and makes the debug logs more readable.
> 
> close() the stdout file in all cases after read loop is complete.
> 
> Then make sure to wait or kill the ssh subprocess in all cases.
> Just reading the output stream and receiving EOF there does not mean
> that the process exited, and wait() needs a timeout if the process
> is hanging. In the end kill the process and return the return value
> and captured output utf-8 encoded, just like before these changes.
> 
> This fixes ssh run() related deadlocks when a qemu target hangs
> completely.
> 
> Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> ---
>  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> index 13fd5b2a49..466a795eb4 100644
> --- a/meta/lib/oeqa/core/target/ssh.py
> +++ b/meta/lib/oeqa/core/target/ssh.py
> @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
>          if timeout:
>              endtime = starttime + timeout
>              eof = False
> +            os.set_blocking(process.stdout.fileno(), False)
>              while time.time() < endtime and not eof:
> -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
>                  try:
> +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
>                      if select.select([process.stdout], [], [], 5)[0] != []:
> -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> -                        data = reader.read(1024, 4096)
> +                        # wait a bit for more data, tries to avoid reading single characters
> +                        time.sleep(0.2)
> +                        data = process.stdout.read()
>                          if not data:
> -                            process.stdout.close()
>                              eof = True
>                          else:
> -                            output += data
> -                            logger.debug('Partial data from SSH call:\n%s' % data)
> +                            # ignore errors to capture as much as possible
> +                            string = data.decode('utf-8', errors='ignore')

I've been trying to remember why we have the reader here and  I think
the issue was you can't call decode on partial data :(.

If it happens that there is a multi byte character on the stream
crossing the boundary then you'll get errors. "ignoring" them just
means you'll corrupt data. The case is rare but with long running
commands with utf8 data like some of the ptests, it can be an issue.

That is why the code was written the way it was...

Cheers,

Richard




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 10:20   ` Richard Purdie
@ 2023-02-09 10:35     ` Mikko Rapeli
  2023-02-09 10:52       ` Richard Purdie
  0 siblings, 1 reply; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09 10:35 UTC (permalink / raw)
  To: Richard Purdie; +Cc: openembedded-core

Hi,

On Thu, Feb 09, 2023 at 10:20:36AM +0000, Richard Purdie wrote:
> On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > When qemu machine hangs, the ssh commands done by tests
> > are not timing out. do_testimage() task has last logs like this:
> > 
> > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > 
> > The test process is stuck for hours, or for ever if the
> > executing command or test case did not set a timeout correctly.
> > The default 300 second timeout is not working when target hangs.
> > Note that timeout is really a "inactive timeout" since data returned
> > by the process will reset the timeout.
> > 
> > Make the process stdout non-blocking so read() will always return
> > right away using os.set_blocking() available in python 3.5 and later.
> > 
> > Then change from python codec reader to plain read() and make
> > the ssh subprocess stdout non-blocking. Even with select()
> > making sure the file had input to be read, the codec reader was
> > trying to find more stuff and blocking for ever when process hangs.
> > 
> > While at it, add a small timeout to read data in larger chunks if
> > possible. This avoids reading data one or few characters at a time
> > and makes the debug logs more readable.
> > 
> > close() the stdout file in all cases after read loop is complete.
> > 
> > Then make sure to wait or kill the ssh subprocess in all cases.
> > Just reading the output stream and receiving EOF there does not mean
> > that the process exited, and wait() needs a timeout if the process
> > is hanging. In the end kill the process and return the return value
> > and captured output utf-8 encoded, just like before these changes.
> > 
> > This fixes ssh run() related deadlocks when a qemu target hangs
> > completely.
> > 
> > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > ---
> >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
> > 
> > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > index 13fd5b2a49..466a795eb4 100644
> > --- a/meta/lib/oeqa/core/target/ssh.py
> > +++ b/meta/lib/oeqa/core/target/ssh.py
> > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> >          if timeout:
> >              endtime = starttime + timeout
> >              eof = False
> > +            os.set_blocking(process.stdout.fileno(), False)
> >              while time.time() < endtime and not eof:
> > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> >                  try:
> > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > -                        data = reader.read(1024, 4096)
> > +                        # wait a bit for more data, tries to avoid reading single characters
> > +                        time.sleep(0.2)
> > +                        data = process.stdout.read()
> >                          if not data:
> > -                            process.stdout.close()
> >                              eof = True
> >                          else:
> > -                            output += data
> > -                            logger.debug('Partial data from SSH call:\n%s' % data)
> > +                            # ignore errors to capture as much as possible
> > +                            string = data.decode('utf-8', errors='ignore')
> 
> I've been trying to remember why we have the reader here and  I think
> the issue was you can't call decode on partial data :(.

I get this. Sadly readers read more than file has available and they don't have
timeouts...

> If it happens that there is a multi byte character on the stream
> crossing the boundary then you'll get errors. "ignoring" them just
> means you'll corrupt data. The case is rare but with long running
> commands with utf8 data like some of the ptests, it can be an issue.
>
> That is why the code was written the way it was...

For logging purposes errors are fine. Should the raw data be captured to
output then? Or another conversion to utf-8 after all data has been
read?

Reading more than one character at a time for sure helps here but
granted there may be windows where we hit the in character bytes..

Cheers,

-Mikko


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 10:35     ` Mikko Rapeli
@ 2023-02-09 10:52       ` Richard Purdie
  0 siblings, 0 replies; 22+ messages in thread
From: Richard Purdie @ 2023-02-09 10:52 UTC (permalink / raw)
  To: Mikko Rapeli; +Cc: openembedded-core

On Thu, 2023-02-09 at 12:35 +0200, Mikko Rapeli wrote:
> Hi,
> 
> On Thu, Feb 09, 2023 at 10:20:36AM +0000, Richard Purdie wrote:
> > On Thu, 2023-02-09 at 10:09 +0200, Mikko Rapeli wrote:
> > > When qemu machine hangs, the ssh commands done by tests
> > > are not timing out. do_testimage() task has last logs like this:
> > > 
> > > DEBUG: time: 1673531086.3155053, endtime: 1673531686.315502
> > > 
> > > The test process is stuck for hours, or for ever if the
> > > executing command or test case did not set a timeout correctly.
> > > The default 300 second timeout is not working when target hangs.
> > > Note that timeout is really a "inactive timeout" since data returned
> > > by the process will reset the timeout.
> > > 
> > > Make the process stdout non-blocking so read() will always return
> > > right away using os.set_blocking() available in python 3.5 and later.
> > > 
> > > Then change from python codec reader to plain read() and make
> > > the ssh subprocess stdout non-blocking. Even with select()
> > > making sure the file had input to be read, the codec reader was
> > > trying to find more stuff and blocking for ever when process hangs.
> > > 
> > > While at it, add a small timeout to read data in larger chunks if
> > > possible. This avoids reading data one or few characters at a time
> > > and makes the debug logs more readable.
> > > 
> > > close() the stdout file in all cases after read loop is complete.
> > > 
> > > Then make sure to wait or kill the ssh subprocess in all cases.
> > > Just reading the output stream and receiving EOF there does not mean
> > > that the process exited, and wait() needs a timeout if the process
> > > is hanging. In the end kill the process and return the return value
> > > and captured output utf-8 encoded, just like before these changes.
> > > 
> > > This fixes ssh run() related deadlocks when a qemu target hangs
> > > completely.
> > > 
> > > Signed-off-by: Mikko Rapeli <mikko.rapeli@linaro.org>
> > > ---
> > >  meta/lib/oeqa/core/target/ssh.py | 33 +++++++++++++++++++++++++-------
> > >  1 file changed, 26 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/meta/lib/oeqa/core/target/ssh.py b/meta/lib/oeqa/core/target/ssh.py
> > > index 13fd5b2a49..466a795eb4 100644
> > > --- a/meta/lib/oeqa/core/target/ssh.py
> > > +++ b/meta/lib/oeqa/core/target/ssh.py
> > > @@ -229,22 +229,28 @@ def SSHCall(command, logger, timeout=None, **opts):
> > >          if timeout:
> > >              endtime = starttime + timeout
> > >              eof = False
> > > +            os.set_blocking(process.stdout.fileno(), False)
> > >              while time.time() < endtime and not eof:
> > > -                logger.debug('time: %s, endtime: %s' % (time.time(), endtime))
> > >                  try:
> > > +                    logger.debug('Waiting for process output: time: %s, endtime: %s' % (time.time(), endtime))
> > >                      if select.select([process.stdout], [], [], 5)[0] != []:
> > > -                        reader = codecs.getreader('utf-8')(process.stdout, 'ignore')
> > > -                        data = reader.read(1024, 4096)
> > > +                        # wait a bit for more data, tries to avoid reading single characters
> > > +                        time.sleep(0.2)
> > > +                        data = process.stdout.read()
> > >                          if not data:
> > > -                            process.stdout.close()
> > >                              eof = True
> > >                          else:
> > > -                            output += data
> > > -                            logger.debug('Partial data from SSH call:\n%s' % data)
> > > +                            # ignore errors to capture as much as possible
> > > +                            string = data.decode('utf-8', errors='ignore')
> > 
> > I've been trying to remember why we have the reader here and  I think
> > the issue was you can't call decode on partial data :(.
> 
> I get this. Sadly readers read more than file has available and they don't have
> timeouts...
> 
> > If it happens that there is a multi byte character on the stream
> > crossing the boundary then you'll get errors. "ignoring" them just
> > means you'll corrupt data. The case is rare but with long running
> > commands with utf8 data like some of the ptests, it can be an issue.
> > 
> > That is why the code was written the way it was...
> 
> For logging purposes errors are fine. Should the raw data be captured to
> output then? Or another conversion to utf-8 after all data has been
> read?

Converting at the end is one option.

The other option would be to feed the reader our own stream object
where it's read method would only return the data we have. I think you
can create dummy "file like" objects in python, it might be easiest
with io.BytesIO(). The codec reader could then interact with that as a
dummy stream. 


Cheers,

Richard



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09  9:36     ` Mikko Rapeli
@ 2023-02-09 10:53       ` Alexander Kanavin
  2023-02-09 10:56         ` Richard Purdie
  2023-02-09 11:12         ` Mikko Rapeli
  0 siblings, 2 replies; 22+ messages in thread
From: Alexander Kanavin @ 2023-02-09 10:53 UTC (permalink / raw)
  To: Mikko Rapeli; +Cc: Richard Purdie, openembedded-core

On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> I think sleep() is ok in this case to reduce busy looping over serial
> console and read() more data than what select() initially saw. Reading
> single character at a time is a bit silly and just wastes CPU time when
> we don't need ms level accuracy.

Something doesn't compute here. The writer should be sending data in
chunks, and there are buffers all the way to the reader, so where does
the splitting into single characters happen?

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 10:53       ` Alexander Kanavin
@ 2023-02-09 10:56         ` Richard Purdie
  2023-02-09 11:12         ` Mikko Rapeli
  1 sibling, 0 replies; 22+ messages in thread
From: Richard Purdie @ 2023-02-09 10:56 UTC (permalink / raw)
  To: Alexander Kanavin, Mikko Rapeli; +Cc: openembedded-core

On Thu, 2023-02-09 at 11:53 +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > I think sleep() is ok in this case to reduce busy looping over serial
> > console and read() more data than what select() initially saw. Reading
> > single character at a time is a bit silly and just wastes CPU time when
> > we don't need ms level accuracy.
> 
> Something doesn't compute here. The writer should be sending data in
> chunks, and there are buffers all the way to the reader, so where does
> the splitting into single characters happen?

Just guessing but in the case of the serial emulation in qemu,
115200bps is probably "slow" compared to the system speed so they
arrive as bytes?

Cheers,

Richard


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 10:53       ` Alexander Kanavin
  2023-02-09 10:56         ` Richard Purdie
@ 2023-02-09 11:12         ` Mikko Rapeli
  2023-02-09 11:42           ` Alexander Kanavin
  1 sibling, 1 reply; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09 11:12 UTC (permalink / raw)
  To: Alexander Kanavin; +Cc: Richard Purdie, openembedded-core

Hi,

On Thu, Feb 09, 2023 at 11:53:00AM +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 10:36, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > I think sleep() is ok in this case to reduce busy looping over serial
> > console and read() more data than what select() initially saw. Reading
> > single character at a time is a bit silly and just wastes CPU time when
> > we don't need ms level accuracy.
> 
> Something doesn't compute here. The writer should be sending data in
> chunks, and there are buffers all the way to the reader, so where does
> the splitting into single characters happen?

Sorry I mixed with the serial console patch, but the output was similar
over ssh.

select() returns with any data, even single character. Fancy readers
will read larger chunks but will read more data than is available and thus
block for ever. Plain read() will read the data which is available from
select() and thus really small amounts like single characters if that's
available over the socket. This is visible in the partial read output of
do_testimage() task output log which becomes quite unreadable without change.

Cheers,

-Mikko


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 11:12         ` Mikko Rapeli
@ 2023-02-09 11:42           ` Alexander Kanavin
  2023-02-09 12:23             ` Mikko Rapeli
  0 siblings, 1 reply; 22+ messages in thread
From: Alexander Kanavin @ 2023-02-09 11:42 UTC (permalink / raw)
  To: Mikko Rapeli; +Cc: Richard Purdie, openembedded-core

On Thu, 9 Feb 2023 at 12:12, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> select() returns with any data, even single character. Fancy readers
> will read larger chunks but will read more data than is available and thus
> block for ever. Plain read() will read the data which is available from
> select() and thus really small amounts like single characters if that's
> available over the socket. This is visible in the partial read output of
> do_testimage() task output log which becomes quite unreadable without change.

Yes, but this does not explain why single characters happen to begin
with - RP offered a hypothesis, but the problem is that data arrives
that way, not the tight-loop manner in which we're reading it.

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [OE-core] [PATCH v2 2/8] oeqa ssh.py: fix hangs in run()
  2023-02-09 11:42           ` Alexander Kanavin
@ 2023-02-09 12:23             ` Mikko Rapeli
  0 siblings, 0 replies; 22+ messages in thread
From: Mikko Rapeli @ 2023-02-09 12:23 UTC (permalink / raw)
  To: Alexander Kanavin; +Cc: Richard Purdie, openembedded-core

Hi,

On Thu, Feb 09, 2023 at 12:42:09PM +0100, Alexander Kanavin wrote:
> On Thu, 9 Feb 2023 at 12:12, Mikko Rapeli <mikko.rapeli@linaro.org> wrote:
> > select() returns with any data, even single character. Fancy readers
> > will read larger chunks but will read more data than is available and thus
> > block for ever. Plain read() will read the data which is available from
> > select() and thus really small amounts like single characters if that's
> > available over the socket. This is visible in the partial read output of
> > do_testimage() task output log which becomes quite unreadable without change.
> 
> Yes, but this does not explain why single characters happen to begin
> with - RP offered a hypothesis, but the problem is that data arrives
> that way, not the tight-loop manner in which we're reading it.

Without the sleep(0.2) with ssh, the do_testimage task output can be like this (which
isn't too bad, just annoying):

DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; cat /etc/controllerimage
DEBUG: Waiting for process output: time: 1675942983.6053402, endtime: 1675943283.6033137
DEBUG: Partial data from SSH call:
cat: 
DEBUG: Waiting for process output: time: 1675942986.2085373, endtime: 1675943286.208536
DEBUG: Partial data from SSH call:
/etc/controllerimage
DEBUG: Waiting for process output: time: 1675942986.2111514, endtime: 1675943286.2111504
DEBUG: Partial data from SSH call:
: No such file or directory
DEBUG: Waiting for process output: time: 1675942986.2140906, endtime: 1675943286.2140894
DEBUG: Partial data from SSH call:


DEBUG: Waiting for process output: time: 1675942986.216261, endtime: 1675943286.21626
DEBUG: [Command returned '1' after 2.64 seconds]

I think qemu is just slower than what anyone expects, including the ssh
protocol which does some buffer but then gives up. It may also be the
kernel and image running under qemu which busy starting up, e.g.
generating entropy or keys.

With the sleep(0.2) output is:

DEBUG: [Running]$ ssh -l root -o ServerAliveCountMax=2 -o ServerAliveInterval=30 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o LogLevel=ERROR -p 2222 127.0.0.1 export PATH=/usr/sbin:/sbin:/usr/bin:/bin; cat /etc/controllerimage
DEBUG: Waiting for process output: time: 1675945065.6815689, endtime: 1675945365.6790204
DEBUG: Partial data from SSH call:
cat: /etc/controllerimage: No such file or directory

DEBUG: Waiting for process output: time: 1675945068.308055, endtime: 1675945368.3080513
DEBUG: [Command returned '1' after 2.83 seconds]
DEBUG: Command: cat /etc/controllerimage
Status: 1 Output:
cat: /etc/controllerimage: No such file or directory

NOTE:  ... ok

Similar thing happens then over serial console which is even slower. I
know debugging the data read and written to serial are not wanted but I
had the changes in place and saw bootloader and kernel messages being
read initally one character at a time and later on 5-10 characters at a time
which made debugging problems really annoying. The small sleep does
improve things, doesn't solve everything but also doesn't really harm.

No data is lost and less time is used looping around the file
descriptor.

Cheers,

-Mikko


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-02-09 12:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-09  8:09 [PATCH v2 0/8] fix oeqa runtime test framework when qemu hangs Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 1/8] oeqa ssh.py: move output prints to new line Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 2/8] oeqa ssh.py: fix hangs in run() Mikko Rapeli
2023-02-09  9:15   ` [OE-core] " Richard Purdie
2023-02-09  9:36     ` Mikko Rapeli
2023-02-09 10:53       ` Alexander Kanavin
2023-02-09 10:56         ` Richard Purdie
2023-02-09 11:12         ` Mikko Rapeli
2023-02-09 11:42           ` Alexander Kanavin
2023-02-09 12:23             ` Mikko Rapeli
2023-02-09 10:20   ` Richard Purdie
2023-02-09 10:35     ` Mikko Rapeli
2023-02-09 10:52       ` Richard Purdie
2023-02-09  8:09 ` [PATCH v2 3/8] oeqa ssh.py: add connection keep alive options to ssh client Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 4/8] oeqa dump.py: add error counter and stop after 5 failures Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 5/8] oeqa qemurunner: read more data at a time from serial Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 6/8] oeqa qemurunner.py: add timeout to QMP calls Mikko Rapeli
2023-02-09  8:09 ` [PATCH v2 7/8] oeqa qemurunner.py: kill qemu if it hangs Mikko Rapeli
2023-02-09  9:45   ` [OE-core] " Alexander Kanavin
2023-02-09  9:56     ` Mikko Rapeli
2023-02-09 10:01       ` Alexander Kanavin
2023-02-09  8:09 ` [PATCH v2 8/8] oeqa qemurunner.py: try to avoid reading one character at a time Mikko Rapeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.