[PATCH 4/5] crash_watchdog.py: add generic crash watchdog

public inbox for kdevops@lists.linux.dev
 help / color / mirror / Atom feed

From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
	kdevops@lists.linux.dev
Cc: Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH 4/5] crash_watchdog.py: add generic crash watchdog
Date: Sat, 19 Apr 2025 22:48:20 -0700	[thread overview]
Message-ID: <20250420054822.533987-5-mcgrof@kernel.org> (raw)
In-Reply-To: <20250420054822.533987-1-mcgrof@kernel.org>

This can be used by any workflow. Specialized workflows can use the
library and customize it as they see fit to provide CIs more output.

Its easy to forget where the hell the kernel logs are so this also
provides a symlink helper which can be used to get the kernel logs
from a host.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 scripts/workflows/generic/crash_watchdog.py | 186 ++++++++++++++++++++
 scripts/workflows/generic/get_console.py    |   1 +
 scripts/workflows/generic/lib               |   1 +
 3 files changed, 188 insertions(+)
 create mode 100755 scripts/workflows/generic/crash_watchdog.py
 create mode 120000 scripts/workflows/generic/get_console.py
 create mode 120000 scripts/workflows/generic/lib

diff --git a/scripts/workflows/generic/crash_watchdog.py b/scripts/workflows/generic/crash_watchdog.py
new file mode 100755
index 000000000000..3860de9d5592
--- /dev/null
+++ b/scripts/workflows/generic/crash_watchdog.py
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: copyleft-next-0.3.1
+
+"""
+This script is intended to run as a kernel-ci agent. Monitoring for crashes
+and kernel warnings and reseting host after capturing essential information.
+It can also be invoked as 'get_console.py' to retrieve the entire kernel log.
+"""
+
+import os
+import sys
+import subprocess
+import re
+import logging
+import argparse
+import yaml
+from datetime import datetime, timedelta
+from pathlib import Path
+from lib.crash import KernelCrashWatchdog
+
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger("crash_watchdog")
+
+def get_active_hosts():
+    """Get the list of active hosts from kdevops configuration."""
+    try:
+        # First try to get the hosts from the ansible inventory
+        result = subprocess.run(
+            ["ansible-inventory", "-i", "hosts", "--list"],
+            capture_output=True, text=True, check=True
+        )
+        inventory = yaml.safe_load(result.stdout)
+        hosts = inventory.get("baseline", {}).get("hosts", [])
+        return sorted(set(hosts))
+    except Exception as e:
+        logger.error(f"Error getting active hosts: {e}")
+        return []
+
+def run_crash_watchdog_on_host(args, this_host_name):
+    watchdog = KernelCrashWatchdog(
+        host_name=this_host_name,
+        output_dir=args.output_dir,
+        full_log=args.full_log,
+        decode_crash=not args.no_decode,
+        reset_host=not args.no_reset,
+        save_warnings = args.save_warnings,
+    )
+
+    crashed = False
+    warnings_found = False
+
+    crash_file, warning_file = watchdog.check_and_reset_host(method=args.method, get_fstests_log=args.fstests_log)
+
+    if warning_file:
+        logger.warning(f"Kernel warning and logged to {warning_file}")
+        warnings_found = True
+    elif args.save_warnings:
+        logger.info(f"No kernel warnings detected for host {args.host_name}")
+    if crash_file:
+        crashed = True
+        logger.warning(f"Crash detected and logged to {crash_file}")
+    else:
+        logger.info(f"No crash detected for host {args.host_name}")
+    return crashed, [crash_file], warnings_found, warning_file
+
+def run_crash_watchdog_all_hosts(args):
+    """Check all active hosts for kernel crashes."""
+    hosts = get_active_hosts()
+    crash_detected = False
+    crash_files = []
+    warnings_detected = False
+    warning_files = []
+
+    logger.info(
+        f"Checking {len(hosts)} hosts for kernel crashes: {', '.join(hosts)}"
+    )
+
+    for host in hosts:
+        host_crash_detected, crash_file, host_warnings_detected, warnings_file = run_crash_watchdog_on_host(args, host)
+        if host_crash_detected and crash_file:
+            crash_detected = True
+            crash_files.append(crash_file)
+            logger.info(f"Crash detected in host {host}, logs saved to {crash_file}")
+        if host_warnings_detected and warnings_file:
+            warnings_detected = True
+            warning_files.append(warning_file)
+            logger.info(f"Kernel warning found on host {host}, logs saved to {warning_file}")
+
+    return crash_detected, crash_files, warnings_detected, warning_files
+
+def write_log_section(f, title, files, label):
+    f.write(f"# {title}\n\n")
+    for path in files:
+        f.write(f"- {label} detected: {path}\n")
+        try:
+            with open(path, "r") as content_file:
+                snippet = "".join(content_file.readlines()[:10]) + "\n...(truncated)..."
+                f.write("\n```\n" + snippet + "\n```\n\n")
+        except Exception as e:
+            f.write(f"\nError reading {label.lower()} file: {e}\n\n")
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Detect and handle kernel crashes or kernel warnings in hosts.",
+        epilog="""
+Examples:
+  Detect and reset all hosts a crash was found (default):
+    ./crash_watchdog.py
+
+  Detect and reset host crash only on e3-ext4-2k guest:
+    ./crash_watchdog.py --host-name e3-ext4-2k
+
+  Detect using systemd-remote journal and show full kernel log:
+    ./crash_watchdog.py e3-ext4-2k --method remote --full-log
+
+  Skip decoding and skip reset:
+    ./crash_watchdog.py e3-ext4-2k --no-decode --no-reset
+
+  Just fetch the full kernel log using symlinked name:
+    ln -s crash_watchdog.py get_console.py
+    ./get_console.py e3-ext4-2k
+
+  Use guestfs console log and do not decode:
+    ./crash_watchdog.py e3-ext4-2k --method console --no-decode
+
+  Use SSH to query the live journalctl output:
+    ./crash_watchdog.py e3-ext4-2k --method ssh
+
+  Disable guest reset when using libvirt:
+    ./crash_watchdog.py e3-ext4-2k --no-reset
+
+  Print full kernel logs for a specific fstest (all tests run with it):
+    ./crash_watchdog.py e3-ext4-2k --fstests-log generic/750
+
+  Get all kernel warnings only:
+    ./crash_watchdog.py e3-ext4-2k --method remote --save-warnings sad.warn
+        """,
+        formatter_class=argparse.RawTextHelpFormatter
+    )
+
+    parser.add_argument("--host-name", help="Optional name of the host to check", default="all")
+    parser.add_argument("--output-dir", help="Directory to store crash logs", default="crashes")
+    parser.add_argument(
+        "--method",
+        choices=["auto", "remote", "console", "ssh"],
+        default="auto",
+        help="Choose method to collect logs: auto, remote, console, or ssh"
+    )
+    parser.add_argument("--full-log", action="store_true", help="Get full kernel log instead of only crash context")
+    parser.add_argument("--no-decode", action="store_true", help="Disable decoding crash logs with decode_stacktrace.sh")
+    parser.add_argument("--no-reset", action="store_true", help="Do not reset the guest even if a crash is detected")
+    parser.add_argument("--fstests-log", help="Show all kernel log lines for a specific fstests test ID (e.g., generic/750)")
+    parser.add_argument("--save-warnings", help="Do you want detected and save kernel warnings", default=True)
+    args = parser.parse_args()
+    crash_files = []
+    warnings_files = []
+
+    invoked_name = os.path.basename(sys.argv[0])
+    if invoked_name == "get_console.py":
+        args.no_reset = True
+        args.save_warnings = False
+        args.full_log_mode = True
+
+    if (args.host_name != "all"):
+        crash_detected, crash_files, warnings_detected, warnings_files = run_crash_watchdog_on_host(args, args.host_name)
+    else:
+        crash_detected, crash_files, warnings_detected, warnings_files = run_crash_watchdog_all_hosts(args)
+
+    if warnings_detected:
+        logger.warning("Kernel warnings detected in one or more hosts")
+    else:
+        logger.info("No kernel warnings detected")
+
+    if crash_detected:
+        logger.warning("Kernel crashes detected in one or more hosts")
+        sys.exit(1)
+    else:
+        logger.info("No kernel crashes detected")
+        sys.exit(0)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/workflows/generic/get_console.py b/scripts/workflows/generic/get_console.py
new file mode 120000
index 000000000000..0169b0dd6188
--- /dev/null
+++ b/scripts/workflows/generic/get_console.py
@@ -0,0 +1 @@
+crash_watchdog.py
\ No newline at end of file
diff --git a/scripts/workflows/generic/lib b/scripts/workflows/generic/lib
new file mode 120000
index 000000000000..5bf80bf1392c
--- /dev/null
+++ b/scripts/workflows/generic/lib
@@ -0,0 +1 @@
+../lib/
\ No newline at end of file
-- 
2.47.2

next prev parent reply	other threads:[~2025-04-20  5:48 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-20  5:48 [PATCH 0/5] crash: provide a crash watchdog Luis Chamberlain
2025-04-20  5:48 ` [PATCH 1/5] systemd-remote: use ip address for systemd-remote journal Luis Chamberlain
2025-04-20  5:48 ` [PATCH 2/5] crash: add kernel crash watchdog library Luis Chamberlain
2025-04-20  5:48 ` [PATCH 3/5] fstests_watchdog.py: use the new " Luis Chamberlain
2025-04-20  5:48 ` Luis Chamberlain [this message]
2025-04-20  5:48 ` [PATCH 5/5] crash_report.py: add a crash report Luis Chamberlain
2025-04-20 15:19 ` [PATCH 0/5] crash: provide a crash watchdog Chuck Lever
2025-04-21 23:16   ` Luis Chamberlain
2025-04-22  2:38     ` Luis Chamberlain

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:3860de9d559 dfblob:0169b0dd618 dfblob:5bf80bf1392 )
 OR (
bs:"[PATCH 4/5] crash_watchdog.py: add generic crash watchdog" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250420054822.533987-5-mcgrof@kernel.org \
    --to=mcgrof@kernel.org \
    --cc=cel@kernel.org \
    --cc=da.gomez@kruces.com \
    --cc=kdevops@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox