From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 04B1E44CAE0 for ; Thu, 7 May 2026 19:24:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778181877; cv=none; b=DTO2mU03y8+gAoyAY9Zu+jCqvwDM6HiH+cNxCi5bGrrox8h4KnRWTOCVwuNgCitH/6b/FKwH+WoS5AqmQtAtKysmX7NvaECxDsSShPOtjz7Pbyuawy4HyL2CHVN+vDr7RXQ7pCVVB/xStk2jj/BCP3+jg7GEzfqki1bqnu6sQW0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778181877; c=relaxed/simple; bh=Ldcy+NfeCKQL4gdHy6jOYRlm7dqkx4JsUXfsmQ7fkyE=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=m+J3Z+eplrI2z2216MFPeVVVqzIpxJ28R3XLUklw3bjyrecujdrsfNTi0FcVVyNUKImWzfHeLj5i27CD8ViLkbxhwR/g/KdvLakpi9oZ1kF/v4liL3WWfSDFqY1WLglY/FMqTBZzFEvvvdv3jjhedssYbUZs+EOOLRP4Wf7DjtQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=Nupo7FL+; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=GwjP0Xlq; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Nupo7FL+"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="GwjP0Xlq" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778181875; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=eJas2sGqLwB+vZWX5Y6mMjONuOkT/m3kcsxZweoSdiM=; b=Nupo7FL+0W6sDGQQvgxomOhyXVejrkq7hJgdydqEvOmie1FkWo7G2svJhvBosPfHk/8Uq7 d/R4jmVBBMyzhY4MOX7w6HUuJ2WNeG/VeB/KOF2yOp6aJ1fHKkCPzJsVLuEjQdmaENccmd xIvjt0DSp3auA957X3sNctBAU+gU0oI= Received: from mail-yx1-f70.google.com (mail-yx1-f70.google.com [74.125.224.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-653-5t9WO7-zMNicqtJH3NZCuQ-1; Thu, 07 May 2026 15:24:33 -0400 X-MC-Unique: 5t9WO7-zMNicqtJH3NZCuQ-1 X-Mimecast-MFC-AGG-ID: 5t9WO7-zMNicqtJH3NZCuQ_1778181873 Received: by mail-yx1-f70.google.com with SMTP id 956f58d0204a3-6546ccc8989so2876213d50.0 for ; Thu, 07 May 2026 12:24:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778181873; x=1778786673; darn=vger.kernel.org; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date:message-id:reply-to; bh=eJas2sGqLwB+vZWX5Y6mMjONuOkT/m3kcsxZweoSdiM=; b=GwjP0XlqJh6XXpLvqEBjvQ+LIomHQZFIQvFWw1x6ojH0CRmSA27Rp5KekHwwb6HSBM WchnH/0ICUu8kRP8XfGXFwj8uvIrLDjw7lqSkj2OaIQabvFwU5CwIsTfRBeZqL4fTHMA lwxmnz9LSh7Gknj1ONYKrvY1e1HE3ogexjovw42GbnfrRAlb5oTfv7P0dH0T0MQNw5/f LnhqE9fcc06z4bBUnzqUtBK4L6MTJGrYRjaGoUq9Qu5KTQ+BIWkKlc8wnFLI0Yp7/9zP ZZaVQfOMuSy+N8AHsMWoXwCpR6uCw3LgADjkQG2D3CFVXfpRom/CTwxneBwBNX/BcR+L FuCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778181873; x=1778786673; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eJas2sGqLwB+vZWX5Y6mMjONuOkT/m3kcsxZweoSdiM=; b=n8Wz+SdVvBPfi/4a3ntJPEiL1YtcgDd7IJOb1JnfzV7hhkmUXMw1rnazKNG+OeFioy K+pp/NMz40Yxr0Zm6CMtS7h5PHbHeJUk5UTSpgGQ5Y9TrgDyjby+Vslszy8Pz9lpjAkX I6mFFHdB5VcXNYqfopXxUdB2b6aqdSJuP8koZ/IehcHlS7Gw+Anp+kAttVXF98Yz7AYV FBjXIQuldXtPRWpPzxQZP1/wMwuo2QcQmhXlktVgBo7bUPXcl1XAJErnpaCaoZEmzwBE 7KtlTQ55e2tSyt9jhuxXpYz5Wb6DnC1FX9aqwPY+Y10SXNHkfvYHVSIyllUkvl7x41w1 WBtA== X-Gm-Message-State: AOJu0YxzePEgAHwYG6sPEO6ArWStkxXVF8oWyWDv7kHvg7HCL20S7A1J /lc9EPHFYdjw0mVawLuOIewqZBsS4rFDh60pyX3Jo8Mr/WGXUuckN+uX03eJSEUB1d4ATx5g29z gpzhhfg3/FktyC7FQW8a+ENJchw/cL5altNKOEu8CH1jfURhxXSHl7HqKeQFQC1WVoI8en4366g == X-Gm-Gg: Acq92OGPH1wfCDbJhYHcFQ6xSCj9oh5ShEy6tZTNxp1Up5w0XAT9GttjADlsYrSRXdg xhMeGiIW8uxzpSOEkjKdBOJxvGyIfoI6+D9IBe8G1+ooyh6EusWh6LZxXJIcBMosdl6TrYw7ejV jUp05Vzqwysp7RFzFLfbXtQBC8L0EurmrnTXGIYOA9/qlYI6rfX6bTwalLE77E0N6ko6fBVQHuT OArbH/PmwONA9wh9+om/NTl7KcvR/WcN4A4uLnOu/y+MUPgSQMkYkPGMzgFwrMBWuXJDzR9h1Wm YGpfcgJ8IW7MuSsZRtzOUNUHu/JJ4N5vkJBcRpE70Rzl4UEgMKfPyJrQbGlGpTfKlLpfJyfEBSA c7uXE1GZAlrjhi0TNGuY4u56C63ZOyoVR7sq8VHw37LhvOc4Pdsxl X-Received: by 2002:a05:690e:480a:b0:650:3952:3d2a with SMTP id 956f58d0204a3-65c79877bf0mr7764024d50.8.1778181872774; Thu, 07 May 2026 12:24:32 -0700 (PDT) X-Received: by 2002:a05:690e:480a:b0:650:3952:3d2a with SMTP id 956f58d0204a3-65c79877bf0mr7763998d50.8.1778181872192; Thu, 07 May 2026 12:24:32 -0700 (PDT) Received: from li-4c4c4544-0032-4210-804c-c3c04f423534.ibm.com ([2600:1700:6476:1430::29]) by smtp.gmail.com with ESMTPSA id 956f58d0204a3-65d935f47eesm78722d50.18.2026.05.07.12.24.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 12:24:31 -0700 (PDT) Message-ID: Subject: Re: [EXTERNAL] [PATCH v4 07/11] selftests: ceph: add reset consistency checker From: Viacheslav Dubeyko To: Alex Markuze , ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com Date: Thu, 07 May 2026 12:24:30 -0700 In-Reply-To: <20260507122737.2804094-8-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> <20260507122737.2804094-8-amarkuze@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.60.0 (3.60.0-1.fc44app2) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote: > Add a Python post-run validator for the CephFS client reset stress > test. The script reads data files written by the stress runner and > checks that every file was either written completely or is missing, > with no partial or corrupted content. >=20 > This is a prerequisite for the stress test script which invokes it > after each run. >=20 > Signed-off-by: Alex Markuze > --- > .../filesystems/ceph/validate_consistency.py | 297 ++++++++++++++++++ > 1 file changed, 297 insertions(+) > create mode 100755 tools/testing/selftests/filesystems/ceph/validate_con= sistency.py >=20 > diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistenc= y.py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py > new file mode 100755 > index 000000000000..c230a59bdb3a > --- /dev/null > +++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py > @@ -0,0 +1,297 @@ > +#!/usr/bin/env python3 > +# SPDX-License-Identifier: GPL-2.0 > + > +import argparse > +import bisect > +import hashlib > +import json > +import os > +from pathlib import Path > + > + > +def sha256_file(path: Path) -> str: > + digest =3D hashlib.sha256() > + with path.open("rb") as handle: > + while True: > + chunk =3D handle.read(1 << 20) > + if not chunk: > + break > + digest.update(chunk) > + return digest.hexdigest() > + > + > +def parse_io_log(path: Path): > + records =3D [] > + if not path.exists(): > + return records > + with path.open("r", encoding=3D"utf-8") as handle: > + for line_no, line in enumerate(handle, 1): > + line =3D line.strip() > + if not line: > + continue > + parts =3D line.split(",") > + if len(parts) !=3D 5: > + raise ValueError(f"io log line {line_no}: expected 5 col= umns, got {len(parts)}") > + ts_ms, seq, logical_id, relpath, digest =3D parts > + records.append( > + { > + "ts_ms": int(ts_ms), > + "seq": int(seq), > + "logical_id": int(logical_id), > + "relpath": relpath, > + "digest": digest, > + } > + ) > + return records > + > + > +def parse_rename_log(path: Path): > + records =3D [] > + if not path.exists(): > + return records > + with path.open("r", encoding=3D"utf-8") as handle: > + for line_no, line in enumerate(handle, 1): > + line =3D line.strip() > + if not line: > + continue > + parts =3D line.split(",") > + if len(parts) =3D=3D 6: > + ts_ms, seq, logical_id, src_rel, dst_rel, rc =3D parts > + elif len(parts) =3D=3D 7: > + ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = =3D parts > + _ =3D worker_id # worker id is informational only > + else: > + raise ValueError( > + f"rename log line {line_no}: expected 6 or 7 columns= , got {len(parts)}" > + ) > + records.append( > + { > + "ts_ms": int(ts_ms), > + "seq": int(seq), > + "logical_id": int(logical_id), > + "src_rel": src_rel, > + "dst_rel": dst_rel, > + "rc": int(rc), > + } > + ) > + return records > + > + > +def parse_reset_log(path: Path): > + records =3D [] > + if not path.exists(): > + return records > + with path.open("r", encoding=3D"utf-8") as handle: > + for line_no, line in enumerate(handle, 1): > + line =3D line.strip() > + if not line: > + continue > + parts =3D line.split(",") > + if len(parts) !=3D 4: > + raise ValueError(f"reset log line {line_no}: expected 4 = columns, got {len(parts)}") > + ts_ms, seq, reason, rc =3D parts > + records.append( > + { > + "ts_ms": int(ts_ms), > + "seq": int(seq), > + "reason": reason, > + "rc": int(rc), > + } > + ) > + return records > + > + > +def parse_status_file(path: Path): > + status =3D {} > + if not path.exists(): > + return status > + with path.open("r", encoding=3D"utf-8") as handle: > + for line in handle: > + line =3D line.strip() > + if not line or ":" not in line: > + continue > + key, value =3D line.split(":", 1) > + status[key.strip()] =3D value.strip() > + return status > + > + > +def to_int(value: str, default: int =3D 0): > + try: > + return int(value) > + except Exception: > + return default > + > + > +def validate_namespace(data_dir: Path, file_count: int, issues): > + actual_locations =3D {} > + actual_paths =3D {} > + for logical_id in range(file_count): > + name =3D f"file_{logical_id:05d}" > + found =3D [] > + for subdir in ("A", "B"): > + candidate =3D data_dir / subdir / name > + if candidate.exists(): > + found.append((subdir, candidate)) > + if len(found) !=3D 1: > + issues.append( > + f"namespace invariant failed for logical_id=3D{logical_i= d:05d}: expected exactly one file in A/B, found {len(found)}" > + ) > + continue > + actual_locations[logical_id] =3D found[0][0] > + actual_paths[logical_id] =3D found[0][1] > + return actual_locations, actual_paths > + > + > +def validate_rename_invariant(rename_records, actual_locations, issues): > + expected_locations =3D {} > + for rec in rename_records: > + if rec["rc"] !=3D 0: > + continue > + dst =3D rec["dst_rel"] > + if "/" not in dst: > + continue > + expected_locations[rec["logical_id"]] =3D dst.split("/", 1)[0] > + > + for logical_id, expected in expected_locations.items(): > + actual =3D actual_locations.get(logical_id) > + if actual is None: > + continue > + if actual !=3D expected: > + issues.append( > + f"rename invariant failed for logical_id=3D{logical_id:0= 5d}: expected location=3D{expected}, actual=3D{actual}" > + ) > + > + > +def validate_data_invariant(io_records, actual_paths, issues): > + expected_hash =3D {} > + for rec in io_records: > + digest =3D rec["digest"] > + if not digest: > + continue > + expected_hash[rec["logical_id"]] =3D digest > + > + for logical_id, digest in expected_hash.items(): > + path =3D actual_paths.get(logical_id) > + if path is None: > + continue > + actual_digest =3D sha256_file(path) > + if digest !=3D actual_digest: > + issues.append( > + f"data invariant failed for logical_id=3D{logical_id:05d= }: expected digest=3D{digest}, actual digest=3D{actual_digest}" > + ) > + > + > +def validate_reset_and_slo(args, reset_records, io_records, rename_recor= ds, status, issues): > + if not args.expect_reset: > + return > + > + successful_reset_times =3D [rec["ts_ms"] for rec in reset_records if= rec["rc"] =3D=3D 0] > + if not successful_reset_times: > + issues.append("expected reset activity but no successful reset t= rigger was observed") > + > + phase =3D status.get("phase") > + blocked_requests =3D to_int(status.get("blocked_requests", "0"), def= ault=3D-1) > + last_errno =3D to_int(status.get("last_errno", "0"), default=3D1) > + failure_count =3D to_int(status.get("failure_count", "0"), default= =3D-1) > + > + if phase is None: > + issues.append("missing final reset status file or phase field") > + elif phase.lower() !=3D "idle": > + issues.append(f"recovery invariant failed: phase=3D{phase}, expe= cted idle") > + > + if blocked_requests !=3D 0: > + issues.append(f"recovery invariant failed: blocked_requests=3D{b= locked_requests}, expected 0") > + if last_errno !=3D 0: > + issues.append(f"recovery invariant failed: last_errno=3D{last_er= rno}, expected 0") > + if failure_count > 0: > + issues.append( > + f"recovery invariant failed: failure_count=3D{failure_count}= , " > + "one or more resets failed during the run" > + ) > + > + op_times =3D [rec["ts_ms"] for rec in io_records] > + op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] = =3D=3D 0) > + op_times.sort() > + > + if successful_reset_times and not op_times: > + issues.append("recovery SLO failed: no workload completion event= s were recorded") > + return > + > + slo_ms =3D args.slo_seconds * 1000 > + for ts in successful_reset_times: > + idx =3D bisect.bisect_left(op_times, ts) > + if idx >=3D len(op_times): > + issues.append(f"recovery SLO failed: no operation completion= observed after reset at ts_ms=3D{ts}") > + continue > + delta =3D op_times[idx] - ts > + if delta > slo_ms: > + issues.append( > + f"recovery SLO failed: first post-reset completion at {d= elta}ms exceeds threshold {slo_ms}ms (reset ts_ms=3D{ts})" > + ) > + > + > +def main(): > + parser =3D argparse.ArgumentParser(description=3D"Validate Ceph rese= t stress artifacts") > + parser.add_argument("--data-dir", required=3DTrue) > + parser.add_argument("--file-count", required=3DTrue, type=3Dint) > + parser.add_argument("--io-log", required=3DTrue) > + parser.add_argument("--rename-log", required=3DTrue) > + parser.add_argument("--reset-log", required=3DTrue) > + parser.add_argument("--status-final", required=3DFalse, default=3D""= ) > + parser.add_argument("--slo-seconds", required=3DFalse, type=3Dint, d= efault=3D30) > + parser.add_argument("--expect-reset", action=3D"store_true") > + parser.add_argument("--report-json", required=3DFalse, default=3D"") > + args =3D parser.parse_args() > + > + data_dir =3D Path(args.data_dir) > + io_log =3D Path(args.io_log) > + rename_log =3D Path(args.rename_log) > + reset_log =3D Path(args.reset_log) > + status_final =3D Path(args.status_final) if args.status_final else P= ath("__missing_status__") > + > + issues =3D [] > + > + if not data_dir.exists(): > + issues.append(f"data directory is missing: {data_dir}") > + > + try: > + io_records =3D parse_io_log(io_log) > + rename_records =3D parse_rename_log(rename_log) > + reset_records =3D parse_reset_log(reset_log) > + except Exception as exc: > + issues.append(f"log parsing failed: {exc}") > + io_records =3D [] > + rename_records =3D [] > + reset_records =3D [] > + > + status =3D parse_status_file(status_final) > + > + actual_locations, actual_paths =3D validate_namespace(data_dir, args= .file_count, issues) > + validate_rename_invariant(rename_records, actual_locations, issues) > + validate_data_invariant(io_records, actual_paths, issues) > + validate_reset_and_slo(args, reset_records, io_records, rename_recor= ds, status, issues) > + > + report =3D { > + "file_count": args.file_count, > + "io_records": len(io_records), > + "rename_records": len(rename_records), > + "reset_records": len(reset_records), > + "expect_reset": args.expect_reset, > + "issues": issues, > + } > + > + if args.report_json: > + report_path =3D Path(args.report_json) > + report_path.write_text(json.dumps(report, indent=3D2, sort_keys= =3DTrue), encoding=3D"utf-8") > + > + if issues: > + print("FAIL: consistency validation found issues") > + for issue in issues: > + print(f" - {issue}") > + raise SystemExit(1) > + > + print("PASS: consistency validation succeeded") > + > + > +if __name__ =3D=3D "__main__": > + main() Reviewed-by: Viacheslav Dubeyko Thanks, Slava.