From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 673F32D0610 for ; Thu, 7 May 2026 19:01:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778180476; cv=none; b=BQB6KKgMV3lXHOhoXhVrw56Q4HKl0rEbOuMdwtP5BTUjEcGaWE3nb9wBAX37128q4Z2r9ZOP0nlLXKDTD0bi0D3qb3BCpFkscRvNNUUmyCdPNyF/5IvxpQhd57QIsbhaPmWs3w00EXvusD211e82BIAhFECz5+CqV9ObnU7VWo8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778180476; c=relaxed/simple; bh=lFTZ/jFD5sqxmIIDESQgZMhYwz1XrhpC3zn+Yo/2khw=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=HzJbblZeZgv9jfzWayD7VFDRBg1HCfMqGKEZrlE4Hiok71dGznuTsA3AuhMs/aecmitLczZN+wHizrDw8z5MQdbGROVArLIsblcwjbHnkhk8BCoqnqPI5Hu2NfEhT/ahSkUhdPI39j9CghMih5scfTjm5JqedzkYCUINHQ+vzik= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=eKNvyjeR; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=OPSkniqQ; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="eKNvyjeR"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="OPSkniqQ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778180473; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=dSNjWxyHawtQvbQl2o4WeIV83oSSkzsu+iVTyFGdvTI=; b=eKNvyjeRzIMfV0CTwgJPNkDHPgePJQeLoRq3lt8uR8zSDq8hAYI25XqLZuCA04e3Py8s74 uzQlztfXyRlW6iiQdYd8Ng5rtq89lHlusRLmQNcqIqLko2AG0S5nG/T0hb+pGCLfgWD08U 2Bddv7PLEIE7IZv91HtoowMY5DZRi3g= Received: from mail-yw1-f200.google.com (mail-yw1-f200.google.com [209.85.128.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-410-AdJJ3cGrNS6yQnKz-xCfQw-1; Thu, 07 May 2026 15:01:11 -0400 X-MC-Unique: AdJJ3cGrNS6yQnKz-xCfQw-1 X-Mimecast-MFC-AGG-ID: AdJJ3cGrNS6yQnKz-xCfQw_1778180471 Received: by mail-yw1-f200.google.com with SMTP id 00721157ae682-7bd5c9e2e4aso29305807b3.2 for ; Thu, 07 May 2026 12:01:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778180471; x=1778785271; darn=vger.kernel.org; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date:message-id:reply-to; bh=dSNjWxyHawtQvbQl2o4WeIV83oSSkzsu+iVTyFGdvTI=; b=OPSkniqQM1LSDh2cYqU4/6pQv7H2saRjv1txJ8gXuskggLpOKd6sJTKfHUgJtqan2p R8rkQEMNSStogQVWiLmF/D1Eg4936TE7bBEy6tihttNsI0m2YL60AK5TgiNfg3WZcP1n iDYH5ceMBA1KeA9PGK+xuAGaGeDjaJ7fn+q5xCZuT/RAMOEZIPOLw/TOPiYxigyrcLsD 4Ck+SkBIAy5zGX5D46MMxr34Y9pPuQdwKtTaP1prHYziI49wYOZdQYXioAe5gVIotWVQ 9BdhiiRol1mJ5y7kWeOmUzTD4br7vWf6TBkwDeMC30jcUWvgYj/VALpmpQcqx1K1LE70 e2Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778180471; x=1778785271; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dSNjWxyHawtQvbQl2o4WeIV83oSSkzsu+iVTyFGdvTI=; b=fxc133LmPCbqNuyGn4VFe3wWLBpSLuYmYsJ9mG4PqO4r14VplOhShiQ66Q5yFYIxv4 XZI53ESdobpNTQX3G8aHdmcylUn5L6mK4VzAJTfiDRos9EXkVpfg+Z4W/+7l4alZqSbb jcPjPvljozVNO7BN0RGy4S6WxAlCpghnAlkk87YFbAOePdv8LKf+ZFmXZbF+hJ22Z6ag fMdnv1Vb0vNOQqSlRpRc3MvncAf6IYoC3gry5Mt73Nz1S5bN+WtpUi9aT5EkV6ErWqLT +Ucfiv9J1TYbHFuwzhaT68N8vWcF1k+9FZC4GD30uQh72qWknSWNSptHybiM1Y0uoCjJ m7jg== X-Gm-Message-State: AOJu0YxirXxwrQqbjiza70mX7wb7+al5giGDZv7wacOTLmXZln97AucR TpkiMrctcYabQcaaRIarkW/dGwXW9+xSeiiNNqpO4ROSg7B7jFmHkxcRiRwh8ILXLjzVIwvNsTD odXO5iQMwKtAD0/5qMO71JJ8fsITbGdiZNZDpiKk312MmwcBcQoMOBoxPQEgLadOi7w== X-Gm-Gg: Acq92OECxGVDWnIGzrlDStF7RiQHEgvFPXTbMfh0O2l5wToI/cA3BsT2t2Fpgp5zMMB 6G9w3X4KwPEpbpvCfA20Kj6EIV7AF/7tXB5XZX/3oAkOX0OfSGtB5hTrDACG2gcVCdhvSvQhR0y nX4AtP0bS7NMyz6oJdDb5qp8TGHQ9LXRvo1HfYrSKH9JORbPZvavQufhT0jxVDnD5vNiLvYFq/P CoDUANOnJOUWt31xnlAEFuLg0LqmLp86VArGmoWFdDeMuZ1c8YndHr9erqHxNFwc1Lm8ETJUhhB 6XwrYdXB91rnGo/UOV7fyDkgw+lkxIRtXuBIhm8un6tul8Z45hiHCEeBiiWA3MIJ3Hzp3zXeLak gtYl///sDHp+Q8BhN4F564xKHsC9Mw50Q7TJJt7mavU8F1LPWUFBZ X-Received: by 2002:a05:690c:d94:b0:7b0:4bc1:d700 with SMTP id 00721157ae682-7bdf5ec8733mr101971107b3.36.1778180470573; Thu, 07 May 2026 12:01:10 -0700 (PDT) X-Received: by 2002:a05:690c:d94:b0:7b0:4bc1:d700 with SMTP id 00721157ae682-7bdf5ec8733mr101969627b3.36.1778180469509; Thu, 07 May 2026 12:01:09 -0700 (PDT) Received: from li-4c4c4544-0032-4210-804c-c3c04f423534.ibm.com ([2600:1700:6476:1430::29]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7bd7d849afesm78481267b3.49.2026.05.07.12.01.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 12:01:09 -0700 (PDT) Message-ID: <2ec3b7c8d69c7a5daae26918075247593009accf.camel@redhat.com> Subject: Re: [EXTERNAL] [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() From: Viacheslav Dubeyko To: Alex Markuze , ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com Date: Thu, 07 May 2026 12:01:07 -0700 In-Reply-To: <20260507122737.2804094-5-amarkuze@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> <20260507122737.2804094-5-amarkuze@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.60.0 (3.60.0-1.fc44app2) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote: > Convert wait_caps_flush() from a silent indefinite wait into a diagnostic > wait loop that periodically dumps pending cap flush state. >=20 > The underlying wait semantics remain intact: callers still wait until the > requested cap flushes complete. The difference is that long stalls now > produce actionable diagnostics instead of looking like a silent hang. >=20 > CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries > emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS > limits the number of timed diagnostic dumps before the wait > continues silently. When more entries exist than the per-dump > limit, a truncation count is reported. When the dump iteration > limit is reached, a final suppression message is emitted so the > transition to silence is explicit. >=20 > The diagnostic dump collects flush entry data under cap_dirty_lock into > a bounded on-stack array, then prints after releasing the lock. This > avoids holding the spinlock across printk calls. >=20 > A null cf->ci on the global flush list indicates a bug since all > cap_flush entries are initialized with a valid ci before being added. > Signal this with WARN_ON_ONCE while still printing enough context for > debugging. >=20 > READ_ONCE is used for the i_last_cap_flush_ack field, which is read > outside the inode lock domain. Flush tids are monotonically increasing > and acks are processed in order under i_ceph_lock, so the latest ack > tid is always the most recently written value. >=20 > Add a ci pointer to struct ceph_cap_flush so that the diagnostic > dump can identify which inode each pending flush belongs to. The > new i_last_cap_flush_ack field tracks the latest acknowledged flush > tid per inode for diagnostic correlation. >=20 > This improves reset-drain observability and is also useful for > existing sync and writeback troubleshooting paths. >=20 > Signed-off-by: Alex Markuze > --- > fs/ceph/caps.c | 10 +++++ > fs/ceph/inode.c | 1 + > fs/ceph/mds_client.c | 100 +++++++++++++++++++++++++++++++++++++++++-- > fs/ceph/mds_client.h | 3 ++ > fs/ceph/super.h | 6 +++ > 5 files changed, 116 insertions(+), 4 deletions(-) >=20 > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index cb9e78b713d9..4b37d9ffdf7f 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_in= fo *ci, > =20 > spin_lock(&mdsc->cap_dirty_lock); > capsnap->cap_flush.tid =3D ++mdsc->last_cap_flush_tid; > + capsnap->cap_flush.ci =3D ci; > list_add_tail(&capsnap->cap_flush.g_list, > &mdsc->cap_flush_list); > if (oldest_flush_tid =3D=3D 0) > @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void) > return NULL; > =20 > cf->is_capsnap =3D false; > + cf->ci =3D NULL; > return cf; > } > =20 > @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode= , > doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode)); > =20 > swap(cf, ci->i_prealloc_cap_flush); > + cf->ci =3D ci; > cf->caps =3D flushing; > cf->wake =3D wake; > =20 > @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *ino= de, u64 flush_tid, > bool wake_ci =3D false; > bool wake_mdsc =3D false; > =20 > + /* > + * Flush tids are monotonically increasing and acks arrive in > + * order under i_ceph_lock, so this is always the latest tid. > + * Diagnostic readers use READ_ONCE() without holding the lock. > + */ > + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid); > + > list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) { > /* Is this the one that was flushed? */ > if (cf->tid =3D=3D flush_tid) > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index 4871d7ab2730..61d7c0b8161f 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -671,6 +671,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb= ) > INIT_LIST_HEAD(&ci->i_cap_snaps); > ci->i_head_snapc =3D NULL; > ci->i_snap_caps =3D 0; > + ci->i_last_cap_flush_ack =3D 0; > =20 > ci->i_last_rd =3D ci->i_last_wr =3D jiffies - 3600 * HZ; > for (i =3D 0; i < CEPH_FILE_MODE_BITS; i++) > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 249419c17d3c..6ab5031e697a 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -2330,19 +2330,111 @@ static int check_caps_flush(struct ceph_mds_clie= nt *mdsc, > } > =20 > /* > - * flush all dirty inode data to disk. > + * Snapshot of a single cap_flush entry for diagnostic dump. > + * Collected under cap_dirty_lock, printed after releasing it. > + */ > +struct flush_dump_entry { > + u64 ino; /* inode number */ > + u64 snap; /* snap id */ > + int caps; /* dirty cap bits */ > + u64 tid; /* flush transaction id */ > + u64 last_ack; /* most recent ack tid for this inode */ > + bool wake; /* whether completion was requested */ > + bool is_capsnap; /* true if this is a cap snap flush */ > + bool ci_null; /* true if cf->ci was unexpectedly NULL */ > +}; > + > +/* > + * Dump pending cap flushes for diagnostic purposes. > * > - * returns true if we've flushed through want_flush_tid > + * cf->ci is safe to dereference here: cap_flush entries hold a > + * reference on the inode (via the cap), and entries are removed from > + * cap_flush_list under cap_dirty_lock before the cap (and thus the > + * inode reference) is released. Holding cap_dirty_lock therefore > + * guarantees the inode remains valid for the lifetime of the scan. > + */ > + > +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid) > +{ > + struct ceph_client *cl =3D mdsc->fsc->client; > + struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES]; > + struct ceph_cap_flush *cf; > + int n =3D 0, remaining =3D 0; > + > + spin_lock(&mdsc->cap_dirty_lock); > + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) { > + if (cf->tid > want_tid) > + break; > + if (n < CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES) { > + struct flush_dump_entry *e =3D &entries[n++]; > + > + e->ci_null =3D WARN_ON_ONCE(!cf->ci); > + if (!e->ci_null) { > + e->ino =3D ceph_ino(&cf->ci->netfs.inode); > + e->snap =3D ceph_snap(&cf->ci->netfs.inode); > + e->last_ack =3D READ_ONCE(cf->ci->i_last_cap_flush_ack); > + } > + e->caps =3D cf->caps; > + e->tid =3D cf->tid; > + e->wake =3D cf->wake; > + e->is_capsnap =3D cf->is_capsnap; > + } else { > + remaining++; > + } We don't need brackets here. :) > + } > + spin_unlock(&mdsc->cap_dirty_lock); > + > + pr_info_client(cl, "still waiting for cap flushes through %llu:\n", > + want_tid); > + for (int i =3D 0; i < n; i++) { I am slightly worried that we could have warnings because of this declarati= on. > + struct flush_dump_entry *e =3D &entries[i]; > + > + if (e->ci_null) > + pr_info_client(cl, > + " (null ci) %s tid=3D%llu wake=3D%d%s\n", > + ceph_cap_string(e->caps), e->tid, > + e->wake, > + e->is_capsnap ? " is_capsnap" : ""); > + else > + pr_info_client(cl, > + " %llx.%llx %s tid=3D%llu last_ack=3D%llu wake=3D%d%s\n", > + e->ino, e->snap, > + ceph_cap_string(e->caps), e->tid, > + e->last_ack, e->wake, > + e->is_capsnap ? " is_capsnap" : ""); > + } > + if (remaining) > + pr_info_client(cl, " ... and %d more pending flushes\n", > + remaining); > +} > + > +/* > + * Wait for all cap flushes through @want_flush_tid to complete. > + * Periodically dumps pending cap flush state for diagnostics. > */ > static void wait_caps_flush(struct ceph_mds_client *mdsc, > u64 want_flush_tid) > { > struct ceph_client *cl =3D mdsc->fsc->client; > + int i =3D 0; > + long ret; > =20 > doutc(cl, "want %llu\n", want_flush_tid); > =20 > - wait_event(mdsc->cap_flushing_wq, > - check_caps_flush(mdsc, want_flush_tid)); > + do { > + /* 60 * HZ fits in a long on all supported architectures. */ > + ret =3D wait_event_timeout(mdsc->cap_flushing_wq, > + check_caps_flush(mdsc, want_flush_tid), > + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ); > + if (ret =3D=3D 0) { > + if (i < CEPH_CAP_FLUSH_MAX_DUMP_ITERS) > + dump_cap_flushes(mdsc, want_flush_tid); > + else if (i =3D=3D CEPH_CAP_FLUSH_MAX_DUMP_ITERS) > + pr_info_client(cl, > + "still waiting for cap flushes; suppressing further dumps\n"= ); > + i++; > + } > + } while (ret =3D=3D 0); > =20 > doutc(cl, "ok, flushed thru %llu\n", want_flush_tid); > } > diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h > index d873e784b025..8208fdf02efe 100644 > --- a/fs/ceph/mds_client.h > +++ b/fs/ceph/mds_client.h > @@ -77,6 +77,9 @@ struct ceph_fs_client; > struct ceph_cap; > =20 > #define MDS_AUTH_UID_ANY -1 > +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60 > +#define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5 > +#define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5 > =20 > struct ceph_mds_cap_match { > s64 uid; /* default to MDS_AUTH_UID_ANY */ > diff --git a/fs/ceph/super.h b/fs/ceph/super.h > index 8afc6f3a10da..a4993644d543 100644 > --- a/fs/ceph/super.h > +++ b/fs/ceph/super.h > @@ -239,6 +239,7 @@ struct ceph_cap_flush { > bool is_capsnap; /* true means capsnap */ > struct list_head g_list; // global > struct list_head i_list; // per inode > + struct ceph_inode_info *ci; > }; > =20 > /* > @@ -453,6 +454,11 @@ struct ceph_inode_info { > struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 o= r > dirty|flushing caps */ > unsigned i_snap_caps; /* cap bits for snapped files */ > + /* > + * Written under i_ceph_lock, read via READ_ONCE() > + * from diagnostic paths. > + */ > + u64 i_last_cap_flush_ack; > =20 > unsigned long i_last_rd; > unsigned long i_last_wr; I have two minor remarks. The rest looks good. Reviewed-by: Viacheslav Dubeyko Thanks, Slava.