[RFC PATCH 0/4] ceph: fix generic/421 test failure

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] ceph: fix generic/421 test failure
@ 2025-02-05  0:02 Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring Viacheslav Dubeyko
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-05  0:02 UTC (permalink / raw)
  To: ceph-devel
  Cc: idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko, slava

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.

This patchset is refactoring the ceph_writepages_start()
method and it fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.

sudo ./check generic/421
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch

generic/421 7s ...  4s
Ran: generic/421
Passed all 1 tests

Viacheslav Dubeyko (4):
  ceph: extend ceph_writeback_ctl for ceph_writepages_start()
    refactoring
  ceph: introduce ceph_process_folio_batch() method
  ceph: introduce ceph_submit_write() method
  ceph: fix generic/421 test failure

 fs/ceph/addr.c       | 1110 +++++++++++++++++++++++++++---------------
 fs/ceph/mds_client.c |    2 +
 fs/ceph/mds_client.h |    3 +
 fs/ceph/super.c      |   11 +
 4 files changed, 746 insertions(+), 380 deletions(-)

-- 
2.48.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
@ 2025-02-05  0:02 ` Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 2/4] ceph: introduce ceph_process_folio_batch() method Viacheslav Dubeyko
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-05  0:02 UTC (permalink / raw)
  To: ceph-devel
  Cc: idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko, slava

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The ceph_writepages_start() has unreasonably huge size and
complex logic that makes this method hard to understand.
Current state of the method's logic makes bug fix really
hard task. This patch extends the struct ceph_writeback_ctl
with the goal to make ceph_writepages_start() method
more compact and easy to understand by means of deep refactoring.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/addr.c | 487 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 302 insertions(+), 185 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f5224a566b69..d002ff62d867 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -568,7 +568,36 @@ struct ceph_writeback_ctl
 	u64 truncate_size;
 	u32 truncate_seq;
 	bool size_stable;
+
 	bool head_snapc;
+	struct ceph_snap_context *snapc;
+	struct ceph_snap_context *last_snapc;
+
+	bool done;
+	bool should_loop;
+	bool range_whole;
+	pgoff_t start_index;
+	pgoff_t index;
+	pgoff_t end;
+	xa_mark_t tag;
+
+	pgoff_t strip_unit_end;
+	unsigned int wsize;
+	unsigned int nr_folios;
+	unsigned int max_pages;
+	unsigned int locked_pages;
+
+	int op_idx;
+	int num_ops;
+	u64 offset;
+	u64 len;
+
+	struct folio_batch fbatch;
+	unsigned int processed_in_fbatch;
+
+	bool from_pool;
+	struct page **pages;
+	struct page **data_pages;
 };
 
 /*
@@ -949,6 +978,74 @@ static void writepages_finish(struct ceph_osd_request *req)
 	ceph_dec_osd_stopping_blocker(fsc->mdsc);
 }
 
+static inline
+unsigned int ceph_define_write_size(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	unsigned int wsize = i_blocksize(inode);
+
+	if (fsc->mount_options->wsize < wsize)
+		wsize = fsc->mount_options->wsize;
+
+	return wsize;
+}
+
+static inline
+void ceph_folio_batch_init(struct ceph_writeback_ctl *ceph_wbc)
+{
+	folio_batch_init(&ceph_wbc->fbatch);
+	ceph_wbc->processed_in_fbatch = 0;
+}
+
+static inline
+void ceph_folio_batch_reinit(struct ceph_writeback_ctl *ceph_wbc)
+{
+	folio_batch_release(&ceph_wbc->fbatch);
+	ceph_folio_batch_init(ceph_wbc);
+}
+
+static inline
+void ceph_init_writeback_ctl(struct address_space *mapping,
+			     struct writeback_control *wbc,
+			     struct ceph_writeback_ctl *ceph_wbc)
+{
+	ceph_wbc->snapc = NULL;
+	ceph_wbc->last_snapc = NULL;
+
+	ceph_wbc->strip_unit_end = 0;
+	ceph_wbc->wsize = ceph_define_write_size(mapping);
+
+	ceph_wbc->nr_folios = 0;
+	ceph_wbc->max_pages = 0;
+	ceph_wbc->locked_pages = 0;
+
+	ceph_wbc->done = false;
+	ceph_wbc->should_loop = false;
+	ceph_wbc->range_whole = false;
+
+	ceph_wbc->start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
+	ceph_wbc->index = ceph_wbc->start_index;
+	ceph_wbc->end = -1;
+
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) {
+		ceph_wbc->tag = PAGECACHE_TAG_TOWRITE;
+	} else {
+		ceph_wbc->tag = PAGECACHE_TAG_DIRTY;
+	}
+
+	ceph_wbc->op_idx = -1;
+	ceph_wbc->num_ops = 0;
+	ceph_wbc->offset = 0;
+	ceph_wbc->len = 0;
+	ceph_wbc->from_pool = false;
+
+	ceph_folio_batch_init(ceph_wbc);
+
+	ceph_wbc->pages = NULL;
+	ceph_wbc->data_pages = NULL;
+}
+
 /*
  * initiate async writeback
  */
@@ -960,17 +1057,11 @@ static int ceph_writepages_start(struct address_space *mapping,
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl = fsc->client;
 	struct ceph_vino vino = ceph_vino(inode);
-	pgoff_t index, start_index, end = -1;
-	struct ceph_snap_context *snapc = NULL, *last_snapc = NULL, *pgsnapc;
-	struct folio_batch fbatch;
-	int rc = 0;
-	unsigned int wsize = i_blocksize(inode);
-	struct ceph_osd_request *req = NULL;
+	struct ceph_snap_context *pgsnapc;
 	struct ceph_writeback_ctl ceph_wbc;
-	bool should_loop, range_whole = false;
-	bool done = false;
+	struct ceph_osd_request *req = NULL;
+	int rc = 0;
 	bool caching = ceph_is_cache_enabled(inode);
-	xa_mark_t tag;
 
 	if (wbc->sync_mode == WB_SYNC_NONE &&
 	    fsc->write_congested)
@@ -989,86 +1080,78 @@ static int ceph_writepages_start(struct address_space *mapping,
 		mapping_set_error(mapping, -EIO);
 		return -EIO; /* we're in a forced umount, don't write! */
 	}
-	if (fsc->mount_options->wsize < wsize)
-		wsize = fsc->mount_options->wsize;
 
-	folio_batch_init(&fbatch);
+	ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
 
-	start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
-	index = start_index;
-
-	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) {
-		tag = PAGECACHE_TAG_TOWRITE;
-	} else {
-		tag = PAGECACHE_TAG_DIRTY;
-	}
 retry:
 	/* find oldest snap context with dirty data */
-	snapc = get_oldest_context(inode, &ceph_wbc, NULL);
-	if (!snapc) {
+	ceph_wbc.snapc = get_oldest_context(inode, &ceph_wbc, NULL);
+	if (!ceph_wbc.snapc) {
 		/* hmm, why does writepages get called when there
 		   is no dirty data? */
 		doutc(cl, " no snap context with dirty data?\n");
 		goto out;
 	}
-	doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n", snapc,
-	      snapc->seq, snapc->num_snaps);
+	doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n",
+	      ceph_wbc.snapc, ceph_wbc.snapc->seq,
+	      ceph_wbc.snapc->num_snaps);
 
-	should_loop = false;
-	if (ceph_wbc.head_snapc && snapc != last_snapc) {
+	ceph_wbc.should_loop = false;
+	if (ceph_wbc.head_snapc && ceph_wbc.snapc != ceph_wbc.last_snapc) {
 		/* where to start/end? */
 		if (wbc->range_cyclic) {
-			index = start_index;
-			end = -1;
-			if (index > 0)
-				should_loop = true;
-			doutc(cl, " cyclic, start at %lu\n", index);
+			ceph_wbc.index = ceph_wbc.start_index;
+			ceph_wbc.end = -1;
+			if (ceph_wbc.index > 0)
+				ceph_wbc.should_loop = true;
+			doutc(cl, " cyclic, start at %lu\n", ceph_wbc.index);
 		} else {
-			index = wbc->range_start >> PAGE_SHIFT;
-			end = wbc->range_end >> PAGE_SHIFT;
+			ceph_wbc.index = wbc->range_start >> PAGE_SHIFT;
+			ceph_wbc.end = wbc->range_end >> PAGE_SHIFT;
 			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-				range_whole = true;
-			doutc(cl, " not cyclic, %lu to %lu\n", index, end);
+				ceph_wbc.range_whole = true;
+			doutc(cl, " not cyclic, %lu to %lu\n",
+			      ceph_wbc.index, ceph_wbc.end);
 		}
 	} else if (!ceph_wbc.head_snapc) {
 		/* Do not respect wbc->range_{start,end}. Dirty pages
 		 * in that range can be associated with newer snapc.
 		 * They are not writeable until we write all dirty pages
 		 * associated with 'snapc' get written */
-		if (index > 0)
-			should_loop = true;
+		if (ceph_wbc.index > 0)
+			ceph_wbc.should_loop = true;
 		doutc(cl, " non-head snapc, range whole\n");
 	}
 
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
-		tag_pages_for_writeback(mapping, index, end);
+		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
 
-	ceph_put_snap_context(last_snapc);
-	last_snapc = snapc;
+	ceph_put_snap_context(ceph_wbc.last_snapc);
+	ceph_wbc.last_snapc = ceph_wbc.snapc;
 
-	while (!done && index <= end) {
-		int num_ops = 0, op_idx;
-		unsigned i, nr_folios, max_pages, locked_pages = 0;
-		struct page **pages = NULL, **data_pages;
+	while (!ceph_wbc.done && ceph_wbc.index <= ceph_wbc.end) {
+		unsigned i;
 		struct page *page;
-		pgoff_t strip_unit_end = 0;
-		u64 offset = 0, len = 0;
-		bool from_pool = false;
 
-		max_pages = wsize >> PAGE_SHIFT;
+		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
 
 get_more_pages:
-		nr_folios = filemap_get_folios_tag(mapping, &index,
-						   end, tag, &fbatch);
-		doutc(cl, "pagevec_lookup_range_tag got %d\n", nr_folios);
-		if (!nr_folios && !locked_pages)
+		ceph_wbc.nr_folios = filemap_get_folios_tag(mapping,
+							    &ceph_wbc.index,
+							    ceph_wbc.end,
+							    ceph_wbc.tag,
+							    &ceph_wbc.fbatch);
+		doutc(cl, "pagevec_lookup_range_tag got %d\n",
+			ceph_wbc.nr_folios);
+		if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
 			break;
-		for (i = 0; i < nr_folios && locked_pages < max_pages; i++) {
-			struct folio *folio = fbatch.folios[i];
+		for (i = 0; i < ceph_wbc.nr_folios &&
+			    ceph_wbc.locked_pages < ceph_wbc.max_pages; i++) {
+			struct folio *folio = ceph_wbc.fbatch.folios[i];
 
 			page = &folio->page;
 			doutc(cl, "? %p idx %lu\n", page, page->index);
-			if (locked_pages == 0)
+			if (ceph_wbc.locked_pages == 0)
 				lock_page(page);  /* first page */
 			else if (!trylock_page(page))
 				break;
@@ -1082,13 +1165,14 @@ static int ceph_writepages_start(struct address_space *mapping,
 			}
 			/* only if matching snap context */
 			pgsnapc = page_snap_context(page);
-			if (pgsnapc != snapc) {
+			if (pgsnapc != ceph_wbc.snapc) {
 				doutc(cl, "page snapc %p %lld != oldest %p %lld\n",
-				      pgsnapc, pgsnapc->seq, snapc, snapc->seq);
-				if (!should_loop &&
+				      pgsnapc, pgsnapc->seq,
+				      ceph_wbc.snapc, ceph_wbc.snapc->seq);
+				if (!ceph_wbc.should_loop &&
 				    !ceph_wbc.head_snapc &&
 				    wbc->sync_mode != WB_SYNC_NONE)
-					should_loop = true;
+					ceph_wbc.should_loop = true;
 				unlock_page(page);
 				continue;
 			}
@@ -1103,7 +1187,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 				folio_unlock(folio);
 				continue;
 			}
-			if (strip_unit_end && (page->index > strip_unit_end)) {
+			if (ceph_wbc.strip_unit_end &&
+			    (page->index > ceph_wbc.strip_unit_end)) {
 				doutc(cl, "end of strip unit %p\n", page);
 				unlock_page(page);
 				break;
@@ -1132,47 +1217,52 @@ static int ceph_writepages_start(struct address_space *mapping,
 			 * calculate max possinle write size and
 			 * allocate a page array
 			 */
-			if (locked_pages == 0) {
+			if (ceph_wbc.locked_pages == 0) {
 				u64 objnum;
 				u64 objoff;
 				u32 xlen;
 
 				/* prepare async write request */
-				offset = (u64)page_offset(page);
+				ceph_wbc.offset = (u64)page_offset(page);
 				ceph_calc_file_object_mapping(&ci->i_layout,
-							      offset, wsize,
+							      ceph_wbc.offset,
+							      ceph_wbc.wsize,
 							      &objnum, &objoff,
 							      &xlen);
-				len = xlen;
+				ceph_wbc.len = xlen;
 
-				num_ops = 1;
-				strip_unit_end = page->index +
-					((len - 1) >> PAGE_SHIFT);
+				ceph_wbc.num_ops = 1;
+				ceph_wbc.strip_unit_end = page->index +
+					((ceph_wbc.len - 1) >> PAGE_SHIFT);
 
-				BUG_ON(pages);
-				max_pages = calc_pages_for(0, (u64)len);
-				pages = kmalloc_array(max_pages,
-						      sizeof(*pages),
+				BUG_ON(ceph_wbc.pages);
+				ceph_wbc.max_pages =
+					calc_pages_for(0, (u64)ceph_wbc.len);
+				ceph_wbc.pages = kmalloc_array(ceph_wbc.max_pages,
+						      sizeof(*ceph_wbc.pages),
 						      GFP_NOFS);
-				if (!pages) {
-					from_pool = true;
-					pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
-					BUG_ON(!pages);
+				if (!ceph_wbc.pages) {
+					ceph_wbc.from_pool = true;
+					ceph_wbc.pages =
+						mempool_alloc(ceph_wb_pagevec_pool,
+								GFP_NOFS);
+					BUG_ON(!ceph_wbc.pages);
 				}
 
-				len = 0;
+				ceph_wbc.len = 0;
 			} else if (page->index !=
-				   (offset + len) >> PAGE_SHIFT) {
-				if (num_ops >= (from_pool ?  CEPH_OSD_SLAB_OPS :
+				   (ceph_wbc.offset + ceph_wbc.len) >> PAGE_SHIFT) {
+				if (ceph_wbc.num_ops >=
+				    (ceph_wbc.from_pool ?  CEPH_OSD_SLAB_OPS :
 							     CEPH_OSD_MAX_OPS)) {
 					redirty_page_for_writepage(wbc, page);
 					unlock_page(page);
 					break;
 				}
 
-				num_ops++;
-				offset = (u64)page_offset(page);
-				len = 0;
+				ceph_wbc.num_ops++;
+				ceph_wbc.offset = (u64)page_offset(page);
+				ceph_wbc.len = 0;
 			}
 
 			/* note position of first page in fbatch */
@@ -1185,78 +1275,85 @@ static int ceph_writepages_start(struct address_space *mapping,
 				fsc->write_congested = true;
 
 			if (IS_ENCRYPTED(inode)) {
-				pages[locked_pages] =
+				ceph_wbc.pages[ceph_wbc.locked_pages] =
 					fscrypt_encrypt_pagecache_blocks(page,
 						PAGE_SIZE, 0,
-						locked_pages ? GFP_NOWAIT : GFP_NOFS);
-				if (IS_ERR(pages[locked_pages])) {
-					if (PTR_ERR(pages[locked_pages]) == -EINVAL)
+						ceph_wbc.locked_pages ?
+							GFP_NOWAIT : GFP_NOFS);
+				if (IS_ERR(ceph_wbc.pages[ceph_wbc.locked_pages])) {
+					if (PTR_ERR(ceph_wbc.pages[ceph_wbc.locked_pages]) == -EINVAL)
 						pr_err_client(cl,
 							"inode->i_blkbits=%hhu\n",
 							inode->i_blkbits);
 					/* better not fail on first page! */
-					BUG_ON(locked_pages == 0);
-					pages[locked_pages] = NULL;
+					BUG_ON(ceph_wbc.locked_pages == 0);
+					ceph_wbc.pages[ceph_wbc.locked_pages] = NULL;
 					redirty_page_for_writepage(wbc, page);
 					unlock_page(page);
 					break;
 				}
-				++locked_pages;
+				++ceph_wbc.locked_pages;
 			} else {
-				pages[locked_pages++] = page;
+				ceph_wbc.pages[ceph_wbc.locked_pages++] = page;
 			}
 
-			fbatch.folios[i] = NULL;
-			len += thp_size(page);
+			ceph_wbc.fbatch.folios[i] = NULL;
+			ceph_wbc.len += thp_size(page);
 		}
 
 		/* did we get anything? */
-		if (!locked_pages)
+		if (!ceph_wbc.locked_pages)
 			goto release_folios;
 		if (i) {
 			unsigned j, n = 0;
 			/* shift unused page to beginning of fbatch */
-			for (j = 0; j < nr_folios; j++) {
-				if (!fbatch.folios[j])
+			for (j = 0; j < ceph_wbc.nr_folios; j++) {
+				if (!ceph_wbc.fbatch.folios[j])
 					continue;
-				if (n < j)
-					fbatch.folios[n] = fbatch.folios[j];
+				if (n < j) {
+					ceph_wbc.fbatch.folios[n] =
+						ceph_wbc.fbatch.folios[j];
+				}
 				n++;
 			}
-			fbatch.nr = n;
+			ceph_wbc.fbatch.nr = n;
 
-			if (nr_folios && i == nr_folios &&
-			    locked_pages < max_pages) {
+			if (ceph_wbc.nr_folios && i == ceph_wbc.nr_folios &&
+			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
 				doutc(cl, "reached end fbatch, trying for more\n");
-				folio_batch_release(&fbatch);
+				folio_batch_release(&ceph_wbc.fbatch);
 				goto get_more_pages;
 			}
 		}
 
 new_request:
-		offset = ceph_fscrypt_page_offset(pages[0]);
-		len = wsize;
+		ceph_wbc.offset = ceph_fscrypt_page_offset(ceph_wbc.pages[0]);
+		ceph_wbc.len = ceph_wbc.wsize;
 
 		req = ceph_osdc_new_request(&fsc->client->osdc,
 					&ci->i_layout, vino,
-					offset, &len, 0, num_ops,
+					ceph_wbc.offset, &ceph_wbc.len,
+					0, ceph_wbc.num_ops,
 					CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
-					snapc, ceph_wbc.truncate_seq,
+					ceph_wbc.snapc, ceph_wbc.truncate_seq,
 					ceph_wbc.truncate_size, false);
 		if (IS_ERR(req)) {
 			req = ceph_osdc_new_request(&fsc->client->osdc,
 						&ci->i_layout, vino,
-						offset, &len, 0,
-						min(num_ops,
+						ceph_wbc.offset, &ceph_wbc.len,
+						0, min(ceph_wbc.num_ops,
 						    CEPH_OSD_SLAB_OPS),
 						CEPH_OSD_OP_WRITE,
 						CEPH_OSD_FLAG_WRITE,
-						snapc, ceph_wbc.truncate_seq,
+						ceph_wbc.snapc,
+						ceph_wbc.truncate_seq,
 						ceph_wbc.truncate_size, true);
 			BUG_ON(IS_ERR(req));
 		}
-		BUG_ON(len < ceph_fscrypt_page_offset(pages[locked_pages - 1]) +
-			     thp_size(pages[locked_pages - 1]) - offset);
+		BUG_ON(ceph_wbc.len <
+			ceph_fscrypt_page_offset(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) +
+				thp_size(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) -
+					ceph_wbc.offset);
 
 		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
 			rc = -EIO;
@@ -1266,100 +1363,118 @@ static int ceph_writepages_start(struct address_space *mapping,
 		req->r_inode = inode;
 
 		/* Format the osd request message and submit the write */
-		len = 0;
-		data_pages = pages;
-		op_idx = 0;
-		for (i = 0; i < locked_pages; i++) {
-			struct page *page = ceph_fscrypt_pagecache_page(pages[i]);
+		ceph_wbc.len = 0;
+		ceph_wbc.data_pages = ceph_wbc.pages;
+		ceph_wbc.op_idx = 0;
+		for (i = 0; i < ceph_wbc.locked_pages; i++) {
+			struct page *page =
+				ceph_fscrypt_pagecache_page(ceph_wbc.pages[i]);
 
 			u64 cur_offset = page_offset(page);
 			/*
 			 * Discontinuity in page range? Ceph can handle that by just passing
 			 * multiple extents in the write op.
 			 */
-			if (offset + len != cur_offset) {
+			if (ceph_wbc.offset + ceph_wbc.len != cur_offset) {
 				/* If it's full, stop here */
-				if (op_idx + 1 == req->r_num_ops)
+				if (ceph_wbc.op_idx + 1 == req->r_num_ops)
 					break;
 
 				/* Kick off an fscache write with what we have so far. */
-				ceph_fscache_write_to_cache(inode, offset, len, caching);
+				ceph_fscache_write_to_cache(inode, ceph_wbc.offset,
+							    ceph_wbc.len, caching);
 
 				/* Start a new extent */
-				osd_req_op_extent_dup_last(req, op_idx,
-							   cur_offset - offset);
-				doutc(cl, "got pages at %llu~%llu\n", offset,
-				      len);
-				osd_req_op_extent_osd_data_pages(req, op_idx,
-							data_pages, len, 0,
-							from_pool, false);
-				osd_req_op_extent_update(req, op_idx, len);
-
-				len = 0;
-				offset = cur_offset;
-				data_pages = pages + i;
-				op_idx++;
+				osd_req_op_extent_dup_last(req, ceph_wbc.op_idx,
+							   cur_offset -
+								ceph_wbc.offset);
+				doutc(cl, "got pages at %llu~%llu\n",
+					ceph_wbc.offset,
+					ceph_wbc.len);
+				osd_req_op_extent_osd_data_pages(req,
+							ceph_wbc.op_idx,
+							ceph_wbc.data_pages,
+							ceph_wbc.len, 0,
+							ceph_wbc.from_pool, false);
+				osd_req_op_extent_update(req, ceph_wbc.op_idx,
+							 ceph_wbc.len);
+
+				ceph_wbc.len = 0;
+				ceph_wbc.offset = cur_offset;
+				ceph_wbc.data_pages = ceph_wbc.pages + i;
+				ceph_wbc.op_idx++;
 			}
 
 			set_page_writeback(page);
 			if (caching)
 				ceph_set_page_fscache(page);
-			len += thp_size(page);
+			ceph_wbc.len += thp_size(page);
 		}
-		ceph_fscache_write_to_cache(inode, offset, len, caching);
+		ceph_fscache_write_to_cache(inode, ceph_wbc.offset,
+					    ceph_wbc.len, caching);
 
 		if (ceph_wbc.size_stable) {
-			len = min(len, ceph_wbc.i_size - offset);
-		} else if (i == locked_pages) {
+			ceph_wbc.len = min(ceph_wbc.len,
+					    ceph_wbc.i_size - ceph_wbc.offset);
+		} else if (i == ceph_wbc.locked_pages) {
 			/* writepages_finish() clears writeback pages
 			 * according to the data length, so make sure
 			 * data length covers all locked pages */
-			u64 min_len = len + 1 - thp_size(page);
-			len = get_writepages_data_length(inode, pages[i - 1],
-							 offset);
-			len = max(len, min_len);
+			u64 min_len = ceph_wbc.len + 1 - thp_size(page);
+			ceph_wbc.len =
+				get_writepages_data_length(inode,
+							ceph_wbc.pages[i - 1],
+							ceph_wbc.offset);
+			ceph_wbc.len = max(ceph_wbc.len, min_len);
+		}
+		if (IS_ENCRYPTED(inode)) {
+			ceph_wbc.len = round_up(ceph_wbc.len,
+						CEPH_FSCRYPT_BLOCK_SIZE);
 		}
-		if (IS_ENCRYPTED(inode))
-			len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE);
 
-		doutc(cl, "got pages at %llu~%llu\n", offset, len);
+		doutc(cl, "got pages at %llu~%llu\n",
+			ceph_wbc.offset, ceph_wbc.len);
 
 		if (IS_ENCRYPTED(inode) &&
-		    ((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK))
+		    ((ceph_wbc.offset | ceph_wbc.len) & ~CEPH_FSCRYPT_BLOCK_MASK))
 			pr_warn_client(cl,
 				"bad encrypted write offset=%lld len=%llu\n",
-				offset, len);
+				ceph_wbc.offset, ceph_wbc.len);
 
-		osd_req_op_extent_osd_data_pages(req, op_idx, data_pages, len,
-						 0, from_pool, false);
-		osd_req_op_extent_update(req, op_idx, len);
+		osd_req_op_extent_osd_data_pages(req, ceph_wbc.op_idx,
+						 ceph_wbc.data_pages,
+						 ceph_wbc.len,
+						 0, ceph_wbc.from_pool, false);
+		osd_req_op_extent_update(req, ceph_wbc.op_idx, ceph_wbc.len);
 
-		BUG_ON(op_idx + 1 != req->r_num_ops);
+		BUG_ON(ceph_wbc.op_idx + 1 != req->r_num_ops);
 
-		from_pool = false;
-		if (i < locked_pages) {
-			BUG_ON(num_ops <= req->r_num_ops);
-			num_ops -= req->r_num_ops;
-			locked_pages -= i;
+		ceph_wbc.from_pool = false;
+		if (i < ceph_wbc.locked_pages) {
+			BUG_ON(ceph_wbc.num_ops <= req->r_num_ops);
+			ceph_wbc.num_ops -= req->r_num_ops;
+			ceph_wbc.locked_pages -= i;
 
 			/* allocate new pages array for next request */
-			data_pages = pages;
-			pages = kmalloc_array(locked_pages, sizeof(*pages),
-					      GFP_NOFS);
-			if (!pages) {
-				from_pool = true;
-				pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
-				BUG_ON(!pages);
+			ceph_wbc.data_pages = ceph_wbc.pages;
+			ceph_wbc.pages = kmalloc_array(ceph_wbc.locked_pages,
+							sizeof(*ceph_wbc.pages),
+							GFP_NOFS);
+			if (!ceph_wbc.pages) {
+				ceph_wbc.from_pool = true;
+				ceph_wbc.pages =
+					mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
+				BUG_ON(!ceph_wbc.pages);
 			}
-			memcpy(pages, data_pages + i,
-			       locked_pages * sizeof(*pages));
-			memset(data_pages + i, 0,
-			       locked_pages * sizeof(*pages));
+			memcpy(ceph_wbc.pages, ceph_wbc.data_pages + i,
+			       ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages));
+			memset(ceph_wbc.data_pages + i, 0,
+			       ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages));
 		} else {
-			BUG_ON(num_ops != req->r_num_ops);
-			index = pages[i - 1]->index + 1;
+			BUG_ON(ceph_wbc.num_ops != req->r_num_ops);
+			ceph_wbc.index = ceph_wbc.pages[i - 1]->index + 1;
 			/* request message now owns the pages array */
-			pages = NULL;
+			ceph_wbc.pages = NULL;
 		}
 
 		req->r_mtime = inode_get_mtime(inode);
@@ -1367,7 +1482,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 		req = NULL;
 
 		wbc->nr_to_write -= i;
-		if (pages)
+		if (ceph_wbc.pages)
 			goto new_request;
 
 		/*
@@ -1377,54 +1492,56 @@ static int ceph_writepages_start(struct address_space *mapping,
 		 * we tagged for writeback prior to entering this loop.
 		 */
 		if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE)
-			done = true;
+			ceph_wbc.done = true;
 
 release_folios:
 		doutc(cl, "folio_batch release on %d folios (%p)\n",
-		      (int)fbatch.nr, fbatch.nr ? fbatch.folios[0] : NULL);
-		folio_batch_release(&fbatch);
+		      (int)ceph_wbc.fbatch.nr,
+		      ceph_wbc.fbatch.nr ? ceph_wbc.fbatch.folios[0] : NULL);
+		folio_batch_release(&ceph_wbc.fbatch);
 	}
 
-	if (should_loop && !done) {
+	if (ceph_wbc.should_loop && !ceph_wbc.done) {
 		/* more to do; loop back to beginning of file */
 		doutc(cl, "looping back to beginning of file\n");
-		end = start_index - 1; /* OK even when start_index == 0 */
+		ceph_wbc.end = ceph_wbc.start_index - 1; /* OK even when start_index == 0 */
 
 		/* to write dirty pages associated with next snapc,
 		 * we need to wait until current writes complete */
 		if (wbc->sync_mode != WB_SYNC_NONE &&
-		    start_index == 0 && /* all dirty pages were checked */
+		    ceph_wbc.start_index == 0 && /* all dirty pages were checked */
 		    !ceph_wbc.head_snapc) {
 			struct page *page;
 			unsigned i, nr;
-			index = 0;
-			while ((index <= end) &&
-			       (nr = filemap_get_folios_tag(mapping, &index,
+			ceph_wbc.index = 0;
+			while ((ceph_wbc.index <= ceph_wbc.end) &&
+			       (nr = filemap_get_folios_tag(mapping,
+						&ceph_wbc.index,
 						(pgoff_t)-1,
 						PAGECACHE_TAG_WRITEBACK,
-						&fbatch))) {
+						&ceph_wbc.fbatch))) {
 				for (i = 0; i < nr; i++) {
-					page = &fbatch.folios[i]->page;
-					if (page_snap_context(page) != snapc)
+					page = &ceph_wbc.fbatch.folios[i]->page;
+					if (page_snap_context(page) != ceph_wbc.snapc)
 						continue;
 					wait_on_page_writeback(page);
 				}
-				folio_batch_release(&fbatch);
+				folio_batch_release(&ceph_wbc.fbatch);
 				cond_resched();
 			}
 		}
 
-		start_index = 0;
-		index = 0;
+		ceph_wbc.start_index = 0;
+		ceph_wbc.index = 0;
 		goto retry;
 	}
 
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = index;
+	if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0))
+		mapping->writeback_index = ceph_wbc.index;
 
 out:
 	ceph_osdc_put_request(req);
-	ceph_put_snap_context(last_snapc);
+	ceph_put_snap_context(ceph_wbc.last_snapc);
 	doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
 	      rc);
 	return rc;
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/4] ceph: introduce ceph_process_folio_batch() method
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring Viacheslav Dubeyko
@ 2025-02-05  0:02 ` Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Viacheslav Dubeyko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-05  0:02 UTC (permalink / raw)
  To: ceph-devel
  Cc: idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko, slava

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

First step of ceph_writepages_start() logic is
of finding the dirty memory folios and processing it.
This patch introduces ceph_process_folio_batch()
method that moves this logic into dedicated method.

The ceph_writepages_start() has this logic:

if (ceph_wbc.locked_pages == 0)
    lock_page(page);  /* first page */
else if (!trylock_page(page))
    break;

<skipped>

if (folio_test_writeback(folio) ||
    folio_test_private_2(folio) /* [DEPRECATED] */) {
      if (wbc->sync_mode == WB_SYNC_NONE) {
          doutc(cl, "%p under writeback\n", folio);
          folio_unlock(folio);
          continue;
      }
      doutc(cl, "waiting on writeback %p\n", folio);
      folio_wait_writeback(folio);
      folio_wait_private_2(folio); /* [DEPRECATED] */
}

The problem here that folio/page is locked here at first
and it is by set_page_writeback(page) later before
submitting the write request. The folio/page is unlocked
by writepages_finish() after finishing the write
request. It means that logic of checking folio_test_writeback()
and folio_wait_writeback() never works because page is locked
and it cannot be locked again until write request completion.
However, for majority of folios/pages the trylock_page()
is used. As a result, multiple threads can try to lock the same
folios/pages multiple times even if they are under writeback
already. It makes this logic more compute intensive than
it is necessary.

This patch changes this logic:

if (folio_test_writeback(folio) ||
    folio_test_private_2(folio) /* [DEPRECATED] */) {
      if (wbc->sync_mode == WB_SYNC_NONE) {
          doutc(cl, "%p under writeback\n", folio);
          folio_unlock(folio);
          continue;
      }
      doutc(cl, "waiting on writeback %p\n", folio);
      folio_wait_writeback(folio);
      folio_wait_private_2(folio); /* [DEPRECATED] */
}

if (ceph_wbc.locked_pages == 0)
    lock_page(page);  /* first page */
else if (!trylock_page(page))
    break;

This logic should exclude the ignoring of writeback
state of folios/pages.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/addr.c | 568 +++++++++++++++++++++++++++++++------------------
 1 file changed, 365 insertions(+), 203 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d002ff62d867..739329846a07 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -978,6 +978,27 @@ static void writepages_finish(struct ceph_osd_request *req)
 	ceph_dec_osd_stopping_blocker(fsc->mdsc);
 }
 
+static inline
+bool is_forced_umount(struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+
+	if (ceph_inode_is_shutdown(inode)) {
+		if (ci->i_wrbuffer_ref > 0) {
+			pr_warn_ratelimited_client(cl,
+				"%llx.%llx %lld forced umount\n",
+				ceph_vinop(inode), ceph_ino(inode));
+		}
+		mapping_set_error(mapping, -EIO);
+		return true;
+	}
+
+	return false;
+}
+
 static inline
 unsigned int ceph_define_write_size(struct address_space *mapping)
 {
@@ -1046,6 +1067,334 @@ void ceph_init_writeback_ctl(struct address_space *mapping,
 	ceph_wbc->data_pages = NULL;
 }
 
+static inline
+int ceph_define_writeback_range(struct address_space *mapping,
+				struct writeback_control *wbc,
+				struct ceph_writeback_ctl *ceph_wbc)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+
+	/* find oldest snap context with dirty data */
+	ceph_wbc->snapc = get_oldest_context(inode, ceph_wbc, NULL);
+	if (!ceph_wbc->snapc) {
+		/* hmm, why does writepages get called when there
+		   is no dirty data? */
+		doutc(cl, " no snap context with dirty data?\n");
+		return -ENODATA;
+	}
+
+	doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n",
+	      ceph_wbc->snapc, ceph_wbc->snapc->seq,
+	      ceph_wbc->snapc->num_snaps);
+
+	ceph_wbc->should_loop = false;
+
+	if (ceph_wbc->head_snapc && ceph_wbc->snapc != ceph_wbc->last_snapc) {
+		/* where to start/end? */
+		if (wbc->range_cyclic) {
+			ceph_wbc->index = ceph_wbc->start_index;
+			ceph_wbc->end = -1;
+			if (ceph_wbc->index > 0)
+				ceph_wbc->should_loop = true;
+			doutc(cl, " cyclic, start at %lu\n", ceph_wbc->index);
+		} else {
+			ceph_wbc->index = wbc->range_start >> PAGE_SHIFT;
+			ceph_wbc->end = wbc->range_end >> PAGE_SHIFT;
+			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+				ceph_wbc->range_whole = true;
+			doutc(cl, " not cyclic, %lu to %lu\n",
+				ceph_wbc->index, ceph_wbc->end);
+		}
+	} else if (!ceph_wbc->head_snapc) {
+		/* Do not respect wbc->range_{start,end}. Dirty pages
+		 * in that range can be associated with newer snapc.
+		 * They are not writeable until we write all dirty pages
+		 * associated with 'snapc' get written */
+		if (ceph_wbc->index > 0)
+			ceph_wbc->should_loop = true;
+		doutc(cl, " non-head snapc, range whole\n");
+	}
+
+	ceph_put_snap_context(ceph_wbc->last_snapc);
+	ceph_wbc->last_snapc = ceph_wbc->snapc;
+
+	return 0;
+}
+
+static inline
+bool has_writeback_done(struct ceph_writeback_ctl *ceph_wbc)
+{
+	return ceph_wbc->done && ceph_wbc->index > ceph_wbc->end;
+}
+
+static inline
+bool can_next_page_be_processed(struct ceph_writeback_ctl *ceph_wbc,
+				unsigned index)
+{
+	return index < ceph_wbc->nr_folios &&
+		ceph_wbc->locked_pages < ceph_wbc->max_pages;
+}
+
+static
+int ceph_check_page_before_write(struct address_space *mapping,
+				 struct writeback_control *wbc,
+				 struct ceph_writeback_ctl *ceph_wbc,
+				 struct folio *folio)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct ceph_snap_context *pgsnapc;
+	struct page *page = &folio->page;
+
+	/* only dirty pages, or our accounting breaks */
+	if (unlikely(!PageDirty(page)) || unlikely(page->mapping != mapping)) {
+		doutc(cl, "!dirty or !mapping %p\n", page);
+		return -ENODATA;
+	}
+
+	/* only if matching snap context */
+	pgsnapc = page_snap_context(page);
+	if (pgsnapc != ceph_wbc->snapc) {
+		doutc(cl, "page snapc %p %lld != oldest %p %lld\n",
+		      pgsnapc, pgsnapc->seq,
+		      ceph_wbc->snapc, ceph_wbc->snapc->seq);
+
+		if (!ceph_wbc->should_loop && !ceph_wbc->head_snapc &&
+		    wbc->sync_mode != WB_SYNC_NONE)
+			ceph_wbc->should_loop = true;
+
+		return -ENODATA;
+	}
+
+	if (page_offset(page) >= ceph_wbc->i_size) {
+		doutc(cl, "folio at %lu beyond eof %llu\n",
+		      folio->index, ceph_wbc->i_size);
+
+		if ((ceph_wbc->size_stable ||
+		    folio_pos(folio) >= i_size_read(inode)) &&
+		    folio_clear_dirty_for_io(folio))
+			folio_invalidate(folio, 0, folio_size(folio));
+
+		return -ENODATA;
+	}
+
+	if (ceph_wbc->strip_unit_end &&
+	    (page->index > ceph_wbc->strip_unit_end)) {
+		doutc(cl, "end of strip unit %p\n", page);
+		return -E2BIG;
+	}
+
+	return 0;
+}
+
+static inline
+void __ceph_allocate_page_array(struct ceph_writeback_ctl *ceph_wbc,
+				unsigned int max_pages)
+{
+	ceph_wbc->pages = kmalloc_array(max_pages,
+					sizeof(*ceph_wbc->pages),
+					GFP_NOFS);
+	if (!ceph_wbc->pages) {
+		ceph_wbc->from_pool = true;
+		ceph_wbc->pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
+		BUG_ON(!ceph_wbc->pages);
+	}
+}
+
+static inline
+void ceph_allocate_page_array(struct address_space *mapping,
+			      struct ceph_writeback_ctl *ceph_wbc,
+			      struct page *page)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	u64 objnum;
+	u64 objoff;
+	u32 xlen;
+
+	/* prepare async write request */
+	ceph_wbc->offset = (u64)page_offset(page);
+	ceph_calc_file_object_mapping(&ci->i_layout,
+					ceph_wbc->offset, ceph_wbc->wsize,
+					&objnum, &objoff, &xlen);
+
+	ceph_wbc->num_ops = 1;
+	ceph_wbc->strip_unit_end = page->index + ((xlen - 1) >> PAGE_SHIFT);
+
+	BUG_ON(ceph_wbc->pages);
+	ceph_wbc->max_pages = calc_pages_for(0, (u64)xlen);
+	__ceph_allocate_page_array(ceph_wbc, ceph_wbc->max_pages);
+
+	ceph_wbc->len = 0;
+}
+
+static inline
+bool is_page_index_contiguous(struct ceph_writeback_ctl *ceph_wbc,
+			      struct page *page)
+{
+	return page->index == (ceph_wbc->offset + ceph_wbc->len) >> PAGE_SHIFT;
+}
+
+static inline
+bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc)
+{
+	return ceph_wbc->num_ops >=
+		(ceph_wbc->from_pool ?  CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS);
+}
+
+static inline
+bool is_write_congestion_happened(struct ceph_fs_client *fsc)
+{
+	return atomic_long_inc_return(&fsc->writeback_count) >
+		CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb);
+}
+
+static inline
+int ceph_move_dirty_page_in_page_array(struct address_space *mapping,
+					struct writeback_control *wbc,
+					struct ceph_writeback_ctl *ceph_wbc,
+					struct page *page)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct page **pages = ceph_wbc->pages;
+	unsigned int index = ceph_wbc->locked_pages;
+	gfp_t gfp_flags = ceph_wbc->locked_pages ? GFP_NOWAIT : GFP_NOFS;
+
+	if (IS_ENCRYPTED(inode)) {
+		pages[index] = fscrypt_encrypt_pagecache_blocks(page,
+								PAGE_SIZE,
+								0,
+								gfp_flags);
+		if (IS_ERR(pages[index])) {
+			if (PTR_ERR(pages[index]) == -EINVAL) {
+				pr_err_client(cl, "inode->i_blkbits=%hhu\n",
+						inode->i_blkbits);
+			}
+
+			/* better not fail on first page! */
+			BUG_ON(ceph_wbc->locked_pages == 0);
+
+			pages[index] = NULL;
+			return PTR_ERR(pages[index]);
+		}
+	} else {
+		pages[index] = page;
+	}
+
+	ceph_wbc->locked_pages++;
+
+	return 0;
+}
+
+static
+int ceph_process_folio_batch(struct address_space *mapping,
+			     struct writeback_control *wbc,
+			     struct ceph_writeback_ctl *ceph_wbc)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct folio *folio = NULL;
+	struct page *page = NULL;
+	unsigned i;
+	int rc = 0;
+
+	for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) {
+		folio = ceph_wbc->fbatch.folios[i];
+
+		if (!folio)
+			continue;
+
+		page = &folio->page;
+
+		doutc(cl, "? %p idx %lu, folio_test_writeback %#x, "
+			"folio_test_dirty %#x, folio_test_locked %#x\n",
+			page, page->index, folio_test_writeback(folio),
+			folio_test_dirty(folio),
+			folio_test_locked(folio));
+
+		if (folio_test_writeback(folio) ||
+		    folio_test_private_2(folio) /* [DEPRECATED] */) {
+			doutc(cl, "waiting on writeback %p\n", folio);
+			folio_wait_writeback(folio);
+			folio_wait_private_2(folio); /* [DEPRECATED] */
+			continue;
+		}
+
+		if (ceph_wbc->locked_pages == 0)
+			lock_page(page);  /* first page */
+		else if (!trylock_page(page))
+			break;
+
+		rc = ceph_check_page_before_write(mapping, wbc,
+						  ceph_wbc, folio);
+		if (rc == -ENODATA) {
+			rc = 0;
+			unlock_page(page);
+			ceph_wbc->fbatch.folios[i] = NULL;
+			continue;
+		} else if (rc == -E2BIG) {
+			rc = 0;
+			unlock_page(page);
+			ceph_wbc->fbatch.folios[i] = NULL;
+			break;
+		}
+
+		if (!clear_page_dirty_for_io(page)) {
+			doutc(cl, "%p !clear_page_dirty_for_io\n", page);
+			unlock_page(page);
+			ceph_wbc->fbatch.folios[i] = NULL;
+			continue;
+		}
+
+		/*
+		 * We have something to write.  If this is
+		 * the first locked page this time through,
+		 * calculate max possible write size and
+		 * allocate a page array
+		 */
+		if (ceph_wbc->locked_pages == 0) {
+			ceph_allocate_page_array(mapping, ceph_wbc, page);
+		} else if (!is_page_index_contiguous(ceph_wbc, page)) {
+			if (is_num_ops_too_big(ceph_wbc)) {
+				redirty_page_for_writepage(wbc, page);
+				unlock_page(page);
+				break;
+			}
+
+			ceph_wbc->num_ops++;
+			ceph_wbc->offset = (u64)page_offset(page);
+			ceph_wbc->len = 0;
+		}
+
+		/* note position of first page in fbatch */
+		doutc(cl, "%llx.%llx will write page %p idx %lu\n",
+		      ceph_vinop(inode), page, page->index);
+
+		fsc->write_congested = is_write_congestion_happened(fsc);
+
+		rc = ceph_move_dirty_page_in_page_array(mapping, wbc,
+							ceph_wbc, page);
+		if (rc) {
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+			break;
+		}
+
+		ceph_wbc->fbatch.folios[i] = NULL;
+		ceph_wbc->len += thp_size(page);
+	}
+
+	ceph_wbc->processed_in_fbatch = i;
+
+	return rc;
+}
+
 /*
  * initiate async writeback
  */
@@ -1057,7 +1406,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl = fsc->client;
 	struct ceph_vino vino = ceph_vino(inode);
-	struct ceph_snap_context *pgsnapc;
 	struct ceph_writeback_ctl ceph_wbc;
 	struct ceph_osd_request *req = NULL;
 	int rc = 0;
@@ -1071,235 +1419,49 @@ static int ceph_writepages_start(struct address_space *mapping,
 	      wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
 	      (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD"));
 
-	if (ceph_inode_is_shutdown(inode)) {
-		if (ci->i_wrbuffer_ref > 0) {
-			pr_warn_ratelimited_client(cl,
-				"%llx.%llx %lld forced umount\n",
-				ceph_vinop(inode), ceph_ino(inode));
-		}
-		mapping_set_error(mapping, -EIO);
-		return -EIO; /* we're in a forced umount, don't write! */
+	if (is_forced_umount(mapping)) {
+		/* we're in a forced umount, don't write! */
+		return -EIO;
 	}
 
 	ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
 
 retry:
-	/* find oldest snap context with dirty data */
-	ceph_wbc.snapc = get_oldest_context(inode, &ceph_wbc, NULL);
-	if (!ceph_wbc.snapc) {
+	rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc);
+	if (rc == -ENODATA) {
 		/* hmm, why does writepages get called when there
 		   is no dirty data? */
-		doutc(cl, " no snap context with dirty data?\n");
+		rc = 0;
 		goto out;
 	}
-	doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n",
-	      ceph_wbc.snapc, ceph_wbc.snapc->seq,
-	      ceph_wbc.snapc->num_snaps);
-
-	ceph_wbc.should_loop = false;
-	if (ceph_wbc.head_snapc && ceph_wbc.snapc != ceph_wbc.last_snapc) {
-		/* where to start/end? */
-		if (wbc->range_cyclic) {
-			ceph_wbc.index = ceph_wbc.start_index;
-			ceph_wbc.end = -1;
-			if (ceph_wbc.index > 0)
-				ceph_wbc.should_loop = true;
-			doutc(cl, " cyclic, start at %lu\n", ceph_wbc.index);
-		} else {
-			ceph_wbc.index = wbc->range_start >> PAGE_SHIFT;
-			ceph_wbc.end = wbc->range_end >> PAGE_SHIFT;
-			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-				ceph_wbc.range_whole = true;
-			doutc(cl, " not cyclic, %lu to %lu\n",
-			      ceph_wbc.index, ceph_wbc.end);
-		}
-	} else if (!ceph_wbc.head_snapc) {
-		/* Do not respect wbc->range_{start,end}. Dirty pages
-		 * in that range can be associated with newer snapc.
-		 * They are not writeable until we write all dirty pages
-		 * associated with 'snapc' get written */
-		if (ceph_wbc.index > 0)
-			ceph_wbc.should_loop = true;
-		doutc(cl, " non-head snapc, range whole\n");
-	}
 
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
 
-	ceph_put_snap_context(ceph_wbc.last_snapc);
-	ceph_wbc.last_snapc = ceph_wbc.snapc;
-
-	while (!ceph_wbc.done && ceph_wbc.index <= ceph_wbc.end) {
+	while (!has_writeback_done(&ceph_wbc)) {
 		unsigned i;
 		struct page *page;
 
+		ceph_wbc.locked_pages = 0;
 		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
 
 get_more_pages:
+		ceph_folio_batch_reinit(&ceph_wbc);
+
 		ceph_wbc.nr_folios = filemap_get_folios_tag(mapping,
 							    &ceph_wbc.index,
 							    ceph_wbc.end,
 							    ceph_wbc.tag,
 							    &ceph_wbc.fbatch);
-		doutc(cl, "pagevec_lookup_range_tag got %d\n",
-			ceph_wbc.nr_folios);
+		doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n",
+			ceph_wbc.tag, ceph_wbc.nr_folios);
+
 		if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
 			break;
-		for (i = 0; i < ceph_wbc.nr_folios &&
-			    ceph_wbc.locked_pages < ceph_wbc.max_pages; i++) {
-			struct folio *folio = ceph_wbc.fbatch.folios[i];
-
-			page = &folio->page;
-			doutc(cl, "? %p idx %lu\n", page, page->index);
-			if (ceph_wbc.locked_pages == 0)
-				lock_page(page);  /* first page */
-			else if (!trylock_page(page))
-				break;
-
-			/* only dirty pages, or our accounting breaks */
-			if (unlikely(!PageDirty(page)) ||
-			    unlikely(page->mapping != mapping)) {
-				doutc(cl, "!dirty or !mapping %p\n", page);
-				unlock_page(page);
-				continue;
-			}
-			/* only if matching snap context */
-			pgsnapc = page_snap_context(page);
-			if (pgsnapc != ceph_wbc.snapc) {
-				doutc(cl, "page snapc %p %lld != oldest %p %lld\n",
-				      pgsnapc, pgsnapc->seq,
-				      ceph_wbc.snapc, ceph_wbc.snapc->seq);
-				if (!ceph_wbc.should_loop &&
-				    !ceph_wbc.head_snapc &&
-				    wbc->sync_mode != WB_SYNC_NONE)
-					ceph_wbc.should_loop = true;
-				unlock_page(page);
-				continue;
-			}
-			if (page_offset(page) >= ceph_wbc.i_size) {
-				doutc(cl, "folio at %lu beyond eof %llu\n",
-				      folio->index, ceph_wbc.i_size);
-				if ((ceph_wbc.size_stable ||
-				    folio_pos(folio) >= i_size_read(inode)) &&
-				    folio_clear_dirty_for_io(folio))
-					folio_invalidate(folio, 0,
-							folio_size(folio));
-				folio_unlock(folio);
-				continue;
-			}
-			if (ceph_wbc.strip_unit_end &&
-			    (page->index > ceph_wbc.strip_unit_end)) {
-				doutc(cl, "end of strip unit %p\n", page);
-				unlock_page(page);
-				break;
-			}
-			if (folio_test_writeback(folio) ||
-			    folio_test_private_2(folio) /* [DEPRECATED] */) {
-				if (wbc->sync_mode == WB_SYNC_NONE) {
-					doutc(cl, "%p under writeback\n", folio);
-					folio_unlock(folio);
-					continue;
-				}
-				doutc(cl, "waiting on writeback %p\n", folio);
-				folio_wait_writeback(folio);
-				folio_wait_private_2(folio); /* [DEPRECATED] */
-			}
-
-			if (!clear_page_dirty_for_io(page)) {
-				doutc(cl, "%p !clear_page_dirty_for_io\n", page);
-				unlock_page(page);
-				continue;
-			}
-
-			/*
-			 * We have something to write.  If this is
-			 * the first locked page this time through,
-			 * calculate max possinle write size and
-			 * allocate a page array
-			 */
-			if (ceph_wbc.locked_pages == 0) {
-				u64 objnum;
-				u64 objoff;
-				u32 xlen;
-
-				/* prepare async write request */
-				ceph_wbc.offset = (u64)page_offset(page);
-				ceph_calc_file_object_mapping(&ci->i_layout,
-							      ceph_wbc.offset,
-							      ceph_wbc.wsize,
-							      &objnum, &objoff,
-							      &xlen);
-				ceph_wbc.len = xlen;
-
-				ceph_wbc.num_ops = 1;
-				ceph_wbc.strip_unit_end = page->index +
-					((ceph_wbc.len - 1) >> PAGE_SHIFT);
-
-				BUG_ON(ceph_wbc.pages);
-				ceph_wbc.max_pages =
-					calc_pages_for(0, (u64)ceph_wbc.len);
-				ceph_wbc.pages = kmalloc_array(ceph_wbc.max_pages,
-						      sizeof(*ceph_wbc.pages),
-						      GFP_NOFS);
-				if (!ceph_wbc.pages) {
-					ceph_wbc.from_pool = true;
-					ceph_wbc.pages =
-						mempool_alloc(ceph_wb_pagevec_pool,
-								GFP_NOFS);
-					BUG_ON(!ceph_wbc.pages);
-				}
 
-				ceph_wbc.len = 0;
-			} else if (page->index !=
-				   (ceph_wbc.offset + ceph_wbc.len) >> PAGE_SHIFT) {
-				if (ceph_wbc.num_ops >=
-				    (ceph_wbc.from_pool ?  CEPH_OSD_SLAB_OPS :
-							     CEPH_OSD_MAX_OPS)) {
-					redirty_page_for_writepage(wbc, page);
-					unlock_page(page);
-					break;
-				}
-
-				ceph_wbc.num_ops++;
-				ceph_wbc.offset = (u64)page_offset(page);
-				ceph_wbc.len = 0;
-			}
-
-			/* note position of first page in fbatch */
-			doutc(cl, "%llx.%llx will write page %p idx %lu\n",
-			      ceph_vinop(inode), page, page->index);
-
-			if (atomic_long_inc_return(&fsc->writeback_count) >
-			    CONGESTION_ON_THRESH(
-				    fsc->mount_options->congestion_kb))
-				fsc->write_congested = true;
-
-			if (IS_ENCRYPTED(inode)) {
-				ceph_wbc.pages[ceph_wbc.locked_pages] =
-					fscrypt_encrypt_pagecache_blocks(page,
-						PAGE_SIZE, 0,
-						ceph_wbc.locked_pages ?
-							GFP_NOWAIT : GFP_NOFS);
-				if (IS_ERR(ceph_wbc.pages[ceph_wbc.locked_pages])) {
-					if (PTR_ERR(ceph_wbc.pages[ceph_wbc.locked_pages]) == -EINVAL)
-						pr_err_client(cl,
-							"inode->i_blkbits=%hhu\n",
-							inode->i_blkbits);
-					/* better not fail on first page! */
-					BUG_ON(ceph_wbc.locked_pages == 0);
-					ceph_wbc.pages[ceph_wbc.locked_pages] = NULL;
-					redirty_page_for_writepage(wbc, page);
-					unlock_page(page);
-					break;
-				}
-				++ceph_wbc.locked_pages;
-			} else {
-				ceph_wbc.pages[ceph_wbc.locked_pages++] = page;
-			}
-
-			ceph_wbc.fbatch.folios[i] = NULL;
-			ceph_wbc.len += thp_size(page);
-		}
+		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
+		if (rc)
+			goto release_folios;
 
 		/* did we get anything? */
 		if (!ceph_wbc.locked_pages)
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring Viacheslav Dubeyko
  2025-02-05  0:02 ` [RFC PATCH 2/4] ceph: introduce ceph_process_folio_batch() method Viacheslav Dubeyko
@ 2025-02-05  0:02 ` Viacheslav Dubeyko
  2025-02-14 21:11   ` Matthew Wilcox
  2025-02-05  0:02 ` [RFC PATCH 4/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-05  0:02 UTC (permalink / raw)
  To: ceph-devel
  Cc: idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko, slava

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

Final responsibility of ceph_writepages_start() is
to submit write requests for processed dirty folios/pages.
The ceph_submit_write() summarize all this logic in
one method.

The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are two problems here:

if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
     rc = -EIO;
     goto release_folios;
}

(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.

This patch implements refactoring of ceph_submit_write()
and also it solves the second issue.

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/addr.c | 461 +++++++++++++++++++++++++++----------------------
 1 file changed, 257 insertions(+), 204 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 739329846a07..02d20c000dc5 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1395,6 +1395,245 @@ int ceph_process_folio_batch(struct address_space *mapping,
 	return rc;
 }
 
+static inline
+void ceph_shift_unused_folios_left(struct folio_batch *fbatch)
+{
+	unsigned j, n = 0;
+
+	/* shift unused page to beginning of fbatch */
+	for (j = 0; j < folio_batch_count(fbatch); j++) {
+		if (!fbatch->folios[j])
+			continue;
+
+		if (n < j) {
+			fbatch->folios[n] = fbatch->folios[j];
+		}
+
+		n++;
+	}
+
+	fbatch->nr = n;
+}
+
+static
+int ceph_submit_write(struct address_space *mapping,
+			struct writeback_control *wbc,
+			struct ceph_writeback_ctl *ceph_wbc)
+{
+	struct inode *inode = mapping->host;
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_client *cl = fsc->client;
+	struct ceph_vino vino = ceph_vino(inode);
+	struct ceph_osd_request *req = NULL;
+	struct page *page = NULL;
+	bool caching = ceph_is_cache_enabled(inode);
+	u64 offset;
+	u64 len;
+	unsigned i;
+
+new_request:
+	offset = ceph_fscrypt_page_offset(ceph_wbc->pages[0]);
+	len = ceph_wbc->wsize;
+
+	req = ceph_osdc_new_request(&fsc->client->osdc,
+				    &ci->i_layout, vino,
+				    offset, &len, 0, ceph_wbc->num_ops,
+				    CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
+				    ceph_wbc->snapc, ceph_wbc->truncate_seq,
+				    ceph_wbc->truncate_size, false);
+	if (IS_ERR(req)) {
+		req = ceph_osdc_new_request(&fsc->client->osdc,
+					    &ci->i_layout, vino,
+					    offset, &len, 0,
+					    min(ceph_wbc->num_ops,
+						CEPH_OSD_SLAB_OPS),
+					    CEPH_OSD_OP_WRITE,
+					    CEPH_OSD_FLAG_WRITE,
+					    ceph_wbc->snapc,
+					    ceph_wbc->truncate_seq,
+					    ceph_wbc->truncate_size,
+					    true);
+		BUG_ON(IS_ERR(req));
+	}
+
+	page = ceph_wbc->pages[ceph_wbc->locked_pages - 1];
+	BUG_ON(len < ceph_fscrypt_page_offset(page) + thp_size(page) - offset);
+
+	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
+		for (i = 0; i < folio_batch_count(&ceph_wbc->fbatch); i++) {
+			struct folio *folio = ceph_wbc->fbatch.folios[i];
+
+			if (!folio)
+				continue;
+
+			page = &folio->page;
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+		}
+
+		for (i = 0; i < ceph_wbc->locked_pages; i++) {
+			page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
+
+			if (!page)
+				continue;
+
+			redirty_page_for_writepage(wbc, page);
+			unlock_page(page);
+		}
+
+		ceph_osdc_put_request(req);
+		return -EIO;
+	}
+
+	req->r_callback = writepages_finish;
+	req->r_inode = inode;
+
+	/* Format the osd request message and submit the write */
+	len = 0;
+	ceph_wbc->data_pages = ceph_wbc->pages;
+	ceph_wbc->op_idx = 0;
+	for (i = 0; i < ceph_wbc->locked_pages; i++) {
+		u64 cur_offset;
+
+		page = ceph_fscrypt_pagecache_page(ceph_wbc->pages[i]);
+		cur_offset = page_offset(page);
+
+		/*
+		 * Discontinuity in page range? Ceph can handle that by just passing
+		 * multiple extents in the write op.
+		 */
+		if (offset + len != cur_offset) {
+			/* If it's full, stop here */
+			if (ceph_wbc->op_idx + 1 == req->r_num_ops)
+				break;
+
+			/* Kick off an fscache write with what we have so far. */
+			ceph_fscache_write_to_cache(inode, offset, len, caching);
+
+			/* Start a new extent */
+			osd_req_op_extent_dup_last(req, ceph_wbc->op_idx,
+						   cur_offset - offset);
+
+			doutc(cl, "got pages at %llu~%llu\n", offset, len);
+
+			osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
+							 ceph_wbc->data_pages,
+							 len, 0,
+							 ceph_wbc->from_pool,
+							 false);
+			osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
+
+			len = 0;
+			offset = cur_offset;
+			ceph_wbc->data_pages = ceph_wbc->pages + i;
+			ceph_wbc->op_idx++;
+		}
+
+		set_page_writeback(page);
+
+		if (caching)
+			ceph_set_page_fscache(page);
+
+		len += thp_size(page);
+	}
+
+	ceph_fscache_write_to_cache(inode, offset, len, caching);
+
+	if (ceph_wbc->size_stable) {
+		len = min(len, ceph_wbc->i_size - offset);
+	} else if (i == ceph_wbc->locked_pages) {
+		/* writepages_finish() clears writeback pages
+		 * according to the data length, so make sure
+		 * data length covers all locked pages */
+		u64 min_len = len + 1 - thp_size(page);
+		len = get_writepages_data_length(inode,
+						 ceph_wbc->pages[i - 1],
+						 offset);
+		len = max(len, min_len);
+	}
+
+	if (IS_ENCRYPTED(inode))
+		len = round_up(len, CEPH_FSCRYPT_BLOCK_SIZE);
+
+	doutc(cl, "got pages at %llu~%llu\n", offset, len);
+
+	if (IS_ENCRYPTED(inode) &&
+	    ((offset | len) & ~CEPH_FSCRYPT_BLOCK_MASK)) {
+		pr_warn_client(cl,
+			"bad encrypted write offset=%lld len=%llu\n",
+			offset, len);
+	}
+
+	osd_req_op_extent_osd_data_pages(req, ceph_wbc->op_idx,
+					 ceph_wbc->data_pages, len,
+					 0, ceph_wbc->from_pool, false);
+	osd_req_op_extent_update(req, ceph_wbc->op_idx, len);
+
+	BUG_ON(ceph_wbc->op_idx + 1 != req->r_num_ops);
+
+	ceph_wbc->from_pool = false;
+	if (i < ceph_wbc->locked_pages) {
+		BUG_ON(ceph_wbc->num_ops <= req->r_num_ops);
+		ceph_wbc->num_ops -= req->r_num_ops;
+		ceph_wbc->locked_pages -= i;
+
+		/* allocate new pages array for next request */
+		ceph_wbc->data_pages = ceph_wbc->pages;
+		__ceph_allocate_page_array(ceph_wbc, ceph_wbc->locked_pages);
+		memcpy(ceph_wbc->pages, ceph_wbc->data_pages + i,
+			ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
+		memset(ceph_wbc->data_pages + i, 0,
+			ceph_wbc->locked_pages * sizeof(*ceph_wbc->pages));
+	} else {
+		BUG_ON(ceph_wbc->num_ops != req->r_num_ops);
+		/* request message now owns the pages array */
+		ceph_wbc->pages = NULL;
+	}
+
+	req->r_mtime = inode_get_mtime(inode);
+	ceph_osdc_start_request(&fsc->client->osdc, req);
+	req = NULL;
+
+	wbc->nr_to_write -= i;
+	if (ceph_wbc->pages)
+		goto new_request;
+
+	return 0;
+}
+
+static
+void ceph_wait_until_current_writes_complete(struct address_space *mapping,
+					     struct writeback_control *wbc,
+					     struct ceph_writeback_ctl *ceph_wbc)
+{
+	struct page *page;
+	unsigned i, nr;
+
+	if (wbc->sync_mode != WB_SYNC_NONE &&
+	    ceph_wbc->start_index == 0 && /* all dirty pages were checked */
+	    !ceph_wbc->head_snapc) {
+		ceph_wbc->index = 0;
+
+		while ((ceph_wbc->index <= ceph_wbc->end) &&
+			(nr = filemap_get_folios_tag(mapping,
+						     &ceph_wbc->index,
+						     (pgoff_t)-1,
+						     PAGECACHE_TAG_WRITEBACK,
+						     &ceph_wbc->fbatch))) {
+			for (i = 0; i < nr; i++) {
+				page = &ceph_wbc->fbatch.folios[i]->page;
+				if (page_snap_context(page) != ceph_wbc->snapc)
+					continue;
+				wait_on_page_writeback(page);
+			}
+
+			folio_batch_release(&ceph_wbc->fbatch);
+			cond_resched();
+		}
+	}
+}
+
 /*
  * initiate async writeback
  */
@@ -1402,17 +1641,12 @@ static int ceph_writepages_start(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
 	struct ceph_client *cl = fsc->client;
-	struct ceph_vino vino = ceph_vino(inode);
 	struct ceph_writeback_ctl ceph_wbc;
-	struct ceph_osd_request *req = NULL;
 	int rc = 0;
-	bool caching = ceph_is_cache_enabled(inode);
 
-	if (wbc->sync_mode == WB_SYNC_NONE &&
-	    fsc->write_congested)
+	if (wbc->sync_mode == WB_SYNC_NONE && fsc->write_congested)
 		return 0;
 
 	doutc(cl, "%llx.%llx (mode=%s)\n", ceph_vinop(inode),
@@ -1439,9 +1673,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 		tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
 
 	while (!has_writeback_done(&ceph_wbc)) {
-		unsigned i;
-		struct page *page;
-
 		ceph_wbc.locked_pages = 0;
 		ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
 
@@ -1459,6 +1690,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 		if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages)
 			break;
 
+process_folio_batch:
 		rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc);
 		if (rc)
 			goto release_folios;
@@ -1466,187 +1698,30 @@ static int ceph_writepages_start(struct address_space *mapping,
 		/* did we get anything? */
 		if (!ceph_wbc.locked_pages)
 			goto release_folios;
-		if (i) {
-			unsigned j, n = 0;
-			/* shift unused page to beginning of fbatch */
-			for (j = 0; j < ceph_wbc.nr_folios; j++) {
-				if (!ceph_wbc.fbatch.folios[j])
-					continue;
-				if (n < j) {
-					ceph_wbc.fbatch.folios[n] =
-						ceph_wbc.fbatch.folios[j];
-				}
-				n++;
-			}
-			ceph_wbc.fbatch.nr = n;
 
-			if (ceph_wbc.nr_folios && i == ceph_wbc.nr_folios &&
+		if (ceph_wbc.processed_in_fbatch) {
+			ceph_shift_unused_folios_left(&ceph_wbc.fbatch);
+
+			if (folio_batch_count(&ceph_wbc.fbatch) == 0 &&
 			    ceph_wbc.locked_pages < ceph_wbc.max_pages) {
 				doutc(cl, "reached end fbatch, trying for more\n");
-				folio_batch_release(&ceph_wbc.fbatch);
 				goto get_more_pages;
 			}
 		}
 
-new_request:
-		ceph_wbc.offset = ceph_fscrypt_page_offset(ceph_wbc.pages[0]);
-		ceph_wbc.len = ceph_wbc.wsize;
-
-		req = ceph_osdc_new_request(&fsc->client->osdc,
-					&ci->i_layout, vino,
-					ceph_wbc.offset, &ceph_wbc.len,
-					0, ceph_wbc.num_ops,
-					CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
-					ceph_wbc.snapc, ceph_wbc.truncate_seq,
-					ceph_wbc.truncate_size, false);
-		if (IS_ERR(req)) {
-			req = ceph_osdc_new_request(&fsc->client->osdc,
-						&ci->i_layout, vino,
-						ceph_wbc.offset, &ceph_wbc.len,
-						0, min(ceph_wbc.num_ops,
-						    CEPH_OSD_SLAB_OPS),
-						CEPH_OSD_OP_WRITE,
-						CEPH_OSD_FLAG_WRITE,
-						ceph_wbc.snapc,
-						ceph_wbc.truncate_seq,
-						ceph_wbc.truncate_size, true);
-			BUG_ON(IS_ERR(req));
-		}
-		BUG_ON(ceph_wbc.len <
-			ceph_fscrypt_page_offset(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) +
-				thp_size(ceph_wbc.pages[ceph_wbc.locked_pages - 1]) -
-					ceph_wbc.offset);
-
-		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
-			rc = -EIO;
+		rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
+		if (rc)
 			goto release_folios;
-		}
-		req->r_callback = writepages_finish;
-		req->r_inode = inode;
-
-		/* Format the osd request message and submit the write */
-		ceph_wbc.len = 0;
-		ceph_wbc.data_pages = ceph_wbc.pages;
-		ceph_wbc.op_idx = 0;
-		for (i = 0; i < ceph_wbc.locked_pages; i++) {
-			struct page *page =
-				ceph_fscrypt_pagecache_page(ceph_wbc.pages[i]);
-
-			u64 cur_offset = page_offset(page);
-			/*
-			 * Discontinuity in page range? Ceph can handle that by just passing
-			 * multiple extents in the write op.
-			 */
-			if (ceph_wbc.offset + ceph_wbc.len != cur_offset) {
-				/* If it's full, stop here */
-				if (ceph_wbc.op_idx + 1 == req->r_num_ops)
-					break;
-
-				/* Kick off an fscache write with what we have so far. */
-				ceph_fscache_write_to_cache(inode, ceph_wbc.offset,
-							    ceph_wbc.len, caching);
-
-				/* Start a new extent */
-				osd_req_op_extent_dup_last(req, ceph_wbc.op_idx,
-							   cur_offset -
-								ceph_wbc.offset);
-				doutc(cl, "got pages at %llu~%llu\n",
-					ceph_wbc.offset,
-					ceph_wbc.len);
-				osd_req_op_extent_osd_data_pages(req,
-							ceph_wbc.op_idx,
-							ceph_wbc.data_pages,
-							ceph_wbc.len, 0,
-							ceph_wbc.from_pool, false);
-				osd_req_op_extent_update(req, ceph_wbc.op_idx,
-							 ceph_wbc.len);
-
-				ceph_wbc.len = 0;
-				ceph_wbc.offset = cur_offset;
-				ceph_wbc.data_pages = ceph_wbc.pages + i;
-				ceph_wbc.op_idx++;
-			}
-
-			set_page_writeback(page);
-			if (caching)
-				ceph_set_page_fscache(page);
-			ceph_wbc.len += thp_size(page);
-		}
-		ceph_fscache_write_to_cache(inode, ceph_wbc.offset,
-					    ceph_wbc.len, caching);
-
-		if (ceph_wbc.size_stable) {
-			ceph_wbc.len = min(ceph_wbc.len,
-					    ceph_wbc.i_size - ceph_wbc.offset);
-		} else if (i == ceph_wbc.locked_pages) {
-			/* writepages_finish() clears writeback pages
-			 * according to the data length, so make sure
-			 * data length covers all locked pages */
-			u64 min_len = ceph_wbc.len + 1 - thp_size(page);
-			ceph_wbc.len =
-				get_writepages_data_length(inode,
-							ceph_wbc.pages[i - 1],
-							ceph_wbc.offset);
-			ceph_wbc.len = max(ceph_wbc.len, min_len);
-		}
-		if (IS_ENCRYPTED(inode)) {
-			ceph_wbc.len = round_up(ceph_wbc.len,
-						CEPH_FSCRYPT_BLOCK_SIZE);
-		}
 
-		doutc(cl, "got pages at %llu~%llu\n",
-			ceph_wbc.offset, ceph_wbc.len);
+		ceph_wbc.locked_pages = 0;
+		ceph_wbc.strip_unit_end = 0;
 
-		if (IS_ENCRYPTED(inode) &&
-		    ((ceph_wbc.offset | ceph_wbc.len) & ~CEPH_FSCRYPT_BLOCK_MASK))
-			pr_warn_client(cl,
-				"bad encrypted write offset=%lld len=%llu\n",
-				ceph_wbc.offset, ceph_wbc.len);
-
-		osd_req_op_extent_osd_data_pages(req, ceph_wbc.op_idx,
-						 ceph_wbc.data_pages,
-						 ceph_wbc.len,
-						 0, ceph_wbc.from_pool, false);
-		osd_req_op_extent_update(req, ceph_wbc.op_idx, ceph_wbc.len);
-
-		BUG_ON(ceph_wbc.op_idx + 1 != req->r_num_ops);
-
-		ceph_wbc.from_pool = false;
-		if (i < ceph_wbc.locked_pages) {
-			BUG_ON(ceph_wbc.num_ops <= req->r_num_ops);
-			ceph_wbc.num_ops -= req->r_num_ops;
-			ceph_wbc.locked_pages -= i;
-
-			/* allocate new pages array for next request */
-			ceph_wbc.data_pages = ceph_wbc.pages;
-			ceph_wbc.pages = kmalloc_array(ceph_wbc.locked_pages,
-							sizeof(*ceph_wbc.pages),
-							GFP_NOFS);
-			if (!ceph_wbc.pages) {
-				ceph_wbc.from_pool = true;
-				ceph_wbc.pages =
-					mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS);
-				BUG_ON(!ceph_wbc.pages);
-			}
-			memcpy(ceph_wbc.pages, ceph_wbc.data_pages + i,
-			       ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages));
-			memset(ceph_wbc.data_pages + i, 0,
-			       ceph_wbc.locked_pages * sizeof(*ceph_wbc.pages));
-		} else {
-			BUG_ON(ceph_wbc.num_ops != req->r_num_ops);
-			ceph_wbc.index = ceph_wbc.pages[i - 1]->index + 1;
-			/* request message now owns the pages array */
-			ceph_wbc.pages = NULL;
+		if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
+			ceph_wbc.nr_folios =
+				folio_batch_count(&ceph_wbc.fbatch);
+			goto process_folio_batch;
 		}
 
-		req->r_mtime = inode_get_mtime(inode);
-		ceph_osdc_start_request(&fsc->client->osdc, req);
-		req = NULL;
-
-		wbc->nr_to_write -= i;
-		if (ceph_wbc.pages)
-			goto new_request;
-
 		/*
 		 * We stop writing back only if we are not doing
 		 * integrity sync. In case of integrity sync we have to
@@ -1666,32 +1741,12 @@ static int ceph_writepages_start(struct address_space *mapping,
 	if (ceph_wbc.should_loop && !ceph_wbc.done) {
 		/* more to do; loop back to beginning of file */
 		doutc(cl, "looping back to beginning of file\n");
-		ceph_wbc.end = ceph_wbc.start_index - 1; /* OK even when start_index == 0 */
+		/* OK even when start_index == 0 */
+		ceph_wbc.end = ceph_wbc.start_index - 1;
 
 		/* to write dirty pages associated with next snapc,
 		 * we need to wait until current writes complete */
-		if (wbc->sync_mode != WB_SYNC_NONE &&
-		    ceph_wbc.start_index == 0 && /* all dirty pages were checked */
-		    !ceph_wbc.head_snapc) {
-			struct page *page;
-			unsigned i, nr;
-			ceph_wbc.index = 0;
-			while ((ceph_wbc.index <= ceph_wbc.end) &&
-			       (nr = filemap_get_folios_tag(mapping,
-						&ceph_wbc.index,
-						(pgoff_t)-1,
-						PAGECACHE_TAG_WRITEBACK,
-						&ceph_wbc.fbatch))) {
-				for (i = 0; i < nr; i++) {
-					page = &ceph_wbc.fbatch.folios[i]->page;
-					if (page_snap_context(page) != ceph_wbc.snapc)
-						continue;
-					wait_on_page_writeback(page);
-				}
-				folio_batch_release(&ceph_wbc.fbatch);
-				cond_resched();
-			}
-		}
+		ceph_wait_until_current_writes_complete(mapping, wbc, &ceph_wbc);
 
 		ceph_wbc.start_index = 0;
 		ceph_wbc.index = 0;
@@ -1702,15 +1757,13 @@ static int ceph_writepages_start(struct address_space *mapping,
 		mapping->writeback_index = ceph_wbc.index;
 
 out:
-	ceph_osdc_put_request(req);
 	ceph_put_snap_context(ceph_wbc.last_snapc);
 	doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
 	      rc);
+
 	return rc;
 }
 
-
-
 /*
  * See if a given @snapc is either writeable, or already written.
  */
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/4] ceph: fix generic/421 test failure
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
                   ` (2 preceding siblings ...)
  2025-02-05  0:02 ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Viacheslav Dubeyko
@ 2025-02-05  0:02 ` Viacheslav Dubeyko
  2025-02-12 18:05 ` [RFC PATCH 0/4] " Viacheslav Dubeyko
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-05  0:02 UTC (permalink / raw)
  To: ceph-devel
  Cc: idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko, slava

From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.

This patch fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.

sudo ./check generic/421
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch

generic/421 7s ...  4s
Ran: generic/421
Passed all 1 tests

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/addr.c       | 20 +++++++++++++++++++-
 fs/ceph/mds_client.c |  2 ++
 fs/ceph/mds_client.h |  3 +++
 fs/ceph/super.c      | 11 +++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 02d20c000dc5..d82ce4867fca 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -82,6 +82,7 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 {
 	struct inode *inode = mapping->host;
 	struct ceph_client *cl = ceph_inode_to_client(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_inode_info *ci;
 	struct ceph_snap_context *snapc;
 
@@ -92,6 +93,8 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio)
 		return false;
 	}
 
+	atomic64_inc(&mdsc->dirty_folios);
+
 	ci = ceph_inode(inode);
 
 	/* dirty the head */
@@ -894,6 +897,7 @@ static void writepages_finish(struct ceph_osd_request *req)
 	struct ceph_snap_context *snapc = req->r_snapc;
 	struct address_space *mapping = inode->i_mapping;
 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	unsigned int len = 0;
 	bool remove_page;
 
@@ -949,6 +953,12 @@ static void writepages_finish(struct ceph_osd_request *req)
 
 			ceph_put_snap_context(detach_page_private(page));
 			end_page_writeback(page);
+
+			if (atomic64_dec_return(&mdsc->dirty_folios) <= 0) {
+				wake_up_all(&mdsc->flush_end_wq);
+				WARN_ON(atomic64_read(&mdsc->dirty_folios) < 0);
+			}
+
 			doutc(cl, "unlocking %p\n", page);
 
 			if (remove_page)
@@ -1660,13 +1670,18 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 	ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc);
 
+	if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
+		rc = -EIO;
+		goto out;
+	}
+
 retry:
 	rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc);
 	if (rc == -ENODATA) {
 		/* hmm, why does writepages get called when there
 		   is no dirty data? */
 		rc = 0;
-		goto out;
+		goto dec_osd_stopping_blocker;
 	}
 
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
@@ -1756,6 +1771,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 	if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = ceph_wbc.index;
 
+dec_osd_stopping_blocker:
+	ceph_dec_osd_stopping_blocker(fsc->mdsc);
+
 out:
 	ceph_put_snap_context(ceph_wbc.last_snapc);
 	doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode),
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 54b3421501e9..230e0c3f341f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -5489,6 +5489,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 	spin_lock_init(&mdsc->stopping_lock);
 	atomic_set(&mdsc->stopping_blockers, 0);
 	init_completion(&mdsc->stopping_waiter);
+	atomic64_set(&mdsc->dirty_folios, 0);
+	init_waitqueue_head(&mdsc->flush_end_wq);
 	init_waitqueue_head(&mdsc->session_close_wq);
 	INIT_LIST_HEAD(&mdsc->waiting_for_map);
 	mdsc->quotarealms_inodes = RB_ROOT;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 7c9fee9e80d4..3e2a6fa7c19a 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -458,6 +458,9 @@ struct ceph_mds_client {
 	atomic_t                stopping_blockers;
 	struct completion	stopping_waiter;
 
+	atomic64_t		dirty_folios;
+	wait_queue_head_t	flush_end_wq;
+
 	atomic64_t		quotarealms_count; /* # realms with quota */
 	/*
 	 * We keep a list of inodes we don't see in the mountpoint but that we
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 4344e1f11806..f3951253e393 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1563,6 +1563,17 @@ static void ceph_kill_sb(struct super_block *s)
 	 */
 	sync_filesystem(s);
 
+	if (atomic64_read(&mdsc->dirty_folios) > 0) {
+		wait_queue_head_t *wq = &mdsc->flush_end_wq;
+		long timeleft = wait_event_killable_timeout(*wq,
+					atomic64_read(&mdsc->dirty_folios) <= 0,
+					fsc->client->options->mount_timeout);
+		if (!timeleft) /* timed out */
+			pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
+		else if (timeleft < 0) /* killed */
+			pr_warn_client(cl, "umount was killed, %ld\n", timeleft);
+	}
+
 	spin_lock(&mdsc->stopping_lock);
 	mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHING;
 	wait = !!atomic_read(&mdsc->stopping_blockers);
-- 
2.48.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re:  [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
                   ` (3 preceding siblings ...)
  2025-02-05  0:02 ` [RFC PATCH 4/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
@ 2025-02-12 18:05 ` Viacheslav Dubeyko
  2025-02-14 17:19 ` David Howells
  2025-02-28 10:22 ` Christian Brauner
  6 siblings, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-12 18:05 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org, David Howells
  Cc: idryomov@gmail.com, linux-fsdevel@vger.kernel.org,
	Patrick Donnelly, Alex Markuze, slava@dubeyko.com

Hi David,

Have you tried the fix? Does it fix the  issue on your side?

Thanks,
Slava.

On Tue, 2025-02-04 at 16:02 -0800, Viacheslav Dubeyko wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> The generic/421 fails to finish because of the issue:
> 
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
> 
> There are several issues here:
> (1) ceph_kill_sb() doesn't wait ending of flushing
> all dirty folios/pages because of racy nature of
> mdsc->stopping_blockers. As a result, mdsc->stopping
> becomes CEPH_MDSC_STOPPING_FLUSHED too early.
> (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
> to increment mdsc->stopping_blockers. Finally,
> already locked folios/pages are never been unlocked
> and the logic tries to lock the same page second time.
> (3) The folio_batch with found dirty pages by
> filemap_get_folios_tag() is not processed properly.
> And this is why some number of dirty pages simply never
> processed and we have dirty folios/pages after unmount
> anyway.
> 
> This patchset is refactoring the ceph_writepages_start()
> method and it fixes the issues by means of:
> (1) introducing dirty_folios counter and flush_end_wq
> waiting queue in struct ceph_mds_client;
> (2) ceph_dirty_folio() increments the dirty_folios
> counter;
> (3) writepages_finish() decrements the dirty_folios
> counter and wake up all waiters on the queue
> if dirty_folios counter is equal or lesser than zero;
> (4) adding in ceph_kill_sb() method the logic of
> checking the value of dirty_folios counter and
> waiting if it is bigger than zero;
> (5) adding ceph_inc_osd_stopping_blocker() call in the
> beginning of the ceph_writepages_start() and
> ceph_dec_osd_stopping_blocker() at the end of
> the ceph_writepages_start() with the goal to resolve
> the racy nature of mdsc->stopping_blockers.
> 
> sudo ./check generic/421
> FSTYP         -- ceph
> PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
> MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
> MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch
> 
> generic/421 7s ...  4s
> Ran: generic/421
> Passed all 1 tests
> 
> Viacheslav Dubeyko (4):
>   ceph: extend ceph_writeback_ctl for ceph_writepages_start()
>     refactoring
>   ceph: introduce ceph_process_folio_batch() method
>   ceph: introduce ceph_submit_write() method
>   ceph: fix generic/421 test failure
> 
>  fs/ceph/addr.c       | 1110 +++++++++++++++++++++++++++---------------
>  fs/ceph/mds_client.c |    2 +
>  fs/ceph/mds_client.h |    3 +
>  fs/ceph/super.c      |   11 +
>  4 files changed, 746 insertions(+), 380 deletions(-)
> 

-- 
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
                   ` (4 preceding siblings ...)
  2025-02-12 18:05 ` [RFC PATCH 0/4] " Viacheslav Dubeyko
@ 2025-02-14 17:19 ` David Howells
  2025-02-14 17:36   ` Viacheslav Dubeyko
  2025-02-14 20:35   ` David Howells
  2025-02-28 10:22 ` Christian Brauner
  6 siblings, 2 replies; 21+ messages in thread
From: David Howells @ 2025-02-14 17:19 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, ceph-devel, idryomov, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko

Okay...   I *think* that fixes the hang.  There was one case where I saw the
hang, but I'm not sure that I had your patches applied or whether I'd managed
to boot the previous kernel that didn't.

So, just with respect to fixing the hang:

	Tested-by: David Howells <dhowells@redhat.com>

There's still the issue of encrypted filenames occasionally showing through
which generic/397 is showing up - but I don't think your patches here fix
that, right?

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 17:19 ` David Howells
@ 2025-02-14 17:36   ` Viacheslav Dubeyko
  2025-02-14 20:35   ` David Howells
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-14 17:36 UTC (permalink / raw)
  To: slava@dubeyko.com, David Howells
  Cc: Alex Markuze, ceph-devel@vger.kernel.org, idryomov@gmail.com,
	linux-fsdevel@vger.kernel.org, Patrick Donnelly

On Fri, 2025-02-14 at 17:19 +0000, David Howells wrote:
> Okay...   I *think* that fixes the hang.  There was one case where I saw the
> hang, but I'm not sure that I had your patches applied or whether I'd managed
> to boot the previous kernel that didn't.
> 
> So, just with respect to fixing the hang:
> 
> 	Tested-by: David Howells <dhowells@redhat.com>
> 
> There's still the issue of encrypted filenames occasionally showing through
> which generic/397 is showing up - but I don't think your patches here fix
> that, right?
> 

This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
fix.

Thanks,
Slava.

[1]
https://lore.kernel.org/all/CAO8a2SjrDL5TqW70P3yyqv8X-B5jfQRg-eMTs9Nbntr8=Mwbog@mail.gmail.com/T/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 17:19 ` David Howells
  2025-02-14 17:36   ` Viacheslav Dubeyko
@ 2025-02-14 20:35   ` David Howells
  2025-02-14 20:45     ` Viacheslav Dubeyko
  2025-02-14 20:52     ` David Howells
  1 sibling, 2 replies; 21+ messages in thread
From: David Howells @ 2025-02-14 20:35 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, slava@dubeyko.com, Alex Markuze,
	ceph-devel@vger.kernel.org, idryomov@gmail.com,
	linux-fsdevel@vger.kernel.org, Patrick Donnelly

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > There's still the issue of encrypted filenames occasionally showing through
> > which generic/397 is showing up - but I don't think your patches here fix
> > that, right?
> > 
> 
> This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
> v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
> fix.

That doesn't fix the problem either.  That seems to be fixing a crash, not:

generic/397       - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad)
    --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
    +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14 20:34:10.365900035 +0000
    @@ -1,13 +1,27 @@
     QA output created by 397
    +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
    +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@���7,��
                                                                                            [�g��
    +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
    +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
    +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
    +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-dev/results//generic/397.out.bad'  to see the entire diff)


David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 20:35   ` David Howells
@ 2025-02-14 20:45     ` Viacheslav Dubeyko
  2025-02-14 20:52     ` David Howells
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-14 20:45 UTC (permalink / raw)
  To: David Howells
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	Patrick Donnelly

On Fri, 2025-02-14 at 20:35 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > > There's still the issue of encrypted filenames occasionally showing through
> > > which generic/397 is showing up - but I don't think your patches here fix
> > > that, right?
> > > 
> > 
> > This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
> > v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
> > fix.
> 
> That doesn't fix the problem either.  That seems to be fixing a crash, not:
> 
> generic/397       - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad)
>     --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
>     +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14 20:34:10.365900035 +0000
>     @@ -1,13 +1,27 @@
>      QA output created by 397
>     +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>     +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@���7,��
>                                                                                             [�g��
>     +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
>     +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
>     +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
>     +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
>     ...
>     (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-dev/results//generic/397.out.bad'  to see the entire diff)
> 
> 
> 

Do you mean that you applied this modification?

---
 fs/ceph/addr.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 85936f6d2bf7..5e6ba92219f3 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -396,6 +396,15 @@ static void ceph_netfs_issue_read(struct
netfs_io_subrequest *subreq)
 		struct page **pages;
 		size_t page_off;
 
+		/*
+		 * The io_iter.count needs to be corrected to aligned length.
+		 * Otherwise, iov_iter_get_pages_alloc2() operates with
+		 * the initial unaligned length value. As a result,
+		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
+		 * if msg->sparse_read_total > msg->data_length.
+		 */
+		subreq->io_iter.count = len;
+
 		err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
&page_off);
 		if (err < 0) {
 			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
@@ -405,6 +414,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest
*subreq)
 
 		/* should always give us a page-aligned read */
 		WARN_ON_ONCE(page_off);
+
 		len = err;
 		err = 0;
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 20:35   ` David Howells
  2025-02-14 20:45     ` Viacheslav Dubeyko
@ 2025-02-14 20:52     ` David Howells
  2025-02-14 20:57       ` Viacheslav Dubeyko
  2025-02-14 23:52       ` Viacheslav Dubeyko
  1 sibling, 2 replies; 21+ messages in thread
From: David Howells @ 2025-02-14 20:52 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: dhowells, idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	Patrick Donnelly

Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> Do you mean that you applied this modification?

See:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes

for I have applied.

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 20:52     ` David Howells
@ 2025-02-14 20:57       ` Viacheslav Dubeyko
  2025-02-14 23:52       ` Viacheslav Dubeyko
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-14 20:57 UTC (permalink / raw)
  To: David Howells
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Patrick Donnelly

On Fri, 2025-02-14 at 20:52 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Do you mean that you applied this modification?
> 
> See:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes  
> 
> for I have applied.
> 

OK. I didn't see such output during the testing:

generic/397       - output mismatch (see /root/xfstests-
dev/results//generic/397.out.bad)
    --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
    +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14
20:34:10.365900035 +0000
    @@ -1,13 +1,27 @@
     QA output created by 397
    +Only in /xfstest.scratch/ref_dir:
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyy
    +Only in /xfstest.scratch/edir:
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@�
��7,��
                                                                               
[�g��
    +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
    +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
    +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
    +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-
dev/results//generic/397.out.bad'  to see the entire diff)

Let me double check the test again.

Thanks,
Slava.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-02-05  0:02 ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Viacheslav Dubeyko
@ 2025-02-14 21:11   ` Matthew Wilcox
  2025-02-14 21:21     ` Viacheslav Dubeyko
  2025-08-26 22:06     ` Max Kellermann
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2025-02-14 21:11 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: ceph-devel, idryomov, dhowells, linux-fsdevel, pdonnell, amarkuze,
	Slava.Dubeyko

On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> This patch implements refactoring of ceph_submit_write()
> and also it solves the second issue.
> 
> Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

This kind of giant refactoring to solve a bug is a really bad idea.
First, it's going to need to be backported to older kernels.  How far
back?  You need to identify that with a Fixes: line.

It's also really hard to review and know whether it's right.  You might
have introduced several new bugs while doing it.  In general, bugfixes
first, refactor later.  I *think* this means we can do without 1/7 of
the patches I resent earlier today, but it's really hard to be sure.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-02-14 21:11   ` Matthew Wilcox
@ 2025-02-14 21:21     ` Viacheslav Dubeyko
  2025-08-26 22:06     ` Max Kellermann
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-14 21:21 UTC (permalink / raw)
  To: willy@infradead.org, slava@dubeyko.com
  Cc: Alex Markuze, ceph-devel@vger.kernel.org, idryomov@gmail.com,
	linux-fsdevel@vger.kernel.org, Patrick Donnelly, David Howells

On Fri, 2025-02-14 at 21:11 +0000, Matthew Wilcox wrote:
> On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > This patch implements refactoring of ceph_submit_write()
> > and also it solves the second issue.
> > 
> > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> This kind of giant refactoring to solve a bug is a really bad idea.
> First, it's going to need to be backported to older kernels.  How far
> back?  You need to identify that with a Fixes: line.
> 
> It's also really hard to review and know whether it's right.  You might
> have introduced several new bugs while doing it.  In general, bugfixes
> first, refactor later.  I *think* this means we can do without 1/7 of
> the patches I resent earlier today, but it's really hard to be sure.

I really would like not to do refactoring here. But the function is huge and
logic is pretty complicated. It's hard to follow the logic of the method.
Refactoring is doing only of moving the existing code into smaller functions and
nothing more.

I can not to do the refactoring and simply add the fixes but function will be
bigger and more complex, from my point of view.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-14 20:52     ` David Howells
  2025-02-14 20:57       ` Viacheslav Dubeyko
@ 2025-02-14 23:52       ` Viacheslav Dubeyko
  1 sibling, 0 replies; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-02-14 23:52 UTC (permalink / raw)
  To: David Howells
  Cc: idryomov@gmail.com, Alex Markuze, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Patrick Donnelly

On Fri, 2025-02-14 at 20:52 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Do you mean that you applied this modification?
> 
> See:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes  
> 
> for I have applied.
> 

I took your branch [1] and compiled the kernel:

git status
HEAD detached at origin/netfs-fixes

But I cannot reproduce the issue:

sudo ./check -g encrypt
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.14.0-rc2+ #1 SMP
PREEMPT_DYNAMIC Fri Feb 14 23:04:17 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40137:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom
127.0.0.1:40137:/scratch /mnt/scratch

generic/395 15s ...  10s
generic/396 12s ...  9s
generic/397 13s ...  11s
generic/398 1s ... [not run] kernel doesn't support renameat2 syscall
generic/399 28s ... [not run] Filesystem ceph not supported in
_scratch_mkfs_sized_encrypted
generic/419 1s ... [not run] kernel doesn't support renameat2 syscall
generic/421 17s ...  13s
generic/429 24s ...  22s
generic/435 1115s ...  873s
generic/440 18s ...  13s
generic/548 2s ... [not run] xfs_io fiemap  failed (old kernel/wrong fs?)
generic/549 2s ... [not run] encryption policy '-c 5 -n 6 -f 0' is unusable;
probably missing kernel crypto API support
generic/550 4s ... [not run] encryption policy '-c 9 -n 9 -f 0' is unusable;
probably missing kernel crypto API support
generic/576       [not run] fsverity utility required, skipped this test
generic/580 18s ...  15s
generic/581 21s ...  20s
generic/582 2s ... [not run] xfs_io fiemap  failed (old kernel/wrong fs?)
generic/583 2s ... [not run] encryption policy '-c 5 -n 6 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/584 3s ... [not run] encryption policy '-c 9 -n 9 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/592 3s ... [not run] kernel does not support encryption policy: '-c 1 -n
4 -v 2 -f 8'
generic/593 18s ...  14s
generic/595 20s ...  19s
generic/602 2s ... [not run] kernel does not support encryption policy: '-c 1 -n
4 -v 2 -f 16'
generic/613 5s ... [not run] _get_encryption_nonce() isn't implemented on ceph
generic/621 6s ... [not run] kernel doesn't support renameat2 syscall
generic/693 6s ... [not run] encryption policy '-c 1 -n 10 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/739       [not run] xfs_io set_encpolicy doesn't support -s
Ran: generic/395 generic/396 generic/397 generic/398 generic/399 generic/419
generic/421 generic/429 generic/435 generic/440 generic/548 generic/549
generic/550 generic/576 generic/580 generic/581 generic/582 generic/583
generic/584 generic/592 generic/593 generic/595 generic/602 generic/613
generic/621 generic/693 generic/739
Not run: generic/398 generic/399 generic/419 generic/548 generic/549 generic/550
generic/576 generic/582 generic/583 generic/584 generic/592 generic/602
generic/613 generic/621 generic/693 generic/739
Passed all 27 tests

[1] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/4] ceph: fix generic/421 test failure
  2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
                   ` (5 preceding siblings ...)
  2025-02-14 17:19 ` David Howells
@ 2025-02-28 10:22 ` Christian Brauner
  6 siblings, 0 replies; 21+ messages in thread
From: Christian Brauner @ 2025-02-28 10:22 UTC (permalink / raw)
  To: ceph-devel, Viacheslav Dubeyko
  Cc: Christian Brauner, idryomov, dhowells, linux-fsdevel, pdonnell,
	amarkuze, Slava.Dubeyko

On Tue, 04 Feb 2025 16:02:45 -0800, Viacheslav Dubeyko wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> The generic/421 fails to finish because of the issue:
> 
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
> 
> [...]

Applied to the vfs-6.15.ceph branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.ceph branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.ceph

[1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
      https://git.kernel.org/vfs/vfs/c/f08068df4aa4
[2/4] ceph: introduce ceph_process_folio_batch() method
      https://git.kernel.org/vfs/vfs/c/ce80b76dd327
[3/4] ceph: introduce ceph_submit_write() method
      https://git.kernel.org/vfs/vfs/c/1551ec61dc55
[4/4] ceph: fix generic/421 test failure
      https://git.kernel.org/vfs/vfs/c/fd7449d937e7

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-02-14 21:11   ` Matthew Wilcox
  2025-02-14 21:21     ` Viacheslav Dubeyko
@ 2025-08-26 22:06     ` Max Kellermann
  2025-08-26 22:33       ` Viacheslav Dubeyko
  1 sibling, 1 reply; 21+ messages in thread
From: Max Kellermann @ 2025-08-26 22:06 UTC (permalink / raw)
  To: Matthew Wilcox, brauner
  Cc: Viacheslav Dubeyko, ceph-devel, idryomov, dhowells, linux-fsdevel,
	pdonnell, amarkuze, Slava.Dubeyko, linux-kernel

On 2025/02/14 22:11, Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > This patch implements refactoring of ceph_submit_write()
> > and also it solves the second issue.
> > 
> > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> This kind of giant refactoring to solve a bug is a really bad idea.
> First, it's going to need to be backported to older kernels.  How far
> back?  You need to identify that with a Fixes: line.
> 
> It's also really hard to review and know whether it's right.  You might
> have introduced several new bugs while doing it.  In general, bugfixes
> first, refactor later.  I *think* this means we can do without 1/7 of
> the patches I resent earlier today, but it's really hard to be sure.

I'm very disappointed that nobody has listened to Matthew's complaint.
Viacheslav has only hand-waved it away with a lazy non-argument.

From Documentation/process/submitting-patches.rst:

 "you should not modify the moved code at all in the same patch which
 moves it"

Obviously, this patch set violates this rule.  There are lots of
semantic/behavior changes in the patches that move code around.

In the end, Christian Brauner has merged this into Linux 6.15 and that
merge has wreaked havoc in our production clusters.  We have been
testing 6.15 for a month with no problems (after David Howells had
fixed yet-another netfs regression that was stable-backported to 6.15,
ugh!), but when we updated a production clusters, all servers had
crashed after a few hours, and our ops team had a very bad night.

This patch set is obviously bad.  It pretends to fix a bug, but really
rewrites almost everything in two patches documented as "introduce XY
method" with no real explanation for why Viacheslav has decided to do
it, instead of just fixing the bug (as Matthew asked him to).

Look at this line modified by patch "ceph: extend ceph_writeback_ctl
for ceph_writepages_start() refactoring":

> +             ceph_wbc.fbatch.folios[i] = NULL;

This sets a folio_batch element to NULL, which will, of course, crash
in folios_put_refs() (but only if the global huge zero page has
already been created).  Fortunately, there's code that removes all
NULL elements from the folio_batch array.  That is code that already
existed before Viacheslav's patch set (code which I already dislike
because it's a fragile mess that is just waiting to crash), and the
code was only being moved around.

Did I mention that I think this is a fragile mess?  Fast-forward to
Viacheslav's patch "ceph: introduce ceph_process_folio_batch() method"
which moves the NULL-setting loop to ceph_process_folio_batch().  Look
at this (untouched) piece of code after the ceph_process_folio_batch()
call:

>   if (i) {
>        unsigned j, n = 0;
>        /* shift unused page to beginning of fbatch */

Shifting only happens if at least one folio has been processed ("i"
was incremented).  But does it really happen?

No, the loop was moved to ceph_process_folio_batch(), and nobody ever
increments "i" anymore.  Voila, crash due to NULL pointer dereference:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0 
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE 
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

Viacheslav's third patch "ceph: introduce ceph_submit_write() method"
messes up the logic a bit more and makes it more fragile by hiding the
shifting code behind more conditions:

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if processed_in_fbatch is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

If shifting doesn't happen, then the kernel crashes (unless
huge-zero-page doesn't exist, see above).  Obviously, nobody has ever
looked closely enough at the code.  I'm still new to Linux memory
management and file systems, but these problems were obvious when I
first saw this patch set (which was my candidate for the other 6.15
crashes which then turned out to be netfs regressions, not Ceph).

This whole patch set is a huge mess and has caused my team a good
amount of pain.  This could and should have been avoided, had only
somebody listened to Matthew.

(Also look at all those "checkpatch.pl" complaints on all patches in
this patch set.  There are many coding style violations.)

Can we please revert the whole patch set?  I don't think it's possible
to fix all the weird undocumented side effects that may cause more
crashes once we fix this one.  Refactoring the Ceph code sure is
necessary, it's not in a good shape, but it should be done more
carefully.  Some people (like me) depend on Ceph's stability, and this
mess is doing a disservice to Ceph's reputation.

Max

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:06     ` Max Kellermann
@ 2025-08-26 22:33       ` Viacheslav Dubeyko
  2025-08-26 22:53         ` Max Kellermann
  0 siblings, 1 reply; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-26 22:33 UTC (permalink / raw)
  To: brauner@kernel.org, Patrick Donnelly,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, slava@dubeyko.com, David Howells,
	Alex Markuze, willy@infradead.org, idryomov@gmail.com

On Wed, 2025-08-27 at 00:06 +0200, Max Kellermann wrote:
> On 2025/02/14 22:11, Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > > This patch implements refactoring of ceph_submit_write()
> > > and also it solves the second issue.
> > > 
> > > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> > 
> > This kind of giant refactoring to solve a bug is a really bad idea.
> > First, it's going to need to be backported to older kernels.  How far
> > back?  You need to identify that with a Fixes: line.
> > 
> > It's also really hard to review and know whether it's right.  You might
> > have introduced several new bugs while doing it.  In general, bugfixes
> > first, refactor later.  I *think* this means we can do without 1/7 of
> > the patches I resent earlier today, but it's really hard to be sure.
> 
> I'm very disappointed that nobody has listened to Matthew's complaint.
> Viacheslav has only hand-waved it away with a lazy non-argument.
> 
> From Documentation/process/submitting-patches.rst:
> 
>  "you should not modify the moved code at all in the same patch which
>  moves it"
> 
> Obviously, this patch set violates this rule.  There are lots of
> semantic/behavior changes in the patches that move code around.
> 
> In the end, Christian Brauner has merged this into Linux 6.15 and that
> merge has wreaked havoc in our production clusters.  We have been
> testing 6.15 for a month with no problems (after David Howells had
> fixed yet-another netfs regression that was stable-backported to 6.15,
> ugh!), but when we updated a production clusters, all servers had
> crashed after a few hours, and our ops team had a very bad night.
> 
> This patch set is obviously bad.  It pretends to fix a bug, but really
> rewrites almost everything in two patches documented as "introduce XY
> method" with no real explanation for why Viacheslav has decided to do
> it, instead of just fixing the bug (as Matthew asked him to).
> 
> Look at this line modified by patch "ceph: extend ceph_writeback_ctl
> for ceph_writepages_start() refactoring":
> 
> > +             ceph_wbc.fbatch.folios[i] = NULL;
> 
> This sets a folio_batch element to NULL, which will, of course, crash
> in folios_put_refs() (but only if the global huge zero page has
> already been created).  Fortunately, there's code that removes all
> NULL elements from the folio_batch array.  That is code that already
> existed before Viacheslav's patch set (code which I already dislike
> because it's a fragile mess that is just waiting to crash), and the
> code was only being moved around.
> 
> Did I mention that I think this is a fragile mess?  Fast-forward to
> Viacheslav's patch "ceph: introduce ceph_process_folio_batch() method"
> which moves the NULL-setting loop to ceph_process_folio_batch().  Look
> at this (untouched) piece of code after the ceph_process_folio_batch()
> call:
> 
> >   if (i) {
> >        unsigned j, n = 0;
> >        /* shift unused page to beginning of fbatch */
> 
> Shifting only happens if at least one folio has been processed ("i"
> was incremented).  But does it really happen?
> 
> No, the loop was moved to ceph_process_folio_batch(), and nobody ever
> increments "i" anymore.  Voila, crash due to NULL pointer dereference:
> 
>  BUG: kernel NULL pointer dereference, address: 0000000000000034
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0 
>  Oops: Oops: 0002 [#1] SMP NOPTI
>  CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE 
>  Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
>  Workqueue: writeback wb_workfn (flush-ceph-1)
>  RIP: 0010:folios_put_refs+0x85/0x140
>  Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
>  RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
>  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
>  RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
>  RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
>  R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
>  R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
>  FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
>  PKRU: 55555554
>  Call Trace:
>   <TASK>
>   ceph_writepages_start+0xeb9/0x1410
> 
> Viacheslav's third patch "ceph: introduce ceph_submit_write() method"
> messes up the logic a bit more and makes it more fragile by hiding the
> shifting code behind more conditions:
> 
> - if ceph_process_folio_batch() fails, shifting never happens
> 
> - if ceph_move_dirty_page_in_page_array() was never called (because
>   ceph_process_folio_batch() has returned early for some of various
>   reasons), shifting never happens
> 
> - if processed_in_fbatch is zero (because ceph_process_folio_batch()
>   has returned early for some of the reasons mentioned above or
>   because ceph_move_dirty_page_in_page_array() has failed), shifting
>   never happens
> 
> If shifting doesn't happen, then the kernel crashes (unless
> huge-zero-page doesn't exist, see above).  Obviously, nobody has ever
> looked closely enough at the code.  I'm still new to Linux memory
> management and file systems, but these problems were obvious when I
> first saw this patch set (which was my candidate for the other 6.15
> crashes which then turned out to be netfs regressions, not Ceph).
> 
> This whole patch set is a huge mess and has caused my team a good
> amount of pain.  This could and should have been avoided, had only
> somebody listened to Matthew.
> 
> (Also look at all those "checkpatch.pl" complaints on all patches in
> this patch set.  There are many coding style violations.)
> 
> Can we please revert the whole patch set?  I don't think it's possible
> to fix all the weird undocumented side effects that may cause more
> crashes once we fix this one.  Refactoring the Ceph code sure is
> necessary, it's not in a good shape, but it should be done more
> carefully.  Some people (like me) depend on Ceph's stability, and this
> mess is doing a disservice to Ceph's reputation.

Of course, we can revert any patch. This patchset has been sent not with the
goal of pure refactoring but it fixes several bugs. Reverting means returning
these bugs back. This patchset was available for review for a long time. So,
anybody was able to review it and to share any comments. I am always happy to
rework and to make any patch more better. From my point of view, reverting is
not answer and it makes sense to continue fix bugs and to make CephFS code more
stable.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:33       ` Viacheslav Dubeyko
@ 2025-08-26 22:53         ` Max Kellermann
  2025-08-27  3:06           ` Viacheslav Dubeyko
  0 siblings, 1 reply; 21+ messages in thread
From: Max Kellermann @ 2025-08-26 22:53 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: brauner@kernel.org, Patrick Donnelly,
	linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org,
	linux-kernel@vger.kernel.org, slava@dubeyko.com, David Howells,
	Alex Markuze, willy@infradead.org, idryomov@gmail.com

On 2025/08/27 00:33, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> Of course, we can revert any patch. This patchset has been sent not with the
> goal of pure refactoring but it fixes several bugs. Reverting means returning
> these bugs back.

You should have listened of Matthew and submit separate minimal
bug-fixing patches instead of posting huge patches which move code
around, change semantics and hidden somewhere deep within fix some bug
(and then introduce new bugs).

> This patchset was available for review for a long time.

There was exactly one review, and no, you were not "happy to rework
and to make any patch more better" - you openly rejected Matthew's
review.

> From my point of view, reverting is not answer and it makes sense to
> continue fix bugs and to make CephFS code more stable.

Your argument only appears to sound right, but it is detached from the
reality I'm living in.

Your patches made Ceph less stable.  6.14 had one Ceph-related crash
every other week, but 6.15 with your patches made all servers crash
within hours.

The point is: the Linux kernel was better without your patches.  Your
patches may have fixed a bug, but have introduced a dozen new bugs,
including one that very quickly crashes the whole kernel, one that was
really obvious enough, just nobody cared enough to read deeply enough
after you rejected Matthew's review.  Too bad no maintainer stopped
you!

Of course, the bug that was fixed by your patch set should be fixed -
but not the way you did it.  Every aspect of your approach to fixing
the bug was bad.

The best way forward for you would be to revert this patch set and
write a minimal patch that only fixes the bug.  If you want to be
helpful here, please give this a try.

Max

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-26 22:53         ` Max Kellermann
@ 2025-08-27  3:06           ` Viacheslav Dubeyko
  2025-08-27  4:57             ` Max Kellermann
  0 siblings, 1 reply; 21+ messages in thread
From: Viacheslav Dubeyko @ 2025-08-27  3:06 UTC (permalink / raw)
  To: David Howells, brauner@kernel.org, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, Patrick Donnelly, willy@infradead.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On Wed, 2025-08-27 at 00:53 +0200, Max Kellermann wrote:
> On 2025/08/27 00:33, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> > Of course, we can revert any patch. This patchset has been sent not with the
> > goal of pure refactoring but it fixes several bugs. Reverting means returning
> > these bugs back.
> 
> You should have listened of Matthew and submit separate minimal
> bug-fixing patches instead of posting huge patches which move code
> around, change semantics and hidden somewhere deep within fix some bug
> (and then introduce new bugs).
> 
> > This patchset was available for review for a long time.
> 
> There was exactly one review, and no, you were not "happy to rework
> and to make any patch more better" - you openly rejected Matthew's
> review.
> 
> > From my point of view, reverting is not answer and it makes sense to
> > continue fix bugs and to make CephFS code more stable.
> 
> Your argument only appears to sound right, but it is detached from the
> reality I'm living in.
> 
> Your patches made Ceph less stable.  6.14 had one Ceph-related crash
> every other week, but 6.15 with your patches made all servers crash
> within hours.
> 
> The point is: the Linux kernel was better without your patches.  Your
> patches may have fixed a bug, but have introduced a dozen new bugs,
> including one that very quickly crashes the whole kernel, one that was
> really obvious enough, just nobody cared enough to read deeply enough
> after you rejected Matthew's review.  Too bad no maintainer stopped
> you!
> 
> Of course, the bug that was fixed by your patch set should be fixed -
> but not the way you did it.  Every aspect of your approach to fixing
> the bug was bad.
> 
> The best way forward for you would be to revert this patch set and
> write a minimal patch that only fixes the bug.  If you want to be
> helpful here, please give this a try.
> 
> 

This patchset had been tested by xfstests and no issue had been triggered. We
have not enough Ceph related test-cases. So, you are welcome to add a test-case
for the issue.

This issue had been reported more than a month ago. I tried to reproduce it but
I had no luck. So, if you are lucky enough, then simply share the patch or the
way to reproduce the issue. The patchset simply placed old code into dedicated
functions with the goal to manage complexity. So, the old code is still there
and the patch cannot introduce "dozen new bugs". Otherwise, xfstests can easily
reproduce these gazilion of bugs. :) Currently, we have only one issue reported.

The open-source is space for collaboration and respect of each other. It is not
space for offensive or bullying behavior.

Thanks,
Slava.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
  2025-08-27  3:06           ` Viacheslav Dubeyko
@ 2025-08-27  4:57             ` Max Kellermann
  0 siblings, 0 replies; 21+ messages in thread
From: Max Kellermann @ 2025-08-27  4:57 UTC (permalink / raw)
  To: Viacheslav Dubeyko
  Cc: David Howells, brauner@kernel.org, slava@dubeyko.com,
	ceph-devel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Alex Markuze, Patrick Donnelly, willy@infradead.org,
	idryomov@gmail.com, linux-fsdevel@vger.kernel.org

On 2025/08/27 05:06, Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> The open-source is space for collaboration and respect of each other. It is not
> space for offensive or bullying behavior.

This is now your third reply to a comment on your patch set (one to
Matthew and two to me), and you are still sidestepping the valid
criticism.  This time, instead of talking how to fix it, you are
explaining how well you tested this, and then this meta-comment.

No, for the bug discussion, it does not matter how well you tested
this.  If it's buggy, it's buggy.  For me, as an amateur, the bug was
obvious enough to see in the source code, and after my long
explanation, it should be easy enough for everybody else to see, isn't
it?  And no, I don't owe you a test case.  I explained the exact
conditions that need to occur for the crash to happen, and I have to
trust that you are capable of easily injecting such an error, aren't
you?

This is not what I call "collaboration" neither "respect", to me it
rather seems like ignorance.  You were not collaborating with the
reviewer and you are not respecting the bug report.  You just divert
the discussion instead of talking about the actual problems.

And, note: in both of your replies to me, you removed me from the
recipient list.  Everybody else you kept, you only removed me.  I had
to look up your replies on lore.kernel.org.  (You did not do that with
your replies to Matthew and David.)

What kind of funny behavior is that?

Max

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-08-27  4:57 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-05  0:02 [RFC PATCH 0/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
2025-02-05  0:02 ` [RFC PATCH 1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring Viacheslav Dubeyko
2025-02-05  0:02 ` [RFC PATCH 2/4] ceph: introduce ceph_process_folio_batch() method Viacheslav Dubeyko
2025-02-05  0:02 ` [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method Viacheslav Dubeyko
2025-02-14 21:11   ` Matthew Wilcox
2025-02-14 21:21     ` Viacheslav Dubeyko
2025-08-26 22:06     ` Max Kellermann
2025-08-26 22:33       ` Viacheslav Dubeyko
2025-08-26 22:53         ` Max Kellermann
2025-08-27  3:06           ` Viacheslav Dubeyko
2025-08-27  4:57             ` Max Kellermann
2025-02-05  0:02 ` [RFC PATCH 4/4] ceph: fix generic/421 test failure Viacheslav Dubeyko
2025-02-12 18:05 ` [RFC PATCH 0/4] " Viacheslav Dubeyko
2025-02-14 17:19 ` David Howells
2025-02-14 17:36   ` Viacheslav Dubeyko
2025-02-14 20:35   ` David Howells
2025-02-14 20:45     ` Viacheslav Dubeyko
2025-02-14 20:52     ` David Howells
2025-02-14 20:57       ` Viacheslav Dubeyko
2025-02-14 23:52       ` Viacheslav Dubeyko
2025-02-28 10:22 ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).