From: Gao Xiang <xiang@kernel.org>
To: Nithurshen <nithurshen.dev@gmail.com>
Cc: linux-erofs@lists.ozlabs.org, hsiangkao@linux.alibaba.com,
xiang@kernel.org
Subject: Re: [PATCH 1/2] fsck.erofs: introduce multi-threaded decompression with static batching
Date: Sun, 7 Jun 2026 09:50:39 +0800 [thread overview]
Message-ID: <aiTObzdT2gThkwln@debian> (raw)
In-Reply-To: <20260523003757.13078-2-nithurshen.dev@gmail.com>
Hi Nithurshen,
On Sat, May 23, 2026 at 06:07:56AM +0530, Nithurshen wrote:
> Currently, fsck.erofs extracts files synchronously. When decompressing
> heavily packed images (like LZ4HC with 4K pclusters), the main thread
> spends the majority of its time blocked on a combination of synchronous
> vfs_write() syscalls and LZ4_decompress_safe(), bottlenecking overall
> extraction speed.
>
> This patch introduces a scalable, multi-threaded decompression framework
> using the existing erofs_workqueue infrastructure to decouple compute
> from the main thread's I/O.
>
> To prevent massive scheduling overhead (futex contention) where worker
> threads spend more CPU time waking up than actually decompressing small
> 4KB clusters, this implementation introduces a batching context. The
> main thread collects an array of sequential pclusters (temporarily hard-
> capped at Z_EROFS_PCLUSTER_BATCH_SIZE = 32) before submitting a single
> erofs_work unit.
>
> Key details of this implementation:
> - The worker pool is dynamically sized based on available system CPUs.
> - Decompression tasks take strict ownership of the raw and output
> buffers (safely tracking memory via a `free_out` flag) to prevent
> data races and memory leaks.
> - Output buffers are explicitly zero-initialized via calloc() to
> prevent trailing garbage bytes from leaking into extracted files.
> - Tail-end packed fragments are processed synchronously by the main
> thread, as their minimal overhead does not benefit from asynchronous
> offloading.
>
> Signed-off-by: Nithurshen <nithurshen.dev@gmail.com>
> ---
> fsck/main.c | 234 +++++++++++++++++----------------------
> include/erofs/internal.h | 18 ++-
> lib/data.c | 203 +++++++++++++++++++++++----------
> 3 files changed, 265 insertions(+), 190 deletions(-)
>
> diff --git a/fsck/main.c b/fsck/main.c
> index 16cc627..d7810e8 100644
> --- a/fsck/main.c
> +++ b/fsck/main.c
> @@ -8,14 +8,18 @@
> #include <time.h>
> #include <utime.h>
> #include <unistd.h>
> +#include <pthread.h>
> #include <sys/stat.h>
> #include "erofs/print.h"
> #include "erofs/decompress.h"
> #include "erofs/dir.h"
> #include "erofs/xattr.h"
> +#include "erofs/workqueue.h"
> #include "../lib/compressor.h"
> #include "../lib/liberofs_compress.h"
>
> +extern struct erofs_workqueue erofs_wq;
> +
> static int erofsfsck_check_inode(erofs_nid_t pnid, erofs_nid_t nid);
>
> struct erofsfsck_dirstack {
> @@ -505,135 +509,95 @@ out:
>
> static int erofs_verify_inode_data(struct erofs_inode *inode, int outfd)
> {
> - struct erofs_map_blocks map = {
> - .buf = __EROFS_BUF_INITIALIZER,
> - };
> - bool needdecode = fsckcfg.check_decomp && !erofs_is_packed_inode(inode);
> - int ret = 0;
> - bool compressed;
> - erofs_off_t pos = 0;
> - u64 pchunk_len = 0;
> - unsigned int raw_size = 0, buffer_size = 0;
> - char *raw = NULL, *buffer = NULL;
> -
> - erofs_dbg("verify data chunk of nid(%llu): type(%d)",
> - inode->nid | 0ULL, inode->datalayout);
> -
> - compressed = erofs_inode_is_data_compressed(inode->datalayout);
> - while (pos < inode->i_size) {
> - unsigned int alloc_rawsize;
> -
> - map.m_la = pos;
> - ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_FIEMAP);
> - if (ret)
> - goto out;
> -
> - if (!compressed && map.m_llen != map.m_plen) {
> - erofs_err("broken chunk length m_la %" PRIu64 " m_llen %" PRIu64 " m_plen %" PRIu64,
> - map.m_la, map.m_llen, map.m_plen);
> - ret = -EFSCORRUPTED;
> - goto out;
> - }
> -
> - /* the last lcluster can be divided into 3 parts */
> - if (map.m_la + map.m_llen > inode->i_size)
> - map.m_llen = inode->i_size - map.m_la;
> -
> - pchunk_len += map.m_plen;
> - pos += map.m_llen;
> -
> - /* should skip decomp? */
> - if (map.m_la >= inode->i_size || !needdecode)
> - continue;
> + struct erofs_map_blocks map = { .buf = __EROFS_BUF_INITIALIZER };
> + bool needdecode = fsckcfg.check_decomp && !erofs_is_packed_inode(inode);
> + int ret = 0;
> + bool compressed = erofs_inode_is_data_compressed(inode->datalayout);
> + erofs_off_t pos = 0;
> + u64 pchunk_len = 0;
> +
> + struct z_erofs_read_ctx ctx = {
> + .pending_tasks = 0,
> + .final_err = 0,
> + .outfd = outfd,
> + .free_out = true,
> + .current_task = NULL
> + };
> + pthread_mutex_init(&ctx.lock, NULL);
> + pthread_cond_init(&ctx.cond, NULL);
Please avoid barely used pthread interface.
For example, erofs_mutex_t is needed for erofs-utils
instead of pthread_mutex_t;
pthread_cond_init needs to be replaced by an abstract
too.
Also, I may forget to mention that, your new
implementation should be workable for both EROFS_MT_ENABLED
and !EROFS_MT_ENABLED, not only EROFS_MT_ENABLED.
> +
> + erofs_dbg("verify data chunk of nid(%llu): type(%d)", inode->nid | 0ULL, inode->datalayout);
> +
> + while (pos < inode->i_size) {
> + map.m_la = pos;
> + ret = erofs_map_blocks(inode, &map, EROFS_GET_BLOCKS_FIEMAP);
> + if (ret) goto out;
> +
> + if (map.m_la + map.m_llen > inode->i_size)
> + map.m_llen = inode->i_size - map.m_la;
> +
> + pchunk_len += map.m_plen;
> + pos += map.m_llen;
> +
> + if (map.m_la >= inode->i_size || !needdecode)
> + continue;
> +
> + if (outfd >= 0 && !(map.m_flags & EROFS_MAP_MAPPED)) {
> + ret = lseek(outfd, map.m_llen, SEEK_CUR);
> + if (ret < 0) {
> + ret = -errno;
> + goto out;
> + }
> + continue;
> + }
> +
> + if (compressed) {
> + char *raw = malloc(map.m_plen);
> + size_t buffer_size = map.m_llen > erofs_blksiz(inode->sbi) ? map.m_llen : erofs_blksiz(inode->sbi);
> + char *buffer = calloc(1, buffer_size);
> +
> + if (!raw || !buffer) {
> + free(raw); free(buffer);
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = z_erofs_read_one_data(inode, &map, raw, buffer, 0, map.m_llen, false, map.m_la, &ctx);
> + if (ret) {
> + /* DO NOT free(raw) or free(buffer) here. z_erofs_read_one_data took ownership! */
> + goto out;
> + }
> + } else {
> + char *raw = calloc(1, map.m_llen);
> + ret = erofs_read_one_data(inode, &map, raw, 0, map.m_llen);
> + if (ret >= 0 && outfd >= 0)
> + pwrite(outfd, raw, map.m_llen, map.m_la);
> + free(raw);
> + if (ret) goto out;
> + }
I think the file read should be multithreaded too, especially the inode
could be large.
Thanks,
Gao Xiang
next prev parent reply other threads:[~2026-06-07 1:50 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-23 0:37 [PATCH 0/2] fsck.erofs: introduce multi-threaded decompression Nithurshen
2026-05-23 0:37 ` [PATCH 1/2] fsck.erofs: introduce multi-threaded decompression with static batching Nithurshen
2026-06-07 1:50 ` Gao Xiang [this message]
2026-05-23 0:37 ` [PATCH 2/2] fsck.erofs: implement dynamic pcluster batching based on algorithm complexity Nithurshen
2026-06-07 1:52 ` Gao Xiang
2026-06-08 5:07 ` [PATCH v2 0/2] fsck.erofs: add multi-threaded decompression Nithurshen
2026-06-08 5:07 ` [PATCH v2 1/2] fsck.erofs: introduce multi-threaded decompression with static batching Nithurshen
2026-06-08 6:25 ` Gao Xiang
2026-06-08 5:07 ` [PATCH v2 2/2] fsck.erofs: implement algorithm-aware pcluster batching Nithurshen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiTObzdT2gThkwln@debian \
--to=xiang@kernel.org \
--cc=hsiangkao@linux.alibaba.com \
--cc=linux-erofs@lists.ozlabs.org \
--cc=nithurshen.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.