public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chenglong Tang <chenglongtang@google.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-unionfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 regressions@lists.linux.dev, jack@suse.com,
	Jan Kara <jack@suse.cz>,
	 miklos@szeredi.hu, tytso@mit.edu, adilger.kernel@dilger.ca,
	 viro@zeniv.linux.org.uk, brauner@kernel.org,
	Kevin Berry <kpberry@google.com>,
	 Robert Kolchmeyer <rkolchmeyer@google.com>,
	Deepa Dinamani <deepadinamani@google.com>,
	 He Gao <hegao@google.com>, Fei Lv <feilv@asrmicro.com>
Subject: Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)
Date: Tue, 24 Mar 2026 01:28:11 -0700	[thread overview]
Message-ID: <CAOdxtTY9yKa5ZfzkoJYK_P=_mKygy1aQmCYfDdXo-xeU+pxqbA@mail.gmail.com> (raw)
In-Reply-To: <CAOQ4uxjuAfCJXGRchDf-7d+uCS+8=Du_Y8OzgX15w4-fOR_oHQ@mail.gmail.com>

Hi Amir,

You absolutely nailed it. Thank you.

Regarding the test, you are correct: the rm -rf happens in the Docker
build phase, so the timed test purely measures the burst creation of
the new __pycache__ directories and .pyc files on a clean slate.

I checked our environment, and metacopy is indeed disabled by default.
We generally keep it disabled for broader compatibility with various
container runtimes and user-namespace tooling that expect a full
copy-up.

To test your theory, I dynamically enabled it (echo Y >
/sys/module/overlay/parameters/metacopy) and re-ran the 20-container
concurrent test. The journal lock contention completely vanished, and
the times dropped from ~27 seconds back down to ~4.3 seconds, fully
restoring the 6.6 performance.

I am currently building a custom COS image with your suggested 1-line
patch (ctx.metadata_fsync = 0 && ...) to verify it on our 96-core test
rig. I highly expect it to be the root cause, and I will report back
with the benchmark results as soon as the build finishes.

Since we generally keep metacopy disabled for broader compatibility
with container tooling, your proposed patch to make metadata_fsync
opt-in would be a good fix for us.

Thanks again for the pointers!

Best,

Chenglong

On Tue, Mar 24, 2026 at 12:53 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang
> <chenglongtang@google.com> wrote:
> >
> > Hi all,
>
> Hi Chenglong,
>
> >
> > We are tracking a severe performance regression in Google's
> > Container-Optimized OS (COS) that appeared when moving from the 6.6
> > LTS kernel to the 6.12 LTS kernel.
> >
> > Under concurrent CI workloads (specifically, many containers doing
> > Python package compilation / .pyc generation simultaneously), the 6.12
> > kernel suffers from massive jbd2 journal contention. Processes hang
> > for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
> > exact same workload completes in ~4 seconds.
> >
> > # Environment:
> > * Host FS: ext4 (backed by standard cloud block storage)
> > * Container FS: OverlayFS (Docker)
> > * Machine: n2d-highmem-96 (96 vCPU, high memory)
> > * Good Kernel: 6.6.87
> > * Bad Kernels: 6.12.55, 6.12.68
> >
> > # The Bottleneck
> > During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
> > distinct groups of blocked processes thrashing on the jbd2 journal.
> > The OverlayFS copy-up mechanism seems to be generating so many
> > synchronous ext4 transactions that it exhausts the jbd2 transaction
> > buffers.
> >
> > 1. Journal Space Exhaustion (Waiting to start transaction):
> > [<0>] __jbd2_log_wait_for_space+0xa3/0x240
> > [<0>] start_this_handle+0x42d/0x8a0
> > [<0>] jbd2__journal_start+0x103/0x1e0
> > [<0>] __ext4_journal_start_sb+0x129/0x1c0
> > [<0>] __ext4_new_inode+0x7cd/0x1290
> > [<0>] ext4_create+0xbc/0x1b0
> > [<0>] vfs_create+0x192/0x250
> > [<0>] ovl_create_real+0xd5/0x170
> > [<0>] ovl_create_or_link+0x1d7/0x7f0
> >
> > 2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
> > [<0>] lock_rename+0x29/0x50
> > [<0>] ovl_copy_up_flags+0x84c/0x12e0
> > [<0>] ovl_create_object+0x4a/0x120
> > [<0>] vfs_mkdir+0x1aa/0x260
> > [<0>] do_mkdirat+0xb9/0x240
> >
> > 3. Synchronous Flush Blocking:
> > [<0>] jbd2_log_wait_commit+0x107/0x150
> > [<0>] jbd2_journal_force_commit+0x9c/0xc0
> > [<0>] ext4_sync_file+0x278/0x310
> > [<0>] ovl_sync_file+0x2f/0x50
> > [<0>] ovl_copy_up_metadata+0x455/0x4b0
> >
> > # Minimal Reproducer
> > The issue is easily reproducible by triggering 20 concurrent cold
> > Python imports in Docker, which forces OverlayFS to copy-up the
> > `__pycache__` directories and write the `.pyc` files.
> >
> > ```bash
> > # 1. Build a clean image with no pre-compiled bytecode
> > cat << 'EOF' > Dockerfile
> > FROM python:3.10-slim
> > RUN pip install --quiet google-cloud-compute
> > RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
> > EOF
> > docker build -t clean-import-test .
> >
> > # 2. Fire 20 concurrent imports
> > for i in {1..20}; do
> >   docker run --rm clean-import-test bash -c 'time python -c "import
> > google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
> > done
> > wait
> > grep "real" clean_test_cold_*.log
> > ```
>
> I don't understand.
>
> You write that Python imports in Docker forces OverlayFS to copy-up the
> `__pycache__` directories, but the prep stage removes all the
> `__pycache__` directories.
>
> My guess would be that rm -rf __pycache__ would generate a lot of
> metadata copy ups, but you write that the issue occurs during the
> 2nd stage. Maybe I misunderstood.
>
> Please try to figure out which and how many copy up objects this translates to
> for directories, for files?
>
> >
> > On 6.6.87, all 20 containers finish in ~4.3s.
> > On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
> > writes completely mitigates the regression on 6.12 (using
> > PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
> > contention issue rather than a CPU scheduling one.
> >
> > Because the regression spans from 6.6 to 6.12, bisection is quite
> > heavy. Before we initiate a full kernel bisect, does this symptom ring
> > a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
> > metacopy/sync changes introduced during this window?
> >
> > Any pointers or patches you'd like us to test would be greatly appreciated.
> >
>
> Very high suspect:
>
> 7d6899fb69d25 ovl: fsync after metadata copy-up
>
> As you can see from this discussion [1] this performance regression
> was somewhat anticipated:
>
> "Now we just need to hope that users won't come shouting about
> performance regressions."
>
> [1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/
>
> With metacopy disabled this change introduced fsyncs on metadata-only
> changes made my overlayfs which could generate a lot of journal stress
> and explain the regression.
>
> But we had not anticipated that workloads could be affected with
> metacopy disabled, because it was anticipated that data fsync
> would be the more significant bottleneck.
>
> Do your containers have metacopy enabled?
> If not, why not? Is it because metacopy is conflicting with some
> other overlayfs feature that you need like userxattr?
>
> Thinking out loud, I wonder if metadata copy up code would benefit from
> calling export_ops->commit_metadata() when supported by upper fs
> instead of open+vfs_fsync(), but I doubt if that would relieve journal stress
> in this case.
>
> Anyway, please see if forcing metadata_fsync off solves the regression
> and I will stage the original patch from Fei to make metadata_fsync
> opt-in.
>
> Thanks,
> Amir.
>
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry
> *parent, struct dentry *dentry,
>          * that will hurt performance of workloads such as chown -R, so we
>          * only fsync on data copyup as legacy behavior.
>          */
> -       ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy &&
> +       ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy &&
>                              (S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode));
>         ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags);

      reply	other threads:[~2026-03-24  8:28 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-23 22:03 [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) Chenglong Tang
2026-03-24  7:53 ` Amir Goldstein
2026-03-24  8:28   ` Chenglong Tang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOdxtTY9yKa5ZfzkoJYK_P=_mKygy1aQmCYfDdXo-xeU+pxqbA@mail.gmail.com' \
    --to=chenglongtang@google.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=deepadinamani@google.com \
    --cc=feilv@asrmicro.com \
    --cc=hegao@google.com \
    --cc=jack@suse.com \
    --cc=jack@suse.cz \
    --cc=kpberry@google.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-unionfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=regressions@lists.linux.dev \
    --cc=rkolchmeyer@google.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox