[REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)
@ 2026-03-23 22:03 Chenglong Tang
  2026-03-24  7:53 ` Amir Goldstein
  0 siblings, 1 reply; 3+ messages in thread
From: Chenglong Tang @ 2026-03-23 22:03 UTC (permalink / raw)
  To: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel,
	regressions
  Cc: jack, Jan Kara, miklos, Amir Goldstein, tytso, adilger.kernel,
	viro, brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani,
	He Gao

Hi all,

We are tracking a severe performance regression in Google's
Container-Optimized OS (COS) that appeared when moving from the 6.6
LTS kernel to the 6.12 LTS kernel.

Under concurrent CI workloads (specifically, many containers doing
Python package compilation / .pyc generation simultaneously), the 6.12
kernel suffers from massive jbd2 journal contention. Processes hang
for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
exact same workload completes in ~4 seconds.

# Environment:
* Host FS: ext4 (backed by standard cloud block storage)
* Container FS: OverlayFS (Docker)
* Machine: n2d-highmem-96 (96 vCPU, high memory)
* Good Kernel: 6.6.87
* Bad Kernels: 6.12.55, 6.12.68

# The Bottleneck
During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
distinct groups of blocked processes thrashing on the jbd2 journal.
The OverlayFS copy-up mechanism seems to be generating so many
synchronous ext4 transactions that it exhausts the jbd2 transaction
buffers.

1. Journal Space Exhaustion (Waiting to start transaction):
[<0>] __jbd2_log_wait_for_space+0xa3/0x240
[<0>] start_this_handle+0x42d/0x8a0
[<0>] jbd2__journal_start+0x103/0x1e0
[<0>] __ext4_journal_start_sb+0x129/0x1c0
[<0>] __ext4_new_inode+0x7cd/0x1290
[<0>] ext4_create+0xbc/0x1b0
[<0>] vfs_create+0x192/0x250
[<0>] ovl_create_real+0xd5/0x170
[<0>] ovl_create_or_link+0x1d7/0x7f0

2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
[<0>] lock_rename+0x29/0x50
[<0>] ovl_copy_up_flags+0x84c/0x12e0
[<0>] ovl_create_object+0x4a/0x120
[<0>] vfs_mkdir+0x1aa/0x260
[<0>] do_mkdirat+0xb9/0x240

3. Synchronous Flush Blocking:
[<0>] jbd2_log_wait_commit+0x107/0x150
[<0>] jbd2_journal_force_commit+0x9c/0xc0
[<0>] ext4_sync_file+0x278/0x310
[<0>] ovl_sync_file+0x2f/0x50
[<0>] ovl_copy_up_metadata+0x455/0x4b0

# Minimal Reproducer
The issue is easily reproducible by triggering 20 concurrent cold
Python imports in Docker, which forces OverlayFS to copy-up the
`__pycache__` directories and write the `.pyc` files.

```bash
# 1. Build a clean image with no pre-compiled bytecode
cat << 'EOF' > Dockerfile
FROM python:3.10-slim
RUN pip install --quiet google-cloud-compute
RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
EOF
docker build -t clean-import-test .

# 2. Fire 20 concurrent imports
for i in {1..20}; do
  docker run --rm clean-import-test bash -c 'time python -c "import
google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
done
wait
grep "real" clean_test_cold_*.log
```

On 6.6.87, all 20 containers finish in ~4.3s.
On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
writes completely mitigates the regression on 6.12 (using
PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
contention issue rather than a CPU scheduling one.

Because the regression spans from 6.6 to 6.12, bisection is quite
heavy. Before we initiate a full kernel bisect, does this symptom ring
a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
metacopy/sync changes introduced during this window?

Any pointers or patches you'd like us to test would be greatly appreciated.

Thanks,
Chenglong Tang
Google Container-Optimized OS Team

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)
  2026-03-23 22:03 [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) Chenglong Tang
@ 2026-03-24  7:53 ` Amir Goldstein
  2026-03-24  8:28   ` Chenglong Tang
  0 siblings, 1 reply; 3+ messages in thread
From: Amir Goldstein @ 2026-03-24  7:53 UTC (permalink / raw)
  To: Chenglong Tang
  Cc: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel,
	regressions, jack, Jan Kara, miklos, tytso, adilger.kernel, viro,
	brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani, He Gao,
	Fei Lv

On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang
<chenglongtang@google.com> wrote:
>
> Hi all,

Hi Chenglong,

>
> We are tracking a severe performance regression in Google's
> Container-Optimized OS (COS) that appeared when moving from the 6.6
> LTS kernel to the 6.12 LTS kernel.
>
> Under concurrent CI workloads (specifically, many containers doing
> Python package compilation / .pyc generation simultaneously), the 6.12
> kernel suffers from massive jbd2 journal contention. Processes hang
> for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
> exact same workload completes in ~4 seconds.
>
> # Environment:
> * Host FS: ext4 (backed by standard cloud block storage)
> * Container FS: OverlayFS (Docker)
> * Machine: n2d-highmem-96 (96 vCPU, high memory)
> * Good Kernel: 6.6.87
> * Bad Kernels: 6.12.55, 6.12.68
>
> # The Bottleneck
> During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
> distinct groups of blocked processes thrashing on the jbd2 journal.
> The OverlayFS copy-up mechanism seems to be generating so many
> synchronous ext4 transactions that it exhausts the jbd2 transaction
> buffers.
>
> 1. Journal Space Exhaustion (Waiting to start transaction):
> [<0>] __jbd2_log_wait_for_space+0xa3/0x240
> [<0>] start_this_handle+0x42d/0x8a0
> [<0>] jbd2__journal_start+0x103/0x1e0
> [<0>] __ext4_journal_start_sb+0x129/0x1c0
> [<0>] __ext4_new_inode+0x7cd/0x1290
> [<0>] ext4_create+0xbc/0x1b0
> [<0>] vfs_create+0x192/0x250
> [<0>] ovl_create_real+0xd5/0x170
> [<0>] ovl_create_or_link+0x1d7/0x7f0
>
> 2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
> [<0>] lock_rename+0x29/0x50
> [<0>] ovl_copy_up_flags+0x84c/0x12e0
> [<0>] ovl_create_object+0x4a/0x120
> [<0>] vfs_mkdir+0x1aa/0x260
> [<0>] do_mkdirat+0xb9/0x240
>
> 3. Synchronous Flush Blocking:
> [<0>] jbd2_log_wait_commit+0x107/0x150
> [<0>] jbd2_journal_force_commit+0x9c/0xc0
> [<0>] ext4_sync_file+0x278/0x310
> [<0>] ovl_sync_file+0x2f/0x50
> [<0>] ovl_copy_up_metadata+0x455/0x4b0
>
> # Minimal Reproducer
> The issue is easily reproducible by triggering 20 concurrent cold
> Python imports in Docker, which forces OverlayFS to copy-up the
> `__pycache__` directories and write the `.pyc` files.
>
> ```bash
> # 1. Build a clean image with no pre-compiled bytecode
> cat << 'EOF' > Dockerfile
> FROM python:3.10-slim
> RUN pip install --quiet google-cloud-compute
> RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
> EOF
> docker build -t clean-import-test .
>
> # 2. Fire 20 concurrent imports
> for i in {1..20}; do
>   docker run --rm clean-import-test bash -c 'time python -c "import
> google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
> done
> wait
> grep "real" clean_test_cold_*.log
> ```

I don't understand.

You write that Python imports in Docker forces OverlayFS to copy-up the
`__pycache__` directories, but the prep stage removes all the
`__pycache__` directories.

My guess would be that rm -rf __pycache__ would generate a lot of
metadata copy ups, but you write that the issue occurs during the
2nd stage. Maybe I misunderstood.

Please try to figure out which and how many copy up objects this translates to
for directories, for files?

>
> On 6.6.87, all 20 containers finish in ~4.3s.
> On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
> writes completely mitigates the regression on 6.12 (using
> PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
> contention issue rather than a CPU scheduling one.
>
> Because the regression spans from 6.6 to 6.12, bisection is quite
> heavy. Before we initiate a full kernel bisect, does this symptom ring
> a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
> metacopy/sync changes introduced during this window?
>
> Any pointers or patches you'd like us to test would be greatly appreciated.
>

Very high suspect:

7d6899fb69d25 ovl: fsync after metadata copy-up

As you can see from this discussion [1] this performance regression
was somewhat anticipated:

"Now we just need to hope that users won't come shouting about
performance regressions."

[1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/

With metacopy disabled this change introduced fsyncs on metadata-only
changes made my overlayfs which could generate a lot of journal stress
and explain the regression.

But we had not anticipated that workloads could be affected with
metacopy disabled, because it was anticipated that data fsync
would be the more significant bottleneck.

Do your containers have metacopy enabled?
If not, why not? Is it because metacopy is conflicting with some
other overlayfs feature that you need like userxattr?

Thinking out loud, I wonder if metadata copy up code would benefit from
calling export_ops->commit_metadata() when supported by upper fs
instead of open+vfs_fsync(), but I doubt if that would relieve journal stress
in this case.

Anyway, please see if forcing metadata_fsync off solves the regression
and I will stage the original patch from Fei to make metadata_fsync
opt-in.

Thanks,
Amir.

--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry
*parent, struct dentry *dentry,
         * that will hurt performance of workloads such as chown -R, so we
         * only fsync on data copyup as legacy behavior.
         */
-       ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy &&
+       ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy &&
                             (S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode));
        ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)
  2026-03-24  7:53 ` Amir Goldstein
@ 2026-03-24  8:28   ` Chenglong Tang
  0 siblings, 0 replies; 3+ messages in thread
From: Chenglong Tang @ 2026-03-24  8:28 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel,
	regressions, jack, Jan Kara, miklos, tytso, adilger.kernel, viro,
	brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani, He Gao,
	Fei Lv

Hi Amir,

You absolutely nailed it. Thank you.

Regarding the test, you are correct: the rm -rf happens in the Docker
build phase, so the timed test purely measures the burst creation of
the new __pycache__ directories and .pyc files on a clean slate.

I checked our environment, and metacopy is indeed disabled by default.
We generally keep it disabled for broader compatibility with various
container runtimes and user-namespace tooling that expect a full
copy-up.

To test your theory, I dynamically enabled it (echo Y >
/sys/module/overlay/parameters/metacopy) and re-ran the 20-container
concurrent test. The journal lock contention completely vanished, and
the times dropped from ~27 seconds back down to ~4.3 seconds, fully
restoring the 6.6 performance.

I am currently building a custom COS image with your suggested 1-line
patch (ctx.metadata_fsync = 0 && ...) to verify it on our 96-core test
rig. I highly expect it to be the root cause, and I will report back
with the benchmark results as soon as the build finishes.

Since we generally keep metacopy disabled for broader compatibility
with container tooling, your proposed patch to make metadata_fsync
opt-in would be a good fix for us.

Thanks again for the pointers!

Best,

Chenglong

On Tue, Mar 24, 2026 at 12:53 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang
> <chenglongtang@google.com> wrote:
> >
> > Hi all,
>
> Hi Chenglong,
>
> >
> > We are tracking a severe performance regression in Google's
> > Container-Optimized OS (COS) that appeared when moving from the 6.6
> > LTS kernel to the 6.12 LTS kernel.
> >
> > Under concurrent CI workloads (specifically, many containers doing
> > Python package compilation / .pyc generation simultaneously), the 6.12
> > kernel suffers from massive jbd2 journal contention. Processes hang
> > for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
> > exact same workload completes in ~4 seconds.
> >
> > # Environment:
> > * Host FS: ext4 (backed by standard cloud block storage)
> > * Container FS: OverlayFS (Docker)
> > * Machine: n2d-highmem-96 (96 vCPU, high memory)
> > * Good Kernel: 6.6.87
> > * Bad Kernels: 6.12.55, 6.12.68
> >
> > # The Bottleneck
> > During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
> > distinct groups of blocked processes thrashing on the jbd2 journal.
> > The OverlayFS copy-up mechanism seems to be generating so many
> > synchronous ext4 transactions that it exhausts the jbd2 transaction
> > buffers.
> >
> > 1. Journal Space Exhaustion (Waiting to start transaction):
> > [<0>] __jbd2_log_wait_for_space+0xa3/0x240
> > [<0>] start_this_handle+0x42d/0x8a0
> > [<0>] jbd2__journal_start+0x103/0x1e0
> > [<0>] __ext4_journal_start_sb+0x129/0x1c0
> > [<0>] __ext4_new_inode+0x7cd/0x1290
> > [<0>] ext4_create+0xbc/0x1b0
> > [<0>] vfs_create+0x192/0x250
> > [<0>] ovl_create_real+0xd5/0x170
> > [<0>] ovl_create_or_link+0x1d7/0x7f0
> >
> > 2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
> > [<0>] lock_rename+0x29/0x50
> > [<0>] ovl_copy_up_flags+0x84c/0x12e0
> > [<0>] ovl_create_object+0x4a/0x120
> > [<0>] vfs_mkdir+0x1aa/0x260
> > [<0>] do_mkdirat+0xb9/0x240
> >
> > 3. Synchronous Flush Blocking:
> > [<0>] jbd2_log_wait_commit+0x107/0x150
> > [<0>] jbd2_journal_force_commit+0x9c/0xc0
> > [<0>] ext4_sync_file+0x278/0x310
> > [<0>] ovl_sync_file+0x2f/0x50
> > [<0>] ovl_copy_up_metadata+0x455/0x4b0
> >
> > # Minimal Reproducer
> > The issue is easily reproducible by triggering 20 concurrent cold
> > Python imports in Docker, which forces OverlayFS to copy-up the
> > `__pycache__` directories and write the `.pyc` files.
> >
> > ```bash
> > # 1. Build a clean image with no pre-compiled bytecode
> > cat << 'EOF' > Dockerfile
> > FROM python:3.10-slim
> > RUN pip install --quiet google-cloud-compute
> > RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
> > EOF
> > docker build -t clean-import-test .
> >
> > # 2. Fire 20 concurrent imports
> > for i in {1..20}; do
> >   docker run --rm clean-import-test bash -c 'time python -c "import
> > google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
> > done
> > wait
> > grep "real" clean_test_cold_*.log
> > ```
>
> I don't understand.
>
> You write that Python imports in Docker forces OverlayFS to copy-up the
> `__pycache__` directories, but the prep stage removes all the
> `__pycache__` directories.
>
> My guess would be that rm -rf __pycache__ would generate a lot of
> metadata copy ups, but you write that the issue occurs during the
> 2nd stage. Maybe I misunderstood.
>
> Please try to figure out which and how many copy up objects this translates to
> for directories, for files?
>
> >
> > On 6.6.87, all 20 containers finish in ~4.3s.
> > On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
> > writes completely mitigates the regression on 6.12 (using
> > PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
> > contention issue rather than a CPU scheduling one.
> >
> > Because the regression spans from 6.6 to 6.12, bisection is quite
> > heavy. Before we initiate a full kernel bisect, does this symptom ring
> > a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
> > metacopy/sync changes introduced during this window?
> >
> > Any pointers or patches you'd like us to test would be greatly appreciated.
> >
>
> Very high suspect:
>
> 7d6899fb69d25 ovl: fsync after metadata copy-up
>
> As you can see from this discussion [1] this performance regression
> was somewhat anticipated:
>
> "Now we just need to hope that users won't come shouting about
> performance regressions."
>
> [1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/
>
> With metacopy disabled this change introduced fsyncs on metadata-only
> changes made my overlayfs which could generate a lot of journal stress
> and explain the regression.
>
> But we had not anticipated that workloads could be affected with
> metacopy disabled, because it was anticipated that data fsync
> would be the more significant bottleneck.
>
> Do your containers have metacopy enabled?
> If not, why not? Is it because metacopy is conflicting with some
> other overlayfs feature that you need like userxattr?
>
> Thinking out loud, I wonder if metadata copy up code would benefit from
> calling export_ops->commit_metadata() when supported by upper fs
> instead of open+vfs_fsync(), but I doubt if that would relieve journal stress
> in this case.
>
> Anyway, please see if forcing metadata_fsync off solves the regression
> and I will stage the original patch from Fei to make metadata_fsync
> opt-in.
>
> Thanks,
> Amir.
>
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry
> *parent, struct dentry *dentry,
>          * that will hurt performance of workloads such as chown -R, so we
>          * only fsync on data copyup as legacy behavior.
>          */
> -       ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy &&
> +       ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy &&
>                              (S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode));
>         ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags);

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-24  8:28 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-23 22:03 [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) Chenglong Tang
2026-03-24  7:53 ` Amir Goldstein
2026-03-24  8:28   ` Chenglong Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox