* [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12)
@ 2026-03-23 22:03 Chenglong Tang
2026-03-24 7:53 ` Amir Goldstein
0 siblings, 1 reply; 3+ messages in thread
From: Chenglong Tang @ 2026-03-23 22:03 UTC (permalink / raw)
To: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel,
regressions
Cc: jack, Jan Kara, miklos, Amir Goldstein, tytso, adilger.kernel,
viro, brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani,
He Gao
Hi all,
We are tracking a severe performance regression in Google's
Container-Optimized OS (COS) that appeared when moving from the 6.6
LTS kernel to the 6.12 LTS kernel.
Under concurrent CI workloads (specifically, many containers doing
Python package compilation / .pyc generation simultaneously), the 6.12
kernel suffers from massive jbd2 journal contention. Processes hang
for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the
exact same workload completes in ~4 seconds.
# Environment:
* Host FS: ext4 (backed by standard cloud block storage)
* Container FS: OverlayFS (Docker)
* Machine: n2d-highmem-96 (96 vCPU, high memory)
* Good Kernel: 6.6.87
* Bad Kernels: 6.12.55, 6.12.68
# The Bottleneck
During the 20+ second hang, `cat /proc/<pid>/stack` reveals three
distinct groups of blocked processes thrashing on the jbd2 journal.
The OverlayFS copy-up mechanism seems to be generating so many
synchronous ext4 transactions that it exhausts the jbd2 transaction
buffers.
1. Journal Space Exhaustion (Waiting to start transaction):
[<0>] __jbd2_log_wait_for_space+0xa3/0x240
[<0>] start_this_handle+0x42d/0x8a0
[<0>] jbd2__journal_start+0x103/0x1e0
[<0>] __ext4_journal_start_sb+0x129/0x1c0
[<0>] __ext4_new_inode+0x7cd/0x1290
[<0>] ext4_create+0xbc/0x1b0
[<0>] vfs_create+0x192/0x250
[<0>] ovl_create_real+0xd5/0x170
[<0>] ovl_create_or_link+0x1d7/0x7f0
2. VFS Rename / Copy-up Contention (Blocked by the slow sync):
[<0>] lock_rename+0x29/0x50
[<0>] ovl_copy_up_flags+0x84c/0x12e0
[<0>] ovl_create_object+0x4a/0x120
[<0>] vfs_mkdir+0x1aa/0x260
[<0>] do_mkdirat+0xb9/0x240
3. Synchronous Flush Blocking:
[<0>] jbd2_log_wait_commit+0x107/0x150
[<0>] jbd2_journal_force_commit+0x9c/0xc0
[<0>] ext4_sync_file+0x278/0x310
[<0>] ovl_sync_file+0x2f/0x50
[<0>] ovl_copy_up_metadata+0x455/0x4b0
# Minimal Reproducer
The issue is easily reproducible by triggering 20 concurrent cold
Python imports in Docker, which forces OverlayFS to copy-up the
`__pycache__` directories and write the `.pyc` files.
```bash
# 1. Build a clean image with no pre-compiled bytecode
cat << 'EOF' > Dockerfile
FROM python:3.10-slim
RUN pip install --quiet google-cloud-compute
RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} +
EOF
docker build -t clean-import-test .
# 2. Fire 20 concurrent imports
for i in {1..20}; do
docker run --rm clean-import-test bash -c 'time python -c "import
google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 &
done
wait
grep "real" clean_test_cold_*.log
```
On 6.6.87, all 20 containers finish in ~4.3s.
On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk
writes completely mitigates the regression on 6.12 (using
PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O
contention issue rather than a CPU scheduling one.
Because the regression spans from 6.6 to 6.12, bisection is quite
heavy. Before we initiate a full kernel bisect, does this symptom ring
a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS
metacopy/sync changes introduced during this window?
Any pointers or patches you'd like us to test would be greatly appreciated.
Thanks,
Chenglong Tang
Google Container-Optimized OS Team
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) 2026-03-23 22:03 [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) Chenglong Tang @ 2026-03-24 7:53 ` Amir Goldstein 2026-03-24 8:28 ` Chenglong Tang 0 siblings, 1 reply; 3+ messages in thread From: Amir Goldstein @ 2026-03-24 7:53 UTC (permalink / raw) To: Chenglong Tang Cc: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel, regressions, jack, Jan Kara, miklos, tytso, adilger.kernel, viro, brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani, He Gao, Fei Lv On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang <chenglongtang@google.com> wrote: > > Hi all, Hi Chenglong, > > We are tracking a severe performance regression in Google's > Container-Optimized OS (COS) that appeared when moving from the 6.6 > LTS kernel to the 6.12 LTS kernel. > > Under concurrent CI workloads (specifically, many containers doing > Python package compilation / .pyc generation simultaneously), the 6.12 > kernel suffers from massive jbd2 journal contention. Processes hang > for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the > exact same workload completes in ~4 seconds. > > # Environment: > * Host FS: ext4 (backed by standard cloud block storage) > * Container FS: OverlayFS (Docker) > * Machine: n2d-highmem-96 (96 vCPU, high memory) > * Good Kernel: 6.6.87 > * Bad Kernels: 6.12.55, 6.12.68 > > # The Bottleneck > During the 20+ second hang, `cat /proc/<pid>/stack` reveals three > distinct groups of blocked processes thrashing on the jbd2 journal. > The OverlayFS copy-up mechanism seems to be generating so many > synchronous ext4 transactions that it exhausts the jbd2 transaction > buffers. > > 1. Journal Space Exhaustion (Waiting to start transaction): > [<0>] __jbd2_log_wait_for_space+0xa3/0x240 > [<0>] start_this_handle+0x42d/0x8a0 > [<0>] jbd2__journal_start+0x103/0x1e0 > [<0>] __ext4_journal_start_sb+0x129/0x1c0 > [<0>] __ext4_new_inode+0x7cd/0x1290 > [<0>] ext4_create+0xbc/0x1b0 > [<0>] vfs_create+0x192/0x250 > [<0>] ovl_create_real+0xd5/0x170 > [<0>] ovl_create_or_link+0x1d7/0x7f0 > > 2. VFS Rename / Copy-up Contention (Blocked by the slow sync): > [<0>] lock_rename+0x29/0x50 > [<0>] ovl_copy_up_flags+0x84c/0x12e0 > [<0>] ovl_create_object+0x4a/0x120 > [<0>] vfs_mkdir+0x1aa/0x260 > [<0>] do_mkdirat+0xb9/0x240 > > 3. Synchronous Flush Blocking: > [<0>] jbd2_log_wait_commit+0x107/0x150 > [<0>] jbd2_journal_force_commit+0x9c/0xc0 > [<0>] ext4_sync_file+0x278/0x310 > [<0>] ovl_sync_file+0x2f/0x50 > [<0>] ovl_copy_up_metadata+0x455/0x4b0 > > # Minimal Reproducer > The issue is easily reproducible by triggering 20 concurrent cold > Python imports in Docker, which forces OverlayFS to copy-up the > `__pycache__` directories and write the `.pyc` files. > > ```bash > # 1. Build a clean image with no pre-compiled bytecode > cat << 'EOF' > Dockerfile > FROM python:3.10-slim > RUN pip install --quiet google-cloud-compute > RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} + > EOF > docker build -t clean-import-test . > > # 2. Fire 20 concurrent imports > for i in {1..20}; do > docker run --rm clean-import-test bash -c 'time python -c "import > google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 & > done > wait > grep "real" clean_test_cold_*.log > ``` I don't understand. You write that Python imports in Docker forces OverlayFS to copy-up the `__pycache__` directories, but the prep stage removes all the `__pycache__` directories. My guess would be that rm -rf __pycache__ would generate a lot of metadata copy ups, but you write that the issue occurs during the 2nd stage. Maybe I misunderstood. Please try to figure out which and how many copy up objects this translates to for directories, for files? > > On 6.6.87, all 20 containers finish in ~4.3s. > On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk > writes completely mitigates the regression on 6.12 (using > PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O > contention issue rather than a CPU scheduling one. > > Because the regression spans from 6.6 to 6.12, bisection is quite > heavy. Before we initiate a full kernel bisect, does this symptom ring > a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS > metacopy/sync changes introduced during this window? > > Any pointers or patches you'd like us to test would be greatly appreciated. > Very high suspect: 7d6899fb69d25 ovl: fsync after metadata copy-up As you can see from this discussion [1] this performance regression was somewhat anticipated: "Now we just need to hope that users won't come shouting about performance regressions." [1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/ With metacopy disabled this change introduced fsyncs on metadata-only changes made my overlayfs which could generate a lot of journal stress and explain the regression. But we had not anticipated that workloads could be affected with metacopy disabled, because it was anticipated that data fsync would be the more significant bottleneck. Do your containers have metacopy enabled? If not, why not? Is it because metacopy is conflicting with some other overlayfs feature that you need like userxattr? Thinking out loud, I wonder if metadata copy up code would benefit from calling export_ops->commit_metadata() when supported by upper fs instead of open+vfs_fsync(), but I doubt if that would relieve journal stress in this case. Anyway, please see if forcing metadata_fsync off solves the regression and I will stage the original patch from Fei to make metadata_fsync opt-in. Thanks, Amir. --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry, * that will hurt performance of workloads such as chown -R, so we * only fsync on data copyup as legacy behavior. */ - ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy && + ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy && (S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode)); ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags); ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) 2026-03-24 7:53 ` Amir Goldstein @ 2026-03-24 8:28 ` Chenglong Tang 0 siblings, 0 replies; 3+ messages in thread From: Chenglong Tang @ 2026-03-24 8:28 UTC (permalink / raw) To: Amir Goldstein Cc: linux-unionfs, linux-ext4, linux-fsdevel, linux-kernel, regressions, jack, Jan Kara, miklos, tytso, adilger.kernel, viro, brauner, Kevin Berry, Robert Kolchmeyer, Deepa Dinamani, He Gao, Fei Lv Hi Amir, You absolutely nailed it. Thank you. Regarding the test, you are correct: the rm -rf happens in the Docker build phase, so the timed test purely measures the burst creation of the new __pycache__ directories and .pyc files on a clean slate. I checked our environment, and metacopy is indeed disabled by default. We generally keep it disabled for broader compatibility with various container runtimes and user-namespace tooling that expect a full copy-up. To test your theory, I dynamically enabled it (echo Y > /sys/module/overlay/parameters/metacopy) and re-ran the 20-container concurrent test. The journal lock contention completely vanished, and the times dropped from ~27 seconds back down to ~4.3 seconds, fully restoring the 6.6 performance. I am currently building a custom COS image with your suggested 1-line patch (ctx.metadata_fsync = 0 && ...) to verify it on our 96-core test rig. I highly expect it to be the root cause, and I will report back with the benchmark results as soon as the build finishes. Since we generally keep metacopy disabled for broader compatibility with container tooling, your proposed patch to make metadata_fsync opt-in would be a good fix for us. Thanks again for the pointers! Best, Chenglong On Tue, Mar 24, 2026 at 12:53 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Mar 23, 2026 at 11:03 PM Chenglong Tang > <chenglongtang@google.com> wrote: > > > > Hi all, > > Hi Chenglong, > > > > > We are tracking a severe performance regression in Google's > > Container-Optimized OS (COS) that appeared when moving from the 6.6 > > LTS kernel to the 6.12 LTS kernel. > > > > Under concurrent CI workloads (specifically, many containers doing > > Python package compilation / .pyc generation simultaneously), the 6.12 > > kernel suffers from massive jbd2 journal contention. Processes hang > > for 20-30 seconds waiting for VFS locks and journal space. On 6.6, the > > exact same workload completes in ~4 seconds. > > > > # Environment: > > * Host FS: ext4 (backed by standard cloud block storage) > > * Container FS: OverlayFS (Docker) > > * Machine: n2d-highmem-96 (96 vCPU, high memory) > > * Good Kernel: 6.6.87 > > * Bad Kernels: 6.12.55, 6.12.68 > > > > # The Bottleneck > > During the 20+ second hang, `cat /proc/<pid>/stack` reveals three > > distinct groups of blocked processes thrashing on the jbd2 journal. > > The OverlayFS copy-up mechanism seems to be generating so many > > synchronous ext4 transactions that it exhausts the jbd2 transaction > > buffers. > > > > 1. Journal Space Exhaustion (Waiting to start transaction): > > [<0>] __jbd2_log_wait_for_space+0xa3/0x240 > > [<0>] start_this_handle+0x42d/0x8a0 > > [<0>] jbd2__journal_start+0x103/0x1e0 > > [<0>] __ext4_journal_start_sb+0x129/0x1c0 > > [<0>] __ext4_new_inode+0x7cd/0x1290 > > [<0>] ext4_create+0xbc/0x1b0 > > [<0>] vfs_create+0x192/0x250 > > [<0>] ovl_create_real+0xd5/0x170 > > [<0>] ovl_create_or_link+0x1d7/0x7f0 > > > > 2. VFS Rename / Copy-up Contention (Blocked by the slow sync): > > [<0>] lock_rename+0x29/0x50 > > [<0>] ovl_copy_up_flags+0x84c/0x12e0 > > [<0>] ovl_create_object+0x4a/0x120 > > [<0>] vfs_mkdir+0x1aa/0x260 > > [<0>] do_mkdirat+0xb9/0x240 > > > > 3. Synchronous Flush Blocking: > > [<0>] jbd2_log_wait_commit+0x107/0x150 > > [<0>] jbd2_journal_force_commit+0x9c/0xc0 > > [<0>] ext4_sync_file+0x278/0x310 > > [<0>] ovl_sync_file+0x2f/0x50 > > [<0>] ovl_copy_up_metadata+0x455/0x4b0 > > > > # Minimal Reproducer > > The issue is easily reproducible by triggering 20 concurrent cold > > Python imports in Docker, which forces OverlayFS to copy-up the > > `__pycache__` directories and write the `.pyc` files. > > > > ```bash > > # 1. Build a clean image with no pre-compiled bytecode > > cat << 'EOF' > Dockerfile > > FROM python:3.10-slim > > RUN pip install --quiet google-cloud-compute > > RUN find /usr/local -type d -name "__pycache__" -exec rm -rf {} + > > EOF > > docker build -t clean-import-test . > > > > # 2. Fire 20 concurrent imports > > for i in {1..20}; do > > docker run --rm clean-import-test bash -c 'time python -c "import > > google.cloud.compute_v1"' > clean_test_cold_$i.log 2>&1 & > > done > > wait > > grep "real" clean_test_cold_*.log > > ``` > > I don't understand. > > You write that Python imports in Docker forces OverlayFS to copy-up the > `__pycache__` directories, but the prep stage removes all the > `__pycache__` directories. > > My guess would be that rm -rf __pycache__ would generate a lot of > metadata copy ups, but you write that the issue occurs during the > 2nd stage. Maybe I misunderstood. > > Please try to figure out which and how many copy up objects this translates to > for directories, for files? > > > > > On 6.6.87, all 20 containers finish in ~4.3s. > > On 6.12.x, they hang and finish between 17s and 27s. Bypassing disk > > writes completely mitigates the regression on 6.12 (using > > PYTHONDONTWRITEBYTECODE=1), confirming it is an ext4/overlayfs I/O > > contention issue rather than a CPU scheduling one. > > > > Because the regression spans from 6.6 to 6.12, bisection is quite > > heavy. Before we initiate a full kernel bisect, does this symptom ring > > a bell for any ext4 fast_commit, jbd2 locking, or OverlayFS > > metacopy/sync changes introduced during this window? > > > > Any pointers or patches you'd like us to test would be greatly appreciated. > > > > Very high suspect: > > 7d6899fb69d25 ovl: fsync after metadata copy-up > > As you can see from this discussion [1] this performance regression > was somewhat anticipated: > > "Now we just need to hope that users won't come shouting about > performance regressions." > > [1] https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/ > > With metacopy disabled this change introduced fsyncs on metadata-only > changes made my overlayfs which could generate a lot of journal stress > and explain the regression. > > But we had not anticipated that workloads could be affected with > metacopy disabled, because it was anticipated that data fsync > would be the more significant bottleneck. > > Do your containers have metacopy enabled? > If not, why not? Is it because metacopy is conflicting with some > other overlayfs feature that you need like userxattr? > > Thinking out loud, I wonder if metadata copy up code would benefit from > calling export_ops->commit_metadata() when supported by upper fs > instead of open+vfs_fsync(), but I doubt if that would relieve journal stress > in this case. > > Anyway, please see if forcing metadata_fsync off solves the regression > and I will stage the original patch from Fei to make metadata_fsync > opt-in. > > Thanks, > Amir. > > --- a/fs/overlayfs/copy_up.c > +++ b/fs/overlayfs/copy_up.c > @@ -1154,7 +1154,7 @@ static int ovl_copy_up_one(struct dentry > *parent, struct dentry *dentry, > * that will hurt performance of workloads such as chown -R, so we > * only fsync on data copyup as legacy behavior. > */ > - ctx.metadata_fsync = !OVL_FS(dentry->d_sb)->config.metacopy && > + ctx.metadata_fsync = 0 && !OVL_FS(dentry->d_sb)->config.metacopy && > (S_ISREG(ctx.stat.mode) || S_ISDIR(ctx.stat.mode)); > ctx.metacopy = ovl_need_meta_copy_up(dentry, ctx.stat.mode, flags); ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-24 8:28 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-23 22:03 [REGRESSION] fs/overlayfs/ext4: Severe jbd2 lock contention and journal starvation on concurrent copy-up (v6.6 -> v6.12) Chenglong Tang 2026-03-24 7:53 ` Amir Goldstein 2026-03-24 8:28 ` Chenglong Tang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox