Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Uschakow, Stanislav" <suschako@amazon.de>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: "trix@redhat.com" <trix@redhat.com>,
	"ndesaulniers@google.com" <ndesaulniers@google.com>,
	"nathan@kernel.org" <nathan@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"mike.kravetz@oracle.com" <mike.kravetz@oracle.com>,
	"jannh@google.com" <jannh@google.com>,
	"lorenzo.stoakes@oracle.com" <lorenzo.stoakes@oracle.com>,
	"liam.howlett@oracle.com" <liam.howlett@oracle.com>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"osalvador@suse.de" <osalvador@suse.de>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"jannh@google.com" <jannh@google.com>
Subject: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
Date: Fri, 29 Aug 2025 14:30:46 +0000	[thread overview]
Message-ID: <4d3878531c76479d9f8ca9789dc6485d@amazon.de> (raw)

Hello.

We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.

Comparing the a kernel without this patch with a kernel with this patch applied when spawning 1000 children we see those execution times:


Patched kernel: 
$ time make stress
...
real    0m11.275s
user    0m0.177s
sys     0m23.905s

Original kernel : 

$ time make stress
...real    0m2.475s
user    0m1.398s
sys     0m2.501s


The patch in question: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1013af4f585fccc4d3e5c5824d174de2257f7d6d


My observation/assumption is:

each child touches 100 random pages and despawns
on each despawn `huge_pmd_unshare()` is called
each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression



I'm happy to provide more information.




Thank you
Stanislav Uschakow








=== Reproducer ===

Setup:


#!/bin/bash
echo "Setting up hugepages for reproduction..."

# hugepages (1.2TB / 2MB = 614400 pages)
REQUIRED_PAGES=614400

# Check current hugepage allocation
CURRENT_PAGES=$(cat /proc/sys/vm/nr_hugepages)
echo "Current hugepages: $CURRENT_PAGES"

if [ "$CURRENT_PAGES" -lt "$REQUIRED_PAGES" ]; then
    echo "Allocating $REQUIRED_PAGES hugepages..."
    echo $REQUIRED_PAGES | sudo tee /proc/sys/vm/nr_hugepages

    ALLOCATED=$(cat /proc/sys/vm/nr_hugepages)
    echo "Allocated hugepages: $ALLOCATED"
    
    if [ "$ALLOCATED" -lt "$REQUIRED_PAGES" ]; then
        echo "Warning: Could not allocate all required hugepages"
        echo "Available: $ALLOCATED, Required: $REQUIRED_PAGES"
    fi
fi

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

echo -e "\nHugepage information:"
cat /proc/meminfo | grep -i huge

echo -e "\nSetup complete. You can now run the reproduction test."



Makefile:


CXX = gcc
CXXFLAGS = -O2 -Wall
TARGET = hugepage_repro
SOURCE = hugepage_repro.c

$(TARGET): $(SOURCE)
    $(CXX) $(CXXFLAGS) -o $(TARGET) $(SOURCE)

clean:
    rm -f $(TARGET)

setup:
    chmod +x setup_hugepages.sh
    ./setup_hugepages.sh

test: $(TARGET)
    ./$(TARGET) 20 3

stress: $(TARGET)
    ./$(TARGET) 1000 1

.PHONY: clean setup test stress



hugepage_repro.c:


#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <stdio.h>

#define HUGEPAGE_SIZE (2 * 1024 * 1024) // 2MB
#define TOTAL_SIZE (1200ULL * 1024 * 1024 * 1024) // 1.2TB
#define NUM_HUGEPAGES (TOTAL_SIZE / HUGEPAGE_SIZE)

void* create_hugepage_mapping() {
    void* addr = mmap(NULL, TOTAL_SIZE, PROT_READ | PROT_WRITE,
                      MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap hugepages failed");
        exit(1);
    }
    return addr;
}

void touch_random_pages(void* addr, int num_touches) {
    char* base = (char*)addr;
    for (int i = 0; i < num_touches; ++i) {
        size_t offset = (rand() % NUM_HUGEPAGES) * HUGEPAGE_SIZE;
        volatile char val = base[offset];
        (void)val;
    }
}

void child_process(void* shared_mem, int child_id) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    touch_random_pages(shared_mem, 100);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    long duration = (end.tv_sec - start.tv_sec) * 1000000 + 
                   (end.tv_nsec - start.tv_nsec) / 1000;
    
    printf("Child %d completed in %ld μs\n", child_id, duration);
}

int main(int argc, char* argv[]) {
    int num_processes = argc > 1 ? atoi(argv[1]) : 50;
    int iterations = argc > 2 ? atoi(argv[2]) : 5;
    
    printf("Creating %lluGB hugepage mapping...\n", TOTAL_SIZE / (1024*1024*1024));
    void* shared_mem = create_hugepage_mapping();
    
    for (int iter = 0; iter < iterations; ++iter) {
        printf("\nIteration %d: Forking %d processes\n", iter + 1, num_processes);
        
        pid_t children[num_processes];
        struct timespec iter_start, iter_end;
        clock_gettime(CLOCK_MONOTONIC, &iter_start);
        
        for (int i = 0; i < num_processes; ++i) {
            pid_t pid = fork();
            if (pid == 0) {
                child_process(shared_mem, i);
                exit(0);
            } else if (pid > 0) {
                children[i] = pid;
            }
        }
        
        for (int i = 0; i < num_processes; ++i) {
            waitpid(children[i], NULL, 0);
        }
        
        clock_gettime(CLOCK_MONOTONIC, &iter_end);
        long iter_duration = (iter_end.tv_sec - iter_start.tv_sec) * 1000 + 
                            (iter_end.tv_nsec - iter_start.tv_nsec) / 1000000;
        printf("Iteration completed in %ld ms\n", iter_duration);
    }
    
    munmap(shared_mem, TOTAL_SIZE);
    return 0;
}




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

next             reply	other threads:[~2025-08-29 14:31 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29 14:30 Uschakow, Stanislav [this message]
2025-09-01 10:58 ` Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race Jann Horn
2025-09-01 11:26   ` David Hildenbrand
2025-09-04 12:39     ` Uschakow, Stanislav
2025-10-08 22:54     ` Prakash Sangappa
2025-10-09  7:23       ` David Hildenbrand
2025-10-09 15:06         ` Prakash Sangappa
2025-10-09  7:40   ` David Hildenbrand
2025-10-09  8:19     ` David Hildenbrand
2025-10-16  9:21     ` Lorenzo Stoakes
2025-10-16 19:13       ` David Hildenbrand
2025-10-16 18:44     ` Jann Horn
2025-10-16 19:10       ` David Hildenbrand
2025-10-16 19:26         ` Jann Horn
2025-10-16 19:44           ` David Hildenbrand
2025-10-16 20:25             ` Jann Horn
2025-10-20 15:00       ` Lorenzo Stoakes
2025-10-20 15:33         ` Jann Horn
2025-10-24 12:24           ` Lorenzo Stoakes
2025-10-24 18:22             ` Jann Horn
2025-10-24 19:02               ` Lorenzo Stoakes
2025-10-24 19:43                 ` Jann Horn
2025-10-24 19:58                   ` Lorenzo Stoakes
2025-10-24 21:41                     ` Jann Horn
2025-10-29 16:19                   ` David Hildenbrand
2025-10-29 18:02                     ` Lorenzo Stoakes
2025-11-18 10:03                       ` David Hildenbrand (Red Hat)
2025-11-19 16:08                         ` Lorenzo Stoakes
2025-11-19 16:29                           ` David Hildenbrand (Red Hat)
2025-11-19 16:31                             ` David Hildenbrand (Red Hat)
2025-11-20 15:47                               ` David Hildenbrand (Red Hat)
2025-12-03 17:22                                 ` Prakash Sangappa
2025-12-03 19:45                                   ` David Hildenbrand (Red Hat)
2025-10-20 17:18         ` David Hildenbrand
2025-10-24  9:59           ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d3878531c76479d9f8ca9789dc6485d@amazon.de \
    --to=suschako@amazon.de \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=osalvador@suse.de \
    --cc=stable@vger.kernel.org \
    --cc=trix@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.