Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Uschakow, Stanislav" <suschako@amazon.de>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: "trix@redhat.com" <trix@redhat.com>,
	"ndesaulniers@google.com" <ndesaulniers@google.com>,
	"nathan@kernel.org" <nathan@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"mike.kravetz@oracle.com" <mike.kravetz@oracle.com>,
	"jannh@google.com" <jannh@google.com>,
	"lorenzo.stoakes@oracle.com" <lorenzo.stoakes@oracle.com>,
	"liam.howlett@oracle.com" <liam.howlett@oracle.com>,
	"muchun.song@linux.dev" <muchun.song@linux.dev>,
	"osalvador@suse.de" <osalvador@suse.de>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"jannh@google.com" <jannh@google.com>
Subject: Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race
Date: Fri, 29 Aug 2025 14:30:46 +0000	[thread overview]
Message-ID: <4d3878531c76479d9f8ca9789dc6485d@amazon.de> (raw)

Hello.

We have observed a huge latency increase using `fork()` after ingesting the CVE-2025-38085 fix which leads to the commit `1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race`. On large machines with 1.5TB of memory with 196 cores, we identified mmapping of 1.2TB of shared memory and forking itself dozens or hundreds of times we see a increase of execution times of a factor of 4. The reproducer is at the end of the email.

Comparing the a kernel without this patch with a kernel with this patch applied when spawning 1000 children we see those execution times:


Patched kernel: 
$ time make stress
...
real    0m11.275s
user    0m0.177s
sys     0m23.905s

Original kernel : 

$ time make stress
...real    0m2.475s
user    0m1.398s
sys     0m2.501s


The patch in question: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1013af4f585fccc4d3e5c5824d174de2257f7d6d


My observation/assumption is:

each child touches 100 random pages and despawns
on each despawn `huge_pmd_unshare()` is called
each call to `huge_pmd_unshare()` syncrhonizes all threads using `tlb_remove_table_sync_one()` leading to the regression



I'm happy to provide more information.




Thank you
Stanislav Uschakow








=== Reproducer ===

Setup:


#!/bin/bash
echo "Setting up hugepages for reproduction..."

# hugepages (1.2TB / 2MB = 614400 pages)
REQUIRED_PAGES=614400

# Check current hugepage allocation
CURRENT_PAGES=$(cat /proc/sys/vm/nr_hugepages)
echo "Current hugepages: $CURRENT_PAGES"

if [ "$CURRENT_PAGES" -lt "$REQUIRED_PAGES" ]; then
    echo "Allocating $REQUIRED_PAGES hugepages..."
    echo $REQUIRED_PAGES | sudo tee /proc/sys/vm/nr_hugepages

    ALLOCATED=$(cat /proc/sys/vm/nr_hugepages)
    echo "Allocated hugepages: $ALLOCATED"
    
    if [ "$ALLOCATED" -lt "$REQUIRED_PAGES" ]; then
        echo "Warning: Could not allocate all required hugepages"
        echo "Available: $ALLOCATED, Required: $REQUIRED_PAGES"
    fi
fi

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

echo -e "\nHugepage information:"
cat /proc/meminfo | grep -i huge

echo -e "\nSetup complete. You can now run the reproduction test."



Makefile:


CXX = gcc
CXXFLAGS = -O2 -Wall
TARGET = hugepage_repro
SOURCE = hugepage_repro.c

$(TARGET): $(SOURCE)
    $(CXX) $(CXXFLAGS) -o $(TARGET) $(SOURCE)

clean:
    rm -f $(TARGET)

setup:
    chmod +x setup_hugepages.sh
    ./setup_hugepages.sh

test: $(TARGET)
    ./$(TARGET) 20 3

stress: $(TARGET)
    ./$(TARGET) 1000 1

.PHONY: clean setup test stress



hugepage_repro.c:


#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <stdio.h>

#define HUGEPAGE_SIZE (2 * 1024 * 1024) // 2MB
#define TOTAL_SIZE (1200ULL * 1024 * 1024 * 1024) // 1.2TB
#define NUM_HUGEPAGES (TOTAL_SIZE / HUGEPAGE_SIZE)

void* create_hugepage_mapping() {
    void* addr = mmap(NULL, TOTAL_SIZE, PROT_READ | PROT_WRITE,
                      MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {
        perror("mmap hugepages failed");
        exit(1);
    }
    return addr;
}

void touch_random_pages(void* addr, int num_touches) {
    char* base = (char*)addr;
    for (int i = 0; i < num_touches; ++i) {
        size_t offset = (rand() % NUM_HUGEPAGES) * HUGEPAGE_SIZE;
        volatile char val = base[offset];
        (void)val;
    }
}

void child_process(void* shared_mem, int child_id) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    touch_random_pages(shared_mem, 100);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    long duration = (end.tv_sec - start.tv_sec) * 1000000 + 
                   (end.tv_nsec - start.tv_nsec) / 1000;
    
    printf("Child %d completed in %ld μs\n", child_id, duration);
}

int main(int argc, char* argv[]) {
    int num_processes = argc > 1 ? atoi(argv[1]) : 50;
    int iterations = argc > 2 ? atoi(argv[2]) : 5;
    
    printf("Creating %lluGB hugepage mapping...\n", TOTAL_SIZE / (1024*1024*1024));
    void* shared_mem = create_hugepage_mapping();
    
    for (int iter = 0; iter < iterations; ++iter) {
        printf("\nIteration %d: Forking %d processes\n", iter + 1, num_processes);
        
        pid_t children[num_processes];
        struct timespec iter_start, iter_end;
        clock_gettime(CLOCK_MONOTONIC, &iter_start);
        
        for (int i = 0; i < num_processes; ++i) {
            pid_t pid = fork();
            if (pid == 0) {
                child_process(shared_mem, i);
                exit(0);
            } else if (pid > 0) {
                children[i] = pid;
            }
        }
        
        for (int i = 0; i < num_processes; ++i) {
            waitpid(children[i], NULL, 0);
        }
        
        clock_gettime(CLOCK_MONOTONIC, &iter_end);
        long iter_duration = (iter_end.tv_sec - iter_start.tv_sec) * 1000 + 
                            (iter_end.tv_nsec - iter_start.tv_nsec) / 1000000;
        printf("Iteration completed in %ld ms\n", iter_duration);
    }
    
    munmap(shared_mem, TOTAL_SIZE);
    return 0;
}




Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

next             reply	other threads:[~2025-08-29 14:31 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29 14:30 Uschakow, Stanislav [this message]
2025-09-01 10:58 ` Bug: Performance regression in 1013af4f585f: mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race Jann Horn
2025-09-01 11:26   ` David Hildenbrand
2025-09-04 12:39     ` Uschakow, Stanislav

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d3878531c76479d9f8ca9789dc6485d@amazon.de \
    --to=suschako@amazon.de \
    --cc=akpm@linux-foundation.org \
    --cc=jannh@google.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=osalvador@suse.de \
    --cc=stable@vger.kernel.org \
    --cc=trix@redhat.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).