From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC50722611 for ; Wed, 8 Jan 2025 23:33:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736379188; cv=none; b=CnHFWtL43QqTBBNQ5ZHpw1LKaP2RGzkE0ySG7Wbhre0L4etDY302MfIlRvVnzN7Lr1mZ79PERpSUbYq7VPMbQh4uk2dEol7ZPC0FEAG750lXPLRaKx0ew4YM/FPsXRUa4aBI10V1Bi/BoREc8DB42kWTxffVvxDFmZdN1y7+/MY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736379188; c=relaxed/simple; bh=DuLUcIMEBSXFhZBETsTmKmdyxvcjLlMS78YVuiXTqYU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=ciHVOerxPp2h+fvb4Qee0Kw+G1aMav7BAp3v6Izp1ne+YSkUkwl0dSB0hl6MU0fF5D85bS5CkrNs2JTbagEBgBzqJJVGWdqFjEKjOCRAMSR0sYESu4+bu54L2Pe5jTC6ospQX0UA1d+HG0ka1au+kOpMGctAx0JmICRXOLqwY0g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=b7bs+tew; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="b7bs+tew" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736379184; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=GToyGf4CfAXG8e+haEf5ADNNDX3ORMNgxBwzAGbkC/A=; b=b7bs+tewz7ALNVFBEu88loH6w6eFmNY6LR1kkORHH6t2aU/cW1mzMfnMif1Orwr6pcIiIt HewNSnkv1iFAxLUMf0Y4Pwi8xlYGnS5we6pktkXgaAP+SkKMG5Bs3u3MTY417IqHAODOLI tonEbjemcsjRuOQte7kEv8qlZJb1wkA= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-360-sav41ajtPjuGhqnCZ7cZjg-1; Wed, 08 Jan 2025 18:33:01 -0500 X-MC-Unique: sav41ajtPjuGhqnCZ7cZjg-1 X-Mimecast-MFC-AGG-ID: sav41ajtPjuGhqnCZ7cZjg Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 321CD1944D05; Wed, 8 Jan 2025 23:32:52 +0000 (UTC) Received: from h1.redhat.com (unknown [10.22.80.41]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 14A8519560AE; Wed, 8 Jan 2025 23:32:41 +0000 (UTC) From: Nico Pache To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org Subject: [RFC 00/11] khugepaged: mTHP support Date: Wed, 8 Jan 2025 16:31:16 -0700 Message-ID: <20250108233128.14484-1-npache@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 The following series provides khugepaged and madvise collapse with the capability to collapse regions to mTHPs. To achieve this we generalize the khugepaged functions to no longer depend on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages (defined by MTHP_MIN_ORDER) that are fully utilized. This info is tracked using a bitmap. After the PMD scan is done, we do binary recursion on the bitmap to find the optimal mTHP sizes for the PMD range. The restriction on max_ptes_none is removed during the scan, to make sure we account for the whole PMD range. max_ptes_none is mapped to a 0-100 range to determine how full a mTHP order needs to be before collapsing it. Some design choices to note: - bitmap structures are allocated dynamically because on some arch's (like PowerPC) the value of MTHP_BITMAP_SIZE cannot be computed at compile time leading to warnings. - The recursion is masked through a stack structure. - A MTHP_MIN_ORDER was added to compress the bitmap, and ensure it was 64bit on x86. This provides some optimization on the bitmap operations. if other arches/configs that have larger than 512 PTEs per PMD want to compress their bitmap further we can change this value per arch. Patch 1-2: Some refactoring to combine madvise_collapse and khugepaged Patch 3: A minor "fix"/optimization Patch 4: Refactor/rename hpage_collapse Patch 5-7: Generalize khugepaged functions for arbitrary orders Patch 8-11: The mTHP patches This series acts as an alternative to Dev Jain's approach [1]. The two series differ in a few ways: - My approach uses a bitmap to store the state of the linear scan_pmd to then determine potential mTHP batches. Devs incorporates his directly into the scan, and will try each available order. - Dev is attempting to optimize the locking, while my approach keeps the locking changes to a minimum. I believe his changes are not safe for uffd. - Dev's changes only work for khugepaged not madvise_collapse (although i think that was by choice and it could easily support madvise) - Dev scales all khugepaged sysfs tunables by order, while im removing the restriction of max_ptes_none and converting it to a scale to determine a (m)THP threshold. - Dev turns on khugepaged if any order is available while mine still only runs if PMDs are enabled. I like Dev's approach and will most likely do the same in my PATCH posting. - mTHPs need their ref count updated to 1<