From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83091CCA473 for ; Fri, 24 Jun 2022 17:37:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E3ED78E0253; Fri, 24 Jun 2022 13:37:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DC84A8E0244; Fri, 24 Jun 2022 13:37:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE27E8E0253; Fri, 24 Jun 2022 13:37:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 970568E0244 for ; Fri, 24 Jun 2022 13:37:40 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 796E11204A4 for ; Fri, 24 Jun 2022 17:37:40 +0000 (UTC) X-FDA: 79613836680.05.55AC3DD Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf17.hostedemail.com (Postfix) with ESMTP id 05BE340024 for ; Fri, 24 Jun 2022 17:37:39 +0000 (UTC) Received: by mail-yb1-f201.google.com with SMTP id d6-20020a256806000000b00668a3d90e95so2784645ybc.2 for ; Fri, 24 Jun 2022 10:37:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=qZvlqd2OvHLqh9twYpoJb6ja2/RhkWxX/89zPHTcn/8=; b=HypVRxxS8PLtYfgYxeV9SmynEG7TPz9Ce9bLT5zNY+BqsBZ2dR4Yu4comEv+XUubMf Tmv+PSPoTEgf52wvkhnc+37jzKxSIz2QTdTgR7IQrEhC+hxTC0PzE5yPLYSLsF+ypT+o 1OeHChfqJn7nF+ydLkcR+31+kCdErcxCg8RBgd1BXnunKktxWtdnnRX0+mbYnTgG3k0r JrUchYTC7GVTsL0xlXQAvH8al5VjVjNSgxGZ6hOsGdc8TeDaNczkB4rHzNv7Qy1NkMg2 e5uS8qHvOfJ8S8UR/qFyiENp/V1amoyWL0FT1OODwP+y6Qxkpk2El/2APigbVyzdYVIR wp2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=qZvlqd2OvHLqh9twYpoJb6ja2/RhkWxX/89zPHTcn/8=; b=zaso7KLsK7ULZxBPH2Io/DKLGWW24XF04gKCK97cF6MMMQrrFrW2BUqhVSWlWEF7Uf muEBhtOKYC3qy1uW8v9031zpdLZCGmE/X8RTJTqtqXrJQpmUdByETQ7JJbtVfXugRIAh fnGPhAxZlIy7xlVc32tmc2gYkkcmu5FblIACq/Fnz2QhIY14j7Pd7FwyDFhxiZEhK1to NVZfCBxGMXgBDyUOWISryydObJv/SE5renpfdnMYXm2q2ubwihYrVLiEHn4DVCAz333B E8aR+L1XpymvVY7NyEz44XwlctRfGdi1DlsoK7NsSwig3OFgAD5KXi0LPsM1+NIpfRb1 92Mg== X-Gm-Message-State: AJIora8CPaDmEEqD90dfC0quP+WqHcUJp08MvqukbUl5AfFeDZ7w8gbs q715TbllQ4+p0nTHEw3baUK/V0Ek0TAiHVwh X-Google-Smtp-Source: AGRyM1vcrrrISl0XVFny3ew+g+bNEBjQOnnF9F+zSGfcQuntVV3TKgvjvU6TjNDpiUuNx/38+gEcAslpNt52T6De X-Received: from jthoughton.c.googlers.com ([fda3:e722:ac3:cc00:14:4d90:c0a8:2a4f]) (user=jthoughton job=sendgmr) by 2002:a25:c7cb:0:b0:664:67aa:36a7 with SMTP id w194-20020a25c7cb000000b0066467aa36a7mr298952ybe.548.1656092259292; Fri, 24 Jun 2022 10:37:39 -0700 (PDT) Date: Fri, 24 Jun 2022 17:36:51 +0000 In-Reply-To: <20220624173656.2033256-1-jthoughton@google.com> Message-Id: <20220624173656.2033256-22-jthoughton@google.com> Mime-Version: 1.0 References: <20220624173656.2033256-1-jthoughton@google.com> X-Mailer: git-send-email 2.37.0.rc0.161.g10f37bed90-goog Subject: [RFC PATCH 21/26] hugetlb: add hugetlb_collapse From: James Houghton To: Mike Kravetz , Muchun Song , Peter Xu Cc: David Hildenbrand , David Rientjes , Axel Rasmussen , Mina Almasry , Jue Wang , Manish Mishra , "Dr . David Alan Gilbert" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, James Houghton Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656092260; a=rsa-sha256; cv=none; b=AFp4t6d4mdmGH0rWrPEd2ymg2Lf5DFXhLHZfX0nyUjp3hAtIXtfONvDvRUsbmJzCCI4hhy eHQEPdBMWutu+aTeBPWwjEBznDfeUupM6dddPMcfBZSRHJ6ik8myneupNxYQVYvB/KsxKU 4ZlehUt8W6to2nyi+lvbx+7iIpwNOqc= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=HypVRxxS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of 3Y_a1YgoKCEQpznu0mnzutmuumrk.iusrot03-ssq1giq.uxm@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3Y_a1YgoKCEQpznu0mnzutmuumrk.iusrot03-ssq1giq.uxm@flex--jthoughton.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656092260; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qZvlqd2OvHLqh9twYpoJb6ja2/RhkWxX/89zPHTcn/8=; b=BELVpox2EmL2V5Xw8DuQNGj+iPNnB8E9L88J4BvZJy0eneakgx3cV0/NQ6EiKNLaSk+LJP fYVWVXxb/xcHkFAuYopnw5XKaM7htQpPqcGjS/81p5S7iJ6siuKGT5ukZ8GRCRA3kMK1En Qi4nshyWW5G+c9Gaz0b4q6iZZDBo/RI= X-Stat-Signature: zo1pb53acds7kfz9zastaq8jh5iiex89 X-Rspamd-Queue-Id: 05BE340024 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=HypVRxxS; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of 3Y_a1YgoKCEQpznu0mnzutmuumrk.iusrot03-ssq1giq.uxm@flex--jthoughton.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3Y_a1YgoKCEQpznu0mnzutmuumrk.iusrot03-ssq1giq.uxm@flex--jthoughton.bounces.google.com X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1656092259-614059 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is what implements MADV_COLLAPSE for HugeTLB pages. This is a necessary extension to the UFFDIO_CONTINUE changes. When userspace finishes mapping an entire hugepage with UFFDIO_CONTINUE, the kernel has no mechanism to automatically collapse the page table to map the whole hugepage normally. We require userspace to inform us that they would like the hugepages to be collapsed; they do this with MADV_COLLAPSE. If userspace has not mapped all of a hugepage with UFFDIO_CONTINUE, but only some, hugetlb_collapse will cause the requested range to be mapped as if it were UFFDIO_CONTINUE'd already. Signed-off-by: James Houghton --- include/linux/hugetlb.h | 7 ++++ mm/hugetlb.c | 88 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 95 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index c207b1ac6195..438057dc3b75 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -1197,6 +1197,8 @@ int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte, unsigned int desired_sz, enum split_mode mode, bool write_locked); +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long start, unsigned long end); #else static inline bool hugetlb_hgm_enabled(struct vm_area_struct *vma) { @@ -1221,6 +1223,11 @@ static inline int huge_pte_alloc_high_granularity(struct hugetlb_pte *hpte, { return -EINVAL; } +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + return -EINVAL; +} #endif static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 09fa57599233..70bb3a1342d9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -7280,6 +7280,94 @@ int hugetlb_alloc_largest_pte(struct hugetlb_pte *hpte, struct mm_struct *mm, return -EINVAL; } +/* + * Collapse the address range from @start to @end to be mapped optimally. + * + * This is only valid for shared mappings. The main use case for this function + * is following UFFDIO_CONTINUE. If a user UFFDIO_CONTINUEs an entire hugepage + * by calling UFFDIO_CONTINUE once for each 4K region, the kernel doesn't know + * to collapse the mapping after the final UFFDIO_CONTINUE. Instead, we leave + * it up to userspace to tell us to do so, via MADV_COLLAPSE. + * + * Any holes in the mapping will be filled. If there is no page in the + * pagecache for a region we're collapsing, the PTEs will be cleared. + */ +int hugetlb_collapse(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + struct hstate *h = hstate_vma(vma); + struct address_space *mapping = vma->vm_file->f_mapping; + struct mmu_notifier_range range; + struct mmu_gather tlb; + struct hstate *tmp_h; + unsigned int shift; + unsigned long curr = start; + int ret = 0; + struct page *hpage, *subpage; + pgoff_t idx; + bool writable = vma->vm_flags & VM_WRITE; + bool shared = vma->vm_flags & VM_SHARED; + pte_t entry; + + /* + * This is only supported for shared VMAs, because we need to look up + * the page to use for any PTEs we end up creating. + */ + if (!shared) + return -EINVAL; + + i_mmap_assert_write_locked(mapping); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, + start, end); + mmu_notifier_invalidate_range_start(&range); + tlb_gather_mmu(&tlb, mm); + + while (curr < end) { + for_each_hgm_shift(h, tmp_h, shift) { + unsigned long sz = 1UL << shift; + struct hugetlb_pte hpte; + + if (!IS_ALIGNED(curr, sz) || curr + sz > end) + continue; + + hugetlb_pte_init(&hpte); + ret = hugetlb_walk_to(mm, &hpte, curr, sz, + /*stop_at_none=*/false); + if (ret) + goto out; + if (hugetlb_pte_size(&hpte) >= sz) + goto hpte_finished; + + idx = vma_hugecache_offset(h, vma, curr); + hpage = find_lock_page(mapping, idx); + hugetlb_free_range(&tlb, &hpte, curr, + curr + hugetlb_pte_size(&hpte)); + if (!hpage) { + hugetlb_pte_clear(mm, &hpte, curr); + goto hpte_finished; + } + + subpage = hugetlb_find_subpage(h, hpage, curr); + entry = make_huge_pte_with_shift(vma, subpage, + writable, shift); + set_huge_pte_at(mm, curr, hpte.ptep, entry); + unlock_page(hpage); +hpte_finished: + curr += hugetlb_pte_size(&hpte); + goto next; + } + ret = -EINVAL; + goto out; +next: + continue; + } +out: + tlb_finish_mmu(&tlb); + mmu_notifier_invalidate_range_end(&range); + return ret; +} + /* * Given a particular address, split the HugeTLB PTE that currently maps it * so that, for the given address, the PTE that maps it is `desired_shift`. -- 2.37.0.rc0.161.g10f37bed90-goog