From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 371C8C7EE2F for ; Wed, 7 Jun 2023 14:53:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240162AbjFGOxO (ORCPT ); Wed, 7 Jun 2023 10:53:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241186AbjFGOxL (ORCPT ); Wed, 7 Jun 2023 10:53:11 -0400 Received: from mail-ed1-x534.google.com (mail-ed1-x534.google.com [IPv6:2a00:1450:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C6FC21BDD for ; Wed, 7 Jun 2023 07:53:06 -0700 (PDT) Received: by mail-ed1-x534.google.com with SMTP id 4fb4d7f45d1cf-513ea2990b8so9513a12.0 for ; Wed, 07 Jun 2023 07:53:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1686149585; x=1688741585; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8a7arGbwRESChqwYt5SMynjhqmAQ0Pvl0NJIjD6VxSs=; b=bXgIViNBJeQbkH5t3badUy+1tgMHK2q2IDtIVnB9eEA7Hz7CSiRLfmqMVjtg0ZLUTS HlB/0lUMHmj9epZEtccnOw84i61Rz2CUsaxXZygxAi4CtZj/LNH8Fr6btJtipxgY25SW ZFuQ5A1ofJMkN8ToWlgSqp4mHRf6l8908xDWDVbuyjeAAW89Otb0bDPk0075GZC2Dhc8 5MzAXVOGqngopxxTWVqvDGVfa17mUJgKUIo50OE9kloR/qd6DmM3r4JYsHMrK0itmLMV 08s0KjfuCiGYtZ3oSRbAwSDWL3bL1BesfO9wmSKnCgE58iHhn9JDaxsre/mdq2JDc05c sN7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686149585; x=1688741585; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8a7arGbwRESChqwYt5SMynjhqmAQ0Pvl0NJIjD6VxSs=; b=OEIMfa7bNE94e7615zelWW0iHRlIQuJQmBziVIe50JQJUeJBz/w1v9fUihX3W8GXHL EqTJVPkxmKosv1vJK8IZsqOZdQupk2TgSj/5hmdSWw4DcgjnAZ9pXW2XkdB8U7K1mAAk Uqzfbcy2xOHrX6FIPXGZr4nXaTCQLDNww3YXzLWPmMzD7384Yc2c2tdmh7u34MrWOlBx WXpaIB+Pl07+QWvxEsh87mFvIFltAb1BeUiH6D8NG/zzbatfV1OF9MDPkEJ87nwMIhQs Bvy8YxW25rMzchy+2kjdF65oelKh8oJT0d8xm/RaRpkSUXDeuM6FpqkJXY2Er/x+aDhh rr3A== X-Gm-Message-State: AC+VfDwtoXGai64Y9YxImZgjPFKvczRfwOjglCtcC9EERLhjpbmgm595 4Cxq+Nfk4jCLdQK8PxizttPhE0PgeVjZ0gjAFlsdFA== X-Google-Smtp-Source: ACHHUZ4VkU4pTZ74qRZpgUZgIm2LZXRRMWlJHqKisze2T0S3N9eI4j4Cpqj2awqGey+GxLJlZduOpnmyE0BOCWbET9I= X-Received: by 2002:a50:d6cf:0:b0:514:95d4:c2bb with SMTP id l15-20020a50d6cf000000b0051495d4c2bbmr136575edj.2.1686149584902; Wed, 07 Jun 2023 07:53:04 -0700 (PDT) MIME-Version: 1.0 References: <20230606060822.1065182-1-usama.anjum@collabora.com> <20230606060822.1065182-3-usama.anjum@collabora.com> In-Reply-To: <20230606060822.1065182-3-usama.anjum@collabora.com> From: =?UTF-8?B?TWljaGHFgiBNaXJvc8WCYXc=?= Date: Wed, 7 Jun 2023 16:52:53 +0200 Message-ID: Subject: Re: [PATCH v17 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs To: Muhammad Usama Anjum Cc: Peter Xu , David Hildenbrand , Andrew Morton , Andrei Vagin , Danylo Mocherniuk , Paul Gofman , Cyrill Gorcunov , Mike Rapoport , Nadav Amit , Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Tue, 6 Jun 2023 at 08:08, Muhammad Usama Anjum wrote: > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN)= , > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Find pages which have been written-to and/or write protect the pages > (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP) > > This IOCTL can be extended to get information about more PTE bits. [...] > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c [...] > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static inline bool is_pmd_uffd_wp(pmd_t pmd) > +{ > + return (pmd_present(pmd) && pmd_uffd_wp(pmd)) || > + (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)); > +} [...] > +#ifdef CONFIG_HUGETLB_PAGE > +static inline bool is_huge_pte_uffd_wp(pte_t pte) > +{ > + return ((pte_present(pte) && huge_pte_uffd_wp(pte)) || > + pte_swp_uffd_wp_any(pte)); Nit: please remove the outer parentheses (it is already done for similar finctuons above). > +} > +static inline bool pagemap_scan_check_page_written(struct pagemap_scan_p= rivate *p) > +{ > + return (p->required_mask | p->anyof_mask | p->excluded_mask) & > + PAGE_IS_WRITTEN; > +} This could be precalculated and put as a flag into pagemap_scan_private - it is kernel-private structure and there are a few spare bits in `flags` if you'd prefer not to add an explicit boolean. [...] > +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, > + unsigned long addr, unsigned int n_pages) > +{ > + unsigned long bitmap =3D PM_SCAN_BITMAP(wt, file, pres, swap); > + struct page_region *cur =3D &p->cur; > + > + if (!n_pages) > + return -EINVAL; > + > + if ((p->required_mask & bitmap) !=3D p->required_mask) > + return 0; > + if (p->anyof_mask && !(p->anyof_mask & bitmap)) > + return 0; > + if (p->excluded_mask & bitmap) > + return 0; > + > + bitmap &=3D p->return_mask; > + if (!bitmap) > + return 0; > + > + if (cur->bitmap =3D=3D bitmap && > + cur->start + cur->len * PAGE_SIZE =3D=3D addr) { > + cur->len +=3D n_pages; > + p->found_pages +=3D n_pages; > + } else { > + /* > + * All data is copied to cur first. When more data is fou= nd, we > + * push cur to vec and copy new data to cur. The vec_inde= x > + * represents the current index of vec array. We add 1 to= the > + * vec_index while performing checks to account for data = in cur. > + */ > + if (cur->len && (p->vec_index + 1) >=3D p->vec_len) > + return -ENOSPC; > + > + if (cur->len) { > + memcpy(&p->vec[p->vec_index], cur, sizeof(*p->vec= )); > + p->vec_index++; > + } > + > + cur->start =3D addr; > + cur->len =3D n_pages; > + cur->bitmap =3D bitmap; > + p->found_pages +=3D n_pages; > + } > + > + if (p->max_pages && (p->found_pages =3D=3D p->max_pages)) > + return PM_SCAN_FOUND_MAX_PAGES; > + > + return 0; > +} > + > +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk= ) > +{ > + struct pagemap_scan_private *p =3D walk->private; > + struct vm_area_struct *vma =3D walk->vma; > + unsigned long addr =3D end; > + pte_t *pte, *orig_pte; > + spinlock_t *ptl; > + bool is_written; > + int ret =3D 0; > + > + arch_enter_lazy_mmu_mode(); > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + ptl =3D pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + unsigned long n_pages =3D (end - start)/PAGE_SIZE; > + > + if (p->max_pages && n_pages > p->max_pages - p->found_pag= es) > + n_pages =3D p->max_pages - p->found_pages; Since p->found_pages is only ever increased in `pagemap_scan_output()` and that function is only called for GET or GET+WP operations, maybe the logic could be folded to pagemap_scan_output() to avoid duplication? In this function the calculation is used only when WP op is done to split the HP if n_pages limit would be hit, but if using plain WP (without GET) it doesn't make sense to use the limit. (pagemap_scan_output() is trivial enough so I think it could be pulled inside the spinlocked region.) > + > + is_written =3D !is_pmd_uffd_wp(*pmd); > + > + /* > + * Break huge page into small pages if the WP operation n= eed to > + * be performed is on a portion of the huge page. > + */ > + if (is_written && IS_PM_SCAN_WP(p->flags) && > + n_pages < HPAGE_SIZE/PAGE_SIZE) { > + spin_unlock(ptl); > + > + split_huge_pmd(vma, pmd, start); > + goto process_smaller_pages; > + } > + > + if (IS_PM_SCAN_GET(p->flags)) > + ret =3D pagemap_scan_output(is_written, vma->vm_f= ile, > + pmd_present(*pmd), > + is_swap_pmd(*pmd), > + p, start, n_pages); > + > + if (ret >=3D 0 && is_written && IS_PM_SCAN_WP(p->flags)) > + make_uffd_wp_pmd(vma, addr, pmd); > + > + if (IS_PM_SCAN_WP(p->flags)) Why `is_written` is not checked? If is_written is false, then the WP op should be a no-op and so won't need TLB flushing, will it? [Same for the PTE case below.] > + flush_tlb_range(vma, start, end); > + [...] > + if (IS_PM_SCAN_WP(p->flags)) > + flush_tlb_range(vma, start, addr); > + > + pte_unmap_unlock(orig_pte, ptl); > + arch_leave_lazy_mmu_mode(); > + > + cond_resched(); > + return ret; > +} > + > +#ifdef CONFIG_HUGETLB_PAGE > +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask, > + unsigned long start, unsigned long = end, > + struct mm_walk *walk) > +{ > + unsigned long n_pages =3D (end - start)/PAGE_SIZE; > + struct pagemap_scan_private *p =3D walk->private; > + struct vm_area_struct *vma =3D walk->vma; > + struct hstate *h =3D hstate_vma(vma); > + spinlock_t *ptl; > + bool is_written; > + int ret =3D 0; > + pte_t pte; > + > + if (p->max_pages && n_pages > p->max_pages - p->found_pages) > + n_pages =3D p->max_pages - p->found_pages; > + > + if (IS_PM_SCAN_WP(p->flags)) { > + i_mmap_lock_write(vma->vm_file->f_mapping); > + ptl =3D huge_pte_lock(h, vma->vm_mm, ptep); > + } > + > + pte =3D huge_ptep_get(ptep); > + is_written =3D !is_huge_pte_uffd_wp(pte); > + > + /* > + * Partial hugetlb page clear isn't supported > + */ > + if (is_written && IS_PM_SCAN_WP(p->flags) && > + n_pages < HPAGE_SIZE/PAGE_SIZE) { > + ret =3D -EPERM; Shouldn't this be ENOSPC, conveying that the operation would overflow the n_pages limit? > + goto unlock_and_return; > + } > + > + if (IS_PM_SCAN_GET(p->flags)) { > + ret =3D pagemap_scan_output(is_written, vma->vm_file, > + pte_present(pte), is_swap_pte(p= te), > + p, start, n_pages); > + if (ret < 0) > + goto unlock_and_return; > + } > + > + if (is_written && IS_PM_SCAN_WP(p->flags)) { Oh, this case does check `is_written` before flushing TLB, contrary to what the cases above do. > + make_uffd_wp_huge_pte(vma, start, ptep, pte); > + flush_hugetlb_tlb_range(vma, start, end); > + } > + > +unlock_and_return: > + if (IS_PM_SCAN_WP(p->flags)) { > + spin_unlock(ptl); > + i_mmap_unlock_write(vma->vm_file->f_mapping); > + } > + > + return ret; > +} > +#else > +#define pagemap_scan_hugetlb_entry NULL > +#endif > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, > + int depth, struct mm_walk *walk) > +{ > + unsigned long n_pages =3D (end - addr)/PAGE_SIZE; > + struct pagemap_scan_private *p =3D walk->private; > + struct vm_area_struct *vma =3D walk->vma; > + int ret =3D 0; > + > + if (!vma || !IS_PM_SCAN_GET(p->flags)) > + return 0; > + > + if (p->max_pages && n_pages > p->max_pages - p->found_pages) > + n_pages =3D p->max_pages - p->found_pages; Nit: If the page flags don't match (wouldn't be output), the limit would not be hit and the calculation is unnecessary. But if it was done in pagemap_scan_output() instead after all the flags checks... > + ret =3D pagemap_scan_output(false, vma->vm_file, false, false, p,= addr, > + n_pages); > + > + return ret; > +} [...] > +static long do_pagemap_scan(struct mm_struct *mm, > + struct pm_scan_arg __user *uarg) > +{ > + unsigned long start, end, walk_start, walk_end; > + unsigned long empty_slots, vec_index =3D 0; > + struct mmu_notifier_range range; > + struct page_region __user *vec; > + struct pagemap_scan_private p; > + struct pm_scan_arg arg; > + int ret =3D 0; > + > + if (copy_from_user(&arg, uarg, sizeof(arg))) > + return -EFAULT; > + > + start =3D untagged_addr((unsigned long)arg.start); > + vec =3D (struct page_region *)untagged_addr((unsigned long)arg.ve= c); > + > + ret =3D pagemap_scan_args_valid(&arg, start, vec); > + if (ret) > + return ret; > + > + end =3D start + arg.len; > + p.max_pages =3D arg.max_pages; > + p.found_pages =3D 0; > + p.flags =3D arg.flags; > + p.required_mask =3D arg.required_mask; > + p.anyof_mask =3D arg.anyof_mask; > + p.excluded_mask =3D arg.excluded_mask; > + p.return_mask =3D arg.return_mask; > + p.cur.start =3D p.cur.len =3D p.cur.bitmap =3D 0; > + p.vec =3D NULL; > + p.vec_len =3D PAGEMAP_WALK_SIZE >> PAGE_SHIFT; If p.vec_len would not count the entry held in `cur` (IOW: vec_len =3D WALK_SIZE - 1), then pagemap_scan_output() wouldn't need the big comment about adding or subtracting 1 when checking for overflow. The output vector needs to have space for at least one entrry to make GET useful. Maybe `cur` could be renamed or annotated to express that it always holds the last entry? > + > + /* > + * Allocate smaller buffer to get output from inside the page wal= k > + * functions and walk page range in PAGEMAP_WALK_SIZE size chunks= . As > + * we want to return output to user in compact form where no two > + * consecutive regions should be continuous and have the same fla= gs. > + * So store the latest element in p.cur between different walks a= nd > + * store the p.cur at the end of the walk to the user buffer. > + */ > + if (IS_PM_SCAN_GET(p.flags)) { > + p.vec =3D kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KE= RNEL); > + if (!p.vec) > + return -ENOMEM; > + } > + > + if (IS_PM_SCAN_WP(p.flags)) { > + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA= , 0, > + mm, start, end); > + mmu_notifier_invalidate_range_start(&range); > + } > + > + walk_start =3D walk_end =3D start; > + while (walk_end < end && !ret) { > + if (IS_PM_SCAN_GET(p.flags)) { > + p.vec_index =3D 0; > + > + empty_slots =3D arg.vec_len - vec_index; Can `empty_slots` be zero here? I don't see anything prohibiting this case. > + p.vec_len =3D min(p.vec_len, empty_slots); ( If not counting `cur`, it would be min(p.vec_len, empty_slots - 1); ) > + } > + > + walk_end =3D (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_W= ALK_MASK; > + if (walk_end > end) > + walk_end =3D end; > + > + ret =3D mmap_read_lock_killable(mm); > + if (ret) > + goto free_data; > + ret =3D walk_page_range(mm, walk_start, walk_end, > + &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + > + if (ret && ret !=3D -ENOSPC && ret !=3D PM_SCAN_FOUND_MAX= _PAGES) > + goto free_data; > + > + walk_start =3D walk_end; > + if (IS_PM_SCAN_GET(p.flags) && p.vec_index) { > + if (copy_to_user(&vec[vec_index], p.vec, > + p.vec_index * sizeof(*p.vec))) { > + /* > + * Return error even though the OP succee= ded > + */ > + ret =3D -EFAULT; > + goto free_data; > + } > + vec_index +=3D p.vec_index; > + } > + } > + > + if (IS_PM_SCAN_GET(p.flags) && p.cur.len) { Nit: p.cur.len can be non-zero only if we do a GET (or GET+WP) operation. > + if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))= ) { Nit: sizeof(*p.cur); (even though this is the same type) > + ret =3D -EFAULT; > + goto free_data; > + } > + vec_index++; > + } > + > + ret =3D vec_index; > + > +free_data: > + if (IS_PM_SCAN_WP(p.flags)) > + mmu_notifier_invalidate_range_end(&range); > + > + kfree(p.vec); > + return ret; > +} > + > +static long do_pagemap_cmd(struct file *file, unsigned int cmd, > + unsigned long arg) > +{ > + struct pm_scan_arg __user *uarg =3D (struct pm_scan_arg __user *)= arg; The cast should be in do_pagemap_scan() as if there comes another `cmd`, then it might use a different argument type. > + struct mm_struct *mm =3D file->private_data; > + > + switch (cmd) { > + case PAGEMAP_SCAN: > + return do_pagemap_scan(mm, uarg); > + > + default: > + return -EINVAL; > + } > +} Best Regards Micha=C5=82 Miros=C5=82aw