Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Khalid Aziz <khalid.aziz@oracle.com>
To: Andy Lutomirski <luto@kernel.org>, Barry Song <21cnbao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>,
	Aneesh Kumar <aneesh.kumar@linux.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>, Jonathan Corbet <corbet@lwn.net>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz,
	Kees Cook <keescook@chromium.org>,
	kirill@shutemov.name, kucharsk@gmail.com, linkinjeon@kernel.org,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	longpeng2@huawei.com, markhemm@googlemail.com,
	Peter Collingbourne <pcc@google.com>,
	Mike Rapoport <rppt@kernel.org>,
	sieberf@amazon.com, sjpark@amazon.de,
	Suren Baghdasaryan <surenb@google.com>,
	tst@schoebel-theuer.de, Iurii Zaikin <yzaikin@google.com>
Subject: Re: [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs
Date: Wed, 6 Jul 2022 14:33:09 -0600	[thread overview]
Message-ID: <a8fce124-149f-ee46-6b23-a1805a32ebc1@oracle.com> (raw)
In-Reply-To: <48e40b61-f506-72a1-0839-08bc9db483cc@kernel.org>

On 7/3/22 14:54, Andy Lutomirski wrote:
> On 6/29/22 10:38, Khalid Aziz wrote:
>> On 5/30/22 22:24, Barry Song wrote:
>>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> mshare'd PTEs should not be removed when a task exits. These PTEs
>>>> are removed when the last task sharing the PTEs exits. Add a check
>>>> for shared PTEs and skip them.
>>>>
>>>> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
>>>> ---
>>>>   mm/memory.c | 22 +++++++++++++++++++---
>>>>   1 file changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index c77c0d643ea8..e7c5bc6f8836 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -419,16 +419,25 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>                  } else {
>>>>                          /*
>>>>                           * Optimization: gather nearby vmas into one call down
>>>> +                        * as long as they all belong to the same mm (that
>>>> +                        * may not be the case if a vma is part of mshare'd
>>>> +                        * range
>>>>                           */
>>>>                          while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>>>> -                              && !is_vm_hugetlb_page(next)) {
>>>> +                              && !is_vm_hugetlb_page(next)
>>>> +                              && vma->vm_mm == tlb->mm) {
>>>>                                  vma = next;
>>>>                                  next = vma->vm_next;
>>>>                                  unlink_anon_vmas(vma);
>>>>                                  unlink_file_vma(vma);
>>>>                          }
>>>> -                       free_pgd_range(tlb, addr, vma->vm_end,
>>>> -                               floor, next ? next->vm_start : ceiling);
>>>> +                       /*
>>>> +                        * Free pgd only if pgd is not allocated for an
>>>> +                        * mshare'd range
>>>> +                        */
>>>> +                       if (vma->vm_mm == tlb->mm)
>>>> +                               free_pgd_range(tlb, addr, vma->vm_end,
>>>> +                                       floor, next ? next->vm_start : ceiling);
>>>>                  }
>>>>                  vma = next;
>>>>          }
>>>> @@ -1551,6 +1560,13 @@ void unmap_page_range(struct mmu_gather *tlb,
>>>>          pgd_t *pgd;
>>>>          unsigned long next;
>>>>
>>>> +       /*
>>>> +        * If this is an mshare'd page, do not unmap it since it might
>>>> +        * still be in use.
>>>> +        */
>>>> +       if (vma->vm_mm != tlb->mm)
>>>> +               return;
>>>> +
>>>
>>> expect unmap, have you ever tested reverse mapping in vmscan, especially
>>> folio_referenced()? are all vmas in those processes sharing page table still
>>> in the rmap of the shared page?
>>> without shared PTE, if 1000 processes share one page, we are reading 1000
>>> PTEs, with it, are we reading just one? or are we reading the same PTE
>>> 1000 times? Have you tested it?
>>>
>>
>> We are treating mshared region same as threads sharing address space. There is one PTE that is being used by all 
>> processes and the VMA maintained in the separate mshare mm struct that also holds the shared PTE is the one that gets 
>> added to rmap. This is a different model with mshare in that it adds an mm struct that is separate from the mm structs 
>> of the processes that refer to the vma and pte in mshare mm struct. Do you see issues with rmap in this model?
> 
> I think this patch is actually the most interesting bit of the series by far.  Most of the rest is defining an API 
> (which is important!) and figuring out semantics.  This patch changes something rather fundamental about how user 
> address spaces work: what vmas live in them.  So let's figure out its effects.
> 
> I admit I'm rather puzzled about what vm_mm is for in the first place. In current kernels (without your patch), I think 
> it's a pretty hard requirement for vm_mm to equal the mm for all vmas in an mm.  After a quick and incomplete survey, 
> vm_mm seems to be mostly used as a somewhat lazy way to find the mm.  Let's see:
> 
> file_operations->mmap doesn't receive an mm_struct.  Instead it infers the mm from vm_mm.  (Why?  I don't know.)
> 
> Some walk_page_range users seem to dig the mm out of vm_mm instead of mm_walk.
> 
> Some manual address space walkers start with an mm, don't bother passing it around, and dig it back out of of vm_mm.  
> For example, unuse_vma() and all its helpers.
> 
> The only real exception I've found so far is rmap: AFAICS (on quick inspection -- I could be wrong), rmap can map from a 
> folio to a bunch of vmas, and the vmas' mms are not stored separately but instead determined by vm_mm.
> 
> 
> 
> Your patch makes me quite nervous.  You're potentially breaking any kernel code path that assumes that mms only contain 
> vmas that have vm_mm == mm.  And you're potentially causing rmap to be quite confused.  I think that if you're going to 
> take this approach, you need to clearly define the new semantics of vm_mm and audit or clean up every user of vm_mm in 
> the tree.  This may be nontrivial (especially rmap), although a cleanup of everything else to stop using vm_mm might be 
> valuable.
> 
> But I'm wondering if it would be better to attack this from a different direction.  Right now, there's a hardcoded 
> assumption that an mm owns every page table it references.  That's really the thing you're changing.  To me, it seems 
> that a magical vma that shares page tables should still be a vma that belongs to its mm_struct -- munmap() and 
> potentialy other m***() operations should all work on it, existing find_vma() users should work, etc.
> 
> So maybe instead there should be new behavior (by a VM_ flag or otherwise) that indicates that a vma owns its PTEs.  It 
> could even be a vm_operation, although if anyone ever wants regular file mappings to share PTEs, then a vm_operation 
> doesn't really make sense.
> 

Hi Andy,

You are absolutely right. Dragons lie on the path to changing the sense of vm_mm. Dave H pointed out potential issues 
with rb_tree as well. As I have looked at more code, it seems breaking the assumption that vm_mm always points to 
containing mm struct opens up the possibility of code breaking in strange ways in odd places. As a result, I have 
changed the code in v2 patches to not break this assumption about vm_mm and instead I rewrote the code to use the vm 
flag VM_SHARED_PT and vm_private_data everywhere it needed to find the mshare mm struct. All the vmas belonging to the 
new mm struct for mshare region also have their vm_mm pointing to the mshare mm_struct and that keeps all vma operations 
working normally.

Thanks,
Khalid

next prev parent reply	other threads:[~2022-07-06 20:34 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-11 16:05 [PATCH v1 00/14] Add support for shared PTEs across processes Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 01/14] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 02/14] mm/mshare: Add msharefs filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 03/14] mm/mshare: Add read for msharefs Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 04/14] mm/mshare: implement mshare_unlink syscall Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 05/14] mm/mshare: Add locking to msharefs syscalls Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 06/14] mm/mshare: Check for mounted filesystem Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 07/14] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 08/14] mm/mshare: Add basic page table sharing using mshare Khalid Aziz
2022-04-11 18:48   ` Dave Hansen
2022-04-11 20:39     ` Khalid Aziz
2022-05-30 11:11   ` Barry Song
2022-06-28 20:11     ` Khalid Aziz
2022-05-31  3:46   ` Barry Song
2022-06-28 20:16     ` Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 09/14] mm/mshare: Do not free PTEs for mshare'd PTEs Khalid Aziz
2022-05-31  4:24   ` Barry Song
2022-06-29 17:38     ` Khalid Aziz
2022-07-03 20:54       ` Andy Lutomirski
2022-07-06 20:33         ` Khalid Aziz [this message]
2022-04-11 16:05 ` [PATCH v1 10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 11/14] mm/mshare: unmap vmas in mshare_unlink Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 12/14] mm/mshare: Add a proc file with mshare alignment/size information Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 13/14] mm/mshare: Enforce mshare'd region permissions Khalid Aziz
2022-04-11 16:05 ` [PATCH v1 14/14] mm/mshare: Copy PTEs to host mm Khalid Aziz
2022-04-11 17:37 ` [PATCH v1 00/14] Add support for shared PTEs across processes Matthew Wilcox
2022-04-11 18:51   ` Dave Hansen
2022-04-11 19:08     ` Matthew Wilcox
2022-04-11 19:52   ` Khalid Aziz
2022-04-11 18:47 ` Dave Hansen
2022-04-11 20:10 ` Eric W. Biederman
2022-04-11 22:21   ` Khalid Aziz
2022-05-30 10:48 ` Barry Song
2022-05-30 11:18   ` David Hildenbrand
2022-05-30 11:49     ` Barry Song
2022-06-29 17:48     ` Khalid Aziz
2022-06-29 17:40   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a8fce124-149f-ee46-6b23-a1805a32ebc1@oracle.com \
    --to=khalid.aziz@oracle.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=arnd@arndb.de \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=hagen@jauu.net \
    --cc=jack@suse.cz \
    --cc=keescook@chromium.org \
    --cc=kirill@shutemov.name \
    --cc=kucharsk@gmail.com \
    --cc=linkinjeon@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=luto@kernel.org \
    --cc=markhemm@googlemail.com \
    --cc=pcc@google.com \
    --cc=rppt@kernel.org \
    --cc=sieberf@amazon.com \
    --cc=sjpark@amazon.de \
    --cc=surenb@google.com \
    --cc=tst@schoebel-theuer.de \
    --cc=willy@infradead.org \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).