Re: [PATCH] Shared page tables during fork

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Souptick Joarder <jrdr.linux@gmail.com>
To: Kaiyang Zhao <zhao776@purdue.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH] Shared page tables during fork
Date: Wed, 7 Jul 2021 00:17:12 +0530	[thread overview]
Message-ID: <CAFqt6zaEDscxPmsdZffySBcutsaAzV_iJDO9a8Kkz0COacHypw@mail.gmail.com> (raw)
In-Reply-To: <20210701134618.18376-1-zhao776@purdue.edu>

On Thu, Jul 1, 2021 at 7:17 PM Kaiyang Zhao <zhao776@purdue.edu> wrote:
>
> In our research work [https://dl.acm.org/doi/10.1145/3447786.3456258], we
> have identified a method that for large applications (i.e., a few hundred
> MBs and larger), can significantly speed up the fork system call. Currently
> the amount of time that the fork system call takes to complete is
> proportional to the size of allocated memory of a process, and our design
> speeds up fork invocation by up to 270x at 50GB in our experiments.
>
> The design is that instead of copying the entire paging tree during the
> fork invocation, we make the child and the parent process share the same
> set of last-level page tables, which will be reference counted. To preserve
> the copy-on-write semantics, we disable the write permission in PMD entries
> in fork, and copy PTE tables as needed in the page fault handler.

Does application have options to choose between default fork() and new
on demand fork() ?

>
> We tested a prototype with large workloads that call fork to take snapshots
> such as fuzzers (e.g., AFL), and it yielded over 2x the execution
> throughput for AFL. The patch is a prototype for x86 only and does not
> support huge pages and swapping, and is meant to demonstrate the potential
> performance gains to fork. Applications can opt-in by a switch use_odf in
> procfs.
>
> On a side note, an approach that shares page tables was proposed by Dave
> McCracken [http://lkml.iu.edu/hypermail/linux/kernel/0508.3/1623.html,
> https://www.kernel.org/doc/ols/2006/ols2006v2-pages-125-130.pdf], but never
> made it into the kernel.  We believe that with the increasing memory
> consumption of modern applications and modern use cases of fork such as
> snapshotting, the shared page table approach in the context of fork is
> worth exploring.
>
> Please let us know your level of interest in this or comments on the
> general design. Thank you.
>
> Signed-off-by: Kaiyang Zhao <zhao776@purdue.edu>
> ---
>  arch/x86/include/asm/pgtable.h |  19 +-
>  fs/proc/base.c                 |  74 ++++++
>  include/linux/mm.h             |  11 +
>  include/linux/mm_types.h       |   2 +
>  include/linux/pgtable.h        |  11 +
>  include/linux/sched/coredump.h |   5 +-
>  kernel/fork.c                  |   7 +-
>  mm/gup.c                       |  61 ++++-
>  mm/memory.c                    | 401 +++++++++++++++++++++++++++++++--
>  mm/mmap.c                      |  91 +++++++-
>  mm/mprotect.c                  |   6 +
>  11 files changed, 668 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index b6c97b8f59ec..0fda05a5c7a1 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -410,6 +410,16 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>         return native_make_pmd(v & ~clear);
>  }
>
> +static inline pmd_t pmd_mknonpresent(pmd_t pmd)
> +{
> +       return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mkpresent(pmd_t pmd)
> +{
> +       return pmd_set_flags(pmd, _PAGE_PRESENT);
> +}
> +
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>  static inline int pmd_uffd_wp(pmd_t pmd)
>  {
> @@ -798,6 +808,11 @@ static inline int pmd_present(pmd_t pmd)
>         return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
>  }
>
> +static inline int pmd_iswrite(pmd_t pmd)
> +{
> +       return pmd_flags(pmd) & (_PAGE_RW);
> +}
> +
>  #ifdef CONFIG_NUMA_BALANCING
>  /*
>   * These work without NUMA balancing but the kernel does not care. See the
> @@ -833,7 +848,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>   * Currently stuck as a macro due to indirect forward reference to
>   * linux/mmzone.h's __section_mem_map_addr() definition:
>   */
> -#define pmd_page(pmd)  pfn_to_page(pmd_pfn(pmd))
> +#define pmd_page(pmd)  pfn_to_page(pmd_pfn(pmd_mkpresent(pmd)))
>
>  /*
>   * Conversion functions: convert a page and protection to a page entry,
> @@ -846,7 +861,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
>
>  static inline int pmd_bad(pmd_t pmd)
>  {
> -       return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
> +       return ((pmd_flags(pmd) & ~(_PAGE_USER)) | (_PAGE_RW | _PAGE_PRESENT)) != _KERNPG_TABLE;
>  }
>
>  static inline unsigned long pages_to_mb(unsigned long npg)
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index e5b5f7709d48..936f33594539 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -2935,6 +2935,79 @@ static const struct file_operations proc_coredump_filter_operations = {
>  };
>  #endif
>
> +static ssize_t proc_use_odf_read(struct file *file, char __user *buf,
> +                                        size_t count, loff_t *ppos)
> +{
> +       struct task_struct *task = get_proc_task(file_inode(file));
> +       struct mm_struct *mm;
> +       char buffer[PROC_NUMBUF];
> +       size_t len;
> +       int ret;
> +
> +       if (!task)
> +               return -ESRCH;
> +
> +       ret = 0;
> +       mm = get_task_mm(task);
> +       if (mm) {
> +               len = snprintf(buffer, sizeof(buffer), "%lu\n",
> +                         ((mm->flags & MMF_USE_ODF_MASK) >> MMF_USE_ODF));
> +               mmput(mm);
> +               ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> +       }
> +
> +       put_task_struct(task);
> +
> +       return ret;
> +}
> +
> +static ssize_t proc_use_odf_write(struct file *file,
> +                                         const char __user *buf,
> +                                         size_t count,
> +                                         loff_t *ppos)
> +{
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       unsigned int val;
> +       int ret;
> +
> +       ret = kstrtouint_from_user(buf, count, 0, &val);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -ESRCH;
> +       task = get_proc_task(file_inode(file));
> +       if (!task)
> +               goto out_no_task;
> +
> +       mm = get_task_mm(task);
> +       if (!mm)
> +               goto out_no_mm;
> +       ret = 0;
> +
> +       if (val == 1) {
> +               set_bit(MMF_USE_ODF, &mm->flags);
> +       } else if (val == 0) {
> +               clear_bit(MMF_USE_ODF, &mm->flags);
> +       } else {
> +               //ignore
> +       }
> +
> +       mmput(mm);
> + out_no_mm:
> +       put_task_struct(task);
> + out_no_task:
> +       if (ret < 0)
> +               return ret;
> +       return count;
> +}
> +
> +static const struct file_operations proc_use_odf_operations = {
> +       .read           = proc_use_odf_read,
> +       .write          = proc_use_odf_write,
> +       .llseek         = generic_file_llseek,
> +};
> +
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  static int do_io_accounting(struct task_struct *task, struct seq_file *m, int whole)
>  {
> @@ -3253,6 +3326,7 @@ static const struct pid_entry tgid_base_stuff[] = {
>  #ifdef CONFIG_ELF_CORE
>         REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
>  #endif
> +       REG("use_odf", S_IRUGO|S_IWUSR, proc_use_odf_operations),
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>         ONE("io",       S_IRUSR, proc_tgid_io_accounting),
>  #endif
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 57453dba41b9..a30eca9e236a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -664,6 +664,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>         memset(vma, 0, sizeof(*vma));
>         vma->vm_mm = mm;
>         vma->vm_ops = &dummy_vm_ops;
> +       vma->pte_table_counter_pending = true;
>         INIT_LIST_HEAD(&vma->anon_vma_chain);
>  }
>
> @@ -2250,6 +2251,9 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
>                 return false;
>         __SetPageTable(page);
>         inc_lruvec_page_state(page, NR_PAGETABLE);
> +
> +       atomic64_set(&(page->pte_table_refcount), 0);
> +
>         return true;
>  }
>
> @@ -2276,6 +2280,8 @@ static inline void pgtable_pte_page_dtor(struct page *page)
>
>  #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
>
> +#define tfork_pte_alloc(mm, pmd) (__tfork_pte_alloc(mm, pmd))
> +
>  #define pte_alloc_map(mm, pmd, address)                        \
>         (pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
>
> @@ -2283,6 +2289,10 @@ static inline void pgtable_pte_page_dtor(struct page *page)
>         (pte_alloc(mm, pmd) ?                   \
>                  NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
>
> +#define tfork_pte_alloc_map_lock(mm, pmd, address, ptlp)       \
> +       (tfork_pte_alloc(mm, pmd) ?                     \
> +                NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
> +
>  #define pte_alloc_kernel(pmd, address)                 \
>         ((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
>                 NULL: pte_offset_kernel(pmd, address))
> @@ -2616,6 +2626,7 @@ extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
>  #ifdef CONFIG_MMU
>  extern int __mm_populate(unsigned long addr, unsigned long len,
>                          int ignore_errors);
> +extern int __mm_populate_nolock(unsigned long addr, unsigned long len, int ignore_errors);
>  static inline void mm_populate(unsigned long addr, unsigned long len)
>  {
>         /* Ignore errors */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index f37abb2d222e..e06c677ce279 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -158,6 +158,7 @@ struct page {
>                         union {
>                                 struct mm_struct *pt_mm; /* x86 pgds only */
>                                 atomic_t pt_frag_refcount; /* powerpc */
> +                               atomic64_t pte_table_refcount;
>                         };
>  #if USE_SPLIT_PTE_PTLOCKS
>  #if ALLOC_SPLIT_PTLOCKS
> @@ -379,6 +380,7 @@ struct vm_area_struct {
>         struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
>  #endif
>         struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +       bool pte_table_counter_pending;
>  } __randomize_layout;
>
>  struct core_thread {
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index d147480cdefc..6afd77ff82e6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -90,6 +90,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
>         return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
>  }
>  #define pte_offset_kernel pte_offset_kernel
> +static inline pte_t *tfork_pte_offset_kernel(pmd_t pmd_val, unsigned long address)
> +{
> +       return (pte_t *)pmd_page_vaddr(pmd_val) + pte_index(address);
> +}
> +#define tfork_pte_offset_kernel tfork_pte_offset_kernel
>  #endif
>
>  #if defined(CONFIG_HIGHPTE)
> @@ -782,6 +787,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct page *page)
>  })
>  #endif
>
> +#define pte_table_start(addr)                                          \
> +(addr & PMD_MASK)
> +
> +#define pte_table_end(addr)                                             \
> +(((addr) + PMD_SIZE) & PMD_MASK)
> +
>  /*
>   * When walking page tables, we usually want to skip any p?d_none entries;
>   * and any p?d_bad entries - reporting the error before resetting to none.
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 4d9e3a656875..8f6e50bc04ab 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -83,7 +83,10 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HAS_PINNED         28      /* FOLL_PIN has run, never cleared */
>  #define MMF_DISABLE_THP_MASK   (1 << MMF_DISABLE_THP)
>
> +#define MMF_USE_ODF 29
> +#define MMF_USE_ODF_MASK (1 << MMF_USE_ODF)
> +
>  #define MMF_INIT_MASK          (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> -                                MMF_DISABLE_THP_MASK)
> +                                MMF_DISABLE_THP_MASK | MMF_USE_ODF_MASK)
>
>  #endif /* _LINUX_SCHED_COREDUMP_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d738aae40f9e..4f21ea4f4f38 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -594,8 +594,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>                 rb_parent = &tmp->vm_rb;
>
>                 mm->map_count++;
> -               if (!(tmp->vm_flags & VM_WIPEONFORK))
> +               if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>                         retval = copy_page_range(tmp, mpnt);
> +                       if (oldmm->flags & MMF_USE_ODF_MASK) {
> +                               tmp->pte_table_counter_pending = false; // reference of the shared PTE table by the new VMA is counted in copy_pmd_range_tfork
> +                               mpnt->pte_table_counter_pending = false; // don't double count when forking again
> +                       }
> +               }
>
>                 if (tmp->vm_ops && tmp->vm_ops->open)
>                         tmp->vm_ops->open(tmp);
> diff --git a/mm/gup.c b/mm/gup.c
> index 42b8b1fa6521..5768f339b0ff 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1489,8 +1489,11 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>          * to break COW, except for shared mappings because these don't COW
>          * and we would not want to dirty them for nothing.
>          */
> -       if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> -               gup_flags |= FOLL_WRITE;
> +       if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) {
> +               if (!(mm->flags & MMF_USE_ODF_MASK)) { //for ODF processes, only allocate page tables
> +                       gup_flags |= FOLL_WRITE;
> +               }
> +       }
>
>         /*
>          * We want mlock to succeed for regions that have any permissions
> @@ -1669,6 +1672,60 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
>  }
>  #endif /* !CONFIG_MMU */
>
> +int __mm_populate_nolock(unsigned long start, unsigned long len, int ignore_errors)
> +{
> +       struct mm_struct *mm = current->mm;
> +       unsigned long end, nstart, nend;
> +       struct vm_area_struct *vma = NULL;
> +       int locked = 0;
> +       long ret = 0;
> +
> +       end = start + len;
> +
> +       for (nstart = start; nstart < end; nstart = nend) {
> +               /*
> +                * We want to fault in pages for [nstart; end) address range.
> +                * Find first corresponding VMA.
> +                */
> +               if (!locked) {
> +                       locked = 1;
> +                       //down_read(&mm->mmap_sem);
> +                       vma = find_vma(mm, nstart);
> +               } else if (nstart >= vma->vm_end)
> +                       vma = vma->vm_next;
> +               if (!vma || vma->vm_start >= end)
> +                       break;
> +               /*
> +                * Set [nstart; nend) to intersection of desired address
> +                * range with the first VMA. Also, skip undesirable VMA types.
> +                */
> +               nend = min(end, vma->vm_end);
> +               if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +                       continue;
> +               if (nstart < vma->vm_start)
> +                       nstart = vma->vm_start;
> +               /*
> +                * Now fault in a range of pages. populate_vma_page_range()
> +                * double checks the vma flags, so that it won't mlock pages
> +                * if the vma was already munlocked.
> +                */
> +               ret = populate_vma_page_range(vma, nstart, nend, &locked);
> +               if (ret < 0) {
> +                       if (ignore_errors) {
> +                               ret = 0;
> +                               continue;       /* continue at next VMA */
> +                       }
> +                       break;
> +               }
> +               nend = nstart + ret * PAGE_SIZE;
> +               ret = 0;
> +       }
> +       /*if (locked)
> +               up_read(&mm->mmap_sem);
> +       */
> +       return ret;     /* 0 or negative error code */
> +}
> +
>  /**
>   * get_dump_page() - pin user page in memory while writing it to core dump
>   * @addr: user address
> diff --git a/mm/memory.c b/mm/memory.c
> index db86558791f1..2b28766e4213 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -83,6 +83,9 @@
>  #include <asm/tlb.h>
>  #include <asm/tlbflush.h>
>
> +static bool tfork_one_pte_table(struct mm_struct *, pmd_t *, unsigned long, unsigned long);
> +static inline void init_rss_vec(int *rss);
> +static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss);
>  #include "pgalloc-track.h"
>  #include "internal.h"
>
> @@ -227,7 +230,16 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
>                            unsigned long addr)
>  {
>         pgtable_t token = pmd_pgtable(*pmd);
> +       long counter;
>         pmd_clear(pmd);
> +       counter = atomic64_read(&(token->pte_table_refcount));
> +       if (counter > 0) {
> +               //the pte table can only be shared in this case
> +#ifdef CONFIG_DEBUG_VM
> +               printk("free_pte_range: addr=%lx, counter=%ld, not freeing table", addr, counter);
> +#endif
> +               return;  //pte table is still in use
> +       }
>         pte_free_tlb(tlb, token, addr);
>         mm_dec_nr_ptes(tlb->mm);
>  }
> @@ -433,6 +445,118 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
>         }
>  }
>
> +// frees every page described by the pte table
> +void zap_one_pte_table(pmd_t pmd_val, unsigned long addr, struct mm_struct *mm)
> +{
> +       int rss[NR_MM_COUNTERS];
> +       pte_t *pte;
> +       unsigned long end;
> +
> +       init_rss_vec(rss);
> +       addr = pte_table_start(addr);
> +       end = pte_table_end(addr);
> +       pte = tfork_pte_offset_kernel(pmd_val, addr);
> +       do {
> +               pte_t ptent = *pte;
> +
> +               if (pte_none(ptent))
> +                       continue;
> +
> +               if (pte_present(ptent)) {
> +                       struct page *page;
> +
> +                       if (pte_special(ptent)) { //known special pte: vvar VMA, which has just one page shared system-wide. Shouldn't matter
> +                               continue;
> +                       }
> +                       page = vm_normal_page(NULL, addr, ptent); //kyz : vma is not important
> +                       if (unlikely(!page))
> +                               continue;
> +                       rss[mm_counter(page)]--;
> +#ifdef CONFIG_DEBUG_VM
> +                       //   printk("zap_one_pte_table: addr=%lx, end=%lx, (before) mapcount=%d, refcount=%d\n", addr, end, page_mapcount(page), page_ref_count(page));
> +#endif
> +                       page_remove_rmap(page, false);
> +                       put_page(page);
> +               }
> +       } while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +       add_mm_rss_vec(mm, rss);
> +}
> +
> +/* pmd lock should be held
> + * returns 1 if the table becomes unused
> + */
> +int dereference_pte_table(pmd_t pmd_val, bool free_table, struct mm_struct *mm, unsigned long addr)
> +{
> +       struct page *table_page;
> +
> +       table_page = pmd_page(pmd_val);
> +
> +       if (atomic64_dec_and_test(&(table_page->pte_table_refcount))) {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("dereference_pte_table: addr=%lx, free_table=%d, pte table reached end of life\n", addr, free_table);
> +#endif
> +
> +               zap_one_pte_table(pmd_val, addr, mm);
> +               if (free_table) {
> +                       pgtable_pte_page_dtor(table_page);
> +                       __free_page(table_page);
> +                       mm_dec_nr_ptes(mm);
> +               }
> +               return 1;
> +       } else {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("dereference_pte_table: addr=%lx, (after) pte_table_count=%lld\n", addr, atomic64_read(&(table_page->pte_table_refcount)));
> +#endif
> +       }
> +       return 0;
> +}
> +
> +int dereference_pte_table_multiple(pmd_t pmd_val, bool free_table, struct mm_struct *mm, unsigned long addr, int num)
> +{
> +       struct page *table_page;
> +       int count_after;
> +
> +       table_page = pmd_page(pmd_val);
> +       count_after = atomic64_sub_return(num, &(table_page->pte_table_refcount));
> +       if (count_after <= 0) {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("dereference_pte_table_multiple: addr=%lx, free_table=%d, num=%d, after count=%d, table reached end of life\n", addr, free_table, num, count_after);
> +#endif
> +
> +               zap_one_pte_table(pmd_val, addr, mm);
> +               if (free_table) {
> +                       pgtable_pte_page_dtor(table_page);
> +                       __free_page(table_page);
> +                       mm_dec_nr_ptes(mm);
> +               }
> +               return 1;
> +       } else {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("dereference_pte_table_multiple: addr=%lx, num=%d, (after) count=%lld\n", addr, num, atomic64_read(&(table_page->pte_table_refcount)));
> +#endif
> +       }
> +       return 0;
> +}
> +
> +int __tfork_pte_alloc(struct mm_struct *mm, pmd_t *pmd)
> +{
> +       pgtable_t new = pte_alloc_one(mm);
> +
> +       if (!new)
> +               return -ENOMEM;
> +       smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> +
> +       mm_inc_nr_ptes(mm);
> +       //kyz: won't check if the pte table already exists
> +       pmd_populate(mm, pmd, new);
> +       new = NULL;
> +       if (new)
> +               pte_free(mm, new);
> +       return 0;
> +}
> +
> +
>  int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
>  {
>         spinlock_t *ptl;
> @@ -928,6 +1052,45 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>         return 0;
>  }
>
> +static inline unsigned long
> +copy_one_pte_tfork(struct mm_struct *dst_mm,
> +               pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> +               unsigned long addr, int *rss)
> +{
> +       unsigned long vm_flags = vma->vm_flags;
> +       pte_t pte = *src_pte;
> +       struct page *page;
> +
> +       /*
> +        * If it's a COW mapping
> +        * only protect in the child (the faulting process)
> +        */
> +       if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> +               pte = pte_wrprotect(pte);
> +       }
> +
> +       /*
> +        * If it's a shared mapping, mark it clean in
> +        * the child
> +        */
> +       if (vm_flags & VM_SHARED)
> +               pte = pte_mkclean(pte);
> +       pte = pte_mkold(pte);
> +
> +       page = vm_normal_page(vma, addr, pte);
> +       if (page) {
> +               get_page(page);
> +               page_dup_rmap(page, false);
> +               rss[mm_counter(page)]++;
> +#ifdef CONFIG_DEBUG_VM
> +//             printk("copy_one_pte_tfork: addr=%lx, (after) mapcount=%d, refcount=%d\n", addr, page_mapcount(page), page_ref_count(page));
> +#endif
> +       }
> +
> +       set_pte_at(dst_mm, addr, dst_pte, pte);
> +       return 0;
> +}
> +
>  /*
>   * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>   * is required to copy this pte.
> @@ -999,6 +1162,59 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
>         return new_page;
>  }
>
> +static int copy_pte_range_tfork(struct mm_struct *dst_mm,
> +                  pmd_t *dst_pmd, pmd_t src_pmd_val, struct vm_area_struct *vma,
> +                  unsigned long addr, unsigned long end)
> +{
> +       pte_t *orig_src_pte, *orig_dst_pte;
> +       pte_t *src_pte, *dst_pte;
> +       spinlock_t *dst_ptl;
> +       int rss[NR_MM_COUNTERS];
> +       swp_entry_t entry = (swp_entry_t){0};
> +       struct page *dst_pte_page;
> +
> +       init_rss_vec(rss);
> +
> +       src_pte = tfork_pte_offset_kernel(src_pmd_val, addr); //src_pte points to the old table
> +       if (!pmd_iswrite(*dst_pmd)) {
> +               dst_pte = tfork_pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);  //dst_pte points to a new table
> +#ifdef CONFIG_DEBUG_VM
> +               printk("copy_pte_range_tfork: allocated new table. addr=%lx, prev_table_page=%px, table_page=%px\n", addr, pmd_page(src_pmd_val), pmd_page(*dst_pmd));
> +#endif
> +       } else {
> +               dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
> +       }
> +       if (!dst_pte)
> +               return -ENOMEM;
> +
> +       dst_pte_page = pmd_page(*dst_pmd);
> +       atomic64_inc(&(dst_pte_page->pte_table_refcount)); //kyz: associates the VMA with the new table
> +#ifdef CONFIG_DEBUG_VM
> +       printk("copy_pte_range_tfork: addr = %lx, end = %lx, new pte table counter (after)=%lld\n", addr, end, atomic64_read(&(dst_pte_page->pte_table_refcount)));
> +#endif
> +
> +       orig_src_pte = src_pte;
> +       orig_dst_pte = dst_pte;
> +       arch_enter_lazy_mmu_mode();
> +
> +       do {
> +               if (pte_none(*src_pte)) {
> +                       continue;
> +               }
> +               entry.val = copy_one_pte_tfork(dst_mm, dst_pte, src_pte,
> +                                                       vma, addr, rss);
> +               if (entry.val)
> +                       printk("kyz: failed copy_one_pte_tfork call\n");
> +       } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
> +
> +       arch_leave_lazy_mmu_mode();
> +       pte_unmap(orig_src_pte);
> +       add_mm_rss_vec(dst_mm, rss);
> +       pte_unmap_unlock(orig_dst_pte, dst_ptl);
> +
> +       return 0;
> +}
> +
>  static int
>  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -1130,8 +1346,9 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  {
>         struct mm_struct *dst_mm = dst_vma->vm_mm;
>         struct mm_struct *src_mm = src_vma->vm_mm;
> -       pmd_t *src_pmd, *dst_pmd;
> +       pmd_t *src_pmd, *dst_pmd, src_pmd_value;
>         unsigned long next;
> +       struct page *table_page;
>
>         dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
>         if (!dst_pmd)
> @@ -1153,9 +1370,43 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                 }
>                 if (pmd_none_or_clear_bad(src_pmd))
>                         continue;
> -               if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> -                                  addr, next))
> -                       return -ENOMEM;
> +               if (src_mm->flags & MMF_USE_ODF_MASK) {
> +#ifdef CONFIG_DEBUG_VM
> +                       printk("copy_pmd_range: vm_start=%lx, addr=%lx, vm_end=%lx, end=%lx\n", src_vma->vm_start, addr, src_vma->vm_end, end);
> +#endif
> +
> +                       src_pmd_value = *src_pmd;
> +                       //kyz: sets write-protect to the pmd entry if the vma is writable
> +                       if (src_vma->vm_flags & VM_WRITE) {
> +                               src_pmd_value = pmd_wrprotect(src_pmd_value);
> +                               set_pmd_at(src_mm, addr, src_pmd, src_pmd_value);
> +                       }
> +                       table_page = pmd_page(*src_pmd);
> +                       if (src_vma->pte_table_counter_pending) { // kyz : the old VMA hasn't been counted in the PTE table, count it now
> +                               atomic64_add(2, &(table_page->pte_table_refcount));
> +#ifdef CONFIG_DEBUG_VM
> +                               printk("copy_pmd_range: addr=%lx, pte table counter (after counting old&new)=%lld\n", addr, atomic64_read(&(table_page->pte_table_refcount)));
> +#endif
> +                       } else {
> +                               atomic64_inc(&(table_page->pte_table_refcount));  //increments the pte table counter
> +                               if (atomic64_read(&(table_page->pte_table_refcount)) == 1) {  //the VMA is old, but the pte table is new (created by a fault after the last odf call)
> +                                       atomic64_set(&(table_page->pte_table_refcount), 2);
> +#ifdef CONFIG_DEBUG_VM
> +                                       printk("copy_pmd_range: addr=%lx, pte table counter (old VMA, new pte table)=%lld\n", addr, atomic64_read(&(table_page->pte_table_refcount)));
> +#endif
> +                               }
> +#ifdef CONFIG_DEBUG_VM
> +                               else {
> +                                       printk("copy_pmd_range: addr=%lx, pte table counter (after counting new)=%lld\n", addr, atomic64_read(&(table_page->pte_table_refcount)));
> +                               }
> +#endif
> +                       }
> +                       set_pmd_at(dst_mm, addr, dst_pmd, src_pmd_value);  //shares the table with the child
> +               } else {
> +                       if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> +                                               addr, next))
> +                               return -ENOMEM;
> +               }
>         } while (dst_pmd++, src_pmd++, addr = next, addr != end);
>         return 0;
>  }
> @@ -1240,9 +1491,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>          * readonly mappings. The tradeoff is that copy_page_range is more
>          * efficient than faulting.
>          */
> -       if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
> +/*     if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
>             !src_vma->anon_vma)
>                 return 0;
> +*/
>
>         if (is_vm_hugetlb_page(src_vma))
>                 return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
> @@ -1304,7 +1556,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>  static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                 struct vm_area_struct *vma, pmd_t *pmd,
>                                 unsigned long addr, unsigned long end,
> -                               struct zap_details *details)
> +                               struct zap_details *details, bool invalidate_pmd)
>  {
>         struct mm_struct *mm = tlb->mm;
>         int force_flush = 0;
> @@ -1343,8 +1595,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                     details->check_mapping != page_rmapping(page))
>                                         continue;
>                         }
> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> -                                                       tlb->fullmm);
> +                       if (!invalidate_pmd) {
> +                               ptent = ptep_get_and_clear_full(mm, addr, pte,
> +                                               tlb->fullmm);
> +                       }
>                         tlb_remove_tlb_entry(tlb, pte, addr);
>                         if (unlikely(!page))
>                                 continue;
> @@ -1358,8 +1612,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                     likely(!(vma->vm_flags & VM_SEQ_READ)))
>                                         mark_page_accessed(page);
>                         }
> -                       rss[mm_counter(page)]--;
> -                       page_remove_rmap(page, false);
> +                       if (!invalidate_pmd) {
> +                               rss[mm_counter(page)]--;
> +                               page_remove_rmap(page, false);
> +                       } else {
> +                               continue;
> +                       }
>                         if (unlikely(page_mapcount(page) < 0))
>                                 print_bad_pte(vma, addr, ptent, page);
>                         if (unlikely(__tlb_remove_page(tlb, page))) {
> @@ -1446,12 +1704,16 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                                 struct zap_details *details)
>  {
>         pmd_t *pmd;
> -       unsigned long next;
> +       unsigned long next, table_start, table_end;
> +       spinlock_t *ptl;
> +       struct page *table_page;
> +       bool got_new_table = false;
>
>         pmd = pmd_offset(pud, addr);
>         do {
> +               ptl = pmd_lock(vma->vm_mm, pmd);
>                 next = pmd_addr_end(addr, end);
> -               if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
> +               if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
>                         if (next - addr != HPAGE_PMD_SIZE)
>                                 __split_huge_pmd(vma, pmd, addr, false, NULL);
>                         else if (zap_huge_pmd(tlb, vma, pmd, addr))
> @@ -1478,8 +1740,49 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                  */
>                 if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>                         goto next;
> -               next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> +               //kyz: copy if the pte table is shared and VMA does not cover fully the 2MB region
> +               table_page = pmd_page(*pmd);
> +               table_start = pte_table_start(addr);
> +
> +               if ((!pmd_iswrite(*pmd)) && (!vma->pte_table_counter_pending)) {//shared pte table. vma has gone through odf
> +                       table_end = pte_table_end(addr);
> +                       if (table_start < vma->vm_start || table_end > vma->vm_end) {
> +#ifdef CONFIG_DEBUG_VM
> +                               printk("%s: addr=%lx, end=%lx, table_start=%lx, table_end=%lx, copy then zap\n", __func__, addr, end, table_start, table_end);
> +#endif
> +                               if (dereference_pte_table(*pmd, false, vma->vm_mm, addr) != 1) { //dec the counter of the shared table. tfork_one_pte_table cannot find the current VMA (which is being unmapped)
> +                                       got_new_table = tfork_one_pte_table(vma->vm_mm, pmd, addr, vma->vm_end);
> +                                       if (got_new_table) {
> +                                               next = zap_pte_range(tlb, vma, pmd, addr, next, details, false);
> +                                       } else {
> +#ifdef CONFIG_DEBUG_VM
> +                                               printk("zap_pmd_range: no more VMAs in this process are using the table, but there are other processes using it\n");
> +#endif
> +                                               pmd_clear(pmd);
> +                                       }
> +                               } else {
> +#ifdef CONFIG_DEBUG_VM
> +                                       printk("zap_pmd_range: the shared table is dead. NOT copying after all.\n");
> +#endif
> +                                       // the shared table will be freed by unmap_single_vma()
> +                               }
> +                       } else {
> +#ifdef CONFIG_DEBUG_VM
> +                               printk("%s: addr=%lx, end=%lx, table_start=%lx, table_end=%lx, zap while preserving pte entries\n", __func__, addr, end, table_start, table_end);
> +#endif
> +                               //kyz: shared and fully covered by the VMA, preserve the pte entries
> +                               next = zap_pte_range(tlb, vma, pmd, addr, next, details, true);
> +                               dereference_pte_table(*pmd, true, vma->vm_mm, addr);
> +                               pmd_clear(pmd);
> +                       }
> +               } else {
> +                       next = zap_pte_range(tlb, vma, pmd, addr, next, details, false);
> +                       if (!vma->pte_table_counter_pending) {
> +                               atomic64_dec(&(table_page->pte_table_refcount));
> +                       }
> +               }
>  next:
> +               spin_unlock(ptl);
>                 cond_resched();
>         } while (pmd++, addr = next, addr != end);
>
> @@ -4476,6 +4779,66 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>         return VM_FAULT_FALLBACK;
>  }
>
> +/*  kyz: Handles an entire pte-level page table, covering multiple VMAs (if they exist)
> + *  Returns true if a new table is put in place, false otherwise.
> + *  if exclude is not 0, the vma that covers addr to exclude will not be copied
> + */
> +static bool tfork_one_pte_table(struct mm_struct *mm, pmd_t *dst_pmd, unsigned long addr, unsigned long exclude)
> +{
> +       unsigned long table_end, end, orig_addr;
> +       struct vm_area_struct *vma;
> +       pmd_t orig_pmd_val;
> +       bool copied = false;
> +       struct page *orig_pte_page;
> +       int num_vmas = 0;
> +
> +       if (!pmd_none(*dst_pmd)) {
> +               orig_pmd_val = *dst_pmd;
> +       } else {
> +               BUG();
> +       }
> +
> +       //kyz: Starts from the beginning of the range covered by the table
> +       orig_addr = addr;
> +       table_end = pte_table_end(addr);
> +       addr = pte_table_start(addr);
> +#ifdef CONFIG_DEBUG_VM
> +       orig_pte_page = pmd_page(orig_pmd_val);
> +       printk("tfork_one_pte_table: shared pte table counter=%lld, Covered Range: start=%lx, end=%lx\n", atomic64_read(&(orig_pte_page->pte_table_refcount)), addr, table_end);
> +#endif
> +       do {
> +               vma = find_vma(mm, addr);
> +               if (!vma) {
> +                       break;  //inexplicable
> +               }
> +               if (vma->vm_start >= table_end) {
> +                       break;
> +               }
> +               end = pmd_addr_end(addr, vma->vm_end);
> +               if (vma->pte_table_counter_pending) {  //this vma is newly mapped (clean) and (fully/partly) described by this pte table
> +                       addr = end;
> +                       continue;
> +               }
> +               if (vma->vm_start > addr) {
> +                       addr = vma->vm_start;
> +               }
> +               if (exclude > 0 && vma->vm_start <= orig_addr && vma->vm_end >= exclude) {
> +                       addr = end;
> +                       continue;
> +               }
> +#ifdef CONFIG_DEBUG_VM
> +               printk("tfork_one_pte_table: vm_start=%lx, vm_end=%lx\n", vma->vm_start, vma->vm_end);
> +#endif
> +               num_vmas++;
> +               copy_pte_range_tfork(mm, dst_pmd, orig_pmd_val, vma, addr, end);
> +               copied = true;
> +               addr = end;
> +       } while (addr < table_end);
> +
> +       dereference_pte_table_multiple(orig_pmd_val, true, mm, orig_addr, num_vmas);
> +       return copied;
> +}
> +
>  /*
>   * These routines also need to handle stuff like marking pages dirty
>   * and/or accessed for architectures that don't do it in hardware (most
> @@ -4610,6 +4973,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>         pgd_t *pgd;
>         p4d_t *p4d;
>         vm_fault_t ret;
> +       spinlock_t *ptl;
>
>         pgd = pgd_offset(mm, address);
>         p4d = p4d_alloc(mm, pgd, address);
> @@ -4659,6 +5023,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                 vmf.orig_pmd = *vmf.pmd;
>
>                 barrier();
> +               /*
>                 if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
>                         VM_BUG_ON(thp_migration_supported() &&
>                                           !is_pmd_migration_entry(vmf.orig_pmd));
> @@ -4666,6 +5031,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                                 pmd_migration_entry_wait(mm, vmf.pmd);
>                         return 0;
>                 }
> +               */
>                 if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
>                         if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
>                                 return do_huge_pmd_numa_page(&vmf);
> @@ -4679,6 +5045,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                                 return 0;
>                         }
>                 }
> +               //kyz: checks if the pmd entry prohibits writes
> +               if ((!pmd_none(vmf.orig_pmd)) && (!pmd_iswrite(vmf.orig_pmd)) && (vma->vm_flags & VM_WRITE)) {
> +#ifdef CONFIG_DEBUG_VM
> +                       printk("__handle_mm_fault: PID=%d, addr=%lx\n", current->pid, address);
> +#endif
> +                       ptl = pmd_lock(mm, vmf.pmd);
> +                       tfork_one_pte_table(mm, vmf.pmd, vmf.address, 0u);
> +                       spin_unlock(ptl);
> +               }
>         }
>
>         return handle_pte_fault(&vmf);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ca54d36d203a..308d86cfe544 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -47,6 +47,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/oom.h>
>  #include <linux/sched/mm.h>
> +#include <linux/pagewalk.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -276,6 +277,9 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>
>  success:
>         populate = newbrk > oldbrk && (mm->def_flags & VM_LOCKED) != 0;
> +       if (mm->flags & MMF_USE_ODF_MASK) { //for ODF
> +               populate = true;
> +       }
>         if (downgraded)
>                 mmap_read_unlock(mm);
>         else
> @@ -1115,6 +1119,50 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
>         return 0;
>  }
>
> +static int pgtable_counter_fixup_pmd_entry(pmd_t *pmd, unsigned long addr,
> +                              unsigned long next, struct mm_walk *walk)
> +{
> +       struct page *table_page;
> +
> +       table_page = pmd_page(*pmd);
> +       atomic64_inc(&(table_page->pte_table_refcount));
> +
> +#ifdef CONFIG_DEBUG_VM
> +       printk("fixup inc: addr=%lx\n", addr);
> +#endif
> +
> +       walk->action = ACTION_CONTINUE;  //skip pte level
> +       return 0;
> +}
> +
> +static int pgtable_counter_fixup_test(unsigned long addr, unsigned long next,
> +                         struct mm_walk *walk)
> +{
> +       return 0;
> +}
> +
> +static const struct mm_walk_ops pgtable_counter_fixup_walk_ops = {
> +.pmd_entry = pgtable_counter_fixup_pmd_entry,
> +.test_walk = pgtable_counter_fixup_test
> +};
> +
> +int merge_vma_pgtable_counter_fixup(struct vm_area_struct *vma, unsigned long start, unsigned long end)
> +{
> +       if (vma->pte_table_counter_pending) {
> +               return 0;
> +       } else {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("merge fixup: vm_start=%lx, vm_end=%lx, inc start=%lx, inc end=%lx\n", vma->vm_start, vma->vm_end, start, end);
> +#endif
> +               start = pte_table_end(start);
> +               end = pte_table_start(end);
> +               __mm_populate_nolock(start, end-start, 1); //popuate tables for extended address range so that we can increment counters
> +               walk_page_range(vma->vm_mm, start, end, &pgtable_counter_fixup_walk_ops, NULL);
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
>   * whether that can be merged with its predecessor or its successor.
> @@ -1215,6 +1263,9 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>                 if (err)
>                         return NULL;
>                 khugepaged_enter_vma_merge(prev, vm_flags);
> +
> +               merge_vma_pgtable_counter_fixup(prev, addr, end);
> +
>                 return prev;
>         }
>
> @@ -1242,6 +1293,9 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>                 if (err)
>                         return NULL;
>                 khugepaged_enter_vma_merge(area, vm_flags);
> +
> +               merge_vma_pgtable_counter_fixup(area, addr, end);
> +
>                 return area;
>         }
>
> @@ -1584,8 +1638,15 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>         addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
>         if (!IS_ERR_VALUE(addr) &&
>             ((vm_flags & VM_LOCKED) ||
> -            (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
> +            (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE ||
> +            (mm->flags & MMF_USE_ODF_MASK))) {
> +#ifdef CONFIG_DEBUG_VM
> +               if (mm->flags & MMF_USE_ODF_MASK) {
> +                       printk("mmap: force populate, addr=%lx, len=%lx\n", addr, len);
> +               }
> +#endif
>                 *populate = len;
> +       }
>         return addr;
>  }
>
> @@ -2799,6 +2860,31 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
>         return __split_vma(mm, vma, addr, new_below);
>  }
>
> +/* left and right vma after the split, address of split */
> +int split_vma_pgtable_counter_fixup(struct vm_area_struct *lvma, struct vm_area_struct *rvma, bool orig_pending_flag)
> +{
> +       if (orig_pending_flag) {
> +               return 0;  //the new vma will have pending flag as true by default, just as the old vma
> +       } else {
> +#ifdef CONFIG_DEBUG_VM
> +               printk("split fixup: set vma flag to false, rvma_start=%lx\n", rvma->vm_start);
> +#endif
> +               lvma->pte_table_counter_pending = false;
> +               rvma->pte_table_counter_pending = false;
> +
> +               if (pte_table_start(rvma->vm_start) == rvma->vm_start) {  //the split was right at the pte table boundary
> +                       return 0;  //the only case where we don't increment pte table counter
> +               } else {
> +#ifdef CONFIG_DEBUG_VM
> +                       printk("split fixup: rvma_start=%lx\n", rvma->vm_start);
> +#endif
> +                       walk_page_range(rvma->vm_mm, pte_table_start(rvma->vm_start), pte_table_end(rvma->vm_start), &pgtable_counter_fixup_walk_ops, NULL);
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static inline void
>  unlock_range(struct vm_area_struct *start, unsigned long limit)
>  {
> @@ -2869,6 +2955,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>                 if (error)
>                         return error;
>                 prev = vma;
> +
> +               split_vma_pgtable_counter_fixup(prev, prev->vm_next, prev->pte_table_counter_pending);
>         }
>
>         /* Does it split the last one? */
> @@ -2877,6 +2965,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>                 int error = __split_vma(mm, last, end, 1);
>                 if (error)
>                         return error;
> +               split_vma_pgtable_counter_fixup(last->vm_prev, last, last->pte_table_counter_pending);
>         }
>         vma = vma_next(mm, prev);
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 4cb240fd9936..d396b1d38fab 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -445,6 +445,8 @@ static const struct mm_walk_ops prot_none_walk_ops = {
>         .test_walk              = prot_none_test,
>  };
>
> +int split_vma_pgtable_counter_fixup(struct vm_area_struct *lvma, struct vm_area_struct *rvma, bool orig_pending_flag);
> +
>  int
>  mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>         unsigned long start, unsigned long end, unsigned long newflags)
> @@ -517,12 +519,16 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
>                 error = split_vma(mm, vma, start, 1);
>                 if (error)
>                         goto fail;
> +
> +               split_vma_pgtable_counter_fixup(vma->vm_prev, vma, vma->pte_table_counter_pending);
>         }
>
>         if (end != vma->vm_end) {
>                 error = split_vma(mm, vma, end, 0);
>                 if (error)
>                         goto fail;
> +
> +               split_vma_pgtable_counter_fixup(vma, vma->vm_next, vma->pte_table_counter_pending);
>         }
>
>  success:
> --
> 2.30.2
>
>

next prev parent reply	other threads:[~2021-07-06 18:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-01 13:46 [PATCH] Shared page tables during fork Kaiyang Zhao
2021-07-06 18:47 ` Souptick Joarder [this message]
2021-07-06 23:01 ` Dave Hansen
2021-07-07  7:00 ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFqt6zaEDscxPmsdZffySBcutsaAzV_iJDO9a8Kkz0COacHypw@mail.gmail.com \
    --to=jrdr.linux@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=zhao776@purdue.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).