linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v2 1/2] mm: add vma_has_recency()
@ 2022-12-30 21:52 Yu Zhao
  2022-12-30 21:52 ` [PATCH mm-unstable v2 2/2] mm: support POSIX_FADV_NOREUSE Yu Zhao
  2023-01-06  4:00 ` [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Andrew Morton
  0 siblings, 2 replies; 4+ messages in thread
From: Yu Zhao @ 2022-12-30 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andrea Righi, Johannes Weiner, Michael Larabel,
	linux-mm, linux-fsdevel, linux-kernel, linux-mm, Yu Zhao

This patch adds vma_has_recency() to indicate whether a VMA may
exhibit temporal locality that the LRU algorithm relies on.

This function returns false for VMAs marked by VM_SEQ_READ or
VM_RAND_READ. While the former flag indicates linear access, i.e., a
special case of spatial locality, both flags indicate a lack of
temporal locality, i.e., the reuse of an area within a relatively
small duration.

"Recency" is chosen over "locality" to avoid confusion between
temporal and spatial localities.

Before this patch, the active/inactive LRU only ignored the accessed
bit from VMAs marked by VM_SEQ_READ. After this patch, the
active/inactive LRU and MGLRU share the same logic: they both ignore
the accessed bit if vma_has_recency() returns false.

For the active/inactive LRU, the following fio test showed a [6, 8]%
increase in IOPS when randomly accessing mapped files under memory
pressure.

  kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
  kb=$((kb - 8*1024*1024))

  modprobe brd rd_nr=1 rd_size=$kb
  dd if=/dev/zero of=/dev/ram0 bs=1M

  mkfs.ext4 /dev/ram0
  mount /dev/ram0 /mnt/
  swapoff -a

  fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
      --size=8G --rw=randrw --time_based --runtime=10m \
      --group_reporting

The discussion that led to this patch is here [1]. Additional test
results are available in that thread.

[1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/mm_inline.h |  8 ++++++++
 mm/memory.c               |  7 +++----
 mm/rmap.c                 | 42 +++++++++++++++++----------------------
 mm/vmscan.c               |  5 ++++-
 4 files changed, 33 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index d1c1f211a86f..fe5b8449e14a 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -595,4 +595,12 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 #endif
 }
 
+static inline bool vma_has_recency(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))
+		return false;
+
+	return true;
+}
+
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 4000e9f017e0..ee72badad847 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1402,8 +1402,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 						force_flush = 1;
 					}
 				}
-				if (pte_young(ptent) &&
-				    likely(!(vma->vm_flags & VM_SEQ_READ)))
+				if (pte_young(ptent) && likely(vma_has_recency(vma)))
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
@@ -5148,8 +5147,8 @@ static inline void mm_account_fault(struct pt_regs *regs,
 #ifdef CONFIG_LRU_GEN
 static void lru_gen_enter_fault(struct vm_area_struct *vma)
 {
-	/* the LRU algorithm doesn't apply to sequential or random reads */
-	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+	/* the LRU algorithm only applies to accesses with recency */
+	current->in_lru_fault = vma_has_recency(vma);
 }
 
 static void lru_gen_exit_fault(void)
diff --git a/mm/rmap.c b/mm/rmap.c
index 8a24b90d9531..9abffdd63a6a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -823,25 +823,14 @@ static bool folio_referenced_one(struct folio *folio,
 		}
 
 		if (pvmw.pte) {
-			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
-			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
 				lru_gen_look_around(&pvmw);
 				referenced++;
 			}
 
 			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte)) {
-				/*
-				 * Don't treat a reference through
-				 * a sequentially read mapping as such.
-				 * If the folio has been used in another mapping,
-				 * we will catch it; if this other mapping is
-				 * already gone, the unmap path will have set
-				 * the referenced flag or activated the folio.
-				 */
-				if (likely(!(vma->vm_flags & VM_SEQ_READ)))
-					referenced++;
-			}
+						pvmw.pte))
+				referenced++;
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			if (pmdp_clear_flush_young_notify(vma, address,
 						pvmw.pmd))
@@ -875,7 +864,20 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
 	struct folio_referenced_arg *pra = arg;
 	struct mem_cgroup *memcg = pra->memcg;
 
-	if (!mm_match_cgroup(vma->vm_mm, memcg))
+	/*
+	 * Ignore references from this mapping if it has no recency. If the
+	 * folio has been used in another mapping, we will catch it; if this
+	 * other mapping is already gone, the unmap path will have set the
+	 * referenced flag or activated the folio in zap_pte_range().
+	 */
+	if (!vma_has_recency(vma))
+		return true;
+
+	/*
+	 * If we are reclaiming on behalf of a cgroup, skip counting on behalf
+	 * of references from different cgroups.
+	 */
+	if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
 		return true;
 
 	return false;
@@ -906,6 +908,7 @@ int folio_referenced(struct folio *folio, int is_locked,
 		.arg = (void *)&pra,
 		.anon_lock = folio_lock_anon_vma_read,
 		.try_lock = true,
+		.invalid_vma = invalid_folio_referenced_vma,
 	};
 
 	*vm_flags = 0;
@@ -921,15 +924,6 @@ int folio_referenced(struct folio *folio, int is_locked,
 			return 1;
 	}
 
-	/*
-	 * If we are reclaiming on behalf of a cgroup, skip
-	 * counting on behalf of references from different
-	 * cgroups
-	 */
-	if (memcg) {
-		rwc.invalid_vma = invalid_folio_referenced_vma;
-	}
-
 	rmap_walk(folio, &rwc);
 	*vm_flags = pra.vm_flags;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6929402db149..cdf96aec39dc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3794,7 +3794,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
 	if (is_vm_hugetlb_page(vma))
 		return true;
 
-	if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ))
+	if (!vma_has_recency(vma))
+		return true;
+
+	if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL))
 		return true;
 
 	if (vma == get_gate_vma(vma->vm_mm))
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH mm-unstable v2 2/2] mm: support POSIX_FADV_NOREUSE
  2022-12-30 21:52 [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Yu Zhao
@ 2022-12-30 21:52 ` Yu Zhao
  2023-01-06  4:00 ` [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Andrew Morton
  1 sibling, 0 replies; 4+ messages in thread
From: Yu Zhao @ 2022-12-30 21:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andrea Righi, Johannes Weiner, Michael Larabel,
	linux-mm, linux-fsdevel, linux-kernel, linux-mm, Yu Zhao

This patch adds POSIX_FADV_NOREUSE to vma_has_recency() so that the
LRU algorithm can ignore access to mapped files marked by this flag.

The advantages of POSIX_FADV_NOREUSE are:
1. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not alter the
   default readahead behavior.
2. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not split VMAs and
   therefore does not take mmap_lock.
3. Unlike MADV_COLD, setting it has a negligible cost, regardless of
   how many pages it affects.

Its limitations are:
1. Like POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL, it currently does
   not support range. IOW, its scope is the entire file.
2. It currently does not ignore access through file descriptors.
   Specifically, for the active/inactive LRU, given a file page shared
   by two users and one of them having set POSIX_FADV_NOREUSE on the
   file, this page will be activated upon the second user accessing
   it. This corner case can be covered by checking POSIX_FADV_NOREUSE
   before calling folio_mark_accessed() on the read path. But it is
   considered not worth the effort.

There have been a few attempts to support POSIX_FADV_NOREUSE, e.g.,
[1]. This time the goal is to fill a niche: a few desktop
applications, e.g., large file transferring and video
encoding/decoding, want fast file streaming with mmap() rather than
direct IO. Among those applications, an SVT-AV1 regression was
reported when running with MGLRU [2]. The following test can reproduce
that regression.

  kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
  kb=$((kb - 8*1024*1024))

  modprobe brd rd_nr=1 rd_size=$kb
  dd if=/dev/zero of=/dev/ram0 bs=1M

  mkfs.ext4 /dev/ram0
  mount /dev/ram0 /mnt/
  swapoff -a

  fallocate -l 8G /mnt/swapfile
  mkswap /mnt/swapfile
  swapon /mnt/swapfile

  wget http://ultravideo.cs.tut.fi/video/Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z
  7z e -o/mnt/ Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z
  SvtAv1EncApp --preset 12 -w 3840 -h 2160 \
               -i /mnt/Bosphorus_3840x2160.y4m

For MGLRU, the following change showed a [9-11]% increase in FPS,
which makes it on par with the active/inactive LRU.

  patch Source/App/EncApp/EbAppMain.c <<EOF
  31a32
  > #include <fcntl.h>
  35d35
  < #include <fcntl.h> /* _O_BINARY */
  117a118
  >             posix_fadvise(config->mmap.fd, 0, 0, POSIX_FADV_NOREUSE);
  EOF

[1] https://lore.kernel.org/r/1308923350-7932-1-git-send-email-andrea@betterlinux.com/
[2] https://openbenchmarking.org/result/2209259-PTS-MGLRU8GB57

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 include/linux/fs.h        | 2 ++
 include/linux/mm_inline.h | 3 +++
 mm/fadvise.c              | 5 ++++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 066555ad1bf8..5660ed0edf1a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -166,6 +166,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports DIRECT IO */
 #define	FMODE_CAN_ODIRECT	((__force fmode_t)0x400000)
 
+#define	FMODE_NOREUSE		((__force fmode_t)0x800000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x4000000)
 
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fe5b8449e14a..064f92c78bfa 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -600,6 +600,9 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
 	if (vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))
 		return false;
 
+	if (vma->vm_file && (vma->vm_file->f_mode & FMODE_NOREUSE))
+		return false;
+
 	return true;
 }
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index bf04fec87f35..fb7c5f43fd2a 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -80,7 +80,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_NORMAL:
 		file->f_ra.ra_pages = bdi->ra_pages;
 		spin_lock(&file->f_lock);
-		file->f_mode &= ~FMODE_RANDOM;
+		file->f_mode &= ~(FMODE_RANDOM | FMODE_NOREUSE);
 		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_RANDOM:
@@ -107,6 +107,9 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 		force_page_cache_readahead(mapping, file, start_index, nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
+		spin_lock(&file->f_lock);
+		file->f_mode |= FMODE_NOREUSE;
+		spin_unlock(&file->f_lock);
 		break;
 	case POSIX_FADV_DONTNEED:
 		__filemap_fdatawrite_range(mapping, offset, endbyte,
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH mm-unstable v2 1/2] mm: add vma_has_recency()
  2022-12-30 21:52 [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Yu Zhao
  2022-12-30 21:52 ` [PATCH mm-unstable v2 2/2] mm: support POSIX_FADV_NOREUSE Yu Zhao
@ 2023-01-06  4:00 ` Andrew Morton
  2023-01-09 22:51   ` T.J. Alumbaugh
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2023-01-06  4:00 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Alexander Viro, Andrea Righi, Johannes Weiner, Michael Larabel,
	linux-mm, linux-fsdevel, linux-kernel, linux-mm

On Fri, 30 Dec 2022 14:52:51 -0700 Yu Zhao <yuzhao@google.com> wrote:

> This patch adds vma_has_recency() to indicate whether a VMA may
> exhibit temporal locality that the LRU algorithm relies on.
> 
> This function returns false for VMAs marked by VM_SEQ_READ or
> VM_RAND_READ. While the former flag indicates linear access, i.e., a
> special case of spatial locality, both flags indicate a lack of
> temporal locality, i.e., the reuse of an area within a relatively
> small duration.
> 
> "Recency" is chosen over "locality" to avoid confusion between
> temporal and spatial localities.
> 
> Before this patch, the active/inactive LRU only ignored the accessed
> bit from VMAs marked by VM_SEQ_READ. After this patch, the
> active/inactive LRU and MGLRU share the same logic: they both ignore
> the accessed bit if vma_has_recency() returns false.
> 
> For the active/inactive LRU, the following fio test showed a [6, 8]%
> increase in IOPS when randomly accessing mapped files under memory
> pressure.
> 
>   kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
>   kb=$((kb - 8*1024*1024))
> 
>   modprobe brd rd_nr=1 rd_size=$kb
>   dd if=/dev/zero of=/dev/ram0 bs=1M
> 
>   mkfs.ext4 /dev/ram0
>   mount /dev/ram0 /mnt/
>   swapoff -a
> 
>   fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \
>       --size=8G --rw=randrw --time_based --runtime=10m \
>       --group_reporting
> 
> The discussion that led to this patch is here [1]. Additional test
> results are available in that thread.
> 
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -595,4 +595,12 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
>  #endif
>  }
>  
> +static inline bool vma_has_recency(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))
> +		return false;

I guess it's fairly obvious why these hints imply "doesn't have
recency".  But still, some comments wouldn't hurt!

> +	return true;
> +}
>  #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index 4000e9f017e0..ee72badad847 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1402,8 +1402,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  						force_flush = 1;
>  					}
>  				}
> -				if (pte_young(ptent) &&
> -				    likely(!(vma->vm_flags & VM_SEQ_READ)))
> +				if (pte_young(ptent) && likely(vma_has_recency(vma)))

So we're newly using VM_RAND_READ for the legacy LRU?  Deliberate?  If
so, what are the effects and why?

>  					mark_page_accessed(page);
>  			}
>  			rss[mm_counter(page)]--;
> @@ -5148,8 +5147,8 @@ static inline void mm_account_fault(struct pt_regs *regs,
>  #ifdef CONFIG_LRU_GEN
>  static void lru_gen_enter_fault(struct vm_area_struct *vma)
>  {
> -	/* the LRU algorithm doesn't apply to sequential or random reads */
> -	current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
> +	/* the LRU algorithm only applies to accesses with recency */
> +	current->in_lru_fault = vma_has_recency(vma);
>  }
>  
>  static void lru_gen_exit_fault(void)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8a24b90d9531..9abffdd63a6a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -823,25 +823,14 @@ static bool folio_referenced_one(struct folio *folio,
>  		}
>  
>  		if (pvmw.pte) {
> -			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> -			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> +			if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
>  				lru_gen_look_around(&pvmw);
>  				referenced++;
>  			}

I'd expect a call to vma_has_recency() here, but I'll trust you ;)


>  			if (ptep_clear_flush_young_notify(vma, address,
> -						pvmw.pte)) {
> -				/*
> -				 * Don't treat a reference through
> -				 * a sequentially read mapping as such.
> -				 * If the folio has been used in another mapping,
> -				 * we will catch it; if this other mapping is
> -				 * already gone, the unmap path will have set
> -				 * the referenced flag or activated the folio.
> -				 */
> -				if (likely(!(vma->vm_flags & VM_SEQ_READ)))
> -					referenced++;
> -			}
> +						pvmw.pte))
> +				referenced++;
>  		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			if (pmdp_clear_flush_young_notify(vma, address,
>  						pvmw.pmd))
> ...
>

The posix_fadvise() manpage will need an update, please.  Not now, but
if/when these changes are heading into mainline.  "merged into
mm-stable" would be a good trigger for this activity.

The legacy LRU has had used-once drop-behind for a long time (Johannes
touched it last).  Have you noticed whether that's all working OK?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH mm-unstable v2 1/2] mm: add vma_has_recency()
  2023-01-06  4:00 ` [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Andrew Morton
@ 2023-01-09 22:51   ` T.J. Alumbaugh
  0 siblings, 0 replies; 4+ messages in thread
From: T.J. Alumbaugh @ 2023-01-09 22:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yu Zhao, Alexander Viro, Andrea Righi, Johannes Weiner,
	Michael Larabel, linux-mm, linux-fsdevel, linux-kernel, linux-mm

On Thu, Jan 5, 2023 at 9:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> >
> The posix_fadvise() manpage will need an update, please.  Not now, but
> if/when these changes are heading into mainline.  "merged into
> mm-stable" would be a good trigger for this activity.
>
>

Thanks. Yu is OOO for a while. I think the follow up tasks (as listed
in [1]) are the man page update, the SVT-AV1 benchmark, and the fio
flag addition. I'll be following up on these soon.

[1] https://lore.kernel.org/linux-mm/CAOUHufZhcJh8PdVtFuoOPWBWw_fWNAB61GndXoWjekYaubXTAQ@mail.gmail.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-01-09 22:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-30 21:52 [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Yu Zhao
2022-12-30 21:52 ` [PATCH mm-unstable v2 2/2] mm: support POSIX_FADV_NOREUSE Yu Zhao
2023-01-06  4:00 ` [PATCH mm-unstable v2 1/2] mm: add vma_has_recency() Andrew Morton
2023-01-09 22:51   ` T.J. Alumbaugh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).