Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH 13/13] resctrl: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the resctrl filesystem.  If these options
are supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 83d3c358f95e..16b110d31457 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2053,7 +2053,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct fs_parse_result result;
 	int ret, opt;
 
-	ret = vfs_parse_sb_flag(fc, param);
+	ret = vfs_parse_ro_rw(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 05/13] vfs: don't parse "silent" option
From: Miklos Szeredi @ 2019-06-19 12:54 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel

While this is a standard option as documented in mount(8), it is ignored by
most filesystems.  So reject, unless filesystem explicitly wants to handle
it.

The exception is unconverted filesystems, where it is unknown if the
filesystem handles this or not.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
Changes:
 - v1 didn't return on matching option in legacy_parse_param()
 
 fs/fs_context.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 49636e541293..bd8f8ab8358b 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -51,7 +51,6 @@ static const struct constant_table common_clear_sb_flag[] = {
 	{ "nolazytime",	SB_LAZYTIME },
 	{ "nomand",	SB_MANDLOCK },
 	{ "rw",		SB_RDONLY },
-	{ "silent",	SB_SILENT },
 };
 
 /*
@@ -535,6 +534,15 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	if (ret != -ENOPARAM)
 		return ret;
 
+	if (strcmp(param->key, "silent") == 0) {
+		if (param->type != fs_value_is_flag)
+			return invalf(fc, "%s: Unexpected value for '%s'",
+				      fc->fs_type->name, param->key);
+
+		fc->sb_flags |= SB_SILENT;
+		return 0;
+	}
+
 	if (strcmp(param->key, "source") == 0) {
 		if (param->type != fs_value_is_string)
 			return invalf(fc, "VFS: Legacy: Non-string source");
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v2 1/5] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-19 12:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-2-minchan@kernel.org>

On Mon 10-06-19 20:12:48, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> file LRU's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. It makes sense for
> implmentaion point of view, too because it's not swapbacked memory any
> longer until it would be re-dirtied. Even, it could give a bonus to make
> them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> garbage so reclaiming them requires swap-out/in in the end so it's bigger
> cost. Since we have designed VM LRU aging based on cost-model, anonymous
> cold pages would be better to position inactive anon's LRU list, not file
> LRU. Furthermore, it would help to avoid unnecessary scanning if system
> doesn't have a swap device. Let's start simpler way without adding
> complexity at this moment.

I would only add that it is a caveat that workloads with a lot of page
cache are likely to ignore MADV_COLD on anonymous memory because we
rarely age anonymous LRU lists.

[...]
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{

This is duplicating a large part of madvise_free_pte_range with some
subtle differences which are not explained anywhere (e.g. why does
madvise_free_huge_pmd need try_lock on a page while not here? etc.).

Why cannot we reuse a large part of that code and differ essentially on
the reclaim target check and action? Have you considered to consolidate
the code to share as much as possible? Maybe that is easier said than
done because the devil is always in details...

I would definitely feel much more comfortable to review the code without
thinking about all those subtle details that have been already solved
before. Especially all the THP ones.

Other than that the patch looks sane to me.

> +	struct mmu_gather *tlb = walk->private;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		pmd_t orig_pmd;
> +
> +		tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		orig_pmd = *pmd;
> +		if (is_huge_zero_pmd(orig_pmd))
> +			goto huge_unlock;
> +
> +		if (unlikely(!pmd_present(orig_pmd))) {
> +			VM_BUG_ON(thp_migration_supported() &&
> +					!is_pmd_migration_entry(orig_pmd));
> +			goto huge_unlock;
> +		}
> +
> +		page = pmd_page(orig_pmd);
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			if (page_mapcount(page) != 1)
> +				goto huge_unlock;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		if (pmd_young(orig_pmd)) {
> +			pmdp_invalidate(vma, addr, pmd);
> +			orig_pmd = pmd_mkold(orig_pmd);
> +
> +			set_pmd_at(mm, addr, pmd, orig_pmd);
> +			tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> +		}
> +
> +		test_and_clear_page_young(page);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:
> +	tlb_change_page_size(tlb, PAGE_SIZE);
> +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	flush_tlb_batched_pending(mm);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (pte_young(ptent)) {
> +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> +							tlb->fullmm);
> +			ptent = pte_mkold(ptent);
> +			set_pte_at(mm, addr, pte, ptent);
> +			tlb_remove_tlb_entry(tlb, pte, addr);
> +		}
> +
> +		/*
> +		 * We are deactivating a page for accelerating reclaiming.
> +		 * VM couldn't reclaim the page unless we clear PG_young.
> +		 * As a side effect, it makes confuse idle-page tracking
> +		 * because they will miss recent referenced history.
> +		 */
> +		test_and_clear_page_young(page);
> +		deactivate_page(page);
> +	}
> +
> +	arch_enter_lazy_mmu_mode();
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 2/5] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
From: Michal Hocko @ 2019-06-19 13:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-3-minchan@kernel.org>

On Mon 10-06-19 20:12:49, Minchan Kim wrote:
> The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
> as default. It is for preventing to reclaim dirty pages when CMA try to
> migrate pages. Strictly speaking, we don't need it because CMA didn't allow
> to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
> 
> Moreover, it has a problem to prevent anonymous pages's swap out even
> though force_reclaim = true in shrink_page_list on upcoming patch.
> So this patch makes references's default value to PAGEREF_RECLAIM and
> rename force_reclaim with ignore_references to make it more clear.
> 
> This is a preparatory work for next patch.
> 
> * RFCv1
>  * use ignore_referecnes as parameter name - hannes
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

The code path is quite tricky to follow but the patch looks OK to me.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 84dcb651d05c..0973a46a0472 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				      struct scan_control *sc,
>  				      enum ttu_flags ttu_flags,
>  				      struct reclaim_stat *stat,
> -				      bool force_reclaim)
> +				      bool ignore_references)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> @@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		struct address_space *mapping;
>  		struct page *page;
>  		int may_enter_fs;
> -		enum page_references references = PAGEREF_RECLAIM_CLEAN;
> +		enum page_references references = PAGEREF_RECLAIM;
>  		bool dirty, writeback;
>  		unsigned int nr_pages;
>  
> @@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -		if (!force_reclaim)
> +		if (!ignore_references)
>  			references = page_check_references(page, sc);
>  
>  		switch (references) {
> -- 
> 2.22.0.rc2.383.gf4fbbf30c2-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-06-19 13:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-5-minchan@kernel.org>

On Mon 10-06-19 20:12:51, Minchan Kim wrote:
[...]
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)

Again the same question about a potential code reuse...
[...]
> +regular_page:
> +	tlb_change_page_size(tlb, PAGE_SIZE);
> +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	flush_tlb_batched_pending(mm);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (isolate_lru_page(page))
> +			continue;
> +
> +		isolated++;
> +		if (pte_young(ptent)) {
> +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> +							tlb->fullmm);
> +			ptent = pte_mkold(ptent);
> +			set_pte_at(mm, addr, pte, ptent);
> +			tlb_remove_tlb_entry(tlb, pte, addr);
> +		}
> +		ClearPageReferenced(page);
> +		test_and_clear_page_young(page);
> +		list_add(&page->lru, &page_list);
> +		if (isolated >= SWAP_CLUSTER_MAX) {

Why do we need SWAP_CLUSTER_MAX batching? Especially when we need ...
[...]

> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> +	int nid = -1;
> +	unsigned long nr_reclaimed = 0;
> +	LIST_HEAD(node_page_list);
> +	struct reclaim_stat dummy_stat;
> +	struct scan_control sc = {
> +		.gfp_mask = GFP_KERNEL,
> +		.priority = DEF_PRIORITY,
> +		.may_writepage = 1,
> +		.may_unmap = 1,
> +		.may_swap = 1,
> +	};
> +
> +	while (!list_empty(page_list)) {
> +		struct page *page;
> +
> +		page = lru_to_page(page_list);
> +		if (nid == -1) {
> +			nid = page_to_nid(page);
> +			INIT_LIST_HEAD(&node_page_list);
> +		}
> +
> +		if (nid == page_to_nid(page)) {
> +			list_move(&page->lru, &node_page_list);
> +			continue;
> +		}
> +
> +		nr_reclaimed += shrink_page_list(&node_page_list,
> +						NODE_DATA(nid),
> +						&sc, 0,
> +						&dummy_stat, false);

per-node batching in fact. Other than that nothing really jumped at me.
Except for the shared page cache side channel timing aspect not being
considered AFAICS. To be more specific. Pushing out a shared page cache
is possible even now but this interface gives a much easier tool to
evict shared state and perform all sorts of timing attacks. Unless I am
missing something we should be doing something similar to mincore and
ignore shared pages without a writeable access or at least document why
we do not care.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Joel Fernandes @ 2019-06-19 17:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton
In-Reply-To: <20190613044824.GF55602@google.com>

Sorry Minchan for late reply, I had a visa interview travel and then the weekend
came. I replied below:

On Thu, Jun 13, 2019 at 01:48:24PM +0900, Minchan Kim wrote:
> On Wed, Jun 12, 2019 at 01:21:04PM -0400, Joel Fernandes wrote:
> > On Mon, Jun 10, 2019 at 07:09:04PM +0900, Minchan Kim wrote:
> > > Hi Joel,
> > > 
> > > On Tue, Jun 04, 2019 at 04:38:41PM -0400, Joel Fernandes wrote:
> > > > On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > happens but data should be preserved for future use.  This could reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > > 
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > pages to evict early during memory pressure.
> > > > > 
> > > > > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > > > > 
> > > > > 	active file page -> inactive file LRU
> > > > > 	active anon page -> inacdtive anon LRU
> > > > > 
> > > > > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > > > > files's head because MADV_COLD is a little bit different symantic.
> > > > > MADV_FREE means it's okay to discard when the memory pressure because
> > > > > the content of the page is *garbage* so freeing such pages is almost zero
> > > > > overhead since we don't need to swap out and access afterward causes just
> > > > > minor fault. Thus, it would make sense to put those freeable pages in
> > > > > inactive file LRU to compete other used-once pages. Even, it could
> > > > > give a bonus to make them be reclaimed on swapless system. However,
> > > > > MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> > > > > in the end. So it's better to move inactive anon's LRU list, not file LRU.
> > > > > Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> > > > > if system doesn't have a swap device.
> > > > > 
> > > > > All of error rule is same with MADV_DONTNEED.
> > > > > 
> > > > > Note:
> > > > > This hint works with only private pages(IOW, page_mapcount(page) < 2)
> > > > > because shared page could have more chance to be accessed from other
> > > > > processes sharing the page although the caller reset the reference bits.
> > > > > It ends up preventing the reclaim of the page and wastes CPU cycle.
> > > > > 
> > > > > * RFCv2
> > > > >  * add more description - mhocko
> > > > > 
> > > > > * RFCv1
> > > > >  * renaming from MADV_COOL to MADV_COLD - hannes
> > > > > 
> > > > > * internal review
> > > > >  * use clear_page_youn in deactivate_page - joelaf
> > > > >  * Revise the description - surenb
> > > > >  * Renaming from MADV_WARM to MADV_COOL - surenb
> > > > > 
> > > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > > ---
> > > > >  include/linux/page-flags.h             |   1 +
> > > > >  include/linux/page_idle.h              |  15 ++++
> > > > >  include/linux/swap.h                   |   1 +
> > > > >  include/uapi/asm-generic/mman-common.h |   1 +
> > > > >  mm/internal.h                          |   2 +-
> > > > >  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
> > > > >  mm/oom_kill.c                          |   2 +-
> > > > >  mm/swap.c                              |  43 +++++++++
> > > > >  8 files changed, 176 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > > > > index 9f8712a4b1a5..58b06654c8dd 100644
> > > > > --- a/include/linux/page-flags.h
> > > > > +++ b/include/linux/page-flags.h
> > > > > @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
> > > > >  TESTPAGEFLAG(Young, young, PF_ANY)
> > > > >  SETPAGEFLAG(Young, young, PF_ANY)
> > > > >  TESTCLEARFLAG(Young, young, PF_ANY)
> > > > > +CLEARPAGEFLAG(Young, young, PF_ANY)
> > > > >  PAGEFLAG(Idle, idle, PF_ANY)
> > > > >  #endif
> > > > >  
> > > > > diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> > > > > index 1e894d34bdce..f3f43b317150 100644
> > > > > --- a/include/linux/page_idle.h
> > > > > +++ b/include/linux/page_idle.h
> > > > > @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
> > > > >  	SetPageYoung(page);
> > > > >  }
> > > > >  
> > > > > +static inline void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	ClearPageYoung(page);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	return TestClearPageYoung(page);
> > > > > @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
> > > > >  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > >  }
> > > > >  
> > > > > +static void clear_page_young(struct page *page)
> > > > > +{
> > > > > +	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > +
> > > > > +	if (unlikely(!page_ext))
> > > > > +		return;
> > > > > +
> > > > > +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> > > > > +}
> > > > > +
> > > > >  static inline bool test_and_clear_page_young(struct page *page)
> > > > >  {
> > > > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index de2c67a33b7e..0ce997edb8bb 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
> > > > >  extern void lru_add_drain_all(void);
> > > > >  extern void rotate_reclaimable_page(struct page *page);
> > > > >  extern void deactivate_file_page(struct page *page);
> > > > > +extern void deactivate_page(struct page *page);
> > > > >  extern void mark_page_lazyfree(struct page *page);
> > > > >  extern void swap_setup(void);
> > > > >  
> > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > > > > index bea0278f65ab..1190f4e7f7b9 100644
> > > > > --- a/include/uapi/asm-generic/mman-common.h
> > > > > +++ b/include/uapi/asm-generic/mman-common.h
> > > > > @@ -43,6 +43,7 @@
> > > > >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> > > > >  #define MADV_WILLNEED	3		/* will need these pages */
> > > > >  #define MADV_DONTNEED	4		/* don't need these pages */
> > > > > +#define MADV_COLD	5		/* deactivatie these pages */
> > > > >  
> > > > >  /* common parameters: try to keep these consistent across architectures */
> > > > >  #define MADV_FREE	8		/* free pages only if memory pressure */
> > > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > > index 9eeaf2b95166..75a4f96ec0fb 100644
> > > > > --- a/mm/internal.h
> > > > > +++ b/mm/internal.h
> > > > > @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> > > > >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> > > > >  		unsigned long floor, unsigned long ceiling);
> > > > >  
> > > > > -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> > > > > +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
> > > > >  {
> > > > >  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
> > > > >  }
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 628022e674a7..ab158766858a 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
> > > > >  	case MADV_REMOVE:
> > > > >  	case MADV_WILLNEED:
> > > > >  	case MADV_DONTNEED:
> > > > > +	case MADV_COLD:
> > > > >  	case MADV_FREE:
> > > > >  		return 0;
> > > > >  	default:
> > > > > @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > > > >  	return 0;
> > > > >  }
> > > > >  
> > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > +{
> > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > +	spinlock_t *ptl;
> > > > > +	struct page *page;
> > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > +	unsigned long next;
> > > > > +
> > > > > +	next = pmd_addr_end(addr, end);
> > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > +		if (!ptl)
> > > > > +			return 0;
> > > > > +
> > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		page = pmd_page(*pmd);
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			goto huge_unlock;
> > > > > +
> > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > +			int err;
> > > > > +
> > > > > +			get_page(page);
> > > > > +			spin_unlock(ptl);
> > > > > +			lock_page(page);
> > > > > +			err = split_huge_page(page);
> > > > > +			unlock_page(page);
> > > > > +			put_page(page);
> > > > > +			if (!err)
> > > > > +				goto regular_page;
> > > > > +			return 0;
> > > > > +		}
> > > > > +
> > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > +		deactivate_page(page);
> > > > > +huge_unlock:
> > > > > +		spin_unlock(ptl);
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > > +	if (pmd_trans_unstable(pmd))
> > > > > +		return 0;
> > > > > +
> > > > > +regular_page:
> > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > +		ptent = *pte;
> > > > > +
> > > > > +		if (pte_none(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		if (!pte_present(ptent))
> > > > > +			continue;
> > > > > +
> > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > +		if (!page)
> > > > > +			continue;
> > > > > +
> > > > > +		if (page_mapcount(page) > 1)
> > > > > +			continue;
> > > > > +
> > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > 
> > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > if it was previously set since it does not know any more that a page was
> > > > actively referenced.
> > > 
> > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > doesn't affect.
> 
> You said *young flag* in the comment, which made me confused. I thought you meant
> PG_young flag but you mean PTE access bit.
> 
> > 
> > Clearing of the young bit in the PTE does affect idle tracking.
> > 
> > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > 
> > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > this hint is concurrently applied?
> > > 
> > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > Do I miss your point?
> > 
> > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > current series does not support remote process, but I am saying for future
> > when you post this). Because you cleared the PTE accessed bit through the
> > hint, idle tracking no longer will know that the page is referenced and the
> > user gets confused because accessed page appears to be idle.
> 
> Right.
> 
> > 
> > I think to fix this, what you should do is clear the PG_Idle flag if the
> > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > don't need to do anything.
> 
> I'm not sure. What does it make MADV_COLD special?
> How about MADV_FREE|MADV_DONTNEED?
> Why don't they clear PG_Idle if pte was young at tearing down pte? 

Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

> The page could be shared by other processes so if we miss to clear out
> PG_idle in there, page idle tracking could miss the access history forever.

I did not understand this. So say a page X is shared process P and Q and
assume the PG_idle flag is set on the page.

P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
hint and forgets to clear the idle flag while clearing the pte accessed bit.

Now the page appears to be idle, even though it was not. This has nothing to
do with Q and whether the page is shared or not.

> If it's not what you want, maybe we need to fix all places all at once.
> However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> even though pte was accesssed because the process now gives strong hint
> "The page is idle from now on". It's valid because he knows himself better than
> others, even admin. IOW, he declare the page is not workingset any more.

Even if the PG_idle flag is not cleared - it is not a strong hint for working
set size IMHO, because the page *was* accessed so the process definitely needed the
page at some point even though now it says it is MADV_COLD. So that is part
of working set. I don't think we should implicitly provide such hints and we
should fix it.

Also I was saying in previous email, if process_madvise (future extension) is
called from say activity manager, then the process and the user running the
idle tracking feature has no idea that the page was accessed because the idle
flag is still set. That is a bit weird and is loss of information.

It may not be a big deal in the long run if the page is accessed a lot, since
the PTE accessed bit will be set again and idle-tracking feature may not miss
it, but why leave it to chance if it is a simple fix?

> What's the problem if page idle tracking feature miss it?

What's the problem if PG_idle flag is cleared here? It is just a software
flag.

> If other processs still have access bit of their pte for the page, page idle
> tracking could find the page as non-idle so it's no problem, either.

Yes, but if the other process also does not access the page, then the access
information is lost.

thanks!

 - Joel

^ permalink raw reply

* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Eric Biggers @ 2019-06-19 19:13 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190619030522.GA28351@mit.edu>

On Tue, Jun 18, 2019 at 11:05:22PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 18, 2019 at 04:41:34PM -0700, Eric Biggers wrote:
> > 
> > I don't think your proposed solution is so simple.  By definition the last
> > extent ends on a filesystem block boundary, while the Merkle tree ends on a
> > Merkle tree block boundary.  In the future we might support the case where these
> > differ, so we don't want to preclude that in the on-disk format we choose now.
> > Therefore, just storing the desc_size isn't enough; we'd actually have to store
> > (desc_pos, desc_size), like I'm doing in the xattr.
> 
> I don't think any of this matters much, since what you're describing
> above is all about the Merkle tree, and that doesn't affect how we
> find the fsverity descriptor information.  We can just say that
> fsverity descriptor block begins on the next file system block
> boundary after the Merkle tree.  And in the case where say, the Merkle
> tree is 4k and the file system block size is 64k, that's fine --- the
> fs descriptor would just begin at the next 64k (fs blocksize)
> boundary.
> 

Sure, that works.

I implemented this for ext4 and extents only, and it does work, though it's a
bit more complex than the xattr solution -- about 70 extra lines of code
including comments.  See diff for fs/ext4/verity.c below.

But we can go with it if you think it's worthwhile to avoid using xattrs at all.

diff --git a/fs/ext4/verity.c b/fs/ext4/verity.c
index 6333b9dd2dff2a..9ae89489f01bf3 100644
--- a/fs/ext4/verity.c
+++ b/fs/ext4/verity.c
@@ -9,7 +9,7 @@
  * Implementation of fsverity_operations for ext4.
  *
  * ext4 stores the verity metadata (Merkle tree and fsverity_descriptor) past
- * the end of the file, starting at the first page fully beyond i_size.  This
+ * the end of the file, starting at the first 64K boundary beyond i_size.  This
  * approach works because (a) verity files are readonly, and (b) pages fully
  * beyond i_size aren't visible to userspace but can be read/written internally
  * by ext4 with only some relatively small changes to ext4.  This approach
@@ -17,13 +17,22 @@
  * ext4's xattr support to support paging multi-gigabyte xattrs into memory, and
  * to support encrypting xattrs.  Note that the verity metadata *must* be
  * encrypted when the file is, since it contains hashes of the plaintext data.
+ *
+ * Using a 64K boundary rather than a 4K one keeps things ready for
+ * architectures with 64K pages, and it doesn't necessarily waste space on-disk
+ * since there can be a hole between i_size and the start of the Merkle tree.
  */
 
 #include <linux/quotaops.h>
 
 #include "ext4.h"
+#include "ext4_extents.h"
 #include "ext4_jbd2.h"
-#include "xattr.h"
+
+static inline loff_t ext4_verity_metadata_pos(const struct inode *inode)
+{
+	return round_up(inode->i_size, 65536);
+}
 
 /*
  * Read some verity metadata from the inode.  __vfs_read() can't be used because
@@ -32,8 +41,6 @@
 static int pagecache_read(struct inode *inode, void *buf, size_t count,
 			  loff_t pos)
 {
-	const size_t orig_count = count;
-
 	while (count) {
 		size_t n = min_t(size_t, count,
 				 PAGE_SIZE - offset_in_page(pos));
@@ -55,7 +62,7 @@ static int pagecache_read(struct inode *inode, void *buf, size_t count,
 		pos += n;
 		count -= n;
 	}
-	return orig_count;
+	return 0;
 }
 
 /*
@@ -96,22 +103,10 @@ static int pagecache_write(struct inode *inode, const void *buf, size_t count,
 	return 0;
 }
 
-/*
- * Format of ext4 verity xattr.  This points to the location of the verity
- * descriptor within the file data rather than containing it directly because
- * the verity descriptor *must* be encrypted when ext4 encryption is used.  But,
- * ext4 encryption does not encrypt xattrs.
- */
-struct fsverity_descriptor_location {
-	__le32 version;
-	__le32 size;
-	__le64 pos;
-};
-
 static int ext4_begin_enable_verity(struct file *filp)
 {
 	struct inode *inode = file_inode(filp);
-	int credits = 2; /* superblock and inode for ext4_orphan_add() */
+	const int credits = 2; /* superblock and inode for ext4_orphan_add() */
 	handle_t *handle;
 	int err;
 
@@ -119,10 +114,24 @@ static int ext4_begin_enable_verity(struct file *filp)
 	if (err)
 		return err;
 
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		ext4_warning_inode(inode,
+				   "verity is only allowed on extent-based files");
+		return -EINVAL;
+	}
+
 	err = ext4_inode_attach_jinode(inode);
 	if (err)
 		return err;
 
+	/*
+	 * ext4 uses the last allocated block to find the verity descriptor, so
+	 * we must remove any other blocks which might confuse things.
+	 */
+	err = ext4_truncate(inode);
+	if (err)
+		return err;
+
 	err = dquot_initialize(inode);
 	if (err)
 		return err;
@@ -139,32 +148,55 @@ static int ext4_begin_enable_verity(struct file *filp)
 	return err;
 }
 
+/*
+ * ext4 stores the verity descriptor beginning on the next filesystem block
+ * boundary after the Merkle tree.  Then, the descriptor size is stored in the
+ * last 4 bytes of the last allocated filesystem block --- which is either the
+ * block in which the descriptor ends, or the next block after that if there
+ * weren't at least 4 bytes remaining.
+ *
+ * We can't simply store the descriptor in an xattr because it *must* be
+ * encrypted when ext4 encryption is used, but ext4 encryption doesn't encrypt
+ * xattrs.  Also, if the descriptor includes a large signature blob it may be
+ * too large to store in an xattr without the EA_INODE feature.
+ */
+static int ext4_write_verity_descriptor(struct inode *inode, const void *desc,
+					size_t desc_size, u64 merkle_tree_size)
+{
+	const u64 desc_pos = round_up(ext4_verity_metadata_pos(inode) +
+				      merkle_tree_size, i_blocksize(inode));
+	const u64 desc_end = desc_pos + desc_size;
+	const __le32 desc_size_disk = cpu_to_le32(desc_size);
+	const u64 desc_size_pos = round_up(desc_end + sizeof(desc_size_disk),
+					   i_blocksize(inode)) -
+				  sizeof(desc_size_disk);
+	int err;
+
+	err = pagecache_write(inode, desc, desc_size, desc_pos);
+	if (err)
+		return err;
+
+	return pagecache_write(inode, &desc_size_disk, sizeof(desc_size_disk),
+			       desc_size_pos);
+}
+
 static int ext4_end_enable_verity(struct file *filp, const void *desc,
 				  size_t desc_size, u64 merkle_tree_size)
 {
 	struct inode *inode = file_inode(filp);
-	u64 desc_pos = round_up(inode->i_size, PAGE_SIZE) + merkle_tree_size;
-	struct fsverity_descriptor_location dloc = {
-		.version = cpu_to_le32(1),
-		.size = cpu_to_le32(desc_size),
-		.pos = cpu_to_le64(desc_pos),
-	};
-	int credits = 0;
+	const int credits = 2; /* superblock and inode for ext4_orphan_add() */
 	handle_t *handle;
 	int err1 = 0;
 	int err;
 
 	if (desc != NULL) {
 		/* Succeeded; write the verity descriptor. */
-		err1 = pagecache_write(inode, desc, desc_size, desc_pos);
+		err1 = ext4_write_verity_descriptor(inode, desc, desc_size,
+						    merkle_tree_size);
 
 		/* Write all pages before clearing VERITY_IN_PROGRESS. */
 		if (!err1)
 			err1 = filemap_write_and_wait(inode->i_mapping);
-
-		if (!err1)
-			err1 = ext4_xattr_set_credits(inode, sizeof(dloc), true,
-						      &credits);
 	} else {
 		/* Failed; truncate anything we wrote past i_size. */
 		ext4_truncate(inode);
@@ -173,14 +205,12 @@ static int ext4_end_enable_verity(struct file *filp, const void *desc,
 	/*
 	 * We must always clean up by clearing EXT4_STATE_VERITY_IN_PROGRESS and
 	 * deleting the inode from the orphan list, even if something failed.
-	 * If everything succeeded, we'll also set the verity bit and descriptor
-	 * location xattr in the same transaction.
+	 * If everything succeeded, we'll also set the verity bit in the same
+	 * transaction.
 	 */
 
 	ext4_clear_inode_state(inode, EXT4_STATE_VERITY_IN_PROGRESS);
 
-	credits += 2; /* superblock and inode for ext4_orphan_del() */
-
 	handle = ext4_journal_start(inode, EXT4_HT_INODE, credits);
 	if (IS_ERR(handle)) {
 		ext4_orphan_del(NULL, inode);
@@ -194,13 +224,6 @@ static int ext4_end_enable_verity(struct file *filp, const void *desc,
 	if (desc != NULL && !err1) {
 		struct ext4_iloc iloc;
 
-		err = ext4_xattr_set_handle(handle, inode,
-					    EXT4_XATTR_INDEX_VERITY,
-					    EXT4_XATTR_NAME_VERITY,
-					    &dloc, sizeof(dloc), XATTR_CREATE);
-		if (err)
-			goto out_stop;
-
 		err = ext4_reserve_inode_write(handle, inode, &iloc);
 		if (err)
 			goto out_stop;
@@ -213,43 +236,103 @@ static int ext4_end_enable_verity(struct file *filp, const void *desc,
 	return err ?: err1;
 }
 
-static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
-				      size_t buf_size)
+static int ext4_get_verity_descriptor_location(struct inode *inode,
+					       size_t *desc_size_ret,
+					       u64 *desc_pos_ret)
 {
-	struct fsverity_descriptor_location dloc;
-	int res;
-	u32 size;
-	u64 pos;
-
-	/* Get the descriptor location */
-	res = ext4_xattr_get(inode, EXT4_XATTR_INDEX_VERITY,
-			     EXT4_XATTR_NAME_VERITY, &dloc, sizeof(dloc));
-	if (res < 0 && res != -ERANGE)
-		return res;
-	if (res != sizeof(dloc) || dloc.version != cpu_to_le32(1)) {
-		ext4_warning_inode(inode, "unknown verity xattr format");
-		return -EINVAL;
+	struct ext4_ext_path *path;
+	struct ext4_extent *last_extent;
+	u32 end_lblk;
+	u64 desc_size_pos;
+	__le32 desc_size_disk;
+	u32 desc_size;
+	u64 desc_pos;
+	int err;
+
+	/*
+	 * Descriptor size is in last 4 bytes of last allocated block.
+	 * See ext4_write_verity_descriptor().
+	 */
+
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		EXT4_ERROR_INODE(inode, "verity file doesn't use extents");
+		return -EFSCORRUPTED;
 	}
-	size = le32_to_cpu(dloc.size);
-	pos = le64_to_cpu(dloc.pos);
 
-	/* Get the descriptor */
-	if (pos + size < pos || pos + size > inode->i_sb->s_maxbytes ||
-	    pos < round_up(inode->i_size, PAGE_SIZE) || size > INT_MAX) {
-		ext4_warning_inode(inode, "invalid verity xattr");
+	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL, 0);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+
+	last_extent = path[path->p_depth].p_ext;
+	if (!last_extent) {
+		EXT4_ERROR_INODE(inode, "verity file has no extents");
+		ext4_ext_drop_refs(path);
+		kfree(path);
 		return -EFSCORRUPTED;
 	}
-	if (buf_size == 0)
-		return size;
-	if (size > buf_size)
-		return -ERANGE;
-	return pagecache_read(inode, buf, size, pos);
+
+	end_lblk = le32_to_cpu(last_extent->ee_block) +
+		   ext4_ext_get_actual_len(last_extent);
+	desc_size_pos = (u64)end_lblk << inode->i_blkbits;
+	ext4_ext_drop_refs(path);
+	kfree(path);
+
+	if (desc_size_pos < sizeof(desc_size_disk))
+		goto bad;
+	desc_size_pos -= sizeof(desc_size_disk);
+
+	err = pagecache_read(inode, &desc_size_disk, sizeof(desc_size_disk),
+			     desc_size_pos);
+	if (err)
+		return err;
+	desc_size = le32_to_cpu(desc_size_disk);
+
+	/*
+	 * The descriptor is stored just before the desc_size_disk, but starting
+	 * on a filesystem block boundary.
+	 */
+
+	if (desc_size > INT_MAX || desc_size > desc_size_pos)
+		goto bad;
+
+	desc_pos = round_down(desc_size_pos - desc_size, i_blocksize(inode));
+	if (desc_pos < ext4_verity_metadata_pos(inode))
+		goto bad;
+
+	*desc_size_ret = desc_size;
+	*desc_pos_ret = desc_pos;
+	return 0;
+
+bad:
+	EXT4_ERROR_INODE(inode, "verity file corrupted; can't find descriptor");
+	return -EFSCORRUPTED;
+}
+
+static int ext4_get_verity_descriptor(struct inode *inode, void *buf,
+				      size_t buf_size)
+{
+	size_t desc_size = 0;
+	u64 desc_pos = 0;
+	int err;
+
+	err = ext4_get_verity_descriptor_location(inode, &desc_size, &desc_pos);
+	if (err)
+		return err;
+
+	if (buf_size) {
+		if (desc_size > buf_size)
+			return -ERANGE;
+		err = pagecache_read(inode, buf, desc_size, desc_pos);
+		if (err)
+			return err;
+	}
+	return desc_size;
 }
 
 static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 					       pgoff_t index)
 {
-	index += DIV_ROUND_UP(inode->i_size, PAGE_SIZE);
+	index += ext4_verity_metadata_pos(inode) >> PAGE_SHIFT;
 
 	return read_mapping_page(inode->i_mapping, index, NULL);
 }
@@ -257,8 +340,7 @@ static struct page *ext4_read_merkle_tree_page(struct inode *inode,
 static int ext4_write_merkle_tree_block(struct inode *inode, const void *buf,
 					u64 index, int log_blocksize)
 {
-	loff_t pos = round_up(inode->i_size, PAGE_SIZE) +
-		     (index << log_blocksize);
+	loff_t pos = ext4_verity_metadata_pos(inode) + (index << log_blocksize);
 
 	return pagecache_write(inode, buf, 1 << log_blocksize, pos);
 }

^ permalink raw reply related

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Minchan Kim @ 2019-06-19 23:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190619122750.GN2968@dhcp22.suse.cz>

On Wed, Jun 19, 2019 at 02:27:50PM +0200, Michal Hocko wrote:
> On Mon 10-06-19 20:12:47, Minchan Kim wrote:
> > This patch is part of previous series:
> > https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
> > Originally, it was created for external madvise hinting feature.
> > 
> > https://lkml.org/lkml/2019/5/31/463
> > Michal wanted to separte the discussion from external hinting interface
> > so this patchset includes only first part of my entire patchset
> > 
> >   - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.
> > 
> > However, I keep entire description for others for easier understanding
> > why this kinds of hint was born.
> > 
> > Thanks.
> > 
> > This patchset is against on next-20190530.
> > 
> > Below is description of previous entire patchset.
> > ================= &< =====================
> > 
> > - Background
> > 
> > The Android terminology used for forking a new process and starting an app
> > from scratch is a cold start, while resuming an existing app is a hot start.
> > While we continually try to improve the performance of cold starts, hot
> > starts will always be significantly less power hungry as well as faster so
> > we are trying to make hot start more likely than cold start.
> > 
> > To increase hot start, Android userspace manages the order that apps should
> > be killed in a process called ActivityManagerService. ActivityManagerService
> > tracks every Android app or service that the user could be interacting with
> > at any time and translates that into a ranked list for lmkd(low memory
> > killer daemon). They are likely to be killed by lmkd if the system has to
> > reclaim memory. In that sense they are similar to entries in any other cache.
> > Those apps are kept alive for opportunistic performance improvements but
> > those performance improvements will vary based on the memory requirements of
> > individual workloads.
> > 
> > - Problem
> > 
> > Naturally, cached apps were dominant consumers of memory on the system.
> > However, they were not significant consumers of swap even though they are
> > good candidate for swap. Under investigation, swapping out only begins
> > once the low zone watermark is hit and kswapd wakes up, but the overall
> > allocation rate in the system might trip lmkd thresholds and cause a cached
> > process to be killed(we measured performance swapping out vs. zapping the
> > memory by killing a process. Unsurprisingly, zapping is 10x times faster
> > even though we use zram which is much faster than real storage) so kill
> > from lmkd will often satisfy the high zone watermark, resulting in very
> > few pages actually being moved to swap.
> > 
> > - Approach
> > 
> > The approach we chose was to use a new interface to allow userspace to
> > proactively reclaim entire processes by leveraging platform information.
> > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> > that are known to be cold from userspace and to avoid races with lmkd
> > by reclaiming apps as soon as they entered the cached state. Additionally,
> > it could provide many chances for platform to use much information to
> > optimize memory efficiency.
> > 
> > To achieve the goal, the patchset introduce two new options for madvise.
> > One is MADV_COLD which will deactivate activated pages and the other is
> > MADV_PAGEOUT which will reclaim private pages instantly. These new options
> > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> > gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
> > that it hints the kernel that memory region is not currently needed and
> > should be reclaimed when memory pressure rises.
> 
> This all is a very good background information suitable for the cover
> letter.
> 
> > This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> > information required to make the reclaim decision is not known to the app.
> > Instead, it is known to a centralized userspace daemon, and that daemon
> > must be able to initiate reclaim on its own without any app involvement.
> > To solve the concern, this patch introduces new syscall -
> > 
> >     struct pr_madvise_param {
> >             int size;               /* the size of this structure */
> >             int cookie;             /* reserved to support atomicity */
> >             int nr_elem;            /* count of below arrary fields */
> >             int __user *hints;      /* hints for each range */
> >             /* to store result of each operation */
> >             const struct iovec __user *results;
> >             /* input address ranges */
> >             const struct iovec __user *ranges;
> >     };
> >     
> >     int process_madvise(int pidfd, struct pr_madvise_param *u_param,
> >                             unsigned long flags);
> 
> But this and the following paragraphs are referring to the later step
> when the madvise gains a remote process capabilities and that is out
> of the scope of this patch series so I would simply remove it from
> here. Andrew tends to put the cover letter into the first patch of the
> series and that would be indeed
> confusing here.

Okay, I will remove the part in next revision.

^ permalink raw reply

* Re: [PATCH v2 1/5] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-20  0:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190619125611.GO2968@dhcp22.suse.cz>

On Wed, Jun 19, 2019 at 02:56:12PM +0200, Michal Hocko wrote:
> On Mon 10-06-19 20:12:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use.  This could reduce
> > workingset eviction so it ends up increasing performance.
> > 
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> > 
> > It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> > 
> > 	active file page -> inactive file LRU
> > 	active anon page -> inacdtive anon LRU
> > 
> > Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> > file LRU's head because MADV_COLD is a little bit different symantic.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage* so freeing such pages is almost zero
> > overhead since we don't need to swap out and access afterward causes just
> > minor fault. Thus, it would make sense to put those freeable pages in
> > inactive file LRU to compete other used-once pages. It makes sense for
> > implmentaion point of view, too because it's not swapbacked memory any
> > longer until it would be re-dirtied. Even, it could give a bonus to make
> > them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> > garbage so reclaiming them requires swap-out/in in the end so it's bigger
> > cost. Since we have designed VM LRU aging based on cost-model, anonymous
> > cold pages would be better to position inactive anon's LRU list, not file
> > LRU. Furthermore, it would help to avoid unnecessary scanning if system
> > doesn't have a swap device. Let's start simpler way without adding
> > complexity at this moment.
> 
> I would only add that it is a caveat that workloads with a lot of page
> cache are likely to ignore MADV_COLD on anonymous memory because we
> rarely age anonymous LRU lists.

Okay, I will add some more.

> 
> [...]
> > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +{
> 
> This is duplicating a large part of madvise_free_pte_range with some
> subtle differences which are not explained anywhere (e.g. why does
> madvise_free_huge_pmd need try_lock on a page while not here? etc.).

madvise_free_huge_pmd handle dirty bit but this is not.

> 
> Why cannot we reuse a large part of that code and differ essentially on
> the reclaim target check and action? Have you considered to consolidate
> the code to share as much as possible? Maybe that is easier said than
> done because the devil is always in details...

Yub, it was not pretty when I tried. Please see last patch in this
patchset.

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-06-20  4:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190619132450.GQ2968@dhcp22.suse.cz>

On Wed, Jun 19, 2019 at 03:24:50PM +0200, Michal Hocko wrote:
> On Mon 10-06-19 20:12:51, Minchan Kim wrote:
> [...]
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> 
> Again the same question about a potential code reuse...
> [...]
> > +regular_page:
> > +	tlb_change_page_size(tlb, PAGE_SIZE);
> > +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > +	flush_tlb_batched_pending(mm);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (isolate_lru_page(page))
> > +			continue;
> > +
> > +		isolated++;
> > +		if (pte_young(ptent)) {
> > +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +							tlb->fullmm);
> > +			ptent = pte_mkold(ptent);
> > +			set_pte_at(mm, addr, pte, ptent);
> > +			tlb_remove_tlb_entry(tlb, pte, addr);
> > +		}
> > +		ClearPageReferenced(page);
> > +		test_and_clear_page_young(page);
> > +		list_add(&page->lru, &page_list);
> > +		if (isolated >= SWAP_CLUSTER_MAX) {
> 
> Why do we need SWAP_CLUSTER_MAX batching? Especially when we need ...
> [...]

It aims for preventing early OOM kill since we isolate too many LRU
pages concurrently.

> 
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > +	int nid = -1;
> > +	unsigned long nr_reclaimed = 0;
> > +	LIST_HEAD(node_page_list);
> > +	struct reclaim_stat dummy_stat;
> > +	struct scan_control sc = {
> > +		.gfp_mask = GFP_KERNEL,
> > +		.priority = DEF_PRIORITY,
> > +		.may_writepage = 1,
> > +		.may_unmap = 1,
> > +		.may_swap = 1,
> > +	};
> > +
> > +	while (!list_empty(page_list)) {
> > +		struct page *page;
> > +
> > +		page = lru_to_page(page_list);
> > +		if (nid == -1) {
> > +			nid = page_to_nid(page);
> > +			INIT_LIST_HEAD(&node_page_list);
> > +		}
> > +
> > +		if (nid == page_to_nid(page)) {
> > +			list_move(&page->lru, &node_page_list);
> > +			continue;
> > +		}
> > +
> > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > +						NODE_DATA(nid),
> > +						&sc, 0,
> > +						&dummy_stat, false);
> 
> per-node batching in fact. Other than that nothing really jumped at me.
> Except for the shared page cache side channel timing aspect not being
> considered AFAICS. To be more specific. Pushing out a shared page cache
> is possible even now but this interface gives a much easier tool to
> evict shared state and perform all sorts of timing attacks. Unless I am
> missing something we should be doing something similar to mincore and
> ignore shared pages without a writeable access or at least document why
> we do not care.

I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
1. they already can do that simply by memory hogging
2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?

^ permalink raw reply

* Re: [PATCH 05/13] vfs: don't parse "silent" option
From: Ian Kent @ 2019-06-20  4:40 UTC (permalink / raw)
  To: Miklos Szeredi, David Howells
  Cc: Al Viro, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-5-mszeredi@redhat.com>

On Wed, 2019-06-19 at 14:30 +0200, Miklos Szeredi wrote:
> While this is a standard option as documented in mount(8), it is ignored by
> most filesystems.  So reject, unless filesystem explicitly wants to handle
> it.
> 
> The exception is unconverted filesystems, where it is unknown if the
> filesystem handles this or not.
> 
> Any implementation, such as mount(8) that needs to parse this option
> without failing should simply ignore the return value from fsconfig().

In theory this is fine but every time someone has attempted
to change the handling of this in the past autofs has had
problems so I'm a bit wary of the change.

It was originally meant to tell the file system to ignore
invalid options such as could be found in automount maps that
are used with multiple OS implementations that have differences
in their options.

That was, IIRC, primarily NFS although NFS should handle most
(if not all of those) cases these days.

Nevertheless I'm a bit nervous about it, ;)

> 
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/fs_context.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs_context.c b/fs/fs_context.c
> index 49636e541293..c26b353aa858 100644
> --- a/fs/fs_context.c
> +++ b/fs/fs_context.c
> @@ -51,7 +51,6 @@ static const struct constant_table common_clear_sb_flag[] =
> {
>  	{ "nolazytime",	SB_LAZYTIME },
>  	{ "nomand",	SB_MANDLOCK },
>  	{ "rw",		SB_RDONLY },
> -	{ "silent",	SB_SILENT },
>  };
>  
>  /*
> @@ -535,6 +534,9 @@ static int legacy_parse_param(struct fs_context *fc,
> struct fs_parameter *param)
>  	if (ret != -ENOPARAM)
>  		return ret;
>  
> +	if (strcmp(param->key, "silent") == 0)
> +		fc->sb_flags |= SB_SILENT;
> +
>  	if (strcmp(param->key, "source") == 0) {
>  		if (param->type != fs_value_is_string)
>  			return invalf(fc, "VFS: Legacy: Non-string source");

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-20  5:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton, vdavydov.dev
In-Reply-To: <20190619171340.GA83620@google.com>

On Wed, Jun 19, 2019 at 01:13:40PM -0400, Joel Fernandes wrote:
< snip >

Ccing Vladimir

> > > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> > > > > > +				unsigned long end, struct mm_walk *walk)
> > > > > > +{
> > > > > > +	pte_t *orig_pte, *pte, ptent;
> > > > > > +	spinlock_t *ptl;
> > > > > > +	struct page *page;
> > > > > > +	struct vm_area_struct *vma = walk->vma;
> > > > > > +	unsigned long next;
> > > > > > +
> > > > > > +	next = pmd_addr_end(addr, end);
> > > > > > +	if (pmd_trans_huge(*pmd)) {
> > > > > > +		ptl = pmd_trans_huge_lock(pmd, vma);
> > > > > > +		if (!ptl)
> > > > > > +			return 0;
> > > > > > +
> > > > > > +		if (is_huge_zero_pmd(*pmd))
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		page = pmd_page(*pmd);
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			goto huge_unlock;
> > > > > > +
> > > > > > +		if (next - addr != HPAGE_PMD_SIZE) {
> > > > > > +			int err;
> > > > > > +
> > > > > > +			get_page(page);
> > > > > > +			spin_unlock(ptl);
> > > > > > +			lock_page(page);
> > > > > > +			err = split_huge_page(page);
> > > > > > +			unlock_page(page);
> > > > > > +			put_page(page);
> > > > > > +			if (!err)
> > > > > > +				goto regular_page;
> > > > > > +			return 0;
> > > > > > +		}
> > > > > > +
> > > > > > +		pmdp_test_and_clear_young(vma, addr, pmd);
> > > > > > +		deactivate_page(page);
> > > > > > +huge_unlock:
> > > > > > +		spin_unlock(ptl);
> > > > > > +		return 0;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (pmd_trans_unstable(pmd))
> > > > > > +		return 0;
> > > > > > +
> > > > > > +regular_page:
> > > > > > +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > > > +		ptent = *pte;
> > > > > > +
> > > > > > +		if (pte_none(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (!pte_present(ptent))
> > > > > > +			continue;
> > > > > > +
> > > > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > > > +		if (!page)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		if (page_mapcount(page) > 1)
> > > > > > +			continue;
> > > > > > +
> > > > > > +		ptep_test_and_clear_young(vma, addr, pte);
> > > > > 
> > > > > Wondering here how it interacts with idle page tracking. Here since young
> > > > > flag is cleared by the cold hint, page_referenced_one() or
> > > > > page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
> > > > > if it was previously set since it does not know any more that a page was
> > > > > actively referenced.
> > > > 
> > > > ptep_test_and_clear_young doesn't change PG_idle/young so idle page tracking
> > > > doesn't affect.
> > 
> > You said *young flag* in the comment, which made me confused. I thought you meant
> > PG_young flag but you mean PTE access bit.
> > 
> > > 
> > > Clearing of the young bit in the PTE does affect idle tracking.
> > > 
> > > Both page_referenced_one() and page_idle_clear_pte_refs_one() check this bit.
> > > 
> > > > > bit was previously set, just so that page-idle tracking works smoothly when
> > > > > this hint is concurrently applied?
> > > > 
> > > > deactivate_page will remove PG_young bit so that the page will be reclaimed.
> > > > Do I miss your point?
> > > 
> > > Say a process had accessed PTE bit not set, then idle tracking is run and PG_Idle
> > > is set. Now the page is accessed from userspace thus setting the accessed PTE
> > > bit.  Now a remote process passes this process_madvise cold hint (I know your
> > > current series does not support remote process, but I am saying for future
> > > when you post this). Because you cleared the PTE accessed bit through the
> > > hint, idle tracking no longer will know that the page is referenced and the
> > > user gets confused because accessed page appears to be idle.
> > 
> > Right.
> > 
> > > 
> > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > don't need to do anything.
> > 
> > I'm not sure. What does it make MADV_COLD special?
> > How about MADV_FREE|MADV_DONTNEED?
> > Why don't they clear PG_Idle if pte was young at tearing down pte? 
> 
> Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?

Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
part, which is more general to cover every sites like munmap, get_user_pages and
so on. Anyway, I don't think it's related to this patchset.

> 
> > The page could be shared by other processes so if we miss to clear out
> > PG_idle in there, page idle tracking could miss the access history forever.
> 
> I did not understand this. So say a page X is shared process P and Q and
> assume the PG_idle flag is set on the page.
> 
> P accesses memory and has the pte accessed bit set. P now gets the MADV_COLD
> hint and forgets to clear the idle flag while clearing the pte accessed bit.
> 
> Now the page appears to be idle, even though it was not. This has nothing to
> do with Q and whether the page is shared or not.

What I meant was MADV_FREE|MADV_DONTNEED.

> 
> > If it's not what you want, maybe we need to fix all places all at once.
> > However, I'm not sure. Rather than, I want to keep PG_idle in those hints
> > even though pte was accesssed because the process now gives strong hint
> > "The page is idle from now on". It's valid because he knows himself better than
> > others, even admin. IOW, he declare the page is not workingset any more.
> 
> Even if the PG_idle flag is not cleared - it is not a strong hint for working
> set size IMHO, because the page *was* accessed so the process definitely needed the
> page at some point even though now it says it is MADV_COLD. So that is part
> of working set. I don't think we should implicitly provide such hints and we
> should fix it.
> 
> Also I was saying in previous email, if process_madvise (future extension) is
> called from say activity manager, then the process and the user running the
> idle tracking feature has no idea that the page was accessed because the idle
> flag is still set. That is a bit weird and is loss of information.
> 
> It may not be a big deal in the long run if the page is accessed a lot, since
> the PTE accessed bit will be set again and idle-tracking feature may not miss
> it, but why leave it to chance if it is a simple fix?

Consistency with other madvise hints.

There are many places you could lose the information as I mentioned and I'm
really not conviced we need fixing because currently page-idle tracking
feature is biased to work with . If you believe we need to fix it,
it would be better to have a separate discussion, not here.

> 
> > What's the problem if page idle tracking feature miss it?
> 
> What's the problem if PG_idle flag is cleared here? It is just a software
> flag.

Again consistency. I don't think it's a MADV_PAGEOUT specific issue.
Since I pointed out other places idle tracking is missing(? not sure),
Let's discuss it separately if you feel we need fix.

Furthermore, once the page is reclaimed, that means the page could be
deallocated so you automatically don't see any PG_idle from the page.

> 
> > If other processs still have access bit of their pte for the page, page idle
> > tracking could find the page as non-idle so it's no problem, either.
> 
> Yes, but if the other process also does not access the page, then the access
> information is lost.
> 
> thanks!
> 
>  - Joel
> 

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-06-20  7:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620041620.GB105727@google.com>

On Thu 20-06-19 13:16:20, Minchan Kim wrote:
> On Wed, Jun 19, 2019 at 03:24:50PM +0200, Michal Hocko wrote:
> > On Mon 10-06-19 20:12:51, Minchan Kim wrote:
> > [...]
> > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > +				unsigned long end, struct mm_walk *walk)
> > 
> > Again the same question about a potential code reuse...
> > [...]
> > > +regular_page:
> > > +	tlb_change_page_size(tlb, PAGE_SIZE);
> > > +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > +	flush_tlb_batched_pending(mm);
> > > +	arch_enter_lazy_mmu_mode();
> > > +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> > > +		ptent = *pte;
> > > +		if (!pte_present(ptent))
> > > +			continue;
> > > +
> > > +		page = vm_normal_page(vma, addr, ptent);
> > > +		if (!page)
> > > +			continue;
> > > +
> > > +		if (isolate_lru_page(page))
> > > +			continue;
> > > +
> > > +		isolated++;
> > > +		if (pte_young(ptent)) {
> > > +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > +							tlb->fullmm);
> > > +			ptent = pte_mkold(ptent);
> > > +			set_pte_at(mm, addr, pte, ptent);
> > > +			tlb_remove_tlb_entry(tlb, pte, addr);
> > > +		}
> > > +		ClearPageReferenced(page);
> > > +		test_and_clear_page_young(page);
> > > +		list_add(&page->lru, &page_list);
> > > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > 
> > Why do we need SWAP_CLUSTER_MAX batching? Especially when we need ...
> > [...]
> 
> It aims for preventing early OOM kill since we isolate too many LRU
> pages concurrently.

This is a good point. For some reason I thought that we consider
isolated pages in should_reclaim_retry but we do not anymore (since we
move from zone to node LRUs I guess). Please stick a comment there.

> > > +unsigned long reclaim_pages(struct list_head *page_list)
> > > +{
> > > +	int nid = -1;
> > > +	unsigned long nr_reclaimed = 0;
> > > +	LIST_HEAD(node_page_list);
> > > +	struct reclaim_stat dummy_stat;
> > > +	struct scan_control sc = {
> > > +		.gfp_mask = GFP_KERNEL,
> > > +		.priority = DEF_PRIORITY,
> > > +		.may_writepage = 1,
> > > +		.may_unmap = 1,
> > > +		.may_swap = 1,
> > > +	};
> > > +
> > > +	while (!list_empty(page_list)) {
> > > +		struct page *page;
> > > +
> > > +		page = lru_to_page(page_list);
> > > +		if (nid == -1) {
> > > +			nid = page_to_nid(page);
> > > +			INIT_LIST_HEAD(&node_page_list);
> > > +		}
> > > +
> > > +		if (nid == page_to_nid(page)) {
> > > +			list_move(&page->lru, &node_page_list);
> > > +			continue;
> > > +		}
> > > +
> > > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > > +						NODE_DATA(nid),
> > > +						&sc, 0,
> > > +						&dummy_stat, false);
> > 
> > per-node batching in fact. Other than that nothing really jumped at me.
> > Except for the shared page cache side channel timing aspect not being
> > considered AFAICS. To be more specific. Pushing out a shared page cache
> > is possible even now but this interface gives a much easier tool to
> > evict shared state and perform all sorts of timing attacks. Unless I am
> > missing something we should be doing something similar to mincore and
> > ignore shared pages without a writeable access or at least document why
> > we do not care.
> 
> I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
> 1. they already can do that simply by memory hogging

This is way much more harder for practical attacks because the reclaim
logic is not fully under the attackers control. Having a direct tool to
reclaim memory directly then just opens doors to measure the other
consumers of that memory and all sorts of side channel.

> 2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?

nope because MADV_DONTNEED doesn't unmap from other processes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 1/5] mm: introduce MADV_COLD
From: Michal Hocko @ 2019-06-20  7:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620000650.GB52978@google.com>

On Thu 20-06-19 09:06:51, Minchan Kim wrote:
> On Wed, Jun 19, 2019 at 02:56:12PM +0200, Michal Hocko wrote:
[...]
> > Why cannot we reuse a large part of that code and differ essentially on
> > the reclaim target check and action? Have you considered to consolidate
> > the code to share as much as possible? Maybe that is easier said than
> > done because the devil is always in details...
> 
> Yub, it was not pretty when I tried. Please see last patch in this
> patchset.

That is bad because this code is quite subtle - especially the THP part
of it. I will be staring at the code some more. Maybe some
simplification pops out.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-06-20  8:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620070444.GB12083@dhcp22.suse.cz>

On Thu, Jun 20, 2019 at 09:04:44AM +0200, Michal Hocko wrote:
> On Thu 20-06-19 13:16:20, Minchan Kim wrote:
> > On Wed, Jun 19, 2019 at 03:24:50PM +0200, Michal Hocko wrote:
> > > On Mon 10-06-19 20:12:51, Minchan Kim wrote:
> > > [...]
> > > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > +				unsigned long end, struct mm_walk *walk)
> > > 
> > > Again the same question about a potential code reuse...
> > > [...]
> > > > +regular_page:
> > > > +	tlb_change_page_size(tlb, PAGE_SIZE);
> > > > +	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +	flush_tlb_batched_pending(mm);
> > > > +	arch_enter_lazy_mmu_mode();
> > > > +	for (; addr < end; pte++, addr += PAGE_SIZE) {
> > > > +		ptent = *pte;
> > > > +		if (!pte_present(ptent))
> > > > +			continue;
> > > > +
> > > > +		page = vm_normal_page(vma, addr, ptent);
> > > > +		if (!page)
> > > > +			continue;
> > > > +
> > > > +		if (isolate_lru_page(page))
> > > > +			continue;
> > > > +
> > > > +		isolated++;
> > > > +		if (pte_young(ptent)) {
> > > > +			ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > +							tlb->fullmm);
> > > > +			ptent = pte_mkold(ptent);
> > > > +			set_pte_at(mm, addr, pte, ptent);
> > > > +			tlb_remove_tlb_entry(tlb, pte, addr);
> > > > +		}
> > > > +		ClearPageReferenced(page);
> > > > +		test_and_clear_page_young(page);
> > > > +		list_add(&page->lru, &page_list);
> > > > +		if (isolated >= SWAP_CLUSTER_MAX) {
> > > 
> > > Why do we need SWAP_CLUSTER_MAX batching? Especially when we need ...
> > > [...]
> > 
> > It aims for preventing early OOM kill since we isolate too many LRU
> > pages concurrently.
> 
> This is a good point. For some reason I thought that we consider
> isolated pages in should_reclaim_retry but we do not anymore (since we
> move from zone to node LRUs I guess). Please stick a comment there.

Sure.

> 
> > > > +unsigned long reclaim_pages(struct list_head *page_list)
> > > > +{
> > > > +	int nid = -1;
> > > > +	unsigned long nr_reclaimed = 0;
> > > > +	LIST_HEAD(node_page_list);
> > > > +	struct reclaim_stat dummy_stat;
> > > > +	struct scan_control sc = {
> > > > +		.gfp_mask = GFP_KERNEL,
> > > > +		.priority = DEF_PRIORITY,
> > > > +		.may_writepage = 1,
> > > > +		.may_unmap = 1,
> > > > +		.may_swap = 1,
> > > > +	};
> > > > +
> > > > +	while (!list_empty(page_list)) {
> > > > +		struct page *page;
> > > > +
> > > > +		page = lru_to_page(page_list);
> > > > +		if (nid == -1) {
> > > > +			nid = page_to_nid(page);
> > > > +			INIT_LIST_HEAD(&node_page_list);
> > > > +		}
> > > > +
> > > > +		if (nid == page_to_nid(page)) {
> > > > +			list_move(&page->lru, &node_page_list);
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		nr_reclaimed += shrink_page_list(&node_page_list,
> > > > +						NODE_DATA(nid),
> > > > +						&sc, 0,
> > > > +						&dummy_stat, false);
> > > 
> > > per-node batching in fact. Other than that nothing really jumped at me.
> > > Except for the shared page cache side channel timing aspect not being
> > > considered AFAICS. To be more specific. Pushing out a shared page cache
> > > is possible even now but this interface gives a much easier tool to
> > > evict shared state and perform all sorts of timing attacks. Unless I am
> > > missing something we should be doing something similar to mincore and
> > > ignore shared pages without a writeable access or at least document why
> > > we do not care.
> > 
> > I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
> > 1. they already can do that simply by memory hogging
> 
> This is way much more harder for practical attacks because the reclaim
> logic is not fully under the attackers control. Having a direct tool to
> reclaim memory directly then just opens doors to measure the other
> consumers of that memory and all sorts of side channel.

Not sure it's much more harder. It's really easy on my experience.
Just creating new memory hogger and consume memory step by step until
you newly allocated pages will be reclaimed.
Anyway, we fixed mincore so attacker cannot see when the page fault-in
if he don't enough permission for the file. Right?
What's the concern of you even though we reclaim more aggressively?


> 
> > 2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?
> 
> nope because MADV_DONTNEED doesn't unmap from other processes.

Hmm, I don't understand. MADV_PAGEOUT doesn't unmap from other
processes, either. Could you elborate it a bit more what's your concern?


> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 1/5] mm: introduce MADV_COLD
From: Minchan Kim @ 2019-06-20  8:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620070854.GC12083@dhcp22.suse.cz>

On Thu, Jun 20, 2019 at 09:08:54AM +0200, Michal Hocko wrote:
> On Thu 20-06-19 09:06:51, Minchan Kim wrote:
> > On Wed, Jun 19, 2019 at 02:56:12PM +0200, Michal Hocko wrote:
> [...]
> > > Why cannot we reuse a large part of that code and differ essentially on
> > > the reclaim target check and action? Have you considered to consolidate
> > > the code to share as much as possible? Maybe that is easier said than
> > > done because the devil is always in details...
> > 
> > Yub, it was not pretty when I tried. Please see last patch in this
> > patchset.
> 
> That is bad because this code is quite subtle - especially the THP part
> of it. I will be staring at the code some more. Maybe some
> simplification pops out.

Yeah, I couldn't come up with better idea. Actually, I wanted to be
left. More suggestion to make simple/readable would be great.

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-06-20  9:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620084040.GD105727@google.com>

On Thu 20-06-19 17:40:40, Minchan Kim wrote:
> > > > Pushing out a shared page cache
> > > > is possible even now but this interface gives a much easier tool to
> > > > evict shared state and perform all sorts of timing attacks. Unless I am
> > > > missing something we should be doing something similar to mincore and
> > > > ignore shared pages without a writeable access or at least document why
> > > > we do not care.
> > > 
> > > I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
> > > 1. they already can do that simply by memory hogging
> > 
> > This is way much more harder for practical attacks because the reclaim
> > logic is not fully under the attackers control. Having a direct tool to
> > reclaim memory directly then just opens doors to measure the other
> > consumers of that memory and all sorts of side channel.
> 
> Not sure it's much more harder. It's really easy on my experience.
> Just creating new memory hogger and consume memory step by step until
> you newly allocated pages will be reclaimed.

You can contain an untrusted application into a memcg and it will only
reclaim its own working set.

> > > 2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?
> > 
> > nope because MADV_DONTNEED doesn't unmap from other processes.
> 
> Hmm, I don't understand. MADV_PAGEOUT doesn't unmap from other
> processes, either.

Either I am confused or missing something. shrink_page_list does
try_to_unmap and that unmaps from all processes, right?

> Could you elborate it a bit more what's your concern?

If you manage to unmap from a remote process then you can measure delays
implied from the refault and that information can be used to infer what
the remote application is doing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Minchan Kim @ 2019-06-20 10:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620092209.GD12083@dhcp22.suse.cz>

On Thu, Jun 20, 2019 at 11:22:09AM +0200, Michal Hocko wrote:
> On Thu 20-06-19 17:40:40, Minchan Kim wrote:
> > > > > Pushing out a shared page cache
> > > > > is possible even now but this interface gives a much easier tool to
> > > > > evict shared state and perform all sorts of timing attacks. Unless I am
> > > > > missing something we should be doing something similar to mincore and
> > > > > ignore shared pages without a writeable access or at least document why
> > > > > we do not care.
> > > > 
> > > > I'm not sure IIUC side channel attach. As you mentioned, without this syscall,
> > > > 1. they already can do that simply by memory hogging
> > > 
> > > This is way much more harder for practical attacks because the reclaim
> > > logic is not fully under the attackers control. Having a direct tool to
> > > reclaim memory directly then just opens doors to measure the other
> > > consumers of that memory and all sorts of side channel.
> > 
> > Not sure it's much more harder. It's really easy on my experience.
> > Just creating new memory hogger and consume memory step by step until
> > you newly allocated pages will be reclaimed.
> 
> You can contain an untrusted application into a memcg and it will only
> reclaim its own working set.
> 
> > > > 2. If we need fix MADV_PAGEOUT, that means we need to fix MADV_DONTNEED, too?
> > > 
> > > nope because MADV_DONTNEED doesn't unmap from other processes.
> > 
> > Hmm, I don't understand. MADV_PAGEOUT doesn't unmap from other
> > processes, either.
> 
> Either I am confused or missing something. shrink_page_list does
> try_to_unmap and that unmaps from all processes, right?

You don't miss it. It seems now I undetstand what you pointed out.
What you meant is attacker can see what page was faulting-in from other processes
via measuring access delay from his address space and MADV_PAGEOUT makes it more
easiler. Thus, it's an issue regardless of recent mincore fix. Right?
Then, okay, I will add can_do_mincore similar check for the MADV_PAGEOUT syscall
if others have different ideas.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 4/5] mm: introduce MADV_PAGEOUT
From: Michal Hocko @ 2019-06-20 10:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190620103215.GF105727@google.com>

On Thu 20-06-19 19:32:15, Minchan Kim wrote:
[...]
> Then, okay, I will add can_do_mincore similar check for the MADV_PAGEOUT syscall
> if others have different ideas.

Great that we are on the same page. We can simply skip over those pages.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] samples: make pidfd-metadata fail gracefully on older kernels
From: Dmitry V. Levin @ 2019-06-20 11:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Oleg Nesterov, Arnd Bergmann, linux-api, linux-kernel
In-Reply-To: <20190620103105.cdxgqfelzlnkmblv@brauner.io>

Cc'ed more people as the issue is not just with the example but
with the interface itself.

On Thu, Jun 20, 2019 at 12:31:06PM +0200, Christian Brauner wrote:
> On Thu, Jun 20, 2019 at 06:11:44AM +0300, Dmitry V. Levin wrote:
> > Initialize pidfd to an invalid descriptor, to fail gracefully on
> > those kernels that do not implement CLONE_PIDFD and leave pidfd
> > unchanged.
> > 
> > Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
> > ---
> >  samples/pidfd/pidfd-metadata.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
> > index 14b454448429..ff109fdac3a5 100644
> > --- a/samples/pidfd/pidfd-metadata.c
> > +++ b/samples/pidfd/pidfd-metadata.c
> > @@ -83,7 +83,7 @@ static int pidfd_metadata_fd(pid_t pid, int pidfd)
> >  
> >  int main(int argc, char *argv[])
> >  {
> > -	int pidfd = 0, ret = EXIT_FAILURE;
> > +	int pidfd = -1, ret = EXIT_FAILURE;
> 
> Hm, that currently won't work since we added a check in fork.c for
> pidfd == 0. If it isn't you'll get EINVAL.

Sorry, I must've missed that check.  But this makes things even worse.

> This was done to ensure that
> we can potentially extend CLONE_PIDFD by passing in flags through the
> return argument.
> However, I find this increasingly unlikely. Especially since the
> interface would be horrendous and an absolute last resort.
> If clone3() gets merged for 5.3 (currently in linux-next) we also have
> no real need anymore to extend legacy clone() this way. So either wait
> until (if) we merge clone3() where the check I mentioned is gone anyway,
> or remove the pidfd == 0 check from fork.c in a preliminary patch.
> Thoughts?

Userspace needs a reliable way to tell whether CLONE_PIDFD is supported
by the kernel or not.

If CLONE_PIDFD is not supported, then pidfd remains unchanged.

If CLONE_PIDFD is supported and fd 0 is closed, then mandatory pidfd == 0
also remains unchanged, which effectively means that userspace must ensure
that fd 0 is not closed when invoking CLONE_PIDFD.  This is ugly.

If we can assume that clone(CLONE_PIDFD) is not going to be extended,
then I'm for removing the pidfd == 0 check along with recommending
userspace to initialize pidfd with -1.

-- 
ldv

^ permalink raw reply

* Re: [PATCH] samples: make pidfd-metadata fail gracefully on older kernels
From: Christian Brauner @ 2019-06-20 11:10 UTC (permalink / raw)
  To: Dmitry V. Levin
  Cc: Jann Horn, Oleg Nesterov, Arnd Bergmann, linux-api, linux-kernel
In-Reply-To: <20190620110037.GA4998@altlinux.org>

On Thu, Jun 20, 2019 at 02:00:37PM +0300, Dmitry V. Levin wrote:
> Cc'ed more people as the issue is not just with the example but
> with the interface itself.
> 
> On Thu, Jun 20, 2019 at 12:31:06PM +0200, Christian Brauner wrote:
> > On Thu, Jun 20, 2019 at 06:11:44AM +0300, Dmitry V. Levin wrote:
> > > Initialize pidfd to an invalid descriptor, to fail gracefully on
> > > those kernels that do not implement CLONE_PIDFD and leave pidfd
> > > unchanged.
> > > 
> > > Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
> > > ---
> > >  samples/pidfd/pidfd-metadata.c | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/samples/pidfd/pidfd-metadata.c b/samples/pidfd/pidfd-metadata.c
> > > index 14b454448429..ff109fdac3a5 100644
> > > --- a/samples/pidfd/pidfd-metadata.c
> > > +++ b/samples/pidfd/pidfd-metadata.c
> > > @@ -83,7 +83,7 @@ static int pidfd_metadata_fd(pid_t pid, int pidfd)
> > >  
> > >  int main(int argc, char *argv[])
> > >  {
> > > -	int pidfd = 0, ret = EXIT_FAILURE;
> > > +	int pidfd = -1, ret = EXIT_FAILURE;
> > 
> > Hm, that currently won't work since we added a check in fork.c for
> > pidfd == 0. If it isn't you'll get EINVAL.
> 
> Sorry, I must've missed that check.  But this makes things even worse.
> 
> > This was done to ensure that
> > we can potentially extend CLONE_PIDFD by passing in flags through the
> > return argument.
> > However, I find this increasingly unlikely. Especially since the
> > interface would be horrendous and an absolute last resort.
> > If clone3() gets merged for 5.3 (currently in linux-next) we also have
> > no real need anymore to extend legacy clone() this way. So either wait
> > until (if) we merge clone3() where the check I mentioned is gone anyway,
> > or remove the pidfd == 0 check from fork.c in a preliminary patch.
> > Thoughts?
> 
> Userspace needs a reliable way to tell whether CLONE_PIDFD is supported
> by the kernel or not.

Right, that's the general problem with legacy clone(): it ignores
unknown flags... clone3() will EINVAL you if you pass any flag it
doesn't know about.

For legacy clone you can pass

(CLONE_PIDFD | CLONE_DETACHED)

on all relevant kernels >= 2.6.2. CLONE_DETACHED will be silently
ignored by the kernel if specified in flags. But if you specify both
CLONE_PIDFD and CLONE_DETACHED on a kernel that does support CLONE_PIDFD
you'll get EINVALed. (We did this because we wanted to have the ability
to make CLONE_DETACHED reuseable with CLONE_PIDFD.)
Does that help?

> 
> If CLONE_PIDFD is not supported, then pidfd remains unchanged.
> 
> If CLONE_PIDFD is supported and fd 0 is closed, then mandatory pidfd == 0
> also remains unchanged, which effectively means that userspace must ensure
> that fd 0 is not closed when invoking CLONE_PIDFD.  This is ugly.
> 
> If we can assume that clone(CLONE_PIDFD) is not going to be extended,
> then I'm for removing the pidfd == 0 check along with recommending
> userspace to initialize pidfd with -1.

Right, I'm ok with that too.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Joel Fernandes @ 2019-06-20 15:57 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Linux API, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon,
	Jann Horn, Oleg Nesterov, Christian Brauner, oleksandr, hdanton,
	Vladimir Davydov
In-Reply-To: <20190620050132.GC105727@google.com>

On Thu, Jun 20, 2019 at 1:01 AM Minchan Kim <minchan@kernel.org> wrote:
[snip]
> > > >
> > > > I think to fix this, what you should do is clear the PG_Idle flag if the
> > > > young/accessed PTE bits are set. If PG_Idle is already cleared, then you
> > > > don't need to do anything.
> > >
> > > I'm not sure. What does it make MADV_COLD special?
> > > How about MADV_FREE|MADV_DONTNEED?
> > > Why don't they clear PG_Idle if pte was young at tearing down pte?
> >
> > Good point, so it sounds like those (MADV_FREE|MADV_DONTNEED) also need to be fixed then?
>
> Not sure. If you want it, maybe you need to fix every pte clearing and pte_mkold
> part, which is more general to cover every sites like munmap, get_user_pages and
> so on. Anyway, I don't think it's related to this patchset.

Ok, I can look into this issue on my own when I get time. I'll add it
to my list. No problems with your patch otherwise from my side.

 -Joel

^ permalink raw reply

* Re: [PATCH v3 2/2] arch: wire-up clone3() syscall
From: Guenter Roeck @ 2019-06-20 18:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, torvalds, jannh, keescook, fweimer, oleg,
	arnd, dhowells, Andrew Morton, Adrian Reber, linux-api,
	linux-arch, x86
In-Reply-To: <20190604160944.4058-2-christian@brauner.io>

On Tue, Jun 04, 2019 at 06:09:44PM +0200, Christian Brauner wrote:
> Wire up the clone3() call on all arches that don't require hand-rolled
> assembly.
> 
> Some of the arches look like they need special assembly massaging and it is
> probably smarter if the appropriate arch maintainers would do the actual
> wiring. Arches that are wired-up are:
> - x86{_32,64}
> - arm{64}
> - xtensa
> 

This patch results in build failures on various architecetures.

h8300-linux-ld: arch/h8300/kernel/syscalls.o:(.data+0x6d0): undefined reference to `sys_clone3'

nios2-linux-ld: arch/nios2/kernel/syscall_table.o:(.data+0x6d0): undefined reference to `sys_clone3'

There may be others; -next is in too bad shape right now to get a complete
picture. Wondering, though: What is special with this syscall ? Normally
one would only get a warning that a syscall is not wired up.

Guenter

> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> Cc: linux-arch@vger.kernel.org
> Cc: x86@kernel.org
> ---
> v1: unchanged
> v2: unchanged
> v3:
> - Christian Brauner <christian@brauner.io>:
>   - wire up clone3 on all arches that don't have hand-rolled entry points
>     for clone
> ---
>  arch/arm/tools/syscall.tbl                  | 1 +
>  arch/arm64/include/asm/unistd.h             | 2 +-
>  arch/arm64/include/asm/unistd32.h           | 2 ++
>  arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
>  include/uapi/asm-generic/unistd.h           | 4 +++-
>  8 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index aaf479a9e92d..e99a82bdb93a 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -447,3 +447,4 @@
>  431	common	fsconfig			sys_fsconfig
>  432	common	fsmount				sys_fsmount
>  433	common	fspick				sys_fspick
> +436	common	clone3				sys_clone3
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 70e6882853c0..24480c2d95da 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -44,7 +44,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		434
> +#define __NR_compat_syscalls		437
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index c39e90600bb3..b144ea675d70 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
>  __SYSCALL(__NR_fsmount, sys_fsmount)
>  #define __NR_fspick 433
>  __SYSCALL(__NR_fspick, sys_fspick)
> +#define __NR_clone3 436
> +__SYSCALL(__NR_clone3, sys_clone3)
>  
>  /*
>   * Please add new compat syscalls above this comment and update
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 26339e417695..3110440bcc31 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -439,3 +439,4 @@
>  431	common	fsconfig			sys_fsconfig
>  432	common	fsmount				sys_fsmount
>  433	common	fspick				sys_fspick
> +436	common	clone3				sys_clone3
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index ad968b7bac72..80e26211feff 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
>  432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
>  433	i386	fspick			sys_fspick			__ia32_sys_fspick
> +436	i386	clone3			sys_clone3			__ia32_sys_clone3
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index b4e6f9e6204a..7968f0b5b5e8 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  431	common	fsconfig		__x64_sys_fsconfig
>  432	common	fsmount			__x64_sys_fsmount
>  433	common	fspick			__x64_sys_fspick
> +436	common	clone3			__x64_sys_clone3/ptregs
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 5fa0ee1c8e00..b2767c8c2b4e 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -404,3 +404,4 @@
>  431	common	fsconfig			sys_fsconfig
>  432	common	fsmount				sys_fsmount
>  433	common	fspick				sys_fspick
> +436	common	clone3				sys_clone3
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index a87904daf103..45bc87687c47 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
>  __SYSCALL(__NR_fsmount, sys_fsmount)
>  #define __NR_fspick 433
>  __SYSCALL(__NR_fspick, sys_fspick)
> +#define __NR_clone3 436
> +__SYSCALL(__NR_clone3, sys_clone3)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 434
> +#define __NR_syscalls 437
>  
>  /*
>   * 32 bit systems traditionally used different

^ permalink raw reply

* [PATCH v5 00/16] fs-verity: read-only file-based authenticity protection
From: Eric Biggers @ 2019-06-20 20:50 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh

Hello,

This is a redesigned version of the fs-verity patchset, implementing
Ted's suggestion to build the Merkle tree in the kernel
(https://lore.kernel.org/linux-fsdevel/20190207031101.GA7387@mit.edu/).
This greatly simplifies the UAPI, since the verity metadata no longer
needs to be transferred to the kernel.  Now to enable fs-verity on a
file, one simply calls FS_IOC_ENABLE_VERITY, passing it this structure:

	struct fsverity_enable_arg {
		__u32 version;
		__u32 hash_algorithm;
		__u32 block_size;
		__u32 salt_size;
		__u64 salt_ptr;
		__u32 sig_size;
		__u32 __reserved1;
		__u64 sig_ptr;
		__u64 __reserved2[11];
	};

The filesystem then builds the file's Merkle tree and stores it in a
filesystem-specific location associated with the file.  Afterwards,
FS_IOC_MEASURE_VERITY can be used to retrieve the file measurement
("root hash").  The way the file measurement is computed is also
effectively part of the API (it has to be), but it's logically
independent of where/how the filesystem stores the Merkle tree.

The API is fully documented in Documentation/filesystems/fsverity.rst,
along with other aspects of fs-verity.  I also added an FAQ section that
answers frequently asked questions about fs-verity, e.g. why isn't it
all at the VFS level, why isn't it part of IMA, why does the Merkle tree
need to be stored on-disk, etc.

Overview
--------

This patchset implements fs-verity for ext4 and f2fs.  fs-verity is
similar to dm-verity, but implemented on a per-file basis: a Merkle tree
is used to measure (hash) a read-only file's data as it is paged in.
ext4 and f2fs hide this Merkle tree beyond the end of the file, but
other filesystems can implement it differently if desired.

In general, fs-verity is intended for use on writable filesystems;
dm-verity is still recommended on read-only ones.

Similar to fscrypt, most of the code is in fs/verity/, and not too many
filesystem-specific changes are needed.  The Merkle tree is built by the
filesystem when the FS_IOC_ENABLE_VERITY ioctl is executed.

fs-verity provides a file measurement (hash) in constant time and
verifies data on-demand.  Thus, it is useful for efficiently verifying
the authenticity of large files of which only a small portion may be
accessed, such as Android application package (APK) files.  It may also
be useful in "audit" use cases where file hashes are logged.

fs-verity can also provide better protection against malicious disks
than an ahead-of-time hash, since fs-verity re-verifies data each time
it's paged in.  Note, however, that any authenticity guarantee is still
dependent on verification of the file measurement and other relevant
metadata in a way that makes sense for the overall system; fs-verity is
only a tool to help with this.

This patchset doesn't include IMA support for fs-verity file
measurements.  This is planned and we'd like to collaborate with the IMA
maintainers.  Although fs-verity can be used on its own without IMA,
fs-verity is primarily a lower level feature (think of it as a way of
hashing a file), so some users may still need IMA's policy mechanism.
However, an optional in-kernel signature verification mechanism within
fs-verity itself is also included.

This patchset is based on v5.2-rc3.  It can also be found in git at tag
fsverity_2019-06-20 of:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git

fs-verity has a userspace utility:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git

xfstests for fs-verity can be found at branch "fsverity" of:

	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfstests-dev.git

fs-verity is supported by f2fs-tools v1.11.0+ and e2fsprogs v1.45.2+.

Examples of setting up fs-verity protected files can be found in the
README.md file of fsverity-utils.

Other useful references include:

  - Documentation/filesystems/fsverity.rst, added by the first patch.

  - LWN coverage of v3 patchset: https://lwn.net/Articles/790185/

  - LWN coverage of v2 patchset: https://lwn.net/Articles/775872/

  - LWN coverage of v1 patchset: https://lwn.net/Articles/763729/

  - Presentation at Linux Security Summit North America 2018:
      - Slides: https://schd.ws/hosted_files/lssna18/af/fs-verity%20slide%20deck.pdf
      - Video: https://www.youtube.com/watch?v=Aw5h6aBhu6M
      (This corresponded to the v1 patchset; changes have been made since then.)

  - LWN coverage of LSFMM 2018 discussion: https://lwn.net/Articles/752614/

Changed since v4:

  - Made ext4 and f2fs store the verity metadata beginning at a 64K
    aligned boundary, to be ready for architectures with 64K pages.

  - Made ext4 store the verity descriptor size in the file data stream,
    so that no xattr is needed.

  - Added support for empty files.

  - A few minor cleanups.

Changed since v3:

  - The FS_IOC_GETFLAGS ioctl now returns the verity flag.

  - Fixed setting i_verity_info too early.

  - Restored pagecache invalidation in FS_IOC_ENABLE_VERITY.

  - Fixed truncation of fsverity_enable_arg::hash_algorithm.

  - Reject empty files for both open and enable, not just enable.

  - Added a couple more FAQ entries to the documentation.

  - A few minor cleanups.

  - Rebased onto v5.2-rc3.

Changed since v2:

  - Large redesign: the Merkle tree is now built by
    FS_IOC_ENABLE_VERITY, rather than being provided by userspace.  The
    fsverity_operations provide an interface for filesystems to read and
    write the Merkle tree from/to a filesystem-specific location.

  - Lot of refactoring, cleanups, and documentation improvements.

  - Many simplifications, such as simplifying the fsverity_descriptor
    format, dropping CRC-32 support, and limiting the salt size.

  - ext4 and f2fs now store an xattr that gives the location of the
    fsverity_descriptor, so loading it is more straightforward.

  - f2fs no longer counts the verity metadata in the on-disk i_size,
    making it consistent with ext4.

  - Replaced the filesystem-specific fs-verity kconfig options with
    CONFIG_FS_VERITY.

  - Replaced the filesystem-specific verity bit checks with IS_VERITY().

Changed since v1:

  - Added documentation file.

  - Require write permission for FS_IOC_ENABLE_VERITY, rather than
    CAP_SYS_ADMIN.

  - Eliminated dependency on CONFIG_BLOCK and clarified that filesystems
    can verify a page at a time rather than a bio at a time.

  - Fixed conditions for verifying holes.

  - ext4 now only allows fs-verity on extent-based files.

  - Eliminated most of the assumptions that the verity metadata is
    stored beyond EOF, in case filesystems want to do things
    differently.

  - Other cleanups.

Eric Biggers (16):
  fs-verity: add a documentation file
  fs-verity: add MAINTAINERS file entry
  fs-verity: add UAPI header
  fs: uapi: define verity bit for FS_IOC_GETFLAGS
  fs-verity: add Kconfig and the helper functions for hashing
  fs-verity: add inode and superblock fields
  fs-verity: add the hook for file ->open()
  fs-verity: add the hook for file ->setattr()
  fs-verity: add data verification hooks for ->readpages()
  fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
  fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
  fs-verity: add SHA-512 support
  fs-verity: support builtin file signatures
  ext4: add basic fs-verity support
  ext4: add fs-verity read support
  f2fs: add fs-verity support

 Documentation/filesystems/fsverity.rst | 710 +++++++++++++++++++++++++
 Documentation/filesystems/index.rst    |   1 +
 Documentation/ioctl/ioctl-number.txt   |   1 +
 MAINTAINERS                            |  12 +
 fs/Kconfig                             |   2 +
 fs/Makefile                            |   1 +
 fs/ext4/Makefile                       |   1 +
 fs/ext4/ext4.h                         |  23 +-
 fs/ext4/file.c                         |   4 +
 fs/ext4/inode.c                        |  48 +-
 fs/ext4/ioctl.c                        |  12 +
 fs/ext4/readpage.c                     | 207 ++++++-
 fs/ext4/super.c                        |  18 +-
 fs/ext4/sysfs.c                        |   6 +
 fs/ext4/verity.c                       | 354 ++++++++++++
 fs/f2fs/Makefile                       |   1 +
 fs/f2fs/data.c                         |  72 ++-
 fs/f2fs/f2fs.h                         |  23 +-
 fs/f2fs/file.c                         |  40 ++
 fs/f2fs/inode.c                        |   5 +-
 fs/f2fs/super.c                        |   3 +
 fs/f2fs/sysfs.c                        |  11 +
 fs/f2fs/verity.c                       | 233 ++++++++
 fs/f2fs/xattr.h                        |   2 +
 fs/verity/Kconfig                      |  55 ++
 fs/verity/Makefile                     |  10 +
 fs/verity/enable.c                     | 355 +++++++++++++
 fs/verity/fsverity_private.h           | 185 +++++++
 fs/verity/hash_algs.c                  | 279 ++++++++++
 fs/verity/init.c                       |  61 +++
 fs/verity/measure.c                    |  57 ++
 fs/verity/open.c                       | 357 +++++++++++++
 fs/verity/signature.c                  | 207 +++++++
 fs/verity/verify.c                     | 281 ++++++++++
 include/linux/fs.h                     |  11 +
 include/linux/fsverity.h               | 209 ++++++++
 include/uapi/linux/fs.h                |   1 +
 include/uapi/linux/fsverity.h          |  40 ++
 38 files changed, 3839 insertions(+), 59 deletions(-)
 create mode 100644 Documentation/filesystems/fsverity.rst
 create mode 100644 fs/ext4/verity.c
 create mode 100644 fs/f2fs/verity.c
 create mode 100644 fs/verity/Kconfig
 create mode 100644 fs/verity/Makefile
 create mode 100644 fs/verity/enable.c
 create mode 100644 fs/verity/fsverity_private.h
 create mode 100644 fs/verity/hash_algs.c
 create mode 100644 fs/verity/init.c
 create mode 100644 fs/verity/measure.c
 create mode 100644 fs/verity/open.c
 create mode 100644 fs/verity/signature.c
 create mode 100644 fs/verity/verify.c
 create mode 100644 include/linux/fsverity.h
 create mode 100644 include/uapi/linux/fsverity.h

-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply

* [PATCH v5 01/16] fs-verity: add a documentation file
From: Eric Biggers @ 2019-06-20 20:50 UTC (permalink / raw)
  To: linux-fscrypt
  Cc: Theodore Y . Ts'o, Darrick J . Wong, linux-api, Dave Chinner,
	linux-f2fs-devel, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190620205043.64350-1-ebiggers@kernel.org>

From: Eric Biggers <ebiggers@google.com>

Add a documentation file for fs-verity, covering:

- Introduction
- Use cases
- User API
    - FS_IOC_ENABLE_VERITY
    - FS_IOC_MEASURE_VERITY
    - FS_IOC_GETFLAGS
- Accessing verity files
- File measurement computation
    - Merkle tree
    - fs-verity descriptor
- Built-in signature verification
- Filesystem support
    - ext4
    - f2fs
- Implementation details
    - Verifying data
        - Pagecache
        - Block device based filesystems
- Userspace utility
- Tests
- FAQ

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 Documentation/filesystems/fsverity.rst | 710 +++++++++++++++++++++++++
 Documentation/filesystems/index.rst    |   1 +
 2 files changed, 711 insertions(+)
 create mode 100644 Documentation/filesystems/fsverity.rst

diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
new file mode 100644
index 00000000000000..49524d7ea190e5
--- /dev/null
+++ b/Documentation/filesystems/fsverity.rst
@@ -0,0 +1,710 @@
+=======================================================
+fs-verity: read-only file-based authenticity protection
+=======================================================
+
+Introduction
+============
+
+fs-verity (``fs/verity/``) is a support layer that filesystems can
+hook into to support transparent integrity and authenticity protection
+of read-only files.  Currently, it is supported by the ext4 and f2fs
+filesystems.  Like fscrypt, not too much filesystem-specific code is
+needed to support fs-verity.
+
+fs-verity is similar to `dm-verity
+<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+but works on files rather than block devices.  On regular files on
+filesystems supporting fs-verity, userspace can execute an ioctl that
+causes the filesystem to build a Merkle tree for the file and persist
+it to a filesystem-specific location associated with the file.
+
+After this, the file is made readonly, and all reads from the file are
+automatically verified against the file's Merkle tree.  Reads of any
+corrupted data, including mmap reads, will fail.
+
+Userspace can use another ioctl to retrieve the root hash (actually
+the "file measurement", which is a hash that includes the root hash)
+that fs-verity is enforcing for the file.  This ioctl executes in
+constant time, regardless of the file size.
+
+fs-verity is essentially a way to hash a file in constant time,
+subject to the caveat that reads which would violate the hash will
+fail at runtime.
+
+Use cases
+=========
+
+By itself, the base fs-verity feature only provides integrity
+protection, i.e. detection of accidental (non-malicious) corruption.
+
+However, because fs-verity makes retrieving the file hash extremely
+efficient, it's primarily meant to be used as a tool to support
+authentication (detection of malicious modifications) or auditing
+(logging file hashes before use).
+
+Trusted userspace code (e.g. operating system code running on a
+read-only partition that is itself authenticated by dm-verity) can
+authenticate the contents of an fs-verity file by using the
+`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
+digital signature of it.
+
+A standard file hash could be used instead of fs-verity.  However,
+this is inefficient if the file is large and only a small portion may
+be accessed.  This is often the case for Android application package
+(APK) files, for example.  These typically contain many translations,
+classes, and other resources that are infrequently or even never
+accessed on a particular device.  It would be slow and wasteful to
+read and hash the entire file before starting the application.
+
+Unlike an ahead-of-time hash, fs-verity also re-verifies data each
+time it's paged in.  This ensures that malicious disk firmware can't
+undetectably change the contents of the file at runtime.
+
+fs-verity does not replace or obsolete dm-verity.  dm-verity should
+still be used on read-only filesystems.  fs-verity is for files that
+must live on a read-write filesystem because they are independently
+updated and potentially user-installed, so dm-verity cannot be used.
+
+The base fs-verity feature is a hashing mechanism only; actually
+authenticating the files is up to userspace.  However, to meet some
+users' needs, fs-verity optionally supports a simple signature
+verification mechanism where users can configure the kernel to require
+that all fs-verity files be signed by a key loaded into a keyring; see
+`Built-in signature verification`_.  Support for fs-verity file hashes
+in IMA (Integrity Measurement Architecture) policies is also planned.
+
+User API
+========
+
+FS_IOC_ENABLE_VERITY
+--------------------
+
+The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file.  It takes
+in a pointer to a :c:type:`struct fsverity_enable_arg`, defined as
+follows::
+
+    struct fsverity_enable_arg {
+            __u32 version;
+            __u32 hash_algorithm;
+            __u32 block_size;
+            __u32 salt_size;
+            __u64 salt_ptr;
+            __u32 sig_size;
+            __u32 __reserved1;
+            __u64 sig_ptr;
+            __u64 __reserved2[11];
+    };
+
+This structure contains the parameters of the Merkle tree to build for
+the file, and optionally contains a signature.  It must be initialized
+as follows:
+
+- ``version`` must be 1.
+- ``hash_algorithm`` must be the identifier for the hash algorithm to
+  use for the Merkle tree, such as FS_VERITY_HASH_ALG_SHA256.  See
+  ``include/uapi/linux/fsverity.h`` for the list of possible values.
+- ``block_size`` must be the Merkle tree block size.  Currently, this
+  must be equal to the system page size, which is usually 4096 bytes.
+  Other sizes may be supported in the future.  This value is not
+  necessarily the same as the filesystem block size.
+- ``salt_size`` is the size of the salt in bytes, or 0 if no salt is
+  provided.  The salt is a value that is prepended to every hashed
+  block; it can be used to personalize the hashing for a particular
+  file or device.  Currently the maximum salt size is 32 bytes.
+- ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
+  provided.
+- ``sig_size`` is the size of the signature in bytes, or 0 if no
+  signature is provided.  Currently the signature is (somewhat
+  arbitrarily) limited to 16128 bytes.  See `Built-in signature
+  verification`_ for more information.
+- ``sig_ptr``  is the pointer to the signature, or NULL if no
+  signature is provided.
+- All reserved fields must be zeroed.
+
+FS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
+the file and persist it to a filesystem-specific location associated
+with the file, then mark the file as a verity file.  This ioctl may
+take a long time to execute on large files, and it is interruptible by
+fatal signals.
+
+FS_IOC_ENABLE_VERITY checks for write access to the inode.  However,
+it must be executed on an O_RDONLY file descriptor and no processes
+can have the file open for writing.  Attempts to open the file for
+writing while this ioctl is executing will fail with ETXTBSY.  (This
+is necessary to guarantee that no writable file descriptors will exist
+after verity is enabled, and to guarantee that the file's contents are
+stable while the Merkle tree is being built over it.)
+
+On success, FS_IOC_ENABLE_VERITY returns 0, and the file becomes a
+verity file.  On failure (including the case of interruption by a
+fatal signal), no changes are made to the file.
+
+FS_IOC_ENABLE_VERITY can fail with the following errors:
+
+- ``EACCES``: the process does not have write access to the file
+- ``EEXIST``: the file already has verity enabled
+- ``EFAULT``: the caller provided inaccessible memory
+- ``EINTR``: the operation was interrupted by a fatal signal
+- ``EINVAL``: unsupported version, hash algorithm, or block size; or
+  reserved bits are set; or the file descriptor refers to neither a
+  regular file nor a directory.
+- ``EISDIR``: the file descriptor refers to a directory
+- ``EMSGSIZE``: the salt or signature is too long
+- ``ENOENT``: fs-verity recognizes the hash algorithm, but it's not
+  available in the kernel's crypto API as currently configured (e.g.
+  for SHA-512, missing CONFIG_CRYPTO_SHA512).
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+  support; or the filesystem superblock has not had the 'verity'
+  feature enabled on it; or the filesystem does not support fs-verity
+  on this file.  (See `Filesystem support`_.)
+- ``EPERM``: the file is append-only
+- ``EROFS``: the filesystem is read-only
+- ``ETXTBSY``: someone has the file open for writing.  This can be the
+  caller's file descriptor, another open file descriptor, or the file
+  reference held by a writable memory map.
+
+FS_IOC_MEASURE_VERITY
+---------------------
+
+The FS_IOC_MEASURE_VERITY ioctl retrieves the measurement of a verity
+file.  The file measurement is a digest that cryptographically
+identifies the file contents that are being enforced on reads.
+
+This ioctl takes in a pointer to a variable-length structure::
+
+    struct fsverity_digest {
+            __u16 digest_algorithm;
+            __u16 digest_size; /* input/output */
+            __u8 digest[];
+    };
+
+``digest_size`` is an input/output field.  On input, it must be
+initialized to the number of bytes allocated for the variable-length
+``digest`` field.
+
+On success, 0 is returned and the kernel fills in the structure as
+follows:
+
+- ``digest_algorithm`` will be the hash algorithm used for the file
+  measurement.  It will match ``fsverity_enable_arg::hash_algorithm``.
+- ``digest_size`` will be the size of the digest in bytes, e.g. 32
+  for SHA-256.  (This can be redundant with ``digest_algorithm``.)
+- ``digest`` will be the actual bytes of the digest.
+
+FS_IOC_MEASURE_VERITY is guaranteed to execute in constant time,
+regardless of the size of the file.
+
+FS_IOC_MEASURE_VERITY can fail with the following errors:
+
+- ``EFAULT``: the caller provided inaccessible memory
+- ``ENODATA``: the file is not a verity file
+- ``ENOTTY``: this type of filesystem does not implement fs-verity
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+  support, or the filesystem superblock has not had the 'verity'
+  feature enabled on it.  (See `Filesystem support`_.)
+- ``EOVERFLOW``: the digest is longer than the specified
+  ``digest_size`` bytes.  Try providing a larger buffer.
+
+FS_IOC_GETFLAGS
+---------------
+
+The existing ioctl FS_IOC_GETFLAGS (which isn't specific to fs-verity)
+can also be used to check whether a file has fs-verity enabled or not.
+To do so, check for FS_VERITY_FL (0x00100000) in the returned flags.
+
+The verity flag is not settable via FS_IOC_SETFLAGS.  You must use
+FS_IOC_ENABLE_VERITY instead, since parameters must be provided.
+
+Accessing verity files
+======================
+
+Applications can transparently access a verity file just like a
+non-verity one, with the following exceptions:
+
+- Verity files are readonly.  They cannot be opened for writing or
+  truncate()d, even if the file mode bits allow it.  Attempts to do
+  one of these things will fail with EPERM.  However, changes to
+  metadata such as owner, mode, timestamps, and xattrs are still
+  allowed, since these are not measured by fs-verity.  Verity files
+  can also still be renamed, deleted, and linked to.
+
+- Direct I/O is not supported on verity files.  Attempts to use direct
+  I/O on such files will fall back to buffered I/O.
+
+- DAX (Direct Access) is not supported on verity files, because this
+  would circumvent the data verification.
+
+- Reads of data that doesn't match the verity Merkle tree will fail
+  with EIO (for read()) or SIGBUS (for mmap() reads).
+
+- If the sysctl "fs.verity.require_signatures" is set to 1 and the
+  file's verity measurement is not signed by a key in the fs-verity
+  keyring, then opening the file will fail.  See `Built-in signature
+  verification`_.
+
+Direct access to the Merkle tree is not supported.  Therefore, if a
+verity file is copied, or is backed up and restored, then it will lose
+its "verity"-ness.  fs-verity is primarily meant for files like
+executables that are managed by a package manager.
+
+File measurement computation
+============================
+
+This section describes how fs-verity hashes the file contents using a
+Merkle tree to produce the "file measurement" which cryptographically
+identifies the file contents.  This algorithm is the same for all
+filesystems that support fs-verity.
+
+Userspace only needs to be aware of this algorithm if it needs to
+compute the file measurement itself, e.g. in order to sign the file.
+
+Merkle tree
+-----------
+
+The file contents is divided into blocks, where the block size is
+configurable but is usually 4096 bytes.  The end of the last block is
+zero-padded if needed.  Each block is then hashed, producing the first
+level of hashes.  Then, the hashes in this first level are grouped
+into 'blocksize'-byte blocks (zero-padding the ends as needed) and
+these blocks are hashed, producing the second level of hashes.  This
+proceeds up the tree until only a single block remains.  The hash of
+this block is the "Merkle tree root hash".
+
+If the file is nonempty and fits in one block, then the "Merkle tree
+root hash" is simply the hash of the single data block.  If the file
+is empty, then the "Merkle tree root hash" is all zeroes.
+
+The "blocks" here are not necessarily the same as "filesystem blocks".
+
+If a salt was specified, then it's zero-padded to the closest multiple
+of the input size of the hash algorithm's compression function, e.g.
+64 bytes for SHA-256 or 128 bytes for SHA-512.  The padded salt is
+prepended to every data or Merkle tree block that is hashed.
+
+The purpose of the block padding is to cause every hash to be taken
+over the same amount of data, which simplifies the implementation and
+keeps open more possibilities for hardware acceleration.  The purpose
+of the salt padding is to make the salting "free" when the salted hash
+state is precomputed, then imported for each hash.
+
+Example: in the recommended configuration of SHA-256 and 4K blocks,
+128 hash values fit in each block.  Thus, each level of the Merkle
+tree is approximately 128 times smaller than the previous, and for
+large files the Merkle tree's size converges to approximately 1/127 of
+the original file size.  However, for small files, the padding is
+significant, making the space overhead proportionally more.
+
+fs-verity descriptor
+--------------------
+
+By itself, the Merkle tree root hash is ambiguous.  For example, it
+can't a distinguish a large file from a small second file whose data
+is exactly the top-level hash block of the first file.  Ambiguities
+also arise from the convention of padding to the next block boundary.
+
+To solve this problem, the verity file measurement is actually
+computed as a hash of the following structure, which contains the
+Merkle tree root hash as well as other fields such as the file size::
+
+    struct fsverity_descriptor {
+            __u8 version;           /* must be 1 */
+            __u8 hash_algorithm;    /* Merkle tree hash algorithm */
+            __u8 log_blocksize;     /* log2 of size of data and tree blocks */
+            __u8 salt_size;         /* size of salt in bytes; 0 if none */
+            __le32 sig_size;        /* must be 0 */
+            __le64 data_size;       /* size of file the Merkle tree is built over */
+            __u8 root_hash[64];     /* Merkle tree root hash */
+            __u8 salt[32];          /* salt prepended to each hashed block */
+            __u8 __reserved[144];   /* must be 0's */
+    };
+
+Note that the ``sig_size`` field must be set to 0 for the purpose of
+computing the file measurement, even if a signature was provided (or
+will be provided) to `FS_IOC_ENABLE_VERITY`_.
+
+Built-in signature verification
+===============================
+
+With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
+a portion of an authentication policy (see `Use cases`_) in the
+kernel.  Specifically, it adds support for:
+
+1. At fs-verity module initialization time, a keyring ".fs-verity" is
+   created.  The root user can add trusted X.509 certificates to this
+   keyring using the add_key() system call, then (when done)
+   optionally use keyctl_restrict_keyring() to prevent additional
+   certificates from being added.
+
+2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
+   signature in DER format of the file measurement.  On success, this
+   signature is persisted alongside the Merkle tree.  Then, any time
+   the file is opened, the kernel will verify this signature against
+   the certificates in the ".fs-verity" keyring, and verify that it
+   matches the actual file measurement.
+
+3. A new sysctl "fs.verity.require_signatures" is made available.
+   When set to 1, the kernel requires that all verity files have a
+   correctly signed file measurement as described in (2).
+
+File measurements must be signed in the following format, which is
+similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
+
+    struct fsverity_signed_digest {
+            char magic[8];                  /* must be "FSVerity" */
+            __le16 digest_algorithm;
+            __le16 digest_size;
+            __u8 digest[];
+    };
+
+fs-verity's built-in signature verification support is meant as a
+relatively simple mechanism that can be used to provide some level of
+authenticity protection for verity files, as an alternative to doing
+the signature verification in userspace or using IMA-appraisal.
+However, with this mechanism, userspace programs still need to check
+that the verity bit is set, and there is no protection against verity
+files being swapped around.
+
+Filesystem support
+==================
+
+fs-verity is currently supported by the ext4 and f2fs filesystems.
+The CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
+on either filesystem.
+
+``include/linux/fsverity.h`` declares the interface between the
+``fs/verity/`` support layer and filesystems.  Briefly, filesystems
+must provide an ``fsverity_operations`` structure that provides
+methods to read and write the verity metadata to a filesystem-specific
+location, including the Merkle tree blocks and
+``fsverity_descriptor``.  Filesystems must also call functions in
+``fs/verity/`` at certain times, such as when a file is opened or when
+pages have been read into the pagecache.  (See `Verifying data`_.)
+
+ext4
+----
+
+ext4 supports fs-verity since Linux TODO and e2fsprogs v1.45.2.
+
+To create verity files on an ext4 filesystem, the filesystem must have
+been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
+it.  "verity" is an RO_COMPAT filesystem feature, so once set, old
+kernels will only be able to mount the filesystem readonly, and old
+versions of e2fsck will be unable to check the filesystem.  Moreover,
+currently ext4 only supports mounting a filesystem with the "verity"
+feature when its block size is equal to PAGE_SIZE (often 4096 bytes).
+
+ext4 sets the EXT4_VERITY_FL on-disk inode flag on verity files.  It
+can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be cleared.
+
+ext4 also supports encryption, which can be used simultaneously with
+fs-verity.  In this case, the plaintext data is verified rather than
+the ciphertext.  This is necessary in order to make the file
+measurement meaningful, since every file is encrypted differently.
+
+ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
+past the end of the file, starting at the first 64K boundary beyond
+i_size.  This approach works because (a) verity files are readonly,
+and (b) pages fully beyond i_size aren't visible to userspace but can
+be read/written internally by ext4 with only some relatively small
+changes to ext4.  This approach avoids having to depend on the
+EA_INODE feature and on rearchitecturing ext4's xattr support to
+support paging multi-gigabyte xattrs into memory, and to support
+encrypting xattrs.  Note that the verity metadata *must* be encrypted
+when the file is, since it contains hashes of the plaintext data.
+
+Currently, ext4 verity only supports the case where the Merkle tree
+block size, filesystem block size, and page size are all the same.  It
+also only supports extent-based files.
+
+f2fs
+----
+
+f2fs supports fs-verity since Linux TODO and f2fs-tools v1.11.0.
+
+To create verity files on an f2fs filesystem, the filesystem must have
+been formatted with ``-O verity``.
+
+f2fs sets the FADVISE_VERITY_BIT on-disk inode flag on verity files.
+It can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be
+cleared.
+
+Like ext4, f2fs stores the verity metadata (Merkle tree and
+fsverity_descriptor) past the end of the file, starting at the first
+64K boundary beyond i_size.  See explanation for ext4 above.
+Moreover, f2fs supports at most 4096 bytes of xattr entries per inode
+which wouldn't be enough for even a single Merkle tree block.
+
+Currently, f2fs verity only supports a Merkle tree block size of 4096.
+
+Implementation details
+======================
+
+Verifying data
+--------------
+
+fs-verity ensures that all reads of a verity file's data are verified,
+regardless of which syscall is used to do the read (e.g. mmap(),
+read(), pread()) and regardless of whether it's the first read or a
+later read (unless the later read can return cached data that was
+already verified).  Below, we describe how filesystems implement this.
+
+Pagecache
+~~~~~~~~~
+
+For filesystems using Linux's pagecache, the ``->readpage()`` and
+``->readpages()`` methods must be modified to verify pages before they
+are marked Uptodate.  Merely hooking ``->read_iter()`` would be
+insufficient, since ``->read_iter()`` is not used for memory maps.
+
+Therefore, fs/verity/ provides a function fsverity_verify_page() which
+verifies a page that has been read into the pagecache of a verity
+inode, but is still locked and not Uptodate, so it's not yet readable
+by userspace.  As needed to do the verification,
+fsverity_verify_page() will call back into the filesystem to read
+Merkle tree pages via fsverity_operations::read_merkle_tree_page().
+
+fsverity_verify_page() returns false if verification failed; in this
+case, the filesystem must not set the page Uptodate.  Following this,
+as per the usual Linux pagecache behavior, attempts by userspace to
+read() from the part of the file containing the page will fail with
+EIO, and accesses to the page within a memory map will raise SIGBUS.
+
+fsverity_verify_page() currently only supports the case where the
+Merkle tree block size is equal to PAGE_SIZE (often 4096 bytes).
+
+In principle, fsverity_verify_page() verifies the entire path in the
+Merkle tree from the data page to the root hash.  However, for
+efficiency the filesystem may cache the hash pages.  Therefore,
+fsverity_verify_page() only ascends the tree reading hash pages until
+an already-verified hash page is seen, as indicated by the PageChecked
+bit being set.  It then verifies the path to that page.
+
+This optimization, which is also used by dm-verity, results in
+excellent sequential read performance.  This is because usually (e.g.
+127 in 128 times for 4K blocks and SHA-256) the hash page from the
+bottom level of the tree will already be cached and checked from
+reading a previous data page.  However, random reads perform worse.
+
+Block device based filesystems
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Block device based filesystems (e.g. ext4 and f2fs) in Linux also use
+the pagecache, so the above subsection applies too.  However, they
+also usually read many pages from a file at once, grouped into a
+structure called a "bio".  To make it easier for these types of
+filesystems to support fs-verity, fs/verity/ also provides a function
+fsverity_verify_bio() which verifies all pages in a bio.
+
+ext4 and f2fs also support encryption.  If a verity file is also
+encrypted, the pages must be decrypted before being verified.  To
+support this, these filesystems allocate a "post-read context" for
+each bio and store it in ``->bi_private``::
+
+    struct bio_post_read_ctx {
+           struct bio *bio;
+           struct work_struct work;
+           unsigned int cur_step;
+           unsigned int enabled_steps;
+    };
+
+``enabled_steps`` is a bitmask that specifies whether decryption,
+verity, or both is enabled.  After the bio completes, for each needed
+postprocessing step the filesystem enqueues the bio_post_read_ctx on a
+workqueue, and then the workqueue work does the decryption or
+verification.  Finally, pages where no decryption or verity error
+occurred are marked Uptodate, and the pages are unlocked.
+
+Files on ext4 and f2fs may contain holes.  Normally, ``->readpages()``
+simply zeroes holes and sets the corresponding pages Uptodate; no bios
+are issued.  To prevent this case from bypassing fs-verity, these
+filesystems use fsverity_verify_page() to verify hole pages.
+
+ext4 and f2fs disable direct I/O on verity files, since otherwise
+direct I/O would bypass fs-verity.  (They also do the same for
+encrypted files.)
+
+Userspace utility
+=================
+
+This document focuses on the kernel, but a userspace utility for
+fs-verity can be found at:
+
+	https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
+
+See the README.md file in the fsverity-utils source tree for details,
+including examples of setting up fs-verity protected files.
+
+Tests
+=====
+
+To test fs-verity, use xfstests.  For example, using `kvm-xfstests
+<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
+
+    kvm-xfstests -c ext4,f2fs -g verity
+
+FAQ
+===
+
+This section answers frequently asked questions about fs-verity that
+weren't already directly answered in other parts of this document.
+
+:Q: Why isn't fs-verity part of IMA?
+:A: fs-verity and IMA (Integrity Measurement Architecture) have
+    different focuses.  fs-verity is a filesystem-level mechanism for
+    hashing individual files using a Merkle tree.  In contrast, IMA
+    specifies a system-wide policy that specifies which files are
+    hashed and what to do with those hashes, such as log them,
+    authenticate them, or add them to a measurement list.
+
+    IMA is planned to support the fs-verity hashing mechanism as an
+    alternative to doing full file hashes, for people who want the
+    performance and security benefits of the Merkle tree based hash.
+    But it doesn't make sense to force all uses of fs-verity to be
+    through IMA.  As a standalone filesystem feature, fs-verity
+    already meets many users' needs, and it's testable like other
+    filesystem features e.g. with xfstests.
+
+:Q: Isn't fs-verity useless because the attacker can just modify the
+    hashes in the Merkle tree, which is stored on-disk?
+:A: To verify the authenticity of an fs-verity file you must verify
+    the authenticity of the "file measurement", which is basically the
+    root hash of the Merkle tree.  See `Use cases`_.
+
+:Q: Isn't fs-verity useless because the attacker can just replace a
+    verity file with a non-verity one?
+:A: See `Use cases`_.  In the initial use case, it's really trusted
+    userspace code that authenticates the files; fs-verity is just a
+    tool to do this job efficiently and securely.  The trusted
+    userspace code will consider non-verity files to be inauthentic.
+
+:Q: Why does the Merkle tree need to be stored on-disk?  Couldn't you
+    store just the root hash?
+:A: If the Merkle tree wasn't stored on-disk, then you'd have to
+    compute the entire tree when the file is first accessed, even if
+    just one byte is being read.  This is a fundamental consequence of
+    how Merkle tree hashing works.  To verify a leaf node, you need to
+    verify the whole path to the root hash, including the root node
+    (the thing which the root hash is a hash of).  But if the root
+    node isn't stored on-disk, you have to compute it by hashing its
+    children, and so on until you've actually hashed the entire file.
+
+    That defeats most of the point of doing a Merkle tree-based hash,
+    since if you have to hash the whole file ahead of time anyway,
+    then you could simply do sha256(file) instead.  That would be much
+    simpler, and a bit faster too.
+
+    It's true that an in-memory Merkle tree could still provide the
+    advantage of verification on every read rather than just on the
+    first read.  However, it would be inefficient because every time a
+    hash page gets evicted (you can't pin the entire Merkle tree into
+    memory, since it may be very large), in order to restore it you
+    again need to hash everything below it in the tree.  This again
+    defeats most of the point of doing a Merkle tree-based hash, since
+    a single block read could trigger re-hashing gigabytes of data.
+
+:Q: But couldn't you store just the leaf nodes and compute the rest?
+:A: See previous answer; this really just moves up one level, since
+    one could alternatively interpret the data blocks as being the
+    leaf nodes of the Merkle tree.  It's true that the tree can be
+    computed much faster if the leaf level is stored rather than just
+    the data, but that's only because each level is less than 1% the
+    size of the level below (assuming the recommended settings of
+    SHA-256 and 4K blocks).  For the exact same reason, by storing
+    "just the leaf nodes" you'd already be storing over 99% of the
+    tree, so you might as well simply store the whole tree.
+
+:Q: Can the Merkle tree be built ahead of time, e.g. distributed as
+    part of a package that is installed to many computers?
+:A: This isn't currently supported.  It was part of the original
+    design, but was removed to simplify the kernel UAPI and because it
+    wasn't a critical use case.  Files are usually installed once and
+    used many times, and cryptographic hashing is somewhat fast on
+    most modern processors.
+
+:Q: Why doesn't fs-verity support writes?
+:A: Write support would be very difficult and would require a
+    completely different design, so it's well outside the scope of
+    fs-verity.  Write support would require:
+
+    - A way to maintain consistency between the data and hashes,
+      including all levels of hashes, since corruption after a crash
+      (especially of potentially the entire file!) is unacceptable.
+      The main options for solving this are data journalling,
+      copy-on-write, and log-structured volume.  But it's very hard to
+      retrofit existing filesystems with new consistency mechanisms.
+      Data journalling is available on ext4, but is very slow.
+
+    - Rebuilding the the Merkle tree after every write, which would be
+      extremely inefficient.  Alternatively, a different authenticated
+      dictionary structure such as an "authenticated skiplist" could
+      be used.  However, this would be far more complex.
+
+    Compare it to dm-verity vs. dm-integrity.  dm-verity is very
+    simple: the kernel just verifies read-only data against a
+    read-only Merkle tree.  In contrast, dm-integrity supports writes
+    but is slow, is much more complex, and doesn't actually support
+    full-device authentication since it authenticates each sector
+    independently, i.e. there is no "root hash".  It doesn't really
+    make sense for the same device-mapper target to support these two
+    very different cases; the same applies to fs-verity.
+
+:Q: Since verity files are immutable, why isn't the immutable bit set?
+:A: The existing "immutable" bit (FS_IMMUTABLE_FL) already has a
+    specific set of semantics which not only make the file contents
+    read-only, but also prevent the file from being deleted, renamed,
+    linked to, or having its owner or mode changed.  These extra
+    properties are unwanted for fs-verity, so reusing the immutable
+    bit isn't appropriate.
+
+:Q: Why does the API use ioctls instead of setxattr() and getxattr()?
+:A: Abusing the xattr interface for basically arbitrary syscalls is
+    heavily frowned upon by most of the Linux filesystem developers.
+    An xattr should really just be an xattr on-disk, not an API to
+    e.g. magically trigger construction of a Merkle tree.
+
+:Q: Does fs-verity support remote filesystems?
+:A: Only ext4 and f2fs support is implemented currently, but in
+    principle any filesystem that can store per-file verity metadata
+    can support fs-verity, regardless of whether it's local or remote.
+    Some filesystems may have fewer options of where to store the
+    verity metadata; one possibility is to store it past the end of
+    the file and "hide" it from userspace by manipulating i_size.  The
+    data verification functions provided by ``fs/verity/`` also assume
+    that the filesystem uses the Linux pagecache, but both local and
+    remote filesystems normally do so.
+
+:Q: Why is anything filesystem-specific at all?  Shouldn't fs-verity
+    be implemented entirely at the VFS level?
+:A: There are many reasons why this is not possible or would be very
+    difficult, including the following:
+
+    - To prevent bypassing verification, pages must not be marked
+      Uptodate until they've been verified.  Currently, each
+      filesystem is responsible for marking pages Uptodate via
+      ``->readpages()``.  Therefore, currently it's not possible for
+      the VFS to do the verification on its own.  Changing this would
+      require significant changes to the VFS and all filesystems.
+
+    - It would require defining a filesystem-independent way to store
+      the verity metadata.  Extended attributes don't work for this
+      because (a) the Merkle tree may be gigabytes, but many
+      filesystems assume that all xattrs fit into a single 4K
+      filesystem block, and (b) ext4 and f2fs encryption doesn't
+      encrypt xattrs, yet the Merkle tree *must* be encrypted when the
+      file contents are, because it stores hashes of the plaintext
+      file contents.
+
+      So the verity metadata would have to be stored in an actual
+      file.  Using a separate file would be very ugly, since the
+      metadata is fundamentally part of the file to be protected, and
+      it could cause problems where users could delete the real file
+      but not the metadata file or vice versa.  On the other hand,
+      having it be in the same file would break applications unless
+      filesystems' notion of i_size were divorced from the VFS's,
+      which would be complex and require changes to all filesystems.
+
+    - It's desirable that FS_IOC_ENABLE_VERITY uses the filesystem's
+      transaction mechanism so that either the file ends up with
+      verity enabled, or no changes were made.  Allowing intermediate
+      states to occur after a crash may cause problems.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 1131c34d77f6f1..416c7f0e123af7 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -31,6 +31,7 @@ filesystem implementations.
 
    journalling
    fscrypt
+   fsverity
 
 Filesystem-specific documentation
 =================================
-- 
2.22.0.410.gd8fdbe21b5-goog

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox