All of lore.kernel.org
 help / color / mirror / Atom feed
From: Byungchul Park <byungchul@sk.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>,
	Byungchul Park <lkml.byungchul.park@gmail.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	kernel_team@skhynix.com, akpm@linux-foundation.org,
	ying.huang@intel.com, vernhao@tencent.com,
	mgorman@techsingularity.net, hughd@google.com,
	willy@infradead.org, peterz@infradead.org, luto@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, rjgolo@gmail.com
Subject: Re: [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped
Date: Tue, 11 Jun 2024 18:12:14 +0900	[thread overview]
Message-ID: <20240611091214.GA16469@system.software.com> (raw)
In-Reply-To: <d650c29b-129f-4fac-9a9d-ea1fbdae2c3a@intel.com>

On Mon, Jun 03, 2024 at 06:23:46AM -0700, Dave Hansen wrote:
> On 6/3/24 02:35, Byungchul Park wrote:
> ...> In luf's point of view, the points where the deferred flush should be
> > performed are simply:
> > 
> > 	1. when changing the vma maps, that might be luf'ed.
> > 	2. when updating data of the pages, that might be luf'ed.
> 
> It's simple, but the devil is in the details as always.
> 
> > All we need to do is to indentify the points:
> > 
> > 	1. when changing the vma maps, that might be luf'ed.
> > 
> > 	   a) mmap and munmap e.i. fault handler or unmap_region().
> > 	   b) permission to writable e.i. mprotect or fault handler.
> > 	   c) what I'm missing.
> 
> I'd say it even more generally: anything that installs a PTE which is
> inconsistent with the original PTE.  That, of course, includes writes.
> But it also includes crazy things that we do like uprobes.  Take a look
> at __replace_page().
> 
> I think the page_vma_mapped_walk() checks plus the ptl keep LUF at bay
> there.  But it needs some really thorough review.
> 
> But the bigger concern is that, if there was a problem, I can't think of
> a systematic way to find it.
> 
> > 	2. when updating data of the pages, that might be luf'ed.
> > 
> > 	   a) updating files through vfs e.g. file_end_write().
> > 	   b) updating files through writable maps e.i. 1-a) or 1-b).
> > 	   c) what I'm missing.
> 
> Filesystems or block devices that change content without a "write" from
> the local system.  Network filesystems and block devices come to mind.
> I honestly don't know what all the rules are around these, but they
> could certainly be troublesome.
> 
> There appear to be some interactions for NFS between file locking and
> page cache flushing.
> 
> But, stepping back ...
> 
> I'd honestly be a lot more comfortable if there was even a debugging LUF
> mode that enforced a rule that said:
> 
>   1. A LUF'd PTE can't be rewritten until after a luf_flush() occurs
>   2. A LUF'd page's position in the page cache can't be replaced until
>      after a luf_flush()

I'm thinking a debug mode doing the following *pseudo* code - check the
logic only since the grammer might be wrong:

   0-a) Introduce new fields in page_ext:

	#ifdef LUF_DEBUG
	struct list_head __percpu luf_node;
	#endif

   0-b) Introduce new fields in struct address_space:

	#ifdef LUF_DEBUG
	struct list_head __percpu luf_node;
	#endif

   0-c) Introduce new fields in struct task_struct:

	#ifdef LUF_DEBUG
	cpumask_t luf_pending_cpus;
	#endif

   0-d) Define percpu list_head to link luf'd folios:

	#ifdef LUF_DEBUG
	DEFINE_PER_CPU(struct list_head, luf_folios);
	DEFINE_PER_CPU(struct list_head, luf_address_spaces);
	#endif

   1) When skipping tlb flush in reclaim or migration for a folio:

	#ifdef LUF_DEBUG
	ext = get_page_ext_for_luf_debug(folio);
	as = folio_mapping(folio);

	for_each_cpu(cpu, skip_cpus) {
		list_add(per_cpu_ptr(ext->luf_node, cpu),
			 per_cpu_ptr(luf_folios, cpu));
		if (as)
			list_add(per_cpu_ptr(as->luf_node, cpu),
				 per_cpu_ptr(luf_address_spaces, cpu));
	}
	put_page_ext(ext);
	#endif

   2) When performing tlb flush in try_to_unmap_flush():
      Remind luf only works on unmapping during reclaim and migration.

	#ifdef LUF_DEBUG
	for_each_cpu(cpu, now_flushing_cpus) {
		for_each_node_safe(folio, per_cpu_ptr(luf_folios)) {
			ext = get_page_ext_for_luf_debug(folio);
			list_del_init(per_cpu_ptr(ext->luf_node, cpu))
			put_page_ext(ext);
		}

		for_each_node_safe(as, per_cpu_ptr(luf_address_spaces))
			list_del_init(per_cpu_ptr(as->luf_node, cpu))

		cpumask_clear_cpu(cpu, current->luf_pending_cpus);
	}
	#endif

   3) In pte_mkwrite():

	#ifdef LUF_DEBUG
	ext = get_page_ext_for_luf_debug(folio);

	for_each_cpu(cpu, online_cpus)
		if (!list_empty(per_cpu_ptr(ext->luf_node, cpu)))
			cpumask_set_cpu(cpu, current->luf_pending_cpus);
	put_page_ext(ext);
	#endif

   4) On returning to user:

	#ifdef LUF_DEBUG
	WARN_ON(!cpumask_empty(current->luf_pending_cpus));
	#endif

   5) On right after every a_ops->write_end() call:

	#ifdef LUF_DEBUG
	as = get_address_space_to_write_to();
	for_each_cpu(cpu, online_cpus)
		if (!list_empty(per_cpu_ptr(as->luf_node, cpu)))
			cpumask_set_cpu(cpu, current->luf_pending_cpus);
	#endif

	luf_flush_or_its_optimized_version();

	#ifdef LUF_DEBUG
	WARN_ON(!cpumask_empty(current->luf_pending_cpus));
	#endif

I will implement the debug mode this way with all serialized.  Do you
think it works for what we want?

	Byungchul

> or *some* other independent set of rules that can tell us when something
> goes wrong.  That uprobes code, for instance, seems like it will work.
> But I can also imagine writing it ten other ways where it would break
> when combined with LUF.


  parent reply	other threads:[~2024-06-11  9:12 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31  9:19 [PATCH v11 00/12] LUF(Lazy Unmap Flush) reducing tlb numbers over 90% Byungchul Park
2024-05-31  9:19 ` [PATCH v11 01/12] x86/tlb: add APIs manipulating tlb batch's arch data Byungchul Park
2024-05-31  9:19 ` [PATCH v11 02/12] arm64: tlbflush: " Byungchul Park
2024-05-31  9:19 ` [PATCH v11 03/12] riscv, tlb: " Byungchul Park
2024-05-31  9:19 ` [PATCH v11 04/12] x86/tlb, riscv/tlb, mm/rmap: separate arch_tlbbatch_clear() out of arch_tlbbatch_flush() Byungchul Park
2024-05-31  9:19 ` [PATCH v11 05/12] mm: buddy: make room for a new variable, ugen, in struct page Byungchul Park
2024-05-31  9:19 ` [PATCH v11 06/12] mm: add folio_put_ugen() to deliver unmap generation number to pcp or buddy Byungchul Park
2024-05-31  9:19 ` [PATCH v11 07/12] mm: add a parameter, unmap generation number, to free_unref_folios() Byungchul Park
2024-05-31  9:19 ` [PATCH v11 08/12] mm/rmap: recognize read-only tlb entries during batched tlb flush Byungchul Park
2024-05-31  9:19 ` [PATCH v11 09/12] mm: implement LUF(Lazy Unmap Flush) defering tlb flush when folios get unmapped Byungchul Park
2024-05-31 16:12   ` Dave Hansen
2024-05-31 18:04     ` Byungchul Park
2024-05-31 21:46       ` Dave Hansen
2024-05-31 22:09         ` Matthew Wilcox
2024-06-01  2:20         ` Byungchul Park
2024-06-01  7:22         ` David Hildenbrand
2024-06-03  9:35           ` Byungchul Park
2024-06-03 13:23             ` Dave Hansen
2024-06-03 16:05               ` David Hildenbrand
2024-06-03 16:37                 ` Dave Hansen
2024-06-03 17:01                   ` Matthew Wilcox
2024-06-03 18:00                     ` David Hildenbrand
2024-06-04  8:16                       ` Huang, Ying
2024-06-04  0:34                     ` Byungchul Park
2024-06-10 13:23                       ` Michal Hocko
2024-06-11  0:55                         ` Byungchul Park
2024-06-11 11:55                           ` Michal Hocko
2024-06-14  2:45                             ` Byungchul Park
2024-06-04  1:53               ` Byungchul Park
2024-06-04  4:43                 ` Byungchul Park
2024-06-06  8:33                   ` David Hildenbrand
2024-06-14  1:57                 ` Byungchul Park
2024-06-11  9:12               ` Byungchul Park [this message]
2024-05-31  9:19 ` [PATCH v11 10/12] mm: separate move/undo parts from migrate_pages_batch() Byungchul Park
2024-05-31  9:20 ` [PATCH v11 11/12] mm, migrate: apply luf mechanism to unmapping during migration Byungchul Park
2024-05-31  9:20 ` [PATCH v11 12/12] mm, vmscan: apply luf mechanism to unmapping during folio reclaim Byungchul Park

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240611091214.GA16469@system.software.com \
    --to=byungchul@sk.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kernel_team@skhynix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkml.byungchul.park@gmail.com \
    --cc=luto@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rjgolo@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=vernhao@tencent.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.