Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andy Lutomirski <luto@amacapital.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Denys Vlasenko <dvlasenk@redhat.com>,
	Brian Gerst <brgerst@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>, "H. Peter Anvin" <hpa@zytor.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <Waiman.Long@hp.com>
Subject: Re: [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code
Date: Sun, 14 Jun 2015 17:40:30 -0700	[thread overview]
Message-ID: <20150615004030.GK3913@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150614193825.GA19582@redhat.com>

On Sun, Jun 14, 2015 at 09:38:25PM +0200, Oleg Nesterov wrote:
> On 06/14, Oleg Nesterov wrote:
> >
> > On 06/14, Ingo Molnar wrote:
> > >
> > > * Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > > > +		spin_lock(&pgd_lock); /* Implies rcu_read_lock() for the task list iteration: */
> > > >                                          ^^^^^^^^^^^^^^^^^^^^^^^
> > > >
> > > > Hmm, but it doesn't if PREEMPT_RCU? No, no, I do not pretend I understand how it
> > > > actually works ;) But, say, rcu_check_callbacks() can be called from irq and
> > > > since spin_lock() doesn't increment current->rcu_read_lock_nesting this can lead
> > > > to rcu_preempt_qs()?
> > >
> > > No, RCU grace periods are still defined by 'heavy' context boundaries such as
> > > context switches, entering idle or user-space mode.
> > >
> > > PREEMPT_RCU is like traditional RCU, except that blocking is allowed within the
> > > RCU read critical section - that is why it uses a separate nesting counter
> > > (current->rcu_read_lock_nesting), not the preempt count.
> >
> > Yes.
> >
> > > But if a piece of kernel code is non-preemptible, such as a spinlocked region or
> > > an irqs-off region, then those are still natural RCU read lock regions, regardless
> > > of the RCU model, and need no additional RCU locking.
> >
> > I do not think so. Yes I understand that rcu_preempt_qs() itself doesn't
> > finish the gp, but if there are no other rcu-read-lock holders then it
> > seems synchronize_rcu() on another CPU can return _before_ spin_unlock(),
> > this CPU no longer needs rcu_preempt_note_context_switch().
> >
> > OK, I can be easily wrong, I do not really understand the implementation
> > of PREEMPT_RCU. Perhaps preempt_disable() can actually act as rcu_read_lock()
> > with the _current_ implementation. Still this doesn't look right even if
> > happens to work, and Documentation/RCU/checklist.txt says:
> >
> > 11.	Note that synchronize_rcu() -only- guarantees to wait until
> > 	all currently executing rcu_read_lock()-protected RCU read-side
> > 	critical sections complete.  It does -not- necessarily guarantee
> > 	that all currently running interrupts, NMIs, preempt_disable()
> > 	code, or idle loops will complete.  Therefore, if your
> > 	read-side critical sections are protected by something other
> > 	than rcu_read_lock(), do -not- use synchronize_rcu().
> 
> 
> I've even checked this ;) I applied the stupid patch below and then
> 
> 	$ taskset 2 perl -e 'syscall 157, 666, 5000' &
> 	[1] 565
> 
> 	$ taskset 1 perl -e 'syscall 157, 777'
> 
> 	$
> 	[1]+  Done                    taskset 2 perl -e 'syscall 157, 666, 5000'
> 
> 	$ dmesg -c
> 	SPIN start
> 	SYNC start
> 	SYNC done!
> 	SPIN done!

Please accept my apologies for my late entry to this thread.
Youngest kid graduated from university this weekend, so my
attention has been elsewhere.

If you were to disable interrupts instead of preemption, I would expect
that the preemptible-RCU grace period would be blocked -- though I am
not particularly comfortable with people relying on disabled interrupts
blocking a preemptible-RCU grace period.

Here is what can happen if you try to block a preemptible-RCU grace
period by disabling preemption, assuming that there are at least two
online CPUs in the system:

1.	CPU 0 does spin_lock(), which disables preemption.

2.	CPU 1 starts a grace period.

3.	CPU 0 takes a scheduling-clock interrupt.  It raises softirq,
	and the RCU_SOFTIRQ handler notes that there is a new grace
	period and sets state so that a subsequent quiescent state on
	this CPU will be noted.

4.	CPU 0 takes another scheduling-clock interrupt, which checks
	current->rcu_read_lock_nesting, and notes that there is no
	preemptible-RCU read-side critical section in progress.  It
	again raises softirq, and the RCU_SOFTIRQ handler reports
	the quiescent state to core RCU.

5.	Once each of the other CPUs report a quiescent state, the
	grace period can end, despite CPU 0 having preemption
	disabled the whole time.

So Oleg's test is correct, disabling preemption is not sufficient
to block a preemptible-RCU grace period.

The usual suggestion would be to add rcu_read_lock() just after the
lock is acquired and rcu_read_unlock() just before each release of that
same lock.  Putting the entire RCU read-side critical section under
the lock prevents RCU from having to invoke rcu_read_unlock_special()
due to preemption.  (It might still invoke it if the RCU read-side
critical section was overly long, but that is much cheaper than the
preemption-handling case.)

							Thanx, Paul

> Oleg.
> 
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2049,6 +2049,9 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
>  }
>  #endif
> 
> +#include <linux/delay.h>
> +
> +
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2062,6 +2065,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> 
>  	error = 0;
>  	switch (option) {
> +	case 666:
> +		preempt_disable();
> +		pr_crit("SPIN start\n");
> +		while (arg2--)
> +			mdelay(1);
> +		pr_crit("SPIN done!\n");
> +		preempt_enable();
> +		break;
> +	case 777:
> +		pr_crit("SYNC start\n");
> +		synchronize_rcu();
> +		pr_crit("SYNC done!\n");
> +		break;
>  	case PR_SET_PDEATHSIG:
>  		if (!valid_signal(arg2)) {
>  			error = -EINVAL;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2015-06-15  0:40 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-13  9:49 [PATCH 00/12, v2] x86/mm: Implement lockless pgd_alloc()/pgd_free() Ingo Molnar
2015-06-13  9:49 ` [PATCH 01/12] x86/mm/pat: Don't free PGD entries on memory unmap Ingo Molnar
2015-06-13  9:49 ` [PATCH 02/12] x86/mm/hotplug: Remove pgd_list use from the memory hotplug code Ingo Molnar
2015-06-13 19:24   ` Oleg Nesterov
2015-06-14  7:36     ` Ingo Molnar
2015-06-14 19:24       ` Oleg Nesterov
2015-06-14 19:38         ` Oleg Nesterov
2015-06-15  0:40           ` Paul E. McKenney [this message]
2015-06-15 20:33             ` Ingo Molnar
2015-06-13  9:49 ` [PATCH 03/12] x86/mm/hotplug: Don't remove PGD entries in remove_pagetable() Ingo Molnar
2015-06-13  9:49 ` [PATCH 04/12] x86/mm/hotplug: Simplify sync_global_pgds() Ingo Molnar
2015-06-13  9:49 ` [PATCH 05/12] mm: Introduce arch_pgd_init_late() Ingo Molnar
2015-06-13  9:49 ` [PATCH 06/12] x86/mm: Enable and use the arch_pgd_init_late() method Ingo Molnar
2015-06-13  9:49 ` [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from the Xen guest code Ingo Molnar
2015-06-14  8:26   ` Ingo Molnar
2015-06-15  9:05   ` Ian Campbell
2015-06-15 10:30     ` David Vrabel
2015-06-15 20:35       ` Ingo Molnar
2015-06-16 14:15         ` David Vrabel
2015-06-16 14:19           ` Boris Ostrovsky
2015-06-16 14:27             ` David Vrabel
2015-06-13  9:49 ` [PATCH 08/12] x86/mm: Remove pgd_list use from vmalloc_sync_all() Ingo Molnar
2015-06-13  9:49 ` [PATCH 09/12] x86/mm/pat/32: Remove pgd_list use from the PAT code Ingo Molnar
2015-06-13  9:49 ` [PATCH 10/12] x86/mm: Make pgd_alloc()/pgd_free() lockless Ingo Molnar
2015-06-13  9:49 ` [PATCH 11/12] x86/mm: Remove pgd_list leftovers Ingo Molnar
2015-06-13  9:49 ` [PATCH 12/12] x86/mm: Simplify pgd_alloc() Ingo Molnar
2015-06-13 18:58 ` why do we need vmalloc_sync_all? Oleg Nesterov
2015-06-14  7:59   ` Ingo Molnar
2015-06-14 20:06     ` Oleg Nesterov
2015-06-15  2:47       ` Andi Kleen
2015-06-15  2:57         ` Andy Lutomirski
2015-06-15 20:28           ` Ingo Molnar
2015-06-15 20:48             ` Andy Lutomirski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150615004030.GK3913@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=Waiman.Long@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=dvlasenk@redhat.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).