[RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout
@ 2015-05-29 15:48 Clark Williams
  2015-05-29 19:11 ` Johannes Weiner
  2015-05-29 21:26 ` Andrew Morton
  0 siblings, 2 replies; 13+ messages in thread
From: Clark Williams @ 2015-05-29 15:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Thomas Gleixner, linux-mm, Andrew Morton, RT,
	Fernando Lopez-Lezcano, Steven Rostedt, Sebastian Andrzej Siewior

Johannes,

We are seeing a panic in the latest RT kernel (4.0.4-rt1) where a
VM_BUG_ON(!irqs_disabled()) in mm/memcontrol.c fires. This is because on
an RT kernel, rt_mutexes (which replace spinlocks) don't disable
interrupts while the lock is held. I talked to Steven and he suggested
that we replace the irqs_disabled() with spin_is_held(). 

Does this patch work for you?

Clark

From: Clark Williams <williams@redhat.com>
Date: Fri, 29 May 2015 10:28:55 -0500
Subject: [PATCH] mm: change irqs_disabled() to spin_is_locked() in mem_cgroup_swapout()

The irqs_disabled() check in mem_cgroup_swapout() fails on the latest
RT kernel because RT mutexes do not disable interrupts when held. Change
the test for the lock being held to use spin_is_locked.

Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Clark Williams <williams@redhat.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9da0f3e9c1f3..70befa14a8ce 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
 
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
-	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
 
 	mem_cgroup_charge_statistics(memcg, page, -1);
 	memcg_check_events(memcg, page);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout
  2015-05-29 15:48 [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout Clark Williams
@ 2015-05-29 19:11 ` Johannes Weiner
  2015-05-29 19:21   ` Steven Rostedt
  2015-05-29 21:26 ` Andrew Morton
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2015-05-29 19:11 UTC (permalink / raw)
  To: Clark Williams
  Cc: Thomas Gleixner, linux-mm, Andrew Morton, RT,
	Fernando Lopez-Lezcano, Steven Rostedt, Sebastian Andrzej Siewior

Hi Clark,

On Fri, May 29, 2015 at 10:48:15AM -0500, Clark Williams wrote:
> @@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
> swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
>  
>  	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> -	VM_BUG_ON(!irqs_disabled());
> +	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
>  
>  	mem_cgroup_charge_statistics(memcg, page, -1);

It's not about the lock, it's about preemption.  The charge statistics
use __this_cpu operations and they're updated from process context and
interrupt context both.

This function really should do a local_irq_save().  I only added the
VM_BUG_ON() to document that we know the caller is holding an IRQ-safe
lock and so we don't need to bother with another level of IRQ saving.

So how does this translate to RT?  I don't know.  But if switching to
explicit IRQ toggling would help you guys out we can do that.  It is
in the swapout path after all, the optimization isn't that important.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout
  2015-05-29 19:11 ` Johannes Weiner
@ 2015-05-29 19:21   ` Steven Rostedt
  0 siblings, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2015-05-29 19:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Clark Williams, Thomas Gleixner, linux-mm, Andrew Morton, RT,
	Fernando Lopez-Lezcano, Sebastian Andrzej Siewior

On Fri, 29 May 2015 15:11:59 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi Clark,
> 
> On Fri, May 29, 2015 at 10:48:15AM -0500, Clark Williams wrote:
> > @@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
> > swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
> >  
> >  	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> > -	VM_BUG_ON(!irqs_disabled());
> > +	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
> >  
> >  	mem_cgroup_charge_statistics(memcg, page, -1);
> 
> It's not about the lock, it's about preemption.  The charge statistics

OK, I just lost my bet with Clark. He said it was about preemption, and
I said it was about the lock ;-)

> use __this_cpu operations and they're updated from process context and
> interrupt context both.
> 
> This function really should do a local_irq_save().  I only added the
> VM_BUG_ON() to document that we know the caller is holding an IRQ-safe
> lock and so we don't need to bother with another level of IRQ saving.
> 
> So how does this translate to RT?  I don't know.  But if switching to
> explicit IRQ toggling would help you guys out we can do that.  It is
> in the swapout path after all, the optimization isn't that important.

You only need to prevent this from preempting with other users here,
right? RT provides a "local_lock_irqsave(var)" which on vanilla linux
will do a local_irq_save(), but more importantly, it provides
documentation of what that local_irq_save is about (the var).

On -rt, that turns into a migrate disable, plus grabbing of the
rt_mutex(var). Thus, the process wont migrate from that CPU, but may be
preempted. If another process (or interrupt thread, as in -rt
interrupts run as preemptable threads) tries to do a local_lock(var) on
the same var, it will block.

Basically, you get the same serialization in both, but you don't cause
latencies in -rt.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout
  2015-05-29 15:48 [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout Clark Williams
  2015-05-29 19:11 ` Johannes Weiner
@ 2015-05-29 21:26 ` Andrew Morton
  2015-06-01 18:14   ` [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL Clark Williams
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2015-05-29 21:26 UTC (permalink / raw)
  To: Clark Williams
  Cc: Johannes Weiner, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt, Sebastian Andrzej Siewior

On Fri, 29 May 2015 10:48:15 -0500 Clark Williams <williams@redhat.com> wrote:

> The irqs_disabled() check in mem_cgroup_swapout() fails on the latest
> RT kernel because RT mutexes do not disable interrupts when held. Change
> the test for the lock being held to use spin_is_locked.
>
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
> swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
>  
>  	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> -	VM_BUG_ON(!irqs_disabled());
> +	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
>  
>  	mem_cgroup_charge_statistics(memcg, page, -1);
>  	memcg_check_events(memcg, page);

spin_is_locked() returns zero on uniprocessor builds.  The results will
be unhappy.  

I suggest just deleting the check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-05-29 21:26 ` Andrew Morton
@ 2015-06-01 18:14   ` Clark Williams
  2015-06-01 19:00     ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Clark Williams @ 2015-06-01 18:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt, Sebastian Andrzej Siewior

On Fri, 29 May 2015 14:26:14 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 29 May 2015 10:48:15 -0500 Clark Williams <williams@redhat.com> wrote:
> 
> > The irqs_disabled() check in mem_cgroup_swapout() fails on the latest
> > RT kernel because RT mutexes do not disable interrupts when held. Change
> > the test for the lock being held to use spin_is_locked.
> >
> > ...
> >
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
> > swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
> >  
> >  	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> > -	VM_BUG_ON(!irqs_disabled());
> > +	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
> >  
> >  	mem_cgroup_charge_statistics(memcg, page, -1);
> >  	memcg_check_events(memcg, page);
> 
> spin_is_locked() returns zero on uniprocessor builds.  The results will
> be unhappy.  
> 
> I suggest just deleting the check.

Guess this is Johannes call. We can just #ifdef it out and that would
remain the same when we finally merge PREEMPT_RT in mainline. 

If Johannes wants to keep the check on non-RT, here's a patch:

From: Clark Williams <williams@redhat.com>
Date: Mon, 1 Jun 2015 13:10:39 -0500
Subject: [PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL

The irqs_disabled() check in mem_cgroup_swapout() fails on the latest
RT kernel because RT mutexes do not disable interrupts when held. Ifdef
this check out for PREEMPT_RT_FULL.

Signed-off-by: Clark Williams <williams@redhat.com>
---
 mm/memcontrol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9da0f3e9c1f3..f3fcef7713f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5844,8 +5844,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, 1);
 
+#ifndef CONFIG_PREEMPT_RT_FULL
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());
+#endif
 
 	mem_cgroup_charge_statistics(memcg, page, -1);
 	memcg_check_events(memcg, page);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-06-01 18:14   ` [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL Clark Williams
@ 2015-06-01 19:00     ` Johannes Weiner
  2015-06-01 19:28       ` Steven Rostedt
  2015-06-11 11:40       ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 13+ messages in thread
From: Johannes Weiner @ 2015-06-01 19:00 UTC (permalink / raw)
  To: Clark Williams
  Cc: Andrew Morton, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt, Sebastian Andrzej Siewior

On Mon, Jun 01, 2015 at 01:14:52PM -0500, Clark Williams wrote:
> On Fri, 29 May 2015 14:26:14 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri, 29 May 2015 10:48:15 -0500 Clark Williams <williams@redhat.com> wrote:
> > 
> > > The irqs_disabled() check in mem_cgroup_swapout() fails on the latest
> > > RT kernel because RT mutexes do not disable interrupts when held. Change
> > > the test for the lock being held to use spin_is_locked.
> > >
> > > ...
> > >
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -5845,7 +5845,7 @@ void mem_cgroup_swapout(struct page *page,
> > > swp_entry_t entry) page_counter_uncharge(&memcg->memory, 1);
> > >  
> > >  	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> > > -	VM_BUG_ON(!irqs_disabled());
> > > +	VM_BUG_ON(!spin_is_locked(&page_mapping(page)->tree_lock));
> > >  
> > >  	mem_cgroup_charge_statistics(memcg, page, -1);
> > >  	memcg_check_events(memcg, page);
> > 
> > spin_is_locked() returns zero on uniprocessor builds.  The results will
> > be unhappy.  
> > 
> > I suggest just deleting the check.
> 
> Guess this is Johannes call. We can just #ifdef it out and that would
> remain the same when we finally merge PREEMPT_RT in mainline. 
> 
> If Johannes wants to keep the check on non-RT, here's a patch:

Andrew's suggestion makes sense, we can probably just delete the check
as long as we keep the comment.

That being said, I think it's a little weird that this doesn't work:

spin_lock_irq()
BUG_ON(!irqs_disabled())
spin_unlock_irq()

I'd expect that if you change the meaning of spin_lock_irq() from
"mask hardware interrupts" to "disable preemption by tophalf", you
would update the irqs_disabled() macro to match.  Most people using
this check probably don't care about the hardware state, only that
they don't get preempted by an interfering interrupt handler, no?

---
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 1 Jun 2015 14:30:49 -0400
Subject: [PATCH] mm: memcontrol: fix false-positive VM_BUG_ON() on -rt

On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
swapout path because the spin_lock_irq(&mapping->tree_lock) in the
caller doesn't actually disable the hardware interrupts - which is
fine, because on -rt the tophalves run in process context and so we
are still safe from preemption while updating the statistics.

Remove the VM_BUG_ON() but keep the comment of what we rely on.

Reported-by: Clark Williams <williams@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..977f7cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5833,9 +5833,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, 1);
 
-	/* XXX: caller holds IRQ-safe mapping->tree_lock */
-	VM_BUG_ON(!irqs_disabled());
-
+	/* Caller disabled preemption with mapping->tree_lock */
 	mem_cgroup_charge_statistics(memcg, page, -1);
 	memcg_check_events(memcg, page);
 }
-- 
2.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-06-01 19:00     ` Johannes Weiner
@ 2015-06-01 19:28       ` Steven Rostedt
  2015-06-11 11:40       ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 13+ messages in thread
From: Steven Rostedt @ 2015-06-01 19:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Clark Williams, Andrew Morton, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Sebastian Andrzej Siewior

On Mon, 1 Jun 2015 15:00:47 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Andrew's suggestion makes sense, we can probably just delete the check
> as long as we keep the comment.
> 
> That being said, I think it's a little weird that this doesn't work:
> 
> spin_lock_irq()
> BUG_ON(!irqs_disabled())
> spin_unlock_irq()
> 
> I'd expect that if you change the meaning of spin_lock_irq() from
> "mask hardware interrupts" to "disable preemption by tophalf", you
> would update the irqs_disabled() macro to match.  Most people using
> this check probably don't care about the hardware state, only that
> they don't get preempted by an interfering interrupt handler, no?

The thing is, in -rt, there's no state to check if a spin_lock_irq()
was done. Adding that would add overhead to the rt_mutexes without much
gain.

The fast path of spin_lock_irq() in -rt looks like this:

	migrate_disable();
	rt_mutex_cmpxchg(lock, NULL, current);

Now, the migrate_disable() is more like preempt disable.

Although, maybe we could have -rt change irq_disabled() just check
that, and add a raw_irq_disabled() for when we need to make sure
interrupts are really off.

-- Steve


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-06-01 19:00     ` Johannes Weiner
  2015-06-01 19:28       ` Steven Rostedt
@ 2015-06-11 11:40       ` Sebastian Andrzej Siewior
  2015-06-19 18:00         ` Johannes Weiner
  1 sibling, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-06-11 11:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Clark Williams, Andrew Morton, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt

* Johannes Weiner | 2015-06-01 15:00:47 [-0400]:

>Andrew's suggestion makes sense, we can probably just delete the check
>as long as we keep the comment.

that comment didn't get out attention - the BUG_ON() did because the
latter helped to spot a bug in -RT. Also if the comment says that the
preemption is expected to be disabled then I still miss the important
piece of information: WHY. You explained it in an earlier email that
this has something to do with the per CPU variables which are modified.
This piece of information is important. In future updates of the code I
would appreciate BUG_ON() statements like this to catch things I didn't
see originally.

>That being said, I think it's a little weird that this doesn't work:
>
>spin_lock_irq()
>BUG_ON(!irqs_disabled())
>spin_unlock_irq()

This depends on the point of view. You expect interrupts to be disabled
while taking a lock. This is not how the function is defined.
The function ensures that the lock can be taken from process context while
it may also be taken by another caller from interrupt context. The fact
that it disables interrupts on vanilla to achieve its goal is an
implementation detail. Same goes for spin_lock_bh() btw. Based on this
semantic it works on vanilla and -RT. It does not disable interrupts on
-RT because there is no need for it: the interrupt handler runs in thread
context. The function delivers what it is expected to deliver from API
point of view: "take the lock from process context which can also be
taken in interrupt context".

>I'd expect that if you change the meaning of spin_lock_irq() from
>"mask hardware interrupts" to "disable preemption by tophalf", you
>would update the irqs_disabled() macro to match.  Most people using
>this check probably don't care about the hardware state, only that
>they don't get preempted by an interfering interrupt handler, no?

Most people that use irqs_disabled() or preempt_disabled() implement
some kind locking which is not documented. It is either related to CPU
features (which are per-CPU) or protect per-CPU variables (sometimes
even global ones). It often ends with something that they rely on how
the vanilla API works.
For instance: preempt_disable() is used to for locking in all callers
but one and this is because that one caller takes a spin_lock() (a
totally unrelated lock) but since spin_lock() also performs
preempt_disable() the author optimizes the "needed" preempt_disable()
invocation away.

Either way, those functions do perform some kind of locking, don't
document what they are actually protecting and you can't test if the
locks are held properly all the time since lockdep can't see them. This
is bad. It is awful. This should not be done.

So. Back to the actual problem. In short: The per-CPU variables were
protected by local_irq_disable() in all places but one because it
inherited the needed local_irq_disable() from spin_lock_irq().
I did not make the preempt_disable() + spin_lock() case up, it happened
unfortunately more than once - ask tglx.

The function in question was introduced in v4.0-rc1. The other functions
that were calling mem_cgroup_charge_statistics() + memcg_check_events()
were converted to use local_lock_irq() for instance in
mem_cgroup_move_account(). That happened in v3.18 I think. I didn't spot
this new users while doing v4.0 -RT. So the next -RT release will get:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5822,6 +5822,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg;
 	unsigned short oldid;
+	unsigned long flags;

 	VM_BUG_ON_PAGE(PageLRU(page), page);
 	VM_BUG_ON_PAGE(page_count(page), page);
@@ -5844,11 +5845,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, 1);

-	/* XXX: caller holds IRQ-safe mapping->tree_lock */
-	VM_BUG_ON(!irqs_disabled());
-
+	local_lock_irqsave(event_lock, flags);
 	mem_cgroup_charge_statistics(memcg, page, -1);
 	memcg_check_events(memcg, page);
+	local_unlock_irqrestore(event_lock, flags);
 }

 /**

The only downside for the non-RT version is that local_lock_irqsave()
expands to local_irq_save() (on non-RT) which disables IRQs which are
already disabled - a minor issue if at all.

Johannes, would you mind using local_lock_irqsave() if it would be
available in vanilla? As you see it documents what is locked :)

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-06-11 11:40       ` Sebastian Andrzej Siewior
@ 2015-06-19 18:00         ` Johannes Weiner
  2015-07-08 15:44           ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2015-06-19 18:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Clark Williams, Andrew Morton, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt

On Thu, Jun 11, 2015 at 01:40:42PM +0200, Sebastian Andrzej Siewior wrote:
> * Johannes Weiner | 2015-06-01 15:00:47 [-0400]:
> 
> >Andrew's suggestion makes sense, we can probably just delete the check
> >as long as we keep the comment.
> 
> that comment didn't get out attention - the BUG_ON() did because the
> latter helped to spot a bug in -RT. Also if the comment says that the
> preemption is expected to be disabled then I still miss the important
> piece of information: WHY. You explained it in an earlier email that
> this has something to do with the per CPU variables which are modified.
> This piece of information is important. In future updates of the code I
> would appreciate BUG_ON() statements like this to catch things I didn't
> see originally.
> 
> >That being said, I think it's a little weird that this doesn't work:
> >
> >spin_lock_irq()
> >BUG_ON(!irqs_disabled())
> >spin_unlock_irq()
> 
> This depends on the point of view. You expect interrupts to be disabled
> while taking a lock. This is not how the function is defined.
> The function ensures that the lock can be taken from process context while
> it may also be taken by another caller from interrupt context. The fact
> that it disables interrupts on vanilla to achieve its goal is an
> implementation detail. Same goes for spin_lock_bh() btw. Based on this
> semantic it works on vanilla and -RT. It does not disable interrupts on
> -RT because there is no need for it: the interrupt handler runs in thread
> context. The function delivers what it is expected to deliver from API
> point of view: "take the lock from process context which can also be
> taken in interrupt context".

Uhm, that's really distorting reality to fit your requirements.  This
helper has been defined to mean local_irq_disable() + spin_lock() for
ages, it's been documented in books on Linux programming.  And people
expect it to prevent interrupt handlers from executing, which it does.

But more importantly, people expect irqs_disabled() to mean that as
well.  Check the callsites.  Except for maybe in the IRQ code itself,
every single caller using irqs_disabled() cares exclusively about the
serialization against interrupt handlers.

> >I'd expect that if you change the meaning of spin_lock_irq() from
> >"mask hardware interrupts" to "disable preemption by tophalf", you
> >would update the irqs_disabled() macro to match.  Most people using
> >this check probably don't care about the hardware state, only that
> >they don't get preempted by an interfering interrupt handler, no?
> 
> Most people that use irqs_disabled() or preempt_disabled() implement
> some kind locking which is not documented. It is either related to CPU
> features (which are per-CPU) or protect per-CPU variables (sometimes
> even global ones). It often ends with something that they rely on how
> the vanilla API works.
> For instance: preempt_disable() is used to for locking in all callers
> but one and this is because that one caller takes a spin_lock() (a
> totally unrelated lock) but since spin_lock() also performs
> preempt_disable() the author optimizes the "needed" preempt_disable()
> invocation away.

That's different, and spin_lock() doesn't imply preemption-disabling
on -rt.  But spin_lock_irq() prevents interrupt handlers from running,
even on -rt.  And people expect irqs_disabled() to test this fact.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5822,6 +5822,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>  {
>  	struct mem_cgroup *memcg;
>  	unsigned short oldid;
> +	unsigned long flags;
>  
>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>  	VM_BUG_ON_PAGE(page_count(page), page);
> @@ -5844,11 +5845,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>  	if (!mem_cgroup_is_root(memcg))
>  		page_counter_uncharge(&memcg->memory, 1);
>  
> -	/* XXX: caller holds IRQ-safe mapping->tree_lock */
> -	VM_BUG_ON(!irqs_disabled());
> -
> +	local_lock_irqsave(event_lock, flags);
>  	mem_cgroup_charge_statistics(memcg, page, -1);
>  	memcg_check_events(memcg, page);
> +	local_unlock_irqrestore(event_lock, flags);
>  }
>  
>  /**
> 
> The only downside for the non-RT version is that local_lock_irqsave()
> expands to local_irq_save() (on non-RT) which disables IRQs which are
> already disabled - a minor issue if at all.
> 
> Johannes, would you mind using local_lock_irqsave() if it would be
> available in vanilla? As you see it documents what is locked :)

WTF is event_lock?  This is even more obscure than anything else we
had before.  Seriously, just fix irqs_disabled() to mean "interrupt
handlers can't run", which is the expectation in pretty much all
callsites that currently use it, except for maybe irq code itself.

Use raw_irqs_disabled() or something for the three callers that care
about the actual hardware state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-06-19 18:00         ` Johannes Weiner
@ 2015-07-08 15:44           ` Sebastian Andrzej Siewior
  2015-07-09 15:07             ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2015-07-08 15:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Clark Williams, Andrew Morton, Thomas Gleixner, linux-mm, RT,
	Fernando Lopez-Lezcano, Steven Rostedt

* Johannes Weiner | 2015-06-19 14:00:02 [-0400]:

>> This depends on the point of view. You expect interrupts to be disabled
>> while taking a lock. This is not how the function is defined.
>> The function ensures that the lock can be taken from process context while
>> it may also be taken by another caller from interrupt context. The fact
>> that it disables interrupts on vanilla to achieve its goal is an
>> implementation detail. Same goes for spin_lock_bh() btw. Based on this
>> semantic it works on vanilla and -RT. It does not disable interrupts on
>> -RT because there is no need for it: the interrupt handler runs in thread
>> context. The function delivers what it is expected to deliver from API
>> point of view: "take the lock from process context which can also be
>> taken in interrupt context".
>
>Uhm, that's really distorting reality to fit your requirements.  This
>helper has been defined to mean local_irq_disable() + spin_lock() for
>ages, it's been documented in books on Linux programming.  And people
>expect it to prevent interrupt handlers from executing, which it does.

After all it documents the current implementation and the semantic
requirement.

>But more importantly, people expect irqs_disabled() to mean that as
>well.  Check the callsites.  Except for maybe in the IRQ code itself,
>every single caller using irqs_disabled() cares exclusively about the
>serialization against interrupt handlers.

I am aware of some people expect and we fix them in -RT. Using
irqs_disabled() as a form locking for per-CPU variables is another way
of creating a BKL.

>> Most people that use irqs_disabled() or preempt_disabled() implement
>> some kind locking which is not documented. It is either related to CPU
>> features (which are per-CPU) or protect per-CPU variables (sometimes
>> even global ones). It often ends with something that they rely on how
>> the vanilla API works.
>> For instance: preempt_disable() is used to for locking in all callers
>> but one and this is because that one caller takes a spin_lock() (a
>> totally unrelated lock) but since spin_lock() also performs
>> preempt_disable() the author optimizes the "needed" preempt_disable()
>> invocation away.
>
>That's different, and spin_lock() doesn't imply preemption-disabling
>on -rt.  But spin_lock_irq() prevents interrupt handlers from running,
>even on -rt.  And people expect irqs_disabled() to test this fact.

What is the point of
    spin_lock_irq(&a);
    BUG_ON(!irqs_disabled())

It should not trigger. It triggers on -RT. But there is no BUG as long
as it safe to use it from IRQ handler and process context and this is
what it has been built for in the first place.

Lets have another example: Lets assume I come up with a way to lazy
disable interrupts. That means spin_lock_irq() does not disable
interrupts it simply simple sets a per-CPU bit that the interrupts
should remain off. Should an interrupt arrive (because the CPU flag
still enabled them) it will see the bit set and postpones the interrupt
and return.
Now: we don't deadlock with spin_lock_irq() against an interrupt even if
we don't disable interrupts. There is a precaution for this. And my
question: What should irqs_disabled() return within the spin_lock_irq()
section? The real state or what people documented in books ages ago?

>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5822,6 +5822,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>>  {
>>  	struct mem_cgroup *memcg;
>>  	unsigned short oldid;
>> +	unsigned long flags;
>>  
>>  	VM_BUG_ON_PAGE(PageLRU(page), page);
>>  	VM_BUG_ON_PAGE(page_count(page), page);
>> @@ -5844,11 +5845,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
>>  	if (!mem_cgroup_is_root(memcg))
>>  		page_counter_uncharge(&memcg->memory, 1);
>>  
>> -	/* XXX: caller holds IRQ-safe mapping->tree_lock */
>> -	VM_BUG_ON(!irqs_disabled());
>> -
>> +	local_lock_irqsave(event_lock, flags);
>>  	mem_cgroup_charge_statistics(memcg, page, -1);
>>  	memcg_check_events(memcg, page);
>> +	local_unlock_irqrestore(event_lock, flags);
>>  }
>>  
>>  /**
>> 
>> The only downside for the non-RT version is that local_lock_irqsave()
>> expands to local_irq_save() (on non-RT) which disables IRQs which are
>> already disabled - a minor issue if at all.
>> 
>> Johannes, would you mind using local_lock_irqsave() if it would be
>> available in vanilla? As you see it documents what is locked :)
>
>WTF is event_lock?  This is even more obscure than anything else we
>had before.  Seriously, just fix irqs_disabled() to mean "interrupt
>handlers can't run", which is the expectation in pretty much all
>callsites that currently use it, except for maybe irq code itself.

How is it more obscure than anything else? After all it disables
interrupts and documents based on the lock named `event_lock' what is
protected. Which means each section that share the same (per-CPU)
ressources belong together and may not be accessed independently.

>Use raw_irqs_disabled() or something for the three callers that care
>about the actual hardware state.

Sebastian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-07-08 15:44           ` Sebastian Andrzej Siewior
@ 2015-07-09 15:07             ` Thomas Gleixner
  2015-07-09 16:00               ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2015-07-09 15:07 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Johannes Weiner, Clark Williams, Andrew Morton, Thomas Gleixner,
	linux-mm, RT, Fernando Lopez-Lezcano, Steven Rostedt, LKML,
	Peter Zijlstra, Paul E. McKenney

On Wed, 8 Jul 2015, Sebastian Andrzej Siewior wrote:
> * Johannes Weiner | 2015-06-19 14:00:02 [-0400]:
> 
> >> This depends on the point of view. You expect interrupts to be disabled
> >> while taking a lock. This is not how the function is defined.
> >> The function ensures that the lock can be taken from process context while
> >> it may also be taken by another caller from interrupt context. The fact
> >> that it disables interrupts on vanilla to achieve its goal is an
> >> implementation detail. Same goes for spin_lock_bh() btw. Based on this
> >> semantic it works on vanilla and -RT. It does not disable interrupts on
> >> -RT because there is no need for it: the interrupt handler runs in thread
> >> context. The function delivers what it is expected to deliver from API
> >> point of view: "take the lock from process context which can also be
> >> taken in interrupt context".
> >
> >Uhm, that's really distorting reality to fit your requirements.  This
> >helper has been defined to mean local_irq_disable() + spin_lock() for
> >ages, it's been documented in books on Linux programming.  And people
> >expect it to prevent interrupt handlers from executing, which it does.
> 
> After all it documents the current implementation and the semantic
> requirement.

Actually its worse. Most books describe the implementation and pretend
that the implementation defines the semantics, which is the
fundamentally wrong approach.

The sad news is, that a lot of kernel developers tend to believe that
as well.

The result is, that local_irq_disable / preempt_disable have become
per CPU BKLs. And they have the same problem as the BKL:

    The protection scope of these constructs is global and completely
    non-obvious.

So its really hard to figure out what is protected against what. Like
the old BKL its an all or nothing approach. And we all know, or should
know, how well that worked.

This all or nothing protection is a real show stopper for RT, so we
try to identify what needs protection against what and then we
annotate those sections with proper scope markers, which turn into RT
friendly constructs at compile time.

The name of the marker in question (event_lock) might not be the best
choice, but that does not invalidate the general usefulness of fine
granular protection scope markers. We certainly need to revisit the
names which we slapped on the particular bits and pieces, and discuss
with the subsystem experts the correctness of the scope markers, but
that's a completely different story.

> > Seriously, just fix irqs_disabled() to mean "interrupt
> > handlers can't run", which is the expectation in pretty much all
> > callsites that currently use it, except for maybe irq code itself.

And that solves the RT problem in which way? NOT AT ALL. It just
preserves the BKL nature of irq_disable. Great solution, NOT.

Why?

Because it just preserves the status quo of mainline and exposes
everything to the same latency behaviour which mainline has. So we add
lots of mechanisms to avoid that behaviour just to bring it back by
switching the irq disabled BKL on again, which means we are back to
square one.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-07-09 15:07             ` Thomas Gleixner
@ 2015-07-09 16:00               ` Johannes Weiner
  2015-07-09 16:43                 ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2015-07-09 16:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Sebastian Andrzej Siewior, Clark Williams, Andrew Morton,
	Thomas Gleixner, linux-mm, RT, Fernando Lopez-Lezcano,
	Steven Rostedt, LKML, Peter Zijlstra, Paul E. McKenney

On Thu, Jul 09, 2015 at 05:07:42PM +0200, Thomas Gleixner wrote:
> This all or nothing protection is a real show stopper for RT, so we
> try to identify what needs protection against what and then we
> annotate those sections with proper scope markers, which turn into RT
> friendly constructs at compile time.
> 
> The name of the marker in question (event_lock) might not be the best
> choice, but that does not invalidate the general usefulness of fine
> granular protection scope markers. We certainly need to revisit the
> names which we slapped on the particular bits and pieces, and discuss
> with the subsystem experts the correctness of the scope markers, but
> that's a completely different story.

Actually, I think there was a misunderstanding.  Sebastian's patch did
not include any definition of event_lock, so it looked like this is a
global lock defined by -rt that is simply explicit about being global,
rather than a lock that specifically protects memcg event statistics.

Yeah that doesn't make a lot of sense, thinking more about it.  Sorry.

So localizing these locks for -rt is reasonable, I can see that.  That
being said, does it make sense to have such locking in mainline code?
Is there a concrete plan for process-context interrupt handlers in
mainline?  Because it'd be annoying to maintain fine-grained locking
schemes with explicit lock names in a source tree where it never
amounts to anything more than anonymous cli/sti or preempt toggling.

Maybe I still don't understand what you were proposing for mainline
and what you were proposing as the -rt solution.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL
  2015-07-09 16:00               ` Johannes Weiner
@ 2015-07-09 16:43                 ` Thomas Gleixner
  0 siblings, 0 replies; 13+ messages in thread
From: Thomas Gleixner @ 2015-07-09 16:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Sebastian Andrzej Siewior, Clark Williams, Andrew Morton,
	Thomas Gleixner, linux-mm, RT, Fernando Lopez-Lezcano,
	Steven Rostedt, LKML, Peter Zijlstra, Paul E. McKenney

On Thu, 9 Jul 2015, Johannes Weiner wrote:
> On Thu, Jul 09, 2015 at 05:07:42PM +0200, Thomas Gleixner wrote:
> > This all or nothing protection is a real show stopper for RT, so we
> > try to identify what needs protection against what and then we
> > annotate those sections with proper scope markers, which turn into RT
> > friendly constructs at compile time.
> > 
> > The name of the marker in question (event_lock) might not be the best
> > choice, but that does not invalidate the general usefulness of fine
> > granular protection scope markers. We certainly need to revisit the
> > names which we slapped on the particular bits and pieces, and discuss
> > with the subsystem experts the correctness of the scope markers, but
> > that's a completely different story.
> 
> Actually, I think there was a misunderstanding.  Sebastian's patch did
> not include any definition of event_lock, so it looked like this is a
> global lock defined by -rt that is simply explicit about being global,
> rather than a lock that specifically protects memcg event statistics.
> 
> Yeah that doesn't make a lot of sense, thinking more about it.  Sorry.
> 
> So localizing these locks for -rt is reasonable, I can see that.  That
> being said, does it make sense to have such locking in mainline code?
> Is there a concrete plan for process-context interrupt handlers in
> mainline? 

They exist today. Though they are opt-in while on rt we enforce them.

> Because it'd be annoying to maintain fine-grained locking
> schemes with explicit lock names in a source tree where it never
> amounts to anything more than anonymous cli/sti or preempt toggling.
> 
> Maybe I still don't understand what you were proposing for mainline
> and what you were proposing as the -rt solution.

For the time being it's RT only, but as we are trying to come up with
a way to merge RT into mainline, we start to figure out how to break
that per cpu BKL style protection into understandable bits and
pieces. We are still debating how that final annotation mechanism will
look like, but something like the local lock mechanism might come out
of it. That said, even w/o RT it makes a lot of sense to document
explicitely in the code WHICH resource needs to be protected against
WHAT.

In that very case, you do not care about interrupt handlers per se,
you only care about interrupt handlers which might recurse into that
code, right?

So the protection scope annotation should be able to express that
independent of the underlying implementation details.

	protect_irq_concurrency(protected_resource)
	fiddle_with_resource()
	unprotect_irq_concurrency(protected_resource)

Gives a very clear picture, about what you care and what needs to be
protected. The ideal outcome of such annotations would be, that tools
(runtime or static analysis) are able to find violations of
this. i.e. if some other place just fiddles with resource w/o having
the protection scope annotation in place, then tools can yell at you,
like we do today with lockdep and other mechanisms.

Thanks,

	tglx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-07-09 16:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-29 15:48 [RFC] mm: change irqs_disabled() test to spin_is_locked() in mem_cgroup_swapout Clark Williams
2015-05-29 19:11 ` Johannes Weiner
2015-05-29 19:21   ` Steven Rostedt
2015-05-29 21:26 ` Andrew Morton
2015-06-01 18:14   ` [RFC][PATCH] mm: ifdef out VM_BUG_ON check on PREEMPT_RT_FULL Clark Williams
2015-06-01 19:00     ` Johannes Weiner
2015-06-01 19:28       ` Steven Rostedt
2015-06-11 11:40       ` Sebastian Andrzej Siewior
2015-06-19 18:00         ` Johannes Weiner
2015-07-08 15:44           ` Sebastian Andrzej Siewior
2015-07-09 15:07             ` Thomas Gleixner
2015-07-09 16:00               ` Johannes Weiner
2015-07-09 16:43                 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).