* [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
@ 2013-06-25 22:34 Paul Gortmaker
2013-06-26 8:10 ` Paolo Bonzini
2013-06-27 11:09 ` [PATCH-next] " Gleb Natapov
0 siblings, 2 replies; 14+ messages in thread
From: Paul Gortmaker @ 2013-06-25 22:34 UTC (permalink / raw)
To: Gleb Natapov, Paolo Bonzini; +Cc: linux-rt-users, kvm, Paul Gortmaker
In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held. This leads to the following:
BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
[<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
[<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
[<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
[<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
[<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
[<ffffffff8111824a>] balance_pgdat+0x54a/0x730
[<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
[<ffffffff811185bf>] kswapd+0x18f/0x490
[<ffffffff81070961>] ? get_parent_ip+0x11/0x50
[<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff81118430>] ? balance_pgdat+0x730/0x730
[<ffffffff81060d2b>] kthread+0xdb/0xe0
[<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
[<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
[<ffffffff81060c50>] ? __init_kthread_worker+0x
Since we only use the lock for protecting the vm_list, once we've
found the instance we want, we can shuffle it to the end of the
list and then drop the kvm_lock before taking the mmu_lock. We
can do this because after the mmu operations are completed, we
break -- i.e. we don't continue list processing, so it doesn't
matter if the list changed around us.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
[Note1: do double check that this solution makes sense for the
mainline kernel; consider this an RFC patch that does want a
review from people in the know.]
[Note2: you'll need to be running a preempt-rt kernel to actually
see this. Also note that the above patch is against linux-next.
Alternate solutions welcome ; this seemed to me the obvious fix.]
arch/x86/kvm/mmu.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 748e0d8..db93a70 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4322,6 +4322,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
struct kvm *kvm;
int nr_to_scan = sc->nr_to_scan;
+ int found = 0;
unsigned long freed = 0;
raw_spin_lock(&kvm_lock);
@@ -4349,6 +4350,12 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
continue;
idx = srcu_read_lock(&kvm->srcu);
+
+ list_move_tail(&kvm->vm_list, &vm_list);
+ found = 1;
+ /* We can't be holding a raw lock and take non-raw mmu_lock */
+ raw_spin_unlock(&kvm_lock);
+
spin_lock(&kvm->mmu_lock);
if (kvm_has_zapped_obsolete_pages(kvm)) {
@@ -4370,11 +4377,12 @@ unlock:
* per-vm shrinkers cry out
* sadness comes quickly
*/
- list_move_tail(&kvm->vm_list, &vm_list);
break;
}
- raw_spin_unlock(&kvm_lock);
+ if (!found)
+ raw_spin_unlock(&kvm_lock);
+
return freed;
}
--
1.8.1.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-25 22:34 [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock Paul Gortmaker
@ 2013-06-26 8:10 ` Paolo Bonzini
2013-06-26 18:11 ` [PATCH-next v2] " Paul Gortmaker
2013-06-27 11:09 ` [PATCH-next] " Gleb Natapov
1 sibling, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-26 8:10 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: Gleb Natapov, linux-rt-users, kvm
Il 26/06/2013 00:34, Paul Gortmaker ha scritto:
> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
> the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
> function tries to grab the (non-raw) mmu_lock within the scope of
> the raw locked kvm_lock being held. This leads to the following:
>
> BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
> in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
> Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
>
> Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
> Call Trace:
> [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
> [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
> [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
> [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
> [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
> [<ffffffff8111824a>] balance_pgdat+0x54a/0x730
> [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
> [<ffffffff811185bf>] kswapd+0x18f/0x490
> [<ffffffff81070961>] ? get_parent_ip+0x11/0x50
> [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
> [<ffffffff81118430>] ? balance_pgdat+0x730/0x730
> [<ffffffff81060d2b>] kthread+0xdb/0xe0
> [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
> [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
> [<ffffffff81060c50>] ? __init_kthread_worker+0x
>
> Since we only use the lock for protecting the vm_list, once we've
> found the instance we want, we can shuffle it to the end of the
> list and then drop the kvm_lock before taking the mmu_lock. We
> can do this because after the mmu operations are completed, we
> break -- i.e. we don't continue list processing, so it doesn't
> matter if the list changed around us.
>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Since the shrinker code is asynchronous with respect to KVM, I think
that the kvm_lock here is also protecting against kvm_destroy_vm running
at the same time.
So the patch is almost okay; all that is missing is a
kvm_get_kvm/kvm_put_kvm pair, where the reference is added just before
releasing the kvm_lock.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH-next v2] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-26 8:10 ` Paolo Bonzini
@ 2013-06-26 18:11 ` Paul Gortmaker
2013-06-26 21:59 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Paul Gortmaker @ 2013-06-26 18:11 UTC (permalink / raw)
To: kvm; +Cc: linux-rt-users, Paul Gortmaker, Paolo Bonzini, Gleb Natapov
In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held. This leads to the following:
BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
[<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
[<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
[<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
[<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
[<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
[<ffffffff8111824a>] balance_pgdat+0x54a/0x730
[<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
[<ffffffff811185bf>] kswapd+0x18f/0x490
[<ffffffff81070961>] ? get_parent_ip+0x11/0x50
[<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
[<ffffffff81118430>] ? balance_pgdat+0x730/0x730
[<ffffffff81060d2b>] kthread+0xdb/0xe0
[<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
[<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
[<ffffffff81060c50>] ? __init_kthread_worker+0x
Note that the above was seen on an earlier 3.4 preempt-rt, for where
the lock distinction (raw vs. non-raw) actually matters.
Since we only use the lock for protecting the vm_list, once we've found
the instance we want, we can shuffle it to the end of the list and then
drop the kvm_lock before taking the mmu_lock. We can do this because
after the mmu operations are completed, we break -- i.e. we don't continue
list processing, so it doesn't matter if the list changed around us.
Since the shrinker code runs asynchronously with respect to KVM, we do
need to still protect against the users_count going to zero and then
kvm_destroy_vm() being called, so we use kvm_get_kvm/kvm_put_kvm, as
suggested by Paolo.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
[v2: add the kvm_get_kvm, update comments and log appropriately]
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 748e0d8..662b679 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4322,6 +4322,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
struct kvm *kvm;
int nr_to_scan = sc->nr_to_scan;
+ int found = 0;
unsigned long freed = 0;
raw_spin_lock(&kvm_lock);
@@ -4349,6 +4350,18 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
continue;
idx = srcu_read_lock(&kvm->srcu);
+
+ list_move_tail(&kvm->vm_list, &vm_list);
+ found = 1;
+ /*
+ * We are done with the list, so drop kvm_lock, as we can't be
+ * holding a raw lock and take the non-raw mmu_lock. But we
+ * don't want to be unprotected from kvm_destroy_vm either,
+ * so we bump users_count.
+ */
+ kvm_get_kvm(kvm);
+ raw_spin_unlock(&kvm_lock);
+
spin_lock(&kvm->mmu_lock);
if (kvm_has_zapped_obsolete_pages(kvm)) {
@@ -4363,6 +4376,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
unlock:
spin_unlock(&kvm->mmu_lock);
+ kvm_put_kvm(kvm);
srcu_read_unlock(&kvm->srcu, idx);
/*
@@ -4370,11 +4384,12 @@ unlock:
* per-vm shrinkers cry out
* sadness comes quickly
*/
- list_move_tail(&kvm->vm_list, &vm_list);
break;
}
- raw_spin_unlock(&kvm_lock);
+ if (!found)
+ raw_spin_unlock(&kvm_lock);
+
return freed;
}
--
1.8.1.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH-next v2] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-26 18:11 ` [PATCH-next v2] " Paul Gortmaker
@ 2013-06-26 21:59 ` Paolo Bonzini
2013-06-27 2:56 ` Paul Gortmaker
0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-26 21:59 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: kvm, linux-rt-users, Gleb Natapov
Il 26/06/2013 20:11, Paul Gortmaker ha scritto:
> spin_unlock(&kvm->mmu_lock);
> + kvm_put_kvm(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
>
kvm_put_kvm needs to go last. I can fix when applying, but I'll wait
for Gleb to take a look too.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next v2] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-26 21:59 ` Paolo Bonzini
@ 2013-06-27 2:56 ` Paul Gortmaker
2013-06-27 10:22 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Paul Gortmaker @ 2013-06-27 2:56 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, linux-rt-users, Gleb Natapov
[Re: [PATCH-next v2] kvm: don't try to take mmu_lock while holding the main raw kvm_lock] On 26/06/2013 (Wed 23:59) Paolo Bonzini wrote:
> Il 26/06/2013 20:11, Paul Gortmaker ha scritto:
> > spin_unlock(&kvm->mmu_lock);
> > + kvm_put_kvm(kvm);
> > srcu_read_unlock(&kvm->srcu, idx);
> >
>
> kvm_put_kvm needs to go last. I can fix when applying, but I'll wait
> for Gleb to take a look too.
I'm curious why you would say that -- since the way I sent it has the
lock tear down be symmetrical and opposite to the build up - e.g.
idx = srcu_read_lock(&kvm->srcu);
[...]
+ kvm_get_kvm(kvm);
[...]
spin_lock(&kvm->mmu_lock);
[...]
unlock:
spin_unlock(&kvm->mmu_lock);
+ kvm_put_kvm(kvm);
srcu_read_unlock(&kvm->srcu, idx);
You'd originally said to put the kvm_get_kvm where it currently is;
perhaps instead we want the get/put to encompass the whole
srcu_read locked section?
P.
--
>
> Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next v2] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 2:56 ` Paul Gortmaker
@ 2013-06-27 10:22 ` Paolo Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-27 10:22 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: kvm, linux-rt-users, Gleb Natapov
Il 27/06/2013 04:56, Paul Gortmaker ha scritto:
>> Il 26/06/2013 20:11, Paul Gortmaker ha scritto:
>>> > > spin_unlock(&kvm->mmu_lock);
>>> > > + kvm_put_kvm(kvm);
>>> > > srcu_read_unlock(&kvm->srcu, idx);
>>> > >
>> >
>> > kvm_put_kvm needs to go last. I can fix when applying, but I'll wait
>> > for Gleb to take a look too.
> I'm curious why you would say that -- since the way I sent it has the
> lock tear down be symmetrical and opposite to the build up - e.g.
>
> idx = srcu_read_lock(&kvm->srcu);
>
> [...]
>
> + kvm_get_kvm(kvm);
>
> [...]
> spin_lock(&kvm->mmu_lock);
>
> [...]
>
> unlock:
> spin_unlock(&kvm->mmu_lock);
> + kvm_put_kvm(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
>
> You'd originally said to put the kvm_get_kvm where it currently is;
> perhaps instead we want the get/put to encompass the whole
> srcu_read locked section?
The put really needs to be the last thing you do, as the data structure
can be destroyed before it returns. Where you put kvm_get_kvm doesn't
really matter, since you're protected by the kvm lock. So, moving the
kvm_get_kvm before would also work---I didn't really mean that
kvm_get_kvm has to be literally just before the raw_spin_unlock.
However, I actually like having the get_kvm right there, because it
makes it explicit that you are using reference counting as a substitute
for holding the lock. I find it quite idiomatic, and in some sense the
lock/unlock is still symmetric: the kvm_put_kvm goes exactly where you'd
have unlocked the kvm_lock.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-25 22:34 [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock Paul Gortmaker
2013-06-26 8:10 ` Paolo Bonzini
@ 2013-06-27 11:09 ` Gleb Natapov
2013-06-27 11:38 ` Paolo Bonzini
1 sibling, 1 reply; 14+ messages in thread
From: Gleb Natapov @ 2013-06-27 11:09 UTC (permalink / raw)
To: Paul Gortmaker; +Cc: Paolo Bonzini, linux-rt-users, kvm, Jan Kiszka
On Tue, Jun 25, 2013 at 06:34:03PM -0400, Paul Gortmaker wrote:
> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
I am copying Jan, the author of the patch. Commit message says:
"Code under this lock requires non-preemptibility", but which code
exactly is this? Is this still true?
> the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
> function tries to grab the (non-raw) mmu_lock within the scope of
> the raw locked kvm_lock being held. This leads to the following:
>
> BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
> in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
> Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
>
> Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
> Call Trace:
> [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
> [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
> [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
> [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
> [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
> [<ffffffff8111824a>] balance_pgdat+0x54a/0x730
> [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
> [<ffffffff811185bf>] kswapd+0x18f/0x490
> [<ffffffff81070961>] ? get_parent_ip+0x11/0x50
> [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
> [<ffffffff81118430>] ? balance_pgdat+0x730/0x730
> [<ffffffff81060d2b>] kthread+0xdb/0xe0
> [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
> [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
> [<ffffffff81060c50>] ? __init_kthread_worker+0x
>
> Since we only use the lock for protecting the vm_list, once we've
> found the instance we want, we can shuffle it to the end of the
> list and then drop the kvm_lock before taking the mmu_lock. We
> can do this because after the mmu operations are completed, we
> break -- i.e. we don't continue list processing, so it doesn't
> matter if the list changed around us.
>
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
>
> [Note1: do double check that this solution makes sense for the
> mainline kernel; consider this an RFC patch that does want a
> review from people in the know.]
>
> [Note2: you'll need to be running a preempt-rt kernel to actually
> see this. Also note that the above patch is against linux-next.
> Alternate solutions welcome ; this seemed to me the obvious fix.]
>
> arch/x86/kvm/mmu.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 748e0d8..db93a70 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4322,6 +4322,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> {
> struct kvm *kvm;
> int nr_to_scan = sc->nr_to_scan;
> + int found = 0;
> unsigned long freed = 0;
>
> raw_spin_lock(&kvm_lock);
> @@ -4349,6 +4350,12 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> continue;
>
> idx = srcu_read_lock(&kvm->srcu);
> +
> + list_move_tail(&kvm->vm_list, &vm_list);
> + found = 1;
> + /* We can't be holding a raw lock and take non-raw mmu_lock */
> + raw_spin_unlock(&kvm_lock);
> +
> spin_lock(&kvm->mmu_lock);
>
> if (kvm_has_zapped_obsolete_pages(kvm)) {
> @@ -4370,11 +4377,12 @@ unlock:
> * per-vm shrinkers cry out
> * sadness comes quickly
> */
> - list_move_tail(&kvm->vm_list, &vm_list);
> break;
> }
>
> - raw_spin_unlock(&kvm_lock);
> + if (!found)
> + raw_spin_unlock(&kvm_lock);
> +
> return freed;
>
> }
> --
> 1.8.1.2
--
Gleb.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 11:09 ` [PATCH-next] " Gleb Natapov
@ 2013-06-27 11:38 ` Paolo Bonzini
2013-06-27 11:43 ` Gleb Natapov
2013-06-27 12:16 ` Jan Kiszka
0 siblings, 2 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-27 11:38 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Paul Gortmaker, linux-rt-users, kvm, Jan Kiszka
Il 27/06/2013 13:09, Gleb Natapov ha scritto:
> On Tue, Jun 25, 2013 at 06:34:03PM -0400, Paul Gortmaker wrote:
>> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
> I am copying Jan, the author of the patch. Commit message says:
> "Code under this lock requires non-preemptibility", but which code
> exactly is this? Is this still true?
hardware_enable_nolock/hardware_disable_nolock does.
Paolo
>> the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
>> function tries to grab the (non-raw) mmu_lock within the scope of
>> the raw locked kvm_lock being held. This leads to the following:
>>
>> BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
>> in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
>> Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
>>
>> Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
>> Call Trace:
>> [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
>> [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
>> [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
>> [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
>> [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
>> [<ffffffff8111824a>] balance_pgdat+0x54a/0x730
>> [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
>> [<ffffffff811185bf>] kswapd+0x18f/0x490
>> [<ffffffff81070961>] ? get_parent_ip+0x11/0x50
>> [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
>> [<ffffffff81118430>] ? balance_pgdat+0x730/0x730
>> [<ffffffff81060d2b>] kthread+0xdb/0xe0
>> [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
>> [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
>> [<ffffffff81060c50>] ? __init_kthread_worker+0x
>>
>> Since we only use the lock for protecting the vm_list, once we've
>> found the instance we want, we can shuffle it to the end of the
>> list and then drop the kvm_lock before taking the mmu_lock. We
>> can do this because after the mmu operations are completed, we
>> break -- i.e. we don't continue list processing, so it doesn't
>> matter if the list changed around us.
>>
>> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
>> ---
>>
>> [Note1: do double check that this solution makes sense for the
>> mainline kernel; consider this an RFC patch that does want a
>> review from people in the know.]
>>
>> [Note2: you'll need to be running a preempt-rt kernel to actually
>> see this. Also note that the above patch is against linux-next.
>> Alternate solutions welcome ; this seemed to me the obvious fix.]
>>
>> arch/x86/kvm/mmu.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 748e0d8..db93a70 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -4322,6 +4322,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>> {
>> struct kvm *kvm;
>> int nr_to_scan = sc->nr_to_scan;
>> + int found = 0;
>> unsigned long freed = 0;
>>
>> raw_spin_lock(&kvm_lock);
>> @@ -4349,6 +4350,12 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>> continue;
>>
>> idx = srcu_read_lock(&kvm->srcu);
>> +
>> + list_move_tail(&kvm->vm_list, &vm_list);
>> + found = 1;
>> + /* We can't be holding a raw lock and take non-raw mmu_lock */
>> + raw_spin_unlock(&kvm_lock);
>> +
>> spin_lock(&kvm->mmu_lock);
>>
>> if (kvm_has_zapped_obsolete_pages(kvm)) {
>> @@ -4370,11 +4377,12 @@ unlock:
>> * per-vm shrinkers cry out
>> * sadness comes quickly
>> */
>> - list_move_tail(&kvm->vm_list, &vm_list);
>> break;
>> }
>>
>> - raw_spin_unlock(&kvm_lock);
>> + if (!found)
>> + raw_spin_unlock(&kvm_lock);
>> +
>> return freed;
>>
>> }
>> --
>> 1.8.1.2
>
> --
> Gleb.
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 11:38 ` Paolo Bonzini
@ 2013-06-27 11:43 ` Gleb Natapov
2013-06-27 11:54 ` Paolo Bonzini
2013-06-27 12:16 ` Jan Kiszka
1 sibling, 1 reply; 14+ messages in thread
From: Gleb Natapov @ 2013-06-27 11:43 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Paul Gortmaker, linux-rt-users, kvm, Jan Kiszka
On Thu, Jun 27, 2013 at 01:38:29PM +0200, Paolo Bonzini wrote:
> Il 27/06/2013 13:09, Gleb Natapov ha scritto:
> > On Tue, Jun 25, 2013 at 06:34:03PM -0400, Paul Gortmaker wrote:
> >> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
> > I am copying Jan, the author of the patch. Commit message says:
> > "Code under this lock requires non-preemptibility", but which code
> > exactly is this? Is this still true?
>
> hardware_enable_nolock/hardware_disable_nolock does.
>
I suspected this will be the answer and prepared another question :)
>From a glance kvm_lock is used to protect those just to avoid creating
separate lock, so why not create raw one to protect them and change
kvm_lock to non raw again. Admittedly I haven't looked too close into
this yet.
> Paolo
>
> >> the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
> >> function tries to grab the (non-raw) mmu_lock within the scope of
> >> the raw locked kvm_lock being held. This leads to the following:
> >>
> >> BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
> >> in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
> >> Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]
> >>
> >> Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
> >> Call Trace:
> >> [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
> >> [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
> >> [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
> >> [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
> >> [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
> >> [<ffffffff8111824a>] balance_pgdat+0x54a/0x730
> >> [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
> >> [<ffffffff811185bf>] kswapd+0x18f/0x490
> >> [<ffffffff81070961>] ? get_parent_ip+0x11/0x50
> >> [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
> >> [<ffffffff81118430>] ? balance_pgdat+0x730/0x730
> >> [<ffffffff81060d2b>] kthread+0xdb/0xe0
> >> [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
> >> [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
> >> [<ffffffff81060c50>] ? __init_kthread_worker+0x
> >>
> >> Since we only use the lock for protecting the vm_list, once we've
> >> found the instance we want, we can shuffle it to the end of the
> >> list and then drop the kvm_lock before taking the mmu_lock. We
> >> can do this because after the mmu operations are completed, we
> >> break -- i.e. we don't continue list processing, so it doesn't
> >> matter if the list changed around us.
> >>
> >> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> >> ---
> >>
> >> [Note1: do double check that this solution makes sense for the
> >> mainline kernel; consider this an RFC patch that does want a
> >> review from people in the know.]
> >>
> >> [Note2: you'll need to be running a preempt-rt kernel to actually
> >> see this. Also note that the above patch is against linux-next.
> >> Alternate solutions welcome ; this seemed to me the obvious fix.]
> >>
> >> arch/x86/kvm/mmu.c | 12 ++++++++++--
> >> 1 file changed, 10 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 748e0d8..db93a70 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -4322,6 +4322,7 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >> {
> >> struct kvm *kvm;
> >> int nr_to_scan = sc->nr_to_scan;
> >> + int found = 0;
> >> unsigned long freed = 0;
> >>
> >> raw_spin_lock(&kvm_lock);
> >> @@ -4349,6 +4350,12 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >> continue;
> >>
> >> idx = srcu_read_lock(&kvm->srcu);
> >> +
> >> + list_move_tail(&kvm->vm_list, &vm_list);
> >> + found = 1;
> >> + /* We can't be holding a raw lock and take non-raw mmu_lock */
> >> + raw_spin_unlock(&kvm_lock);
> >> +
> >> spin_lock(&kvm->mmu_lock);
> >>
> >> if (kvm_has_zapped_obsolete_pages(kvm)) {
> >> @@ -4370,11 +4377,12 @@ unlock:
> >> * per-vm shrinkers cry out
> >> * sadness comes quickly
> >> */
> >> - list_move_tail(&kvm->vm_list, &vm_list);
> >> break;
> >> }
> >>
> >> - raw_spin_unlock(&kvm_lock);
> >> + if (!found)
> >> + raw_spin_unlock(&kvm_lock);
> >> +
> >> return freed;
> >>
> >> }
> >> --
> >> 1.8.1.2
> >
> > --
> > Gleb.
> >
--
Gleb.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 11:43 ` Gleb Natapov
@ 2013-06-27 11:54 ` Paolo Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-27 11:54 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Paul Gortmaker, linux-rt-users, kvm, Jan Kiszka
Il 27/06/2013 13:43, Gleb Natapov ha scritto:
>>> > > I am copying Jan, the author of the patch. Commit message says:
>>> > > "Code under this lock requires non-preemptibility", but which code
>>> > > exactly is this? Is this still true?
>> >
>> > hardware_enable_nolock/hardware_disable_nolock does.
>> >
> I suspected this will be the answer and prepared another question :)
> From a glance kvm_lock is used to protect those just to avoid creating
> separate lock, so why not create raw one to protect them and change
> kvm_lock to non raw again. Admittedly I haven't looked too close into
> this yet.
I was wondering the same, but I think it's fine. There's just a handful
of uses outside virt/kvm/kvm_main.c.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 11:38 ` Paolo Bonzini
2013-06-27 11:43 ` Gleb Natapov
@ 2013-06-27 12:16 ` Jan Kiszka
2013-06-27 12:32 ` Gleb Natapov
1 sibling, 1 reply; 14+ messages in thread
From: Jan Kiszka @ 2013-06-27 12:16 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Gleb Natapov, Paul Gortmaker, linux-rt-users@vger.kernel.org,
kvm@vger.kernel.org
On 2013-06-27 13:38, Paolo Bonzini wrote:
> Il 27/06/2013 13:09, Gleb Natapov ha scritto:
>> On Tue, Jun 25, 2013 at 06:34:03PM -0400, Paul Gortmaker wrote:
>>> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
>> I am copying Jan, the author of the patch. Commit message says:
>> "Code under this lock requires non-preemptibility", but which code
>> exactly is this? Is this still true?
>
> hardware_enable_nolock/hardware_disable_nolock does.
IIRC, also the loop in kvmclock_cpufreq_notifier needs it because it
reads the processor ID of the caller. That implies the caller cannot be
preempted, but theses days a migration lock should be fine as well.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 12:16 ` Jan Kiszka
@ 2013-06-27 12:32 ` Gleb Natapov
2013-06-27 13:00 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Gleb Natapov @ 2013-06-27 12:32 UTC (permalink / raw)
To: Jan Kiszka
Cc: Paolo Bonzini, Paul Gortmaker, linux-rt-users@vger.kernel.org,
kvm@vger.kernel.org, mtosatti
On Thu, Jun 27, 2013 at 02:16:07PM +0200, Jan Kiszka wrote:
> On 2013-06-27 13:38, Paolo Bonzini wrote:
> > Il 27/06/2013 13:09, Gleb Natapov ha scritto:
> >> On Tue, Jun 25, 2013 at 06:34:03PM -0400, Paul Gortmaker wrote:
> >>> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
> >> I am copying Jan, the author of the patch. Commit message says:
> >> "Code under this lock requires non-preemptibility", but which code
> >> exactly is this? Is this still true?
> >
> > hardware_enable_nolock/hardware_disable_nolock does.
>
> IIRC, also the loop in kvmclock_cpufreq_notifier needs it because it
> reads the processor ID of the caller. That implies the caller cannot be
> preempted, but theses days a migration lock should be fine as well.
>
OK, adding Marcelo to the party. This code is called from cpufreq
notifier. I would expect that it will be called from the context that
prevents migration to another cpu.
--
Gleb.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 12:32 ` Gleb Natapov
@ 2013-06-27 13:00 ` Paolo Bonzini
2013-06-27 13:01 ` Paolo Bonzini
0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-27 13:00 UTC (permalink / raw)
To: Gleb Natapov
Cc: Jan Kiszka, Paul Gortmaker, linux-rt-users@vger.kernel.org,
kvm@vger.kernel.org, mtosatti
Il 27/06/2013 14:32, Gleb Natapov ha scritto:
>>>>> > >>> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
>>>> > >> I am copying Jan, the author of the patch. Commit message says:
>>>> > >> "Code under this lock requires non-preemptibility", but which code
>>>> > >> exactly is this? Is this still true?
>>> > >
>>> > > hardware_enable_nolock/hardware_disable_nolock does.
>> >
>> > IIRC, also the loop in kvmclock_cpufreq_notifier needs it because it
>> > reads the processor ID of the caller. That implies the caller cannot be
>> > preempted, but theses days a migration lock should be fine as well.
>> >
> OK, adding Marcelo to the party. This code is called from cpufreq
> notifier. I would expect that it will be called from the context that
> prevents migration to another cpu.
No, the CPU is in freq->cpu and may not even be the CPU that changed
frequency.
But even then I'm not sure the loop needs to be non-preemptible. If it
were, the smp_call_function_single just before/after the loop would have
to be non-preemptable as well. So it is just an optimization and it can
use raw_smp_processor_id() instead.
Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock
2013-06-27 13:00 ` Paolo Bonzini
@ 2013-06-27 13:01 ` Paolo Bonzini
0 siblings, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2013-06-27 13:01 UTC (permalink / raw)
Cc: Gleb Natapov, Jan Kiszka, Paul Gortmaker,
linux-rt-users@vger.kernel.org, kvm@vger.kernel.org, mtosatti
Il 27/06/2013 15:00, Paolo Bonzini ha scritto:
> Il 27/06/2013 14:32, Gleb Natapov ha scritto:
>>>>>>>>>> In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
>>>>>>>> I am copying Jan, the author of the patch. Commit message says:
>>>>>>>> "Code under this lock requires non-preemptibility", but which code
>>>>>>>> exactly is this? Is this still true?
>>>>>>
>>>>>> hardware_enable_nolock/hardware_disable_nolock does.
>>>>
>>>> IIRC, also the loop in kvmclock_cpufreq_notifier needs it because it
>>>> reads the processor ID of the caller. That implies the caller cannot be
>>>> preempted, but theses days a migration lock should be fine as well.
>>>>
>> OK, adding Marcelo to the party. This code is called from cpufreq
>> notifier. I would expect that it will be called from the context that
>> prevents migration to another cpu.
>
> No, the CPU is in freq->cpu and may not even be the CPU that changed
> frequency.
Try again: "No, the CPU is in freq->cpu and smp_processor_id() may not
even be the CPU that changed frequency". It probably makes more sense now.
Paolo
> But even then I'm not sure the loop needs to be non-preemptible. If it
> were, the smp_call_function_single just before/after the loop would have
> to be non-preemptable as well. So it is just an optimization and it can
> use raw_smp_processor_id() instead.
>
> Paolo
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-06-27 13:01 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-25 22:34 [PATCH-next] kvm: don't try to take mmu_lock while holding the main raw kvm_lock Paul Gortmaker
2013-06-26 8:10 ` Paolo Bonzini
2013-06-26 18:11 ` [PATCH-next v2] " Paul Gortmaker
2013-06-26 21:59 ` Paolo Bonzini
2013-06-27 2:56 ` Paul Gortmaker
2013-06-27 10:22 ` Paolo Bonzini
2013-06-27 11:09 ` [PATCH-next] " Gleb Natapov
2013-06-27 11:38 ` Paolo Bonzini
2013-06-27 11:43 ` Gleb Natapov
2013-06-27 11:54 ` Paolo Bonzini
2013-06-27 12:16 ` Jan Kiszka
2013-06-27 12:32 ` Gleb Natapov
2013-06-27 13:00 ` Paolo Bonzini
2013-06-27 13:01 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).