[Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
@ 2012-09-19  3:02 liu ping fan
  2012-09-19  8:06 ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-19  3:02 UTC (permalink / raw)
  To: qemu-devel, Anthony Liguori, Avi Kivity, Jan Kiszka,
	Paolo Bonzini, Marcelo Tosatti

Currently, cpu_physical_memory_rw() can be used directly or indirectly
by mmio-dispatcher to access other devices' memory region. This can
cause some problem when adopting device's private lock.

Back ground refer to:
http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
For lazy, just refer to:
http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html


--1st. the recursive lock of biglock.
If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
the following (section of the whole call chain, and with
private_lockA):
      lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
Before c_p_m_rw(), we drop private_lockA to anti the possibly of
deadlock.  But we can not anti the nested of this chain or calling to
another lockB-mmio-dispatcher. So we can not avoid the possibility of
nested lock of biglock.  And another important factor is that we break
the lock sequence: private_lock-->biglock.
All of these require us to push biglock's holding into c_p_m_rw(), the
wrapper can not give help.

--2nd. c_p_m_rw(), sync or async?

IF we convert all of the device to be protected by refcount, then we can have
//no big lock
 c_p_m_rw()
{
   devB->ref++;
   {
--------------------------------------->pushed onto another thread.
   lock_privatelock
   mr->ops->write();
   unlock_privatelock
   }
   wait_for_completion();
   devB->ref--;
}
This model can help c_p_m_rw() present as a SYNC API.  But currently,
we mix biglock and private lock together, and wait_for_completion()
maybe block the release of big lock, which finally causes deadlock. So
we can not simply rely on this model.
Instead, we need to classify the calling scene into three cases:
  case1. lockA--dispatcher ---> lockB-dispatcher   //can use
async+completion model
  case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
cause the nested lock of biglock
  case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
the lock sequence problem, (as to completion, it need to be placed
outside the top level biglock, and it is hard to do so. Suggest to
change to case 1. Or at present, just leave it async)

This new model will require the biglock can be nested.

Any comments?

Thanks and regards,
pingfan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  3:02 [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock liu ping fan
@ 2012-09-19  8:06 ` Avi Kivity
  2012-09-19  9:00   ` liu ping fan
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-19  8:06 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/19/2012 06:02 AM, liu ping fan wrote:
> Currently, cpu_physical_memory_rw() can be used directly or indirectly
> by mmio-dispatcher to access other devices' memory region. This can
> cause some problem when adopting device's private lock.
> 
> Back ground refer to:
> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
> For lazy, just refer to:
> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
> 
> 
> --1st. the recursive lock of biglock.
> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
> the following (section of the whole call chain, and with
> private_lockA):
>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
> deadlock.  But we can not anti the nested of this chain or calling to
> another lockB-mmio-dispatcher. So we can not avoid the possibility of
> nested lock of biglock.  And another important factor is that we break
> the lock sequence: private_lock-->biglock.
> All of these require us to push biglock's holding into c_p_m_rw(), the
> wrapper can not give help.

I agree that this is unavoidable.

> 
> --2nd. c_p_m_rw(), sync or async?
> 
> IF we convert all of the device to be protected by refcount, then we can have
> //no big lock
>  c_p_m_rw()
> {
>    devB->ref++;
>    {
> --------------------------------------->pushed onto another thread.
>    lock_privatelock
>    mr->ops->write();
>    unlock_privatelock
>    }
>    wait_for_completion();
>    devB->ref--;
> }
> This model can help c_p_m_rw() present as a SYNC API.  But currently,
> we mix biglock and private lock together, and wait_for_completion()
> maybe block the release of big lock, which finally causes deadlock. So
> we can not simply rely on this model.
> Instead, we need to classify the calling scene into three cases:
>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
> async+completion model
>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
> cause the nested lock of biglock
>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
> the lock sequence problem, (as to completion, it need to be placed
> outside the top level biglock, and it is hard to do so. Suggest to
> change to case 1. Or at present, just leave it async)
> 
> This new model will require the biglock can be nested.

I think changing to an async model is too complicated.  It's difficult
enough already.  Isn't dropping private locks + recursive big locks
sufficient?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  8:06 ` Avi Kivity
@ 2012-09-19  9:00   ` liu ping fan
  2012-09-19  9:07     ` Avi Kivity
  2012-09-19  9:34     ` Jan Kiszka
  0 siblings, 2 replies; 45+ messages in thread
From: liu ping fan @ 2012-09-19  9:00 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Wed, Sep 19, 2012 at 4:06 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 06:02 AM, liu ping fan wrote:
>> Currently, cpu_physical_memory_rw() can be used directly or indirectly
>> by mmio-dispatcher to access other devices' memory region. This can
>> cause some problem when adopting device's private lock.
>>
>> Back ground refer to:
>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
>> For lazy, just refer to:
>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
>>
>>
>> --1st. the recursive lock of biglock.
>> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
>> the following (section of the whole call chain, and with
>> private_lockA):
>>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
>> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
>> deadlock.  But we can not anti the nested of this chain or calling to
>> another lockB-mmio-dispatcher. So we can not avoid the possibility of
>> nested lock of biglock.  And another important factor is that we break
>> the lock sequence: private_lock-->biglock.
>> All of these require us to push biglock's holding into c_p_m_rw(), the
>> wrapper can not give help.
>
> I agree that this is unavoidable.
>
>>
>> --2nd. c_p_m_rw(), sync or async?
>>
>> IF we convert all of the device to be protected by refcount, then we can have
>> //no big lock
>>  c_p_m_rw()
>> {
>>    devB->ref++;
>>    {
>> --------------------------------------->pushed onto another thread.
>>    lock_privatelock
>>    mr->ops->write();
>>    unlock_privatelock
>>    }
>>    wait_for_completion();
>>    devB->ref--;
>> }
>> This model can help c_p_m_rw() present as a SYNC API.  But currently,
>> we mix biglock and private lock together, and wait_for_completion()
>> maybe block the release of big lock, which finally causes deadlock. So
>> we can not simply rely on this model.
>> Instead, we need to classify the calling scene into three cases:
>>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
>> async+completion model
>>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
>> cause the nested lock of biglock
>>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
>> the lock sequence problem, (as to completion, it need to be placed
>> outside the top level biglock, and it is hard to do so. Suggest to
>> change to case 1. Or at present, just leave it async)
>>
>> This new model will require the biglock can be nested.
>
> I think changing to an async model is too complicated.  It's difficult
> enough already.  Isn't dropping private locks + recursive big locks
> sufficient?
>
I think that "dropping private locks + recursive big locks" just cover
case 2. And most of the important, it dont describe case3 which break
the rule of lock sequence "private-lock --> biglock". Scene:
devA_lock-->(devX_with-biglock--->devB_lock).
I just want to classify and post these cases to discuss. Maybe we can
achieve without async.

Thanks and regards,
pingfan

> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:00   ` liu ping fan
@ 2012-09-19  9:07     ` Avi Kivity
  2012-09-19  9:11       ` liu ping fan
  2012-09-19  9:34     ` Jan Kiszka
  1 sibling, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-19  9:07 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On 09/19/2012 12:00 PM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 4:06 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/19/2012 06:02 AM, liu ping fan wrote:
>>> Currently, cpu_physical_memory_rw() can be used directly or indirectly
>>> by mmio-dispatcher to access other devices' memory region. This can
>>> cause some problem when adopting device's private lock.
>>>
>>> Back ground refer to:
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
>>> For lazy, just refer to:
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
>>>
>>>
>>> --1st. the recursive lock of biglock.
>>> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
>>> the following (section of the whole call chain, and with
>>> private_lockA):
>>>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
>>> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
>>> deadlock.  But we can not anti the nested of this chain or calling to
>>> another lockB-mmio-dispatcher. So we can not avoid the possibility of
>>> nested lock of biglock.  And another important factor is that we break
>>> the lock sequence: private_lock-->biglock.
>>> All of these require us to push biglock's holding into c_p_m_rw(), the
>>> wrapper can not give help.
>>
>> I agree that this is unavoidable.
>>
>>>
>>> --2nd. c_p_m_rw(), sync or async?
>>>
>>> IF we convert all of the device to be protected by refcount, then we can have
>>> //no big lock
>>>  c_p_m_rw()
>>> {
>>>    devB->ref++;
>>>    {
>>> --------------------------------------->pushed onto another thread.
>>>    lock_privatelock
>>>    mr->ops->write();
>>>    unlock_privatelock
>>>    }
>>>    wait_for_completion();
>>>    devB->ref--;
>>> }
>>> This model can help c_p_m_rw() present as a SYNC API.  But currently,
>>> we mix biglock and private lock together, and wait_for_completion()
>>> maybe block the release of big lock, which finally causes deadlock. So
>>> we can not simply rely on this model.
>>> Instead, we need to classify the calling scene into three cases:
>>>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
>>> async+completion model
>>>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
>>> cause the nested lock of biglock
>>>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
>>> the lock sequence problem, (as to completion, it need to be placed
>>> outside the top level biglock, and it is hard to do so. Suggest to
>>> change to case 1. Or at present, just leave it async)
>>>
>>> This new model will require the biglock can be nested.
>>
>> I think changing to an async model is too complicated.  It's difficult
>> enough already.  Isn't dropping private locks + recursive big locks
>> sufficient?
>>
> I think that "dropping private locks + recursive big locks" just cover
> case 2. And most of the important, it dont describe case3 which break
> the rule of lock sequence "private-lock --> biglock". Scene:
> devA_lock-->(devX_with-biglock--->devB_lock).

Why not? devA will drop its local lock, devX will retake the big lock
recursively, devB will take its local lock.  In the end, we have biglock
-> devB.

> I just want to classify and post these cases to discuss. Maybe we can
> achieve without async.





-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:07     ` Avi Kivity
@ 2012-09-19  9:11       ` liu ping fan
  2012-09-19  9:14         ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-19  9:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Paolo Bonzini

On Wed, Sep 19, 2012 at 5:07 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 12:00 PM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 4:06 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/19/2012 06:02 AM, liu ping fan wrote:
>>>> Currently, cpu_physical_memory_rw() can be used directly or indirectly
>>>> by mmio-dispatcher to access other devices' memory region. This can
>>>> cause some problem when adopting device's private lock.
>>>>
>>>> Back ground refer to:
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
>>>> For lazy, just refer to:
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
>>>>
>>>>
>>>> --1st. the recursive lock of biglock.
>>>> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
>>>> the following (section of the whole call chain, and with
>>>> private_lockA):
>>>>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
>>>> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
>>>> deadlock.  But we can not anti the nested of this chain or calling to
>>>> another lockB-mmio-dispatcher. So we can not avoid the possibility of
>>>> nested lock of biglock.  And another important factor is that we break
>>>> the lock sequence: private_lock-->biglock.
>>>> All of these require us to push biglock's holding into c_p_m_rw(), the
>>>> wrapper can not give help.
>>>
>>> I agree that this is unavoidable.
>>>
>>>>
>>>> --2nd. c_p_m_rw(), sync or async?
>>>>
>>>> IF we convert all of the device to be protected by refcount, then we can have
>>>> //no big lock
>>>>  c_p_m_rw()
>>>> {
>>>>    devB->ref++;
>>>>    {
>>>> --------------------------------------->pushed onto another thread.
>>>>    lock_privatelock
>>>>    mr->ops->write();
>>>>    unlock_privatelock
>>>>    }
>>>>    wait_for_completion();
>>>>    devB->ref--;
>>>> }
>>>> This model can help c_p_m_rw() present as a SYNC API.  But currently,
>>>> we mix biglock and private lock together, and wait_for_completion()
>>>> maybe block the release of big lock, which finally causes deadlock. So
>>>> we can not simply rely on this model.
>>>> Instead, we need to classify the calling scene into three cases:
>>>>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
>>>> async+completion model
>>>>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
>>>> cause the nested lock of biglock
>>>>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
>>>> the lock sequence problem, (as to completion, it need to be placed
>>>> outside the top level biglock, and it is hard to do so. Suggest to
>>>> change to case 1. Or at present, just leave it async)
>>>>
>>>> This new model will require the biglock can be nested.
>>>
>>> I think changing to an async model is too complicated.  It's difficult
>>> enough already.  Isn't dropping private locks + recursive big locks
>>> sufficient?
>>>
>> I think that "dropping private locks + recursive big locks" just cover
>> case 2. And most of the important, it dont describe case3 which break
>> the rule of lock sequence "private-lock --> biglock". Scene:
>> devA_lock-->(devX_with-biglock--->devB_lock).
>
> Why not? devA will drop its local lock, devX will retake the big lock
> recursively, devB will take its local lock.  In the end, we have biglock
> -> devB.
>
But when adopting local lock, we assume take local lock, then biglock.
Otherwise another thread will take biglock then local lock, which
cause the possibility of deadlock.

>> I just want to classify and post these cases to discuss. Maybe we can
>> achieve without async.
>
>
>
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:11       ` liu ping fan
@ 2012-09-19  9:14         ` Paolo Bonzini
  2012-09-19  9:19           ` liu ping fan
  2012-09-19  9:21           ` Avi Kivity
  0 siblings, 2 replies; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-19  9:14 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

Il 19/09/2012 11:11, liu ping fan ha scritto:
>> > Why not? devA will drop its local lock, devX will retake the big lock
>> > recursively, devB will take its local lock.  In the end, we have biglock
>> > -> devB.
>> >
> But when adopting local lock, we assume take local lock, then biglock.

No, because the local lock will be dropped before taking the biglock.
The order must always be coarse->fine.

I don't know if the front-end (device) lock should come before or after
the back-end (e.g. netdev) lock in the hierarchy, but that's another story.

Paolo

> Otherwise another thread will take biglock then local lock, which
> cause the possibility of deadlock.
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:14         ` Paolo Bonzini
@ 2012-09-19  9:19           ` liu ping fan
  2012-09-19  9:23             ` Avi Kivity
  2012-09-19  9:21           ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-19  9:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>> > Why not? devA will drop its local lock, devX will retake the big lock
>>> > recursively, devB will take its local lock.  In the end, we have biglock
>>> > -> devB.
>>> >
>> But when adopting local lock, we assume take local lock, then biglock.
>
> No, because the local lock will be dropped before taking the biglock.
> The order must always be coarse->fine.
>
But if we takes coarse firstly, then the mmio-dispatcher will still
contend for the big lock against each other.

> I don't know if the front-end (device) lock should come before or after
> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
>
I think fine->coarse may be the rule, since for some code path, it is
not necessary to take coarse lock.

Thanks and regards,
pingfan
> Paolo
>
>> Otherwise another thread will take biglock then local lock, which
>> cause the possibility of deadlock.
>>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:19           ` liu ping fan
@ 2012-09-19  9:23             ` Avi Kivity
  2012-09-19  9:27               ` Jan Kiszka
  2012-09-20  7:51               ` liu ping fan
  0 siblings, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-19  9:23 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Jan Kiszka

On 09/19/2012 12:19 PM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>> > Why not? devA will drop its local lock, devX will retake the big lock
>>>> > recursively, devB will take its local lock.  In the end, we have biglock
>>>> > -> devB.
>>>> >
>>> But when adopting local lock, we assume take local lock, then biglock.
>>
>> No, because the local lock will be dropped before taking the biglock.
>> The order must always be coarse->fine.
>>
> But if we takes coarse firstly, then the mmio-dispatcher will still
> contend for the big lock against each other.

Can you detail the sequence?

> 
>> I don't know if the front-end (device) lock should come before or after
>> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
>>
> I think fine->coarse may be the rule, since for some code path, it is
> not necessary to take coarse lock.

coarse->fine doesn't mean you have to take the coarse lock.

Valid:
  lock(coarse)
  lock(fine)

Valid:
  lock(find)

Valid:
  lock(coarse)

Invalid:
  lock(fine)
  lock(coarse)


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:23             ` Avi Kivity
@ 2012-09-19  9:27               ` Jan Kiszka
  2012-09-19  9:28                 ` Jan Kiszka
  2012-09-20  7:51               ` liu ping fan
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kiszka @ 2012-09-19  9:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-19 11:23, Avi Kivity wrote:
> On 09/19/2012 12:19 PM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>>>> Why not? devA will drop its local lock, devX will retake the big lock
>>>>>> recursively, devB will take its local lock.  In the end, we have biglock
>>>>>> -> devB.
>>>>>>
>>>> But when adopting local lock, we assume take local lock, then biglock.
>>>
>>> No, because the local lock will be dropped before taking the biglock.
>>> The order must always be coarse->fine.
>>>
>> But if we takes coarse firstly, then the mmio-dispatcher will still
>> contend for the big lock against each other.
> 
> Can you detail the sequence?
> 
>>
>>> I don't know if the front-end (device) lock should come before or after
>>> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
>>>
>> I think fine->coarse may be the rule, since for some code path, it is
>> not necessary to take coarse lock.
> 
> coarse->fine doesn't mean you have to take the coarse lock.
> 
> Valid:
>   lock(coarse)
>   lock(fine)

This is invalid due to prio inversion issues, independent of potential
deadlocks.

Jan

> 
> Valid:
>   lock(find)
> 
> Valid:
>   lock(coarse)
> 
> Invalid:
>   lock(fine)
>   lock(coarse)
> 
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:27               ` Jan Kiszka
@ 2012-09-19  9:28                 ` Jan Kiszka
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2012-09-19  9:28 UTC (permalink / raw)
  Cc: Marcelo Tosatti, qemu-devel@nongnu.org, liu ping fan, Avi Kivity,
	Anthony Liguori, Paolo Bonzini

On 2012-09-19 11:27, Jan Kiszka wrote:
> On 2012-09-19 11:23, Avi Kivity wrote:
>> On 09/19/2012 12:19 PM, liu ping fan wrote:
>>> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>>>>> Why not? devA will drop its local lock, devX will retake the big lock
>>>>>>> recursively, devB will take its local lock.  In the end, we have biglock
>>>>>>> -> devB.
>>>>>>>
>>>>> But when adopting local lock, we assume take local lock, then biglock.
>>>>
>>>> No, because the local lock will be dropped before taking the biglock.
>>>> The order must always be coarse->fine.
>>>>
>>> But if we takes coarse firstly, then the mmio-dispatcher will still
>>> contend for the big lock against each other.
>>
>> Can you detail the sequence?
>>
>>>
>>>> I don't know if the front-end (device) lock should come before or after
>>>> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
>>>>
>>> I think fine->coarse may be the rule, since for some code path, it is
>>> not necessary to take coarse lock.
>>
>> coarse->fine doesn't mean you have to take the coarse lock.
>>
>> Valid:
>>   lock(coarse)
>>   lock(fine)
> 
> This is invalid due to prio inversion issues, independent of potential
> deadlocks.

Err, forget it - the other way around is the problem.

Sorry,
Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:23             ` Avi Kivity
  2012-09-19  9:27               ` Jan Kiszka
@ 2012-09-20  7:51               ` liu ping fan
  2012-09-20  7:54                 ` Paolo Bonzini
  2012-09-20  9:07                 ` Avi Kivity
  1 sibling, 2 replies; 45+ messages in thread
From: liu ping fan @ 2012-09-20  7:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Jan Kiszka

On Wed, Sep 19, 2012 at 5:23 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 12:19 PM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>>> > Why not? devA will drop its local lock, devX will retake the big lock
>>>>> > recursively, devB will take its local lock.  In the end, we have biglock
>>>>> > -> devB.
>>>>> >
>>>> But when adopting local lock, we assume take local lock, then biglock.
>>>
>>> No, because the local lock will be dropped before taking the biglock.
>>> The order must always be coarse->fine.
>>>
>> But if we takes coarse firstly, then the mmio-dispatcher will still
>> contend for the big lock against each other.
>
> Can you detail the sequence?
>
LOCK(local lock)
.......................
LOCK(big lock)
Access timer/block/network subsystem
UNLOCK(big lock)
.....................
UNLOCK(local lock)
>>
>>> I don't know if the front-end (device) lock should come before or after
>>> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
>>>
>> I think fine->coarse may be the rule, since for some code path, it is
>> not necessary to take coarse lock.
>
> coarse->fine doesn't mean you have to take the coarse lock.
>
Sorry, donot catching your meaning.  Does not "coarse->fine"  mean
LOCK(coarse)-->LOCK(fine);  .. UNLOCK(fine)-->UNLOCK(coarse) ?

> Valid:
>   lock(coarse)
>   lock(fine)
>
But it is conflict with " localLock(fine) --> bigLock(coarse)" which
is taken when mmio dispatch.

Regards,
pingfan

> Valid:
>   lock(find)
>
> Valid:
>   lock(coarse)
>
> Invalid:
>   lock(fine)
>   lock(coarse)
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-20  7:51               ` liu ping fan
@ 2012-09-20  7:54                 ` Paolo Bonzini
  2012-09-20  8:09                   ` liu ping fan
  2012-09-20  9:07                 ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-20  7:54 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

Il 20/09/2012 09:51, liu ping fan ha scritto:
> Sorry, donot catching your meaning.  Does not "coarse->fine"  mean
> LOCK(coarse)-->LOCK(fine);  .. UNLOCK(fine)-->UNLOCK(coarse) ?

Yes.

>> > Valid:
>> >   lock(coarse)
>> >   lock(fine)
>> >
> But it is conflict with " localLock(fine) --> bigLock(coarse)" which
> is taken when mmio dispatch.

No, MMIO dispatch has to discard the fine-grained lock before acquiring
the big lock.

If you allow

  lock(fine)
  lock(coarse)

then the (presumably higher-priority) thread that is requesting the
fine-grained lock must wait for the lower-priority thread that holds the
coarse-grained lock.  Then you get priority inversion.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-20  7:54                 ` Paolo Bonzini
@ 2012-09-20  8:09                   ` liu ping fan
  2012-09-20  8:27                     ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-20  8:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

On Thu, Sep 20, 2012 at 3:54 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 20/09/2012 09:51, liu ping fan ha scritto:
>> Sorry, donot catching your meaning.  Does not "coarse->fine"  mean
>> LOCK(coarse)-->LOCK(fine);  .. UNLOCK(fine)-->UNLOCK(coarse) ?
>
> Yes.
>
>>> > Valid:
>>> >   lock(coarse)
>>> >   lock(fine)
>>> >
>> But it is conflict with " localLock(fine) --> bigLock(coarse)" which
>> is taken when mmio dispatch.
>
> No, MMIO dispatch has to discard the fine-grained lock before acquiring
> the big lock.
>
This will cause the device state broken, and expose device under changing risk.

> If you allow
>
>   lock(fine)
>   lock(coarse)
>
> then the (presumably higher-priority) thread that is requesting the
> fine-grained lock must wait for the lower-priority thread that holds the
> coarse-grained lock.  Then you get priority inversion.
>
Which thread has higher-priority? Why does the thread with coarse lock
have lower-priority?

Thanks and regards,
pingfan

> Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-20  8:09                   ` liu ping fan
@ 2012-09-20  8:27                     ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-20  8:27 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel

Il 20/09/2012 10:09, liu ping fan ha scritto:
> >
> > No, MMIO dispatch has to discard the fine-grained lock before acquiring
> > the big lock.
>
> This will cause the device state broken, and expose device under changing risk.

We do it all the time with asynchronous I/O.

It's just an invariant that fine-grain-locked devices have to obey.  You
cannot keep the same semantics you enjoyed when a single lock covered
everything.

> > If you allow
> >
> >   lock(fine)
> >   lock(coarse)
> >
> > then the (presumably higher-priority) thread that is requesting the
> > fine-grained lock must wait for the lower-priority thread that holds the
> > coarse-grained lock.  Then you get priority inversion.
>
> Which thread has higher-priority? Why does the thread with coarse lock
> have lower-priority?

Because there can be so many things happening under the coarse lock,
that just _assuming_ it has lower priority is a good thing.

On the other hand, if the device didn't need immediate servicing (which
means higher priority), you wouldn't bother taking it out of the big
QEMU lock.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-20  7:51               ` liu ping fan
  2012-09-20  7:54                 ` Paolo Bonzini
@ 2012-09-20  9:07                 ` Avi Kivity
  2012-09-21  7:27                   ` liu ping fan
  1 sibling, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-20  9:07 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, qemu-devel, Anthony Liguori,
	Jan Kiszka

On 09/20/2012 10:51 AM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 5:23 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/19/2012 12:19 PM, liu ping fan wrote:
>>> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>>>> > Why not? devA will drop its local lock, devX will retake the big lock
>>>>>> > recursively, devB will take its local lock.  In the end, we have biglock
>>>>>> > -> devB.
>>>>>> >
>>>>> But when adopting local lock, we assume take local lock, then biglock.
>>>>
>>>> No, because the local lock will be dropped before taking the biglock.
>>>> The order must always be coarse->fine.
>>>>
>>> But if we takes coarse firstly, then the mmio-dispatcher will still
>>> contend for the big lock against each other.
>>
>> Can you detail the sequence?
>>
> LOCK(local lock)
> .......................
> LOCK(big lock)
> Access timer/block/network subsystem
> UNLOCK(big lock)
> .....................
> UNLOCK(local lock)

This is an invalid sequence.  Either the subsystem has to be fine-grain
locked, or the lock order has to be reversed.

Before we finish subsystem conversion, an mmio dispatcher may look like:

dev_write(...)
{
    lock(s->lock)
    switch (addr) {
        case REGA:
            ...
        case REGB:
            ...
        case REGC:
            unlock(s->lock)
            lock(big lock)
            lock(s->lock)
            qemu_mod_timer()
            unlock(bit lock)
            break;
        ...
    }
    unlock(s->lock)
}



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-20  9:07                 ` Avi Kivity
@ 2012-09-21  7:27                   ` liu ping fan
  2012-09-21  8:21                     ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-21  7:27 UTC (permalink / raw)
  To: qemu-devel, Paolo Bonzini; +Cc: Jan Kiszka, Marcelo Tosatti, Anthony Liguori

On Thu, Sep 20, 2012 at 5:07 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/20/2012 10:51 AM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:23 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/19/2012 12:19 PM, liu ping fan wrote:
>>>> On Wed, Sep 19, 2012 at 5:14 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> Il 19/09/2012 11:11, liu ping fan ha scritto:
>>>>>>> > Why not? devA will drop its local lock, devX will retake the big lock
>>>>>>> > recursively, devB will take its local lock.  In the end, we have biglock
>>>>>>> > -> devB.
>>>>>>> >
>>>>>> But when adopting local lock, we assume take local lock, then biglock.
>>>>>
>>>>> No, because the local lock will be dropped before taking the biglock.
>>>>> The order must always be coarse->fine.
>>>>>
>>>> But if we takes coarse firstly, then the mmio-dispatcher will still
>>>> contend for the big lock against each other.
>>>
>>> Can you detail the sequence?
>>>
>> LOCK(local lock)
>> .......................
>> LOCK(big lock)
>> Access timer/block/network subsystem
>> UNLOCK(big lock)
>> .....................
>> UNLOCK(local lock)
>
> This is an invalid sequence.  Either the subsystem has to be fine-grain
> locked, or the lock order has to be reversed.
>
Oh!  And from this thread, my understanding of the reason for the rule
of lock sequence: coarse->fine is that biglock means higher
possibility of conflict, so we try it first, then try the fine-lock.
In this way, we have a smaller window for holding fine-lock which
means the other thread can get this lock more smoothly.  Right?
NOT want to open an argument, just a question, is there any reason for
the sequence
devlock->timelock?

Regards,
pingfan
> Before we finish subsystem conversion, an mmio dispatcher may look like:
>
> dev_write(...)
> {
>     lock(s->lock)
>     switch (addr) {
>         case REGA:
>             ...
>         case REGB:
>             ...
>         case REGC:
>             unlock(s->lock)
>             lock(big lock)
>             lock(s->lock)
>             qemu_mod_timer()
>             unlock(bit lock)
>             break;
>         ...
>     }
>     unlock(s->lock)
> }
>
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-21  7:27                   ` liu ping fan
@ 2012-09-21  8:21                     ` Paolo Bonzini
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-21  8:21 UTC (permalink / raw)
  To: liu ping fan; +Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel, Anthony Liguori

> Oh!  And from this thread, my understanding of the reason for the rule
> of lock sequence: coarse->fine is that biglock means higher
> possibility of conflict, so we try it first, then try the fine-lock.
> In this way, we have a smaller window for holding fine-lock which
> means the other thread can get this lock more smoothly.  Right?

Yes.

> NOT want to open an argument, just a question, is there any reason
> for the sequence devlock->timelock?

Not any particular reason, just that it makes more sense. :)
Backend subsystems (timers, network, I/O handlers, ...) typically
will not call anything else.  So it makes sense that their locks
are held inside the device locks (and inside the big lock).

Also note that while the timer subsystem could call the devices,
it is perfectly plausible that the devices will do for example a
qemu_mod_timer from a timer callback.  Thus, in some sense the
timer subsystem already has to expect modifications of its state
during the callback.  Releasing the lock during callbacks is "just"
an extension of the current behavior.

Of course some changes are needed to the invariants, such as the one
Avi pointed out elsewhere in the thread, but devlock->timerlock is
overall more natural than the opposite.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:14         ` Paolo Bonzini
  2012-09-19  9:19           ` liu ping fan
@ 2012-09-19  9:21           ` Avi Kivity
  2012-09-19  9:51             ` Paolo Bonzini
  1 sibling, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-19  9:21 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

On 09/19/2012 12:14 PM, Paolo Bonzini wrote:

> I don't know if the front-end (device) lock should come before or after
> the back-end (e.g. netdev) lock in the hierarchy, but that's another story.

I would say device -> backend.  It's natural if the backend is the timer
subsystem, so extend it from there to the block and network layers.  Of
course callbacks want it to be the other way round.

We could solve the callback problem in the same way we're using for mmio
callbacks.  When providing the callback, give the subsystem ->ref() and
->unref() methods.  As long as a callback is pending, the subsystem
holds a reference.  Once the callback has been invoked or cancelled, the
reference can be released.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:21           ` Avi Kivity
@ 2012-09-19  9:51             ` Paolo Bonzini
  2012-09-19 10:06               ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-19  9:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

Il 19/09/2012 11:21, Avi Kivity ha scritto:
>> > I don't know if the front-end (device) lock should come before or after
>> > the back-end (e.g. netdev) lock in the hierarchy, but that's another story.
> I would say device -> backend.  It's natural if the backend is the timer
> subsystem, so extend it from there to the block and network layers.  Of
> course callbacks want it to be the other way round.

Yes, that's what I wasn't sure about.  In many cases I believe callbacks
can just release the backend lock.  It works for timers, for example:

    for(;;) {
        ts = clock->active_timers;
        if (!qemu_timer_expired_ns(ts, current_time)) {
            break;
        }
        /* remove timer from the list before calling the callback */
        clock->active_timers = ts->next;
        ts->next = NULL;

        /* run the callback (the timer list can be modified) */
-       ts->cb(ts->opaque);
+       cb = ts->cb;
+       opaque = ts->opaque;
+       unlock();
+       cb(opaque);
+       lock();
    }

(The hunch is that ts could be deleted exactly at the moment the
callback is unlocked.  This can be solved with ref/unref on the opaque
value, as you mention below).

Paolo

> We could solve the callback problem in the same way we're using for mmio
> callbacks.  When providing the callback, give the subsystem ->ref() and
> ->unref() methods.  As long as a callback is pending, the subsystem
> holds a reference.  Once the callback has been invoked or cancelled, the
> reference can be released.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:51             ` Paolo Bonzini
@ 2012-09-19 10:06               ` Avi Kivity
  2012-09-19 10:19                 ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-19 10:06 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

On 09/19/2012 12:51 PM, Paolo Bonzini wrote:
> Il 19/09/2012 11:21, Avi Kivity ha scritto:
>>> > I don't know if the front-end (device) lock should come before or after
>>> > the back-end (e.g. netdev) lock in the hierarchy, but that's another story.

>> I would say device -> backend.  It's natural if the backend is the timer
>> subsystem, so extend it from there to the block and network layers.  Of
>> course callbacks want it to be the other way round.
> 
> Yes, that's what I wasn't sure about.  In many cases I believe callbacks
> can just release the backend lock.  It works for timers, for example:
> 
>     for(;;) {
>         ts = clock->active_timers;
>         if (!qemu_timer_expired_ns(ts, current_time)) {
>             break;
>         }
>         /* remove timer from the list before calling the callback */
>         clock->active_timers = ts->next;
>         ts->next = NULL;
> 
>         /* run the callback (the timer list can be modified) */
> -       ts->cb(ts->opaque);
> +       cb = ts->cb;
> +       opaque = ts->opaque;
> +       unlock();
> +       cb(opaque);
> +       lock();
>     }
> 
> (The hunch is that ts could be deleted exactly at the moment the
> callback is unlocked.  This can be solved with ref/unref on the opaque
> value, as you mention below).

Are you saying that this works as is or not?  It does seem broken wrt
deletion; after qemu_del_timer() completes the caller expects the
callback not to be called.

This isn't trivial to guarantee, we need something like

   dispatch_timer():
      pending += 1
      timer.ref()
      drop lock
      timer.cb()
      take lock
      timer.unref()
      pending -= 1
      notify

   del_timer():
      take lock
      timer.unlink()
      while pending:
          wait for notification
      drop lock

but, if del_timer is called with the device lock held, we deadlock. ugh.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19 10:06               ` Avi Kivity
@ 2012-09-19 10:19                 ` Paolo Bonzini
  2012-09-19 10:27                   ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-19 10:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

Il 19/09/2012 12:06, Avi Kivity ha scritto:
>> > (The hunch is that ts could be deleted exactly at the moment the
>> > callback is unlocked.  This can be solved with ref/unref on the opaque
>> > value, as you mention below).
> Are you saying that this works as is or not?  It does seem broken wrt
> deletion; after qemu_del_timer() completes the caller expects the
> callback not to be called.

Ouch, then I think the only solution is to remove this invariant if you
add fine-grained locking and re-check the enable conditions in the timer
callback.

If you allow calling del_timer to be called with the device lock held,
you cannot fixing without deadlocking.  If you don't, the caller of
del_timer cannot set any expectation because the timer could be run in
the window between dropping the device lock and calling del_timer

Paolo

> This isn't trivial to guarantee, we need something like
> 
>    dispatch_timer():
>       pending += 1
>       timer.ref()
>       drop lock
>       timer.cb()
>       take lock
>       timer.unref()
>       pending -= 1
>       notify
> 
>    del_timer():
>       take lock
>       timer.unlink()
>       while pending:
>           wait for notification
>       drop lock
> 
> but, if del_timer is called with the device lock held, we deadlock. ugh.

Wait for notification should be a cond_wait that drops the lock.

Alternatively:

    dispatch_timer():
       drop lock
       if timer has expired:
           timer.cb()
       take lock
       notify

    del_timer():
       take lock
       timer.unlink()
       drop lock

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19 10:19                 ` Paolo Bonzini
@ 2012-09-19 10:27                   ` Avi Kivity
  0 siblings, 0 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-19 10:27 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel

On 09/19/2012 01:19 PM, Paolo Bonzini wrote:
> Il 19/09/2012 12:06, Avi Kivity ha scritto:
>>> > (The hunch is that ts could be deleted exactly at the moment the
>>> > callback is unlocked.  This can be solved with ref/unref on the opaque
>>> > value, as you mention below).
>> Are you saying that this works as is or not?  It does seem broken wrt
>> deletion; after qemu_del_timer() completes the caller expects the
>> callback not to be called.
> 
> Ouch, then I think the only solution is to remove this invariant if you
> add fine-grained locking and re-check the enable conditions in the timer
> callback.

I think so too, I know authors of GUI frameworks suffer from this
problem (the GUI calling into the framework vs. the framework asking the
GUI to repaint).  Don't know what the state of the art solution is.

> 
> If you allow calling del_timer to be called with the device lock held,
> you cannot fixing without deadlocking.  If you don't, the caller of
> del_timer cannot set any expectation because the timer could be run in
> the window between dropping the device lock and calling del_timer
> 

Right.  Callbacks should check qemu_timer_deleted() after taking their
little lock.  The reference held by the timer subsystem protects against
qemu_free_timer().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:00   ` liu ping fan
  2012-09-19  9:07     ` Avi Kivity
@ 2012-09-19  9:34     ` Jan Kiszka
  2012-09-19  9:50       ` Avi Kivity
  2012-09-20  8:11       ` liu ping fan
  1 sibling, 2 replies; 45+ messages in thread
From: Jan Kiszka @ 2012-09-19  9:34 UTC (permalink / raw)
  To: liu ping fan
  Cc: Paolo Bonzini, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-19 11:00, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 4:06 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/19/2012 06:02 AM, liu ping fan wrote:
>>> Currently, cpu_physical_memory_rw() can be used directly or indirectly
>>> by mmio-dispatcher to access other devices' memory region. This can
>>> cause some problem when adopting device's private lock.
>>>
>>> Back ground refer to:
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
>>> For lazy, just refer to:
>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
>>>
>>>
>>> --1st. the recursive lock of biglock.
>>> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
>>> the following (section of the whole call chain, and with
>>> private_lockA):
>>>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
>>> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
>>> deadlock.  But we can not anti the nested of this chain or calling to
>>> another lockB-mmio-dispatcher. So we can not avoid the possibility of
>>> nested lock of biglock.  And another important factor is that we break
>>> the lock sequence: private_lock-->biglock.
>>> All of these require us to push biglock's holding into c_p_m_rw(), the
>>> wrapper can not give help.
>>
>> I agree that this is unavoidable.
>>
>>>
>>> --2nd. c_p_m_rw(), sync or async?
>>>
>>> IF we convert all of the device to be protected by refcount, then we can have
>>> //no big lock
>>>  c_p_m_rw()
>>> {
>>>    devB->ref++;
>>>    {
>>> --------------------------------------->pushed onto another thread.
>>>    lock_privatelock
>>>    mr->ops->write();
>>>    unlock_privatelock
>>>    }
>>>    wait_for_completion();
>>>    devB->ref--;
>>> }
>>> This model can help c_p_m_rw() present as a SYNC API.  But currently,
>>> we mix biglock and private lock together, and wait_for_completion()
>>> maybe block the release of big lock, which finally causes deadlock. So
>>> we can not simply rely on this model.
>>> Instead, we need to classify the calling scene into three cases:
>>>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
>>> async+completion model
>>>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
>>> cause the nested lock of biglock
>>>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
>>> the lock sequence problem, (as to completion, it need to be placed
>>> outside the top level biglock, and it is hard to do so. Suggest to
>>> change to case 1. Or at present, just leave it async)
>>>
>>> This new model will require the biglock can be nested.
>>
>> I think changing to an async model is too complicated.  It's difficult
>> enough already.  Isn't dropping private locks + recursive big locks
>> sufficient?
>>
> I think that "dropping private locks + recursive big locks" just cover
> case 2. And most of the important, it dont describe case3 which break
> the rule of lock sequence "private-lock --> biglock". Scene:
> devA_lock-->(devX_with-biglock--->devB_lock).
> I just want to classify and post these cases to discuss. Maybe we can
> achieve without async.

What about the following:

What we really need to support in practice is MMIO access triggers RAM
access of device model. Scenarios where a device access triggers another
MMIO access could likely just be rejected without causing troubles.

So, when we dispatch a request to a device, we mark that the current
thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
_not_ target RAM, ie. is another, nested MMIO request - independent of
its destination. How much of the known issues would this solve? And what
would remain open?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:34     ` Jan Kiszka
@ 2012-09-19  9:50       ` Avi Kivity
  2012-09-19 10:18         ` Jan Kiszka
  2012-09-24  6:33         ` liu ping fan
  2012-09-20  8:11       ` liu ping fan
  1 sibling, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-19  9:50 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/19/2012 12:34 PM, Jan Kiszka wrote:
> 
> What about the following:
> 
> What we really need to support in practice is MMIO access triggers RAM
> access of device model. Scenarios where a device access triggers another
> MMIO access could likely just be rejected without causing troubles.
> 
> So, when we dispatch a request to a device, we mark that the current
> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
> _not_ target RAM, ie. is another, nested MMIO request - independent of
> its destination. How much of the known issues would this solve? And what
> would remain open?

Various iommu-like devices re-dispatch I/O, like changing endianness or
bitband.  I don't know whether it targets I/O rather than RAM.

If they do, we can push the support into the memory API.  I think it's
acceptable as a short term solution (short term meaning as long as needed).



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:50       ` Avi Kivity
@ 2012-09-19 10:18         ` Jan Kiszka
  2012-09-24  6:33         ` liu ping fan
  1 sibling, 0 replies; 45+ messages in thread
From: Jan Kiszka @ 2012-09-19 10:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Paolo Bonzini, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 2012-09-19 11:50, Avi Kivity wrote:
> On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>
>> What about the following:
>>
>> What we really need to support in practice is MMIO access triggers RAM
>> access of device model. Scenarios where a device access triggers another
>> MMIO access could likely just be rejected without causing troubles.
>>
>> So, when we dispatch a request to a device, we mark that the current
>> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>> _not_ target RAM, ie. is another, nested MMIO request - independent of
>> its destination. How much of the known issues would this solve? And what
>> would remain open?
> 
> Various iommu-like devices re-dispatch I/O, like changing endianness or
> bitband.  I don't know whether it targets I/O rather than RAM.

If such dispatching is lock-less and we validate it's not recursive,
then it should be fine even when not targeting RAM. And we could also
factor out generic dispatchers like MMIO->PIO.

> 
> If they do, we can push the support into the memory API.  I think it's
> acceptable as a short term solution (short term meaning as long as needed).

We will always have to avoid loops. I think all we can improve is what
we still allow by better detecting what would be wrong.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:50       ` Avi Kivity
  2012-09-19 10:18         ` Jan Kiszka
@ 2012-09-24  6:33         ` liu ping fan
  2012-09-24  7:44           ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-24  6:33 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>
>> What about the following:
>>
>> What we really need to support in practice is MMIO access triggers RAM
>> access of device model. Scenarios where a device access triggers another
>> MMIO access could likely just be rejected without causing troubles.
>>
>> So, when we dispatch a request to a device, we mark that the current
>> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>> _not_ target RAM, ie. is another, nested MMIO request - independent of
>> its destination. How much of the known issues would this solve? And what
>> would remain open?
>
> Various iommu-like devices re-dispatch I/O, like changing endianness or
> bitband.  I don't know whether it targets I/O rather than RAM.
>
Have not found the exact code. But I think the call chain may look
like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
c_p_m_rw()
And I think you worry about the case for "c_p_m_rw() --> iommu
mmio-handler". Right? How about introduce an member can_nest for
MemoryRegionOps of iommu's mr?

Regards,
pingfan

> If they do, we can push the support into the memory API.  I think it's
> acceptable as a short term solution (short term meaning as long as needed).
>
>
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-24  6:33         ` liu ping fan
@ 2012-09-24  7:44           ` Avi Kivity
  2012-09-24  8:32             ` liu ping fan
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-24  7:44 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/24/2012 08:33 AM, liu ping fan wrote:
> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
> >>
> >> What about the following:
> >>
> >> What we really need to support in practice is MMIO access triggers RAM
> >> access of device model. Scenarios where a device access triggers another
> >> MMIO access could likely just be rejected without causing troubles.
> >>
> >> So, when we dispatch a request to a device, we mark that the current
> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
> >> its destination. How much of the known issues would this solve? And what
> >> would remain open?
> >
> > Various iommu-like devices re-dispatch I/O, like changing endianness or
> > bitband.  I don't know whether it targets I/O rather than RAM.
> >
> Have not found the exact code. But I think the call chain may look
> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
> c_p_m_rw()
> And I think you worry about the case for "c_p_m_rw() --> iommu
> mmio-handler". Right? How about introduce an member can_nest for
> MemoryRegionOps of iommu's mr?
>

I would rather push the iommu logic into the memory API:

  memory_region_init_iommu(MemoryRegion *mr, const char *name,
                           MemoryRegion *target, MemoryRegionIOMMUOps *ops,
                           unsigned size)

  struct MemoryRegionIOMMUOps {
      target_physical_addr_t (*translate)(target_physical_addr_t addr,
bool write);
      void (*fault)(target_physical_addr_t addr);
  };

I'll look at a proposal for this.  It's a generalized case of
memory_region_init_alias().

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-24  7:44           ` Avi Kivity
@ 2012-09-24  8:32             ` liu ping fan
  2012-09-24  9:42               ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-24  8:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Mon, Sep 24, 2012 at 3:44 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/24/2012 08:33 AM, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
>> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>> >>
>> >> What about the following:
>> >>
>> >> What we really need to support in practice is MMIO access triggers RAM
>> >> access of device model. Scenarios where a device access triggers another
>> >> MMIO access could likely just be rejected without causing troubles.
>> >>
>> >> So, when we dispatch a request to a device, we mark that the current
>> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
>> >> its destination. How much of the known issues would this solve? And what
>> >> would remain open?
>> >
>> > Various iommu-like devices re-dispatch I/O, like changing endianness or
>> > bitband.  I don't know whether it targets I/O rather than RAM.
>> >
>> Have not found the exact code. But I think the call chain may look
>> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
>> c_p_m_rw()
>> And I think you worry about the case for "c_p_m_rw() --> iommu
>> mmio-handler". Right? How about introduce an member can_nest for
>> MemoryRegionOps of iommu's mr?
>>
>
> I would rather push the iommu logic into the memory API:
>
>   memory_region_init_iommu(MemoryRegion *mr, const char *name,
>                            MemoryRegion *target, MemoryRegionIOMMUOps *ops,
>                            unsigned size)
>
>   struct MemoryRegionIOMMUOps {
>       target_physical_addr_t (*translate)(target_physical_addr_t addr,
> bool write);
>       void (*fault)(target_physical_addr_t addr);
>   };
>
So I guess, after introduce this, the code logic in c_p_m_rw() will
look like this

c_p_m_rw(dev_virt_addr, ...)
{
   mr = phys_page_lookup();
   if (mr->iommu_ops)
       real_addr = translate(dev_virt_addr,..)；

   ptr = qemu_get_ram_ptr(real_addr);
   memcpy(buf, ptr, sz);
}

> I'll look at a proposal for this.  It's a generalized case of
> memory_region_init_alias().
>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-24  8:32             ` liu ping fan
@ 2012-09-24  9:42               ` Avi Kivity
  2012-09-27  3:13                 ` liu ping fan
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-24  9:42 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/24/2012 10:32 AM, liu ping fan wrote:
> On Mon, Sep 24, 2012 at 3:44 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/24/2012 08:33 AM, liu ping fan wrote:
>>> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
>>> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>> >>
>>> >> What about the following:
>>> >>
>>> >> What we really need to support in practice is MMIO access triggers RAM
>>> >> access of device model. Scenarios where a device access triggers another
>>> >> MMIO access could likely just be rejected without causing troubles.
>>> >>
>>> >> So, when we dispatch a request to a device, we mark that the current
>>> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>>> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
>>> >> its destination. How much of the known issues would this solve? And what
>>> >> would remain open?
>>> >
>>> > Various iommu-like devices re-dispatch I/O, like changing endianness or
>>> > bitband.  I don't know whether it targets I/O rather than RAM.
>>> >
>>> Have not found the exact code. But I think the call chain may look
>>> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
>>> c_p_m_rw()
>>> And I think you worry about the case for "c_p_m_rw() --> iommu
>>> mmio-handler". Right? How about introduce an member can_nest for
>>> MemoryRegionOps of iommu's mr?
>>>
>>
>> I would rather push the iommu logic into the memory API:
>>
>>   memory_region_init_iommu(MemoryRegion *mr, const char *name,
>>                            MemoryRegion *target, MemoryRegionIOMMUOps *ops,
>>                            unsigned size)
>>
>>   struct MemoryRegionIOMMUOps {
>>       target_physical_addr_t (*translate)(target_physical_addr_t addr,
>> bool write);
>>       void (*fault)(target_physical_addr_t addr);
>>   };
>>
> So I guess, after introduce this, the code logic in c_p_m_rw() will
> look like this
> 
> c_p_m_rw(dev_virt_addr, ...)
> {
>    mr = phys_page_lookup();
>    if (mr->iommu_ops)
>        real_addr = translate(dev_virt_addr,..)；
> 
>    ptr = qemu_get_ram_ptr(real_addr);
>    memcpy(buf, ptr, sz);
> }
> 

Something like that.  It will be a while loop, to allow for iommus
strung in series.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-24  9:42               ` Avi Kivity
@ 2012-09-27  3:13                 ` liu ping fan
  2012-09-27  9:16                   ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-27  3:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Mon, Sep 24, 2012 at 5:42 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/24/2012 10:32 AM, liu ping fan wrote:
>> On Mon, Sep 24, 2012 at 3:44 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/24/2012 08:33 AM, liu ping fan wrote:
>>>> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>>> >>
>>>> >> What about the following:
>>>> >>
>>>> >> What we really need to support in practice is MMIO access triggers RAM
>>>> >> access of device model. Scenarios where a device access triggers another
>>>> >> MMIO access could likely just be rejected without causing troubles.
>>>> >>
>>>> >> So, when we dispatch a request to a device, we mark that the current
>>>> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>>>> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
>>>> >> its destination. How much of the known issues would this solve? And what
>>>> >> would remain open?
>>>> >
>>>> > Various iommu-like devices re-dispatch I/O, like changing endianness or
>>>> > bitband.  I don't know whether it targets I/O rather than RAM.
>>>> >
>>>> Have not found the exact code. But I think the call chain may look
>>>> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
>>>> c_p_m_rw()
>>>> And I think you worry about the case for "c_p_m_rw() --> iommu
>>>> mmio-handler". Right? How about introduce an member can_nest for
>>>> MemoryRegionOps of iommu's mr?
>>>>
>>>
>>> I would rather push the iommu logic into the memory API:
>>>
>>>   memory_region_init_iommu(MemoryRegion *mr, const char *name,
>>>                            MemoryRegion *target, MemoryRegionIOMMUOps *ops,
>>>                            unsigned size)
>>>
>>>   struct MemoryRegionIOMMUOps {
>>>       target_physical_addr_t (*translate)(target_physical_addr_t addr,
>>> bool write);
>>>       void (*fault)(target_physical_addr_t addr);
>>>   };
>>>
>> So I guess, after introduce this, the code logic in c_p_m_rw() will
>> look like this
>>
>> c_p_m_rw(dev_virt_addr, ...)
>> {
>>    mr = phys_page_lookup();
>>    if (mr->iommu_ops)
>>        real_addr = translate(dev_virt_addr,..)；
>>
>>    ptr = qemu_get_ram_ptr(real_addr);
>>    memcpy(buf, ptr, sz);
>> }
>>
>
> Something like that.  It will be a while loop, to allow for iommus
> strung in series.
>
Will model the system like the following:

--.Introduce iommu address space. It will be the container of the
regions which are put under the management of iommu.
--.In the system address space, using alias-iommu-mrX with priority=1
to expose iommu address space and obscure the overlapped regions.
-- Device's access to address manged by alias-iommu-mrX
c_p_m_rw(target_physical_addr_t addrA, ..)
{
    while (len > 0) {
    mr = phys_page_lookup();
    if (mr->iommu_ops)
        addrB = translate(addrA,..)；

    ptr = qemu_get_ram_ptr(addrB);
    memcpy(buf, ptr, sz);
    }
}

Is it correct?

Thanks and regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  3:13                 ` liu ping fan
@ 2012-09-27  9:16                   ` Avi Kivity
  2012-09-27  9:29                     ` Paolo Bonzini
  2012-09-29  9:20                     ` liu ping fan
  0 siblings, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-27  9:16 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/27/2012 05:13 AM, liu ping fan wrote:
> On Mon, Sep 24, 2012 at 5:42 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/24/2012 10:32 AM, liu ping fan wrote:
>>> On Mon, Sep 24, 2012 at 3:44 PM, Avi Kivity <avi@redhat.com> wrote:
>>>> On 09/24/2012 08:33 AM, liu ping fan wrote:
>>>>> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>>>> >>
>>>>> >> What about the following:
>>>>> >>
>>>>> >> What we really need to support in practice is MMIO access triggers RAM
>>>>> >> access of device model. Scenarios where a device access triggers another
>>>>> >> MMIO access could likely just be rejected without causing troubles.
>>>>> >>
>>>>> >> So, when we dispatch a request to a device, we mark that the current
>>>>> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>>>>> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
>>>>> >> its destination. How much of the known issues would this solve? And what
>>>>> >> would remain open?
>>>>> >
>>>>> > Various iommu-like devices re-dispatch I/O, like changing endianness or
>>>>> > bitband.  I don't know whether it targets I/O rather than RAM.
>>>>> >
>>>>> Have not found the exact code. But I think the call chain may look
>>>>> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
>>>>> c_p_m_rw()
>>>>> And I think you worry about the case for "c_p_m_rw() --> iommu
>>>>> mmio-handler". Right? How about introduce an member can_nest for
>>>>> MemoryRegionOps of iommu's mr?
>>>>>
>>>>
>>>> I would rather push the iommu logic into the memory API:
>>>>
>>>>   memory_region_init_iommu(MemoryRegion *mr, const char *name,
>>>>                            MemoryRegion *target, MemoryRegionIOMMUOps *ops,
>>>>                            unsigned size)
>>>>
>>>>   struct MemoryRegionIOMMUOps {
>>>>       target_physical_addr_t (*translate)(target_physical_addr_t addr,
>>>> bool write);
>>>>       void (*fault)(target_physical_addr_t addr);
>>>>   };
>>>>
>>> So I guess, after introduce this, the code logic in c_p_m_rw() will
>>> look like this
>>>
>>> c_p_m_rw(dev_virt_addr, ...)
>>> {
>>>    mr = phys_page_lookup();
>>>    if (mr->iommu_ops)
>>>        real_addr = translate(dev_virt_addr,..)；
>>>
>>>    ptr = qemu_get_ram_ptr(real_addr);
>>>    memcpy(buf, ptr, sz);
>>> }
>>>
>>
>> Something like that.  It will be a while loop, to allow for iommus
>> strung in series.
>>
> Will model the system like the following:
> 
> --.Introduce iommu address space. It will be the container of the
> regions which are put under the management of iommu.
> --.In the system address space, using alias-iommu-mrX with priority=1
> to expose iommu address space and obscure the overlapped regions.
> -- Device's access to address manged by alias-iommu-mrX
> c_p_m_rw(target_physical_addr_t addrA, ..)
> {
>     while (len > 0) {
>     mr = phys_page_lookup();
>     if (mr->iommu_ops)
>         addrB = translate(addrA,..)；
> 
>     ptr = qemu_get_ram_ptr(addrB);
>     memcpy(buf, ptr, sz);
>     }
> }
> 
> Is it correct?

iommus only apply to device accesses, not cpu accesses (as in
cpu_p_m_w()).  So we will need a generic dma function:

  typedef struct MemoryAddressSpace {
      MemoryRegion *root;
      PhysPageEntry phys_map;
      ...
      // linked list entry for list of all MemoryAddressSpaces
  }

  void memory_address_space_rw(MemoryAddressSpace *mas, ...)
  {
     look up mas->phys_map
     dispatch
  }

  void cpu_physical_memory_rw(...)
  {
      memory_address_space_rw(&system_memory, ...);
  }

The snippet

    if (mr->iommu_ops)
        addrB = translate(addrA,..)；

needs to be a little more complicated.  After translation, we need to
look up the address again in a different phys_map.  So a MemoryRegion
that is an iommu needs to hold its own phys_map pointer for the lookup.

But let's ignore the problem for now, we have too much on our plate.
With a recursive big lock, there is no problem with iommus, yes?  So as
long as there is no intersection between converted devices and platforms
with iommus, we're safe.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  9:16                   ` Avi Kivity
@ 2012-09-27  9:29                     ` Paolo Bonzini
  2012-09-27  9:34                       ` Avi Kivity
  2012-09-29  9:20                     ` liu ping fan
  1 sibling, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-27  9:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

Il 27/09/2012 11:16, Avi Kivity ha scritto:
> needs to be a little more complicated.  After translation, we need to
> look up the address again in a different phys_map.  So a MemoryRegion
> that is an iommu needs to hold its own phys_map pointer for the lookup.
> 
> But let's ignore the problem for now, we have too much on our plate.
> With a recursive big lock, there is no problem with iommus, yes?  So as
> long as there is no intersection between converted devices and platforms
> with iommus, we're safe.

pSeries has both PCI and iommu, but I think no one should be using e1000
on a pSeries.  virtio-net is per-spec not going through the iommu.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  9:29                     ` Paolo Bonzini
@ 2012-09-27  9:34                       ` Avi Kivity
  2012-09-27  9:36                         ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-27  9:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/27/2012 11:29 AM, Paolo Bonzini wrote:

> virtio-net is per-spec not going through the iommu.

What's that smell?  A feature bit?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  9:34                       ` Avi Kivity
@ 2012-09-27  9:36                         ` Paolo Bonzini
  2012-09-27 10:08                           ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-27  9:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

Il 27/09/2012 11:34, Avi Kivity ha scritto:
>> > virtio-net is per-spec not going through the iommu.
> What's that smell?  A feature bit?

Why is it a bad thing that it does not go through the iommu?

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  9:36                         ` Paolo Bonzini
@ 2012-09-27 10:08                           ` Avi Kivity
  2012-09-27 10:22                             ` Paolo Bonzini
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-27 10:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/27/2012 11:36 AM, Paolo Bonzini wrote:
> Il 27/09/2012 11:34, Avi Kivity ha scritto:
>>> > virtio-net is per-spec not going through the iommu.
>> What's that smell?  A feature bit?
> 
> Why is it a bad thing that it does not go through the iommu?

You can't assign the device to nested guests or to guest userspace via
vfio.  Some guests may want to use the iommu even in the guest kernel
context.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27 10:08                           ` Avi Kivity
@ 2012-09-27 10:22                             ` Paolo Bonzini
  2012-09-27 10:48                               ` Avi Kivity
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Bonzini @ 2012-09-27 10:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

Il 27/09/2012 12:08, Avi Kivity ha scritto:
>>>>> virtio-net is per-spec not going through the iommu.
>>> >> What's that smell?  A feature bit?
>> > 
>> > Why is it a bad thing that it does not go through the iommu?
> You can't assign the device to nested guests or to guest userspace via
> vfio.  Some guests may want to use the iommu even in the guest kernel
> context.

All good points.

However, using the iommu means you cannot use either vhost-net or (at
least for now) per-device lock.  So it closes some doors on performance
improvements... I can imagine pSeries guys prefer to keep no iommu in
virtio devices.

Paolo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27 10:22                             ` Paolo Bonzini
@ 2012-09-27 10:48                               ` Avi Kivity
  0 siblings, 0 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-27 10:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, Anthony Liguori,
	qemu-devel@nongnu.org

On 09/27/2012 12:22 PM, Paolo Bonzini wrote:
> Il 27/09/2012 12:08, Avi Kivity ha scritto:
>>>>>> virtio-net is per-spec not going through the iommu.
>>>> >> What's that smell?  A feature bit?
>>> > 
>>> > Why is it a bad thing that it does not go through the iommu?
>> You can't assign the device to nested guests or to guest userspace via
>> vfio.  Some guests may want to use the iommu even in the guest kernel
>> context.
> 
> All good points.
> 
> However, using the iommu means you cannot use either vhost-net or (at
> least for now) per-device lock.  So it closes some doors on performance
> improvements... I can imagine pSeries guys prefer to keep no iommu in
> virtio devices.

Eventually we'll have a kernel-emulated iommu for these use cases.  We
can use the shadow code to generate the gvioaddr -> gpa -> hpa
translation (like we use shadow code to fold the ngpa -> gpa -> hpa
translation into a single ngpa -> hpa translation with nested npt).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-27  9:16                   ` Avi Kivity
  2012-09-27  9:29                     ` Paolo Bonzini
@ 2012-09-29  9:20                     ` liu ping fan
  2012-09-30  8:13                       ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-29  9:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Thu, Sep 27, 2012 at 5:16 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/27/2012 05:13 AM, liu ping fan wrote:
>> On Mon, Sep 24, 2012 at 5:42 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/24/2012 10:32 AM, liu ping fan wrote:
>>>> On Mon, Sep 24, 2012 at 3:44 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>> On 09/24/2012 08:33 AM, liu ping fan wrote:
>>>>>> On Wed, Sep 19, 2012 at 5:50 PM, Avi Kivity <avi@redhat.com> wrote:
>>>>>> > On 09/19/2012 12:34 PM, Jan Kiszka wrote:
>>>>>> >>
>>>>>> >> What about the following:
>>>>>> >>
>>>>>> >> What we really need to support in practice is MMIO access triggers RAM
>>>>>> >> access of device model. Scenarios where a device access triggers another
>>>>>> >> MMIO access could likely just be rejected without causing troubles.
>>>>>> >>
>>>>>> >> So, when we dispatch a request to a device, we mark that the current
>>>>>> >> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
>>>>>> >> _not_ target RAM, ie. is another, nested MMIO request - independent of
>>>>>> >> its destination. How much of the known issues would this solve? And what
>>>>>> >> would remain open?
>>>>>> >
>>>>>> > Various iommu-like devices re-dispatch I/O, like changing endianness or
>>>>>> > bitband.  I don't know whether it targets I/O rather than RAM.
>>>>>> >
>>>>>> Have not found the exact code. But I think the call chain may look
>>>>>> like this: dev mmio-handler --> c_p_m_rw() --> iommu mmio-handler -->
>>>>>> c_p_m_rw()
>>>>>> And I think you worry about the case for "c_p_m_rw() --> iommu
>>>>>> mmio-handler". Right? How about introduce an member can_nest for
>>>>>> MemoryRegionOps of iommu's mr?
>>>>>>
>>>>>
>>>>> I would rather push the iommu logic into the memory API:
>>>>>
>>>>>   memory_region_init_iommu(MemoryRegion *mr, const char *name,
>>>>>                            MemoryRegion *target, MemoryRegionIOMMUOps *ops,
>>>>>                            unsigned size)
>>>>>
>>>>>   struct MemoryRegionIOMMUOps {
>>>>>       target_physical_addr_t (*translate)(target_physical_addr_t addr,
>>>>> bool write);
>>>>>       void (*fault)(target_physical_addr_t addr);
>>>>>   };
>>>>>
>>>> So I guess, after introduce this, the code logic in c_p_m_rw() will
>>>> look like this
>>>>
>>>> c_p_m_rw(dev_virt_addr, ...)
>>>> {
>>>>    mr = phys_page_lookup();
>>>>    if (mr->iommu_ops)
>>>>        real_addr = translate(dev_virt_addr,..)；
>>>>
>>>>    ptr = qemu_get_ram_ptr(real_addr);
>>>>    memcpy(buf, ptr, sz);
>>>> }
>>>>
>>>
>>> Something like that.  It will be a while loop, to allow for iommus
>>> strung in series.
>>>
>> Will model the system like the following:
>>
>> --.Introduce iommu address space. It will be the container of the
>> regions which are put under the management of iommu.
>> --.In the system address space, using alias-iommu-mrX with priority=1
>> to expose iommu address space and obscure the overlapped regions.
>> -- Device's access to address manged by alias-iommu-mrX
>> c_p_m_rw(target_physical_addr_t addrA, ..)
>> {
>>     while (len > 0) {
>>     mr = phys_page_lookup();
>>     if (mr->iommu_ops)
>>         addrB = translate(addrA,..)；
>>
>>     ptr = qemu_get_ram_ptr(addrB);
>>     memcpy(buf, ptr, sz);
>>     }
>> }
>>
>> Is it correct?
>
> iommus only apply to device accesses, not cpu accesses (as in
> cpu_p_m_w()).  So we will need a generic dma function:
>
Yes, during model it, I found that c_p_m_rw() operate on MMU's result
, it like the translation result of IOMMU.  But here the iommu, I
think, is just a very limited device -- adjust bitband and endian, so
it mapped address into self.

>   typedef struct MemoryAddressSpace {
>       MemoryRegion *root;
>       PhysPageEntry phys_map;
>       ...
>       // linked list entry for list of all MemoryAddressSpaces
>   }
>
>   void memory_address_space_rw(MemoryAddressSpace *mas, ...)
>   {
>      look up mas->phys_map
>      dispatch
>   }
>
>   void cpu_physical_memory_rw(...)
>   {
>       memory_address_space_rw(&system_memory, ...);
>   }
>
> The snippet
>
>     if (mr->iommu_ops)
>         addrB = translate(addrA,..)；
>
> needs to be a little more complicated.  After translation, we need to
> look up the address again in a different phys_map.  So a MemoryRegion
> that is an iommu needs to hold its own phys_map pointer for the lookup.
>
> But let's ignore the problem for now, we have too much on our plate.
> With a recursive big lock, there is no problem with iommus, yes?  So as

Do we have iommus in qemu now, since there are no separate phys_maps
for real address and dev's virt address, and I think the iommu is only
needed by host, not guest, so need not emulated by qemu.  If no, we
can just reject the nested DMA, and the c_p_m_rw() can only be nested
once, so if introduce a wrapper for c_p_m_rw(), we can avoid
recursive big lock, right?

Regards,
pingfan

> long as there is no intersection between converted devices and platforms
> with iommus, we're safe.
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-29  9:20                     ` liu ping fan
@ 2012-09-30  8:13                       ` Avi Kivity
  2012-09-30  8:48                         ` liu ping fan
  2012-09-30 11:04                         ` Blue Swirl
  0 siblings, 2 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-30  8:13 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/29/2012 11:20 AM, liu ping fan wrote:
> 
> Do we have iommus in qemu now, 

We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
don't know if it's a standalone iommu on real hardware or whether it is
part of the HBA.

> since there are no separate phys_maps
> for real address and dev's virt address, and I think the iommu is only
> needed by host, not guest, so need not emulated by qemu.  

Eventually we will emulate iommus for x86 too, so we need to consider them.

> If no, we
> can just reject the nested DMA, and the c_p_m_rw() can only be nested
> once, so if introduce a wrapper for c_p_m_rw(), we can avoid
> recursive big lock, right?

Don't we need that for other reasons?  If not, we can drop it for now.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-30  8:13                       ` Avi Kivity
@ 2012-09-30  8:48                         ` liu ping fan
  2012-09-30 11:18                           ` Avi Kivity
  2012-09-30 11:04                         ` Blue Swirl
  1 sibling, 1 reply; 45+ messages in thread
From: liu ping fan @ 2012-09-30  8:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Sun, Sep 30, 2012 at 4:13 PM, Avi Kivity <avi@redhat.com> wrote:
> On 09/29/2012 11:20 AM, liu ping fan wrote:
>>
>> Do we have iommus in qemu now,
>
> We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
> don't know if it's a standalone iommu on real hardware or whether it is
> part of the HBA.
>
>> since there are no separate phys_maps
>> for real address and dev's virt address, and I think the iommu is only
>> needed by host, not guest, so need not emulated by qemu.
>
> Eventually we will emulate iommus for x86 too, so we need to consider them.
>
For nested guest?  The address translation is like nested mmu?
>> If no, we
>> can just reject the nested DMA, and the c_p_m_rw() can only be nested
>> once, so if introduce a wrapper for c_p_m_rw(), we can avoid
>> recursive big lock, right?
>
> Don't we need that for other reasons?  If not, we can drop it for now.
>
Yes, there is another reason is about the unplug process as you
suggest, DeviceState::uninit() can call del timer with nested biglock.
 Or as alternative, we del timer when unplug and check the valid of
timer when mmio-dispatch in e1000.

Regards,
pingfan
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-30  8:48                         ` liu ping fan
@ 2012-09-30 11:18                           ` Avi Kivity
  0 siblings, 0 replies; 45+ messages in thread
From: Avi Kivity @ 2012-09-30 11:18 UTC (permalink / raw)
  To: liu ping fan
  Cc: Jan Kiszka, Marcelo Tosatti, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/30/2012 10:48 AM, liu ping fan wrote:
> On Sun, Sep 30, 2012 at 4:13 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/29/2012 11:20 AM, liu ping fan wrote:
>>>
>>> Do we have iommus in qemu now,
>>
>> We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
>> don't know if it's a standalone iommu on real hardware or whether it is
>> part of the HBA.
>>
>>> since there are no separate phys_maps
>>> for real address and dev's virt address, and I think the iommu is only
>>> needed by host, not guest, so need not emulated by qemu.
>>
>> Eventually we will emulate iommus for x86 too, so we need to consider them.
>>
> For nested guest?  The address translation is like nested mmu?

Yes.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-30  8:13                       ` Avi Kivity
  2012-09-30  8:48                         ` liu ping fan
@ 2012-09-30 11:04                         ` Blue Swirl
  2012-09-30 11:17                           ` Avi Kivity
  1 sibling, 1 reply; 45+ messages in thread
From: Blue Swirl @ 2012-09-30 11:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Sun, Sep 30, 2012 at 8:13 AM, Avi Kivity <avi@redhat.com> wrote:
> On 09/29/2012 11:20 AM, liu ping fan wrote:
>>
>> Do we have iommus in qemu now,
>
> We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
> don't know if it's a standalone iommu on real hardware or whether it is
> part of the HBA.

It's standalone or even part of CPU (some uniprocessor systems). IOMMU
sits between memory and SBus, so all SBus devices (currently only ESP
and Lance) use it.

>
>> since there are no separate phys_maps
>> for real address and dev's virt address, and I think the iommu is only
>> needed by host, not guest, so need not emulated by qemu.
>
> Eventually we will emulate iommus for x86 too, so we need to consider them.
>
>> If no, we
>> can just reject the nested DMA, and the c_p_m_rw() can only be nested
>> once, so if introduce a wrapper for c_p_m_rw(), we can avoid
>> recursive big lock, right?
>
> Don't we need that for other reasons?  If not, we can drop it for now.

I don't think nested DMA is ever needed, it would be pretty obscure
feature and it would need a pretty heavy implementation (recursion) in
a real HW IOMMU. In theory the translated address may map to MMIO but
that's different.

>
> --
> error compiling committee.c: too many arguments to function
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-30 11:04                         ` Blue Swirl
@ 2012-09-30 11:17                           ` Avi Kivity
  2012-09-30 11:48                             ` Blue Swirl
  0 siblings, 1 reply; 45+ messages in thread
From: Avi Kivity @ 2012-09-30 11:17 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On 09/30/2012 01:04 PM, Blue Swirl wrote:
> On Sun, Sep 30, 2012 at 8:13 AM, Avi Kivity <avi@redhat.com> wrote:
>> On 09/29/2012 11:20 AM, liu ping fan wrote:
>>>
>>> Do we have iommus in qemu now,
>>
>> We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
>> don't know if it's a standalone iommu on real hardware or whether it is
>> part of the HBA.
> 
> It's standalone or even part of CPU (some uniprocessor systems). IOMMU
> sits between memory and SBus, so all SBus devices (currently only ESP
> and Lance) use it.

So, the current modelling is incorrect?  I'd like to fix it as a way of
getting proper iommu modelling, but I don't know how it should be done,
or how to test it.

> 
>>
>>> since there are no separate phys_maps
>>> for real address and dev's virt address, and I think the iommu is only
>>> needed by host, not guest, so need not emulated by qemu.
>>
>> Eventually we will emulate iommus for x86 too, so we need to consider them.
>>
>>> If no, we
>>> can just reject the nested DMA, and the c_p_m_rw() can only be nested
>>> once, so if introduce a wrapper for c_p_m_rw(), we can avoid
>>> recursive big lock, right?
>>
>> Don't we need that for other reasons?  If not, we can drop it for now.
> 
> I don't think nested DMA is ever needed, it would be pretty obscure
> feature and it would need a pretty heavy implementation (recursion) in
> a real HW IOMMU. In theory the translated address may map to MMIO but
> that's different.

Sure.  But if we can get a model that works for everything at no extra
cost, then why not?

btw, the model of 'either we take the big lock recusrsively, or we drop
the local lock before issuing dma' seems to cover nested iommus with any
mix of big locks and little locks.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-30 11:17                           ` Avi Kivity
@ 2012-09-30 11:48                             ` Blue Swirl
  0 siblings, 0 replies; 45+ messages in thread
From: Blue Swirl @ 2012-09-30 11:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, liu ping fan, qemu-devel@nongnu.org,
	Anthony Liguori, Paolo Bonzini

On Sun, Sep 30, 2012 at 11:17 AM, Avi Kivity <avi@redhat.com> wrote:
> On 09/30/2012 01:04 PM, Blue Swirl wrote:
>> On Sun, Sep 30, 2012 at 8:13 AM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/29/2012 11:20 AM, liu ping fan wrote:
>>>>
>>>> Do we have iommus in qemu now,
>>>
>>> We do, but they're hacked into the scsi layer, see hw/sun4m_iommu.c.  I
>>> don't know if it's a standalone iommu on real hardware or whether it is
>>> part of the HBA.
>>
>> It's standalone or even part of CPU (some uniprocessor systems). IOMMU
>> sits between memory and SBus, so all SBus devices (currently only ESP
>> and Lance) use it.
>
> So, the current modelling is incorrect?  I'd like to fix it as a way of
> getting proper iommu modelling, but I don't know how it should be done,
> or how to test it.

The model (opaque IOMMU handle + ESP/Lance specific accessors) could
be improved much, it predates all IOMMU discussions.

Just grab any ISO (preferably a sparc32 one), try booting. If it
fails, there's a problem because IOMMU is enabled by OpenBIOS.

>
>>
>>>
>>>> since there are no separate phys_maps
>>>> for real address and dev's virt address, and I think the iommu is only
>>>> needed by host, not guest, so need not emulated by qemu.
>>>
>>> Eventually we will emulate iommus for x86 too, so we need to consider them.
>>>
>>>> If no, we
>>>> can just reject the nested DMA, and the c_p_m_rw() can only be nested
>>>> once, so if introduce a wrapper for c_p_m_rw(), we can avoid
>>>> recursive big lock, right?
>>>
>>> Don't we need that for other reasons?  If not, we can drop it for now.
>>
>> I don't think nested DMA is ever needed, it would be pretty obscure
>> feature and it would need a pretty heavy implementation (recursion) in
>> a real HW IOMMU. In theory the translated address may map to MMIO but
>> that's different.
>
> Sure.  But if we can get a model that works for everything at no extra
> cost, then why not?

That's OK, I'm not opposing it.

>
> btw, the model of 'either we take the big lock recusrsively, or we drop
> the local lock before issuing dma' seems to cover nested iommus with any
> mix of big locks and little locks.
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock
  2012-09-19  9:34     ` Jan Kiszka
  2012-09-19  9:50       ` Avi Kivity
@ 2012-09-20  8:11       ` liu ping fan
  1 sibling, 0 replies; 45+ messages in thread
From: liu ping fan @ 2012-09-20  8:11 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Marcelo Tosatti, Avi Kivity, Anthony Liguori,
	qemu-devel@nongnu.org

On Wed, Sep 19, 2012 at 5:34 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-09-19 11:00, liu ping fan wrote:
>> On Wed, Sep 19, 2012 at 4:06 PM, Avi Kivity <avi@redhat.com> wrote:
>>> On 09/19/2012 06:02 AM, liu ping fan wrote:
>>>> Currently, cpu_physical_memory_rw() can be used directly or indirectly
>>>> by mmio-dispatcher to access other devices' memory region. This can
>>>> cause some problem when adopting device's private lock.
>>>>
>>>> Back ground refer to:
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01481.html
>>>> For lazy, just refer to:
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg01878.html
>>>>
>>>>
>>>> --1st. the recursive lock of biglock.
>>>> If we leave c_p_m_rw() as it is, ie, no lock inside. Then we can have
>>>> the following (section of the whole call chain, and with
>>>> private_lockA):
>>>>       lockA-mmio-dispatcher   --> hold biglock -- >c_p_m_rw() --- >
>>>> Before c_p_m_rw(), we drop private_lockA to anti the possibly of
>>>> deadlock.  But we can not anti the nested of this chain or calling to
>>>> another lockB-mmio-dispatcher. So we can not avoid the possibility of
>>>> nested lock of biglock.  And another important factor is that we break
>>>> the lock sequence: private_lock-->biglock.
>>>> All of these require us to push biglock's holding into c_p_m_rw(), the
>>>> wrapper can not give help.
>>>
>>> I agree that this is unavoidable.
>>>
>>>>
>>>> --2nd. c_p_m_rw(), sync or async?
>>>>
>>>> IF we convert all of the device to be protected by refcount, then we can have
>>>> //no big lock
>>>>  c_p_m_rw()
>>>> {
>>>>    devB->ref++;
>>>>    {
>>>> --------------------------------------->pushed onto another thread.
>>>>    lock_privatelock
>>>>    mr->ops->write();
>>>>    unlock_privatelock
>>>>    }
>>>>    wait_for_completion();
>>>>    devB->ref--;
>>>> }
>>>> This model can help c_p_m_rw() present as a SYNC API.  But currently,
>>>> we mix biglock and private lock together, and wait_for_completion()
>>>> maybe block the release of big lock, which finally causes deadlock. So
>>>> we can not simply rely on this model.
>>>> Instead, we need to classify the calling scene into three cases:
>>>>   case1. lockA--dispatcher ---> lockB-dispatcher   //can use
>>>> async+completion model
>>>>   case2. lockA--dispatcher ---> biglock-dispatcher // sync, but can
>>>> cause the nested lock of biglock
>>>>   case3. biglock-dispacher ---> lockB-dispatcher  // async to avoid
>>>> the lock sequence problem, (as to completion, it need to be placed
>>>> outside the top level biglock, and it is hard to do so. Suggest to
>>>> change to case 1. Or at present, just leave it async)
>>>>
>>>> This new model will require the biglock can be nested.
>>>
>>> I think changing to an async model is too complicated.  It's difficult
>>> enough already.  Isn't dropping private locks + recursive big locks
>>> sufficient?
>>>
>> I think that "dropping private locks + recursive big locks" just cover
>> case 2. And most of the important, it dont describe case3 which break
>> the rule of lock sequence "private-lock --> biglock". Scene:
>> devA_lock-->(devX_with-biglock--->devB_lock).
>> I just want to classify and post these cases to discuss. Maybe we can
>> achieve without async.
>
> What about the following:
>
> What we really need to support in practice is MMIO access triggers RAM
> access of device model. Scenarios where a device access triggers another
> MMIO access could likely just be rejected without causing troubles.
>
> So, when we dispatch a request to a device, we mark that the current
> thread is in a MMIO dispatch and reject any follow-up c_p_m_rw that does
> _not_ target RAM, ie. is another, nested MMIO request - independent of
> its destination. How much of the known issues would this solve? And what
> would remain open?
>
This make things more easy and realistic. Will dig into code to see
whether there is exception or not.

Thanks and regards,
pingfan

> Jan
>
> --
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-09-30 11:48 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-19  3:02 [Qemu-devel] [big lock] Discussion about the convention of device's DMA each other after breaking down biglock liu ping fan
2012-09-19  8:06 ` Avi Kivity
2012-09-19  9:00   ` liu ping fan
2012-09-19  9:07     ` Avi Kivity
2012-09-19  9:11       ` liu ping fan
2012-09-19  9:14         ` Paolo Bonzini
2012-09-19  9:19           ` liu ping fan
2012-09-19  9:23             ` Avi Kivity
2012-09-19  9:27               ` Jan Kiszka
2012-09-19  9:28                 ` Jan Kiszka
2012-09-20  7:51               ` liu ping fan
2012-09-20  7:54                 ` Paolo Bonzini
2012-09-20  8:09                   ` liu ping fan
2012-09-20  8:27                     ` Paolo Bonzini
2012-09-20  9:07                 ` Avi Kivity
2012-09-21  7:27                   ` liu ping fan
2012-09-21  8:21                     ` Paolo Bonzini
2012-09-19  9:21           ` Avi Kivity
2012-09-19  9:51             ` Paolo Bonzini
2012-09-19 10:06               ` Avi Kivity
2012-09-19 10:19                 ` Paolo Bonzini
2012-09-19 10:27                   ` Avi Kivity
2012-09-19  9:34     ` Jan Kiszka
2012-09-19  9:50       ` Avi Kivity
2012-09-19 10:18         ` Jan Kiszka
2012-09-24  6:33         ` liu ping fan
2012-09-24  7:44           ` Avi Kivity
2012-09-24  8:32             ` liu ping fan
2012-09-24  9:42               ` Avi Kivity
2012-09-27  3:13                 ` liu ping fan
2012-09-27  9:16                   ` Avi Kivity
2012-09-27  9:29                     ` Paolo Bonzini
2012-09-27  9:34                       ` Avi Kivity
2012-09-27  9:36                         ` Paolo Bonzini
2012-09-27 10:08                           ` Avi Kivity
2012-09-27 10:22                             ` Paolo Bonzini
2012-09-27 10:48                               ` Avi Kivity
2012-09-29  9:20                     ` liu ping fan
2012-09-30  8:13                       ` Avi Kivity
2012-09-30  8:48                         ` liu ping fan
2012-09-30 11:18                           ` Avi Kivity
2012-09-30 11:04                         ` Blue Swirl
2012-09-30 11:17                           ` Avi Kivity
2012-09-30 11:48                             ` Blue Swirl
2012-09-20  8:11       ` liu ping fan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).