hexagon: modeling a shared lock state

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* hexagon: modeling a shared lock state
@ 2024-01-24 15:07 Brian Cain
  2024-01-25  0:22 ` Richard Henderson
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Cain @ 2024-01-24 15:07 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: qemu-devel@nongnu.org, Sid Manning, Marco Liebel,
	Matheus Bernardino, richard.henderson@linaro.org

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

Philippe,

For hexagon sysemu, while internally reviewing patches to be upstreamed we noticed that our design for a lock instruction is not quite suitable.  The k0lock instruction will halt if some other hexagon hardware CPU has already claimed the kernel lock, only to continue executing once some CPU executes the unlock instruction.  We modeled this using a lock state enumeration member { OWNER, WAITING, UNLOCKED } in *each* vCPU and atomically transitioning the lock required us to have vCPU[n] write the updated lock state to vCPU[m] when unlocking.

In context: https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/hexagon/op_helper.c#L1790-L1821

So instead, maybe it makes more sense to have a system device hold a single representation of the lock's state and each vCPU can do some kind of access via load/store and/or interrupts to/from the device?  I was thinking of raising an interrupt on the lock device at the vCPU's k0lock instruction to indicate demand for the lock, and then the device could raise an interrupt to one of the vCPUs when it's granted the lock.  One drawback for this is that this might add significant latency to the uncontested lock case.  So I also considered doing a load of some part of the lock device's memory that could indicate whether the lock was available, and if available it would claim it with a store (both ld/st while holding BQL).  Only if unavailable it would halt and raise the interrupt.  Is it possible to create an address space that's independent of the true system memory map for this use case or would we need to carve out some memory from the existing memory map for this synthetic device?  Or - do you have a suggestion for a better approach overall?

-Brian

[-- Attachment #2: Type: text/html, Size: 3835 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: hexagon: modeling a shared lock state
  2024-01-24 15:07 hexagon: modeling a shared lock state Brian Cain
@ 2024-01-25  0:22 ` Richard Henderson
  2024-01-25 16:28   ` Brian Cain
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Henderson @ 2024-01-25  0:22 UTC (permalink / raw)
  To: Brian Cain, Philippe Mathieu-Daudé
  Cc: qemu-devel@nongnu.org, Sid Manning, Marco Liebel,
	Matheus Bernardino

On 1/25/24 01:07, Brian Cain wrote:
> Philippe,
> 
> For hexagon sysemu, while internally reviewing patches to be upstreamed we noticed that 
> our design for a lock instruction is not quite suitable.  The k0lock instruction will halt 
> if some other hexagon hardware CPU has already claimed the kernel lock, only to continue 
> executing once some CPU executes the unlock instruction.  We modeled this using a lock 
> state enumeration member { OWNER, WAITING, UNLOCKED } in **each** vCPU and atomically 
> transitioning the lock required us to have vCPU[n] write the updated lock state to vCPU[m] 
> when unlocking.
> 
> In context: 
> https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/hexagon/op_helper.c#L1790-L1821 <https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/hexagon/op_helper.c#L1790-L1821>
> 
> So instead, maybe it makes more sense to have a system device hold a single representation 
> of the lock’s state and each vCPU can do some kind of access via load/store and/or 
> interrupts to/from the device?  I was thinking of raising an interrupt on the lock device 
> at the vCPU’s k0lock instruction to indicate demand for the lock, and then the device 
> could raise an interrupt to one of the vCPUs when it’s granted the lock.  One drawback for 
> this is that this might add significant latency to the uncontested lock case.  So I also 
> considered doing a load of some part of the lock device’s memory that could indicate 
> whether the lock was available, and if available it would claim it with a store (both 
> ld/st while holding BQL).  Only if unavailable it would halt and raise the interrupt.  Is 
> it possible to create an address space that’s independent of the true system memory map 
> for this use case or would we need to carve out some memory from the existing memory map 
> for this synthetic device?  Or – do you have a suggestion for a better approach overall?

I think you are over-thinking this.  A system device would just get in the way.  I think 
you want something like

   struct CPUHexagonState {
     ...
     bool hwlock_pending;
   }

   hexagon_cpu_has_work() {
     if (cpu->hwlock_pending) {
       return false;
     }
   }

   static void do_hwlock(CPUHexagonState *env, bool *lock)
   {
     bql_lock();

     if (*lock) {
       env->hwlock_pending = true;
       cs->halted = true;
       cs->exception_index = EXCP_HALTED;
       bql_unlock();
       cpu_loop_exit(cs);
     } else {
       *lock = true;
       bql_unlock();
     }
   }

   static void do_hwunlock(CPUHexagonState *env, bool *lock)
   {
     CPUState *cs;
     BQL_LOCK_GUARD();

     *lock = false;

     CPU_FOREACH(cs) {
       if (cs->hwlock_pending) {
         cs->hwlock_pending = false;
         qemu_cpu_kick(cs);
       }
     }
   }

   static bool k0lock, tlblock;

   void HELPER(k0lock)(CPUHexagonState *env)
   void HELPER(tlblock)(CPUHexagonState *env)
   void HELPER(k0unlock)(CPUHexagonState *env)
   void HELPER(tlbunlock)(CPUHexagonState *env)

Open questions are:

(1) Do interrupts cancel lock wait?  Either way, placement in hexagon_cpu_has_work is key.

(2) I assume that the pc will not be advanced, so that after the kick we will re-execute 
the hwlock instruction.  There will be a thundering herd racing to grab the lock again, 
but it saves queuing logic, which might be complicated if interrupts are involved.


r~


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: hexagon: modeling a shared lock state
  2024-01-25  0:22 ` Richard Henderson
@ 2024-01-25 16:28   ` Brian Cain
  2024-01-27  2:19     ` Richard Henderson
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Cain @ 2024-01-25 16:28 UTC (permalink / raw)
  To: Richard Henderson, Philippe Mathieu-Daudé
  Cc: qemu-devel@nongnu.org, Sid Manning, Marco Liebel,
	Matheus Bernardino



> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Wednesday, January 24, 2024 6:22 PM
> To: Brian Cain <bcain@quicinc.com>; Philippe Mathieu-Daudé
> <philmd@linaro.org>
> Cc: qemu-devel@nongnu.org; Sid Manning <sidneym@quicinc.com>; Marco
> Liebel <mliebel@qti.qualcomm.com>; Matheus Bernardino
> <mathbern@qti.qualcomm.com>
> Subject: Re: hexagon: modeling a shared lock state
> 
> WARNING: This email originated from outside of Qualcomm. Please be wary of
> any links or attachments, and do not enable macros.
> 
> On 1/25/24 01:07, Brian Cain wrote:
> > Philippe,
> >
> > For hexagon sysemu, while internally reviewing patches to be upstreamed we
> noticed that
> > our design for a lock instruction is not quite suitable.  The k0lock instruction
> will halt
> > if some other hexagon hardware CPU has already claimed the kernel lock,
> only to continue
> > executing once some CPU executes the unlock instruction.  We modeled this
> using a lock
> > state enumeration member { OWNER, WAITING, UNLOCKED } in **each**
> vCPU and atomically
> > transitioning the lock required us to have vCPU[n] write the updated lock
> state to vCPU[m]
> > when unlocking.
> >
> > In context:
> >
> https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/he
> xagon/op_helper.c#L1790-L1821
> <https://github.com/quic/qemu/blob/hexagon_sysemu_20_dec_2023/target/h
> exagon/op_helper.c#L1790-L1821>
> >
> > So instead, maybe it makes more sense to have a system device hold a single
> representation
> > of the lock’s state and each vCPU can do some kind of access via load/store
> and/or
> > interrupts to/from the device?  I was thinking of raising an interrupt on the
> lock device
> > at the vCPU’s k0lock instruction to indicate demand for the lock, and then the
> device
> > could raise an interrupt to one of the vCPUs when it’s granted the lock.  One
> drawback for
> > this is that this might add significant latency to the uncontested lock case.  So
> I also
> > considered doing a load of some part of the lock device’s memory that could
> indicate
> > whether the lock was available, and if available it would claim it with a store
> (both
> > ld/st while holding BQL).  Only if unavailable it would halt and raise the
> interrupt.  Is
> > it possible to create an address space that’s independent of the true system
> memory map
> > for this use case or would we need to carve out some memory from the
> existing memory map
> > for this synthetic device?  Or – do you have a suggestion for a better
> approach overall?
> 
> I think you are over-thinking this.  A system device would just get in the way.  I

Ok, great - our existing design is ~roughly like this.  But we can explore this -- thanks for writing this example!

> think
> you want something like
> 
>    struct CPUHexagonState {
>      ...
>      bool hwlock_pending;
>    }
> 
>    hexagon_cpu_has_work() {
>      if (cpu->hwlock_pending) {
>        return false;
>      }
>    }
> 
>    static void do_hwlock(CPUHexagonState *env, bool *lock)
>    {
>      bql_lock();
> 
>      if (*lock) {
>        env->hwlock_pending = true;
>        cs->halted = true;
>        cs->exception_index = EXCP_HALTED;
>        bql_unlock();
>        cpu_loop_exit(cs);

In place of the above - we have cpu_interrupt(cs, CPU_INTERRUPT_HALT) -- but is that equivalent?  Is there one idiom that's preferred over another?  Somehow it seems simpler if we don't need to longjmp and IIRC some of these patterns do.

>      } else {
>        *lock = true;
>        bql_unlock();
>      }
>    }
> 
>    static void do_hwunlock(CPUHexagonState *env, bool *lock)
>    {
>      CPUState *cs;
>      BQL_LOCK_GUARD();
> 
>      *lock = false;
> 
>      CPU_FOREACH(cs) {
>        if (cs->hwlock_pending) {
>          cs->hwlock_pending = false;
>          qemu_cpu_kick(cs);
>        }
>      }
>    }
> 
>    static bool k0lock, tlblock;
> 
>    void HELPER(k0lock)(CPUHexagonState *env)
>    void HELPER(tlblock)(CPUHexagonState *env)
>    void HELPER(k0unlock)(CPUHexagonState *env)
>    void HELPER(tlbunlock)(CPUHexagonState *env)
> 
> Open questions are:
> 
> (1) Do interrupts cancel lock wait?  Either way, placement in
> hexagon_cpu_has_work is key.

Yeah - they do, we will double check the interaction w has_work.

> (2) I assume that the pc will not be advanced, so that after the kick we will re-
> execute
> the hwlock instruction.  There will be a thundering herd racing to grab the lock
> again,
> but it saves queuing logic, which might be complicated if interrupts are
> involved.

Yes that's right, too.  Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: hexagon: modeling a shared lock state
  2024-01-25 16:28   ` Brian Cain
@ 2024-01-27  2:19     ` Richard Henderson
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Henderson @ 2024-01-27  2:19 UTC (permalink / raw)
  To: Brian Cain, Philippe Mathieu-Daudé
  Cc: qemu-devel@nongnu.org, Sid Manning, Marco Liebel,
	Matheus Bernardino

On 1/26/24 02:28, Brian Cain wrote:
>>     static void do_hwlock(CPUHexagonState *env, bool *lock)
>>     {
>>       bql_lock();
>>
>>       if (*lock) {
>>         env->hwlock_pending = true;
>>         cs->halted = true;
>>         cs->exception_index = EXCP_HALTED;
>>         bql_unlock();
>>         cpu_loop_exit(cs);
> 
> In place of the above - we have cpu_interrupt(cs, CPU_INTERRUPT_HALT) -- but is that equivalent?

No, it is not.  Raising an interrupt will not take effect immediately.

>  Is there one idiom that's preferred over another?  Somehow it seems simpler if we don't need to longjmp and IIRC some of these patterns do.

You need the longjmp to halt and stop execution without completing the current insn, so 
that the insn can be restarted later.

Oh!  Just remembered.   You'll want cpu_loop_exit_restore() there, so that pc is updated 
properly.  Better to unwind in the contended case than require the pc be updated along the 
fast path uncontended case.


r~


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-27  2:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-24 15:07 hexagon: modeling a shared lock state Brian Cain
2024-01-25  0:22 ` Richard Henderson
2024-01-25 16:28   ` Brian Cain
2024-01-27  2:19     ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).