Re: [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	jbeulich@suse.com, keir@xen.org, xen-devel@lists.xenproject.org
Subject: Re: [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq.
Date: Wed, 27 Aug 2014 19:53:22 +0100	[thread overview]
Message-ID: <53FE2922.7000108@citrix.com> (raw)
In-Reply-To: <1409162329-6094-2-git-send-email-konrad.wilk@oracle.com>

On 27/08/14 18:58, Konrad Rzeszutek Wilk wrote:
> This implements a lockless per-cpu tasklet mechanism.
>
> The existing tasklet mechanism has a single global
> spinlock that is taken every-time the global list
> is touched. And we use this lock quite a lot - when
> we call do_tasklet_work which is called via an softirq
> and from the idle loop. We take the lock on any
> operation on the tasklet_list.
>
> The problem we are facing is that there are quite a lot of
> tasklets scheduled. The most common one that is invoked is
> the one injecting the VIRQ_TIMER in the guest. Guests
> are not insane and don't set the one-shot or periodic
> clocks to be in sub 1ms intervals (causing said tasklet
> to be scheduled for such small intervalls).
>
> The problem appears when PCI passthrough devices are used
> over many sockets and we have an mix of heavy-interrupt
> guests and idle guests. The idle guests end up seeing
> 1/10 of its RUNNING timeslice eaten by the hypervisor
> (and 40% steal time).
>
> The mechanism by which we inject PCI interrupts is by
> hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
> tasklet every time an interrupt is received.
> The callchain is:
>
> _asm_vmexit_handler
>  -> vmx_vmexit_handler
>     ->vmx_do_extint
>         -> do_IRQ
>             -> __do_IRQ_guest
>                 -> hvm_do_IRQ_dpci
>                    tasklet_schedule(&dpci->dirq_tasklet);
>                    [takes lock to put the tasklet on]
>
> [later on the schedule_tail is invoked which is 'vmx_do_resume']
>
> vmx_do_resume
>  -> vmx_asm_do_vmentry
>         -> call vmx_intr_assist
>           -> vmx_process_softirqs
>             -> do_softirq
>               [executes the tasklet function, takes the
>                lock again]
>
> While on other CPUs they might be sitting in a idle loop
> and invoked to deliver an VIRQ_TIMER, which also ends
> up taking the lock twice: first to schedule the
> v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
> the guests' BLOCKED_state); then to execute it - which is
> accounted for in the guest's RUNTIME_state.
>
> The end result is that on a 8 socket machine with
> PCI passthrough, where four sockets are busy with interrupts,
> and the other sockets have idle guests - we end up with
> the idle guests having around 40% steal time and 1/10
> of its timeslice (3ms out of 30 ms) being tied up
> taking the lock. The latency of the PCI interrupts delieved
> to guest is also hindered.
>
> With this patch the problem disappears completly.
> That is removing the lock for the PCI passthrough use-case
> (the 'hvm_dirq_assist' case).
>
> As such this patch introduces the code to setup
> softirq per-cpu tasklets and only modifies the PCI
> passthrough cases instead of doing it wholesale. This
> is done because:
>  - We want to easily bisect it if things break.
>  - We modify the code one section at a time to
>    make it easier to review this core code.
>
> Now on the code itself. The Linux code (softirq.c)
> has an per-cpu implementation of tasklets on which
> this was based on. However there are differences:
>  - This patch executes one tasklet at a time - similar
>    to how the existing implementation does it.
>  - We use a double-linked list instead of a single linked
>    list. We could use a single-linked list but folks are
>    more familiar with 'list_*' type macros.
>  - This patch does not have the cross-CPU feeders
>    implemented. That code is in the patch
>    titled: tasklet: Add cross CPU feeding of per-cpu
>    tasklets. This is done to support:
>    "tasklet_schedule_on_cpu"
>  - We add an temporary 'TASKLET_SOFTIRQ_PERCPU' which
>    is can co-exist with the TASKLET_SOFTIRQ. It will be
>    replaced in 'tasklet: Remove the old-softirq
>    implementation."
>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  xen/arch/x86/hvm/hvm.c       |   2 +-
>  xen/common/tasklet.c         | 129 +++++++++++++++++++++++++++++++++++++++++--
>  xen/drivers/passthrough/io.c |   2 +-
>  xen/include/xen/softirq.h    |   1 +
>  xen/include/xen/tasklet.h    |  61 ++++++++++++++++++--
>  5 files changed, 184 insertions(+), 11 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 94b18ba..4b4cad1 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2270,7 +2270,7 @@ int hvm_vcpu_initialise(struct vcpu *v)
>      if ( (rc = hvm_funcs.vcpu_initialise(v)) != 0 ) /* teardown: hvm_funcs.vcpu_destroy */
>          goto fail3;
>  
> -    softirq_tasklet_init(
> +    percpu_tasklet_init(
>          &v->arch.hvm_vcpu.assert_evtchn_irq_tasklet,
>          (void(*)(unsigned long))hvm_assert_evtchn_irq,
>          (unsigned long)v);
> diff --git a/xen/common/tasklet.c b/xen/common/tasklet.c
> index 4e42fa7..319866f 100644
> --- a/xen/common/tasklet.c
> +++ b/xen/common/tasklet.c
> @@ -31,10 +31,30 @@ static DEFINE_PER_CPU(struct list_head, softirq_tasklet_list);
>  /* Protects all lists and tasklet structures. */
>  static DEFINE_SPINLOCK(tasklet_lock);
>  
> +static DEFINE_PER_CPU(struct list_head, softirq_list);
> +
>  static void tasklet_enqueue(struct tasklet *t)
>  {
>      unsigned int cpu = t->scheduled_on;
>  
> +    if ( t->is_percpu )
> +    {
> +        unsigned long flags;
> +        struct list_head *list;
> +
> +        INIT_LIST_HEAD(&t->list);
> +        BUG_ON( !t->is_softirq );
> +        BUG_ON( cpu != smp_processor_id() ); /* Not implemented yet. */
> +
> +        local_irq_save(flags);
> +
> +        list = &__get_cpu_var(softirq_list);
> +        list_add_tail(&t->list, list);
> +        raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +
> +        local_irq_restore(flags);

The raise_softirq() call can be done with interrupts re-enabled, which
reduces the critical window.

__get_cpu_var() does some inspection of the stack pointer behind your
back.  It would be far more efficient in the critical window to take
"unsigned int cpu = smp_processor_id();"  outside and use "per_cpu($FOO,
cpu)" inside.

> +        return;
> +    }
>      if ( t->is_softirq )
>      {
>          struct list_head *list = &per_cpu(softirq_tasklet_list, cpu);
> @@ -56,16 +76,25 @@ void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu)
>  {
>      unsigned long flags;
>  
> -    spin_lock_irqsave(&tasklet_lock, flags);
> +    if ( !tasklets_initialised || t->is_dead )
> +        return;
>  
> -    if ( tasklets_initialised && !t->is_dead )
> +    if ( t->is_percpu )
>      {
> -        t->scheduled_on = cpu;
> -        if ( !t->is_running )
> +        if ( !test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
>          {
> -            list_del(&t->list);
> +            t->scheduled_on = cpu;
>              tasklet_enqueue(t);
>          }
> +        return;
> +    }
> +    spin_lock_irqsave(&tasklet_lock, flags);
> +
> +    t->scheduled_on = cpu;
> +    if ( !t->is_running )
> +    {
> +        list_del(&t->list);
> +        tasklet_enqueue(t);
>      }
>  
>      spin_unlock_irqrestore(&tasklet_lock, flags);
> @@ -104,6 +133,66 @@ static void do_tasklet_work(unsigned int cpu, struct list_head *list)
>      }
>  }
>  
> +void do_tasklet_work_percpu(void)
> +{
> +    struct tasklet *t = NULL;
> +    struct list_head *head;
> +    bool_t poke = 0;
> +
> +    local_irq_disable();
> +    head = &__get_cpu_var(softirq_list);
> +
> +    if ( !list_empty(head) )
> +    {
> +        t = list_entry(head->next, struct tasklet, list);
> +
> +        if ( head->next == head->prev ) /* One singular item. Re-init head. */
> +            INIT_LIST_HEAD(&__get_cpu_var(softirq_list));

It would be most efficient to hoist "struct list_head *this_softirq_list
= &this_cpu(softirq_list);" outside the critical region.

> +        else
> +        {
> +            /* Multiple items, update head to skip 't'. */
> +            struct list_head *list;
> +
> +            /* One item past 't'. */
> +            list = head->next->next;
> +
> +            BUG_ON(list == NULL);
> +
> +            /* And update head to skip 't'. Note that t->list.prev still
> +             * points to head, but we don't care as we only process one tasklet
> +             * and once done the tasklet list is re-init one way or another.
> +             */
> +            head->next = list;
> +            poke = 1;
> +        }
> +    }
> +    local_irq_enable();
> +
> +    if ( !t )
> +        return; /* Never saw it happend, but we might have a spurious case? */
> +
> +    if ( tasklet_trylock(t) )
> +    {
> +        if ( !test_and_clear_bit(TASKLET_STATE_SCHED, &t->state) )
> +                BUG();
> +        sync_local_execstate();
> +        t->func(t->data);
> +        tasklet_unlock(t);
> +        if ( poke )
> +            raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +        /* We could reinit the t->list but tasklet_enqueue does it for us. */
> +        return;
> +    }
> +
> +    local_irq_disable();
> +
> +    INIT_LIST_HEAD(&t->list);
> +    list_add_tail(&t->list, &__get_cpu_var(softirq_list));
> +    smp_wmb();

Is this needed? all of this infrastructure is local to the cpu.

> +    raise_softirq(TASKLET_SOFTIRQ_PERCPU);
> +    local_irq_enable();
> +}
> +
>  /* VCPU context work */
>  void do_tasklet(void)
>  {
> @@ -147,10 +236,29 @@ static void tasklet_softirq_action(void)
>      spin_unlock_irq(&tasklet_lock);
>  }
>  
> +/* Per CPU softirq context work. */
> +static void tasklet_softirq_percpu_action(void)
> +{
> +    do_tasklet_work_percpu();
> +}
> +
>  void tasklet_kill(struct tasklet *t)
>  {
>      unsigned long flags;
>  
> +    if ( t->is_percpu )
> +    {
> +        while ( test_and_set_bit(TASKLET_STATE_SCHED, &t->state) )
> +        {
> +            do {
> +                process_pending_softirqs();
> +            } while ( test_bit(TASKLET_STATE_SCHED, &t->state) );
> +        }
> +        tasklet_unlock_wait(t);
> +        clear_bit(TASKLET_STATE_SCHED, &t->state);
> +        t->is_dead = 1;
> +        return;
> +    }
>      spin_lock_irqsave(&tasklet_lock, flags);
>  
>      if ( !list_empty(&t->list) )
> @@ -208,6 +316,14 @@ void softirq_tasklet_init(
>      t->is_softirq = 1;
>  }
>  
> +void percpu_tasklet_init(
> +    struct tasklet *t, void (*func)(unsigned long), unsigned long data)
> +{
> +    tasklet_init(t, func, data);
> +    t->is_softirq = 1;
> +    t->is_percpu = 1;
> +}
> +
>  static int cpu_callback(
>      struct notifier_block *nfb, unsigned long action, void *hcpu)
>  {
> @@ -218,11 +334,13 @@ static int cpu_callback(
>      case CPU_UP_PREPARE:
>          INIT_LIST_HEAD(&per_cpu(tasklet_list, cpu));
>          INIT_LIST_HEAD(&per_cpu(softirq_tasklet_list, cpu));
> +        INIT_LIST_HEAD(&per_cpu(softirq_list, cpu));
>          break;
>      case CPU_UP_CANCELED:
>      case CPU_DEAD:
>          migrate_tasklets_from_cpu(cpu, &per_cpu(tasklet_list, cpu));
>          migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_tasklet_list, cpu));
> +        migrate_tasklets_from_cpu(cpu, &per_cpu(softirq_list, cpu));
>          break;
>      default:
>          break;
> @@ -242,6 +360,7 @@ void __init tasklet_subsys_init(void)
>      cpu_callback(&cpu_nfb, CPU_UP_PREPARE, hcpu);
>      register_cpu_notifier(&cpu_nfb);
>      open_softirq(TASKLET_SOFTIRQ, tasklet_softirq_action);
> +    open_softirq(TASKLET_SOFTIRQ_PERCPU, tasklet_softirq_percpu_action);
>      tasklets_initialised = 1;
>  }
>  
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ef75b94..740cee5 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -114,7 +114,7 @@ int pt_irq_create_bind(
>              spin_unlock(&d->event_lock);
>              return -ENOMEM;
>          }
> -        softirq_tasklet_init(
> +        percpu_tasklet_init(
>              &hvm_irq_dpci->dirq_tasklet,
>              hvm_dirq_assist, (unsigned long)d);
>          for ( i = 0; i < NR_HVM_IRQS; i++ )
> diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
> index 0c0d481..8b15c8c 100644
> --- a/xen/include/xen/softirq.h
> +++ b/xen/include/xen/softirq.h
> @@ -7,6 +7,7 @@ enum {
>      SCHEDULE_SOFTIRQ,
>      NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
>      RCU_SOFTIRQ,
> +    TASKLET_SOFTIRQ_PERCPU,
>      TASKLET_SOFTIRQ,
>      NR_COMMON_SOFTIRQS
>  };
> diff --git a/xen/include/xen/tasklet.h b/xen/include/xen/tasklet.h
> index 8c3de7e..9497c47 100644
> --- a/xen/include/xen/tasklet.h
> +++ b/xen/include/xen/tasklet.h
> @@ -17,21 +17,24 @@
>  struct tasklet
>  {
>      struct list_head list;
> +    unsigned long state;
>      int scheduled_on;
>      bool_t is_softirq;
>      bool_t is_running;
>      bool_t is_dead;
> +    bool_t is_percpu;
>      void (*func)(unsigned long);
>      unsigned long data;
>  };
>  
> -#define _DECLARE_TASKLET(name, func, data, softirq)                     \
> +#define _DECLARE_TASKLET(name, func, data, softirq, percpu)             \
>      struct tasklet name = {                                             \
> -        LIST_HEAD_INIT(name.list), -1, softirq, 0, 0, func, data }
> +        LIST_HEAD_INIT(name.list), 0, -1, softirq, 0, 0, percpu,        \
> +        func, data }
>  #define DECLARE_TASKLET(name, func, data)               \
> -    _DECLARE_TASKLET(name, func, data, 0)
> +    _DECLARE_TASKLET(name, func, data, 0, 0)
>  #define DECLARE_SOFTIRQ_TASKLET(name, func, data)       \
> -    _DECLARE_TASKLET(name, func, data, 1)
> +    _DECLARE_TASKLET(name, func, data, 1, 0)
>  
>  /* Indicates status of tasklet work on each CPU. */
>  DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
> @@ -40,6 +43,54 @@ DECLARE_PER_CPU(unsigned long, tasklet_work_to_do);
>  #define TASKLET_enqueued   (1ul << _TASKLET_enqueued)
>  #define TASKLET_scheduled  (1ul << _TASKLET_scheduled)
>  
> +/* These fancy bit manipulations (bit 0 and bit 1) along with using a lock
> + * operation allow us to have four stages in tasklet life-time.
> + *  a) 0x0: Completely empty (not scheduled nor running).
> + *  b) 0x1: Scheduled but not running. Used to guard in 'tasklet_schedule'
> + *     such that we will only schedule one. If it is scheduled and had never
> + *     run (hence never clearing STATE_SCHED bit), tasklet_kill will spin
> + *     forever on said tasklet. However 'tasklet_schedule' raises the
> + *     softirq associated with the per-cpu - so it will run, albeit there might
> + *     be a race (tasklet_kill spinning until the softirq handler runs).
> + *  c) 0x2: it is running (only on one CPU) and can be scheduled on any
> + *     CPU. The bit 0 - scheduled is cleared at this stage allowing
> + *     'tasklet_schedule' to succesfully schedule.
> + *  d) 0x3: scheduled and running - only possible if the running tasklet calls
> + *     tasklet_schedule (on same CPU) or the tasklet is scheduled from another
> + *     CPU while the tasklet is running on another CPU.
> + *
> + * The two bits play a vital role in assuring that the tasklet is scheduled
> + * once and runs only once. The steps are:
> + *
> + *  1) tasklet_schedule: STATE_SCHED bit set (0x1), added on the per cpu list.
> + *  2) tasklet_softirq_percpu_action picks one tasklet from the list. Schedules
> + *  itself later if there are more tasklets on it. Tries to set STATE_RUN bit
> + *  (0x3) - if it fails adds tasklet back to the per-cpu list. If it succeeds
> + *  clears the STATE_SCHED bit (0x2). Once tasklet completed, unsets STATE_RUN
> + *  (0x0 or 0x1 if tasklet called tasklet_schedule).
> + */
> +enum {
> +    TASKLET_STATE_SCHED, /* Bit 0 */
> +    TASKLET_STATE_RUN
> +};
> +
> +static inline int tasklet_trylock(struct tasklet *t)
> +{
> +    return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);

No need for brackets around (t) for these static inlines.

> +}
> +
> +static inline void tasklet_unlock(struct tasklet *t)
> +{
> +    barrier();

clear_bit() has a memory clobber.  This barrier() is entirely redundant.

> +    clear_bit(TASKLET_STATE_RUN, &(t)->state);
> +}
> +static inline void tasklet_unlock_wait(struct tasklet *t)
> +{
> +    while (test_bit(TASKLET_STATE_RUN, &(t)->state))
> +    {
> +        barrier();

cpu_relax();

~Andrew

> +    }
> +}
>  void tasklet_schedule_on_cpu(struct tasklet *t, unsigned int cpu);
>  void tasklet_schedule(struct tasklet *t);
>  void do_tasklet(void);
> @@ -48,6 +99,8 @@ void tasklet_init(
>      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
>  void softirq_tasklet_init(
>      struct tasklet *t, void (*func)(unsigned long), unsigned long data);
> +void percpu_tasklet_init(
> +    struct tasklet *t, void (*func)(unsigned long), unsigned long data);
>  void tasklet_subsys_init(void);
>  
>  #endif /* __XEN_TASKLET_H__ */

next prev parent reply	other threads:[~2014-08-27 18:53 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-27 17:58 [RFC PATCH v1] Replace tasklets with per-cpu implementation Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 1/5] tasklet: Introduce per-cpu tasklet for softirq Konrad Rzeszutek Wilk
2014-08-27 18:53   ` Andrew Cooper [this message]
2014-08-27 19:06     ` Konrad Rzeszutek Wilk
2014-08-28  8:17     ` Jan Beulich
2014-08-27 17:58 ` [RFC PATCH v1 2/5] tasklet: Add cross CPU feeding of per-cpu tasklets Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 3/5] tasklet: Remove the old-softirq implementation Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 4/5] tasklet: Introduce per-cpu tasklet for schedule tasklet Konrad Rzeszutek Wilk
2014-08-27 17:58 ` [RFC PATCH v1 5/5] tasklet: Remove the scaffolding Konrad Rzeszutek Wilk
2014-08-28 12:39 ` [RFC PATCH v1] Replace tasklets with per-cpu implementation Jan Beulich
2014-08-29 13:46   ` Konrad Rzeszutek Wilk
2014-08-29 14:10     ` Jan Beulich
2014-09-02 20:28       ` Konrad Rzeszutek Wilk
2014-09-03  8:03         ` Jan Beulich
2014-09-08 19:01           ` Konrad Rzeszutek Wilk
2014-09-09  9:01             ` Jan Beulich
2014-09-09 14:37               ` Konrad Rzeszutek Wilk
2014-09-09 16:37                 ` Jan Beulich
2014-09-10 16:03                   ` Konrad Rzeszutek Wilk
2014-09-10 16:25                     ` Jan Beulich
2014-09-10 16:35                       ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53FE2922.7000108@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=jbeulich@suse.com \
    --cc=keir@xen.org \
    --cc=konrad.wilk@oracle.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).