LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 00/18] powerpc/boot: 64bit little endian wrapper
From: Cedric Le Goater @ 2014-03-21  8:37 UTC (permalink / raw)
  To: Geoff Levand; +Cc: linuxppc-dev
In-Reply-To: <1395335024.20123.2.camel@smoke>

Hi Geoff,

> Do you have these in a git branch somewhere so I can merge and test
> them?

No, sorry, I don't have a public git repo. Let me see how I can fix 
that, it might be useful for future patchsets. 

Thanks,

C.  

^ permalink raw reply

* Re: [PATCH v3 0/5] powernv: Enable Dynamic Frequency
From: Gautham R Shenoy @ 2014-03-21  8:19 UTC (permalink / raw)
  To: Viresh Kumar; +Cc: Gautham R. Shenoy, Linux PM list, linuxppc-dev
In-Reply-To: <CAOh2x=mYnRWEqs=qBi4C4OWog+ds85=e+uDTJGpojozEr-XKhQ@mail.gmail.com>

Hi Viresh,

On Fri, Mar 21, 2014 at 01:16:03PM +0530, Viresh Kumar wrote:
> On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy
> <ego@linux.vnet.ibm.com> wrote:
> > This is the v3 of the consolidated patchset consisting
> > patches for enabling cpufreq on IBM POWERNV platforms
> > along with some enhancements.
> 
> This is the first time I saw them. Looks like you never Cc'd linux-pm list.
> Also, would be better to keep Maintainers in --to field so that they can
> review them as soon as possible. Otherwise they might stay on lists for
> a long time..

Yes, I hadn't Cc'ed linux-pm list last time around. Sorry about
that. Will keep this in mind the next time around. Also, your
suggestion regarding keeping the Maintainers in --to field is well
taken.

--
Thanks and Regards
gautham.
> 

^ permalink raw reply

* Re: [PATCH v3 0/5] powernv: Enable Dynamic Frequency
From: Viresh Kumar @ 2014-03-21  7:46 UTC (permalink / raw)
  To: Gautham R. Shenoy; +Cc: linuxppc-dev, Linux PM list
In-Reply-To: <1395317460-14811-1-git-send-email-ego@linux.vnet.ibm.com>

On Thu, Mar 20, 2014 at 5:40 PM, Gautham R. Shenoy
<ego@linux.vnet.ibm.com> wrote:
> This is the v3 of the consolidated patchset consisting
> patches for enabling cpufreq on IBM POWERNV platforms
> along with some enhancements.

This is the first time I saw them. Looks like you never Cc'd linux-pm list.
Also, would be better to keep Maintainers in --to field so that they can
review them as soon as possible. Otherwise they might stay on lists for
a long time..

^ permalink raw reply

* Re: [PATCH v3 5/5] powernv:cpufreq: Implement the driver->get() method
From: Gautham R Shenoy @ 2014-03-21  6:25 UTC (permalink / raw)
  To: Gautham R. Shenoy; +Cc: Preeti U Murthy, linuxppc-dev, linux-pm
In-Reply-To: <1395317460-14811-6-git-send-email-ego@linux.vnet.ibm.com>

On Thu, Mar 20, 2014 at 05:41:00PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> The current frequency of a cpu is reported through the sysfs file
> cpuinfo_cur_freq. This requires the driver to implement a
> "->get(unsigned int cpu)" method which will return the current
> operating frequency.
> 
> Implement a function named powernv_cpufreq_get() which reads the local
> pstate from the PMSR and returns the corresponding frequency.
> 
> Set the powernv_cpufreq_driver.get hook to powernv_cpufreq_get().

Forgot to add 

  Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> 
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 46bee8a..ef6ed8c 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -206,6 +206,43 @@ static inline void set_pmspr(unsigned long sprn, unsigned long val)
>  	BUG();
>  }
> 
> +/*
> + * Computes the current frequency on this cpu
> + * and stores the result in *ret_freq.
> + */
> +static void powernv_read_cpu_freq(void *ret_freq)
> +{
> +	unsigned long pmspr_val;
> +	s8 local_pstate_id;
> +	int *cur_freq, freq, pstate_id;
> +
> +	cur_freq = (int *)ret_freq;
> +	pmspr_val = get_pmspr(SPRN_PMSR);
> +
> +	/* The local pstate id corresponds bits 48..55 in the PMSR.
> +         * Note: Watch out for the sign! */
> +	local_pstate_id = (pmspr_val >> 48) & 0xFF;
> +	pstate_id = local_pstate_id;
> +
> +	freq = pstate_id_to_freq(pstate_id);
> +	pr_debug("cpu %d pmsr %lx pstate_id %d frequency %d \n",
> +		smp_processor_id(), pmspr_val, pstate_id, freq);
> +	*cur_freq = freq;
> +}
> +
> +/*
> + * Returns the cpu frequency as reported by the firmware for 'cpu'.
> + * This value is reported through the sysfs file cpuinfo_cur_freq.
> + */
> +unsigned int powernv_cpufreq_get(unsigned int cpu)
> +{
> +	int ret_freq;
> +
> +	smp_call_function_any(cpu_sibling_mask(cpu), powernv_read_cpu_freq,
> +			&ret_freq, 1);
> +	return ret_freq;
> +}
> +
>  static void set_pstate(void *pstate)
>  {
>  	unsigned long val;
> @@ -297,6 +334,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
>  static struct cpufreq_driver powernv_cpufreq_driver = {
>  	.verify		= powernv_cpufreq_verify,
>  	.target		= powernv_cpufreq_target,
> +	.get		= powernv_cpufreq_get,
>  	.init		= powernv_cpufreq_cpu_init,
>  	.exit		= powernv_cpufreq_cpu_exit,
>  	.name		= "powernv-cpufreq",
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v3 3/5] powernv:cpufreq: Create pstate_id_to_freq() helper
From: Gautham R Shenoy @ 2014-03-21  6:24 UTC (permalink / raw)
  To: Gautham R. Shenoy; +Cc: Preeti U Murthy, linuxppc-dev, linux-pm
In-Reply-To: <1395317460-14811-4-git-send-email-ego@linux.vnet.ibm.com>

On Thu, Mar 20, 2014 at 05:40:58PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> 
> Create a helper routine that can return the cpu-frequency for the
> corresponding pstate_id.
> 
> Also, cache the values of the pstate_max, pstate_min and
> pstate_nominal and nr_pstates in a static structure so that they can
> be reused in the future to perform any validations.
> 

Forgot to add the following line:

  Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index 66dae0d..e7b0292 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -39,6 +39,14 @@ static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
>  static struct cpufreq_frequency_table powernv_freqs[POWERNV_MAX_PSTATES+1];
>  static int powernv_pstate_ids[POWERNV_MAX_PSTATES+1];
> 
> +struct powernv_pstate_info {
> +	int pstate_min_id;
> +	int pstate_max_id;
> +	int pstate_nominal_id;
> +	int nr_pstates;
> +};
> +static struct powernv_pstate_info powernv_pstate_info;
> +
>  /*
>   * Initialize the freq table based on data obtained
>   * from the firmware passed via device-tree
> @@ -112,9 +120,28 @@ static int init_powernv_pstates(void)
>  	for (i = 0; powernv_freqs[i].frequency != CPUFREQ_TABLE_END; i++)
>  		pr_debug("%d: %d\n", i, powernv_freqs[i].frequency);
> 
> +	powernv_pstate_info.pstate_min_id = pstate_min;
> +	powernv_pstate_info.pstate_max_id = pstate_max;
> +	powernv_pstate_info.pstate_nominal_id = pstate_nominal;
> +	powernv_pstate_info.nr_pstates = nr_pstates;
> +
>  	return 0;
>  }
> 
> +/**
> + * Returns the cpu frequency corresponding to the pstate_id.
> + */
> +static unsigned int pstate_id_to_freq(int pstate_id)
> +{
> +	int i;
> +
> +	i = powernv_pstate_info.pstate_max_id - pstate_id;
> +
> +	BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
> +	WARN_ON(powernv_pstate_ids[i] != pstate_id);
> +	return powernv_freqs[i].frequency;
> +}
> +
>  static struct freq_attr *powernv_cpu_freq_attr[] = {
>  	&cpufreq_freq_attr_scaling_available_freqs,
>  	NULL,
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v3 2/5] powernv, cpufreq:Add per-core locking to serialize frequency transitions
From: Gautham R Shenoy @ 2014-03-21  6:24 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: linux-pm, linuxppc-dev, Srivatsa S. Bhat, Preeti U Murthy
In-Reply-To: <1395317460-14811-3-git-send-email-ego@linux.vnet.ibm.com>

On Thu, Mar 20, 2014 at 05:40:57PM +0530, Gautham R. Shenoy wrote:
> From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
> 
> On POWER systems, the CPU frequency is controlled at a core-level and
> hence we need to serialize so that only one of the threads in the core
> switches the core's frequency at a time.
> 
> Using a global mutex lock would needlessly serialize _all_ frequency
> transitions in the system (across all cores). So introduce per-core
> locking to enable finer-grained synchronization and thereby enhance
> the speed and responsiveness of the cpufreq driver to varying workload
> demands.
> 
> The design of per-core locking is very simple and straight-forward: we
> first define a Per-CPU lock and use the ones that belongs to the first
> thread sibling of the core.
> 
> cpu_first_thread_sibling() macro is used to find the *common* lock for
> all thread siblings belonging to a core.
> 

Forgot to add the following line.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
> index ab1551f..66dae0d 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -24,8 +24,15 @@
>  #include <linux/of.h>
>  #include <asm/cputhreads.h>
> 
> -/* FIXME: Make this per-core */
> -static DEFINE_MUTEX(freq_switch_mutex);
> +/* Per-Core locking for frequency transitions */
> +static DEFINE_PER_CPU(struct mutex, freq_switch_lock);
> +
> +#define lock_core_freq(cpu)				\
> +			mutex_lock(&per_cpu(freq_switch_lock,\
> +				cpu_first_thread_sibling(cpu)));
> +#define unlock_core_freq(cpu)				\
> +			mutex_unlock(&per_cpu(freq_switch_lock,\
> +				cpu_first_thread_sibling(cpu)));
> 
>  #define POWERNV_MAX_PSTATES	256
> 
> @@ -221,7 +228,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
>  	freqs.new = powernv_freqs[new_index].frequency;
>  	freqs.cpu = policy->cpu;
> 
> -	mutex_lock(&freq_switch_mutex);
> +	lock_core_freq(policy->cpu);
>  	cpufreq_notify_transition(policy, &freqs, CPUFREQ_PRECHANGE);
> 
>  	pr_debug("setting frequency for cpu %d to %d kHz index %d pstate %d",
> @@ -233,7 +240,7 @@ static int powernv_cpufreq_target(struct cpufreq_policy *policy,
>  	rc = powernv_set_freq(policy->cpus, new_index);
> 
>  	cpufreq_notify_transition(policy, &freqs, CPUFREQ_POSTCHANGE);
> -	mutex_unlock(&freq_switch_mutex);
> +	unlock_core_freq(policy->cpu);
> 
>  	return rc;
>  }
> @@ -250,7 +257,7 @@ static struct cpufreq_driver powernv_cpufreq_driver = {
> 
>  static int __init powernv_cpufreq_init(void)
>  {
> -	int rc = 0;
> +	int cpu, rc = 0;
> 
>  	/* Discover pstates from device tree and init */
> 
> @@ -260,6 +267,10 @@ static int __init powernv_cpufreq_init(void)
>  		pr_info("powernv-cpufreq disabled\n");
>  		return rc;
>  	}
> +	/* Init per-core mutex */
> +	for_each_possible_cpu(cpu) {
> +		mutex_init(&per_cpu(freq_switch_lock, cpu));
> +	}
> 
>  	rc = cpufreq_register_driver(&powernv_cpufreq_driver);
>  	return rc;
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-21  5:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, LKML, Davidlohr Bueso, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <20140321045501.GD30295@linux.vnet.ibm.com>

On Thu, Mar 20, 2014 at 9:55 PM, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> I reverted commits 99b60ce6 and b0c29f79. Then applied the patches in
> the above url. The last one had a reject but it was pretty
> straightforward to resolve it. After this, specjbb completes.
>
> So reverting and applying v3 3/4 and 4/4 patches works for me.

Ok, I verified that the above endds up resulting in the same tree as
the minimal patch I sent out, modulo (a) some comments and (b) an
#ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter.

So I committed the minimal patch with your tested-by.

             Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Srikar Dronamraju @ 2014-03-21  4:55 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Peter Zijlstra, Linus Torvalds, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395346808.14694.62.camel@buesod1.americas.hpqcorp.net>

> 
> Ok, so a big reason why this patch doesn't apply cleanly after reverting
> is because *most* of the changes were done at the top of the file with
> regards to documenting the ordering guarantees, the actual code changes
> are quite minimal.
> 
> I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending
> commit), and then I cleanly applied the equivalent ones from v3 of the
> series (which was already *tested* and ready for upstream until you
> suggested looking into the alternative spinlock approach):
> 
> https://lkml.org/lkml/2013/12/19/624
> https://lkml.org/lkml/2013/12/19/630

I reverted commits 99b60ce6 and b0c29f79. Then applied the patches in
the above url. The last one had a reject but it was pretty
straightforward to resolve it. After this, specjbb completes. 

So reverting and applying v3 3/4 and 4/4 patches works for me.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH RFC/RFT v3 6/9] powerpc: move cacheinfo sysfs to generic cacheinfo infrastructure
From: Anshuman Khandual @ 2014-03-21  3:44 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: linuxppc-dev@lists.ozlabs.org, Paul Mackerras,
	linux-kernel@vger.kernel.org
In-Reply-To: <531D9E3A.7040109@arm.com>

On 03/10/2014 04:42 PM, Sudeep Holla wrote:
> Hi Anshuman,
> 
> On 07/03/14 06:14, Anshuman Khandual wrote:
>> On 03/07/2014 09:36 AM, Anshuman Khandual wrote:
>>> On 02/19/2014 09:36 PM, Sudeep Holla wrote:
>>>> From: Sudeep Holla <sudeep.holla@arm.com>
>>>>
>>>> This patch removes the redundant sysfs cacheinfo code by making use of
>>>> the newly introduced generic cacheinfo infrastructure.
>>>>
>>>> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
>>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>>> Cc: Paul Mackerras <paulus@samba.org>
>>>> Cc: linuxppc-dev@lists.ozlabs.org
>>>> ---
>>>>   arch/powerpc/kernel/cacheinfo.c | 831
>>>> ++++++----------------------------------
>>>>   arch/powerpc/kernel/cacheinfo.h |   8 -
>>>>   arch/powerpc/kernel/sysfs.c     |   4 -
>>>>   3 files changed, 109 insertions(+), 734 deletions(-)
>>>>   delete mode 100644 arch/powerpc/kernel/cacheinfo.h
>>>>
>>>> diff --git a/arch/powerpc/kernel/cacheinfo.c
>>>> b/arch/powerpc/kernel/cacheinfo.c
>>>> index 2912b87..05b7580 100644
>>>> --- a/arch/powerpc/kernel/cacheinfo.c
>>>> +++ b/arch/powerpc/kernel/cacheinfo.c
>>>> @@ -10,38 +10,10 @@
>>>>    * 2 as published by the Free Software Foundation.
>>>>    */
>>>>
>>>> +#include <linux/cacheinfo.h>
>>>>   #include <linux/cpu.h>
>>>> -#include <linux/cpumask.h>
>>>>   #include <linux/kernel.h>
>>>> -#include <linux/kobject.h>
>>>> -#include <linux/list.h>
>>>> -#include <linux/notifier.h>
>>>>   #include <linux/of.h>
>>>> -#include <linux/percpu.h>
>>>> -#include <linux/slab.h>
>>>> -#include <asm/prom.h>
>>>> -
>>>> -#include "cacheinfo.h"
>>>> -
>>>> -/* per-cpu object for tracking:
>>>> - * - a "cache" kobject for the top-level directory
>>>> - * - a list of "index" objects representing the cpu's local cache
>>>> hierarchy
>>>> - */
>>>> -struct cache_dir {
>>>> -    struct kobject *kobj; /* bare (not embedded) kobject for cache
>>>> -                   * directory */
>>>> -    struct cache_index_dir *index; /* list of index objects */
>>>> -};
>>>> -
>>>> -/* "index" object: each cpu's cache directory has an index
>>>> - * subdirectory corresponding to a cache object associated with the
>>>> - * cpu.  This object's lifetime is managed via the embedded kobject.
>>>> - */
>>>> -struct cache_index_dir {
>>>> -    struct kobject kobj;
>>>> -    struct cache_index_dir *next; /* next index in parent directory */
>>>> -    struct cache *cache;
>>>> -};
>>>>
>>>>   /* Template for determining which OF properties to query for a given
>>>>    * cache type */
>>>> @@ -60,11 +32,6 @@ struct cache_type_info {
>>>>       const char *nr_sets_prop;
>>>>   };
>>>>
>>>> -/* These are used to index the cache_type_info array. */
>>>> -#define CACHE_TYPE_UNIFIED     0
>>>> -#define CACHE_TYPE_INSTRUCTION 1
>>>> -#define CACHE_TYPE_DATA        2
>>>> -
>>>>   static const struct cache_type_info cache_type_info[] = {
>>>>       {
>>>>           /* PowerPC Processor binding says the [di]-cache-*
>>>> @@ -77,246 +44,115 @@ static const struct cache_type_info
>>>> cache_type_info[] = {
>>>>           .nr_sets_prop    = "d-cache-sets",
>>>>       },
>>>>       {
>>>> -        .name            = "Instruction",
>>>> -        .size_prop       = "i-cache-size",
>>>> -        .line_size_props = { "i-cache-line-size",
>>>> -                     "i-cache-block-size", },
>>>> -        .nr_sets_prop    = "i-cache-sets",
>>>> -    },
>>>> -    {
>>>>           .name            = "Data",
>>>>           .size_prop       = "d-cache-size",
>>>>           .line_size_props = { "d-cache-line-size",
>>>>                        "d-cache-block-size", },
>>>>           .nr_sets_prop    = "d-cache-sets",
>>>>       },
>>>> +    {
>>>> +        .name            = "Instruction",
>>>> +        .size_prop       = "i-cache-size",
>>>> +        .line_size_props = { "i-cache-line-size",
>>>> +                     "i-cache-block-size", },
>>>> +        .nr_sets_prop    = "i-cache-sets",
>>>> +    },
>>>>   };
>>>
>>>
>>> Hey Sudeep,
>>>
>>> After applying this patch, the cache_type_info array looks like this.
>>>
>>> static const struct cache_type_info cache_type_info[] = {
>>>          {
>>>                  /*
>>>                   * PowerPC Processor binding says the [di]-cache-*
>>>                   * must be equal on unified caches, so just use
>>>                   * d-cache properties.
>>>                   */
>>>                  .name            = "Unified",
>>>                  .size_prop       = "d-cache-size",
>>>                  .line_size_props = { "d-cache-line-size",
>>>                                       "d-cache-block-size", },
>>>                  .nr_sets_prop    = "d-cache-sets",
>>>          },
>>>          {
>>>                  .name            = "Data",
>>>                  .size_prop       = "d-cache-size",
>>>                  .line_size_props = { "d-cache-line-size",
>>>                                       "d-cache-block-size", },
>>>                  .nr_sets_prop    = "d-cache-sets",
>>>          },
>>>          {
>>>                  .name            = "Instruction",
>>>                  .size_prop       = "i-cache-size",
>>>                  .line_size_props = { "i-cache-line-size",
>>>                                       "i-cache-block-size", },
>>>                  .nr_sets_prop    = "i-cache-sets",
>>>          },
>>> };
>>>
>>> and this function computes the the array index for any given cache type
>>> define for PowerPC.
>>>
>>> static inline int get_cacheinfo_idx(enum cache_type type)
>>> {
>>>          if (type == CACHE_TYPE_UNIFIED)
>>>                  return 0;
>>>          else
>>>                  return type;
>>> }
>>>
>>> These types are define in include/linux/cacheinfo.h as
>>>
>>> enum cache_type {
>>>          CACHE_TYPE_NOCACHE = 0,
>>>          CACHE_TYPE_INST = BIT(0),        ---> 1
>>>          CACHE_TYPE_DATA = BIT(1),        ---> 2
>>>          CACHE_TYPE_SEPARATE = CACHE_TYPE_INST | CACHE_TYPE_DATA,
>>>          CACHE_TYPE_UNIFIED = BIT(2),
>>> };
>>>
>>> When it is UNIFIED we return index 0, which is correct. But the index
>>> for instruction and data cache seems to be swapped which wrong. This
>>> will fetch invalid properties for any given cache type.
>>>
> 
> Ah, that's silly mistake on my side, will fix it.
> 
>>> I have done some initial review and testing for this patch's impact on
>>> PowerPC (ppc64 POWER specifically). I am trying to do some code clean-up
>>> and re-arrangements. Will post out soon. Thanks !
> 
> Thanks for taking time for testing and reviewing these patches.

Now that you got some of the problems to work on and resend the patches, I will
hold on to the clean up patches I had.

^ permalink raw reply

* Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Scott Wood @ 2014-03-20 23:33 UTC (permalink / raw)
  To: Chenhui Zhao; +Cc: linuxppc-dev, linux-kernel, Jason.Jin
In-Reply-To: <20140319005607.GB9735@localhost.localdomain>

On Wed, 2014-03-19 at 08:56 +0800, Chenhui Zhao wrote:
> On Tue, Mar 18, 2014 at 05:42:09PM -0500, Scott Wood wrote:
> > On Mon, 2014-03-17 at 19:19 +0800, Chenhui Zhao wrote:
> > > On Fri, Mar 14, 2014 at 06:18:27PM -0500, Scott Wood wrote:
> > > > Why do you need the entry mapping on 32-bit but not 64-bit?
> > > 
> > > fsl_booke_entry_mapping.S is for 32-bit. 64-bit calls
> > > initial_tlb_book3e() in exceptions-64e.S.
> > 
> > The answer I was looking for is that __entry_deep_sleep calls
> > start_initialization_book3e which calls the code to handle it.
> > 
> > But why is it driven from sleep.S on 64-bit but not on 32-bit?  Why
> > can't you make it so that the 32-bit TLB setup can be called into in a
> > similar manner?
> 
> Yes. I also wish to do like this. As I mentioned, the problem in 32-bit
> is that the TLB setup code in fsl_booke_entry_mapping.S only setups a temp
> mapping of 4KB, so these code only can run in this 4KB address space. But the
> code in sleep.S is outside of the 4KB space. So can't put the TLB setup
> code in sleep.S. There are two method to solve it.
> 1) The current method is running the TLB setup code of fsl_booke_entry_mapping.S in the 4KB
> space, then jump to the code of sleep.S.
> 2) extend the temp mapping space in the TLB setup code to cover kernel, say 4MB or 8MB. But
> not sure if there are any side effects.

fsl_booke_entry_mapping.S creates a 64M entry.  The 4K entry is only
temporary while in AS1 -- it shouldn't matter if the address you return
to when leaving fsl_booke_entry_mapping.S is outside the 4K, as long as
it's within the 64M entry.

Or am I missing something?

> > > > > > > +#define FSLDELAY(count)		\
> > > > > > > +	li	r3, (count)@l;	\
> > > > > > > +	slwi	r3, r3, 10;	\
> > > > > > > +	mtctr	r3;		\
> > > > > > > +101:	nop;			\
> > > > > > > +	bdnz	101b;
> > > > > > 
> > > > > > You don't need a namespace prefix on local macros in a non-header file.
> > > > > > 
> > > > > > Is the timebase stopped where you're calling this from?
> > > > > 
> > > > > No. My purpose is to avoid jump in the last stage of entering deep sleep.
> > > > > Jump may cause problem at that time.
> > > > 
> > > > "bdnz" is a jump.
> > > > 
> > > > What problems do you think a jump will cause?
> > > 
> > > I mean a far jump which can jump to an address which has not been prefetched in
> > > advance. I wish the code is executed in a restricted environment (predictable code
> > > and address).
> > 
> > Why would a timebase loop require a "far" jump?
> 
> I mean there is far jump in udely().
> 
> Do you mean using a timebase loop in the macro FSLDELAY? If so, I agree.

Yes, I meant a timebase loop, not udelay().

> > > > > > You also probably want to do a "sync, readback, data dependency, isync"
> > > > > > sequence to make sure that the store has hit CCSR before you begin your
> > > > > > delay (or is a delay required at all if you do that?).
> > > > > 
> > > > > Yes. It is safer with a sync sequence.
> > > > > 
> > > > > The DDR controller need some time to signal the external DDR modules to
> > > > > enter self refresh mode.
> > > > 
> > > > Is it documented how much time it requires?
> > > > 
> > > > -Scott
> > > 
> > > No.
> > 
> > How do you know the current delay is adequate in all circumstances (e.g
> > clock speeds), much less on future chips?  Is it documented that a delay
> > is needed at all, or is this just something that appeared to make it
> > work?  If the latter, what happens if you put the synchronization in,
> > but leave out the delay?
> > 
> > -Scott
> 
> The code controls external parts (FPGA/CPLD, DDR module) to act, and
> the sequent code must wait until external parts complete. We can't get
> an ack from external parts, so use delay to make sure the sequence and
> insert enough time to wait.

It would be good if you could get the hardware designers to provide an
upper bound for how long we need to wait.

-Scott

^ permalink raw reply

* Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Scott Wood @ 2014-03-20 22:17 UTC (permalink / raw)
  To: Kevin Hao; +Cc: linuxppc-dev, Chenhui Zhao, Jason.Jin, linux-kernel
In-Reply-To: <20140320114735.GD10182@pek-khao-d1.corp.ad.wrs.com>

On Thu, 2014-03-20 at 19:47 +0800, Kevin Hao wrote:
> OK, so the intention of 'twi, isync' following the load is not to order the
> following storage access, but order the following delay loop instructions,
> right? But according to the e6500 manual, the instructions complete in order.
> The following is the definition of 'complete':
>   Complete—An instruction is eligible to complete after it finishes executing
>   and makes its results available for subsequent instructions. Instructions
>   must complete in order from the bottom two entries of the
>   completion queue (CQ). The completion unit coordinates how instructions (which
>   may have executed out of order) affect architected registers to ensure the
>   appearance of serial execution. This guarantees that the completed instruction
>   and all previous instructions can cause no exceptions. An instruction completes
>   when it is retired, that is, deleted from the CQ.
> 
> So the following delay loop instructions should never complete before the
> complete of the load instruction.

We don't want the delay loop instructions to *start* until the load has
completed.  Completion of the loop only matters when ordering the loop
versus post-loop actions (and again, there we'd want the loop to
complete before subsequent actions start).

> > > > > So if we want to order all the storage access as well
> > > > > as execution synchronization, we should choose sync here.
> > > > 
> > > > Do we need execution synchronization or context synchronization?
> > > 
> > > There is no context-altering instruction here, so I think an execution
> > > synchronizing instruction should be enough here.
> > 
> > Is the ISA ever explicit about what constitutes "context"?
> 
> The following is the definition of context-altering instruction:
>   An instruction that alters the context in which data
>   addresses or instruction addresses are interpreted, or
>   in which instructions are executed or data accesses are
>   performed, is called a context-altering instruction.
> 
> So the context should be:
>   - in which data addresses or instruction addresses are interpreted
>   - in which instructions are executed
>   - in which data accesses are performed

By that definition, a store to CCSR could easily change the context in
which data accesses are performed, by changing a mapping.

But still, nothing in the above defines "context" -- or rather, it does
so circularly.  While it makes intuitive sense that it would be limited
to context that lives within the core, rather than the rest of the
system, I don't think the ISA generally distinguishes between the two.

-Scott

^ permalink raw reply

* Re: [PATCH 9/9] powerpc/pm: support deep sleep feature on T1040
From: Scott Wood @ 2014-03-20 22:07 UTC (permalink / raw)
  To: David Laight
  Cc: linuxppc-dev@lists.ozlabs.org, 'Kevin Hao', Chenhui Zhao,
	Jason.Jin@freescale.com, linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D0F6E33AE@AcuExch.aculab.com>

On Thu, 2014-03-20 at 11:59 +0000, David Laight wrote:
> I tried to work out what the 'twi, isync' instructions were for (in in_le32()).
> The best I could come up with was to ensure a synchronous bus-fault.
> But bus faults are probably only expected during device probing - not
> normal operation, and the instructions will have a significant cost.
> 
> Additionally in_le32() and out_le32() both start with a 'sync' instruction.
> In many cases that isn't needed either - an explicit iosync() can be
> used after groups of instructions.

The idea is that it's better to be maximally safe by default, and let
performance critical sections be optimized using raw accessors and
explicit synchronization if needed, than to have hard-to-debug bugs due
to missing/wrong sync.  A lot of I/O is slow enough that the performance
impact doesn't really matter, but the brain-time cost of getting the 
sync right is still there.

-Scott

^ permalink raw reply

* Re: [v2,2/2] powerpc/mpc85xx: add support for Keymile's kmcoge4 board
From: Scott Wood @ 2014-03-20 21:30 UTC (permalink / raw)
  To: Valentin Longchamp; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <532AA9F8.3060406@keymile.com>

On Thu, 2014-03-20 at 09:42 +0100, Valentin Longchamp wrote:
> On 03/20/2014 12:08 AM, Scott Wood wrote:
> > On Tue, Feb 11, 2014 at 12:50:07PM +0100, Valentin Longchamp wrote:
> >> +		reset_cpld@1,0 {
> >> +			interrupt-controller;
> >> +			#interrupt-cells = <2>;
> >> +			reg = <1 0 0x80>;
> >> +			interrupt-parent = <&mpic>;
> >> +			interrupts = <
> >> +				4 1 0 0
> >> +				5 1 0 0>;
> >> +		};
> >> +
> >> +		chassis_mgmt@3,0 {
> >> +			interrupt-controller;
> >> +			#interrupt-cells = <2>;
> >> +			reg = <3 0 0x100>;
> >> +			interrupt-parent = <&mpic>;
> >> +			interrupts = <6 1 0 0>;
> >> +		};
> > 
> > Dashes are preferred to underscores in device trees.
> 
> OK.
> 
> > 
> > More importantly, these nodes need proper compatibles and bindings.  Once
> > that's done, the name for the nodes should probably be
> > "board_control@whatever" for both.
> > 
> 
> The first one can be board-ctrl.

board-control (rather than board-ctrl) is already used in several device
trees, so it would be good to be consistent.

> The second however manages things that are
> beyond this board and important for other boards in the chassis, so I think
> chassis-mgmt is correct.

OK.

> For the binding/compatbiles issues: in the first discussion I had omitted these
> nodes because these are not available (and honestly for such FPGAs I doubt they
> will ever be mainlined). We discussed it and concluded that the DTS should
> describe the HW and not the drivers available in the kernel so I have now added
> them. Do you want me to add the compatible strings we use in our tree even
> though there are no bindings ? Leave them as is ? Or drop them ?

Include the compatibles, and add bindings.  The bindings should state
the compatible string used, what the interrupt specifier format is, and
what reg/interrupts represents -- especially in the case of the reset
cpld node, which has two interrupts.  Make it clear which interrupt goes
first and which goes second.

It doesn't matter whether the driver will ever be mainlined.

-Scott

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 20:36 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395346808.14694.62.camel@buesod1.americas.hpqcorp.net>

On Thu, Mar 20, 2014 at 1:20 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending
> commit), and then I cleanly applied the equivalent ones from v3 of the
> series (which was already *tested* and ready for upstream until you
> suggested looking into the alternative spinlock approach):
>
> https://lkml.org/lkml/2013/12/19/624
> https://lkml.org/lkml/2013/12/19/630
>
> Assuming the atomics solves the issue, would you be willing to take this
> path? Any pending documentation fixes can be added afterwards. The
> important thing is that the actual code is well tested.

So my preference would be to do that "tested code" thing, but then
edit out the comment changes and boil it down to just the minimal code
changes. So that you can see what the patch actually *does*, without
it being hidden by the bulk of the patch just being the reverts of the
comment fixups.

In fact, I hope that if you do that, you end up with the patch I just
created by hand, and then we'd have come to the same situation two
different ways independently, and I'd be doubly happy for that extra
cross-checking of what went on.

And I would *not* want to do this as "two reverts and one patch to
re-do things like we used to", because that just makes the actual
change even harder to see. And that's partly because if we eventually
do decide that "hey, if we can do this using the ticket lock as a
counter, let's do it that way", then this *small* fixup patch ends up
showing the actual real differences between the two approaches.

Of course, right now we don't even have confirmation from Srikar that
the explicit "waiters" counter even fixes things on powerpc, so.. All
the testing that orginal patch had was also on x86, so if it's some
subtle memory ordering issue that we haven't figured out now, rather
than the ticket lock thing, all this discussion about which way to go
turns out to be entirely premature.

            Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Benjamin Herrenschmidt @ 2014-03-20 20:23 UTC (permalink / raw)
  To: torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, paulus, Davidlohr Bueso,
	tglx, Paul McKenney, linuxppc-dev, mingo
In-Reply-To: <1395333093.14694.3.camel@buesod1.americas.hpqcorp.net>

On Thu, 2014-03-20 at 09:31 -0700, Davidlohr Bueso wrote:
> hmmm looking at ppc spinlock code, it seems that it doesn't have ticket
> spinlocks -- in fact Torsten Duwe has been trying to get them upstream
> very recently. Since we rely on the counter for detecting waiters, this
> might explain the issue. Could someone confirm this spinlock
> implementation difference? 

Indeed. I haven't merged ticket locks because they break lockref :-(

We have a problem here because we need to store the lock holder so we
can yield to the lock holder partition on contention and we are running
out of space in the spinlock.

The lock holder doesn't have to be atomic, so in theory we could have
the tickets and the lockref in the same 64-bit and the holder separately
but the way the layers are stacked at the moment that's not workable,
at least not without duplicating the whole lockref implementation and
breaking the spinlock in two, a "base" lock without older and the separate
variant with holder field. A mess...

I want to try sorting that out at some stage but haven't had a chance yet.

Cheers,
Ben.

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Davidlohr Bueso @ 2014-03-20 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <CA+55aFw8wSe-DUtZ8=rjFJzXYmOSsZ0JzS-_jsZKrDai8LZMXw@mail.gmail.com>

On Thu, 2014-03-20 at 12:25 -0700, Linus Torvalds wrote:
> On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > Oh, it does. This atomics technique was tested at a customer's site and
> > ready for upstream.
> 
> I'm not worried about the *original* patch. I'm worried about the
> incremental one.
> 
> Your original patch never applied to my tree - I think it was based on
> -mm or something. So I couldn't verify my "let's go back to the
> explicit 'waiters'" incremental patch against reverting and
> re-applying the original patch.

Ok, so a big reason why this patch doesn't apply cleanly after reverting
is because *most* of the changes were done at the top of the file with
regards to documenting the ordering guarantees, the actual code changes
are quite minimal.

I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending
commit), and then I cleanly applied the equivalent ones from v3 of the
series (which was already *tested* and ready for upstream until you
suggested looking into the alternative spinlock approach):

https://lkml.org/lkml/2013/12/19/624
https://lkml.org/lkml/2013/12/19/630

Assuming the atomics solves the issue, would you be willing to take this
path? Any pending documentation fixes can be added afterwards. The
important thing is that the actual code is well tested.

Thanks,
Davidlohr

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 19:25 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395342510.14694.43.camel@buesod1.americas.hpqcorp.net>

On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> Oh, it does. This atomics technique was tested at a customer's site and
> ready for upstream.

I'm not worried about the *original* patch. I'm worried about the
incremental one.

Your original patch never applied to my tree - I think it was based on
-mm or something. So I couldn't verify my "let's go back to the
explicit 'waiters'" incremental patch against reverting and
re-applying the original patch.

So I'd like you to re-verify that that incremental patch really is
solid, and does what your original one did.

               Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Davidlohr Bueso @ 2014-03-20 19:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <CA+55aFwH2sJKCH6hUf8H-7nMPo8YwFCR6oP8McZ5VpMcGDrYSw@mail.gmail.com>

On Thu, 2014-03-20 at 11:36 -0700, Linus Torvalds wrote:
> On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > Comparing with the patch I sent earlier this morning, looks equivalent,
> > and fwiw, passes my initial qemu bootup, which is the first way of
> > detecting anything stupid going on.
> >
> > So, Srikar, please try this patch out, as opposed to mine, you don't
> > have to first revert the commit in question.
> 
> Ok, so it boots for me too, so hopefully it isn't totally broken.
> 
> However, since it's just closing a race, and since getting the counts
> wrong should easily result in it *working* but always taking the slow
> path (for example), I'd really like people to also verify that it
> fixes the actual performance issue (ie assuming it fixes powerpc
> behavior for Srikar, I'd like to get it double-checked that it also
> avoids the spinlock in the common case). 

Oh, it does. This atomics technique was tested at a customer's site and
ready for upstream. To refresh, we were originally seeing massive
contention on the hb->lock and an enormous amounts of 0 returns from
futex_wake, indicating that spinners where piling up just to realize
that the plist was empty! While I don't have any official numbers, I can
confirm that perf showed that this issue was addressed with the atomics
variant. Yes, such pathological behavior shows problems in the userspace
locking primitives design/implementation, but allowing the kernel not to
be affected by suboptimal uses of futexes is definitely a plus. 

As tglx suggested at the time, I also made sure that adding the barriers
when doing the key refcounting didn't impose any serious restrictions to
performance either.

Now, what at the time required re-testing everything was when you
suggested replacing this approach with a more elegant spin is locked
test. Both approaches showed pretty much identical performance (and
correctness, at least on x86). And to this day shows *significant* less
time spent in kernel space dealing with futexes.


> Because if the
> increment/decrement pairings end up being wrong, we could have a
> situation where the waiter count just ends up bogus, and it all works
> from a correctness standpoint but not from the intended performance
> optimization.
> 
> No way I can test that sanely on my single-socket machine. Davidlohr?

Not this patch, no :( -- we could never blindly reproduce the customer's
workload. The only patch that I was able to create test cases for is the
larger hash table one, which simply alleviates collisions. This is now
part of perf-bench.

Thanks,
Davidlohr

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 18:36 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395335909.14694.15.camel@buesod1.americas.hpqcorp.net>

On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> Comparing with the patch I sent earlier this morning, looks equivalent,
> and fwiw, passes my initial qemu bootup, which is the first way of
> detecting anything stupid going on.
>
> So, Srikar, please try this patch out, as opposed to mine, you don't
> have to first revert the commit in question.

Ok, so it boots for me too, so hopefully it isn't totally broken.

However, since it's just closing a race, and since getting the counts
wrong should easily result in it *working* but always taking the slow
path (for example), I'd really like people to also verify that it
fixes the actual performance issue (ie assuming it fixes powerpc
behavior for Srikar, I'd like to get it double-checked that it also
avoids the spinlock in the common case). Because if the
increment/decrement pairings end up being wrong, we could have a
situation where the waiter count just ends up bogus, and it all works
from a correctness standpoint but not from the intended performance
optimization.

No way I can test that sanely on my single-socket machine. Davidlohr?

                  Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 18:16 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395338607.14694.23.camel@buesod1.americas.hpqcorp.net>

On Thu, Mar 20, 2014 at 11:03 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> I still wonder about ppc and spinlocks (no ticketing!!) ... sure the
> "waiters" patch might fix the problem just because we explicitly count
> the members of the plist. And I guess if we cannot rely on all archs
> having an equivalent spinlock implementation, we simply cannot use this
> technique for futexes.

So the ticketing part means that on x86 we see pending waiters even
when a previous one does "spin_unlock()". I agree that that is a
fundamental difference between x86 and powerpc, and it does seem to be
the most likely culprit.

And dammit, I *liked* my "don't use an explicit waiter count"
approach, so I'd love to be able to do it. But I we've never really
guaranteed that "is_spin_locked()" shows whether there are spinners,
so I don't know how to do that.

I guess we could expose some interface for the spinlock code to say
whether it supports that or not, and then switch between the two
algorithms. But that just feels very very ugly to me.

But let's see if the explicit waiter count version even solves the
thing on powerpc. Maybe it's something else, and we'll have to revert
entirely for now.

                 Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Davidlohr Bueso @ 2014-03-20 18:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <CA+55aFw-gGQ5hTdVPovpRgcEHGyhf3YDzoSLpavzP4SefkkBDg@mail.gmail.com>

On Thu, 2014-03-20 at 10:42 -0700, Linus Torvalds wrote:
> On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >> It strikes me that the "spin_is_locked()" test has no barriers wrt the
> >> writing of the new futex value on the wake path. And the read barrier
> >> obviously does nothing wrt the write either. Or am I missing
> >> something? So the write that actually released the futex might be
> >> almost arbitrarily delayed on the waking side. So the waiting side may
> >> not see the new value, even though the waker assumes it does due to
> >> the ordering of it doing the write first.
> >
> > Aha, that would certainly violate the ordering guarantees. I feared
> > _something_ like that when we originally discussed your suggestion as
> > opposed to the atomics one, but didn't have any case for it either.
> 
> Actually, looking closer, we have the memory barrier in
> get_futex_key_refs() (called by "get_futex_key()") so that's not it.
> In fact, your "atomic_read(&hb->waiters)" doesn't have any more
> serialization than the spin_is_locked() test had.
> 
> But the spin_is_locked() and queue-empty tests are two separate memory
> reads, and maybe there is some ordering wrt those two that we missed,
> so the "waiters" patch is worth trying anyway.

Well, imho we would have seen something wrong much much earlier. This
patch has been very heavily tested (including with the java workload
used by Shrikar). 

I still wonder about ppc and spinlocks (no ticketing!!) ... sure the
"waiters" patch might fix the problem just because we explicitly count
the members of the plist. And I guess if we cannot rely on all archs
having an equivalent spinlock implementation, we simply cannot use this
technique for futexes.

Thanks,
Davidlohr

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 17:42 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395335909.14694.15.camel@buesod1.americas.hpqcorp.net>

On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>> It strikes me that the "spin_is_locked()" test has no barriers wrt the
>> writing of the new futex value on the wake path. And the read barrier
>> obviously does nothing wrt the write either. Or am I missing
>> something? So the write that actually released the futex might be
>> almost arbitrarily delayed on the waking side. So the waiting side may
>> not see the new value, even though the waker assumes it does due to
>> the ordering of it doing the write first.
>
> Aha, that would certainly violate the ordering guarantees. I feared
> _something_ like that when we originally discussed your suggestion as
> opposed to the atomics one, but didn't have any case for it either.

Actually, looking closer, we have the memory barrier in
get_futex_key_refs() (called by "get_futex_key()") so that's not it.
In fact, your "atomic_read(&hb->waiters)" doesn't have any more
serialization than the spin_is_locked() test had.

But the spin_is_locked() and queue-empty tests are two separate memory
reads, and maybe there is some ordering wrt those two that we missed,
so the "waiters" patch is worth trying anyway.

I do still dislike how the "waiters" thing adds an atomic update, but whatever..

          Linus

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Davidlohr Bueso @ 2014-03-20 17:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <CA+55aFzP60Ucn0snFchHNw2tOnS6HYp-6n+f+9nxrbTPw8fTfQ@mail.gmail.com>

On Thu, 2014-03-20 at 09:41 -0700, Linus Torvalds wrote:
> On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
> >
> > This problem suggests that we missed a wakeup for a task that was adding
> > itself to the queue in a wait path. And the only place that can happen
> > is with the hb spinlock check for any pending waiters.
> 
> Ok, so thinking about hb_waiters_pending():
> 
>  - spin_is_locked() test
>  - read barrier
>  - plist_head_empty() test
> 
> seems to be broken after all.
> 
> The race is against futex_wait() that does
> 
>  - futex_wait_setup():
>    - queue_lock()
>    - get_futex_value_locked()
>  - futex_wait_queue_me()
>    - queue_me()
>      - plist_add()
> 
> right?

Yep.

> 
> It strikes me that the "spin_is_locked()" test has no barriers wrt the
> writing of the new futex value on the wake path. And the read barrier
> obviously does nothing wrt the write either. Or am I missing
> something? So the write that actually released the futex might be
> almost arbitrarily delayed on the waking side. So the waiting side may
> not see the new value, even though the waker assumes it does due to
> the ordering of it doing the write first.

Aha, that would certainly violate the ordering guarantees. I feared
_something_ like that when we originally discussed your suggestion as
opposed to the atomics one, but didn't have any case for it either.

> So maybe we need a memory barrier in hb_waiters_pending() just to make
> sure that the new value is written.
> 
> But at that point, I suspect that Davidlohrs original patch that just
> had explicit waiting counts is just as well. The whole point with the
> head empty test was to emulate that "do we have waiters" without
> having to actually add the atomics, but a memory barrier is really no
> worse.
> 
> The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs
> atomic count. It may be terminally broken, I literally didn't test it
> at all.

Comparing with the patch I sent earlier this morning, looks equivalent,
and fwiw, passes my initial qemu bootup, which is the first way of
detecting anything stupid going on.

So, Srikar, please try this patch out, as opposed to mine, you don't
have to first revert the commit in question.

^ permalink raw reply

* Re: [PATCH 00/18] powerpc/boot: 64bit little endian wrapper
From: Geoff Levand @ 2014-03-20 17:03 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: linuxppc-dev
In-Reply-To: <1395328213-19206-1-git-send-email-clg@fr.ibm.com>

Hi Cédric,

On Thu, 2014-03-20 at 16:09 +0100, Cédric Le Goater wrote:
>  16 files changed, 465 insertions(+), 100 deletions(-)

Do you have these in a git branch somewhere so I can merge and test
them?

Thanks.

-Geoff

^ permalink raw reply

* Re: Tasks stuck in futex code (in 3.14-rc6)
From: Linus Torvalds @ 2014-03-20 16:41 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Srikar Dronamraju, Peter Zijlstra, LKML, Paul Mackerras,
	Thomas Gleixner, Paul McKenney, ppc-dev, Ingo Molnar
In-Reply-To: <1395295019.2612.11.camel@buesod1.americas.hpqcorp.net>

[-- Attachment #1: Type: text/plain, Size: 1678 bytes --]

On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote:
>
> This problem suggests that we missed a wakeup for a task that was adding
> itself to the queue in a wait path. And the only place that can happen
> is with the hb spinlock check for any pending waiters.

Ok, so thinking about hb_waiters_pending():

 - spin_is_locked() test
 - read barrier
 - plist_head_empty() test

seems to be broken after all.

The race is against futex_wait() that does

 - futex_wait_setup():
   - queue_lock()
   - get_futex_value_locked()
 - futex_wait_queue_me()
   - queue_me()
     - plist_add()

right?

It strikes me that the "spin_is_locked()" test has no barriers wrt the
writing of the new futex value on the wake path. And the read barrier
obviously does nothing wrt the write either. Or am I missing
something? So the write that actually released the futex might be
almost arbitrarily delayed on the waking side. So the waiting side may
not see the new value, even though the waker assumes it does due to
the ordering of it doing the write first.

So maybe we need a memory barrier in hb_waiters_pending() just to make
sure that the new value is written.

But at that point, I suspect that Davidlohrs original patch that just
had explicit waiting counts is just as well. The whole point with the
head empty test was to emulate that "do we have waiters" without
having to actually add the atomics, but a memory barrier is really no
worse.

The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs
atomic count. It may be terminally broken, I literally didn't test it
at all.

Davidlohr, mind checking and correcting this?

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 3576 bytes --]

 kernel/futex.c | 53 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 44a1261cb9ff..08ec814ad9d2 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -234,6 +234,7 @@ static const struct futex_q futex_q_init = {
  * waiting on a futex.
  */
 struct futex_hash_bucket {
+	atomic_t waiters;
 	spinlock_t lock;
 	struct plist_head chain;
 } ____cacheline_aligned_in_smp;
@@ -253,22 +254,37 @@ static inline void futex_get_mm(union futex_key *key)
 	smp_mb__after_atomic_inc();
 }
 
-static inline bool hb_waiters_pending(struct futex_hash_bucket *hb)
+/*
+ * Reflects a new waiter being added to the waitqueue.
+ */
+static inline void hb_waiters_inc(struct futex_hash_bucket *hb)
 {
 #ifdef CONFIG_SMP
+	atomic_inc(&hb->waiters);
 	/*
-	 * Tasks trying to enter the critical region are most likely
-	 * potential waiters that will be added to the plist. Ensure
-	 * that wakers won't miss to-be-slept tasks in the window between
-	 * the wait call and the actual plist_add.
+	 * Full barrier (A), see the ordering comment above.
 	 */
-	if (spin_is_locked(&hb->lock))
-		return true;
-	smp_rmb(); /* Make sure we check the lock state first */
+	smp_mb__after_atomic_inc();
+#endif
+}
+
+/*
+ * Reflects a waiter being removed from the waitqueue by wakeup
+ * paths.
+ */
+static inline void hb_waiters_dec(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	atomic_dec(&hb->waiters);
+#endif
+}
 
-	return !plist_head_empty(&hb->chain);
+static inline int hb_waiters_pending(struct futex_hash_bucket *hb)
+{
+#ifdef CONFIG_SMP
+	return atomic_read(&hb->waiters);
 #else
-	return true;
+	return 1;
 #endif
 }
 
@@ -954,6 +970,7 @@ static void __unqueue_futex(struct futex_q *q)
 
 	hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock);
 	plist_del(&q->list, &hb->chain);
+	hb_waiters_dec(hb);
 }
 
 /*
@@ -1257,7 +1274,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1,
 	 */
 	if (likely(&hb1->chain != &hb2->chain)) {
 		plist_del(&q->list, &hb1->chain);
+		hb_waiters_dec(hb1);
 		plist_add(&q->list, &hb2->chain);
+		hb_waiters_inc(hb2);
 		q->lock_ptr = &hb2->lock;
 	}
 	get_futex_key_refs(key2);
@@ -1600,6 +1619,17 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q)
 	struct futex_hash_bucket *hb;
 
 	hb = hash_futex(&q->key);
+
+	/*
+	 * Increment the counter before taking the lock so that
+	 * a potential waker won't miss a to-be-slept task that is
+	 * waiting for the spinlock. This is safe as all queue_lock()
+	 * users end up calling queue_me(). Similarly, for housekeeping,
+	 * decrement the counter at queue_unlock() when some error has
+	 * occurred and we don't end up adding the task to the list.
+	 */
+	hb_waiters_inc(hb);
+
 	q->lock_ptr = &hb->lock;
 
 	spin_lock(&hb->lock); /* implies MB (A) */
@@ -1611,6 +1641,7 @@ queue_unlock(struct futex_hash_bucket *hb)
 	__releases(&hb->lock)
 {
 	spin_unlock(&hb->lock);
+	hb_waiters_dec(hb);
 }
 
 /**
@@ -2342,6 +2373,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb,
 		 * Unqueue the futex_q and determine which it was.
 		 */
 		plist_del(&q->list, &hb->chain);
+		hb_waiters_dec(hb);
 
 		/* Handle spurious wakeups gracefully */
 		ret = -EWOULDBLOCK;
@@ -2875,6 +2907,7 @@ static int __init futex_init(void)
 		futex_cmpxchg_enabled = 1;
 
 	for (i = 0; i < futex_hashsize; i++) {
+		atomic_set(&futex_queues[i].waiters, 0);
 		plist_head_init(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox