LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCHv2 2/2] drivers/base: reorder consumer and its children behind suppliers
From: Greg Kroah-Hartman @ 2018-06-25 10:45 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-kernel, Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas,
	Dave Young, linux-pci, linuxppc-dev
In-Reply-To: <1529912859-10475-3-git-send-email-kernelfans@gmail.com>

On Mon, Jun 25, 2018 at 03:47:39PM +0800, Pingfan Liu wrote:
> commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> introduces supplier<-consumer order in devices_kset. The commit tries
> to cleverly maintain both parent<-child and supplier<-consumer order by
> reordering a device when probing. This method makes things simple and
> clean, but unfortunately, breaks parent<-child order in some case,
> which is described in next patch in this series.

There is no "next patch in this series" :(

> Here this patch tries to resolve supplier<-consumer by only reordering a
> device when it has suppliers, and takes care of the following scenario:
>     [consumer, children] [ ... potential ... ] supplier
>                          ^                   ^
> After moving the consumer and its children after the supplier, the
> potentail section may contain consumers whose supplier is inside
> children, and this poses the requirement to dry out all consumpers in
> the section recursively.
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Bjorn Helgaas <helgaas@kernel.org>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: linux-pci@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> ---
> note: there is lock issue in this patch, should be fixed in next version

Please send patches that you know are correct, why would I want to
review this if you know it is not correct?

And if the original commit is causing problems for you, why not just
revert that instead of adding this much-increased complexity?



> 
> ---
>  drivers/base/core.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 129 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 66f06ff..db30e86 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -123,12 +123,138 @@ static int device_is_dependent(struct device *dev, void *target)
>  	return ret;
>  }
>  
> -/* a temporary place holder to mark out the root cause of the bug.
> - * The proposal algorithm will come in next patch
> +struct pos_info {
> +	struct device *pos;
> +	struct device *tail;
> +};
> +
> +/* caller takes the devices_kset->list_lock */
> +static int descendants_reorder_after_pos(struct device *dev,
> +	void *data)

Why are you wrapping lines that do not need to be wrapped?

What does this function do?

> +{
> +	struct device *pos;
> +	struct pos_info *p = data;
> +
> +	pos = p->pos;
> +	pr_debug("devices_kset: Moving %s after %s\n",
> +		 dev_name(dev), dev_name(pos));

You have a device, use it for debugging, i.e. dev_dbg().

> +	device_for_each_child(dev, p, descendants_reorder_after_pos);

Recursive?

> +	/* children at the tail */
> +	list_move(&dev->kobj.entry, &pos->kobj.entry);
> +	/* record the right boundary of the section */
> +	if (p->tail == NULL)
> +		p->tail = dev;
> +	return 0;
> +}

I really do not understand what the above code is supposed to be doing :(

> +
> +/* iterate over an open section */
> +#define list_opensect_for_each_reverse(cur, left, right)	\
> +	for (cur = right->prev; cur == left; cur = cur->prev)
> +
> +static bool is_consumer(struct device *query, struct device *supplier)
> +{
> +	struct device_link *link;
> +	/* todo, lock protection */

Always run checkpatch.pl on patches so you do not get grumpy maintainers
telling you to run checkpatch.pl :(

> +	list_for_each_entry(link, &supplier->links.consumers, s_node)
> +		if (link->consumer == query)
> +			return true;
> +	return false;
> +}
> +
> +/* recursively move the potential consumers in open section (left, right)
> + * after the barrier

What barrier?

I'm stopping here as I have no idea what is going on, and this needs a
lot more work at the basic level of "it handles locking correctly"...

If you are working on this for power9, I'm guessing you work for IBM?
If so, please run this through your internal patch review process before
sending it out again...

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v8 3/6] module: use relative references for __ksymtab entries
From: Martijn Coenen @ 2018-06-25 10:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, Arnd Bergmann, Kees Cook, Will Deacon,
	Michael Ellerman, Thomas Garnier, Thomas Gleixner,
	Serge E. Hallyn, Bjorn Helgaas, Benjamin Herrenschmidt,
	Russell King, Paul Mackerras, Catalin Marinas, Petr Mladek,
	Ingo Molnar, James Morris, Andrew Morton, Nicolas Pitre,
	Josh Poimboeuf, Steven Rostedt, Sergey Senozhatsky,
	Linus Torvalds, Jessica Yu, LKML, linuxppc-dev,
	the arch/x86 maintainers, Ingo Molnar
In-Reply-To: <CAKv+Gu-M4vJfppfQ8PYv3hFR6a1FuOAEKpvyisYSZBe-ubMvHQ@mail.gmail.com>

On Mon, Jun 25, 2018 at 11:14 AM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> Because struct kernel_symbol is only 8 bytes in size after this
> change, and aligning a 8 byte quantity to 16 bytes wastes 8 bytes.

I get that, but then that means the 16-byte alignment wasn't actually
necessary in the first place.

> The x86 ABI may require it, but we don't actually adhere to it in the
> kernel. Also, these structures never occur on the stack anyway.

Ok, makes sense.

Thanks,
Martijn

>
>
>>> -#include <asm-generic/export.h>
>>> diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
>>> index 719db1968d81..97ce606459ae 100644
>>> --- a/include/asm-generic/export.h
>>> +++ b/include/asm-generic/export.h
>>> @@ -5,12 +5,10 @@
>>>  #define KSYM_FUNC(x) x
>>>  #endif
>>>  #ifdef CONFIG_64BIT
>>> -#define __put .quad
>>>  #ifndef KSYM_ALIGN
>>>  #define KSYM_ALIGN 8
>>>  #endif
>>>  #else
>>> -#define __put .long
>>>  #ifndef KSYM_ALIGN
>>>  #define KSYM_ALIGN 4
>>>  #endif
>>> @@ -25,6 +23,16 @@
>>>  #define KSYM(name) name
>>>  #endif
>>>
>>> +.macro __put, val, name
>>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>>> +       .long   \val - ., \name - .
>>> +#elif defined(CONFIG_64BIT)
>>> +       .quad   \val, \name
>>> +#else
>>> +       .long   \val, \name
>>> +#endif
>>> +.endm
>>> +
>>>  /*
>>>   * note on .section use: @progbits vs %progbits nastiness doesn't matter,
>>>   * since we immediately emit into those sections anyway.
>>> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
>>> index ab4711c63601..0a9328ea9dbd 100644
>>> --- a/include/linux/compiler.h
>>> +++ b/include/linux/compiler.h
>>> @@ -280,6 +280,25 @@ unsigned long read_word_at_a_time(const void *addr)
>>>
>>>  #endif /* __KERNEL__ */
>>>
>>> +/*
>>> + * Force the compiler to emit 'sym' as a symbol, so that we can reference
>>> + * it from inline assembler. Necessary in case 'sym' could be inlined
>>> + * otherwise, or eliminated entirely due to lack of references that are
>>> + * visible to the compiler.
>>> + */
>>> +#define __ADDRESSABLE(sym) \
>>> +       static void * const __attribute__((section(".discard"), used))  \
>>> +               __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
>>> +
>>> +/**
>>> + * offset_to_ptr - convert a relative memory offset to an absolute pointer
>>> + * @off:       the address of the 32-bit offset value
>>> + */
>>> +static inline void *offset_to_ptr(const int *off)
>>> +{
>>> +       return (void *)((unsigned long)off + *off);
>>> +}
>>> +
>>>  #endif /* __ASSEMBLY__ */
>>>
>>>  #ifndef __optimize
>>> diff --git a/include/linux/export.h b/include/linux/export.h
>>> index 25005b55b079..04c78e6bfec9 100644
>>> --- a/include/linux/export.h
>>> +++ b/include/linux/export.h
>>> @@ -24,12 +24,6 @@
>>>  #define VMLINUX_SYMBOL_STR(x) __VMLINUX_SYMBOL_STR(x)
>>>
>>>  #ifndef __ASSEMBLY__
>>> -struct kernel_symbol
>>> -{
>>> -       unsigned long value;
>>> -       const char *name;
>>> -};
>>> -
>>>  #ifdef MODULE
>>>  extern struct module __this_module;
>>>  #define THIS_MODULE (&__this_module)
>>> @@ -60,17 +54,47 @@ extern struct module __this_module;
>>>  #define __CRC_SYMBOL(sym, sec)
>>>  #endif
>>>
>>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>>> +#include <linux/compiler.h>
>>> +/*
>>> + * Emit the ksymtab entry as a pair of relative references: this reduces
>>> + * the size by half on 64-bit architectures, and eliminates the need for
>>> + * absolute relocations that require runtime processing on relocatable
>>> + * kernels.
>>> + */
>>> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
>>> +       __ADDRESSABLE(sym)                                              \
>>> +       asm("   .section \"___ksymtab" sec "+" #sym "\", \"a\"  \n"     \
>>> +           "   .balign 8                                       \n"     \
>>> +           VMLINUX_SYMBOL_STR(__ksymtab_##sym) ":              \n"     \
>>> +           "   .long " VMLINUX_SYMBOL_STR(sym) "- .            \n"     \
>>> +           "   .long " VMLINUX_SYMBOL_STR(__kstrtab_##sym) "- .\n"     \
>>> +           "   .previous                                       \n")
>>> +
>>> +struct kernel_symbol {
>>> +       int value_offset;
>>> +       int name_offset;
>>> +};
>>> +#else
>>> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
>>> +       static const struct kernel_symbol __ksymtab_##sym               \
>>> +       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
>>> +       = { (unsigned long)&sym, __kstrtab_##sym }
>>> +
>>> +struct kernel_symbol {
>>> +       unsigned long value;
>>> +       const char *name;
>>> +};
>>> +#endif
>>> +
>>>  /* For every exported symbol, place a struct in the __ksymtab section */
>>>  #define ___EXPORT_SYMBOL(sym, sec)                                     \
>>>         extern typeof(sym) sym;                                         \
>>>         __CRC_SYMBOL(sym, sec)                                          \
>>>         static const char __kstrtab_##sym[]                             \
>>> -       __attribute__((section("__ksymtab_strings"), aligned(1)))       \
>>> +       __attribute__((section("__ksymtab_strings"), used, aligned(1))) \
>>>         = VMLINUX_SYMBOL_STR(sym);                                      \
>>> -       static const struct kernel_symbol __ksymtab_##sym               \
>>> -       __used                                                          \
>>> -       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
>>> -       = { (unsigned long)&sym, __kstrtab_##sym }
>>> +       __KSYMTAB_ENTRY(sym, sec)
>>>
>>>  #if defined(__DISABLE_EXPORTS)
>>>
>>> diff --git a/kernel/module.c b/kernel/module.c
>>> index ad2d420024f6..b4782cfbb79b 100644
>>> --- a/kernel/module.c
>>> +++ b/kernel/module.c
>>> @@ -549,12 +549,30 @@ static bool check_symbol(const struct symsearch *syms,
>>>         return true;
>>>  }
>>>
>>> +static unsigned long kernel_symbol_value(const struct kernel_symbol *sym)
>>> +{
>>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>>> +       return (unsigned long)offset_to_ptr(&sym->value_offset);
>>> +#else
>>> +       return sym->value;
>>> +#endif
>>> +}
>>> +
>>> +static const char *kernel_symbol_name(const struct kernel_symbol *sym)
>>> +{
>>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>>> +       return offset_to_ptr(&sym->name_offset);
>>> +#else
>>> +       return sym->name;
>>> +#endif
>>> +}
>>> +
>>>  static int cmp_name(const void *va, const void *vb)
>>>  {
>>>         const char *a;
>>>         const struct kernel_symbol *b;
>>>         a = va; b = vb;
>>> -       return strcmp(a, b->name);
>>> +       return strcmp(a, kernel_symbol_name(b));
>>>  }
>>>
>>>  static bool find_symbol_in_section(const struct symsearch *syms,
>>> @@ -2198,7 +2216,7 @@ void *__symbol_get(const char *symbol)
>>>                 sym = NULL;
>>>         preempt_enable();
>>>
>>> -       return sym ? (void *)sym->value : NULL;
>>> +       return sym ? (void *)kernel_symbol_value(sym) : NULL;
>>>  }
>>>  EXPORT_SYMBOL_GPL(__symbol_get);
>>>
>>> @@ -2228,10 +2246,12 @@ static int verify_export_symbols(struct module *mod)
>>>
>>>         for (i = 0; i < ARRAY_SIZE(arr); i++) {
>>>                 for (s = arr[i].sym; s < arr[i].sym + arr[i].num; s++) {
>>> -                       if (find_symbol(s->name, &owner, NULL, true, false)) {
>>> +                       if (find_symbol(kernel_symbol_name(s), &owner, NULL,
>>> +                                       true, false)) {
>>>                                 pr_err("%s: exports duplicate symbol %s"
>>>                                        " (owned by %s)\n",
>>> -                                      mod->name, s->name, module_name(owner));
>>> +                                      mod->name, kernel_symbol_name(s),
>>> +                                      module_name(owner));
>>>                                 return -ENOEXEC;
>>>                         }
>>>                 }
>>> @@ -2280,7 +2300,7 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
>>>                         ksym = resolve_symbol_wait(mod, info, name);
>>>                         /* Ok if resolved.  */
>>>                         if (ksym && !IS_ERR(ksym)) {
>>> -                               sym[i].st_value = ksym->value;
>>> +                               sym[i].st_value = kernel_symbol_value(ksym);
>>>                                 break;
>>>                         }
>>>
>>> @@ -2540,7 +2560,7 @@ static int is_exported(const char *name, unsigned long value,
>>>                 ks = lookup_symbol(name, __start___ksymtab, __stop___ksymtab);
>>>         else
>>>                 ks = lookup_symbol(name, mod->syms, mod->syms + mod->num_syms);
>>> -       return ks != NULL && ks->value == value;
>>> +       return ks != NULL && kernel_symbol_value(ks) == value;
>>>  }
>>>
>>>  /* As per nm */
>>> --
>>> 2.15.1
>>>

^ permalink raw reply

* Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
From: Joakim Tjernlund @ 2018-06-25 10:04 UTC (permalink / raw)
  To: linuxppc-dev@lists.ozlabs.org, qiang.zhao@nxp.com,
	york.sun@nxp.com
In-Reply-To: <VI1PR04MB1200BC4D8A8B362FE62B1D7591760@VI1PR04MB1200.eurprd04.prod.outlook.com>

T24gVGh1LCAyMDE4LTA2LTIxIGF0IDAyOjM4ICswMDAwLCBRaWFuZyBaaGFvIHdyb3RlOg0KPiAN
Cj4gT24gMDYvMTkvMjAxOCAwOToyMiBBTSwgSm9ha2ltIFRqZXJubHVuZCB3cm90ZToNCj4gLS0t
LS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogTGludXhwcGMtZGV2IFttYWlsdG86bGlu
dXhwcGMtZGV2LWJvdW5jZXMrcWlhbmcuemhhbz1ueHAuY29tQGxpc3RzLm96bGFicy5vcmddIE9u
IEJlaGFsZiBPZiBKb2FraW0gVGplcm5sdW5kDQo+IFNlbnQ6IDIwMTjlubQ25pyIMjDml6UgMDoy
Mg0KPiBUbzogWW9yayBTdW4gPHlvcmsuc3VuQG54cC5jb20+OyBsaW51eHBwYy1kZXYgPGxpbnV4
cHBjLWRldkBsaXN0cy5vemxhYnMub3JnPg0KPiBTdWJqZWN0OiBbUEFUQ0hdIFFFIEdQSU86IEFk
ZCBxZV9ncGlvX3NldF9tdWx0aXBsZQ0KPiANCj4gVGhpcyBjb3VzaW4gdG8gZ3Bpby1tcGM4eHh4
IHdhcyBsYWNraW5nIGEgbXVsdGlwbGUgcGlucyBtZXRob2QsIGFkZCBvbmUuDQo+IA0KPiBTaWdu
ZWQtb2ZmLWJ5OiBKb2FraW0gVGplcm5sdW5kIDxqb2FraW0udGplcm5sdW5kQGluZmluZXJhLmNv
bT4NCj4gLS0tDQo+ICBkcml2ZXJzL3NvYy9mc2wvcWUvZ3Bpby5jIHwgMjggKysrKysrKysrKysr
KysrKysrKysrKysrKysrKw0KPiAgMSBmaWxlIGNoYW5nZWQsIDI4IGluc2VydGlvbnMoKykNCj4g
DQo+IGRpZmYgLS1naXQgYS9kcml2ZXJzL3NvYy9mc2wvcWUvZ3Bpby5jIGIvZHJpdmVycy9zb2Mv
ZnNsL3FlL2dwaW8uYyBpbmRleCAzYjI3MDc1YzIxYTcuLjgxOWJlZDBmNTY2NyAxMDA2NDQNCj4g
LS0tIGEvZHJpdmVycy9zb2MvZnNsL3FlL2dwaW8uYw0KPiArKysgYi9kcml2ZXJzL3NvYy9mc2wv
cWUvZ3Bpby5jDQo+IEBAIC04Myw2ICs4MywzMyBAQCBzdGF0aWMgdm9pZCBxZV9ncGlvX3NldChz
dHJ1Y3QgZ3Bpb19jaGlwICpnYywgdW5zaWduZWQgaW50IGdwaW8sIGludCB2YWwpDQo+ICAgICAg
ICAgc3Bpbl91bmxvY2tfaXJxcmVzdG9yZSgmcWVfZ2MtPmxvY2ssIGZsYWdzKTsgIH0NCj4gDQo+
ICtzdGF0aWMgdm9pZCBxZV9ncGlvX3NldF9tdWx0aXBsZShzdHJ1Y3QgZ3Bpb19jaGlwICpnYywN
Cj4gKyAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgdW5zaWduZWQgbG9uZyAqbWFzaywg
dW5zaWduZWQgbG9uZyAqYml0cykgew0KPiArICAgICAgIHN0cnVjdCBvZl9tbV9ncGlvX2NoaXAg
Km1tX2djID0gdG9fb2ZfbW1fZ3Bpb19jaGlwKGdjKTsNCj4gKyAgICAgICBzdHJ1Y3QgcWVfZ3Bp
b19jaGlwICpxZV9nYyA9IGdwaW9jaGlwX2dldF9kYXRhKGdjKTsNCj4gKyAgICAgICBzdHJ1Y3Qg
cWVfcGlvX3JlZ3MgX19pb21lbSAqcmVncyA9IG1tX2djLT5yZWdzOw0KPiArICAgICAgIHVuc2ln
bmVkIGxvbmcgZmxhZ3M7DQo+ICsgICAgICAgaW50IGk7DQo+ICsNCj4gKyAgICAgICBzcGluX2xv
Y2tfaXJxc2F2ZSgmcWVfZ2MtPmxvY2ssIGZsYWdzKTsNCj4gKw0KPiArICAgICAgIGZvciAoaSA9
IDA7IGkgPCBnYy0+bmdwaW87IGkrKykgew0KPiArICAgICAgICAgICAgICAgaWYgKCptYXNrID09
IDApDQo+ICsgICAgICAgICAgICAgICAgICAgICAgIGJyZWFrOw0KPiArICAgICAgICAgICAgICAg
aWYgKF9fdGVzdF9hbmRfY2xlYXJfYml0KGksIG1hc2spKSB7DQo+ICsgICAgICAgICAgICAgICAg
ICAgICAgIGlmICh0ZXN0X2JpdChpLCBiaXRzKSkNCj4gKyAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICBxZV9nYy0+Y3BkYXRhIHw9ICgxVSA8PCAoUUVfUElPX1BJTlMgLSAxIC0gaSkpOw0K
PiArICAgICAgICAgICAgICAgICAgICAgICBlbHNlDQo+ICsgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgcWVfZ2MtPmNwZGF0YSAmPSB+KDFVIDw8IChRRV9QSU9fUElOUyAtIDEgLSBpKSk7
DQo+ICsgICAgICAgICAgICAgICB9DQo+ICsgICAgICAgfQ0KPiArDQo+ICsgICAgICAgb3V0X2Jl
MzIoJnJlZ3MtPmNwZGF0YSwgcWVfZ2MtPmNwZGF0YSk7DQo+ICsNCj4gKyAgICAgICBzcGluX3Vu
bG9ja19pcnFyZXN0b3JlKCZxZV9nYy0+bG9jaywgZmxhZ3MpOyB9DQo+ICsNCj4gIHN0YXRpYyBp
bnQgcWVfZ3Bpb19kaXJfaW4oc3RydWN0IGdwaW9fY2hpcCAqZ2MsIHVuc2lnbmVkIGludCBncGlv
KSAgew0KPiAgICAgICAgIHN0cnVjdCBvZl9tbV9ncGlvX2NoaXAgKm1tX2djID0gdG9fb2ZfbW1f
Z3Bpb19jaGlwKGdjKTsgQEAgLTI5OCw2ICszMjUsNyBAQCBzdGF0aWMgaW50IF9faW5pdCBxZV9h
ZGRfZ3Bpb2NoaXBzKHZvaWQpDQo+ICAgICAgICAgICAgICAgICBnYy0+ZGlyZWN0aW9uX291dHB1
dCA9IHFlX2dwaW9fZGlyX291dDsNCj4gICAgICAgICAgICAgICAgIGdjLT5nZXQgPSBxZV9ncGlv
X2dldDsNCj4gICAgICAgICAgICAgICAgIGdjLT5zZXQgPSBxZV9ncGlvX3NldDsNCj4gKyAgICAg
ICAgICAgICAgIGdjLT5zZXRfbXVsdGlwbGUgPSBxZV9ncGlvX3NldF9tdWx0aXBsZTsNCj4gDQo+
ICAgICAgICAgICAgICAgICByZXQgPSBvZl9tbV9ncGlvY2hpcF9hZGRfZGF0YShucCwgbW1fZ2Ms
IHFlX2djKTsNCj4gICAgICAgICAgICAgICAgIGlmIChyZXQpDQo+IA0KPiBSZXZpZXdlZC1ieTog
UWlhbmcgWmhhbyA8cWlhbmcuemhhb0BueHAuY29tPg0KPiANCg0KV2hvIHBpY2tzIHVwIHRoaXMg
cGF0Y2ggPyBJcyBpdCBxdWV1ZWQgc29tZXdoZXJlIGFscmVhZHk/DQoNCiAgSm9ja2U=

^ permalink raw reply

* Re: [PATCH v8 3/6] module: use relative references for __ksymtab entries
From: Ard Biesheuvel @ 2018-06-25  9:14 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: linux-arm-kernel, Arnd Bergmann, Kees Cook, Will Deacon,
	Michael Ellerman, Thomas Garnier, Thomas Gleixner,
	Serge E. Hallyn, Bjorn Helgaas, Benjamin Herrenschmidt,
	Russell King, Paul Mackerras, Catalin Marinas, Petr Mladek,
	Ingo Molnar, James Morris, Andrew Morton, Nicolas Pitre,
	Josh Poimboeuf, Steven Rostedt, Sergey Senozhatsky,
	Linus Torvalds, Jessica Yu, LKML, linuxppc-dev,
	the arch/x86 maintainers, Ingo Molnar
In-Reply-To: <CAB0TPYGd1jQzgc2q+fJZif4LVQTPnOrD4b6w0YSAzgAB0oL86g@mail.gmail.com>

On 25 June 2018 at 10:56, Martijn Coenen <maco@android.com> wrote:
> Hi Ard,
>
>> --- a/arch/x86/include/asm/export.h
>> +++ /dev/null
>> @@ -1,5 +0,0 @@
>> -/* SPDX-License-Identifier: GPL-2.0 */
>> -#ifdef CONFIG_64BIT
>> -#define KSYM_ALIGN 16
>> -#endif
>
> Why remove the 16-byte alignment here?

Because struct kernel_symbol is only 8 bytes in size after this
change, and aligning a 8 byte quantity to 16 bytes wastes 8 bytes.

> I was working on some other
> code and ran into this 16-byte alignment, but I'm not sure why it's
> needed on x86_64 in the first place. Possibly because the ABI requires
> the stack to be 16-byte aligned when using sse instructions?
>

The x86 ABI may require it, but we don't actually adhere to it in the
kernel. Also, these structures never occur on the stack anyway.


>> -#include <asm-generic/export.h>
>> diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
>> index 719db1968d81..97ce606459ae 100644
>> --- a/include/asm-generic/export.h
>> +++ b/include/asm-generic/export.h
>> @@ -5,12 +5,10 @@
>>  #define KSYM_FUNC(x) x
>>  #endif
>>  #ifdef CONFIG_64BIT
>> -#define __put .quad
>>  #ifndef KSYM_ALIGN
>>  #define KSYM_ALIGN 8
>>  #endif
>>  #else
>> -#define __put .long
>>  #ifndef KSYM_ALIGN
>>  #define KSYM_ALIGN 4
>>  #endif
>> @@ -25,6 +23,16 @@
>>  #define KSYM(name) name
>>  #endif
>>
>> +.macro __put, val, name
>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> +       .long   \val - ., \name - .
>> +#elif defined(CONFIG_64BIT)
>> +       .quad   \val, \name
>> +#else
>> +       .long   \val, \name
>> +#endif
>> +.endm
>> +
>>  /*
>>   * note on .section use: @progbits vs %progbits nastiness doesn't matter,
>>   * since we immediately emit into those sections anyway.
>> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
>> index ab4711c63601..0a9328ea9dbd 100644
>> --- a/include/linux/compiler.h
>> +++ b/include/linux/compiler.h
>> @@ -280,6 +280,25 @@ unsigned long read_word_at_a_time(const void *addr)
>>
>>  #endif /* __KERNEL__ */
>>
>> +/*
>> + * Force the compiler to emit 'sym' as a symbol, so that we can reference
>> + * it from inline assembler. Necessary in case 'sym' could be inlined
>> + * otherwise, or eliminated entirely due to lack of references that are
>> + * visible to the compiler.
>> + */
>> +#define __ADDRESSABLE(sym) \
>> +       static void * const __attribute__((section(".discard"), used))  \
>> +               __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
>> +
>> +/**
>> + * offset_to_ptr - convert a relative memory offset to an absolute pointer
>> + * @off:       the address of the 32-bit offset value
>> + */
>> +static inline void *offset_to_ptr(const int *off)
>> +{
>> +       return (void *)((unsigned long)off + *off);
>> +}
>> +
>>  #endif /* __ASSEMBLY__ */
>>
>>  #ifndef __optimize
>> diff --git a/include/linux/export.h b/include/linux/export.h
>> index 25005b55b079..04c78e6bfec9 100644
>> --- a/include/linux/export.h
>> +++ b/include/linux/export.h
>> @@ -24,12 +24,6 @@
>>  #define VMLINUX_SYMBOL_STR(x) __VMLINUX_SYMBOL_STR(x)
>>
>>  #ifndef __ASSEMBLY__
>> -struct kernel_symbol
>> -{
>> -       unsigned long value;
>> -       const char *name;
>> -};
>> -
>>  #ifdef MODULE
>>  extern struct module __this_module;
>>  #define THIS_MODULE (&__this_module)
>> @@ -60,17 +54,47 @@ extern struct module __this_module;
>>  #define __CRC_SYMBOL(sym, sec)
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> +#include <linux/compiler.h>
>> +/*
>> + * Emit the ksymtab entry as a pair of relative references: this reduces
>> + * the size by half on 64-bit architectures, and eliminates the need for
>> + * absolute relocations that require runtime processing on relocatable
>> + * kernels.
>> + */
>> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
>> +       __ADDRESSABLE(sym)                                              \
>> +       asm("   .section \"___ksymtab" sec "+" #sym "\", \"a\"  \n"     \
>> +           "   .balign 8                                       \n"     \
>> +           VMLINUX_SYMBOL_STR(__ksymtab_##sym) ":              \n"     \
>> +           "   .long " VMLINUX_SYMBOL_STR(sym) "- .            \n"     \
>> +           "   .long " VMLINUX_SYMBOL_STR(__kstrtab_##sym) "- .\n"     \
>> +           "   .previous                                       \n")
>> +
>> +struct kernel_symbol {
>> +       int value_offset;
>> +       int name_offset;
>> +};
>> +#else
>> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
>> +       static const struct kernel_symbol __ksymtab_##sym               \
>> +       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
>> +       = { (unsigned long)&sym, __kstrtab_##sym }
>> +
>> +struct kernel_symbol {
>> +       unsigned long value;
>> +       const char *name;
>> +};
>> +#endif
>> +
>>  /* For every exported symbol, place a struct in the __ksymtab section */
>>  #define ___EXPORT_SYMBOL(sym, sec)                                     \
>>         extern typeof(sym) sym;                                         \
>>         __CRC_SYMBOL(sym, sec)                                          \
>>         static const char __kstrtab_##sym[]                             \
>> -       __attribute__((section("__ksymtab_strings"), aligned(1)))       \
>> +       __attribute__((section("__ksymtab_strings"), used, aligned(1))) \
>>         = VMLINUX_SYMBOL_STR(sym);                                      \
>> -       static const struct kernel_symbol __ksymtab_##sym               \
>> -       __used                                                          \
>> -       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
>> -       = { (unsigned long)&sym, __kstrtab_##sym }
>> +       __KSYMTAB_ENTRY(sym, sec)
>>
>>  #if defined(__DISABLE_EXPORTS)
>>
>> diff --git a/kernel/module.c b/kernel/module.c
>> index ad2d420024f6..b4782cfbb79b 100644
>> --- a/kernel/module.c
>> +++ b/kernel/module.c
>> @@ -549,12 +549,30 @@ static bool check_symbol(const struct symsearch *syms,
>>         return true;
>>  }
>>
>> +static unsigned long kernel_symbol_value(const struct kernel_symbol *sym)
>> +{
>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> +       return (unsigned long)offset_to_ptr(&sym->value_offset);
>> +#else
>> +       return sym->value;
>> +#endif
>> +}
>> +
>> +static const char *kernel_symbol_name(const struct kernel_symbol *sym)
>> +{
>> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
>> +       return offset_to_ptr(&sym->name_offset);
>> +#else
>> +       return sym->name;
>> +#endif
>> +}
>> +
>>  static int cmp_name(const void *va, const void *vb)
>>  {
>>         const char *a;
>>         const struct kernel_symbol *b;
>>         a = va; b = vb;
>> -       return strcmp(a, b->name);
>> +       return strcmp(a, kernel_symbol_name(b));
>>  }
>>
>>  static bool find_symbol_in_section(const struct symsearch *syms,
>> @@ -2198,7 +2216,7 @@ void *__symbol_get(const char *symbol)
>>                 sym = NULL;
>>         preempt_enable();
>>
>> -       return sym ? (void *)sym->value : NULL;
>> +       return sym ? (void *)kernel_symbol_value(sym) : NULL;
>>  }
>>  EXPORT_SYMBOL_GPL(__symbol_get);
>>
>> @@ -2228,10 +2246,12 @@ static int verify_export_symbols(struct module *mod)
>>
>>         for (i = 0; i < ARRAY_SIZE(arr); i++) {
>>                 for (s = arr[i].sym; s < arr[i].sym + arr[i].num; s++) {
>> -                       if (find_symbol(s->name, &owner, NULL, true, false)) {
>> +                       if (find_symbol(kernel_symbol_name(s), &owner, NULL,
>> +                                       true, false)) {
>>                                 pr_err("%s: exports duplicate symbol %s"
>>                                        " (owned by %s)\n",
>> -                                      mod->name, s->name, module_name(owner));
>> +                                      mod->name, kernel_symbol_name(s),
>> +                                      module_name(owner));
>>                                 return -ENOEXEC;
>>                         }
>>                 }
>> @@ -2280,7 +2300,7 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
>>                         ksym = resolve_symbol_wait(mod, info, name);
>>                         /* Ok if resolved.  */
>>                         if (ksym && !IS_ERR(ksym)) {
>> -                               sym[i].st_value = ksym->value;
>> +                               sym[i].st_value = kernel_symbol_value(ksym);
>>                                 break;
>>                         }
>>
>> @@ -2540,7 +2560,7 @@ static int is_exported(const char *name, unsigned long value,
>>                 ks = lookup_symbol(name, __start___ksymtab, __stop___ksymtab);
>>         else
>>                 ks = lookup_symbol(name, mod->syms, mod->syms + mod->num_syms);
>> -       return ks != NULL && ks->value == value;
>> +       return ks != NULL && kernel_symbol_value(ks) == value;
>>  }
>>
>>  /* As per nm */
>> --
>> 2.15.1
>>

^ permalink raw reply

* Re: [PATCH v8 3/6] module: use relative references for __ksymtab entries
From: Martijn Coenen @ 2018-06-25  8:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-arm-kernel, Arnd Bergmann, Kees Cook, Will Deacon,
	Michael Ellerman, Thomas Garnier, Thomas Gleixner,
	Serge E. Hallyn, Bjorn Helgaas, Benjamin Herrenschmidt,
	Russell King, Paul Mackerras, Catalin Marinas, Petr Mladek,
	Ingo Molnar, James Morris, Andrew Morton, Nicolas Pitre,
	Josh Poimboeuf, Steven Rostedt, Sergey Senozhatsky,
	Linus Torvalds, Jessica Yu, LKML, linuxppc-dev, x86, Ingo Molnar
In-Reply-To: <20180311123815.17916-4-ard.biesheuvel@linaro.org>

Hi Ard,

> --- a/arch/x86/include/asm/export.h
> +++ /dev/null
> @@ -1,5 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifdef CONFIG_64BIT
> -#define KSYM_ALIGN 16
> -#endif

Why remove the 16-byte alignment here? I was working on some other
code and ran into this 16-byte alignment, but I'm not sure why it's
needed on x86_64 in the first place. Possibly because the ABI requires
the stack to be 16-byte aligned when using sse instructions?

Thanks,
Martijn

> -#include <asm-generic/export.h>
> diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
> index 719db1968d81..97ce606459ae 100644
> --- a/include/asm-generic/export.h
> +++ b/include/asm-generic/export.h
> @@ -5,12 +5,10 @@
>  #define KSYM_FUNC(x) x
>  #endif
>  #ifdef CONFIG_64BIT
> -#define __put .quad
>  #ifndef KSYM_ALIGN
>  #define KSYM_ALIGN 8
>  #endif
>  #else
> -#define __put .long
>  #ifndef KSYM_ALIGN
>  #define KSYM_ALIGN 4
>  #endif
> @@ -25,6 +23,16 @@
>  #define KSYM(name) name
>  #endif
>
> +.macro __put, val, name
> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> +       .long   \val - ., \name - .
> +#elif defined(CONFIG_64BIT)
> +       .quad   \val, \name
> +#else
> +       .long   \val, \name
> +#endif
> +.endm
> +
>  /*
>   * note on .section use: @progbits vs %progbits nastiness doesn't matter,
>   * since we immediately emit into those sections anyway.
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index ab4711c63601..0a9328ea9dbd 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -280,6 +280,25 @@ unsigned long read_word_at_a_time(const void *addr)
>
>  #endif /* __KERNEL__ */
>
> +/*
> + * Force the compiler to emit 'sym' as a symbol, so that we can reference
> + * it from inline assembler. Necessary in case 'sym' could be inlined
> + * otherwise, or eliminated entirely due to lack of references that are
> + * visible to the compiler.
> + */
> +#define __ADDRESSABLE(sym) \
> +       static void * const __attribute__((section(".discard"), used))  \
> +               __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
> +
> +/**
> + * offset_to_ptr - convert a relative memory offset to an absolute pointer
> + * @off:       the address of the 32-bit offset value
> + */
> +static inline void *offset_to_ptr(const int *off)
> +{
> +       return (void *)((unsigned long)off + *off);
> +}
> +
>  #endif /* __ASSEMBLY__ */
>
>  #ifndef __optimize
> diff --git a/include/linux/export.h b/include/linux/export.h
> index 25005b55b079..04c78e6bfec9 100644
> --- a/include/linux/export.h
> +++ b/include/linux/export.h
> @@ -24,12 +24,6 @@
>  #define VMLINUX_SYMBOL_STR(x) __VMLINUX_SYMBOL_STR(x)
>
>  #ifndef __ASSEMBLY__
> -struct kernel_symbol
> -{
> -       unsigned long value;
> -       const char *name;
> -};
> -
>  #ifdef MODULE
>  extern struct module __this_module;
>  #define THIS_MODULE (&__this_module)
> @@ -60,17 +54,47 @@ extern struct module __this_module;
>  #define __CRC_SYMBOL(sym, sec)
>  #endif
>
> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> +#include <linux/compiler.h>
> +/*
> + * Emit the ksymtab entry as a pair of relative references: this reduces
> + * the size by half on 64-bit architectures, and eliminates the need for
> + * absolute relocations that require runtime processing on relocatable
> + * kernels.
> + */
> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
> +       __ADDRESSABLE(sym)                                              \
> +       asm("   .section \"___ksymtab" sec "+" #sym "\", \"a\"  \n"     \
> +           "   .balign 8                                       \n"     \
> +           VMLINUX_SYMBOL_STR(__ksymtab_##sym) ":              \n"     \
> +           "   .long " VMLINUX_SYMBOL_STR(sym) "- .            \n"     \
> +           "   .long " VMLINUX_SYMBOL_STR(__kstrtab_##sym) "- .\n"     \
> +           "   .previous                                       \n")
> +
> +struct kernel_symbol {
> +       int value_offset;
> +       int name_offset;
> +};
> +#else
> +#define __KSYMTAB_ENTRY(sym, sec)                                      \
> +       static const struct kernel_symbol __ksymtab_##sym               \
> +       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
> +       = { (unsigned long)&sym, __kstrtab_##sym }
> +
> +struct kernel_symbol {
> +       unsigned long value;
> +       const char *name;
> +};
> +#endif
> +
>  /* For every exported symbol, place a struct in the __ksymtab section */
>  #define ___EXPORT_SYMBOL(sym, sec)                                     \
>         extern typeof(sym) sym;                                         \
>         __CRC_SYMBOL(sym, sec)                                          \
>         static const char __kstrtab_##sym[]                             \
> -       __attribute__((section("__ksymtab_strings"), aligned(1)))       \
> +       __attribute__((section("__ksymtab_strings"), used, aligned(1))) \
>         = VMLINUX_SYMBOL_STR(sym);                                      \
> -       static const struct kernel_symbol __ksymtab_##sym               \
> -       __used                                                          \
> -       __attribute__((section("___ksymtab" sec "+" #sym), used))       \
> -       = { (unsigned long)&sym, __kstrtab_##sym }
> +       __KSYMTAB_ENTRY(sym, sec)
>
>  #if defined(__DISABLE_EXPORTS)
>
> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..b4782cfbb79b 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> @@ -549,12 +549,30 @@ static bool check_symbol(const struct symsearch *syms,
>         return true;
>  }
>
> +static unsigned long kernel_symbol_value(const struct kernel_symbol *sym)
> +{
> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> +       return (unsigned long)offset_to_ptr(&sym->value_offset);
> +#else
> +       return sym->value;
> +#endif
> +}
> +
> +static const char *kernel_symbol_name(const struct kernel_symbol *sym)
> +{
> +#ifdef CONFIG_HAVE_ARCH_PREL32_RELOCATIONS
> +       return offset_to_ptr(&sym->name_offset);
> +#else
> +       return sym->name;
> +#endif
> +}
> +
>  static int cmp_name(const void *va, const void *vb)
>  {
>         const char *a;
>         const struct kernel_symbol *b;
>         a = va; b = vb;
> -       return strcmp(a, b->name);
> +       return strcmp(a, kernel_symbol_name(b));
>  }
>
>  static bool find_symbol_in_section(const struct symsearch *syms,
> @@ -2198,7 +2216,7 @@ void *__symbol_get(const char *symbol)
>                 sym = NULL;
>         preempt_enable();
>
> -       return sym ? (void *)sym->value : NULL;
> +       return sym ? (void *)kernel_symbol_value(sym) : NULL;
>  }
>  EXPORT_SYMBOL_GPL(__symbol_get);
>
> @@ -2228,10 +2246,12 @@ static int verify_export_symbols(struct module *mod)
>
>         for (i = 0; i < ARRAY_SIZE(arr); i++) {
>                 for (s = arr[i].sym; s < arr[i].sym + arr[i].num; s++) {
> -                       if (find_symbol(s->name, &owner, NULL, true, false)) {
> +                       if (find_symbol(kernel_symbol_name(s), &owner, NULL,
> +                                       true, false)) {
>                                 pr_err("%s: exports duplicate symbol %s"
>                                        " (owned by %s)\n",
> -                                      mod->name, s->name, module_name(owner));
> +                                      mod->name, kernel_symbol_name(s),
> +                                      module_name(owner));
>                                 return -ENOEXEC;
>                         }
>                 }
> @@ -2280,7 +2300,7 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
>                         ksym = resolve_symbol_wait(mod, info, name);
>                         /* Ok if resolved.  */
>                         if (ksym && !IS_ERR(ksym)) {
> -                               sym[i].st_value = ksym->value;
> +                               sym[i].st_value = kernel_symbol_value(ksym);
>                                 break;
>                         }
>
> @@ -2540,7 +2560,7 @@ static int is_exported(const char *name, unsigned long value,
>                 ks = lookup_symbol(name, __start___ksymtab, __stop___ksymtab);
>         else
>                 ks = lookup_symbol(name, mod->syms, mod->syms + mod->num_syms);
> -       return ks != NULL && ks->value == value;
> +       return ks != NULL && kernel_symbol_value(ks) == value;
>  }
>
>  /* As per nm */
> --
> 2.15.1
>

^ permalink raw reply

* [PATCH] powerpc/mm/32: Fix pgtable_page_dtor call
From: Aneesh Kumar K.V @ 2018-06-25  8:15 UTC (permalink / raw)
  To: npiggin, benh, paulus, mpe, Christophe Leroy
  Cc: linuxppc-dev, Aneesh Kumar K.V

Commit 667416f38554 ("powerpc/mm: Fix kernel crash on page table free")
added a call for pgtable_page_dtor in the rcu page table free routine. We missed
the fact that for 32 bit platforms we did call the 'dtor' early. Drop the extra
call for pgtable_page_dtor. We remove the call from __pte_free_tlb so that we
do the page table free and 'dtor' call together. This should help when we
switch these platforms to pte fragments.

Fixes: 667416f38554 ("powerpc/mm: Fix kernel crash on page table free")
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/32/pgalloc.h | 1 -
 arch/powerpc/include/asm/nohash/32/pgalloc.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index e4633803fe43..82e44b1a00ae 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -138,7 +138,6 @@ static inline void pgtable_free_tlb(struct mmu_gather *tlb,
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 				  unsigned long address)
 {
-	pgtable_page_dtor(table);
 	pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_BOOK3S_32_PGALLOC_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index 9de40eb614da..8825953c225b 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -140,7 +140,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 				  unsigned long address)
 {
 	tlb_flush_pgtable(tlb, address);
-	pgtable_page_dtor(table);
 	pgtable_free_tlb(tlb, page_address(table), 0);
 }
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
-- 
2.17.1

^ permalink raw reply related

* [PATCHv2 2/2] drivers/base: reorder consumer and its children behind suppliers
From: Pingfan Liu @ 2018-06-25  7:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev
In-Reply-To: <1529912859-10475-1-git-send-email-kernelfans@gmail.com>

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
introduces supplier<-consumer order in devices_kset. The commit tries
to cleverly maintain both parent<-child and supplier<-consumer order by
reordering a device when probing. This method makes things simple and
clean, but unfortunately, breaks parent<-child order in some case,
which is described in next patch in this series.
Here this patch tries to resolve supplier<-consumer by only reordering a
device when it has suppliers, and takes care of the following scenario:
    [consumer, children] [ ... potential ... ] supplier
                         ^                   ^
After moving the consumer and its children after the supplier, the
potentail section may contain consumers whose supplier is inside
children, and this poses the requirement to dry out all consumpers in
the section recursively.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
note: there is lock issue in this patch, should be fixed in next version

---
 drivers/base/core.c | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 129 insertions(+), 3 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 66f06ff..db30e86 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -123,12 +123,138 @@ static int device_is_dependent(struct device *dev, void *target)
 	return ret;
 }
 
-/* a temporary place holder to mark out the root cause of the bug.
- * The proposal algorithm will come in next patch
+struct pos_info {
+	struct device *pos;
+	struct device *tail;
+};
+
+/* caller takes the devices_kset->list_lock */
+static int descendants_reorder_after_pos(struct device *dev,
+	void *data)
+{
+	struct device *pos;
+	struct pos_info *p = data;
+
+	pos = p->pos;
+	pr_debug("devices_kset: Moving %s after %s\n",
+		 dev_name(dev), dev_name(pos));
+	device_for_each_child(dev, p, descendants_reorder_after_pos);
+	/* children at the tail */
+	list_move(&dev->kobj.entry, &pos->kobj.entry);
+	/* record the right boundary of the section */
+	if (p->tail == NULL)
+		p->tail = dev;
+	return 0;
+}
+
+/* iterate over an open section */
+#define list_opensect_for_each_reverse(cur, left, right)	\
+	for (cur = right->prev; cur == left; cur = cur->prev)
+
+static bool is_consumer(struct device *query, struct device *supplier)
+{
+	struct device_link *link;
+	/* todo, lock protection */
+	list_for_each_entry(link, &supplier->links.consumers, s_node)
+		if (link->consumer == query)
+			return true;
+	return false;
+}
+
+/* recursively move the potential consumers in open section (left, right)
+ * after the barrier
+ */
+static int __device_reorder_consumer(struct device *consumer,
+	struct list_head *left, struct list_head *right,
+	struct pos_info *p)
+{
+	struct list_head *iter;
+	struct device *c_dev, *s_dev, *tail_dev;
+
+	descendants_reorder_after_pos(consumer, p);
+	tail_dev = p->tail;
+	/* (left, right) may contain consumers, hence checking if any moved
+	 * child serving as supplier. The reversing order help us to meet
+	 * the last supplier of a consumer.
+	 */
+	list_opensect_for_each_reverse(iter, left, right) {
+		struct list_head *l_iter, *moved_left, *moved_right;
+
+		moved_left = (&consumer->kobj.entry)->prev;
+		moved_right = tail_dev->kobj.entry.next;
+		/* the moved section may contain potential suppliers */
+		list_opensect_for_each_reverse(l_iter, moved_left,
+			moved_right) {
+			s_dev = list_entry(l_iter, struct device, kobj.entry);
+			c_dev = list_entry(iter, struct device, kobj.entry);
+			/* to fix: this poses extra effort for locking */
+			if (is_consumer(c_dev, s_dev)) {
+				p->tail = NULL;
+				/* to fix: lock issue */
+				p->pos =  s_dev;
+				/* reorder after the last supplier */
+				__device_reorder_consumer(c_dev,
+					l_iter, right, p);
+			}
+		}
+	}
+	return 0;
+}
+
+static int find_last_supplier(struct device *dev, struct device *supplier)
+{
+	struct device_link *link;
+
+	list_for_each_entry_reverse(link, &dev->links.suppliers, c_node) {
+		if (link->supplier == supplier)
+			return 1;
+	}
+	if (dev == supplier)
+		return -1;
+	return 0;
+}
+
+/* When reodering, take care of the range of (old_pos(dev), new_pos(dev)),
+ * there may be requirement to recursively move item.
  */
 int device_reorder_consumer(struct device *dev)
 {
-	devices_kset_move_last(dev);
+	struct list_head *iter, *left, *right;
+	struct device *cur_dev;
+	struct pos_info info;
+	int ret, idx;
+
+	idx = device_links_read_lock();
+	if (list_empty(&dev->links.suppliers)) {
+		device_links_read_unlock(idx);
+		return 0;
+	}
+	spin_lock(&devices_kset->list_lock);
+	list_for_each_prev(iter, &devices_kset->list) {
+		cur_dev = list_entry(iter, struct device, kobj.entry);
+		ret = find_last_supplier(dev, cur_dev);
+		switch (ret) {
+		case -1:
+			goto unlock;
+		case 1:
+			break;
+		case 0:
+			continue;
+		}
+	}
+	BUG_ON(!ret);
+
+	/* record the affected open section */
+	left = dev->kobj.entry.prev;
+	right = iter;
+	info.pos = list_entry(iter, struct device, kobj.entry);
+	info.tail = NULL;
+	/* dry out the consumers in (left,right) */
+	__device_reorder_consumer(dev, left, right, &info);
+
+unlock:
+	spin_unlock(&devices_kset->list_lock);
+	device_links_read_unlock(idx);
 	return 0;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [PATCHv2 1/2] drivers/base: only reordering consumer device when probing
From: Pingfan Liu @ 2018-06-25  7:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev
In-Reply-To: <1529912859-10475-1-git-send-email-kernelfans@gmail.com>

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge. This will break the
parent<-children order and cause failure when "kexec -e" in some scenario.

The detailed description of the scenario:
An IBM Power9 machine on which, two drivers portdrv_pci and shpchp(a mod)
match the PCI_CLASS_BRIDGE_PCI, but neither of them success to probe due
to some issue. For this case, the bridge is moved after its children in
devices_kset. Then, when "kexec -e", a ata-disk behind the bridge can not
write back buffer in flight due to the former shutdown of the bridge which
clears the BusMaster bit.

To fix this issue, only reordering a device and all of its children
if it is a consumer.

Note, the bridge involved:
0004:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        NUMA node: 0
        Bus: primary=00, secondary=01, subordinate=12, sec-latency=0
        I/O behind bridge: 00000000-00000fff
        Memory behind bridge: 80000000-ffefffff
        Prefetchable memory behind bridge: 0006024000000000-0006027f7fffffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+
                DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis+, LTR-, OBFF Disabled ARIFwd+
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP+ FCP- CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq- ACSViol-
                UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [148 v1] #19

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
note: this patch points out the code where the bug is introduced.
and I hope it sketches out the scene. Will fold the series in one.

---
 drivers/base/base.h | 1 +
 drivers/base/core.c | 9 +++++++++
 drivers/base/dd.c   | 9 ++-------
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index a75c302..37f86ca 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -135,6 +135,7 @@ extern void device_unblock_probing(void);
 
 /* /sys/devices directory */
 extern struct kset *devices_kset;
+extern int device_reorder_consumer(struct device *dev);
 extern void devices_kset_move_last(struct device *dev);
 
 #if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 36622b5..66f06ff 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -123,6 +123,15 @@ static int device_is_dependent(struct device *dev, void *target)
 	return ret;
 }
 
+/* a temporary place holder to mark out the root cause of the bug.
+ * The proposal algorithm will come in next patch
+ */
+int device_reorder_consumer(struct device *dev)
+{
+	devices_kset_move_last(dev);
+	return 0;
+}
+
 static int device_reorder_to_tail(struct device *dev, void *not_used)
 {
 	struct device_link *link;
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 1435d72..c74f23c 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -434,13 +434,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 			goto probe_failed;
 	}
 
-	/*
-	 * Ensure devices are listed in devices_kset in correct order
-	 * It's important to move Dev to the end of devices_kset before
-	 * calling .probe, because it could be recursive and parent Dev
-	 * should always go first
-	 */
-	devices_kset_move_last(dev);
+	/* only reoder consumer and its children after suppliers.*/
+	device_reorder_consumer(dev);
 
 	if (dev->bus->probe) {
 		ret = dev->bus->probe(dev);
-- 
2.7.4

^ permalink raw reply related

* [PATCHv2 0/2] drivers/base: bugfix for supplier<-consumer ordering in device_kset
From: Pingfan Liu @ 2018-06-25  7:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge.
This will break the parent<-children order and cause failure when
"kexec -e" in some scenario.

I tried to fix this issue in pci subsystem, and it turns out to be wrong.
Thanks to Christoph Hellwig, he enlightens me that it should be a bug in
driver core.

note: This series has some lock issue, should be fixed in next version

v1 -> v2:
  refragment

Pingfan Liu (2):
  drivers/base: only reordering consumer device when probing
  drivers/base: reorder consumer and its children behind suppliers

 drivers/base/base.h |   1 +
 drivers/base/core.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/base/dd.c   |   9 +---
 3 files changed, 138 insertions(+), 7 deletions(-)

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 1/3] drivers/base: introduce some help routines for reordering a group of dev
From: Pingfan Liu @ 2018-06-25  7:08 UTC (permalink / raw)
  To: gregkh
  Cc: linux-kernel, grygorii.strashko, Christoph Hellwig, helgaas,
	Dave Young, linux-pci, linuxppc-dev
In-Reply-To: <20180625064110.GA4233@kroah.com>

On Mon, Jun 25, 2018 at 2:41 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Mon, Jun 25, 2018 at 01:23:05PM +0800, Pingfan Liu wrote:
> > This patch introduce some help routines used by next patch. It aims to
> > ease reviewing, while the next patch will concentrate on algorithm.
> >
> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Bjorn Helgaas <helgaas@kernel.org>
> > Cc: Dave Young <dyoung@redhat.com>
> > Cc: linux-pci@vger.kernel.org
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > ---
> >  drivers/base/core.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> >
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 36622b5..8113d2c 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -123,6 +123,44 @@ static int device_is_dependent(struct device *dev, void *target)
> >       return ret;
> >  }
> >
> > +struct pos_info {
> > +     struct device *pos;
> > +     struct device *tail;
> > +};
> > +
> > +/* caller takes the devices_kset->list_lock */
> > +static int descendants_reorder_after_pos(struct device *dev,
> > +     void *data)
> > +{
> > +     struct device *pos;
> > +     struct pos_info *p = data;
> > +
> > +     pos = p->pos;
> > +     pr_debug("devices_kset: Moving %s after %s\n",
> > +              dev_name(dev), dev_name(pos));
> > +     device_for_each_child(dev, p, descendants_reorder_after_pos);
> > +     /* children at the tail */
> > +     list_move(&dev->kobj.entry, &pos->kobj.entry);
> > +     /* record the right boundary of the section */
> > +     if (p->tail == NULL)
> > +             p->tail = dev;
> > +     return 0;
> > +}
> > +
> > +/* iterate over an open section */
> > +#define list_opensect_for_each_reverse(cur, left, right)     \
> > +     for (cur = right->prev; cur == left; cur = cur->prev)
> > +
> > +static bool is_consumer(struct device *query, struct device *supplier)
> > +{
> > +     struct device_link *link;
> > +     /* todo, lock protection */
> > +     list_for_each_entry(link, &supplier->links.consumers, s_node)
> > +             if (link->consumer == query)
> > +                     return true;
> > +     return false;
> > +}
>
> You are adding code that no one uses yet, making this impossible to
> review as I don't know what to expect.  I shouldn't have to read the
> second patch and have to flip back and forth to try to figure it out :(
>
OK, I had thought the less code in [2/3] will ease the reviewing

> sorry, please break this series up in a better way to make it simpler to
> review.
>
OK.

Regards,
Pingfan

^ permalink raw reply

* Re: [PATCH 1/3] drivers/base: introduce some help routines for reordering a group of dev
From: Greg Kroah-Hartman @ 2018-06-25  6:41 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: linux-kernel, Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas,
	Dave Young, linux-pci, linuxppc-dev
In-Reply-To: <1529904187-18673-2-git-send-email-kernelfans@gmail.com>

On Mon, Jun 25, 2018 at 01:23:05PM +0800, Pingfan Liu wrote:
> This patch introduce some help routines used by next patch. It aims to
> ease reviewing, while the next patch will concentrate on algorithm.
> 
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Bjorn Helgaas <helgaas@kernel.org>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: linux-pci@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> ---
>  drivers/base/core.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 36622b5..8113d2c 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -123,6 +123,44 @@ static int device_is_dependent(struct device *dev, void *target)
>  	return ret;
>  }
>  
> +struct pos_info {
> +	struct device *pos;
> +	struct device *tail;
> +};
> +
> +/* caller takes the devices_kset->list_lock */
> +static int descendants_reorder_after_pos(struct device *dev,
> +	void *data)
> +{
> +	struct device *pos;
> +	struct pos_info *p = data;
> +
> +	pos = p->pos;
> +	pr_debug("devices_kset: Moving %s after %s\n",
> +		 dev_name(dev), dev_name(pos));
> +	device_for_each_child(dev, p, descendants_reorder_after_pos);
> +	/* children at the tail */
> +	list_move(&dev->kobj.entry, &pos->kobj.entry);
> +	/* record the right boundary of the section */
> +	if (p->tail == NULL)
> +		p->tail = dev;
> +	return 0;
> +}
> +
> +/* iterate over an open section */
> +#define list_opensect_for_each_reverse(cur, left, right)	\
> +	for (cur = right->prev; cur == left; cur = cur->prev)
> +
> +static bool is_consumer(struct device *query, struct device *supplier)
> +{
> +	struct device_link *link;
> +	/* todo, lock protection */
> +	list_for_each_entry(link, &supplier->links.consumers, s_node)
> +		if (link->consumer == query)
> +			return true;
> +	return false;
> +}

You are adding code that no one uses yet, making this impossible to
review as I don't know what to expect.  I shouldn't have to read the
second patch and have to flip back and forth to try to figure it out :(

sorry, please break this series up in a better way to make it simpler to
review.

greg k-h

^ permalink raw reply

* Re: [PATCH v2 2/2] powerpc: Document issues with TM on POWER9
From: Stewart Smith @ 2018-06-25  6:04 UTC (permalink / raw)
  To: Michael Neuling, mpe; +Cc: mikey, linuxppc-dev
In-Reply-To: <20180625013456.16157-2-mikey@neuling.org>

Michael Neuling <mikey@neuling.org> writes:
> Signed-off-by: Michael Neuling <mikey@neuling.org>

Acked-by: Stewart Smith <stewart@linux.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply

* [PATCH 3/3] drivers/base: only reordering consumer device when probing
From: Pingfan Liu @ 2018-06-25  5:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev
In-Reply-To: <1529904187-18673-1-git-send-email-kernelfans@gmail.com>

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge. This will break the
parent<-children order and cause failure when "kexec -e" in some scenario.

The detailed description of the scenario:
An IBM Power9 machine on which, two drivers portdrv_pci and shpchp(a mod)
match the PCI_CLASS_BRIDGE_PCI, but neither of them success to probe due
to some issue. For this case, the bridge is moved after its children in
devices_kset. Then, when "kexec -e", a ata-disk behind the bridge can not
write back buffer in flight due to the former shutdown of the bridge which
clears the BusMaster bit.

To fix this issue, only reordering a device and all of its children
if it is a consumer.

Note, the bridge involved:
0004:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        NUMA node: 0
        Bus: primary=00, secondary=01, subordinate=12, sec-latency=0
        I/O behind bridge: 00000000-00000fff
        Memory behind bridge: 80000000-ffefffff
        Prefetchable memory behind bridge: 0006024000000000-0006027f7fffffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
        BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [48] Express (v2) Root Port (Slot-), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
                RootCap: CRSVisible-
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+
                DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis+, LTR-, OBFF Disabled ARIFwd+
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP+ FCP- CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq- ACSViol-
                UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
        Capabilities: [148 v1] #19

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
 drivers/base/dd.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 1435d72..c74f23c 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -434,13 +434,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 			goto probe_failed;
 	}
 
-	/*
-	 * Ensure devices are listed in devices_kset in correct order
-	 * It's important to move Dev to the end of devices_kset before
-	 * calling .probe, because it could be recursive and parent Dev
-	 * should always go first
-	 */
-	devices_kset_move_last(dev);
+	/* only reoder consumer and its children after suppliers.*/
+	device_reorder_consumer(dev);
 
 	if (dev->bus->probe) {
 		ret = dev->bus->probe(dev);
-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/3] drivers/base: reorder consumer and its children behind suppliers
From: Pingfan Liu @ 2018-06-25  5:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev
In-Reply-To: <1529904187-18673-1-git-send-email-kernelfans@gmail.com>

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
introduces supplier<-consumer order in devices_kset. The commit tries
to cleverly maintain both parent<-child and supplier<-consumer order by
reordering a device when probing. This method makes things simple and
clean, but unfortunately, breaks parent<-child order in some case,
which is described in next patch in this series.
Here this patch tries to resolve supplier<-consumer by only reordering a
device when it has suppliers, and takes care of the following scenario:
    [consumer, children] [ ... potential ... ] supplier
                         ^                   ^
After moving the consumer and its children after the supplier, the
potentail section may contain consumers whose supplier is inside
children, and this poses the requirement to dry out all consumpers in
the section recursively.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
 drivers/base/base.h |  1 +
 drivers/base/core.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index a75c302..37f86ca 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -135,6 +135,7 @@ extern void device_unblock_probing(void);
 
 /* /sys/devices directory */
 extern struct kset *devices_kset;
+extern int device_reorder_consumer(struct device *dev);
 extern void devices_kset_move_last(struct device *dev);
 
 #if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS)
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 8113d2c..db30e86 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -161,6 +161,103 @@ static bool is_consumer(struct device *query, struct device *supplier)
 	return false;
 }
 
+/* recursively move the potential consumers in open section (left, right)
+ * after the barrier
+ */
+static int __device_reorder_consumer(struct device *consumer,
+	struct list_head *left, struct list_head *right,
+	struct pos_info *p)
+{
+	struct list_head *iter;
+	struct device *c_dev, *s_dev, *tail_dev;
+
+	descendants_reorder_after_pos(consumer, p);
+	tail_dev = p->tail;
+	/* (left, right) may contain consumers, hence checking if any moved
+	 * child serving as supplier. The reversing order help us to meet
+	 * the last supplier of a consumer.
+	 */
+	list_opensect_for_each_reverse(iter, left, right) {
+		struct list_head *l_iter, *moved_left, *moved_right;
+
+		moved_left = (&consumer->kobj.entry)->prev;
+		moved_right = tail_dev->kobj.entry.next;
+		/* the moved section may contain potential suppliers */
+		list_opensect_for_each_reverse(l_iter, moved_left,
+			moved_right) {
+			s_dev = list_entry(l_iter, struct device, kobj.entry);
+			c_dev = list_entry(iter, struct device, kobj.entry);
+			/* to fix: this poses extra effort for locking */
+			if (is_consumer(c_dev, s_dev)) {
+				p->tail = NULL;
+				/* to fix: lock issue */
+				p->pos =  s_dev;
+				/* reorder after the last supplier */
+				__device_reorder_consumer(c_dev,
+					l_iter, right, p);
+			}
+		}
+	}
+	return 0;
+}
+
+static int find_last_supplier(struct device *dev, struct device *supplier)
+{
+	struct device_link *link;
+
+	list_for_each_entry_reverse(link, &dev->links.suppliers, c_node) {
+		if (link->supplier == supplier)
+			return 1;
+	}
+	if (dev == supplier)
+		return -1;
+	return 0;
+}
+
+/* When reodering, take care of the range of (old_pos(dev), new_pos(dev)),
+ * there may be requirement to recursively move item.
+ */
+int device_reorder_consumer(struct device *dev)
+{
+	struct list_head *iter, *left, *right;
+	struct device *cur_dev;
+	struct pos_info info;
+	int ret, idx;
+
+	idx = device_links_read_lock();
+	if (list_empty(&dev->links.suppliers)) {
+		device_links_read_unlock(idx);
+		return 0;
+	}
+	spin_lock(&devices_kset->list_lock);
+	list_for_each_prev(iter, &devices_kset->list) {
+		cur_dev = list_entry(iter, struct device, kobj.entry);
+		ret = find_last_supplier(dev, cur_dev);
+		switch (ret) {
+		case -1:
+			goto unlock;
+		case 1:
+			break;
+		case 0:
+			continue;
+		}
+	}
+	BUG_ON(!ret);
+
+	/* record the affected open section */
+	left = dev->kobj.entry.prev;
+	right = iter;
+	info.pos = list_entry(iter, struct device, kobj.entry);
+	info.tail = NULL;
+	/* dry out the consumers in (left,right) */
+	__device_reorder_consumer(dev, left, right, &info);
+
+unlock:
+	spin_unlock(&devices_kset->list_lock);
+	device_links_read_unlock(idx);
+	return 0;
+}
+
 static int device_reorder_to_tail(struct device *dev, void *not_used)
 {
 	struct device_link *link;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 1/3] drivers/base: introduce some help routines for reordering a group of dev
From: Pingfan Liu @ 2018-06-25  5:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev
In-Reply-To: <1529904187-18673-1-git-send-email-kernelfans@gmail.com>

This patch introduce some help routines used by next patch. It aims to
ease reviewing, while the next patch will concentrate on algorithm.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
---
 drivers/base/core.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 36622b5..8113d2c 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -123,6 +123,44 @@ static int device_is_dependent(struct device *dev, void *target)
 	return ret;
 }
 
+struct pos_info {
+	struct device *pos;
+	struct device *tail;
+};
+
+/* caller takes the devices_kset->list_lock */
+static int descendants_reorder_after_pos(struct device *dev,
+	void *data)
+{
+	struct device *pos;
+	struct pos_info *p = data;
+
+	pos = p->pos;
+	pr_debug("devices_kset: Moving %s after %s\n",
+		 dev_name(dev), dev_name(pos));
+	device_for_each_child(dev, p, descendants_reorder_after_pos);
+	/* children at the tail */
+	list_move(&dev->kobj.entry, &pos->kobj.entry);
+	/* record the right boundary of the section */
+	if (p->tail == NULL)
+		p->tail = dev;
+	return 0;
+}
+
+/* iterate over an open section */
+#define list_opensect_for_each_reverse(cur, left, right)	\
+	for (cur = right->prev; cur == left; cur = cur->prev)
+
+static bool is_consumer(struct device *query, struct device *supplier)
+{
+	struct device_link *link;
+	/* todo, lock protection */
+	list_for_each_entry(link, &supplier->links.consumers, s_node)
+		if (link->consumer == query)
+			return true;
+	return false;
+}
+
 static int device_reorder_to_tail(struct device *dev, void *not_used)
 {
 	struct device_link *link;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/3] drivers/base: bugfix for supplier<-consumer ordering in device_kset
From: Pingfan Liu @ 2018-06-25  5:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pingfan Liu, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev

commit 52cdbdd49853 ("driver core: correct device's shutdown order")
places an assumption of supplier<-consumer order on the process of probe.
But it turns out to break down the parent <- child order in some scene.
E.g in pci, a bridge is enabled by pci core, and behind it, the devices
have been probed. Then comes the bridge's module, which enables extra
feature(such as hotplug) on this bridge.
This will break the parent<-children order and cause failure when
"kexec -e" in some scenario.

I tried to fix this issue in pci subsystem, and it turns out to be wrong.
Thanks to Christoph Hellwig, he enlightens me that it should be a bug in
driver core.

To ease the review, I organize the patch as the following
[3/3] reflects the root cause of this bug. if [2/3] is not acceptable,
we still need some way to fix it.
[2/3] introduce a algorithm to reorder device
[1/3] some trivial help routine

Pingfan Liu (3):
  drivers/base: introduce some help routines for reordering a group of
    dev
  drivers/base: reorder consumer and its children behind suppliers
  drivers/base: only reordering consumer device when probing

 drivers/base/base.h |   1 +
 drivers/base/core.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/base/dd.c   |   9 +---
 3 files changed, 138 insertions(+), 7 deletions(-)

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 7/7] powerpc/powernv/pci: Don't use the lower 4G TCEs in
From: Alexey Kardashevskiy @ 2018-06-25  3:51 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <698112983.2569254.1529798098017.JavaMail.zimbra@raptorengineeringinc.com>

On Sat, 23 Jun 2018 18:54:58 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

>  pseudo-DMA mode
> 
> Four TCEs are reserved for legacy 32-bit DMA mappings in psuedo DMA
> mode.  Mark these with an invalid address to avoid their use by
> the TCE cache mapper.


Can we still have 32bit DMA in the case when this new DMA is enabled?

Are the TCEs in the actual table marked invalid?

Anyway, this should be merged into the patch introducing
pnv_pci_pseudo_bypass_setup().


> 
> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index a6097dd323f8..e8a1333f6b3e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1783,7 +1783,7 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
>  
>  static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe)
>  {
> -	u64 tce_count, table_size, window_size;
> +	u64 i, tce_count, table_size, window_size;
>  	struct pnv_phb *p = pe->phb;
>  	struct page *table_pages;
>  	__be64 *tces;
> @@ -1835,6 +1835,12 @@ static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe)
>  	/* mark the first 4GB as reserved so this can still be used for 32bit */
>  	bitmap_set(pe->tce_bitmap, 0, 1ULL << (32 - p->ioda.max_tce_order));
>  
> +	/* make sure reserved first 4GB TCEs are not used by the mapper
> +	 * set each address to -1, which will never match an incoming request
> +	 */
> +	for (i = 0; i < 4; i++)
> +		pe->tce_tracker[i * 2] = -1;
> +
>  	pe_info(pe, "pseudo-bypass sizes: tracker %d bitmap %d TCEs %lld\n",
>  		tracker_entries, bitmap_size, tce_count);
>  
> -- 
> 2.17.1



--
Alexey

^ permalink raw reply

* Re: [PATCH 6/7] powerpc/powernv/pci: Invalidate TCE cache after DMA map
From: Alexey Kardashevskiy @ 2018-06-25  3:49 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <1298010887.2569253.1529798076885.JavaMail.zimbra@raptorengineeringinc.com>

On Sat, 23 Jun 2018 18:54:36 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

>  setup
> 
> Per the IODA2, TCEs must be invalidated after their settings
> have been changed.  Invalidate the cache after the address
> is changed during TCE allocation when using pseudo DMA.
> 
> Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
> ---
>  arch/powerpc/platforms/powernv/pci-dma.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c
> index 237940a2a052..060dbc168401 100644
> --- a/arch/powerpc/platforms/powernv/pci-dma.c
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -42,8 +42,7 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr
>  	new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
>  	pe->tces[tce] = new;
>  	mb();
> -	pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
> -		tce, new, old);
> +	pnv_pci_ioda2_tce_invalidate_pe(pe);

This should be merged into "powerpc/powernv: DMA operations for
discontiguous" and "[PATCH 5/7] powerpc/powernv/pci: Export" should be
first in the series.


>  	spin_unlock_irqrestore(&pe->tce_alloc_lock, flags);
>  
>  	return tce;
> -- 
> 2.17.1



--
Alexey

^ permalink raw reply

* Re: [PATCH 4/7] powerpc/powernv/pci: Safety fixes for pseudobypass TCE
From: Alexey Kardashevskiy @ 2018-06-25  3:46 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <955790001.2569251.1529798034743.JavaMail.zimbra@raptorengineeringinc.com>

On Sat, 23 Jun 2018 18:53:54 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

>  allocation
> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/platforms/powernv/pci-dma.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c
> index 1d5409be343e..237940a2a052 100644
> --- a/arch/powerpc/platforms/powernv/pci-dma.c
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -29,8 +29,9 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr
>  {
>  	int tce;
>  	__be64 old, new;
> +	unsigned long flags;
>  
> -	;
> +	spin_lock_irqsave(&pe->tce_alloc_lock, flags);


No commit log and this actually should be merged into the patch
introducing spin_lock(&pe->tce_alloc_lock).


>  	tce = bitmap_find_next_zero_area(pe->tce_bitmap,
>  					 pe->tce_count,
>  					 0,
> @@ -40,9 +41,10 @@ static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr
>  	old = pe->tces[tce];
>  	new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
>  	pe->tces[tce] = new;
> +	mb();
>  	pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
>  		tce, new, old);
> -	spin_unlock(&pe->tce_alloc_lock);
> +	spin_unlock_irqrestore(&pe->tce_alloc_lock, flags);
>  
>  	return tce;
>  }
> -- 
> 2.17.1



--
Alexey

^ permalink raw reply

* Re: [PATCH 2/7] powerpc/powernv: DMA operations for discontiguous
From: Alexey Kardashevskiy @ 2018-06-25  3:35 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <813333447.2569248.1529797982244.JavaMail.zimbra@raptorengineeringinc.com>

On Sat, 23 Jun 2018 18:53:02 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

>  allocation
> 
> Cognitive DMA is a new set of DMA operations that solve some issues for
> devices that want to address more than 32 bits but can't address the 59
> bits required to enable direct DMA.
> 
> The previous implementation for POWER8/PHB3 worked around this by
> configuring a bypass from the default 32-bit address space into 64-bit
> address space.  This approach does not work for POWER9/PHB4 because
> regions of memory are discontiguous and many devices will be unable to
> address memory beyond the first node.

Why does not it work precisely? If we use 1GB pages, the table will be
able to cover all the memory.

> Instead, implement a new set of DMA operations that allocate TCEs as DMA
> mappings are requested so that all memory is addressable even when a
> one-to-one mapping between real addresses and DMA addresses isn't
> possible. 

Why does not dma_iommu_ops in this case? It is not limited by table
size or page size and should just work for this case too.

> These TCEs are the maximum size available on the platform,
> which is 256M on PHB3 and 1G on PHB4.


Do we have PHB3 systems with sparse memory to test this or it is dead
code?


> Devices can now map any region of memory up to the maximum amount they can
> address according to the DMA mask set, in chunks of the largest available
> TCE size.
> 
> This implementation replaces the need for the existing PHB3 solution and
> should be compatible with future PHB versions.
> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/include/asm/dma-mapping.h    |   1 +
>  arch/powerpc/platforms/powernv/Makefile   |   2 +-
>  arch/powerpc/platforms/powernv/pci-dma.c  | 319 ++++++++++++++++++++++
>  arch/powerpc/platforms/powernv/pci-ioda.c | 102 +++----
>  arch/powerpc/platforms/powernv/pci.h      |   7 +
>  5 files changed, 381 insertions(+), 50 deletions(-)
>  create mode 100644 arch/powerpc/platforms/powernv/pci-dma.c
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..354f435160f3 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -74,6 +74,7 @@ static inline unsigned long device_to_mask(struct device *dev)
>  extern struct dma_map_ops dma_iommu_ops;
>  #endif
>  extern const struct dma_map_ops dma_nommu_ops;
> +extern const struct dma_map_ops dma_pseudo_bypass_ops;
>  
>  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
>  {
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index 703a350a7f4e..2467bdab3c13 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -6,7 +6,7 @@ obj-y			+= opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
>  obj-y			+= opal-kmsg.o opal-powercap.o opal-psr.o opal-sensor-groups.o
>  
>  obj-$(CONFIG_SMP)	+= smp.o subcore.o subcore-asm.o
> -obj-$(CONFIG_PCI)	+= pci.o pci-ioda.o npu-dma.o
> +obj-$(CONFIG_PCI)	+= pci.o pci-ioda.o npu-dma.o pci-dma.o
>  obj-$(CONFIG_CXL_BASE)	+= pci-cxl.o
>  obj-$(CONFIG_EEH)	+= eeh-powernv.o
>  obj-$(CONFIG_PPC_SCOM)	+= opal-xscom.o
> diff --git a/arch/powerpc/platforms/powernv/pci-dma.c b/arch/powerpc/platforms/powernv/pci-dma.c
> new file mode 100644
> index 000000000000..1d5409be343e
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/pci-dma.c
> @@ -0,0 +1,319 @@
> +/*
> + * DMA operations supporting pseudo-bypass for PHB3+

License header is missing, run scripts/checkpatch.pl before posting.


> + *
> + * Author: Russell Currey <ruscur@russell.cc>
> + *
> + * Copyright 2018 IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/memblock.h>
> +#include <linux/device.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/hash.h>
> +
> +#include <asm/pci-bridge.h>
> +#include <asm/ppc-pci.h>
> +#include <asm/pnv-pci.h>
> +#include <asm/tce.h>
> +
> +#include "pci.h"
> +
> +/* select and allocate a TCE using the bitmap */
> +static int dma_pseudo_bypass_select_tce(struct pnv_ioda_pe *pe, phys_addr_t addr)
> +{
> +	int tce;
> +	__be64 old, new;
> +
> +	spin_lock(&pe->tce_alloc_lock);
> +	tce = bitmap_find_next_zero_area(pe->tce_bitmap,
> +					 pe->tce_count,
> +					 0,
> +					 1,
> +					 0);
> +	bitmap_set(pe->tce_bitmap, tce, 1);
> +	old = pe->tces[tce];
> +	new = cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
> +	pe->tces[tce] = new;
> +	pe_info(pe, "allocating TCE %i 0x%016llx (old 0x%016llx)\n",
> +		tce, new, old);
> +	spin_unlock(&pe->tce_alloc_lock);
> +
> +	return tce;
> +}
> +
> +/*
> + * The tracking table for assigning TCEs has two entries per TCE.
> + * - @entry1 contains the physical address and the smallest bit indicates
> + *     if it's currently valid.
> + * - @entry2 contains the DMA address returned in the upper 34 bits, and a
> + *     refcount in the lower 30 bits.
> + */
> +static dma_addr_t dma_pseudo_bypass_get_address(struct device *dev,
> +					    phys_addr_t addr)
> +{
> +	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_ioda_pe *pe;
> +        u64 i, entry1, entry2, dma_prefix, tce, ret;
> +	u64 offset = addr & ((1 << phb->ioda.max_tce_order) - 1);
> +
> +	pe = &phb->ioda.pe_array[pci_get_pdn(pdev)->pe_number];
> +
> +	/* look through the tracking table for a free entry */
> +	for (i = 0; i < pe->tce_count; i++) {
> +		entry1 = pe->tce_tracker[i * 2];
> +		entry2 = pe->tce_tracker[i * 2 + 1];
> +		dma_prefix = entry2 >> 34;


Magic value of 34?


> +
> +		/* if the address is the same and the entry is valid */
> +		if (entry1 == ((addr - offset) | 1)) {
> +			/* all we need to do here is increment the refcount */
> +			ret = cmpxchg(&pe->tce_tracker[i * 2 + 1],
> +				      entry2, entry2 + 1);
> +			if (ret != entry2) {
> +				/* conflict, start looking again just in case */
> +				i--;
> +				continue;
> +			}
> +			return (dma_prefix << phb->ioda.max_tce_order) | offset;
> +		/* if the entry is invalid then we want to replace it */
> +		} else if (!(entry1 & 1)) {
> +			/* set the real address, note that it isn't valid yet */
> +			ret = cmpxchg(&pe->tce_tracker[i * 2],
> +				      entry1, (addr - offset));
> +			if (ret != entry1) {
> +				/* conflict, start looking again */
> +				i--;
> +				continue;
> +			}
> +
> +			/* now we can allocate a TCE */
> +			tce = dma_pseudo_bypass_select_tce(pe, addr - offset);
> +
> +			/* set new value, including TCE index and new refcount */
> +			ret = cmpxchg(&pe->tce_tracker[i * 2 + 1],
> +				      entry2, tce << 34 | 1);
> +			if (ret != entry2) {
> +				/*
> +				 * XXX In this case we need to throw out
> +				 * everything, including the TCE we just
> +				 * allocated.  For now, just leave it.
> +				 */
> +				i--;
> +				continue;
> +			}
> +
> +			/* now set the valid bit */
> +			ret = cmpxchg(&pe->tce_tracker[i * 2],
> +				      (addr - offset), (addr - offset) | 1);
> +			if (ret != (addr - offset)) {
> +				/*
> +				 * XXX Same situation as above.  We'd probably
> +				 * want to null out entry2 as well.
> +				 */
> +				i--;
> +				continue;
> +			}
> +			return (tce << phb->ioda.max_tce_order) | offset;
> +		/* it's a valid entry but not ours, keep looking */
> +		} else {
> +			continue;
> +		}
> +	}
> +	/* If we get here, the table must be full, so error out. */
> +	return -1ULL;
> +}
> +
> +/*
> + * For the moment, unmapping just decrements the refcount and doesn't actually
> + * remove the TCE.  This is because it's very likely that a previously allocated
> + * TCE will be used again, and this saves having to invalidate it.
> + *
> + * TODO implement some kind of garbage collection that clears unused TCE entries
> + * once the table reaches a certain size.
> + */
> +static void dma_pseudo_bypass_unmap_address(struct device *dev, dma_addr_t dma_addr)
> +{
> +	struct pci_dev *pdev = container_of(dev, struct pci_dev, dev);
> +	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> +	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_ioda_pe *pe;
> +	u64 i, entry1, entry2, dma_prefix, refcount;
> +
> +	pe = &phb->ioda.pe_array[pci_get_pdn(pdev)->pe_number];
> +
> +	for (i = 0; i < pe->tce_count; i++) {
> +		entry1 = pe->tce_tracker[i * 2];
> +		entry2 = pe->tce_tracker[i * 2 + 1];
> +		dma_prefix = entry2 >> 34;
> +		refcount = entry2 & ((1 << 30) - 1);
> +
> +		/* look through entry2 until we find our address */
> +		if (dma_prefix == (dma_addr >> phb->ioda.max_tce_order)) {
> +			refcount--;
> +			cmpxchg(&pe->tce_tracker[i * 2 + 1], entry2, (dma_prefix << 34) | refcount);
> +			if (!refcount) {
> +				/*
> +				 * Here is where we would remove the valid bit
> +				 * from entry1, clear the entry in the TCE table
> +				 * and invalidate the TCE - but we want to leave
> +				 * them until the table fills up (for now).
> +				 */
> +			}
> +			break;
> +		}
> +	}
> +}
> +
> +static int dma_pseudo_bypass_dma_supported(struct device *dev, u64 mask)
> +{
> +	/*
> +	 * Normally dma_supported() checks if the mask is capable of addressing
> +	 * all of memory.  Since we map physical memory in chunks that the
> +	 * device can address, the device will be able to address whatever it
> +	 * wants - just not all at once.
> +	 */
> +	return 1;
> +}
> +
> +static void *dma_pseudo_bypass_alloc_coherent(struct device *dev,
> +					  size_t size,
> +					  dma_addr_t *dma_handle,
> +					  gfp_t flag,
> +					  unsigned long attrs)
> +{
> +	void *ret;
> +	struct page *page;
> +	int node = dev_to_node(dev);
> +
> +	/* ignore region specifiers */
> +	flag &= ~(__GFP_HIGHMEM);
> +
> +	page = alloc_pages_node(node, flag, get_order(size));
> +	if (page == NULL)
> +		return NULL;
> +	ret = page_address(page);
> +	memset(ret, 0, size);
> +	*dma_handle = dma_pseudo_bypass_get_address(dev, __pa(ret));
> +
> +	return ret;
> +}
> +
> +static void dma_pseudo_bypass_free_coherent(struct device *dev,
> +					 size_t size,
> +					 void *vaddr,
> +					 dma_addr_t dma_handle,
> +					 unsigned long attrs)
> +{
> +	free_pages((unsigned long)vaddr, get_order(size));
> +}
> +
> +static int dma_pseudo_bypass_mmap_coherent(struct device *dev,
> +				       struct vm_area_struct *vma,
> +				       void *cpu_addr,
> +				       dma_addr_t handle,
> +				       size_t size,
> +				       unsigned long attrs)
> +{
> +	unsigned long pfn = page_to_pfn(virt_to_page(cpu_addr));
> +
> +	return remap_pfn_range(vma, vma->vm_start,
> +			       pfn + vma->vm_pgoff,
> +			       vma->vm_end - vma->vm_start,
> +			       vma->vm_page_prot);
> +}
> +
> +static inline dma_addr_t dma_pseudo_bypass_map_page(struct device *dev,
> +						struct page *page,
> +						unsigned long offset,
> +						size_t size,
> +						enum dma_data_direction dir,
> +						unsigned long attrs)
> +{
> +	BUG_ON(dir == DMA_NONE);
> +
> +	/* XXX I don't know if this is necessary (or even desired) */
> +	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +		__dma_sync_page(page, offset, size, dir);
> +
> +	return dma_pseudo_bypass_get_address(dev, page_to_phys(page) + offset);
> +}
> +
> +static inline void dma_pseudo_bypass_unmap_page(struct device *dev,
> +					 dma_addr_t dma_address,
> +					 size_t size,
> +					 enum dma_data_direction direction,
> +					 unsigned long attrs)
> +{
> +	dma_pseudo_bypass_unmap_address(dev, dma_address);
> +}
> +
> +
> +static int dma_pseudo_bypass_map_sg(struct device *dev, struct scatterlist *sgl,
> +			     int nents, enum dma_data_direction direction,
> +			     unsigned long attrs)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +
> +	for_each_sg(sgl, sg, nents, i) {
> +		sg->dma_address = dma_pseudo_bypass_get_address(dev, sg_phys(sg));
> +		sg->dma_length = sg->length;
> +
> +		if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
> +			continue;
> +
> +		__dma_sync_page(sg_page(sg), sg->offset, sg->length, direction);
> +	}
> +
> +	return nents;
> +}
> +
> +static void dma_pseudo_bypass_unmap_sg(struct device *dev, struct scatterlist *sgl,
> +				int nents, enum dma_data_direction direction,
> +				unsigned long attrs)
> +{
> +	struct scatterlist *sg;
> +	int i;
> +
> +	for_each_sg(sgl, sg, nents, i) {
> +		dma_pseudo_bypass_unmap_address(dev, sg->dma_address);
> +	}


No need in curly braces.

> +}
> +
> +static u64 dma_pseudo_bypass_get_required_mask(struct device *dev)
> +{
> +	/*
> +	 * there's no limitation on our end, the driver should just call
> +	 * set_mask() with as many bits as the device can address.
> +	 */
> +	return -1ULL;
> +}
> +
> +static int dma_pseudo_bypass_mapping_error(struct device *dev, dma_addr_t dma_addr)
> +{
> +	return dma_addr == -1ULL;
> +}
> +
> +
> +const struct dma_map_ops dma_pseudo_bypass_ops = {
> +	.alloc				= dma_pseudo_bypass_alloc_coherent,
> +	.free				= dma_pseudo_bypass_free_coherent,
> +	.mmap				= dma_pseudo_bypass_mmap_coherent,
> +	.map_sg				= dma_pseudo_bypass_map_sg,
> +	.unmap_sg			= dma_pseudo_bypass_unmap_sg,
> +	.dma_supported			= dma_pseudo_bypass_dma_supported,
> +	.map_page			= dma_pseudo_bypass_map_page,
> +	.unmap_page			= dma_pseudo_bypass_unmap_page,
> +	.get_required_mask		= dma_pseudo_bypass_get_required_mask,
> +	.mapping_error			= dma_pseudo_bypass_mapping_error,
> +};
> +EXPORT_SYMBOL(dma_pseudo_bypass_ops);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index bcb3bfce072a..7ecc186493ca 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -25,6 +25,7 @@
>  #include <linux/iommu.h>
>  #include <linux/rculist.h>
>  #include <linux/sizes.h>
> +#include <linux/vmalloc.h>
>  
>  #include <asm/sections.h>
>  #include <asm/io.h>
> @@ -1088,6 +1089,9 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  	pe->pbus = NULL;
>  	pe->mve_number = -1;
>  	pe->rid = dev->bus->number << 8 | pdn->devfn;
> +	pe->tces = NULL;
> +	pe->tce_tracker = NULL;
> +	pe->tce_bitmap = NULL;
>  
>  	pe_info(pe, "Associated device to PE\n");
>  
> @@ -1569,6 +1573,9 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		pe->mve_number = -1;
>  		pe->rid = (pci_iov_virtfn_bus(pdev, vf_index) << 8) |
>  			   pci_iov_virtfn_devfn(pdev, vf_index);
> +		pe->tces = NULL;
> +		pe->tce_tracker = NULL;
> +		pe->tce_bitmap = NULL;
>  
>  		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
>  			hose->global_number, pdev->bus->number,
> @@ -1774,43 +1781,40 @@ static bool pnv_pci_ioda_pe_single_vendor(struct pnv_ioda_pe *pe)
>  	return true;
>  }
>  
> -/*
> - * Reconfigure TVE#0 to be usable as 64-bit DMA space.
> - *
> - * The first 4GB of virtual memory for a PE is reserved for 32-bit accesses.
> - * Devices can only access more than that if bit 59 of the PCI address is set
> - * by hardware, which indicates TVE#1 should be used instead of TVE#0.
> - * Many PCI devices are not capable of addressing that many bits, and as a
> - * result are limited to the 4GB of virtual memory made available to 32-bit
> - * devices in TVE#0.
> - *
> - * In order to work around this, reconfigure TVE#0 to be suitable for 64-bit
> - * devices by configuring the virtual memory past the first 4GB inaccessible
> - * by 64-bit DMAs.  This should only be used by devices that want more than
> - * 4GB, and only on PEs that have no 32-bit devices.
> - *
> - * Currently this will only work on PHB3 (POWER8).
> - */
> -static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
> +static int pnv_pci_pseudo_bypass_setup(struct pnv_ioda_pe *pe)
>  {
> -	u64 window_size, table_size, tce_count, addr;
> +	u64 tce_count, table_size, window_size;
> +	struct pnv_phb *p = pe->phb;
>  	struct page *table_pages;
> -	u64 tce_order = 28; /* 256MB TCEs */
>  	__be64 *tces;
> -	s64 rc;
> +	int rc = -ENOMEM;
> +	int bitmap_size, tracker_entries;
> +
> +	/*
> +	 * XXX These are factors for scaling the size of the TCE table, and
> +	 * the table that tracks these allocations.  These should eventually
> +	 * be kernel command line options with defaults above 1, for situations
> +	 * where your memory expands after the machine has booted.
> +	 */
> +	int tce_size_factor = 1;
> +	int tracking_table_factor = 1;

I'd drop these for now, add them later.

>  
>  	/*
> -	 * Window size needs to be a power of two, but needs to account for
> -	 * shifting memory by the 4GB offset required to skip 32bit space.
> +	 * The window size covers all of memory (and optionally more), with
> +	 * enough tracker entries to cover them all being allocated.  So we
> +	 * create enough TCEs to cover all of memory at once.
>  	 */
> -	window_size = roundup_pow_of_two(memory_hotplug_max() + (1ULL << 32));
> -	tce_count = window_size >> tce_order;
> +	window_size = roundup_pow_of_two(tce_size_factor * memory_hotplug_max());
> +	tracker_entries = (tracking_table_factor * memory_hotplug_max()) >>
> +		p->ioda.max_tce_order;
> +	tce_count = window_size >> p->ioda.max_tce_order;
> +	bitmap_size = BITS_TO_LONGS(tce_count) * sizeof(unsigned long);
>  	table_size = tce_count << 3;
>  
>  	if (table_size < PAGE_SIZE)
>  		table_size = PAGE_SIZE;
>  
> -	table_pages = alloc_pages_node(pe->phb->hose->node, GFP_KERNEL,
> +	table_pages = alloc_pages_node(p->hose->node, GFP_KERNEL,
>  				       get_order(table_size));


table_pages memory leaks if the device is used by VFIO.


>  	if (!table_pages)
>  		goto err;
> @@ -1821,26 +1825,33 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
>  
>  	memset(tces, 0, table_size);
>  
> -	for (addr = 0; addr < memory_hotplug_max(); addr += (1 << tce_order)) {
> -		tces[(addr + (1ULL << 32)) >> tce_order] =
> -			cpu_to_be64(addr | TCE_PCI_READ | TCE_PCI_WRITE);
> -	}
> +	pe->tces = tces;
> +	pe->tce_count = tce_count;
> +	pe->tce_bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> +	/* The tracking table has two u64s per TCE */
> +	pe->tce_tracker = vzalloc(sizeof(u64) * 2 * tracker_entries);
> +	spin_lock_init(&pe->tce_alloc_lock);
> +
> +	/* mark the first 4GB as reserved so this can still be used for 32bit */
> +	bitmap_set(pe->tce_bitmap, 0, 1ULL << (32 - p->ioda.max_tce_order));
> +
> +	pe_info(pe, "pseudo-bypass sizes: tracker %d bitmap %d TCEs %lld\n",
> +		tracker_entries, bitmap_size, tce_count);
>  
>  	rc = opal_pci_map_pe_dma_window(pe->phb->opal_id,
>  					pe->pe_number,
> -					/* reconfigure window 0 */
>  					(pe->pe_number << 1) + 0,
>  					1,
>  					__pa(tces),
>  					table_size,
> -					1 << tce_order);
> +					1 << p->ioda.max_tce_order);

Is there any reason not to use the existing iommu_table_group_ops API
for tracking whatever was programmed into TVT?

I'd really love see this be based on top of 
https://patchwork.ozlabs.org/patch/923868/


>  	if (rc == OPAL_SUCCESS) {
> -		pe_info(pe, "Using 64-bit DMA iommu bypass (through TVE#0)\n");
> +		pe_info(pe, "TCE tables configured for pseudo-bypass\n");
>  		return 0;
>  	}
>  err:
> -	pe_err(pe, "Error configuring 64-bit DMA bypass\n");
> -	return -EIO;
> +	pe_err(pe, "error configuring pseudo-bypass\n");
> +	return rc;
>  }
>  
>  static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
> @@ -1851,7 +1862,6 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
>  	struct pnv_ioda_pe *pe;
>  	uint64_t top;
>  	bool bypass = false;
> -	s64 rc;
>  
>  	if (WARN_ON(!pdn || pdn->pe_number == IODA_INVALID_PE))
>  		return -ENODEV;
> @@ -1868,21 +1878,15 @@ static int pnv_pci_ioda_dma_set_mask(struct pci_dev *pdev, u64 dma_mask)
>  	} else {
>  		/*
>  		 * If the device can't set the TCE bypass bit but still wants
> -		 * to access 4GB or more, on PHB3 we can reconfigure TVE#0 to
> -		 * bypass the 32-bit region and be usable for 64-bit DMAs.
> -		 * The device needs to be able to address all of this space.
> +		 * to access 4GB or more, we need to use a different set of DMA
> +		 * operations with an indirect mapping.
>  		 */
>  		if (dma_mask >> 32 &&
> -		    dma_mask > (memory_hotplug_max() + (1ULL << 32)) &&
> -		    pnv_pci_ioda_pe_single_vendor(pe) &&
> -		    phb->model == PNV_PHB_MODEL_PHB3) {
> -			/* Configure the bypass mode */
> -			rc = pnv_pci_ioda_dma_64bit_bypass(pe);
> -			if (rc)
> -				return rc;
> -			/* 4GB offset bypasses 32-bit space */
> -			set_dma_offset(&pdev->dev, (1ULL << 32));
> -			set_dma_ops(&pdev->dev, &dma_nommu_ops);
> +		    phb->model != PNV_PHB_MODEL_P7IOC &&
> +		    pnv_pci_ioda_pe_single_vendor(pe)) {
> +			if (!pe->tces)
> +				pnv_pci_pseudo_bypass_setup(pe);
> +			set_dma_ops(&pdev->dev, &dma_pseudo_bypass_ops);
>  		} else if (dma_mask >> 32 && dma_mask != DMA_BIT_MASK(64)) {
>  			/*
>  			 * Fail the request if a DMA mask between 32 and 64 bits
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index c9952def5e93..83492aba90f1 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -70,6 +70,13 @@ struct pnv_ioda_pe {
>  	bool			tce_bypass_enabled;
>  	uint64_t		tce_bypass_base;
>  
> +	/* TCE tables for DMA pseudo-bypass */
> +	__be64			*tces;
> +	u64			tce_count;
> +	unsigned long		*tce_bitmap;
> +	u64			*tce_tracker; // 2 u64s per TCE
> +	spinlock_t		tce_alloc_lock;


Can we please not duplicate pe->table_group here? That thing has array
of iommu_table's with locks and everything.


> +
>  	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
>  	 * and -1 if not supported. (It's actually identical to the
>  	 * PE number)
> -- 
> 2.17.1



--
Alexey

^ permalink raw reply

* Re: [PATCH 1/7] powerpc/powernv/pci: Track largest available TCE order
From: Alexey Kardashevskiy @ 2018-06-25  3:10 UTC (permalink / raw)
  To: Timothy Pearson; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <77952537.2569246.1529797950974.JavaMail.zimbra@raptorengineeringinc.com>

On Sat, 23 Jun 2018 18:52:30 -0500 (CDT)
Timothy Pearson <tpearson@raptorengineering.com> wrote:

>  per PHB
> 
> Knowing the largest possible TCE size of a PHB is useful, so get it out
> of the device tree.  This relies on the property being added in OPAL.
> 
> It is assumed that any PHB4 or later machine would be running firmware
> that implemented this property, and otherwise assumed to be PHB3, which
> has a maximum TCE order of 28 bits or 256MB TCEs.
> 
> This is used later in the series.
> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++++++++++++++++
>  arch/powerpc/platforms/powernv/pci.h      |  3 +++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 5bd0eb6681bc..bcb3bfce072a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3873,11 +3873,13 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	struct resource r;
>  	const __be64 *prop64;
>  	const __be32 *prop32;
> +	struct property *prop;
>  	int len;
>  	unsigned int segno;
>  	u64 phb_id;
>  	void *aux;
>  	long rc;
> +	u32 val;
>  
>  	if (!of_device_is_available(np))
>  		return;
> @@ -4016,6 +4018,20 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np,
>  	}
>  	phb->ioda.pe_array = aux + pemap_off;
>  
> +	phb->ioda.max_tce_order = 0;
> +	// Get TCE order from the DT.  If it's not present, assume P8
> +	if (!of_get_property(np, "ibm,supported-tce-sizes", NULL)) {
> +		phb->ioda.max_tce_order = 28; // assume P8 256mb TCEs
> +	} else {
> +		of_property_for_each_u32(np, "ibm,supported-tce-sizes", prop,
> +					 prop32, val) {
> +			if (val > phb->ioda.max_tce_order)
> +				phb->ioda.max_tce_order = val;
> +		}
> +		pr_debug("PHB%llx Found max TCE order of %d bits\n",
> +			 phb->opal_id, phb->ioda.max_tce_order);
> +	}
> +


7ef73cd39b4e28 stores this in pe->table_group.pgsizes, why duplicate
this?



>  	/*
>  	 * Choose PE number for root bus, which shouldn't have
>  	 * M64 resources consumed by its child devices. To pick
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index eada4b6068cb..c9952def5e93 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -173,6 +173,9 @@ struct pnv_phb {
>  		struct list_head	pe_list;
>  		struct mutex            pe_list_mutex;
>  
> +		/* Largest supported TCE order bits */
> +		uint8_t			max_tce_order;
> +
>  		/* Reverse map of PEs, indexed by {bus, devfn} */
>  		unsigned int		pe_rmap[0x10000];
>  	} ioda;
> -- 
> 2.17.1



--
Alexey

^ permalink raw reply

* Re: [patchV2 2/2] pci/shpchp: no claim on pcie port device
From: Pingfan Liu @ 2018-06-25  2:59 UTC (permalink / raw)
  To: helgaas; +Cc: linux-pci, Bjorn Helgaas, Christoph Hellwig, linuxppc-dev
In-Reply-To: <20180613131317.GA201807@bhelgaas-glaptop.roam.corp.google.com>

On Wed, Jun 13, 2018 at 9:13 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jun 13, 2018 at 02:29:57PM +0800, Pingfan Liu wrote:
> > The Linux Device Driver Model allows a physical device to be handled
> > by only a single driver. But at present, both shpchp and portdrv_pci
> > claim PCI_CLASS_BRIDGE_PCI, and touch devices_kset when the drivers are
> > loaded. This causes a few problems, one is the wrong shutdown seq of
> > devices, owing to the broken devices_kset. This patch keeps shpchp away
> > from pcie port device, using the extra matching method.
>
> As Christoph pointed out, something seems wrong if we add to
> devices_kset even when the probe fails.  We will call shpc_probe() for
> the device below, but there's no SHPC capability, so the probe should
> fail when shpc_init() fails.
>
> I do think the current structure where portdrv and shpchp are both
> "drivers" that claim bridges is broken.  This has caused problems
> before, e.g., some PCIe switches have performance counters, and a
> driver for those switches ought to be able to claim the device and use
> the counters, but portdrv is in the way.
>
> I think it would be better if portdrv (and maybe shpchp; not sure
> about that) were integrated into the PCIe core and did not register as
> a driver.  But this is a big project, far beyond the scope of this
> current issue.
>
Yes, I agree. But just leaving this for later fix, and try to bring
out a fix in drivers/core firstly.

Regards,
Pingfan

> > Note:
> > I hit this bug on a Power9 machine, when "kexec -e", and see a ata-disk
> > behind a bridge can not write back buffer in flight due to the former
> > shutdown of the bridge which clears the BusMaster bit.
> >
> > the device involved:
> > 0004:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> >         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
> >         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> >         Latency: 0
> >         NUMA node: 0
> >         Bus: primary=00, secondary=01, subordinate=12, sec-latency=0
> >         I/O behind bridge: 00000000-00000fff
> >         Memory behind bridge: 80000000-ffefffff
> >         Prefetchable memory behind bridge: 0006024000000000-0006027f7fffffff
> >         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> >         BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
> >                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> >         Capabilities: [40] Power Management version 3
> >                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
> >                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> >         Capabilities: [48] Express (v2) Root Port (Slot-), MSI 00
> >                 DevCap: MaxPayload 512 bytes, PhantFunc 0
> >                         ExtTag- RBE+
> >                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> >                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >                         MaxPayload 256 bytes, MaxReadReq 128 bytes
> >                 DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> >                 LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
> >                         ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp-
> >                 LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- CommClk-
> >                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >                 LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt+
> >                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> >                 RootCap: CRSVisible-
> >                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> >                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported ARIFwd+
> >                 DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis+, LTR-, OBFF Disabled ARIFwd+
> >                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
> >                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> >                          Compliance De-emphasis: -6dB
> >                 LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
> >                          EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
> >         Capabilities: [100 v1] Advanced Error Reporting
> >                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >                 UEMsk:  DLP- SDES- TLP+ FCP- CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF- MalfTLP- ECRC+ UnsupReq- ACSViol-
> >                 UESvrt: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> >                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> >                 AERCap: First Error Pointer: 00, GenCap+ CGenEn+ ChkCap+ ChkEn+
> >         Capabilities: [148 v1] #19
> >
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > ---
> >  drivers/pci/hotplug/shpchp_core.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/drivers/pci/hotplug/shpchp_core.c b/drivers/pci/hotplug/shpchp_core.c
> > index 1f0f969..85b4665 100644
> > --- a/drivers/pci/hotplug/shpchp_core.c
> > +++ b/drivers/pci/hotplug/shpchp_core.c
> > @@ -336,6 +336,18 @@ static void shpc_remove(struct pci_dev *dev)
> >       kfree(ctrl);
> >  }
> >
> > +static int shpc_extra_match(struct pci_dev *pdev)
> > +{
> > +     /* do not claim pcie port device */
> > +     if (pci_is_pcie(pdev) &&
> > +         (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT ||
> > +          pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM ||
> > +          pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
> > +             return -ENODEV;
> > +
> > +     return 0;
> > +}
> > +
> >  static const struct pci_device_id shpcd_pci_tbl[] = {
> >       {PCI_DEVICE_CLASS(((PCI_CLASS_BRIDGE_PCI << 8) | 0x00), ~0)},
> >       { /* end: all zeroes */ }
> > @@ -345,6 +357,7 @@ MODULE_DEVICE_TABLE(pci, shpcd_pci_tbl);
> >  static struct pci_driver shpc_driver = {
> >       .name =         SHPC_MODULE_NAME,
> >       .id_table =     shpcd_pci_tbl,
> > +     .extra_match =  shpc_extra_match,
> >       .probe =        shpc_probe,
> >       .remove =       shpc_remove,
> >  };
> > --
> > 2.7.4
> >

^ permalink raw reply

* [PATCH v2 1/2] powerpc: Document issues with the DAWR on POWER9
From: Michael Neuling @ 2018-06-25  1:34 UTC (permalink / raw)
  To: mpe; +Cc: linuxppc-dev, mikey

Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Stewart Smith <stewart@linux.ibm.com>
---
v2:
  Spelling mistakes :-/
---
 Documentation/powerpc/DAWR-POWER9.txt | 58 +++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)
 create mode 100644 Documentation/powerpc/DAWR-POWER9.txt

diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/DAWR-POWER9.txt
new file mode 100644
index 0000000000..2feaa66196
--- /dev/null
+++ b/Documentation/powerpc/DAWR-POWER9.txt
@@ -0,0 +1,58 @@
+DAWR issues on POWER9
+============================
+
+On POWER9 the DAWR can cause a checkstop if it points to cache
+inhibited (CI) memory. Currently Linux has no way to disinguish CI
+memory when configuring the DAWR, so (for now) the DAWR is disabled by
+this commit:
+
+    commit 9654153158d3e0684a1bdb76dbababdb7111d5a0
+    Author: Michael Neuling <mikey@neuling.org>
+    Date:   Tue Mar 27 15:37:24 2018 +1100
+    powerpc: Disable DAWR in the base POWER9 CPU features
+
+Technical Details:
+============================
+
+DAWR has 6 different ways of being set.
+1) ptrace
+2) h_set_mode(DAWR)
+3) h_set_dabr()
+4) kvmppc_set_one_reg()
+5) xmon
+
+For ptrace, we now advertise zero breakpoints on POWER9 via the
+PPC_PTRACE_GETHWDBGINFO call. This results in GDB falling back to
+software emulation of the watchpoint (which is slow).
+
+h_set_mode(DAWR) and h_set_dabr() will now return an error to the
+guest on a POWER9 host. Current Linux guests ignore this error, so
+they will silently not get the DAWR.
+
+kvmppc_set_one_reg() will store the value in the vcpu but won't
+actually set it on POWER9 hardware. This is done so we don't break
+migration from POWER8 to POWER9, at the cost of silently losing the
+DAWR on the migration.
+
+For xmon, the 'bd' command will return an error on P9.
+
+Consequences for users
+============================
+
+For GDB watchpoints (ie 'watch' command) on POWER9 bare metal , GDB
+will accept the command. Unfortunately since there is no hardware
+support for the watchpoint, GDB will software emulate the watchpoint
+making it run very slowly.
+
+The same will also be true for any guests started on a POWER9
+host. The watchpoint will fail and GDB will fall back to software
+emulation.
+
+If a guest is started on a POWER8 host, GDB will accept the watchpoint
+and configure the hardware to use the DAWR. This will run at full
+speed since it can use the hardware emulation. Unfortunately if this
+guest is migrated to a POWER9 host, the watchpoint will be lost on the
+POWER9. Loads and stores to the watchpoint locations will not be
+trapped in GDB. The watchpoint is remembered, so if the guest is
+migrated back to the POWER8 host, it will start working again.
+
-- 
2.17.1

^ permalink raw reply related

* [PATCH v2 2/2] powerpc: Document issues with TM on POWER9
From: Michael Neuling @ 2018-06-25  1:34 UTC (permalink / raw)
  To: mpe; +Cc: linuxppc-dev, mikey
In-Reply-To: <20180625013456.16157-1-mikey@neuling.org>

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
v2:
  Spelling mistakes :-/
---
 .../powerpc/transactional_memory.txt          | 44 +++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.txt
index e32fdbb4c9..52c023e14f 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.txt
@@ -198,3 +198,47 @@ presented).  The transaction cannot then be continued and will take the failure
 handler route.  Furthermore, the transactional 2nd register state will be
 inaccessible.  GDB can currently be used on programs using TM, but not sensibly
 in parts within transactions.
+
+POWER9
+======
+
+TM on POWER9 has issues with storing the complete register state. This
+is described in this commit:
+
+    commit 4bb3c7a0208fc13ca70598efd109901a7cd45ae7
+    Author: Paul Mackerras <paulus@ozlabs.org>
+    Date:   Wed Mar 21 21:32:01 2018 +1100
+    KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
+
+To account for this different POWER9 chips have TM enabled in
+different ways.
+
+On POWER9N DD2.01 and below, TM is disabled. ie
+HWCAP2[PPC_FEATURE2_HTM] is not set.
+
+On POWER9N DD2.1 TM is configured by firmware to always abort a
+transaction when tm suspend occurs. So tsuspend will cause a
+transaction to be aborted and rolled back. Kernel exceptions will also
+cause the transaction to be aborted and rolled back and the exception
+will not occur. If userspace constructs a sigcontext that enables TM
+suspend, the sigcontext will be rejected by the kernel. This mode is
+advertised to users with HWCAP2[PPC_FEATURE2_HTM_NO_SUSPEND] set.
+HWCAP2[PPC_FEATURE2_HTM] is not set in this mode.
+
+On POWER9N DD2.2 and above, KVM and POWERVM emulate TM for guests (as
+described in commit 4bb3c7a0208f), hence TM is enabled for guests
+ie. HWCAP2[PPC_FEATURE2_HTM] is set for guest userspace. Guests that
+makes heavy use of TM suspend (tsuspend or kernel suspend) will result
+in traps into the hypervisor and hence will suffer a performance
+degradation. Host userspace has TM disabled
+ie. HWCAP2[PPC_FEATURE2_HTM] is not set. (although we make enable it
+at some point in the future if we bring the emulation into host
+userspace context switching).
+
+POWER9C DD1.2 and above are only available with POWERVM and hence
+Linux only runs as a guest. On these systems TM is emulated like on
+POWER9N DD2.2.
+
+Guest migration from POWER8 to POWER9 will work with POWER9N DD2.2 and
+POWER9C DD1.2. Since earlier POWER9 processors don't support TM
+emulation, migration from POWER8 to POWER9 is not supported there.
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH 1/2] powerpc: Document issues with the DAWR on POWER9
From: Michael Neuling @ 2018-06-25  1:30 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: mpe, linuxppc-dev
In-Reply-To: <20180622172308.GQ16221@gate.crashing.org>

On Fri, 2018-06-22 at 12:23 -0500, Segher Boessenkool wrote:
> On Fri, Jun 22, 2018 at 04:14:51PM +1000, Michael Neuling wrote:
> > +will accept the command. Unfortunatley since there is no hardware
>=20
> "unfortunately".
>=20
> > +speed since it can use the hardware emualation. Unfortnatley if this
>=20
> It is not your favourite word to type ;-)

Or emulation apparently :-/

Thanks.
Mikey

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox