Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Yu-cheng Yu @ 2018-06-12 17:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: bsingharora, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CALCETrVOyZz72RuoRB=z_EjFTqqctSLfX30GM+MSEVtbcd=PeQ@mail.gmail.com>

On Tue, 2018-06-12 at 09:31 -0700, Andy Lutomirski wrote:
> On Tue, Jun 12, 2018 at 9:24 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> >
> > On Tue, 2018-06-12 at 09:00 -0700, Andy Lutomirski wrote:
> > > On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > >
> > > > On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > > > >
> > > > > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > > > > This series introduces CET - Shadow stack
> > > > > >
> > > > > > At the high level, shadow stack is:
> > > > > >
> > > > > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > > > > >     Its PTEs must be read-only and dirty;
> > > > > >     Fixed sized, but the default size can be changed by sys admin.
> > > > > >
> > > > > > For a forked child, the shadow stack is duplicated when the next
> > > > > > shadow stack access takes place.
> > > > > >
> > > > > > For a pthread child, a new shadow stack is allocated.
> > > > > >
> > > > > > The signal handler uses the same shadow stack as the main program.
> > > > > >
> > > > >
> > > > > Even with sigaltstack()?
> > > > >
> > > > >
> > > > > Balbir Singh.
> > > >
> > > > Yes.
> > > >
> > >
> > > I think we're going to need some provision to add an alternate signal
> > > stack to handle the case where the shadow stack overflows.
> >
> > The shadow stack stores only return addresses; its consumption will not
> > exceed a percentage of (program stack size + sigaltstack size) before
> > those overflow.  When that happens, there is usually very little we can
> > do.  So we set a default shadow stack size that supports certain nested
> > calls and allow sys admin to adjust it.
> >
> 
> Of course there's something you can do: add a sigaltstack-like stack
> switching mechanism.  Have a reserve shadow stack and, when a signal
> is delivered (possibly guarded by other conditions like "did the
> shadow stack overflow"), switch to a new shadow stack and maybe write
> a special token to the new shadow stack that says "signal delivery
> jumped here and will restore to the previous shadow stack and
> such-and-such address on return".

If (shstk size == (stack size + sigaltstack size)), then shstk will not
overflow before program stack overflows and sigaltstack also overflows.

Let me think about this.

> Also, I have a couple of other questions after reading the
> documentation some more:
> 
> 1. Why on Earth does INCSSP only take an 8-bit number of frames to
> skip?  It seems to me that code that calls setjmp() and then calls
> longjmp() while nested more than 256 function call levels will crash.

GLIBC takes care of more than 256 functions calls.

> 2. The mnemonic RSTORSSP makes no sense to me.  RSTORSSP is a stack
> *switch* operation not a stack *restore* operation, unless I'm
> seriously misunderstanding.

The intention is to switch shadow stacks with tokens.  RSTORSSP restores
to a previous shadow stack address from a restore token.

> 3. Is there anything resembling clear documentation of the format of
> the shadow stack?  That is, what types of values might be found on the
> shadow stack and what do they all mean?

Only return addresses and restore tokens can be on a user-mode shadow
stack.  The restore token has the incoming shadow stack address plus one
bit indicating 64/32-bit mode.

I will put this into Documentation/x86/intel_cet.txt.

> 
> 4. Usually Intel doesn't submit upstream Linux patches for ISA
> extensions until the ISA is documented for real.  CET does not appear
> to be documented for real.  Could Intel kindly release something that
> at least claims to be authoritative documentation?
> 
> --Andy


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: Thomas Gleixner @ 2018-06-12 18:59 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Andy Lutomirski, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <CAMe9rOpzcCdje=bUVs+C1WrY6GuwA-8AUFVLOG325LGz7KHJxw@mail.gmail.com>

On Tue, 12 Jun 2018, H.J. Lu wrote:
> On Tue, Jun 12, 2018 at 9:34 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > On Tue, Jun 12, 2018 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> On Tue, Jun 12, 2018 at 9:01 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >> > On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >> >> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >> > That works for stuff which loads all libraries at start time, but what
> >> >> > happens if the program uses dlopen() later on? If CET is force locked and
> >> >> > the library is not CET enabled, it will fail.
> >> >>
> >> >> That is to prevent disabling CET by dlopening a legacy shared library.
> >> >>
> >> >> > I don't see the point of trying to support CET by magic. It adds complexity
> >> >> > and you'll never be able to handle all corner cases correctly. dlopen() is
> >> >> > not even a corner case.
> >> >>
> >> >> That is a price we pay for security.  To enable CET, especially shadow
> >> >> shack, the program and all of shared libraries it uses should be CET
> >> >> enabled.  Most of programs can be enabled with CET by compiling them
> >> >> with -fcf-protection.
> >> >
> >> > If you charge too high a price for security, people may turn it off.
> >> > I think we're going to need a mode where a program says "I want to use
> >> > the CET, but turn it off if I dlopen an unsupported library".  There
> >> > are programs that load binary-only plugins.
> >>
> >> You can do
> >>
> >> # export GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK
> >>
> >> which turns off shadow stack.
> >>
> >
> > Which exactly illustrates my point.  By making your security story too
> > absolute, you'll force people to turn it off when they don't need to.
> > If I'm using a fully CET-ified distro and I'm using a CET-aware
> > program that loads binary plugins, and I may or may not have an old
> > (binary-only, perhaps) plugin that doesn't support CET, then the
> > behavior I want is for CET to be on until I dlopen() a program that
> > doesn't support it.  Unless there's some ABI reason why that can't be
> > done, but I don't think there is.
> 
> We can make it opt-in via GLIBC_TUNABLES.  But by default, the legacy
> shared object is disallowed when CET is enabled.

That's a bad idea. Stuff has launchers which people might not be able to
change. So they will simply turn of CET completely or it makes them hack
horrible crap into init, e.g. the above export.

Give them sane kernel options:

     cet = off, relaxed, forced

where relaxed allows to run binary plugins. Then let dlopen() call into the
kernel with the filepath of the library to check for CET and that will tell
you whether its ok or or not and do the necessary magic in the kernel when
CET has to be disabled due to a !CET library/application.

That's also making the whole thing independent of magic glibc environment
options and allows it to be used all over the place in the same way.

Thanks,

	tglx





--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 06/10] x86/cet: Add arch_prctl functions for shadow stack
From: H.J. Lu @ 2018-06-12 19:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Yu-cheng Yu, LKML, linux-doc, Linux-MM,
	linux-arch, X86 ML, H. Peter Anvin, Ingo Molnar,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <alpine.DEB.2.21.1806122046520.1592@nanos.tec.linutronix.de>

On Tue, Jun 12, 2018 at 11:59 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 12 Jun 2018, H.J. Lu wrote:
>> On Tue, Jun 12, 2018 at 9:34 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> > On Tue, Jun 12, 2018 at 9:05 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >> On Tue, Jun 12, 2018 at 9:01 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> > On Tue, Jun 12, 2018 at 4:43 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>> >> >> On Tue, Jun 12, 2018 at 3:03 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> >> > That works for stuff which loads all libraries at start time, but what
>> >> >> > happens if the program uses dlopen() later on? If CET is force locked and
>> >> >> > the library is not CET enabled, it will fail.
>> >> >>
>> >> >> That is to prevent disabling CET by dlopening a legacy shared library.
>> >> >>
>> >> >> > I don't see the point of trying to support CET by magic. It adds complexity
>> >> >> > and you'll never be able to handle all corner cases correctly. dlopen() is
>> >> >> > not even a corner case.
>> >> >>
>> >> >> That is a price we pay for security.  To enable CET, especially shadow
>> >> >> shack, the program and all of shared libraries it uses should be CET
>> >> >> enabled.  Most of programs can be enabled with CET by compiling them
>> >> >> with -fcf-protection.
>> >> >
>> >> > If you charge too high a price for security, people may turn it off.
>> >> > I think we're going to need a mode where a program says "I want to use
>> >> > the CET, but turn it off if I dlopen an unsupported library".  There
>> >> > are programs that load binary-only plugins.
>> >>
>> >> You can do
>> >>
>> >> # export GLIBC_TUNABLES=glibc.tune.hwcaps=-SHSTK
>> >>
>> >> which turns off shadow stack.
>> >>
>> >
>> > Which exactly illustrates my point.  By making your security story too
>> > absolute, you'll force people to turn it off when they don't need to.
>> > If I'm using a fully CET-ified distro and I'm using a CET-aware
>> > program that loads binary plugins, and I may or may not have an old
>> > (binary-only, perhaps) plugin that doesn't support CET, then the
>> > behavior I want is for CET to be on until I dlopen() a program that
>> > doesn't support it.  Unless there's some ABI reason why that can't be
>> > done, but I don't think there is.
>>
>> We can make it opt-in via GLIBC_TUNABLES.  But by default, the legacy
>> shared object is disallowed when CET is enabled.
>
> That's a bad idea. Stuff has launchers which people might not be able to
> change. So they will simply turn of CET completely or it makes them hack
> horrible crap into init, e.g. the above export.
>
> Give them sane kernel options:
>
>      cet = off, relaxed, forced
>
> where relaxed allows to run binary plugins. Then let dlopen() call into the
> kernel with the filepath of the library to check for CET and that will tell
> you whether its ok or or not and do the necessary magic in the kernel when
> CET has to be disabled due to a !CET library/application.
>
> That's also making the whole thing independent of magic glibc environment
> options and allows it to be used all over the place in the same way.

This is very similar to our ARCH_CET_EXEC proposal which controls how
CET should be enforced.   But Andy thinks it is a bad idea.


-- 
H.J.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 6/6] Documentation: networking: cpsw: add MQPRIO & CBS offload examples
From: Grygorii Strashko @ 2018-06-12 19:55 UTC (permalink / raw)
  To: Ivan Khoronzhuk, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar
In-Reply-To: <20180611133047.4818-7-ivan.khoronzhuk@linaro.org>



On 06/11/2018 08:30 AM, Ivan Khoronzhuk wrote:
> This document describes MQPRIO and CBS Qdisc offload configuration
> for cpsw driver based on examples. It potentially can be used in
> audio video bridging (AVB) and time sensitive networking (TSN).
> 
> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
> ---
>   Documentation/networking/cpsw.txt | 540 ++++++++++++++++++++++++++++++
>   1 file changed, 540 insertions(+)
>   create mode 100644 Documentation/networking/cpsw.txt
> 
> diff --git a/Documentation/networking/cpsw.txt b/Documentation/networking/cpsw.txt
> new file mode 100644
> index 000000000000..f5d58f502e52
> --- /dev/null
> +++ b/Documentation/networking/cpsw.txt

Could you name it with "ti" prefix, pls?

Like "ti-cpsw.txt" or "ti,cpsw.txt"

> @@ -0,0 +1,540 @@
> +* Texas Instruments CPSW ethernet driver
> +
> +Multiqueue & CBS & MQPRIO
> +=====================================================================
> +=====================================================================
[...]

-- 
regards,
-grygorii
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Yu-cheng Yu @ 2018-06-12 20:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: bsingharora, LKML, linux-doc, Linux-MM, linux-arch, X86 ML,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H. J. Lu,
	Shanbhogue, Vedvyas, Ravi V. Shankar, Dave Hansen,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, mike.kravetz
In-Reply-To: <1528824280.9447.30.camel@2b52.sc.intel.com>

On Tue, 2018-06-12 at 10:24 -0700, Yu-cheng Yu wrote:
> On Tue, 2018-06-12 at 09:31 -0700, Andy Lutomirski wrote:
> > On Tue, Jun 12, 2018 at 9:24 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > >
> > > On Tue, 2018-06-12 at 09:00 -0700, Andy Lutomirski wrote:
> > > > On Tue, Jun 12, 2018 at 8:06 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote:
> > > > >
> > > > > On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > > > > >
> > > > > > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > > > > > This series introduces CET - Shadow stack
> > > > > > >
> > > > > > > At the high level, shadow stack is:
> > > > > > >
> > > > > > >     Allocated from a task's address space with vm_flags VM_SHSTK;
> > > > > > >     Its PTEs must be read-only and dirty;
> > > > > > >     Fixed sized, but the default size can be changed by sys admin.
> > > > > > >
> > > > > > > For a forked child, the shadow stack is duplicated when the next
> > > > > > > shadow stack access takes place.
> > > > > > >
> > > > > > > For a pthread child, a new shadow stack is allocated.
> > > > > > >
> > > > > > > The signal handler uses the same shadow stack as the main program.
> > > > > > >
> > > > > >
> > > > > > Even with sigaltstack()?
> > > > > >
> > > > > >
> > > > > > Balbir Singh.
> > > > >
> > > > > Yes.
> > > > >
> > > >
> > > > I think we're going to need some provision to add an alternate signal
> > > > stack to handle the case where the shadow stack overflows.
> > >
> > > The shadow stack stores only return addresses; its consumption will not
> > > exceed a percentage of (program stack size + sigaltstack size) before
> > > those overflow.  When that happens, there is usually very little we can
> > > do.  So we set a default shadow stack size that supports certain nested
> > > calls and allow sys admin to adjust it.
> > >
> > 
> > Of course there's something you can do: add a sigaltstack-like stack
> > switching mechanism.  Have a reserve shadow stack and, when a signal
> > is delivered (possibly guarded by other conditions like "did the
> > shadow stack overflow"), switch to a new shadow stack and maybe write
> > a special token to the new shadow stack that says "signal delivery
> > jumped here and will restore to the previous shadow stack and
> > such-and-such address on return".
> 
> If (shstk size == (stack size + sigaltstack size)), then shstk will not
> overflow before program stack overflows and sigaltstack also overflows.
> 
> Let me think about this.

The reserve shadow stack will help only when the shstk overflows but
signal stack/sigaltstack still has room and we can deliver a signal.  If
the shstk is large enough to cover any nested calls that will overflow
both the program stack and sigaltstack then we don't need a reserve
shstk.

We can estimate how big the shstk needs to be; in the worst case it
should not be greater than (program stack size + sigaltstack size).  The
default shstk size we choose pass all signal tests in GLIBC.  In case
there is a need to increase it for a very large RLIMIT_STACK or very
large sigaltstack, the sys admin can increase the default shstk size.

Yu-cheng


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
From: Ivan Khoronzhuk @ 2018-06-12 20:35 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: grygorii.strashko, davem, corbet, akpm, netdev, linux-doc,
	linux-kernel, linux-omap, vinicius.gomes, henrik,
	jesus.sanchez-palencia, ilias.apalodimas, p-varis, spatton,
	francois.ozog, yogeshs, nsekhar
In-Reply-To: <20180612163658.GC12251@lunn.ch>

On Tue, Jun 12, 2018 at 06:36:58PM +0200, Andrew Lunn wrote:
>On Mon, Jun 11, 2018 at 04:30:44PM +0300, Ivan Khoronzhuk wrote:
>> That's possible to offload vlan to tc priority mapping with
>> assumption sk_prio == L2 prio.
>>
>> Example:
>> $ ethtool -L eth0 rx 1 tx 4
>>
>> $ qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
>> map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
>>
>> $ tc -g class show dev eth0
>> +---(100:ffe2) mqprio
>> |    +---(100:3) mqprio
>> |    +---(100:4) mqprio
>> |    
>> +---(100:ffe1) mqprio
>> |    +---(100:2) mqprio
>> |    
>> +---(100:ffe0) mqprio
>>      +---(100:1) mqprio
>>
>> Here, 100:1 is txq0, 100:2 is txq1, 100:3 is txq2, 100:4 is txq3
>> txq0 belongs to tc0, txq1 to tc1, txq2 and txq3 to tc2
>> The offload part only maps L2 prio to classes of traffic, but not
>> to transmit queues, so to direct traffic to traffic class vlan has
>> to be created with appropriate egress map.
>>
>> Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
>> ---
>>  drivers/net/ethernet/ti/cpsw.c | 82 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 82 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
>> index 406537d74ec1..fd967d2bce5d 100644
>> --- a/drivers/net/ethernet/ti/cpsw.c
>> +++ b/drivers/net/ethernet/ti/cpsw.c
>> @@ -39,6 +39,7 @@
[...]

>>
>> +static int cpsw_set_tc(struct net_device *ndev, void *type_data)
>> +{
>
>Hi Ivan
>
>Maybe this is not the best of names. What if you add support for
>another TC qdisc? So you have another case in the switch statement
>below?
>
>Maybe call it cpsw_set_mqprio?

Yes, proposed name is more suitable.

-- 
Regards,
Ivan Khoronzhuk
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 6/6] Documentation: networking: cpsw: add MQPRIO & CBS offload examples
From: Ivan Khoronzhuk @ 2018-06-12 20:36 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: davem, corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar
In-Reply-To: <16c97ea6-9f28-1dd6-136a-2473423bf94a@ti.com>

On Tue, Jun 12, 2018 at 02:55:18PM -0500, Grygorii Strashko wrote:
>
>
>On 06/11/2018 08:30 AM, Ivan Khoronzhuk wrote:
>>This document describes MQPRIO and CBS Qdisc offload configuration
>>for cpsw driver based on examples. It potentially can be used in
>>audio video bridging (AVB) and time sensitive networking (TSN).
>>
>>Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
>>---
>>  Documentation/networking/cpsw.txt | 540 ++++++++++++++++++++++++++++++
>>  1 file changed, 540 insertions(+)
>>  create mode 100644 Documentation/networking/cpsw.txt
>>
>>diff --git a/Documentation/networking/cpsw.txt b/Documentation/networking/cpsw.txt
>>new file mode 100644
>>index 000000000000..f5d58f502e52
>>--- /dev/null
>>+++ b/Documentation/networking/cpsw.txt
>
>Could you name it with "ti" prefix, pls?
>
>Like "ti-cpsw.txt" or "ti,cpsw.txt"
will name it ti-cpsw.txt in v2

>
>>@@ -0,0 +1,540 @@
>>+* Texas Instruments CPSW ethernet driver
>>+
>>+Multiqueue & CBS & MQPRIO
>>+=====================================================================
>>+=====================================================================
>[...]
>
>-- 
>regards,
>-grygorii

-- 
Regards,
Ivan Khoronzhuk
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 0/1] USB Audio Device Class 3.0 Gadget support
From: Ruslan Bilovol @ 2018-06-13  0:28 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Greg Kroah-Hartman, linux-usb@vger.kernel.org, linux-doc,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAB=otbSidge7CYyR21744_2xy400tUw51nbaZy0HPmjADt3oTw@mail.gmail.com>

On Thu, Dec 7, 2017 at 1:18 PM, Ruslan Bilovol <ruslan.bilovol@gmail.com> wrote:
> Hi Felipe,
>
> On Mon, Dec 4, 2017 at 1:36 PM, Felipe Balbi <balbi@kernel.org> wrote:
>>
>> Hi,
>>
>> Ruslan Bilovol <ruslan.bilovol@gmail.com> writes:
>>> On Tue, Nov 7, 2017 at 3:52 AM, Ruslan Bilovol <ruslan.bilovol@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> This patch adds USB Audio Device Class 3.0 [1] function
>>>> support to gadget subsystem.
>>>> I didn't add UAC3 support to legacy gadget as it will
>>>> make preprocessor configuration too complex (UAC3 device
>>>> must have two configurations for backward compatibility,
>>>> first is UAC1/2 and second is UAC3), yet also I'm too lazy
>>>> to do that and verify all possible configurations.
>>>>
>>>> For modern ConfigFS interface I'll provide my configuration
>>>> for testing below; testing was done on a BeagleBone Black
>>>> board.
>>>>
>>>> This patch depends on uac3 header files from include dir
>>>> which I'll post as part of ALSA host UAC3 patch and will
>>>> provide the link to it here.
>>>
>>> http://www.spinics.net/lists/alsa-devel/msg69071.html
>>
>> Once that patch hits upstream, then we can queue this for merge window
>> otherwise we will just have issues and create unbisectable points in the
>> tree.
>
> Takashi promised to create an immutable branch for that purpose.
>
> However, I'm currently reworking configfs part of UAC3 for channels
> configuration handling, which is now more clear after sharing missing
> parts of UAC3 spec by Pierre-Louis Bossart during host side patches
> review; so I will send v2 soon.

OK, so now we have both UAC3 initial support patches [1] and
also UAC3 BADD profiles support [2] in Torvalds tree, and I'm going
to refresh this patch series and send v2 soon (perhaps in next few weeks)

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9a2fe9b801f585baccf8352d82839dcd54b300cf
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=17156f23e93c0f59e06dd2aaffd06221341caaee

Thanks,
Ruslan
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-13  7:47 UTC (permalink / raw)
  To: dri-devel, linux-kernel
  Cc: Thomas Hellstrom, Peter Zijlstra, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613074745.14750-1-thellstrom@vmware.com>

The current Wound-Wait mutex algorithm is actually not Wound-Wait but
Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait
is, contrary to Wait-Die a preemptive algorithm and is known to generate
fewer backoffs. Testing reveals that this is true if the
number of simultaneous contending transactions is small.
As the number of simultaneous contending threads increases, Wait-Wound
becomes inferior to Wait-Die in terms of elapsed time.
Possibly due to the larger number of held locks of sleeping transactions.

Update documentation and callers.

Timings using git://people.freedesktop.org/~thomash/ww_mutex_test
tag patch-18-06-04

Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly
chosen out of 100000. Four core Intel x86_64:

Algorithm    #threads       Rollbacks  time
Wound-Wait   4              ~100       ~17s.
Wait-Die     4              ~150000    ~19s.
Wound-Wait   16             ~360000    ~109s.
Wait-Die     16             ~450000    ~82s.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 Documentation/locking/ww-mutex-design.txt | 57 ++++++++++++++----
 drivers/dma-buf/reservation.c             |  2 +-
 drivers/gpu/drm/drm_modeset_lock.c        |  2 +-
 include/linux/ww_mutex.h                  | 19 ++++--
 kernel/locking/locktorture.c              |  2 +-
 kernel/locking/mutex.c                    | 98 ++++++++++++++++++++++++++++---
 kernel/locking/test-ww_mutex.c            |  2 +-
 lib/locking-selftest.c                    |  2 +-
 8 files changed, 152 insertions(+), 32 deletions(-)

diff --git a/Documentation/locking/ww-mutex-design.txt b/Documentation/locking/ww-mutex-design.txt
index 34c3a1b50b9a..29c85623b551 100644
--- a/Documentation/locking/ww-mutex-design.txt
+++ b/Documentation/locking/ww-mutex-design.txt
@@ -1,4 +1,4 @@
-Wait/Wound Deadlock-Proof Mutex Design
+Wound/Wait Deadlock-Proof Mutex Design
 ======================================
 
 Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
@@ -32,10 +32,23 @@ the oldest task) wins, and the one with the higher reservation id (i.e. the
 younger task) unlocks all of the buffers that it has already locked, and then
 tries again.
 
-In the RDBMS literature this deadlock handling approach is called wait/wound:
-The older tasks waits until it can acquire the contended lock. The younger tasks
-needs to back off and drop all the locks it is currently holding, i.e. the
-younger task is wounded.
+In the RDBMS literature, a reservation ticket is associated with a transaction.
+and the deadlock handling approach is called Wait-Die. The name is based on
+the actions of a locking thread when it encounters an already locked mutex.
+If the transaction holding the lock is younger, the locking transaction waits.
+If the transaction holding the lock is older, the locking transaction backs off
+and dies. Hence Wait-Die.
+There is also another algorithm called Wound-Wait:
+If the transaction holding the lock is younger, the locking transaction
+preempts the transaction holding the lock, requiring it to back off. It
+Wounds the other transaction.
+If the transaction holding the lock is older, it waits for the other
+transaction. Hence Wound-Wait.
+The two algorithms are both fair in that a transaction will eventually succeed.
+However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
+compared to Wait-Die, but is, on the other hand, associated with more work than
+Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
+algorithm which requires a reliable way to preempt another transaction.
 
 Concepts
 --------
@@ -47,10 +60,12 @@ Acquire context: To ensure eventual forward progress it is important the a task
 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
 acquired when starting the lock acquisition. This ticket is stored in the
 acquire context. Furthermore the acquire context keeps track of debugging state
-to catch w/w mutex interface abuse.
+to catch w/w mutex interface abuse. An acquire context is representing a
+transaction.
 
 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
-w/w mutexes, since it is required to initialize the acquire context.
+w/w mutexes, since it is required to initialize the acquire context. The lock
+class also specifies what algorithm to use, Wound-Wait or Wait-Die.
 
 Furthermore there are three different class of w/w lock acquire functions:
 
@@ -90,10 +105,15 @@ provided.
 Usage
 -----
 
+The algorithm (Wait-Die vs Wound-Wait) is chosen using the _is_wait_die
+argument to DEFINE_WW_CLASS(). As a rough rule of thumb, use Wound-Wait iff you
+typically expect the number of simultaneous competing transactions to be small,
+and the rollback cost can be substantial.
+
 Three different ways to acquire locks within the same w/w class. Common
 definitions for methods #1 and #2:
 
-static DEFINE_WW_CLASS(ww_class);
+static DEFINE_WW_CLASS(ww_class, false);
 
 struct obj {
 	struct ww_mutex lock;
@@ -243,7 +263,7 @@ struct obj {
 	struct list_head locked_list;
 };
 
-static DEFINE_WW_CLASS(ww_class);
+static DEFINE_WW_CLASS(ww_class, false);
 
 void __unlock_objs(struct list_head *list)
 {
@@ -312,12 +332,23 @@ Design:
   We maintain the following invariants for the wait list:
   (1) Waiters with an acquire context are sorted by stamp order; waiters
       without an acquire context are interspersed in FIFO order.
-  (2) Among waiters with contexts, only the first one can have other locks
-      acquired already (ctx->acquired > 0). Note that this waiter may come
-      after other waiters without contexts in the list.
+  (2) For Wait-Die, among waiters with contexts, only the first one can have
+      other locks acquired already (ctx->acquired > 0). Note that this waiter
+      may come after other waiters without contexts in the list.
+
+  The Wound-Wait preemption is implemented with a lazy-preemption scheme:
+  The wounded status of the transaction is checked only when there is
+  contention for a new lock and hence a true chance of deadlock. In that
+  situation, if the transaction is wounded, it backs off, clears the
+  wounded status and retries. A great benefit of implementing preemption in
+  this way is that the wounded transaction can identify a contending lock to
+  wait for before restarting the transaction. Just blindly restarting the
+  transaction would likely make the transaction end up in a situation where
+  it would have to back off again.
 
   In general, not much contention is expected. The locks are typically used to
-  serialize access to resources for devices.
+  serialize access to resources for devices, and optimization focus should
+  therefore be directed towards the uncontended cases.
 
 Lockdep:
   Special care has been taken to warn for as many cases of api abuse
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index 314eb1071cce..039571b9fea1 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -46,7 +46,7 @@
  * write-side updates.
  */
 
-DEFINE_WW_CLASS(reservation_ww_class);
+DEFINE_WW_CLASS(reservation_ww_class, true);
 EXPORT_SYMBOL(reservation_ww_class);
 
 struct lock_class_key reservation_seqcount_class;
diff --git a/drivers/gpu/drm/drm_modeset_lock.c b/drivers/gpu/drm/drm_modeset_lock.c
index 8a5100685875..f22a7ef41de1 100644
--- a/drivers/gpu/drm/drm_modeset_lock.c
+++ b/drivers/gpu/drm/drm_modeset_lock.c
@@ -70,7 +70,7 @@
  * lists and lookup data structures.
  */
 
-static DEFINE_WW_CLASS(crtc_ww_class);
+static DEFINE_WW_CLASS(crtc_ww_class, true);
 
 /**
  * drm_modeset_lock_all - take all modeset locks
diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
index 39fda195bf78..6278077f288b 100644
--- a/include/linux/ww_mutex.h
+++ b/include/linux/ww_mutex.h
@@ -8,6 +8,8 @@
  *
  * Wound/wait implementation:
  *  Copyright (C) 2013 Canonical Ltd.
+ * Choice of algorithm:
+ *  Copyright (C) 2018 WMWare Inc.
  *
  * This file contains the main data structure and API definitions.
  */
@@ -23,15 +25,17 @@ struct ww_class {
 	struct lock_class_key mutex_key;
 	const char *acquire_name;
 	const char *mutex_name;
+	bool is_wait_die;
 };
 
 struct ww_acquire_ctx {
 	struct task_struct *task;
 	unsigned long stamp;
 	unsigned acquired;
+	bool wounded;
+	struct ww_class *ww_class;
 #ifdef CONFIG_DEBUG_MUTEXES
 	unsigned done_acquire;
-	struct ww_class *ww_class;
 	struct ww_mutex *contending_lock;
 #endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -58,17 +62,19 @@ struct ww_mutex {
 # define __WW_CLASS_MUTEX_INITIALIZER(lockname, class)
 #endif
 
-#define __WW_CLASS_INITIALIZER(ww_class) \
+#define __WW_CLASS_INITIALIZER(ww_class, _is_wait_die)	    \
 		{ .stamp = ATOMIC_LONG_INIT(0) \
 		, .acquire_name = #ww_class "_acquire" \
-		, .mutex_name = #ww_class "_mutex" }
+		, .mutex_name = #ww_class "_mutex" \
+		, .is_wait_die = _is_wait_die }
 
 #define __WW_MUTEX_INITIALIZER(lockname, class) \
 		{ .base =  __MUTEX_INITIALIZER(lockname.base) \
 		__WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
 
-#define DEFINE_WW_CLASS(classname) \
-	struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+#define DEFINE_WW_CLASS(classname, _is_wait_die)			\
+	struct ww_class classname = __WW_CLASS_INITIALIZER(classname, \
+							   _is_wait_die)
 
 #define DEFINE_WW_MUTEX(mutexname, ww_class) \
 	struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
@@ -123,8 +129,9 @@ static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
 	ctx->task = current;
 	ctx->stamp = atomic_long_inc_return_relaxed(&ww_class->stamp);
 	ctx->acquired = 0;
-#ifdef CONFIG_DEBUG_MUTEXES
 	ctx->ww_class = ww_class;
+	ctx->wounded = false;
+#ifdef CONFIG_DEBUG_MUTEXES
 	ctx->done_acquire = 0;
 	ctx->contending_lock = NULL;
 #endif
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 6850ffd69125..778ed026382f 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -365,7 +365,7 @@ static struct lock_torture_ops mutex_lock_ops = {
 };
 
 #include <linux/ww_mutex.h>
-static DEFINE_WW_CLASS(torture_ww_class);
+static DEFINE_WW_CLASS(torture_ww_class, true);
 static DEFINE_WW_MUTEX(torture_ww_mutex_0, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_1, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_2, &torture_ww_class);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2048359f33d2..b449a012c6f9 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -290,12 +290,47 @@ __ww_ctx_stamp_after(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
 	       (a->stamp != b->stamp || a > b);
 }
 
+/*
+ * Wound the lock holder transaction if it's younger than the contending
+ * transaction, and there is a possibility of a deadlock.
+ * Also if the lock holder transaction isn't the current transaction,
+ * Make sure it's woken up in case it's sleeping on another ww mutex.
+ */
+static bool __ww_mutex_wound(struct mutex *lock,
+			     struct ww_acquire_ctx *ww_ctx,
+			     struct ww_acquire_ctx *hold_ctx)
+{
+	struct task_struct *owner =
+		__owner_task(atomic_long_read(&lock->owner));
+
+	lockdep_assert_held(&lock->wait_lock);
+
+	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
+	    ww_ctx->acquired > 0) {
+		WRITE_ONCE(hold_ctx->wounded, true);
+		if (owner != current) {
+			/*
+			 * wake_up_process() inserts a write memory barrier to
+			 * make sure owner sees it is wounded before
+			 * TASK_RUNNING in case it's sleeping on another
+			 * ww_mutex. Note that owner points to a valid
+			 * task_struct as long as we hold the wait_lock.
+			 */
+			wake_up_process(owner);
+		}
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Wake up any waiters that may have to back off when the lock is held by the
  * given context.
  *
  * Due to the invariants on the wait list, this can only affect the first
- * waiter with a context.
+ * waiter with a context, unless the Wound-Wait algorithm is used where
+ * also subsequent waiters with a context main wound the lock holder.
  *
  * The current task must not be on the wait list.
  */
@@ -303,6 +338,7 @@ static void __sched
 __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
 {
 	struct mutex_waiter *cur;
+	bool is_wait_die = ww_ctx->ww_class->is_wait_die;
 
 	lockdep_assert_held(&lock->wait_lock);
 
@@ -310,13 +346,14 @@ __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
 		if (!cur->ww_ctx)
 			continue;
 
-		if (cur->ww_ctx->acquired > 0 &&
+		if (is_wait_die && cur->ww_ctx->acquired > 0 &&
 		    __ww_ctx_stamp_after(cur->ww_ctx, ww_ctx)) {
 			debug_mutex_wake_waiter(lock, cur);
 			wake_up_process(cur->task);
 		}
 
-		break;
+		if (is_wait_die || __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+			break;
 	}
 }
 
@@ -338,12 +375,17 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * and keep spinning, or it will acquire wait_lock, add itself
 	 * to waiter list and sleep.
 	 */
-	smp_mb(); /* ^^^ */
+	smp_mb(); /* See comments above and below. */
 
 	/*
-	 * Check if lock is contended, if not there is nobody to wake up
+	 * Check if lock is contended, if not there is nobody to wake up.
+	 * Checking MUTEX_FLAG_WAITERS is not enough here, since we need to
+	 * order against the lock->ctx check in __ww_mutex_wound called from
+	 * __ww_mutex_add_waiter. We can use list_empty without taking the
+	 * wait_lock, given the memory barrier above and the list_empty
+	 * documentation.
 	 */
-	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
+	if (likely(list_empty(&lock->base.wait_list)))
 		return;
 
 	/*
@@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
 	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
 	struct mutex_waiter *cur;
 
+	/*
+	 * If we miss a wounded == true here, we will have a pending
+	 * TASK_RUNNING and pick it up on the next schedule fall-through.
+	 */
+	if (!ctx->ww_class->is_wait_die) {
+		if (READ_ONCE(ctx->wounded))
+			goto deadlock;
+		else
+			return 0;
+	}
+
 	if (hold_ctx && __ww_ctx_stamp_after(ctx, hold_ctx))
 		goto deadlock;
 
@@ -683,12 +736,15 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 {
 	struct mutex_waiter *cur;
 	struct list_head *pos;
+	bool is_wait_die;
 
 	if (!ww_ctx) {
 		list_add_tail(&waiter->list, &lock->wait_list);
 		return 0;
 	}
 
+	is_wait_die = ww_ctx->ww_class->is_wait_die;
+
 	/*
 	 * Add the waiter before the first waiter with a higher stamp.
 	 * Waiters without a context are skipped to avoid starving
@@ -701,7 +757,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 
 		if (__ww_ctx_stamp_after(ww_ctx, cur->ww_ctx)) {
 			/* Back off immediately if necessary. */
-			if (ww_ctx->acquired > 0) {
+			if (is_wait_die && ww_ctx->acquired > 0) {
 #ifdef CONFIG_DEBUG_MUTEXES
 				struct ww_mutex *ww;
 
@@ -721,13 +777,26 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 		 * Wake up the waiter so that it gets a chance to back
 		 * off.
 		 */
-		if (cur->ww_ctx->acquired > 0) {
+		if (is_wait_die && cur->ww_ctx->acquired > 0) {
 			debug_mutex_wake_waiter(lock, cur);
 			wake_up_process(cur->task);
 		}
 	}
 
 	list_add_tail(&waiter->list, pos);
+	if (!is_wait_die) {
+		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+
+		/*
+		 * Make sure a racing lock taker sees a non-empty waiting list
+		 * before we read ww->ctx, so that if we miss ww->ctx, the
+		 * racing lock taker will call __ww_mutex_wake_up_for_backoff()
+		 * and wound itself.
+		 */
+		smp_mb();
+		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
+	}
+
 	return 0;
 }
 
@@ -750,6 +819,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	if (use_ww_ctx && ww_ctx) {
 		if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
 			return -EALREADY;
+
+		/*
+		 * Reset the wounded flag after a backoff.
+		 * No other process can race and wound us here since they
+		 * can't have a valid owner pointer at this time
+		 */
+		if (ww_ctx->acquired == 0)
+			ww_ctx->wounded = false;
 	}
 
 	preempt_disable();
@@ -858,6 +935,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 acquired:
 	__set_current_state(TASK_RUNNING);
 
+	/* We stole the lock. Need to check wounded status. */
+	if (use_ww_ctx && ww_ctx && !ww_ctx->ww_class->is_wait_die &&
+	    !__mutex_waiter_is_first(lock, &waiter))
+		__ww_mutex_wakeup_for_backoff(lock, ww_ctx);
+
 	mutex_remove_waiter(lock, &waiter, current);
 	if (likely(list_empty(&lock->wait_list)))
 		__mutex_clear_flag(lock, MUTEX_FLAGS);
diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c
index 0e4cd64ad2c0..c7fc112d691d 100644
--- a/kernel/locking/test-ww_mutex.c
+++ b/kernel/locking/test-ww_mutex.c
@@ -26,7 +26,7 @@
 #include <linux/slab.h>
 #include <linux/ww_mutex.h>
 
-static DEFINE_WW_CLASS(ww_class);
+static DEFINE_WW_CLASS(ww_class, true);
 struct workqueue_struct *wq;
 
 struct test_mutex {
diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index b5c1293ce147..e52065f2acbf 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -29,7 +29,7 @@
  */
 static unsigned int debug_locks_verbose;
 
-static DEFINE_WW_CLASS(ww_lockdep);
+static DEFINE_WW_CLASS(ww_lockdep, true);
 
 static int __init setup_debug_locks_verbose(char *str)
 {
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Greg Kroah-Hartman @ 2018-06-13  7:54 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: dri-devel, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
	Thomas Gleixner, Kate Stewart, Philippe Ombredanne, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613074745.14750-2-thellstrom@vmware.com>

On Wed, Jun 13, 2018 at 09:47:44AM +0200, Thomas Hellstrom wrote:
>  -----
>  
> +The algorithm (Wait-Die vs Wound-Wait) is chosen using the _is_wait_die
> +argument to DEFINE_WW_CLASS(). As a rough rule of thumb, use Wound-Wait iff you
> +typically expect the number of simultaneous competing transactions to be small,
> +and the rollback cost can be substantial.
> +
>  Three different ways to acquire locks within the same w/w class. Common
>  definitions for methods #1 and #2:
>  
> -static DEFINE_WW_CLASS(ww_class);
> +static DEFINE_WW_CLASS(ww_class, false);

Minor nit on the api here.  Having a "flag" is a royal pain.  You have
to go and look up exactly what that "true/false" means every time you
run across it in code to figure out what it means.  Don't do that if at
all possible.

Make a new api:
	DEFINE_WW_CLASS_DIE(ww_class);
instead that then wraps that boolean internally to switch between the
different types.  That way the api is "self-documenting" and we all know
what is going on without having to dig through a header file.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-13  8:34 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: dri-devel, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet, Gustavo Padovan, Maarten Lankhorst, Sean Paul,
	David Airlie, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
	Thomas Gleixner, Kate Stewart, Philippe Ombredanne, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613075411.GA17681@kroah.com>

On 06/13/2018 09:54 AM, Greg Kroah-Hartman wrote:
> On Wed, Jun 13, 2018 at 09:47:44AM +0200, Thomas Hellstrom wrote:
>>   -----
>>   
>> +The algorithm (Wait-Die vs Wound-Wait) is chosen using the _is_wait_die
>> +argument to DEFINE_WW_CLASS(). As a rough rule of thumb, use Wound-Wait iff you
>> +typically expect the number of simultaneous competing transactions to be small,
>> +and the rollback cost can be substantial.
>> +
>>   Three different ways to acquire locks within the same w/w class. Common
>>   definitions for methods #1 and #2:
>>   
>> -static DEFINE_WW_CLASS(ww_class);
>> +static DEFINE_WW_CLASS(ww_class, false);
> Minor nit on the api here.  Having a "flag" is a royal pain.  You have
> to go and look up exactly what that "true/false" means every time you
> run across it in code to figure out what it means.  Don't do that if at
> all possible.
>
> Make a new api:
> 	DEFINE_WW_CLASS_DIE(ww_class);
> instead that then wraps that boolean internally to switch between the
> different types.  That way the api is "self-documenting" and we all know
> what is going on without having to dig through a header file.
>
> thanks,
>
> greg k-h

Good point. I'll update in a v2.

Thanks,

Thomas


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Peter Zijlstra @ 2018-06-13  9:50 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: dri-devel, linux-kernel, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613074745.14750-2-thellstrom@vmware.com>


/me wonders what's up with partial Cc's today..

On Wed, Jun 13, 2018 at 09:47:44AM +0200, Thomas Hellstrom wrote:
> The current Wound-Wait mutex algorithm is actually not Wound-Wait but
> Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait
> is, contrary to Wait-Die a preemptive algorithm and is known to generate
> fewer backoffs. Testing reveals that this is true if the
> number of simultaneous contending transactions is small.
> As the number of simultaneous contending threads increases, Wait-Wound
> becomes inferior to Wait-Die in terms of elapsed time.
> Possibly due to the larger number of held locks of sleeping transactions.
> 
> Update documentation and callers.
> 
> Timings using git://people.freedesktop.org/~thomash/ww_mutex_test
> tag patch-18-06-04
> 
> Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly
> chosen out of 100000. Four core Intel x86_64:
> 
> Algorithm    #threads       Rollbacks  time
> Wound-Wait   4              ~100       ~17s.
> Wait-Die     4              ~150000    ~19s.
> Wound-Wait   16             ~360000    ~109s.
> Wait-Die     16             ~450000    ~82s.

> diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
> index 39fda195bf78..6278077f288b 100644
> --- a/include/linux/ww_mutex.h
> +++ b/include/linux/ww_mutex.h
> @@ -8,6 +8,8 @@
>   *
>   * Wound/wait implementation:
>   *  Copyright (C) 2013 Canonical Ltd.
> + * Choice of algorithm:
> + *  Copyright (C) 2018 WMWare Inc.
>   *
>   * This file contains the main data structure and API definitions.
>   */
> @@ -23,15 +25,17 @@ struct ww_class {
>  	struct lock_class_key mutex_key;
>  	const char *acquire_name;
>  	const char *mutex_name;
> +	bool is_wait_die;
>  };

No _Bool in composites please.

>  struct ww_acquire_ctx {
>  	struct task_struct *task;
>  	unsigned long stamp;
>  	unsigned acquired;
> +	bool wounded;

Again.

> +	struct ww_class *ww_class;
>  #ifdef CONFIG_DEBUG_MUTEXES
>  	unsigned done_acquire;
> -	struct ww_class *ww_class;
>  	struct ww_mutex *contending_lock;
>  #endif
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC

> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index 2048359f33d2..b449a012c6f9 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -290,12 +290,47 @@ __ww_ctx_stamp_after(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
>  	       (a->stamp != b->stamp || a > b);
>  }
>  
> +/*
> + * Wound the lock holder transaction if it's younger than the contending
> + * transaction, and there is a possibility of a deadlock.
> + * Also if the lock holder transaction isn't the current transaction,

Comma followed by a capital?

> + * Make sure it's woken up in case it's sleeping on another ww mutex.

> + */
> +static bool __ww_mutex_wound(struct mutex *lock,
> +			     struct ww_acquire_ctx *ww_ctx,
> +			     struct ww_acquire_ctx *hold_ctx)
> +{
> +	struct task_struct *owner =
> +		__owner_task(atomic_long_read(&lock->owner));

Did you just spell __mutex_owner() wrong?

> +
> +	lockdep_assert_held(&lock->wait_lock);
> +
> +	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
> +	    ww_ctx->acquired > 0) {
> +		WRITE_ONCE(hold_ctx->wounded, true);
> +		if (owner != current) {
> +			/*
> +			 * wake_up_process() inserts a write memory barrier to

It does no such thing. But yes, it does ensure the wakee sees all prior
stores IFF the wakeup happened.

> +			 * make sure owner sees it is wounded before
> +			 * TASK_RUNNING in case it's sleeping on another
> +			 * ww_mutex. Note that owner points to a valid
> +			 * task_struct as long as we hold the wait_lock.
> +			 */

What exactly are you trying to say here ?

I'm thinking this is the pairing barrier to the smp_mb() below, with
your list_empty() thing? Might make sense to write a single coherent
comment and refer to the other location.

> +			wake_up_process(owner);
> +		}
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  /*
>   * Wake up any waiters that may have to back off when the lock is held by the
>   * given context.
>   *
>   * Due to the invariants on the wait list, this can only affect the first
> - * waiter with a context.
> + * waiter with a context, unless the Wound-Wait algorithm is used where
> + * also subsequent waiters with a context main wound the lock holder.
>   *
>   * The current task must not be on the wait list.
>   */
> @@ -303,6 +338,7 @@ static void __sched
>  __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
>  {
>  	struct mutex_waiter *cur;
> +	bool is_wait_die = ww_ctx->ww_class->is_wait_die;
>  
>  	lockdep_assert_held(&lock->wait_lock);
>  
> @@ -310,13 +346,14 @@ __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
>  		if (!cur->ww_ctx)
>  			continue;
>  
> -		if (cur->ww_ctx->acquired > 0 &&
> +		if (is_wait_die && cur->ww_ctx->acquired > 0 &&
>  		    __ww_ctx_stamp_after(cur->ww_ctx, ww_ctx)) {
>  			debug_mutex_wake_waiter(lock, cur);
>  			wake_up_process(cur->task);
>  		}
>  
> -		break;
> +		if (is_wait_die || __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
> +			break;
>  	}
>  }
>  
> @@ -338,12 +375,17 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
>  	 * and keep spinning, or it will acquire wait_lock, add itself
>  	 * to waiter list and sleep.
>  	 */
> -	smp_mb(); /* ^^^ */
> +	smp_mb(); /* See comments above and below. */
>  
>  	/*
> -	 * Check if lock is contended, if not there is nobody to wake up
> +	 * Check if lock is contended, if not there is nobody to wake up.
> +	 * Checking MUTEX_FLAG_WAITERS is not enough here, 

That seems like a superfluous thing to say. It makes sense in the
context of this patch because we change the FLAG check into a list
check, but the resulting comment/code looks odd.

>							   since we need to
> +	 * order against the lock->ctx check in __ww_mutex_wound called from
> +	 * __ww_mutex_add_waiter. We can use list_empty without taking the
> +	 * wait_lock, given the memory barrier above and the list_empty
> +	 * documentation.

I don't trust documentation. Please reason about implementation.

>  	 */
> -	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
> +	if (likely(list_empty(&lock->base.wait_list)))
>  		return;
>  
>  	/*
> @@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
>  	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
>  	struct mutex_waiter *cur;
>  
> +	/*
> +	 * If we miss a wounded == true here, we will have a pending

Explain how we can miss that.

> +	 * TASK_RUNNING and pick it up on the next schedule fall-through.
> +	 */
> +	if (!ctx->ww_class->is_wait_die) {
> +		if (READ_ONCE(ctx->wounded))
> +			goto deadlock;
> +		else
> +			return 0;
> +	}
> +
>  	if (hold_ctx && __ww_ctx_stamp_after(ctx, hold_ctx))
>  		goto deadlock;
>  
> @@ -683,12 +736,15 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>  {
>  	struct mutex_waiter *cur;
>  	struct list_head *pos;
> +	bool is_wait_die;
>  
>  	if (!ww_ctx) {
>  		list_add_tail(&waiter->list, &lock->wait_list);
>  		return 0;
>  	}
>  
> +	is_wait_die = ww_ctx->ww_class->is_wait_die;
> +
>  	/*
>  	 * Add the waiter before the first waiter with a higher stamp.
>  	 * Waiters without a context are skipped to avoid starving
> @@ -701,7 +757,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>  
>  		if (__ww_ctx_stamp_after(ww_ctx, cur->ww_ctx)) {
>  			/* Back off immediately if necessary. */
> -			if (ww_ctx->acquired > 0) {
> +			if (is_wait_die && ww_ctx->acquired > 0) {
>  #ifdef CONFIG_DEBUG_MUTEXES
>  				struct ww_mutex *ww;
>  
> @@ -721,13 +777,26 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>  		 * Wake up the waiter so that it gets a chance to back
>  		 * off.
>  		 */
> -		if (cur->ww_ctx->acquired > 0) {
> +		if (is_wait_die && cur->ww_ctx->acquired > 0) {
>  			debug_mutex_wake_waiter(lock, cur);
>  			wake_up_process(cur->task);
>  		}
>  	}
>  
>  	list_add_tail(&waiter->list, pos);
> +	if (!is_wait_die) {
> +		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
> +
> +		/*
> +		 * Make sure a racing lock taker sees a non-empty waiting list
> +		 * before we read ww->ctx, so that if we miss ww->ctx, the
> +		 * racing lock taker will call __ww_mutex_wake_up_for_backoff()
> +		 * and wound itself.
> +		 */
> +		smp_mb();
> +		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -750,6 +819,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	if (use_ww_ctx && ww_ctx) {
>  		if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
>  			return -EALREADY;
> +
> +		/*
> +		 * Reset the wounded flag after a backoff.
> +		 * No other process can race and wound us here since they
> +		 * can't have a valid owner pointer at this time
> +		 */
> +		if (ww_ctx->acquired == 0)
> +			ww_ctx->wounded = false;
>  	}
>  
>  	preempt_disable();
> @@ -858,6 +935,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  acquired:
>  	__set_current_state(TASK_RUNNING);
>  
> +	/* We stole the lock. Need to check wounded status. */
> +	if (use_ww_ctx && ww_ctx && !ww_ctx->ww_class->is_wait_die &&
> +	    !__mutex_waiter_is_first(lock, &waiter))
> +		__ww_mutex_wakeup_for_backoff(lock, ww_ctx);
> +
>  	mutex_remove_waiter(lock, &waiter, current);
>  	if (likely(list_empty(&lock->wait_list)))
>  		__mutex_clear_flag(lock, MUTEX_FLAGS);

I can't say I'm a fan. I'm already cursing the ww_mutex stuff every time
I have to look at it, and you just made it worse spagethi.


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-13 10:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dri-devel, linux-kernel, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613095012.GW12198@hirez.programming.kicks-ass.net>

On 06/13/2018 11:50 AM, Peter Zijlstra wrote:
>
>> +
>> +	lockdep_assert_held(&lock->wait_lock);
>> +
>> +	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
>> +	    ww_ctx->acquired > 0) {
>> +		WRITE_ONCE(hold_ctx->wounded, true);
>> +		if (owner != current) {
>> +			/*
>> +			 * wake_up_process() inserts a write memory barrier to
> It does no such thing. But yes, it does ensure the wakee sees all prior
> stores IFF the wakeup happened.
>
>> +			 * make sure owner sees it is wounded before
>> +			 * TASK_RUNNING in case it's sleeping on another
>> +			 * ww_mutex. Note that owner points to a valid
>> +			 * task_struct as long as we hold the wait_lock.
>> +			 */
> What exactly are you trying to say here ?
>
> I'm thinking this is the pairing barrier to the smp_mb() below, with
> your list_empty() thing? Might make sense to write a single coherent
> comment and refer to the other location.

So what I'm trying to say here is that wake_up_process() ensures that 
the owner, if in !TASK_RUNNING, sees the write to hold_ctx->wounded 
before the transition to TASK_RUNNING. This was how I interpreted "woken 
up" in the wake up process documentation.

>
>> +			wake_up_process(owner);
>> +		}
>> +		return true;
>> +	}
>> +
>> +	return false;
>> +}
>> +
>>   /*
>>    * Wake up any waiters that may have to back off when the lock is held by the
>>    * given context.
>>    *
>>    * Due to the invariants on the wait list, this can only affect the first
>> - * waiter with a context.
>> + * waiter with a context, unless the Wound-Wait algorithm is used where
>> + * also subsequent waiters with a context main wound the lock holder.
>>    *
>>    * The current task must not be on the wait list.
>>    */
>> @@ -303,6 +338,7 @@ static void __sched
>>   __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
>>   {
>>   	struct mutex_waiter *cur;
>> +	bool is_wait_die = ww_ctx->ww_class->is_wait_die;
>>   
>>   	lockdep_assert_held(&lock->wait_lock);
>>   
>> @@ -310,13 +346,14 @@ __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
>>   		if (!cur->ww_ctx)
>>   			continue;
>>   
>> -		if (cur->ww_ctx->acquired > 0 &&
>> +		if (is_wait_die && cur->ww_ctx->acquired > 0 &&
>>   		    __ww_ctx_stamp_after(cur->ww_ctx, ww_ctx)) {
>>   			debug_mutex_wake_waiter(lock, cur);
>>   			wake_up_process(cur->task);
>>   		}
>>   
>> -		break;
>> +		if (is_wait_die || __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
>> +			break;
>>   	}
>>   }
>>   
>> @@ -338,12 +375,17 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
>>   	 * and keep spinning, or it will acquire wait_lock, add itself
>>   	 * to waiter list and sleep.
>>   	 */
>> -	smp_mb(); /* ^^^ */
>> +	smp_mb(); /* See comments above and below. */
>>   
>>   	/*
>> -	 * Check if lock is contended, if not there is nobody to wake up
>> +	 * Check if lock is contended, if not there is nobody to wake up.
>> +	 * Checking MUTEX_FLAG_WAITERS is not enough here,
> That seems like a superfluous thing to say. It makes sense in the
> context of this patch because we change the FLAG check into a list
> check, but the resulting comment/code looks odd.
>
>> 							   since we need to
>> +	 * order against the lock->ctx check in __ww_mutex_wound called from
>> +	 * __ww_mutex_add_waiter. We can use list_empty without taking the
>> +	 * wait_lock, given the memory barrier above and the list_empty
>> +	 * documentation.
> I don't trust documentation. Please reason about implementation.

Will do.

>>   	 */
>> -	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
>> +	if (likely(list_empty(&lock->base.wait_list)))
>>   		return;
>>   
>>   	/*
>> @@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
>>   	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
>>   	struct mutex_waiter *cur;
>>   
>> +	/*
>> +	 * If we miss a wounded == true here, we will have a pending
> Explain how we can miss that.

This is actually the pairing location of the wake_up_process() comment / 
code discussed above. Here we should have !TASK_RUNNING, and let's say 
ctx->wounded is set by another process immediately after we've read it 
(we "miss" it). At that point there must be a pending wake-up-process() 
for us and we'll pick up the set value of wounded on the next iteration 
after returning from schedule().

>
>> +	 * TASK_RUNNING and pick it up on the next schedule fall-through.
>> +	 */
>> +	if (!ctx->ww_class->is_wait_die) {
>> +		if (READ_ONCE(ctx->wounded))
>> +			goto deadlock;
>> +		else
>> +			return 0;
>> +	}
>> +
>>   	if (hold_ctx && __ww_ctx_stamp_after(ctx, hold_ctx))
>>   		goto deadlock;
>>   
>> @@ -683,12 +736,15 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>>   {
>>   	struct mutex_waiter *cur;
>>   	struct list_head *pos;
>> +	bool is_wait_die;
>>   
>>   	if (!ww_ctx) {
>>   		list_add_tail(&waiter->list, &lock->wait_list);
>>   		return 0;
>>   	}
>>   
>> +	is_wait_die = ww_ctx->ww_class->is_wait_die;
>> +
>>   	/*
>>   	 * Add the waiter before the first waiter with a higher stamp.
>>   	 * Waiters without a context are skipped to avoid starving
>> @@ -701,7 +757,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>>   
>>   		if (__ww_ctx_stamp_after(ww_ctx, cur->ww_ctx)) {
>>   			/* Back off immediately if necessary. */
>> -			if (ww_ctx->acquired > 0) {
>> +			if (is_wait_die && ww_ctx->acquired > 0) {
>>   #ifdef CONFIG_DEBUG_MUTEXES
>>   				struct ww_mutex *ww;
>>   
>> @@ -721,13 +777,26 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
>>   		 * Wake up the waiter so that it gets a chance to back
>>   		 * off.
>>   		 */
>> -		if (cur->ww_ctx->acquired > 0) {
>> +		if (is_wait_die && cur->ww_ctx->acquired > 0) {
>>   			debug_mutex_wake_waiter(lock, cur);
>>   			wake_up_process(cur->task);
>>   		}
>>   	}
>>   
>>   	list_add_tail(&waiter->list, pos);
>> +	if (!is_wait_die) {
>> +		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
>> +
>> +		/*
>> +		 * Make sure a racing lock taker sees a non-empty waiting list
>> +		 * before we read ww->ctx, so that if we miss ww->ctx, the
>> +		 * racing lock taker will call __ww_mutex_wake_up_for_backoff()
>> +		 * and wound itself.
>> +		 */
>> +		smp_mb();
>> +		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
>> +	}
>> +
>>   	return 0;
>>   }
>>   
>> @@ -750,6 +819,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>>   	if (use_ww_ctx && ww_ctx) {
>>   		if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
>>   			return -EALREADY;
>> +
>> +		/*
>> +		 * Reset the wounded flag after a backoff.
>> +		 * No other process can race and wound us here since they
>> +		 * can't have a valid owner pointer at this time
>> +		 */
>> +		if (ww_ctx->acquired == 0)
>> +			ww_ctx->wounded = false;
>>   	}
>>   
>>   	preempt_disable();
>> @@ -858,6 +935,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>>   acquired:
>>   	__set_current_state(TASK_RUNNING);
>>   
>> +	/* We stole the lock. Need to check wounded status. */
>> +	if (use_ww_ctx && ww_ctx && !ww_ctx->ww_class->is_wait_die &&
>> +	    !__mutex_waiter_is_first(lock, &waiter))
>> +		__ww_mutex_wakeup_for_backoff(lock, ww_ctx);
>> +
>>   	mutex_remove_waiter(lock, &waiter, current);
>>   	if (likely(list_empty(&lock->wait_list)))
>>   		__mutex_clear_flag(lock, MUTEX_FLAGS);
> I can't say I'm a fan. I'm already cursing the ww_mutex stuff every time
> I have to look at it, and you just made it worse spagethi.
>
>

Thanks for the review.

Well, I can't speak for the current ww implementation except I didn't 
think it was too hard to understand for a first time reader.

Admittedly the Wound-Wait path makes it worse since it's a preemptive 
algorithm and we need to touch other processes a acquire contexts and 
worry about ordering.

So, assuming your review comments are fixed up, is that a solid NAK or 
do you have any suggestion that would make you more comfortable with the 
code? like splitting out ww-stuff to a separate file?

/Thomas


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Peter Zijlstra @ 2018-06-13 13:10 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: dri-devel, linux-kernel, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <69f3dee9-4782-bc90-3ee2-813ac6835c4a@vmware.com>

On Wed, Jun 13, 2018 at 12:40:29PM +0200, Thomas Hellstrom wrote:
> On 06/13/2018 11:50 AM, Peter Zijlstra wrote:
> > 
> > > +
> > > +	lockdep_assert_held(&lock->wait_lock);
> > > +
> > > +	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
> > > +	    ww_ctx->acquired > 0) {
> > > +		WRITE_ONCE(hold_ctx->wounded, true);
> > > +		if (owner != current) {
> > > +			/*
> > > +			 * wake_up_process() inserts a write memory barrier to
> > It does no such thing. But yes, it does ensure the wakee sees all prior
> > stores IFF the wakeup happened.
> > 
> > > +			 * make sure owner sees it is wounded before
> > > +			 * TASK_RUNNING in case it's sleeping on another
> > > +			 * ww_mutex. Note that owner points to a valid
> > > +			 * task_struct as long as we hold the wait_lock.
> > > +			 */
> > What exactly are you trying to say here ?
> > 
> > I'm thinking this is the pairing barrier to the smp_mb() below, with
> > your list_empty() thing? Might make sense to write a single coherent
> > comment and refer to the other location.
> 
> So what I'm trying to say here is that wake_up_process() ensures that the
> owner, if in !TASK_RUNNING, sees the write to hold_ctx->wounded before the
> transition to TASK_RUNNING. This was how I interpreted "woken up" in the
> wake up process documentation.

There is documentation!? :-) Aaah, you mean that kerneldoc comment with
wake_up_process() ? Yeah, that needs fixing. /me puts on endless todo
list.

Anyway, wakeup providing that ordering isn't something that needs a
comment of that size; and I think the only comment here is that we care
about the ordering and a reference to the site(s) that pairs with it.

Maybe something like:

	/*
	 * __ww_mutex_lock_check_stamp() will observe our wounded store.
	 */

> > > -	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
> > > +	if (likely(list_empty(&lock->base.wait_list)))
> > >   		return;
> > >   	/*
> > > @@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
> > >   	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
> > >   	struct mutex_waiter *cur;
> > > +	/*
> > > +	 * If we miss a wounded == true here, we will have a pending
> > Explain how we can miss that.
> 
> This is actually the pairing location of the wake_up_process() comment /
> code discussed above. Here we should have !TASK_RUNNING, and let's say
> ctx->wounded is set by another process immediately after we've read it (we
> "miss" it). At that point there must be a pending wake-up-process() for us
> and we'll pick up the set value of wounded on the next iteration after
> returning from schedule().

Right, so that's when the above wakeup isn't the one waking us.


> > I can't say I'm a fan. I'm already cursing the ww_mutex stuff every time
> > I have to look at it, and you just made it worse spagethi.

> Well, I can't speak for the current ww implementation except I didn't think
> it was too hard to understand for a first time reader.
> 
> Admittedly the Wound-Wait path makes it worse since it's a preemptive
> algorithm and we need to touch other processes a acquire contexts and worry
> about ordering.
> 
> So, assuming your review comments are fixed up, is that a solid NAK or do
> you have any suggestion that would make you more comfortable with the code?
> like splitting out ww-stuff to a separate file?

Nah, not a NAK, but we should look at whan can be done to improve code.
Maybe add a few more comments that explain why. Part of the problem with
ww_mutex is always that I forget exactly how they work and mutex.c
doesn't have much useful comments in (most of those are in ww_mutex.h
and I always forget to look there).

Also; I'm not at all sure about the exact difference between what we
have and what you propose. I did read the documentation part (I really
should not have to) but it just doesn't jive.

I suspect you're using preemption entirely different from what we
usually call a preemption.



Also, __ww_ctx_stamp_after() is crap; did we want to write:

	return (signed long)(a->stamp - b->stamp) > 0;

or something?


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-13 14:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dri-devel, linux-kernel, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180613131000.GX12198@hirez.programming.kicks-ass.net>

On 06/13/2018 03:10 PM, Peter Zijlstra wrote:
> On Wed, Jun 13, 2018 at 12:40:29PM +0200, Thomas Hellstrom wrote:
>> On 06/13/2018 11:50 AM, Peter Zijlstra wrote:
>>>> +
>>>> +	lockdep_assert_held(&lock->wait_lock);
>>>> +
>>>> +	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
>>>> +	    ww_ctx->acquired > 0) {
>>>> +		WRITE_ONCE(hold_ctx->wounded, true);
>>>> +		if (owner != current) {
>>>> +			/*
>>>> +			 * wake_up_process() inserts a write memory barrier to
>>> It does no such thing. But yes, it does ensure the wakee sees all prior
>>> stores IFF the wakeup happened.
>>>
>>>> +			 * make sure owner sees it is wounded before
>>>> +			 * TASK_RUNNING in case it's sleeping on another
>>>> +			 * ww_mutex. Note that owner points to a valid
>>>> +			 * task_struct as long as we hold the wait_lock.
>>>> +			 */
>>> What exactly are you trying to say here ?
>>>
>>> I'm thinking this is the pairing barrier to the smp_mb() below, with
>>> your list_empty() thing? Might make sense to write a single coherent
>>> comment and refer to the other location.
>> So what I'm trying to say here is that wake_up_process() ensures that the
>> owner, if in !TASK_RUNNING, sees the write to hold_ctx->wounded before the
>> transition to TASK_RUNNING. This was how I interpreted "woken up" in the
>> wake up process documentation.
> There is documentation!? :-) Aaah, you mean that kerneldoc comment with
> wake_up_process() ? Yeah, that needs fixing. /me puts on endless todo
> list.
>
> Anyway, wakeup providing that ordering isn't something that needs a
> comment of that size; and I think the only comment here is that we care
> about the ordering and a reference to the site(s) that pairs with it.
>
> Maybe something like:
>
> 	/*
> 	 * __ww_mutex_lock_check_stamp() will observe our wounded store.
> 	 */

Yes.

Actually, I just found the set_current_state() kerneldoc which explains 
the built-in barrier pairing with wake_up_xxx. Perhaps I also should 
mention that as well. Looks like the use WRITE_ONCE() and READ_ONCE() 
can be dropped as well.

>>>> -	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
>>>> +	if (likely(list_empty(&lock->base.wait_list)))
>>>>    		return;
>>>>    	/*
>>>> @@ -653,6 +695,17 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
>>>>    	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
>>>>    	struct mutex_waiter *cur;
>>>> +	/*
>>>> +	 * If we miss a wounded == true here, we will have a pending
>>> Explain how we can miss that.
>> This is actually the pairing location of the wake_up_process() comment /
>> code discussed above. Here we should have !TASK_RUNNING, and let's say
>> ctx->wounded is set by another process immediately after we've read it (we
>> "miss" it). At that point there must be a pending wake-up-process() for us
>> and we'll pick up the set value of wounded on the next iteration after
>> returning from schedule().
> Right, so that's when the above wakeup isn't the one waking us.
>
>
>>> I can't say I'm a fan. I'm already cursing the ww_mutex stuff every time
>>> I have to look at it, and you just made it worse spagethi.
>> Well, I can't speak for the current ww implementation except I didn't think
>> it was too hard to understand for a first time reader.
>>
>> Admittedly the Wound-Wait path makes it worse since it's a preemptive
>> algorithm and we need to touch other processes a acquire contexts and worry
>> about ordering.
>>
>> So, assuming your review comments are fixed up, is that a solid NAK or do
>> you have any suggestion that would make you more comfortable with the code?
>> like splitting out ww-stuff to a separate file?
> Nah, not a NAK, but we should look at whan can be done to improve code.
> Maybe add a few more comments that explain why. Part of the problem with
> ww_mutex is always that I forget exactly how they work and mutex.c
> doesn't have much useful comments in (most of those are in ww_mutex.h
> and I always forget to look there).

Understood.

>
> Also; I'm not at all sure about the exact difference between what we
> have and what you propose. I did read the documentation part (I really
> should not have to) but it just doesn't jive.
>
> I suspect you're using preemption entirely different from what we
> usually call a preemption.

I think that perhaps requires a good understanding of the difference of 
the algorithms in question before looking at the implementation. I put a 
short explanation and some URLs to CS websites describing the two 
algorithms and their pros and cons in the patch series introductory 
message. I'll forward that.

In short, with Wait-Die (before the patch) it's the process _taking_ the 
contended lock that backs off if necessary. No preemption required. With 
Wound-Wait, it's the process _holding_ the contended lock that gets 
wounded (preempted), and it needs to back off at its own discretion but 
no later than when it's going to sleep on another ww mutex. That point 
is where we intercept the preemption request. We're preempting the 
transaction rather than the process.


>
>
> Also, __ww_ctx_stamp_after() is crap; did we want to write:
>
> 	return (signed long)(a->stamp - b->stamp) > 0;
>
> or something?
>
>
Hmm. Yes it indeed looks odd. Seems like the above code should do the trick.

/Thomas


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 00/10] Control Flow Enforcement - Part (3)
From: Balbir Singh @ 2018-06-14  1:07 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <1528815820.8271.16.camel@2b52.sc.intel.com>

On Tue, 2018-06-12 at 08:03 -0700, Yu-cheng Yu wrote:
> On Tue, 2018-06-12 at 20:56 +1000, Balbir Singh wrote:
> > 
> > On 08/06/18 00:37, Yu-cheng Yu wrote:
> > > This series introduces CET - Shadow stack
> > > 
> > > At the high level, shadow stack is:
> > > 
> > > 	Allocated from a task's address space with vm_flags VM_SHSTK;
> > > 	Its PTEs must be read-only and dirty;
> > > 	Fixed sized, but the default size can be changed by sys admin.
> > > 
> > > For a forked child, the shadow stack is duplicated when the next
> > > shadow stack access takes place.
> > > 
> > > For a pthread child, a new shadow stack is allocated.
> > > 
> > > The signal handler uses the same shadow stack as the main program.
> > > 
> > 
> > Even with sigaltstack()?
> > 
> Yes.

I am not convinced that it would work, as we switch stacks, oveflow might
be an issue. I also forgot to bring up setcontext(2), I presume those
will get new shadow stacks

Balbir Singh.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 02/10] x86/cet: Introduce WRUSS instruction
From: Balbir Singh @ 2018-06-14  1:30 UTC (permalink / raw)
  To: Yu-cheng Yu, linux-kernel, linux-doc, linux-mm, linux-arch, x86,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar, H.J. Lu,
	Vedvyas Shanbhogue, Ravi V. Shankar, Dave Hansen, Andy Lutomirski,
	Jonathan Corbet, Oleg Nesterov, Arnd Bergmann, Mike Kravetz
In-Reply-To: <20180607143807.3611-3-yu-cheng.yu@intel.com>

On Thu, 2018-06-07 at 07:37 -0700, Yu-cheng Yu wrote:
> WRUSS is a new kernel-mode instruction but writes directly
> to user shadow stack memory.  This is used to construct
> a return address on the shadow stack for the signal
> handler.
> 
> This instruction can fault if the user shadow stack is
> invalid shadow stack memory.  In that case, the kernel does
> fixup.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> ---
>  arch/x86/include/asm/special_insns.h          | 44 +++++++++++++++++++++++++++
>  arch/x86/lib/x86-opcode-map.txt               |  2 +-
>  arch/x86/mm/fault.c                           | 13 +++++++-
>  tools/objtool/arch/x86/lib/x86-opcode-map.txt |  2 +-
>  4 files changed, 58 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
> index 317fc59b512c..8ce532fcc171 100644
> --- a/arch/x86/include/asm/special_insns.h
> +++ b/arch/x86/include/asm/special_insns.h
> @@ -237,6 +237,50 @@ static inline void clwb(volatile void *__p)
>  		: [pax] "a" (p));
>  }
>  
> +#ifdef CONFIG_X86_INTEL_CET
> +
> +#if defined(CONFIG_IA32_EMULATION) || defined(CONFIG_X86_X32)
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
> +{
> +	int err;
> +
> +	asm volatile("1:.byte 0x66, 0x0f, 0x38, 0xf5, 0x37\n"

It would nice to use something like ASM_WRUSS/Q like ASM_CLAC/ASM_STAC.
Is the 0x37 spurious? I don't see addr/val being used in the instructions
either.

> +		     "xor %[err],%[err]\n"
> +		     "2:\n"
> +		     ".section .fixup,\"ax\"\n"
> +		     "3: mov $-1,%[err]; jmp 2b\n"
> +		     ".previous\n"
> +		     _ASM_EXTABLE(1b, 3b)
> +		: [err] "=a" (err)
> +		: [val] "S" (val), [addr] "D" (addr)
> +		: "memory");
> +	return err;
> +}
> +#else
> +static inline int write_user_shstk_32(unsigned long addr, unsigned int val)
> +{
> +	return 0;
> +}
> +#endif
> +
> +static inline int write_user_shstk_64(unsigned long addr, unsigned long val)
> +{
> +	int err;
> +
> +	asm volatile("1:.byte 0x66, 0x48, 0x0f, 0x38, 0xf5, 0x37\n"
> +		     "xor %[err],%[err]\n"
> +		     "2:\n"
> +		     ".section .fixup,\"ax\"\n"
> +		     "3: mov $-1,%[err]; jmp 2b\n"
> +		     ".previous\n"
> +		     _ASM_EXTABLE(1b, 3b)
> +		: [err] "=a" (err)
> +		: [val] "S" (val), [addr] "D" (addr)
> +		: "memory");
> +	return err;
> +}
> +#endif /* CONFIG_X86_INTEL_CET */
> +
>  #define nop() asm volatile ("nop")
>  
>  
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index e0b85930dd77..72bb7c48a7df 100644
> --- a/arch/x86/lib/x86-opcode-map.txt
> +++ b/arch/x86/lib/x86-opcode-map.txt
> @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
>  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
>  f2: ANDN Gy,By,Ey (v)
>  f3: Grp17 (1A)
> -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
> +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
>  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
>  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
>  EndTable
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 2b3b9170109c..f157338862f8 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -640,6 +640,17 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned long address)
>  	return 0;
>  }
>  
> +/*
> + * WRUSS is a kernel instrcution and but writes to user
> + * shadow stack memory.  When a fault occurs, both
> + * X86_PF_USER and X86_PF_SHSTK are set.
> + */
> +static int is_wruss(struct pt_regs *regs, unsigned long error_code)
> +{
> +	return (((error_code & (X86_PF_USER | X86_PF_SHSTK)) ==
> +		(X86_PF_USER | X86_PF_SHSTK)) && !user_mode(regs));
> +}
> +
>  static const char nx_warning[] = KERN_CRIT
>  "kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n";
>  static const char smep_warning[] = KERN_CRIT
> @@ -851,7 +862,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
>  	struct task_struct *tsk = current;
>  
>  	/* User mode accesses just cause a SIGSEGV */
> -	if (error_code & X86_PF_USER) {
> +	if ((error_code & X86_PF_USER) && !is_wruss(regs, error_code)) {
>  		/*
>  		 * It's possible to have interrupts off here:
>  		 */
> diff --git a/tools/objtool/arch/x86/lib/x86-opcode-map.txt b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
> index e0b85930dd77..72bb7c48a7df 100644
> --- a/tools/objtool/arch/x86/lib/x86-opcode-map.txt
> +++ b/tools/objtool/arch/x86/lib/x86-opcode-map.txt
> @@ -789,7 +789,7 @@ f0: MOVBE Gy,My | MOVBE Gw,Mw (66) | CRC32 Gd,Eb (F2) | CRC32 Gd,Eb (66&F2)
>  f1: MOVBE My,Gy | MOVBE Mw,Gw (66) | CRC32 Gd,Ey (F2) | CRC32 Gd,Ew (66&F2)
>  f2: ANDN Gy,By,Ey (v)
>  f3: Grp17 (1A)
> -f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v)
> +f5: BZHI Gy,Ey,By (v) | PEXT Gy,By,Ey (F3),(v) | PDEP Gy,By,Ey (F2),(v) | WRUSS Pq,Qq (66),REX.W
>  f6: ADCX Gy,Ey (66) | ADOX Gy,Ey (F3) | MULX By,Gy,rDX,Ey (F2),(v)
>  f7: BEXTR Gy,Ey,By (v) | SHLX Gy,Ey,By (66),(v) | SARX Gy,Ey,By (F3),(v) | SHRX Gy,Ey,By (F2),(v)
>  EndTable

Balbir Singh.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v4 01/10] i3c: Add core I3C infrastructure
From: Wolfram Sang @ 2018-06-14  4:19 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: linux-i2c, Jonathan Corbet, linux-doc, Greg Kroah-Hartman,
	Arnd Bergmann, Przemyslaw Sroka, Arkadiusz Golec, Alan Douglas,
	Bartosz Folta, Damian Kos, Alicja Jurasik-Urbaniak,
	Cyprian Wronka, Suresh Punnoose, Rafal Ciepiela, Thomas Petazzoni,
	Nishanth Menon, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, linux-kernel, Vitor Soares,
	Geert Uytterhoeven, Linus Walleij, Xiang Lin, linux-gpio
In-Reply-To: <20180330074751.25987-2-boris.brezillon@bootlin.com>

[-- Attachment #1: Type: text/plain, Size: 1073 bytes --]

Hi Boris,

> +/**
> + * struct i3c_priv_xfer - I3C SDR private transfer
> + * @rnw: encodes the transfer direction. true for a read, false for a write
> + * @len: transfer length in bytes of the transfer
> + * @data: input/output buffer
> + */
> +struct i3c_priv_xfer {
> +	bool rnw;
> +	u16 len;
> +	union {
> +		void *in;
> +		const void *out;
> +	} data;

So, this is probably where most payloads end up?

I didn't notice any sign of DMA in these patches, but given my
experiences, DMA will come sooner or later. And in I2C, this was
problematic because then a lot of drivers were in the wild getting their
buffers from everywhere (stack!). We now have an opt-in approach to mark
buffers as DMA-safe.

I don't know if typical I3C transfers will be similar to I2C with
usually small payloads where DMA usually makes not much sense. Yet, I
think, that it might be a good idea to think about how this shall be
handled with I3C right away. Maybe simply enforcing buffers to be
DMA-safe. Or whatever.

A clear rule on that might save you hazzle later.

Regards,

   Wolfram

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 01/10] i3c: Add core I3C infrastructure
From: Boris Brezillon @ 2018-06-14  7:07 UTC (permalink / raw)
  To: Wolfram Sang
  Cc: linux-i2c, Jonathan Corbet, linux-doc, Greg Kroah-Hartman,
	Arnd Bergmann, Przemyslaw Sroka, Arkadiusz Golec, Alan Douglas,
	Bartosz Folta, Damian Kos, Alicja Jurasik-Urbaniak,
	Cyprian Wronka, Suresh Punnoose, Rafal Ciepiela, Thomas Petazzoni,
	Nishanth Menon, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree, linux-kernel, Vitor Soares,
	Geert Uytterhoeven, Linus Walleij, Xiang Lin, linux-gpio
In-Reply-To: <20180614041941.4i2cadzoevujmzha@ninjato>

On Thu, 14 Jun 2018 13:19:42 +0900
Wolfram Sang <wsa@the-dreams.de> wrote:

> Hi Boris,
> 
> > +/**
> > + * struct i3c_priv_xfer - I3C SDR private transfer
> > + * @rnw: encodes the transfer direction. true for a read, false for a write
> > + * @len: transfer length in bytes of the transfer
> > + * @data: input/output buffer
> > + */
> > +struct i3c_priv_xfer {
> > +	bool rnw;
> > +	u16 len;
> > +	union {
> > +		void *in;
> > +		const void *out;
> > +	} data;  
> 
> So, this is probably where most payloads end up?
> 
> I didn't notice any sign of DMA in these patches, but given my
> experiences, DMA will come sooner or later. And in I2C, this was
> problematic because then a lot of drivers were in the wild getting their
> buffers from everywhere (stack!). We now have an opt-in approach to mark
> buffers as DMA-safe.
> 
> I don't know if typical I3C transfers will be similar to I2C with
> usually small payloads where DMA usually makes not much sense. Yet, I
> think, that it might be a good idea to think about how this shall be
> handled with I3C right away. Maybe simply enforcing buffers to be
> DMA-safe. Or whatever.
> 
> A clear rule on that might save you hazzle later.

I completely agree. I'll clarify that and for people to pass DMA-able
buffers to this struct (just as the SPI framework does). Note that we
don't really have a way to ensure that the buffer is actually
DMA-safe from the core, because this notion is architecture/SoC
dependent.

> 
> Regards,
> 
>    Wolfram
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2 0/2] locking,drm: Fix ww mutex naming / algorithm inconsistency
From: Thomas Hellstrom @ 2018-06-14  7:29 UTC (permalink / raw)
  To: dri-devel, linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Jonathan Corbet, Gustavo Padovan,
	Maarten Lankhorst, Sean Paul, David Airlie, Davidlohr Bueso,
	Paul E. McKenney, Josh Triplett, Thomas Gleixner, Kate Stewart,
	Philippe Ombredanne, Greg Kroah-Hartman, linux-doc, linux-media,
	linaro-mm-sig

This is a small fallout from a work to allow batching WW mutex locks and
unlocks.

Our Wound-Wait mutexes actually don't use the Wound-Wait algorithm but
the Wait-Die algorithm. One could perhaps rename those mutexes tree-wide to
"Wait-Die mutexes" or "Deadlock Avoidance mutexes". Another approach suggested
here is to implement also the "Wound-Wait" algorithm as a per-WW-class
choice, as it has advantages in some cases. See for example

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html

Now Wound-Wait is a preemptive algorithm, and the preemption is implemented
using a lazy scheme: If a wounded transaction is about to go to sleep on
a contended WW mutex, we return -EDEADLK. That is sufficient for deadlock
prevention. Since with WW mutexes we also require the aborted transaction to
sleep waiting to lock the WW mutex it was aborted on, this choice also provides
a suitable WW mutex to sleep on. If we were to return -EDEADLK on the first
WW mutex lock after the transaction was wounded whether the WW mutex was
contended or not, the transaction might frequently be restarted without a wait,
which is far from optimal. Note also that with the lazy preemption scheme,
contrary to Wait-Die there will be no rollbacks on lock contention of locks
held by a transaction that has completed its locking sequence.

The modeset locks are then changed from Wait-Die to Wound-Wait since the
typical locking pattern of those locks very well matches the criterion for
a substantial reduction in the number of rollbacks. For reservation objects,
the benefit is more unclear at this point and they remain using Wait-Die.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org

v2:
  Updated DEFINE_WW_CLASS() API, minor code- and comment fixes as
  detailed in each patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2 1/2] locking: Implement an algorithm choice for Wound-Wait mutexes
From: Thomas Hellstrom @ 2018-06-14  7:29 UTC (permalink / raw)
  To: dri-devel, linux-kernel
  Cc: Thomas Hellstrom, Peter Zijlstra, Ingo Molnar, Jonathan Corbet,
	Gustavo Padovan, Maarten Lankhorst, Sean Paul, David Airlie,
	Davidlohr Bueso, Paul E. McKenney, Josh Triplett, Thomas Gleixner,
	Kate Stewart, Philippe Ombredanne, Greg Kroah-Hartman, linux-doc,
	linux-media, linaro-mm-sig
In-Reply-To: <20180614072922.8114-1-thellstrom@vmware.com>

The current Wound-Wait mutex algorithm is actually not Wound-Wait but
Wait-Die. Implement also Wound-Wait as a per-ww-class choice. Wound-Wait
is, contrary to Wait-Die a preemptive algorithm and is known to generate
fewer backoffs. Testing reveals that this is true if the
number of simultaneous contending transactions is small.
As the number of simultaneous contending threads increases, Wait-Wound
becomes inferior to Wait-Die in terms of elapsed time.
Possibly due to the larger number of held locks of sleeping transactions.

Update documentation and callers.

Timings using git://people.freedesktop.org/~thomash/ww_mutex_test
tag patch-18-06-14

Each thread runs 100000 batches of lock / unlock 800 ww mutexes randomly
chosen out of 100000. Four core Intel x86_64:

Algorithm    #threads       Rollbacks  time
Wound-Wait   4              ~100       ~17s.
Wait-Die     4              ~150000    ~19s.
Wound-Wait   16             ~360000    ~109s.
Wait-Die     16             ~450000    ~82s.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Sean Paul <seanpaul@chromium.org>
Cc: David Airlie <airlied@linux.ie>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>

---
v2:
* Update API according to review comment by Greg Kroah-Hartman.
* Address review comments by Peter Zijlstra:
  - Avoid _Bool in composites
  - Fix typo
  - Use __mutex_owner() where applicable
  - Rely on built-in barriers for the main loop exit condition,
    struct ww_acquire_ctx::wounded. Update code comments.
  - Explain unlocked use of list_empty().
---
 Documentation/locking/ww-mutex-design.txt |  54 ++++++++++++----
 drivers/dma-buf/reservation.c             |   2 +-
 drivers/gpu/drm/drm_modeset_lock.c        |   2 +-
 include/linux/ww_mutex.h                  |  19 ++++--
 kernel/locking/locktorture.c              |   2 +-
 kernel/locking/mutex.c                    | 103 +++++++++++++++++++++++++++---
 kernel/locking/test-ww_mutex.c            |   2 +-
 lib/locking-selftest.c                    |   2 +-
 8 files changed, 156 insertions(+), 30 deletions(-)

diff --git a/Documentation/locking/ww-mutex-design.txt b/Documentation/locking/ww-mutex-design.txt
index 34c3a1b50b9a..b9597def9581 100644
--- a/Documentation/locking/ww-mutex-design.txt
+++ b/Documentation/locking/ww-mutex-design.txt
@@ -1,4 +1,4 @@
-Wait/Wound Deadlock-Proof Mutex Design
+Wound/Wait Deadlock-Proof Mutex Design
 ======================================
 
 Please read mutex-design.txt first, as it applies to wait/wound mutexes too.
@@ -32,10 +32,23 @@ the oldest task) wins, and the one with the higher reservation id (i.e. the
 younger task) unlocks all of the buffers that it has already locked, and then
 tries again.
 
-In the RDBMS literature this deadlock handling approach is called wait/wound:
-The older tasks waits until it can acquire the contended lock. The younger tasks
-needs to back off and drop all the locks it is currently holding, i.e. the
-younger task is wounded.
+In the RDBMS literature, a reservation ticket is associated with a transaction.
+and the deadlock handling approach is called Wait-Die. The name is based on
+the actions of a locking thread when it encounters an already locked mutex.
+If the transaction holding the lock is younger, the locking transaction waits.
+If the transaction holding the lock is older, the locking transaction backs off
+and dies. Hence Wait-Die.
+There is also another algorithm called Wound-Wait:
+If the transaction holding the lock is younger, the locking transaction
+preempts the transaction holding the lock, requiring it to back off. It
+Wounds the other transaction.
+If the transaction holding the lock is older, it waits for the other
+transaction. Hence Wound-Wait.
+The two algorithms are both fair in that a transaction will eventually succeed.
+However, the Wound-Wait algorithm is typically stated to generate fewer backoffs
+compared to Wait-Die, but is, on the other hand, associated with more work than
+Wait-Die when recovering from a backoff. Wound-Wait is also a preemptive
+algorithm which requires a reliable way to preempt another transaction.
 
 Concepts
 --------
@@ -47,10 +60,12 @@ Acquire context: To ensure eventual forward progress it is important the a task
 trying to acquire locks doesn't grab a new reservation id, but keeps the one it
 acquired when starting the lock acquisition. This ticket is stored in the
 acquire context. Furthermore the acquire context keeps track of debugging state
-to catch w/w mutex interface abuse.
+to catch w/w mutex interface abuse. An acquire context is representing a
+transaction.
 
 W/w class: In contrast to normal mutexes the lock class needs to be explicit for
-w/w mutexes, since it is required to initialize the acquire context.
+w/w mutexes, since it is required to initialize the acquire context. The lock
+class also specifies what algorithm to use, Wound-Wait or Wait-Die.
 
 Furthermore there are three different class of w/w lock acquire functions:
 
@@ -90,6 +105,12 @@ provided.
 Usage
 -----
 
+The algorithm (Wait-Die vs Wound-Wait) is chosen by using either
+DEFINE_WW_CLASS_WDIE() for Wait-Die or DEFINE_WW_CLASS() for Wound-Wait.
+As a rough rule of thumb, use Wound-Wait iff you typically expect the number
+of simultaneous competing transactions to be small, and the rollback cost can
+be substantial.
+
 Three different ways to acquire locks within the same w/w class. Common
 definitions for methods #1 and #2:
 
@@ -312,12 +333,23 @@ Design:
   We maintain the following invariants for the wait list:
   (1) Waiters with an acquire context are sorted by stamp order; waiters
       without an acquire context are interspersed in FIFO order.
-  (2) Among waiters with contexts, only the first one can have other locks
-      acquired already (ctx->acquired > 0). Note that this waiter may come
-      after other waiters without contexts in the list.
+  (2) For Wait-Die, among waiters with contexts, only the first one can have
+      other locks acquired already (ctx->acquired > 0). Note that this waiter
+      may come after other waiters without contexts in the list.
+
+  The Wound-Wait preemption is implemented with a lazy-preemption scheme:
+  The wounded status of the transaction is checked only when there is
+  contention for a new lock and hence a true chance of deadlock. In that
+  situation, if the transaction is wounded, it backs off, clears the
+  wounded status and retries. A great benefit of implementing preemption in
+  this way is that the wounded transaction can identify a contending lock to
+  wait for before restarting the transaction. Just blindly restarting the
+  transaction would likely make the transaction end up in a situation where
+  it would have to back off again.
 
   In general, not much contention is expected. The locks are typically used to
-  serialize access to resources for devices.
+  serialize access to resources for devices, and optimization focus should
+  therefore be directed towards the uncontended cases.
 
 Lockdep:
   Special care has been taken to warn for as many cases of api abuse
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index 314eb1071cce..b94a4bab2ecd 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -46,7 +46,7 @@
  * write-side updates.
  */
 
-DEFINE_WW_CLASS(reservation_ww_class);
+DEFINE_WW_CLASS_WDIE(reservation_ww_class);
 EXPORT_SYMBOL(reservation_ww_class);
 
 struct lock_class_key reservation_seqcount_class;
diff --git a/drivers/gpu/drm/drm_modeset_lock.c b/drivers/gpu/drm/drm_modeset_lock.c
index 8a5100685875..ff00a814f617 100644
--- a/drivers/gpu/drm/drm_modeset_lock.c
+++ b/drivers/gpu/drm/drm_modeset_lock.c
@@ -70,7 +70,7 @@
  * lists and lookup data structures.
  */
 
-static DEFINE_WW_CLASS(crtc_ww_class);
+static DEFINE_WW_CLASS_WDIE(crtc_ww_class);
 
 /**
  * drm_modeset_lock_all - take all modeset locks
diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
index 39fda195bf78..3880813b7db5 100644
--- a/include/linux/ww_mutex.h
+++ b/include/linux/ww_mutex.h
@@ -8,6 +8,8 @@
  *
  * Wound/wait implementation:
  *  Copyright (C) 2013 Canonical Ltd.
+ * Choice of algorithm:
+ *  Copyright (C) 2018 WMWare Inc.
  *
  * This file contains the main data structure and API definitions.
  */
@@ -23,15 +25,17 @@ struct ww_class {
 	struct lock_class_key mutex_key;
 	const char *acquire_name;
 	const char *mutex_name;
+	unsigned int is_wait_die;
 };
 
 struct ww_acquire_ctx {
 	struct task_struct *task;
 	unsigned long stamp;
 	unsigned acquired;
+	unsigned int wounded;
+	struct ww_class *ww_class;
 #ifdef CONFIG_DEBUG_MUTEXES
 	unsigned done_acquire;
-	struct ww_class *ww_class;
 	struct ww_mutex *contending_lock;
 #endif
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
@@ -58,17 +62,21 @@ struct ww_mutex {
 # define __WW_CLASS_MUTEX_INITIALIZER(lockname, class)
 #endif
 
-#define __WW_CLASS_INITIALIZER(ww_class) \
+#define __WW_CLASS_INITIALIZER(ww_class, _is_wait_die)	    \
 		{ .stamp = ATOMIC_LONG_INIT(0) \
 		, .acquire_name = #ww_class "_acquire" \
-		, .mutex_name = #ww_class "_mutex" }
+		, .mutex_name = #ww_class "_mutex" \
+		, .is_wait_die = _is_wait_die }
 
 #define __WW_MUTEX_INITIALIZER(lockname, class) \
 		{ .base =  __MUTEX_INITIALIZER(lockname.base) \
 		__WW_CLASS_MUTEX_INITIALIZER(lockname, class) }
 
 #define DEFINE_WW_CLASS(classname) \
-	struct ww_class classname = __WW_CLASS_INITIALIZER(classname)
+	struct ww_class classname = __WW_CLASS_INITIALIZER(classname, 0)
+
+#define DEFINE_WW_CLASS_WDIE(classname)	\
+	struct ww_class classname = __WW_CLASS_INITIALIZER(classname, 1)
 
 #define DEFINE_WW_MUTEX(mutexname, ww_class) \
 	struct ww_mutex mutexname = __WW_MUTEX_INITIALIZER(mutexname, ww_class)
@@ -123,8 +131,9 @@ static inline void ww_acquire_init(struct ww_acquire_ctx *ctx,
 	ctx->task = current;
 	ctx->stamp = atomic_long_inc_return_relaxed(&ww_class->stamp);
 	ctx->acquired = 0;
-#ifdef CONFIG_DEBUG_MUTEXES
 	ctx->ww_class = ww_class;
+	ctx->wounded = false;
+#ifdef CONFIG_DEBUG_MUTEXES
 	ctx->done_acquire = 0;
 	ctx->contending_lock = NULL;
 #endif
diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c
index 6850ffd69125..e861c1bf0e1e 100644
--- a/kernel/locking/locktorture.c
+++ b/kernel/locking/locktorture.c
@@ -365,7 +365,7 @@ static struct lock_torture_ops mutex_lock_ops = {
 };
 
 #include <linux/ww_mutex.h>
-static DEFINE_WW_CLASS(torture_ww_class);
+static DEFINE_WW_CLASS_WDIE(torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_0, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_1, &torture_ww_class);
 static DEFINE_WW_MUTEX(torture_ww_mutex_2, &torture_ww_class);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2048359f33d2..ffa00b5aaf03 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -290,12 +290,49 @@ __ww_ctx_stamp_after(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
 	       (a->stamp != b->stamp || a > b);
 }
 
+/*
+ * Wound the lock holder transaction if it's younger than the contending
+ * transaction, and there is a possibility of a deadlock.
+ * Also if the lock holder transaction isn't the current transaction,
+ * make sure it's woken up in case it's sleeping on another ww mutex.
+ */
+static bool __ww_mutex_wound(struct mutex *lock,
+			     struct ww_acquire_ctx *ww_ctx,
+			     struct ww_acquire_ctx *hold_ctx)
+{
+	struct task_struct *owner = __mutex_owner(lock);
+
+	lockdep_assert_held(&lock->wait_lock);
+
+	if (owner && hold_ctx && __ww_ctx_stamp_after(hold_ctx, ww_ctx) &&
+	    ww_ctx->acquired > 0) {
+		hold_ctx->wounded = 1;
+
+		/*
+		 * wake_up_process() paired with set_current_state() inserts
+		 * sufficient barriers to make sure @owner either sees it's
+		 * wounded or has a wakeup pending to re-read the wounded
+		 * state.
+		 *
+		 * The value of hold_ctx->wounded in
+		 * __ww_mutex_lock_check_stamp();
+		 */
+		if (owner != current)
+			wake_up_process(owner);
+
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Wake up any waiters that may have to back off when the lock is held by the
  * given context.
  *
  * Due to the invariants on the wait list, this can only affect the first
- * waiter with a context.
+ * waiter with a context, unless the Wound-Wait algorithm is used where
+ * also subsequent waiters with a context main wound the lock holder.
  *
  * The current task must not be on the wait list.
  */
@@ -303,6 +340,7 @@ static void __sched
 __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
 {
 	struct mutex_waiter *cur;
+	unsigned int is_wait_die = ww_ctx->ww_class->is_wait_die;
 
 	lockdep_assert_held(&lock->wait_lock);
 
@@ -310,13 +348,14 @@ __ww_mutex_wakeup_for_backoff(struct mutex *lock, struct ww_acquire_ctx *ww_ctx)
 		if (!cur->ww_ctx)
 			continue;
 
-		if (cur->ww_ctx->acquired > 0 &&
+		if (is_wait_die && cur->ww_ctx->acquired > 0 &&
 		    __ww_ctx_stamp_after(cur->ww_ctx, ww_ctx)) {
 			debug_mutex_wake_waiter(lock, cur);
 			wake_up_process(cur->task);
 		}
 
-		break;
+		if (is_wait_die || __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx))
+			break;
 	}
 }
 
@@ -338,12 +377,18 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
 	 * and keep spinning, or it will acquire wait_lock, add itself
 	 * to waiter list and sleep.
 	 */
-	smp_mb(); /* ^^^ */
+	smp_mb(); /* See comments above and below. */
 
 	/*
-	 * Check if lock is contended, if not there is nobody to wake up
+	 * Check if lock is contended, if not there is nobody to wake up.
+	 * We can use list_empty() unlocked here since it only compares a
+	 * list_head field pointer to the address of the list head
+	 * itself, similarly to how list_empty() can be considered RCU-safe.
+	 * The memory barrier above pairs with the memory barrier in
+	 * __ww_mutex_add_waiter and makes sure lock->ctx is visible before
+	 * we check for waiters.
 	 */
-	if (likely(!(atomic_long_read(&lock->base.owner) & MUTEX_FLAG_WAITERS)))
+	if (likely(list_empty(&lock->base.wait_list)))
 		return;
 
 	/*
@@ -653,6 +698,13 @@ __ww_mutex_lock_check_stamp(struct mutex *lock, struct mutex_waiter *waiter,
 	struct ww_acquire_ctx *hold_ctx = READ_ONCE(ww->ctx);
 	struct mutex_waiter *cur;
 
+	if (!ctx->ww_class->is_wait_die) {
+		if (ctx->wounded)
+			goto deadlock;
+		else
+			return 0;
+	}
+
 	if (hold_ctx && __ww_ctx_stamp_after(ctx, hold_ctx))
 		goto deadlock;
 
@@ -683,16 +735,21 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 {
 	struct mutex_waiter *cur;
 	struct list_head *pos;
+	unsigned int is_wait_die;
 
 	if (!ww_ctx) {
 		list_add_tail(&waiter->list, &lock->wait_list);
 		return 0;
 	}
 
+	is_wait_die = ww_ctx->ww_class->is_wait_die;
+
 	/*
 	 * Add the waiter before the first waiter with a higher stamp.
 	 * Waiters without a context are skipped to avoid starving
-	 * them.
+	 * them. Wait-Die waiters may back off here. Wound-Wait waiters
+	 * never back off here, but they are sorted in stamp order and
+	 * may wound the lock holder.
 	 */
 	pos = &lock->wait_list;
 	list_for_each_entry_reverse(cur, &lock->wait_list, list) {
@@ -701,7 +758,7 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 
 		if (__ww_ctx_stamp_after(ww_ctx, cur->ww_ctx)) {
 			/* Back off immediately if necessary. */
-			if (ww_ctx->acquired > 0) {
+			if (is_wait_die && ww_ctx->acquired > 0) {
 #ifdef CONFIG_DEBUG_MUTEXES
 				struct ww_mutex *ww;
 
@@ -721,13 +778,28 @@ __ww_mutex_add_waiter(struct mutex_waiter *waiter,
 		 * Wake up the waiter so that it gets a chance to back
 		 * off.
 		 */
-		if (cur->ww_ctx->acquired > 0) {
+		if (is_wait_die && cur->ww_ctx->acquired > 0) {
 			debug_mutex_wake_waiter(lock, cur);
 			wake_up_process(cur->task);
 		}
 	}
 
 	list_add_tail(&waiter->list, pos);
+	if (!is_wait_die) {
+		struct ww_mutex *ww = container_of(lock, struct ww_mutex, base);
+
+		/*
+		 * Make sure a racing lock taker sees a non-empty waiting list
+		 * before we read ww->ctx, so that if we miss ww->ctx, the
+		 * racing lock taker will see a non-empty list and call
+		 * __ww_mutex_wake_up_for_backoff() and wound itself. The
+		 * memory barrier pairs with the one in
+		 * ww_mutex_set_context_fastpath().
+		 */
+		smp_mb();
+		__ww_mutex_wound(lock, ww_ctx, ww->ctx);
+	}
+
 	return 0;
 }
 
@@ -750,6 +822,14 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	if (use_ww_ctx && ww_ctx) {
 		if (unlikely(ww_ctx == READ_ONCE(ww->ctx)))
 			return -EALREADY;
+
+		/*
+		 * Reset the wounded flag after a backoff.
+		 * No other process can race and wound us here since they
+		 * can't have a valid owner pointer at this time
+		 */
+		if (ww_ctx->acquired == 0)
+			ww_ctx->wounded = 0;
 	}
 
 	preempt_disable();
@@ -858,6 +938,11 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 acquired:
 	__set_current_state(TASK_RUNNING);
 
+	/* We stole the lock. Need to check wounded status. */
+	if (use_ww_ctx && ww_ctx && !ww_ctx->ww_class->is_wait_die &&
+	    !__mutex_waiter_is_first(lock, &waiter))
+		__ww_mutex_wakeup_for_backoff(lock, ww_ctx);
+
 	mutex_remove_waiter(lock, &waiter, current);
 	if (likely(list_empty(&lock->wait_list)))
 		__mutex_clear_flag(lock, MUTEX_FLAGS);
diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c
index 0e4cd64ad2c0..3413430611d8 100644
--- a/kernel/locking/test-ww_mutex.c
+++ b/kernel/locking/test-ww_mutex.c
@@ -26,7 +26,7 @@
 #include <linux/slab.h>
 #include <linux/ww_mutex.h>
 
-static DEFINE_WW_CLASS(ww_class);
+static DEFINE_WW_CLASS_WDIE(ww_class);
 struct workqueue_struct *wq;
 
 struct test_mutex {
diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c
index b5c1293ce147..d0abf65ba9ad 100644
--- a/lib/locking-selftest.c
+++ b/lib/locking-selftest.c
@@ -29,7 +29,7 @@
  */
 static unsigned int debug_locks_verbose;
 
-static DEFINE_WW_CLASS(ww_lockdep);
+static DEFINE_WW_CLASS_WDIE(ww_lockdep);
 
 static int __init setup_debug_locks_verbose(char *str)
 {
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 net-next 5/6] net: ethernet: ti: cpsw: restore shaper configuration while down/up
From: Ivan Khoronzhuk @ 2018-06-14  7:36 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar, andrew,
	Ivan Khoronzhuk
In-Reply-To: <20180614073650.29659-1-ivan.khoronzhuk@linaro.org>

Need to restore shapers configuration after interface was down/up.
This is needed as appropriate configuration is still replicated in
kernel settings. This only shapers context restore, so vlan
configuration should be restored by user if needed, especially for
devices with one port where vlan frames are sent via ALE.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 47 ++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 3623c2994ddf..b2dfa7053b40 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1807,6 +1807,51 @@ static int cpsw_set_cbs(struct net_device *ndev,
 	return ret;
 }
 
+static void cpsw_cbs_resume(struct cpsw_slave *slave, struct cpsw_priv *priv)
+{
+	int fifo, bw;
+
+	for (fifo = CPSW_FIFO_SHAPERS_NUM; fifo > 0; fifo--) {
+		bw = priv->fifo_bw[fifo];
+		if (!bw)
+			continue;
+
+		cpsw_set_fifo_rlimit(priv, fifo, bw);
+	}
+}
+
+static void cpsw_mqprio_resume(struct cpsw_slave *slave, struct cpsw_priv *priv)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+	u32 tx_prio_map = 0;
+	int i, tc, fifo;
+	u32 tx_prio_rg;
+
+	if (!priv->mqprio_hw)
+		return;
+
+	for (i = 0; i < 8; i++) {
+		tc = netdev_get_prio_tc_map(priv->ndev, i);
+		fifo = CPSW_FIFO_SHAPERS_NUM - tc;
+		tx_prio_map |= fifo << (4 * i);
+	}
+
+	tx_prio_rg = cpsw->version == CPSW_VERSION_1 ?
+		     CPSW1_TX_PRI_MAP : CPSW2_TX_PRI_MAP;
+
+	slave_write(slave, tx_prio_map, tx_prio_rg);
+}
+
+/* restore resources after port reset */
+static void cpsw_restore(struct cpsw_priv *priv)
+{
+	/* restore MQPRIO offload */
+	for_each_slave(priv, cpsw_mqprio_resume, priv);
+
+	/* restore CBS offload */
+	for_each_slave(priv, cpsw_cbs_resume, priv);
+}
+
 static int cpsw_ndo_open(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
@@ -1886,6 +1931,8 @@ static int cpsw_ndo_open(struct net_device *ndev)
 
 	}
 
+	cpsw_restore(priv);
+
 	/* Enable Interrupt pacing if configured */
 	if (cpsw->coal_intvl != 0) {
 		struct ethtool_coalesce coal;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 net-next 4/6] net: ethernet: ti: cpsw: add CBS Qdisc offload
From: Ivan Khoronzhuk @ 2018-06-14  7:36 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar, andrew,
	Ivan Khoronzhuk
In-Reply-To: <20180614073650.29659-1-ivan.khoronzhuk@linaro.org>

The cpsw has up to 4 FIFOs per port and upper 3 FIFOs can feed rate
limited queue with shaping. In order to set and enable shaping for
those 3 FIFOs queues the network device with CBS qdisc attached is
needed. The CBS configuration is added for dual-emac/single port mode
only, but potentially can be used in switch mode also, based on
switchdev for instance.

Despite the FIFO shapers can work w/o cpdma level shapers the base
usage must be in combine with cpdma level shapers as described in TRM,
that are set as maximum rates for interface queues with sysfs.

One of the possible configuration with txq shapers and CBS shapers:

                      Configured with echo RATE >
                  /sys/class/net/eth0/queues/tx-0/tx_maxrate
             /---------------------------------------------------
            /
           /            cpdma level shapers
        +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
        | c7 | | c6 | | c5 | | c4 | | c3 | | c2 | | c1 | | c0 |
        \    / \    / \    / \    / \    / \    / \    / \    /
         \  /   \  /   \  /   \  /   \  /   \  /   \  /   \  /
          \/     \/     \/     \/     \/     \/     \/     \/
+---------|------|------|------|-------------------------------------+
|    +----+      |      |  +---+                                     |
|    |      +----+      |  |                                         |
|    v      v           v  v                                         |
| +----+ +----+ +----+ +----+ p        p+----+ +----+ +----+ +----+  |
| |    | |    | |    | |    | o        o|    | |    | |    | |    |  |
| | f3 | | f2 | | f1 | | f0 | r  CPSW  r| f3 | | f2 | | f1 | | f0 |  |
| |    | |    | |    | |    | t        t|    | |    | |    | |    |  |
| \    / \    / \    / \    / 0        1\    / \    / \    / \    /  |
|  \  X   \  /   \  /   \  /             \  /   \  /   \  /   \  /   |
|   \/ \   \/     \/     \/               \/     \/     \/     \/    |
+-------\------------------------------------------------------------+
         \
          \ FIFO shaper, set with CBS offload added in this patch,
           \ FIFO0 cannot be rate limited
            ------------------------------------------------------

CBS shaper configuration is supposed to be used with root MQPRIO Qdisc
offload allowing to add sk_prio->tc->txq maps that direct traffic to
appropriate tx queue and maps L2 priority to FIFO shaper.

The CBS shaper is intended to be used for AVB where L2 priority
(pcp field) is used to differentiate class of traffic. So additionally
vlan needs to be created with appropriate egress sk_prio->l2 prio map.

If CBS has several tx queues assigned to it, the sum of their
bandwidth has not overlap bandwidth set for CBS. It's recomended the
CBS bandwidth to be a little bit more.

The CBS shaper is configured with CBS qdisc offload interface using tc
tool from iproute2 packet.

For instance:

$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1

$ tc -g class show dev eth0
+---(100:ffe2) mqprio
|    +---(100:3) mqprio
|    +---(100:4) mqprio
|    
+---(100:ffe1) mqprio
|    +---(100:2) mqprio
|    
+---(100:ffe0) mqprio
     +---(100:1) mqprio

$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1440 \
hicredit 60 sendslope -960000 idleslope 40000 offload 1

$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
hicredit 62 sendslope -980000 idleslope 20000 offload 1

The above code set CBS shapers for tc0 and tc1, for that txq0 and
txq1 is used. Pay attention, the real set bandwidth can differ a bit
due to discreteness of configuration parameters.

Here parameters like locredit, hicredit and sendslope are ignored
internally and are supposed to be set with assumption that maximum
frame size for frame - 1500.

It's supposed that interface speed is not changed while reconnection,
not always is true, so inform user in case speed of interface was
changed, as it can impact on dependent shapers configuration.

For more examples see Documentation.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 221 +++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index edd14def98df..3623c2994ddf 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -46,6 +46,8 @@
 #include "cpts.h"
 #include "davinci_cpdma.h"
 
+#include <net/pkt_sched.h>
+
 #define CPSW_DEBUG	(NETIF_MSG_HW		| NETIF_MSG_WOL		| \
 			 NETIF_MSG_DRV		| NETIF_MSG_LINK	| \
 			 NETIF_MSG_IFUP		| NETIF_MSG_INTR	| \
@@ -154,8 +156,12 @@ do {								\
 #define IRQ_NUM			2
 #define CPSW_MAX_QUEUES		8
 #define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
+#define CPSW_FIFO_QUEUE_TYPE_SHIFT	16
+#define CPSW_FIFO_SHAPE_EN_SHIFT	16
+#define CPSW_FIFO_RATE_EN_SHIFT		20
 #define CPSW_TC_NUM			4
 #define CPSW_FIFO_SHAPERS_NUM		(CPSW_TC_NUM - 1)
+#define CPSW_PCT_MASK			0x7f
 
 #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT	29
 #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK		GENMASK(2, 0)
@@ -457,6 +463,8 @@ struct cpsw_priv {
 	bool				rx_pause;
 	bool				tx_pause;
 	bool				mqprio_hw;
+	int				fifo_bw[CPSW_TC_NUM];
+	int				shp_cfg_speed;
 	u32 emac_port;
 	struct cpsw_common *cpsw;
 };
@@ -1081,6 +1089,38 @@ static void cpsw_set_slave_mac(struct cpsw_slave *slave,
 	slave_write(slave, mac_lo(priv->mac_addr), SA_LO);
 }
 
+static bool cpsw_shp_is_off(struct cpsw_priv *priv)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+	struct cpsw_slave *slave;
+	u32 shift, mask, val;
+
+	val = readl_relaxed(&cpsw->regs->ptype);
+
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
+	mask = 7 << shift;
+	val = val & mask;
+
+	return !val;
+}
+
+static void cpsw_fifo_shp_on(struct cpsw_priv *priv, int fifo, int on)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+	struct cpsw_slave *slave;
+	u32 shift, mask, val;
+
+	val = readl_relaxed(&cpsw->regs->ptype);
+
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
+	mask = (1 << --fifo) << shift;
+	val = on ? val | mask : val & ~mask;
+
+	writel_relaxed(val, &cpsw->regs->ptype);
+}
+
 static void _cpsw_adjust_link(struct cpsw_slave *slave,
 			      struct cpsw_priv *priv, bool *link)
 {
@@ -1120,6 +1160,12 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
 			mac_control |= BIT(4);
 
 		*link = true;
+
+		if (priv->shp_cfg_speed &&
+		    priv->shp_cfg_speed != slave->phy->speed &&
+		    !cpsw_shp_is_off(priv))
+			dev_warn(priv->dev,
+				 "Speed was changed, CBS sahper speeds are changed!");
 	} else {
 		mac_control = 0;
 		/* disable forwarding */
@@ -1589,6 +1635,178 @@ static int cpsw_tc_to_fifo(int tc, int num_tc)
 	return CPSW_FIFO_SHAPERS_NUM - tc;
 }
 
+static int cpsw_set_fifo_bw(struct cpsw_priv *priv, int fifo, int bw)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+	u32 val = 0, send_pct, shift;
+	struct cpsw_slave *slave;
+	int pct = 0, i;
+
+	if (bw > priv->shp_cfg_speed * 1000)
+		goto err;
+
+	/* shaping has to stay enabled for highest fifos linearly
+	 * and fifo bw no more then interface can allow
+	 */
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	send_pct = slave_read(slave, SEND_PERCENT);
+	for (i = CPSW_FIFO_SHAPERS_NUM; i > 0; i--) {
+		if (!bw) {
+			if (i >= fifo || !priv->fifo_bw[i])
+				continue;
+
+			dev_warn(priv->dev, "Prev FIFO%d is shaped", i);
+			continue;
+		}
+
+		if (!priv->fifo_bw[i] && i > fifo) {
+			dev_err(priv->dev, "Upper FIFO%d is not shaped", i);
+			return -EINVAL;
+		}
+
+		shift = (i - 1) * 8;
+		if (i == fifo) {
+			send_pct &= ~(CPSW_PCT_MASK << shift);
+			val = DIV_ROUND_UP(bw, priv->shp_cfg_speed * 10);
+			if (!val)
+				val = 1;
+
+			send_pct |= val << shift;
+			pct += val;
+			continue;
+		}
+
+		if (priv->fifo_bw[i])
+			pct += (send_pct >> shift) & CPSW_PCT_MASK;
+	}
+
+	if (pct >= 100)
+		goto err;
+
+	slave_write(slave, send_pct, SEND_PERCENT);
+	priv->fifo_bw[fifo] = bw;
+
+	dev_warn(priv->dev, "set FIFO%d bw = %d\n", fifo,
+		 DIV_ROUND_CLOSEST(val * priv->shp_cfg_speed, 100));
+
+	return 0;
+err:
+	dev_err(priv->dev, "Bandwidth doesn't fit in tc configuration");
+	return -EINVAL;
+}
+
+static int cpsw_set_fifo_rlimit(struct cpsw_priv *priv, int fifo, int bw)
+{
+	struct cpsw_common *cpsw = priv->cpsw;
+	struct cpsw_slave *slave;
+	u32 tx_in_ctl_rg, val;
+	int ret;
+
+	ret = cpsw_set_fifo_bw(priv, fifo, bw);
+	if (ret)
+		return ret;
+
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	tx_in_ctl_rg = cpsw->version == CPSW_VERSION_1 ?
+		       CPSW1_TX_IN_CTL : CPSW2_TX_IN_CTL;
+
+	if (!bw)
+		cpsw_fifo_shp_on(priv, fifo, bw);
+
+	val = slave_read(slave, tx_in_ctl_rg);
+	if (cpsw_shp_is_off(priv)) {
+		/* disable FIFOs rate limited queues */
+		val &= ~(0xf << CPSW_FIFO_RATE_EN_SHIFT);
+
+		/* set type of FIFO queues to normal priority mode */
+		val &= ~(3 << CPSW_FIFO_QUEUE_TYPE_SHIFT);
+
+		/* set type of FIFO queues to be rate limited */
+		if (bw)
+			val |= 2 << CPSW_FIFO_QUEUE_TYPE_SHIFT;
+		else
+			priv->shp_cfg_speed = 0;
+	}
+
+	/* toggle a FIFO rate limited queue */
+	if (bw)
+		val |= BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
+	else
+		val &= ~BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
+	slave_write(slave, val, tx_in_ctl_rg);
+
+	/* FIFO transmit shape enable */
+	cpsw_fifo_shp_on(priv, fifo, bw);
+	return 0;
+}
+
+/* Defaults:
+ * class A - prio 3
+ * class B - prio 2
+ * shaping for class A should be set first
+ */
+static int cpsw_set_cbs(struct net_device *ndev,
+			struct tc_cbs_qopt_offload *qopt)
+{
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	struct cpsw_slave *slave;
+	int prev_speed = 0;
+	int tc, ret, fifo;
+	u32 bw = 0;
+
+	tc = netdev_txq_to_tc(priv->ndev, qopt->queue);
+
+	/* enable channels in backward order, as highest FIFOs must be rate
+	 * limited first and for compliance with CPDMA rate limited channels
+	 * that also used in bacward order. FIFO0 cannot be rate limited.
+	 */
+	fifo = cpsw_tc_to_fifo(tc, ndev->num_tc);
+	if (!fifo) {
+		dev_err(priv->dev, "Last tc%d can't be rate limited", tc);
+		return -EINVAL;
+	}
+
+	/* do nothing, it's disabled anyway */
+	if (!qopt->enable && !priv->fifo_bw[fifo])
+		return 0;
+
+	/* shapers can be set if link speed is known */
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	if (slave->phy && slave->phy->link) {
+		if (priv->shp_cfg_speed &&
+		    priv->shp_cfg_speed != slave->phy->speed)
+			prev_speed = priv->shp_cfg_speed;
+
+		priv->shp_cfg_speed = slave->phy->speed;
+	}
+
+	if (!priv->shp_cfg_speed) {
+		dev_err(priv->dev, "Link speed is not known");
+		return -1;
+	}
+
+	ret = pm_runtime_get_sync(cpsw->dev);
+	if (ret < 0) {
+		pm_runtime_put_noidle(cpsw->dev);
+		return ret;
+	}
+
+	bw = qopt->enable ? qopt->idleslope : 0;
+	ret = cpsw_set_fifo_rlimit(priv, fifo, bw);
+	if (ret) {
+		priv->shp_cfg_speed = prev_speed;
+		prev_speed = 0;
+	}
+
+	if (bw && prev_speed)
+		dev_warn(priv->dev,
+			 "Speed was changed, CBS sahper speeds are changed!");
+
+	pm_runtime_put_sync(cpsw->dev);
+	return ret;
+}
+
 static int cpsw_ndo_open(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
@@ -2263,6 +2481,9 @@ static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
 			     void *type_data)
 {
 	switch (type) {
+	case TC_SETUP_QDISC_CBS:
+		return cpsw_set_cbs(ndev, type_data);
+
 	case TC_SETUP_QDISC_MQPRIO:
 		return cpsw_set_mqprio(ndev, type_data);
 
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 net-next 6/6] Documentation: networking: cpsw: add MQPRIO & CBS offload examples
From: Ivan Khoronzhuk @ 2018-06-14  7:36 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar, andrew,
	Ivan Khoronzhuk
In-Reply-To: <20180614073650.29659-1-ivan.khoronzhuk@linaro.org>

This document describes MQPRIO and CBS Qdisc offload configuration
for cpsw driver based on examples. It potentially can be used in
audio video bridging (AVB) and time sensitive networking (TSN).

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 Documentation/networking/ti-cpsw.txt | 540 +++++++++++++++++++++++++++
 1 file changed, 540 insertions(+)
 create mode 100644 Documentation/networking/ti-cpsw.txt

diff --git a/Documentation/networking/ti-cpsw.txt b/Documentation/networking/ti-cpsw.txt
new file mode 100644
index 000000000000..f5d58f502e52
--- /dev/null
+++ b/Documentation/networking/ti-cpsw.txt
@@ -0,0 +1,540 @@
+* Texas Instruments CPSW ethernet driver
+
+Multiqueue & CBS & MQPRIO
+=====================================================================
+=====================================================================
+
+The cpsw has 3 CBS shapers for each external ports. This document
+describes MQPRIO and CBS Qdisc offload configuration for cpsw driver
+based on examples. It potentially can be used in audio video bridging
+(AVB) and time sensitive networking (TSN).
+
+The following examples was tested on AM572x EVM and BBB boards.
+
+Test setup
+==========
+
+Under consideration two examples with AM52xx EVM running cpsw driver
+in dual_emac mode.
+
+Several prerequisites:
+- TX queues must be rated starting from txq0 that has highest priority
+- Traffic classes are used starting from 0, that has highest priority
+- CBS shapers should be used with rated queues
+- The bandwidth for CBS shapers has to be set a little bit more then
+  potential incoming rate, thus, rate of all incoming tx queues has
+  to be a little less
+- Real rates can differ, due to discreetness
+- Map skb-priority to txq is not enough, also skb-priority to l2 prio
+  map has to be created with ip or vconfig tool
+- Any l2/socket prio (0 - 7) for classes can be used, but for
+  simplicity default values are used: 3 and 2
+- only 2 classes tested: A and B, but checked and can work with more,
+  maximum allowed 4, but only for 3 rate can be set.
+
+Test setup for examples
+=======================
+                                    +-------------------------------+
+                                    |--+                            |
+                                    |  |      Workstation0          |
+                                    |E |  MAC 18:03:73:66:87:42     |
++-----------------------------+  +--|t |                            |
+|                    | 1  | E |  |  |h |./tsn_listener -d \         |
+|  Target board:     | 0  | t |--+  |0 | 18:03:73:66:87:42 -i eth0 \|
+|  AM572x EVM        | 0  | h |     |  | -s 1500                    |
+|                    | 0  | 0 |     |--+                            |
+|  Only 2 classes:   |Mb  +---|     +-------------------------------+
+|  class A, class B  |        |
+|                    |    +---|     +-------------------------------+
+|                    | 1  | E |     |--+                            |
+|                    | 0  | t |     |  |      Workstation1          |
+|                    | 0  | h |--+  |E |  MAC 20:cf:30:85:7d:fd     |
+|                    |Mb  | 1 |  +--|t |                            |
++-----------------------------+     |h |./tsn_listener -d \         |
+                                    |0 | 20:cf:30:85:7d:fd -i eth0 \|
+                                    |  | -s 1500                    |
+                                    |--+                            |
+                                    +-------------------------------+
+
+*********************************************************************
+*********************************************************************
+*********************************************************************
+Example 1: One port tx AVB configuration scheme for target board
+----------------------------------------------------------------------
+(prints and scheme for AM52xx evm, applicable for single port boards)
+
+tc - traffic class
+txq - transmit queue
+p - priority
+f - fifo (cpsw fifo)
+S - shaper configured
+
++------------------------------------------------------------------+ u
+| +---------------+  +---------------+  +------+ +------+          | s
+| |               |  |               |  |      | |      |          | e
+| | App 1         |  | App 2         |  | Apps | | Apps |          | r
+| | Class A       |  | Class B       |  | Rest | | Rest |          |
+| | Eth0          |  | Eth0          |  | Eth0 | | Eth1 |          | s
+| | VLAN100       |  | VLAN100       |  |   |  | |   |  |          | p
+| | 40 Mb/s       |  | 20 Mb/s       |  |   |  | |   |  |          | a
+| | SO_PRIORITY=3 |  | SO_PRIORITY=2 |  |   |  | |   |  |          | c
+| |   |           |  |   |           |  |   |  | |   |  |          | e
+| +---|-----------+  +---|-----------+  +---|--+ +---|--+          |
++-----|------------------|------------------|--------|-------------+
+    +-+     +------------+                  |        |
+    |       |             +-----------------+     +--+
+    |       |             |                       |
++---|-------|-------------|-----------------------|----------------+
+| +----+ +----+ +----+ +----+                   +----+             |
+| | p3 | | p2 | | p1 | | p0 |                   | p0 |             | k
+| \    / \    / \    / \    /                   \    /             | e
+|  \  /   \  /   \  /   \  /                     \  /              | r
+|   \/     \/     \/     \/                       \/               | n
+|    |     |             |                        |                | e
+|    |     |       +-----+                        |                | l
+|    |     |       |                              |                |
+| +----+ +----+ +----+                          +----+             | s
+| |tc0 | |tc1 | |tc2 |                          |tc0 |             | p
+| \    / \    / \    /                          \    /             | a
+|  \  /   \  /   \  /                            \  /              | c
+|   \/     \/     \/                              \/               | e
+|   |      |       +-----+                        |                |
+|   |      |       |     |                        |                |
+|   |      |       |     |                        |                |
+|   |      |       |     |                        |                |
+| +----+ +----+ +----+ +----+                   +----+             |
+| |txq0| |txq1| |txq2| |txq3|                   |txq4|             |
+| \    / \    / \    / \    /                   \    /             |
+|  \  /   \  /   \  /   \  /                     \  /              |
+|   \/     \/     \/     \/                       \/               |
+| +-|------|------|------|--+                  +--|--------------+ |
+| | |      |      |      |  | Eth0.100         |  |     Eth1     | |
++---|------|------|------|------------------------|----------------+
+    |      |      |      |                        |
+    p      p      p      p                        |
+    3      2      0-1, 4-7  <- L2 priority        |
+    |      |      |      |                        |
+    |      |      |      |                        |
++---|------|------|------|------------------------|----------------+
+|   |      |      |      |             |----------+                |
+| +----+ +----+ +----+ +----+       +----+                         |
+| |dma7| |dma6| |dma5| |dma4|       |dma3|                         |
+| \    / \    / \    / \    /       \    /                         | c
+|  \S /   \S /   \  /   \  /         \  /                          | p
+|   \/     \/     \/     \/           \/                           | s
+|   |      |      | +-----            |                            | w
+|   |      |      | |                 |                            |
+|   |      |      | |                 |                            | d
+| +----+ +----+ +----+p            p+----+                         | r
+| |    | |    | |    |o            o|    |                         | i
+| | f3 | | f2 | | f0 |r            r| f0 |                         | v
+| |tc0 | |tc1 | |tc2 |t            t|tc0 |                         | e
+| \CBS / \CBS / \CBS /1            2\CBS /                         | r
+|  \S /   \S /   \  /                \  /                          |
+|   \/     \/     \/                  \/                           |
++------------------------------------------------------------------+
+========================================Eth==========================>
+
+1)
+// Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1
+$ ethtool -L eth0 rx 1 tx 5
+rx unmodified, ignoring
+
+2)
+// Check if num of queues is set correctly:
+$ ethtool -l eth0
+Channel parameters for eth0:
+Pre-set maximums:
+RX:             8
+TX:             8
+Other:          0
+Combined:       0
+Current hardware settings:
+RX:             1
+TX:             5
+Other:          0
+Combined:       0
+
+3)
+// TX queues must be rated starting from 0, so set bws for tx0 and tx1
+// Set rates 40 and 20 Mb/s appropriately.
+// Pay attention, real speed can differ a bit due to discreetness.
+// Leave last 2 tx queues not rated.
+$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+
+4)
+// Check maximum rate of tx (cpdma) queues:
+$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+40
+20
+0
+0
+0
+
+5)
+// Map skb->priority to traffic class:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3)
+$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
+
+5a)
+// As two interface sharing same set of tx queues, assign all traffic
+// coming to interface Eth1 to separate queue in order to not mix it
+// with traffic from interface Eth0, so use separate txq to send
+// packets to Eth1, so all prio -> tc0 and tc0 -> txq4
+// Here hw 0, so here still default configuration for eth1 in hw
+$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \
+map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0
+
+6)
+// Check classes settings
+$ tc -g class show dev eth0
++---(100:ffe2) mqprio
+|    +---(100:3) mqprio
+|    +---(100:4) mqprio
+|
++---(100:ffe1) mqprio
+|    +---(100:2) mqprio
+|
++---(100:ffe0) mqprio
+     +---(100:1) mqprio
+
+$ tc -g class show dev eth1
++---(100:ffe0) mqprio
+     +---(100:5) mqprio
+
+7)
+// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc
+// Set it +1 Mb for reserve (important!)
+// here only idle slope is important, others arg are ignored
+// Pay attention, real speed can differ a bit due to discreetness
+$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \
+hicredit 62 sendslope -959000 idleslope 41000 offload 1
+net eth0: set FIFO3 bw = 50
+
+8)
+// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc:
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \
+hicredit 65 sendslope -979000 idleslope 21000 offload 1
+net eth0: set FIFO2 bw = 30
+
+9)
+// Create vlan 100 to map sk->priority to vlan qos
+$ ip link add link eth0 name eth0.100 type vlan id 100
+8021q: 802.1Q VLAN Support v1.8
+8021q: adding VLAN 0 to HW filter on device eth0
+8021q: adding VLAN 0 to HW filter on device eth1
+net eth0: Adding vlanid 100 to vlan filter
+
+10)
+// Map skb->priority to L2 prio, 1 to 1
+$ ip link set eth0.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth0.100
+[...]
+INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12)
+// Run your appropriate tools with socket option "SO_PRIORITY"
+// to 3 for class A and/or to 2 for class B
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+
+13)
+// run your listener on workstation
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39000 kbps
+
+14)
+// Restore default configuration if needed
+$ ip link del eth0.100
+$ tc qdisc del dev eth1 root
+$ tc qdisc del dev eth0 root
+net eth0: Prev FIFO2 is shaped
+net eth0: set FIFO3 bw = 0
+net eth0: set FIFO2 bw = 0
+$ ethtool -L eth0 rx 1 tx 1
+
+*********************************************************************
+*********************************************************************
+*********************************************************************
+Example 2: Two port tx AVB configuration scheme for target board
+----------------------------------------------------------------------
+(prints and scheme for AM52xx evm, for dual emac boards only)
+
++------------------------------------------------------------------+ u
+| +----------+  +----------+  +------+  +----------+  +----------+ | s
+| |          |  |          |  |      |  |          |  |          | | e
+| | App 1    |  | App 2    |  | Apps |  | App 3    |  | App 4    | | r
+| | Class A  |  | Class B  |  | Rest |  | Class B  |  | Class A  | |
+| | Eth0     |  | Eth0     |  |   |  |  | Eth1     |  | Eth1     | | s
+| | VLAN100  |  | VLAN100  |  |   |  |  | VLAN100  |  | VLAN100  | | p
+| | 40 Mb/s  |  | 20 Mb/s  |  |   |  |  | 10 Mb/s  |  | 30 Mb/s  | | a
+| | SO_PRI=3 |  | SO_PRI=2 |  |   |  |  | SO_PRI=3 |  | SO_PRI=2 | | c
+| |   |      |  |   |      |  |   |  |  |   |      |  |   |      | | e
+| +---|------+  +---|------+  +---|--+  +---|------+  +---|------+ |
++-----|-------------|-------------|---------|-------------|--------+
+    +-+     +-------+             |         +----------+  +----+
+    |       |             +-------+------+             |       |
+    |       |             |              |             |       |
++---|-------|-------------|--------------|-------------|-------|---+
+| +----+ +----+ +----+ +----+          +----+ +----+ +----+ +----+ |
+| | p3 | | p2 | | p1 | | p0 |          | p0 | | p1 | | p2 | | p3 | | k
+| \    / \    / \    / \    /          \    / \    / \    / \    / | e
+|  \  /   \  /   \  /   \  /            \  /   \  /   \  /   \  /  | r
+|   \/     \/     \/     \/              \/     \/     \/     \/   | n
+|   |      |             |                |             |      |   | e
+|   |      |        +----+                +----+        |      |   | l
+|   |      |        |                          |        |      |   |
+| +----+ +----+ +----+                        +----+ +----+ +----+ | s
+| |tc0 | |tc1 | |tc2 |                        |tc2 | |tc1 | |tc0 | | p
+| \    / \    / \    /                        \    / \    / \    / | a
+|  \  /   \  /   \  /                          \  /   \  /   \  /  | c
+|   \/     \/     \/                            \/     \/     \/   | e
+|   |      |       +-----+                +-----+      |       |   |
+|   |      |       |     |                |     |      |       |   |
+|   |      |       |     |                |     |      |       |   |
+|   |      |       |     |    E      E    |     |      |       |   |
+| +----+ +----+ +----+ +----+ t      t +----+ +----+ +----+ +----+ |
+| |txq0| |txq1| |txq4| |txq5| h      h |txq6| |txq7| |txq3| |txq2| |
+| \    / \    / \    / \    / 0      1 \    / \    / \    / \    / |
+|  \  /   \  /   \  /   \  /  .      .  \  /   \  /   \  /   \  /  |
+|   \/     \/     \/     \/   1      1   \/     \/     \/     \/   |
+| +-|------|------|------|--+ 0      0 +-|------|------|------|--+ |
+| | |      |      |      |  | 0      0 | |      |      |      |  | |
++---|------|------|------|---------------|------|------|------|----+
+    |      |      |      |               |      |      |      |
+    p      p      p      p               p      p      p      p
+    3      2      0-1, 4-7   <-L2 pri->  0-1, 4-7      2      3
+    |      |      |      |               |      |      |      |
+    |      |      |      |               |      |      |      |
++---|------|------|------|---------------|------|------|------|----+
+|   |      |      |      |               |      |      |      |    |
+| +----+ +----+ +----+ +----+          +----+ +----+ +----+ +----+ |
+| |dma7| |dma6| |dma3| |dma2|          |dma1| |dma0| |dma4| |dma5| |
+| \    / \    / \    / \    /          \    / \    / \    / \    / | c
+|  \S /   \S /   \  /   \  /            \  /   \  /   \S /   \S /  | p
+|   \/     \/     \/     \/              \/     \/     \/     \/   | s
+|   |      |      | +-----                |      |      |      |   | w
+|   |      |      | |                     +----+ |      |      |   |
+|   |      |      | |                          | |      |      |   | d
+| +----+ +----+ +----+p                      p+----+ +----+ +----+ | r
+| |    | |    | |    |o                      o|    | |    | |    | | i
+| | f3 | | f2 | | f0 |r        CPSW          r| f3 | | f2 | | f0 | | v
+| |tc0 | |tc1 | |tc2 |t                      t|tc0 | |tc1 | |tc2 | | e
+| \CBS / \CBS / \CBS /1                      2\CBS / \CBS / \CBS / | r
+|  \S /   \S /   \  /                          \S /   \S /   \  /  |
+|   \/     \/     \/                            \/     \/     \/   |
++------------------------------------------------------------------+
+========================================Eth==========================>
+
+1)
+// Add 8 tx queues, for interface Eth0, but they are common, so are accessed
+// by two interfaces Eth0 and Eth1.
+$ ethtool -L eth1 rx 1 tx 8
+rx unmodified, ignoring
+
+2)
+// Check if num of queues is set correctly:
+$ ethtool -l eth0
+Channel parameters for eth0:
+Pre-set maximums:
+RX:             8
+TX:             8
+Other:          0
+Combined:       0
+Current hardware settings:
+RX:             1
+TX:             8
+Other:          0
+Combined:       0
+
+3)
+// TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0
+// and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately
+// for Eth0 and 30 and 10 Mb/s for Eth1.
+// Real speed can differ a bit due to discreetness
+// Leave last 4 tx queues as not rated
+$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+$ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate
+$ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate
+
+4)
+// Check maximum rate of tx (cpdma) queues:
+$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+40
+20
+30
+10
+0
+0
+0
+0
+
+5)
+// Map skb->priority to traffic class for Eth0:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5)
+$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1
+
+6)
+// Check classes settings
+$ tc -g class show dev eth0
++---(100:ffe2) mqprio
+|    +---(100:5) mqprio
+|    +---(100:6) mqprio
+|
++---(100:ffe1) mqprio
+|    +---(100:2) mqprio
+|
++---(100:ffe0) mqprio
+     +---(100:1) mqprio
+
+7)
+// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0
+// here only idle slope is important, others ignored
+// Real speed can differ a bit due to discreetness
+$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \
+hicredit 62 sendslope -959000 idleslope 41000 offload 1
+net eth0: set FIFO3 bw = 50
+
+8)
+// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0
+$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
+hicredit 65 sendslope -979000 idleslope 21000 offload 1
+net eth0: set FIFO2 bw = 30
+
+9)
+// Create vlan 100 to map sk->priority to vlan qos for Eth0
+$ ip link add link eth0 name eth0.100 type vlan id 100
+net eth0: Adding vlanid 100 to vlan filter
+
+10)
+// Map skb->priority to L2 prio for Eth0.100, one to one
+$ ip link set eth0.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth0.100
+[...]
+INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12)
+// Map skb->priority to traffic class for Eth1:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7)
+$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1
+
+13)
+// Check classes settings
+$ tc -g class show dev eth1
++---(100:ffe2) mqprio
+|    +---(100:7) mqprio
+|    +---(100:8) mqprio
+|
++---(100:ffe1) mqprio
+|    +---(100:4) mqprio
+|
++---(100:ffe0) mqprio
+     +---(100:3) mqprio
+
+14)
+// Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1
+// here only idle slope is important, others ignored
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth1 parent 100:3 cbs locredit -1453 \
+hicredit 47 sendslope -969000 idleslope 31000 offload 1
+net eth1: set FIFO3 bw = 31
+
+15)
+// Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth1 parent 100:4 cbs locredit -1483 \
+hicredit 34 sendslope -989000 idleslope 11000 offload 1
+net eth1: set FIFO2 bw = 11
+
+16)
+// Create vlan 100 to map sk->priority to vlan qos for Eth1
+$ ip link add link eth1 name eth1.100 type vlan id 100
+net eth1: Adding vlanid 100 to vlan filter
+
+17)
+// Map skb->priority to L2 prio for Eth1.100, one to one
+$ ip link set eth1.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+18)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth1.100
+[...]
+INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+19)
+// Run appropriate tools with socket option "SO_PRIORITY" to 3
+// for class A and to 2 for class B. For both interfaces
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500&
+./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500&
+
+20)
+// run your listeners on workstations
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39000 kbps
+
+21)
+// Restore default configuration if needed
+$ ip link del eth1.100
+$ ip link del eth0.100
+$ tc qdisc del dev eth1 root
+net eth1: Prev FIFO2 is shaped
+net eth1: set FIFO3 bw = 0
+net eth1: set FIFO2 bw = 0
+$ tc qdisc del dev eth0 root
+net eth0: Prev FIFO2 is shaped
+net eth0: set FIFO3 bw = 0
+net eth0: set FIFO2 bw = 0
+$ ethtool -L eth0 rx 1 tx 1
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 net-next 3/6] net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
From: Ivan Khoronzhuk @ 2018-06-14  7:36 UTC (permalink / raw)
  To: grygorii.strashko, davem
  Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
	vinicius.gomes, henrik, jesus.sanchez-palencia, ilias.apalodimas,
	p-varis, spatton, francois.ozog, yogeshs, nsekhar, andrew,
	Ivan Khoronzhuk
In-Reply-To: <20180614073650.29659-1-ivan.khoronzhuk@linaro.org>

That's possible to offload vlan to tc priority mapping with
assumption sk_prio == L2 prio.

Example:
$ ethtool -L eth0 rx 1 tx 4

$ qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1

$ tc -g class show dev eth0
+---(100:ffe2) mqprio
|    +---(100:3) mqprio
|    +---(100:4) mqprio
|    
+---(100:ffe1) mqprio
|    +---(100:2) mqprio
|    
+---(100:ffe0) mqprio
     +---(100:1) mqprio

Here, 100:1 is txq0, 100:2 is txq1, 100:3 is txq2, 100:4 is txq3
txq0 belongs to tc0, txq1 to tc1, txq2 and txq3 to tc2
The offload part only maps L2 prio to classes of traffic, but not
to transmit queues, so to direct traffic to traffic class vlan has
to be created with appropriate egress map.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
 drivers/net/ethernet/ti/cpsw.c | 82 ++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 406537d74ec1..edd14def98df 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -39,6 +39,7 @@
 #include <linux/sys_soc.h>
 
 #include <linux/pinctrl/consumer.h>
+#include <net/pkt_cls.h>
 
 #include "cpsw.h"
 #include "cpsw_ale.h"
@@ -153,6 +154,8 @@ do {								\
 #define IRQ_NUM			2
 #define CPSW_MAX_QUEUES		8
 #define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
+#define CPSW_TC_NUM			4
+#define CPSW_FIFO_SHAPERS_NUM		(CPSW_TC_NUM - 1)
 
 #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT	29
 #define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK		GENMASK(2, 0)
@@ -453,6 +456,7 @@ struct cpsw_priv {
 	u8				mac_addr[ETH_ALEN];
 	bool				rx_pause;
 	bool				tx_pause;
+	bool				mqprio_hw;
 	u32 emac_port;
 	struct cpsw_common *cpsw;
 };
@@ -1577,6 +1581,14 @@ static void cpsw_slave_stop(struct cpsw_slave *slave, struct cpsw_common *cpsw)
 	soft_reset_slave(slave);
 }
 
+static int cpsw_tc_to_fifo(int tc, int num_tc)
+{
+	if (tc == num_tc - 1)
+		return 0;
+
+	return CPSW_FIFO_SHAPERS_NUM - tc;
+}
+
 static int cpsw_ndo_open(struct net_device *ndev)
 {
 	struct cpsw_priv *priv = netdev_priv(ndev);
@@ -2190,6 +2202,75 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
 	return ret;
 }
 
+static int cpsw_set_mqprio(struct net_device *ndev, void *type_data)
+{
+	struct tc_mqprio_qopt_offload *mqprio = type_data;
+	struct cpsw_priv *priv = netdev_priv(ndev);
+	struct cpsw_common *cpsw = priv->cpsw;
+	int fifo, num_tc, count, offset;
+	struct cpsw_slave *slave;
+	u32 tx_prio_map = 0;
+	int i, tc, ret;
+
+	num_tc = mqprio->qopt.num_tc;
+	if (num_tc > CPSW_TC_NUM)
+		return -EINVAL;
+
+	if (mqprio->mode != TC_MQPRIO_MODE_DCB)
+		return -EINVAL;
+
+	ret = pm_runtime_get_sync(cpsw->dev);
+	if (ret < 0) {
+		pm_runtime_put_noidle(cpsw->dev);
+		return ret;
+	}
+
+	if (num_tc) {
+		for (i = 0; i < 8; i++) {
+			tc = mqprio->qopt.prio_tc_map[i];
+			fifo = cpsw_tc_to_fifo(tc, num_tc);
+			tx_prio_map |= fifo << (4 * i);
+		}
+
+		netdev_set_num_tc(ndev, num_tc);
+		for (i = 0; i < num_tc; i++) {
+			count = mqprio->qopt.count[i];
+			offset = mqprio->qopt.offset[i];
+			netdev_set_tc_queue(ndev, i, count, offset);
+		}
+	}
+
+	if (!mqprio->qopt.hw) {
+		/* restore default configuration */
+		netdev_reset_tc(ndev);
+		tx_prio_map = TX_PRIORITY_MAPPING;
+	}
+
+	priv->mqprio_hw = mqprio->qopt.hw;
+
+	offset = cpsw->version == CPSW_VERSION_1 ?
+		 CPSW1_TX_PRI_MAP : CPSW2_TX_PRI_MAP;
+
+	slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+	slave_write(slave, tx_prio_map, offset);
+
+	pm_runtime_put_sync(cpsw->dev);
+
+	return 0;
+}
+
+static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+			     void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_QDISC_MQPRIO:
+		return cpsw_set_mqprio(ndev, type_data);
+
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static const struct net_device_ops cpsw_netdev_ops = {
 	.ndo_open		= cpsw_ndo_open,
 	.ndo_stop		= cpsw_ndo_stop,
@@ -2205,6 +2286,7 @@ static const struct net_device_ops cpsw_netdev_ops = {
 #endif
 	.ndo_vlan_rx_add_vid	= cpsw_ndo_vlan_rx_add_vid,
 	.ndo_vlan_rx_kill_vid	= cpsw_ndo_vlan_rx_kill_vid,
+	.ndo_setup_tc           = cpsw_ndo_setup_tc,
 };
 
 static int cpsw_get_regs_len(struct net_device *ndev)
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox