* Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Ganapatrao Kulkarni @ 2018-05-21 12:42 UTC (permalink / raw)
To: Mark Rutland
Cc: Ganapatrao Kulkarni, linux-doc, LKML, linux-arm-kernel,
Will Deacon, jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber
In-Reply-To: <20180521104008.z6ei5zjve7u5iwho@lakrids.cambridge.arm.com>
On Mon, May 21, 2018 at 4:10 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
>> Hi Ganapat,
>>
>>
>> Sorry for the delay in replying; I was away most of last week.
>>
>> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
>> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
>> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@arm.com> wrote:
>> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>
>> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> > >>> +{
>> > >>> + int counter;
>> > >>> +
>> > >>> + raw_spin_lock(&pmu_uncore->lock);
>> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> > >>> + pmu_uncore->uncore_dev->max_counters);
>> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> + return -ENOSPC;
>> > >>> + }
>> > >>> + set_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> + return counter;
>> > >>> +}
>> > >>> +
>> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>> > >>> + int counter)
>> > >>> +{
>> > >>> + raw_spin_lock(&pmu_uncore->lock);
>> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> +}
>> > >>
>> > >> I don't believe that locking is required in either of these, as the perf
>> > >> core serializes pmu::add() and pmu::del(), where these get called.
>> >
>> > without this locking, i am seeing "BUG: scheduling while atomic" when
>> > i run perf with more events together than the maximum counters
>> > supported
>>
>> Did you manage to get to the bottom of this?
>>
>> Do you have a backtrace?
>>
>> It looks like in your latest posting you reserve counters through the
>> userspace ABI, which doesn't seem right to me, and I'd like to
>> understand the problem.
>
> Looks like I misunderstood -- those are still allocated kernel-side.
>
> I'll follow that up in the v5 posting.
please review v5.
>
> Thanks,
> Mark.
thanks
Ganapat
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Ganapatrao Kulkarni @ 2018-05-21 12:34 UTC (permalink / raw)
To: Mark Rutland
Cc: Ganapatrao Kulkarni, linux-doc, LKML, linux-arm-kernel,
Will Deacon, jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber
In-Reply-To: <20180521105511.6ztjk5conf7lfaiz@lakrids.cambridge.arm.com>
Hi Mark,
On Mon, May 21, 2018 at 4:25 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
>> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@arm.com> wrote:
>> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
>> >> + *
>> >> + * L3 Tile and DMC channel selection is through SMC call
>> >> + * SMC call arguments,
>> >> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>> >> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>> >> + * x2 = Node id
>> >
>> > How do we map Linux node IDs to the firmware's view of node IDs?
>> >
>> > I don't believe the two are necessarily the same -- Linux's node IDs are
>> > a Linux-specific construct.
>>
>> both are same, it is numa node id from ACPI/firmware.
>
> I am very wary about assuming that the Linux nid will always be the same
> as the ACPI node id.
>
> For that to *potentially* be true, this driver should depend on
> CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
> node id will always be NUMA_NO_NODE.
ok, i can check the node id which we get from ACPI helpers in probe.
if it is NUMA_NO_NODE, I will init first socket uncore only and nid
param to fw is always zero?
>
> I would be *much* happier if we had an explicit mapping somewhere to the
> ID the FW expects.
>
>> > It would be much nicer if we could pass something based on the MPIDR,
>> > which is a known HW construct, or if this implicitly affected the
>> > current node.
>>
>> IMO, node id is sufficient.
>
> I agree that *a* node ID is sufficient, I just don't think that we're
> guaranteed to have the specific node ID the FW wants.
for thunderx2 which is 2 socket only platform, pxm and nid should be
same(either 0 or 1)
however, i can send PXM id(node_to_pxm) to firmware to make it more sane.
>
>> > It would be vastly more sane for this to not be muxed at all. :/
>>
>> i am helpless due to crappy hw design!
>
> I'm certainly not blaming you for this! :)
>
> I hope the HW designers don't make the same mistake in future, though...
>
> Thanks,
> Mark.
thanks
Ganapat
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH 3/3] bpf: add ability to configure BPF JIT kallsyms export at the boot time
From: Eugene Syromiatnikov @ 2018-05-21 12:30 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, linux-doc, Kees Cook, Kai-Heng Feng,
Daniel Borkmann, Alexei Starovoitov, Jonathan Corbet, Jiri Olsa,
Jesper Dangaard Brouer
This patch introduces two configuration options,
BPF_JIT_KALLSYMS_BOOTPARAM and BPF_JIT_KALLSYMS_BOOTPARAM_VALUE, that
allow configuring the initial value of net.core.bpf_jit_kallsyms sysctl
knob. This enables export of addresses of JIT'ed BPF programs that
created during the early boot.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
---
Documentation/admin-guide/kernel-parameters.txt | 10 +++++++++
init/Kconfig | 30 +++++++++++++++++++++++++
kernel/bpf/core.c | 14 ++++++++++++
3 files changed, 54 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5adc6d0..10e7502 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -452,6 +452,16 @@
2 - JIT hardening is enabled for all users.
Default value is set via kernel config option.
+ bpf_jit_kallsyms=
+ Format: { "0" | "1" }
+ Sets initial value of net.core.bpf_jit_kallsyms
+ sysctl knob.
+ 0 - Addresses of JIT'ed BPF programs are not exported
+ to kallsyms.
+ 1 - Export of addresses of JIT'ed BPF programs is
+ enabled for privileged users.
+ Default value is set via kernel config option.
+
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as
kernel args too.
diff --git a/init/Kconfig b/init/Kconfig
index b661497..b5405ca 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1464,6 +1464,36 @@ config BPF_JIT_HARDEN_BOOTPARAM_VALUE
If you are unsure how to answer this question, answer 0.
+config BPF_JIT_KALLSYMS_BOOTPARAM
+ bool "BPF JIT kallsyms export boot parameter"
+ default n
+ help
+ This option adds a kernel parameter 'bpf_jit_kallsyms' that allows
+ configuring default state of the net.core.bpf_jit_kallsyms sysctl
+ knob. If this option is selected, the default value of the
+ net.core.bpf_jit_kallsyms sysctl knob can be set on the kernel command
+ line. The purpose of this option is to allow enabling BPF JIT
+ kallsyms export for the BPF programs created during the early boot,
+ so they can be traced later.
+
+ If you are unsure how to answer this question, answer N.
+
+config BPF_JIT_KALLSYMS_BOOTPARAM_VALUE
+ int "BPF JIT kallsyms export boot parameter default value"
+ depends on BPF_JIT_HARDEN_BOOTPARAM
+ range 0 1
+ default 0
+ help
+ This option sets the default value for the kernel parameter
+ 'bpf_jit_kallsyms' that configures default value of the
+ net.core.bpf_jit_kallsyms sysctl knob at boot. If this option is set
+ to 0 (zero), the net.core.bpf_jit_kallsyms will default to 0, which
+ will lead to disabling of exporting of addresses of JIT'ed BPF
+ programs. If this option is set to 1 (one), addresses of privileged
+ BPF programs are exported to kallsyms.
+
+ If you are unsure how to answer this question, answer 0.
+
config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9edb7a8..003d708 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -321,7 +321,21 @@ __setup("bpf_jit_harden=", bpf_jit_harden_setup);
int bpf_jit_harden __read_mostly;
#endif /* CONFIG_BPF_JIT_HARDEN_BOOTPARAM */
+#ifdef CONFIG_BPF_JIT_KALLSYMS_BOOTPARAM
+int bpf_jit_kallsyms __read_mostly = CONFIG_BPF_JIT_KALLSYMS_BOOTPARAM_VALUE;
+
+static int __init bpf_jit_kallsyms_setup(char *str)
+{
+ unsigned long enabled;
+
+ if (!kstrtoul(str, 0, &enabled))
+ bpf_jit_kallsyms = !!enabled;
+ return 1;
+}
+__setup("bpf_jit_kallsyms=", bpf_jit_kallsyms_setup);
+#else /* !CONFIG_BPF_JIT_KALLSYMS_BOOTPARAM */
int bpf_jit_kallsyms __read_mostly;
+#endif /* CONFIG_BPF_JIT_KALLSYMS_BOOTPARAM */
static __always_inline void
bpf_get_prog_addr_region(const struct bpf_prog *prog,
--
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 2/3] bpf: add ability to configure BPF JIT hardening via boot-time parameter
From: Eugene Syromiatnikov @ 2018-05-21 12:30 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, linux-doc, Kees Cook, Kai-Heng Feng,
Daniel Borkmann, Alexei Starovoitov, Jonathan Corbet, Jiri Olsa,
Jesper Dangaard Brouer
This patch introduces two configuration options,
BPF_JIT_HARDEN_BOOTPARAM and BPF_JIT_HARDEN_BOOTPARAM_VALUE, that allow
configuring the initial value of net.core.bpf_jit_harden sysctl knob,
which is useful for enforcing JIT hardening during the early boot.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
---
Documentation/admin-guide/kernel-parameters.txt | 10 +++++++++
init/Kconfig | 29 +++++++++++++++++++++++++
kernel/bpf/core.c | 17 +++++++++++++++
3 files changed, 56 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index aa8e831..5adc6d0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -442,6 +442,16 @@
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
+ bpf_jit_harden=
+ Format: { "0" | "1" | "2" }
+ Sets initial value of net.core.bpf_jit_harden
+ sysctl knob.
+ 0 - JIT hardening is disabled.
+ 1 - JIT hardening is enabled for unprivileged users
+ only.
+ 2 - JIT hardening is enabled for all users.
+ Default value is set via kernel config option.
+
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as
kernel args too.
diff --git a/init/Kconfig b/init/Kconfig
index 1403a3e..b661497 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1435,6 +1435,35 @@ config UNPRIVILEGED_BPF_BOOTPARAM_VALUE
If you are unsure how to answer this question, answer 0.
+config BPF_JIT_HARDEN_BOOTPARAM
+ bool "BPF JIT harden boot parameter"
+ default n
+ help
+ This option adds a kernel parameter 'bpf_jit_harden' that allows
+ configuring default state of the net.core.bpf_jit_harden sysctl knob.
+ If this option is selected, the default value of the
+ net.core.bpf_jit_harden sysctl knob can be set on the kernel command
+ line. The purpose of this option is to allow enabling BPF JIT
+ hardening for the BPF programs created during the early boot.
+
+ If you are unsure how to answer this question, answer N.
+
+config BPF_JIT_HARDEN_BOOTPARAM_VALUE
+ int "BPF JIT harden boot parameter default value"
+ depends on BPF_JIT_HARDEN_BOOTPARAM
+ range 0 2
+ default 0
+ help
+ This option sets the default value for the kernel parameter
+ 'bpf_jit_enabled' that configures default value of the
+ net.core.bpf_jit_harden sysctl knob at boot. If this option is set to
+ 0 (zero), the net.core.bpf_jit_harden will default to 0, which will
+ lead to no hardening at bootup. If this option is set to 1 (one),
+ hardening will be applied only to unprivileged users only. If this
+ option is set to 2 (two), JIT hardening will be enabled for all users.
+
+ If you are unsure how to answer this question, answer 0.
+
config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 2194c6a..9edb7a8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -32,6 +32,7 @@
#include <linux/kallsyms.h>
#include <linux/rcupdate.h>
#include <linux/perf_event.h>
+#include <linux/init.h>
#include <asm/unaligned.h>
@@ -303,7 +304,23 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
#ifdef CONFIG_BPF_JIT
/* All BPF JIT sysctl knobs here. */
int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
+
+#ifdef CONFIG_BPF_JIT_HARDEN_BOOTPARAM
+int bpf_jit_harden __read_mostly = CONFIG_BPF_JIT_HARDEN_BOOTPARAM_VALUE;
+
+static int __init bpf_jit_harden_setup(char *str)
+{
+ unsigned long value;
+
+ if (!kstrtoul(str, 0, &value))
+ bpf_jit_harden = min(value, 2UL);
+ return 1;
+}
+__setup("bpf_jit_harden=", bpf_jit_harden_setup);
+#else /* !CONFIG_BPF_JIT_HARDEN_BOOTPARAM */
int bpf_jit_harden __read_mostly;
+#endif /* CONFIG_BPF_JIT_HARDEN_BOOTPARAM */
+
int bpf_jit_kallsyms __read_mostly;
static __always_inline void
--
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 1/3] bpf: add ability to configure unprivileged BPF via boot-time parameter
From: Eugene Syromiatnikov @ 2018-05-21 12:29 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, linux-doc, Kees Cook, Kai-Heng Feng,
Daniel Borkmann, Alexei Starovoitov, Jonathan Corbet, Jiri Olsa,
Jesper Dangaard Brouer
This patch introduces two configuration options,
UNPRIVILEGED_BPF_BOOTPARAM and UNPRIVILEGED_BPF_BOOTPARAM_VALUE, that
allow configuring the initial value of kernel.unprivileged_bpf_disabled
sysctl knob, which is useful for the cases when disabling unprivileged
bpf() access during the early boot is desirable.
Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
---
Documentation/admin-guide/kernel-parameters.txt | 8 +++++++
init/Kconfig | 31 +++++++++++++++++++++++++
kernel/bpf/syscall.c | 16 +++++++++++++
3 files changed, 55 insertions(+)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28e..aa8e831 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4355,6 +4355,14 @@
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
+ unprivileged_bpf_disabled=
+ Format: { "0" | "1" }
+ Sets initial value of kernel.unprivileged_bpf_disabled
+ sysctl knob.
+ 0 - unprivileged bpf() syscall access enabled.
+ 1 - unprivileged bpf() syscall access disabled.
+ Default value is set via kernel config option.
+
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized except for wireless USB,
diff --git a/init/Kconfig b/init/Kconfig
index 480a4f2..1403a3e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1404,6 +1404,37 @@ config BPF_JIT_ALWAYS_ON
Enables BPF JIT and removes BPF interpreter to avoid
speculative execution of BPF instructions by the interpreter
+config UNPRIVILEGED_BPF_BOOTPARAM
+ bool "Unprivileged bpf() boot parameter"
+ depends on BPF_SYSCALL
+ default n
+ help
+ This option adds a kernel parameter 'unprivileged_bpf_disabled'
+ that allows configuring default state of the
+ kernel.unprivileged_bpf_disabled sysctl knob.
+ If this option is selected, unprivileged access to the bpf() syscall
+ can be disabled with unprivileged_bpf_disabled=1 on the kernel command
+ line. The purpose of this option is to allow disabling unprivileged
+ bpf() syscall access during the early boot.
+
+ If you are unsure how to answer this question, answer N.
+
+config UNPRIVILEGED_BPF_BOOTPARAM_VALUE
+ int "Unprivileged bpf() boot parameter default value"
+ depends on UNPRIVILEGED_BPF_BOOTPARAM
+ range 0 1
+ default 0
+ help
+ This option sets the default value for the kernel parameter
+ 'unprivileged_bpf_disabled', which allows disabling unprivileged bpf()
+ syscall access at boot. If this option is set to 0 (zero), the
+ unprivileged bpf() boot kernel parameter will default to 0, allowing
+ unprivileged bpf() syscall access at bootup. If this option is
+ set to 1 (one), the unprivileged bpf() kernel parameter will default
+ to 1, disabling unprivileged bpf() syscall access at bootup.
+
+ If you are unsure how to answer this question, answer 0.
+
config USERFAULTFD
bool "Enable userfaultfd() system call"
select ANON_INODES
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index bfcde94..fdc5fd9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -29,6 +29,7 @@
#include <linux/ctype.h>
#include <linux/btf.h>
#include <linux/nospec.h>
+#include <linux/init.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PROG_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -45,7 +46,22 @@ static DEFINE_SPINLOCK(prog_idr_lock);
static DEFINE_IDR(map_idr);
static DEFINE_SPINLOCK(map_idr_lock);
+#ifdef CONFIG_UNPRIVILEGED_BPF_BOOTPARAM
+int sysctl_unprivileged_bpf_disabled __read_mostly =
+ CONFIG_UNPRIVILEGED_BPF_BOOTPARAM_VALUE;
+
+static int __init unprivileged_bpf_setup(char *str)
+{
+ unsigned long disabled;
+
+ if (!kstrtoul(str, 0, &disabled))
+ sysctl_unprivileged_bpf_disabled = !!disabled;
+ return 1;
+}
+__setup("unprivileged_bpf_disabled=", unprivileged_bpf_setup);
+#else /* !CONFIG_UNPRIVILEGED_BPF_BOOTPARAM */
int sysctl_unprivileged_bpf_disabled __read_mostly;
+#endif /* CONFIG_UNPRIVILEGED_BPF_BOOTPARAM */
static const struct bpf_map_ops * const bpf_map_types[] = {
#define BPF_PROG_TYPE(_id, _ops)
--
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [PATCH 0/3] bpf: add boot parameters for sysctl knobs
From: Eugene Syromiatnikov @ 2018-05-21 12:29 UTC (permalink / raw)
To: netdev
Cc: linux-kernel, linux-doc, Kees Cook, Kai-Heng Feng,
Daniel Borkmann, Alexei Starovoitov, Jonathan Corbet, Jiri Olsa,
Jesper Dangaard Brouer
Hello.
This patch set adds ability to set default values for
kernel.unprivileged_bpf_disable, net.core.bpf_jit_harden,
net.core.bpf_jit_kallsyms sysctl knobs as well as option to override
them via a boot-time kernel parameter.
Eugene Syromiatnikov (3):
bpf: add ability to configure unprivileged BPF via boot-time parameter
bpf: add ability to configure BPF JIT hardening via boot-time
parameter
bpf: add ability to configure BPF JIT kallsyms export at the boot time
Documentation/admin-guide/kernel-parameters.txt | 28 ++++++++
init/Kconfig | 90 +++++++++++++++++++++++++
kernel/bpf/core.c | 31 +++++++++
kernel/bpf/syscall.c | 16 +++++
4 files changed, 165 insertions(+)
--
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy
From: Patrick Bellasi @ 2018-05-21 11:55 UTC (permalink / raw)
To: Waiman Long
Cc: Tejun Heo, Li Zefan, Johannes Weiner, Peter Zijlstra, Ingo Molnar,
cgroups, linux-kernel, linux-doc, kernel-team, pjt, luto,
Mike Galbraith, torvalds, Roman Gushchin, Juri Lelli
In-Reply-To: <1526590545-3350-2-git-send-email-longman@redhat.com>
Hi Waiman!
I've started looking at the possibility to move Android to use cgroups
v2 and the availability of the cpuset controller makes this even more
promising.
I'll try to give a run to this series on Android, meanwhile I have
some (hopefully not too much dummy) questions below.
On 17-May 16:55, Waiman Long wrote:
> Given the fact that thread mode had been merged into 4.14, it is now
> time to enable cpuset to be used in the default hierarchy (cgroup v2)
> as it is clearly threaded.
>
> The cpuset controller had experienced feature creep since its
> introduction more than a decade ago. Besides the core cpus and mems
> control files to limit cpus and memory nodes, there are a bunch of
> additional features that can be controlled from the userspace. Some of
> the features are of doubtful usefulness and may not be actively used.
>
> This patch enables cpuset controller in the default hierarchy with
> a minimal set of features, namely just the cpus and mems and their
> effective_* counterparts. We can certainly add more features to the
> default hierarchy in the future if there is a real user need for them
> later on.
>
> Alternatively, with the unified hiearachy, it may make more sense
> to move some of those additional cpuset features, if desired, to
> memory controller or may be to the cpu controller instead of staying
> with cpuset.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
> Documentation/cgroup-v2.txt | 90 ++++++++++++++++++++++++++++++++++++++++++---
> kernel/cgroup/cpuset.c | 48 ++++++++++++++++++++++--
> 2 files changed, 130 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 74cdeae..cf7bac6 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -53,11 +53,13 @@ v1 is available under Documentation/cgroup-v1/.
> 5-3-2. Writeback
> 5-4. PID
> 5-4-1. PID Interface Files
> - 5-5. Device
> - 5-6. RDMA
> - 5-6-1. RDMA Interface Files
> - 5-7. Misc
> - 5-7-1. perf_event
> + 5-5. Cpuset
> + 5.5-1. Cpuset Interface Files
> + 5-6. Device
> + 5-7. RDMA
> + 5-7-1. RDMA Interface Files
> + 5-8. Misc
> + 5-8-1. perf_event
> 5-N. Non-normative information
> 5-N-1. CPU controller root cgroup process behaviour
> 5-N-2. IO controller root cgroup process behaviour
> @@ -1435,6 +1437,84 @@ through fork() or clone(). These will return -EAGAIN if the creation
> of a new process would cause a cgroup policy to be violated.
>
>
> +Cpuset
> +------
> +
> +The "cpuset" controller provides a mechanism for constraining
> +the CPU and memory node placement of tasks to only the resources
> +specified in the cpuset interface files in a task's current cgroup.
> +This is especially valuable on large NUMA systems where placing jobs
> +on properly sized subsets of the systems with careful processor and
> +memory placement to reduce cross-node memory access and contention
> +can improve overall system performance.
Another quite important use-case for cpuset is Android, where they are
actively used to do both power-saving as well as performance tunings.
For example, depending on the status of an application, its threads
can be allowed to run on all available CPUS (e.g. foreground apps) or
be restricted only on few energy efficient CPUs (e.g. backgroud apps).
Since here we are at "rewriting" cpusets for v2, I think it's important
to keep this mobile world scenario into consideration.
For example, in this context, we are looking at the possibility to
update/tune cpuset.cpus with a relatively high rate, i.e. tens of
times per second. Not sure that's the same update rate usually
required for the large NUMA systems you cite above. However, in this
case it's quite important to have really small overheads for these
operations.
> +
> +The "cpuset" controller is hierarchical. That means the controller
> +cannot use CPUs or memory nodes not allowed in its parent.
> +
> +
> +Cpuset Interface Files
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> + cpuset.cpus
> + A read-write multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the CPUs allowed to be used by tasks within this
> + cgroup. The CPU numbers are comma-separated numbers or
> + ranges. For example:
> +
> + # cat cpuset.cpus
> + 0-4,6,8-10
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.cpus" or all the available CPUs if none is found.
Does that means that we can move tasks into a newly created group for
which we have not yet configured this value?
AFAIK, that's a different behavior wrt v1... and I like it better.
> +
> + The value of "cpuset.cpus" stays constant until the next update
> + and won't be affected by any CPU hotplug events.
This also sounds interesting, does it means that we use the
cpuset.cpus mask to restrict online CPUs, whatever they are?
I'll have a better look at the code, but my understanding of v1 is
that we spent a lot of effort to keep task cpu-affinity masks aligned
with the cpuset in which they live, and we do something similar at each
HP event, which ultimately generates a lot of overheads in systems
where: you have many HP events and/or cpuset.cpus change quite
frequently.
I hope to find some better behavior in this series.
> +
> + cpuset.cpus.effective
> + A read-only multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the onlined CPUs that are actually allowed to be
> + used by tasks within the current cgroup. If "cpuset.cpus"
> + is empty, it shows all the CPUs from the parent cgroup that
> + will be available to be used by this cgroup. Otherwise, it is
> + a subset of "cpuset.cpus". Its value will be affected by CPU
> + hotplug events.
This looks similar to v1, isn't it?
> +
> + cpuset.mems
> + A read-write multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the memory nodes allowed to be used by tasks within
> + this cgroup. The memory node numbers are comma-separated
> + numbers or ranges. For example:
> +
> + # cat cpuset.mems
> + 0-1,3
> +
> + An empty value indicates that the cgroup is using the same
> + setting as the nearest cgroup ancestor with a non-empty
> + "cpuset.mems" or all the available memory nodes if none
> + is found.
> +
> + The value of "cpuset.mems" stays constant until the next update
> + and won't be affected by any memory nodes hotplug events.
> +
> + cpuset.mems.effective
> + A read-only multiple values file which exists on non-root
> + cpuset-enabled cgroups.
> +
> + It lists the onlined memory nodes that are actually allowed to
> + be used by tasks within the current cgroup. If "cpuset.mems"
> + is empty, it shows all the memory nodes from the parent cgroup
> + that will be available to be used by this cgroup. Otherwise,
> + it is a subset of "cpuset.mems". Its value will be affected
> + by memory nodes hotplug events.
> +
> +
> Device controller
> -----------------
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index b42037e..419b758 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1823,12 +1823,11 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
> return 0;
> }
>
> -
> /*
> * for the common functions, 'private' gives the type of file
> */
>
> -static struct cftype files[] = {
> +static struct cftype legacy_files[] = {
> {
> .name = "cpus",
> .seq_show = cpuset_common_seq_show,
> @@ -1931,6 +1930,47 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
> };
>
> /*
> + * This is currently a minimal set for the default hierarchy. It can be
> + * expanded later on by migrating more features and control files from v1.
> + */
> +static struct cftype dfl_files[] = {
> + {
> + .name = "cpus",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * NR_CPUS),
> + .private = FILE_CPULIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "mems",
> + .seq_show = cpuset_common_seq_show,
> + .write = cpuset_write_resmask,
> + .max_write_len = (100U + 6 * MAX_NUMNODES),
> + .private = FILE_MEMLIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "cpus.effective",
> + .seq_show = cpuset_common_seq_show,
> + .private = FILE_EFFECTIVE_CPULIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + {
> + .name = "mems.effective",
> + .seq_show = cpuset_common_seq_show,
> + .private = FILE_EFFECTIVE_MEMLIST,
> + .flags = CFTYPE_NOT_ON_ROOT,
> + },
> +
> + { } /* terminate */
> +};
> +
> +
> +/*
> * cpuset_css_alloc - allocate a cpuset css
> * cgrp: control group that the new cpuset will be part of
> */
> @@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = {
> .post_attach = cpuset_post_attach,
> .bind = cpuset_bind,
> .fork = cpuset_fork,
> - .legacy_cftypes = files,
> + .legacy_cftypes = legacy_files,
> + .dfl_cftypes = dfl_files,
> .early_init = true,
> + .threaded = true,
Which means that by default we can attach tasks instead of only
processes, right?
> };
>
> /**
> --
> 1.8.3.1
>
--
#include <best/regards.h>
Patrick Bellasi
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFT v2 1/4] perf cs-etm: Generate sample for missed packets
From: Robert Walker @ 2018-05-21 11:27 UTC (permalink / raw)
To: Leo Yan, Arnaldo Carvalho de Melo, Mathieu Poirier,
Jonathan Corbet, Peter Zijlstra, Ingo Molnar, Alexander Shishkin,
Jiri Olsa, Namhyung Kim, linux-arm-kernel, linux-doc,
linux-kernel, Tor Jeremiassen, mike.leach, kim.phillips,
coresight
Cc: Mike Leach
In-Reply-To: <1526892748-326-2-git-send-email-leo.yan@linaro.org>
Hi Leo,
On 21/05/18 09:52, Leo Yan wrote:
> Commit e573e978fb12 ("perf cs-etm: Inject capabilitity for CoreSight
> traces") reworks the samples generation flow from CoreSight trace to
> match the correct format so Perf report tool can display the samples
> properly. But the change has side effect for packet handling, it only
> generate samples when 'prev_packet->last_instr_taken_branch' is true,
> this results in the start tracing packet and exception packets are
> dropped.
>
> This patch checks extra two conditions for complete samples:
>
> - If 'prev_packet->sample_type' is zero we can use this condition to
> get to know this is the start tracing packet; for this case, the start
> packet's end_addr is zero as well so we need to handle it in the
> function cs_etm__last_executed_instr();
>
I think you also need to add something in to handle discontinuities in
trace - for example it is possible to configure the ETM to only trace
execution in specific code regions or to trace a few cycles every so
often. In these cases, prev_packet->sample_type will not be zero, but
whatever the previous packet was. You will get a CS_ETM_TRACE_ON packet
in such cases, generated by an I_TRACE_ON element in the trace stream.
You also get this on exception return.
However, you should also keep the test for prev_packet->sample_type == 0
as you may not see a CS_ETM_TRACE_ON when decoding a buffer that has
wrapped.
Regards
Rob
> - If 'prev_packet->exc' is true, we can know the previous packet is
> exception handling packet so need to generate sample for exception
> flow.
>
> Fixes: e573e978fb12 ("perf cs-etm: Inject capabilitity for CoreSight traces")
> Cc: Mike Leach <mike.leach@arm.com>
> Cc: Robert Walker <robert.walker@arm.com>
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
> tools/perf/util/cs-etm.c | 35 ++++++++++++++++++++++++++++-------
> 1 file changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> index 822ba91..378953b 100644
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
> @@ -495,6 +495,13 @@ static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq)
> static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet)
> {
> /*
> + * The packet is the start tracing packet if the end_addr is zero,
> + * returns 0 for this case.
> + */
> + if (!packet->end_addr)
> + return 0;
> +
> + /*
> * The packet records the execution range with an exclusive end address
> *
> * A64 instructions are constant size, so the last executed
> @@ -897,13 +904,27 @@ static int cs_etm__sample(struct cs_etm_queue *etmq)
> etmq->period_instructions = instrs_over;
> }
>
> - if (etm->sample_branches &&
> - etmq->prev_packet &&
> - etmq->prev_packet->sample_type == CS_ETM_RANGE &&
> - etmq->prev_packet->last_instr_taken_branch) {
> - ret = cs_etm__synth_branch_sample(etmq);
> - if (ret)
> - return ret;
> + if (etm->sample_branches && etmq->prev_packet) {
> + bool generate_sample = false;
> +
> + /* Generate sample for start tracing packet */
> + if (etmq->prev_packet->sample_type == 0)
> + generate_sample = true;
> +
> + /* Generate sample for exception packet */
> + if (etmq->prev_packet->exc == true)
> + generate_sample = true;
> +
> + /* Generate sample for normal branch packet */
> + if (etmq->prev_packet->sample_type == CS_ETM_RANGE &&
> + etmq->prev_packet->last_instr_taken_branch)
> + generate_sample = true;
> +
> + if (generate_sample) {
> + ret = cs_etm__synth_branch_sample(etmq);
> + if (ret)
> + return ret;
> + }
> }
>
> if (etm->sample_branches || etm->synth_opts.last_branch) {
>
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Mark Rutland @ 2018-05-21 10:55 UTC (permalink / raw)
To: Ganapatrao Kulkarni
Cc: Ganapatrao Kulkarni, linux-doc, linux-kernel, linux-arm-kernel,
Will Deacon, jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber
In-Reply-To: <CAKTKpr7Q7T9ZCTBi=LQR=XaAoihbBA3OKCO7yFobzNmR8EfyjQ@mail.gmail.com>
On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
> >> + *
> >> + * L3 Tile and DMC channel selection is through SMC call
> >> + * SMC call arguments,
> >> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
> >> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
> >> + * x2 = Node id
> >
> > How do we map Linux node IDs to the firmware's view of node IDs?
> >
> > I don't believe the two are necessarily the same -- Linux's node IDs are
> > a Linux-specific construct.
>
> both are same, it is numa node id from ACPI/firmware.
I am very wary about assuming that the Linux nid will always be the same
as the ACPI node id.
For that to *potentially* be true, this driver should depend on
CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
node id will always be NUMA_NO_NODE.
I would be *much* happier if we had an explicit mapping somewhere to the
ID the FW expects.
> > It would be much nicer if we could pass something based on the MPIDR,
> > which is a known HW construct, or if this implicitly affected the
> > current node.
>
> IMO, node id is sufficient.
I agree that *a* node ID is sufficient, I just don't think that we're
guaranteed to have the specific node ID the FW wants.
> > It would be vastly more sane for this to not be muxed at all. :/
>
> i am helpless due to crappy hw design!
I'm certainly not blaming you for this! :)
I hope the HW designers don't make the same mistake in future, though...
Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Mark Rutland @ 2018-05-21 10:40 UTC (permalink / raw)
To: Ganapatrao Kulkarni
Cc: Ganapatrao Kulkarni, linux-doc, LKML, linux-arm-kernel,
Will Deacon, jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber
In-Reply-To: <20180521103712.gofbrjdtghfwolmd@lakrids.cambridge.arm.com>
On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
> Hi Ganapat,
>
>
> Sorry for the delay in replying; I was away most of last week.
>
> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
> > >>> +{
> > >>> + int counter;
> > >>> +
> > >>> + raw_spin_lock(&pmu_uncore->lock);
> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
> > >>> + pmu_uncore->uncore_dev->max_counters);
> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> + return -ENOSPC;
> > >>> + }
> > >>> + set_bit(counter, pmu_uncore->counter_mask);
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> + return counter;
> > >>> +}
> > >>> +
> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
> > >>> + int counter)
> > >>> +{
> > >>> + raw_spin_lock(&pmu_uncore->lock);
> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> +}
> > >>
> > >> I don't believe that locking is required in either of these, as the perf
> > >> core serializes pmu::add() and pmu::del(), where these get called.
> >
> > without this locking, i am seeing "BUG: scheduling while atomic" when
> > i run perf with more events together than the maximum counters
> > supported
>
> Did you manage to get to the bottom of this?
>
> Do you have a backtrace?
>
> It looks like in your latest posting you reserve counters through the
> userspace ABI, which doesn't seem right to me, and I'd like to
> understand the problem.
Looks like I misunderstood -- those are still allocated kernel-side.
I'll follow that up in the v5 posting.
Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver
From: Mark Rutland @ 2018-05-21 10:37 UTC (permalink / raw)
To: Ganapatrao Kulkarni
Cc: Ganapatrao Kulkarni, linux-doc, LKML, linux-arm-kernel,
Will Deacon, jnair, Robert Richter, Vadim.Lomovtsev, Jan.Glauber
In-Reply-To: <CAKTKpr61zBW_D6v_Ck1Lcp0NJ4wPFOpgbiM5aU_EtuiFU-qp4Q@mail.gmail.com>
Hi Ganapat,
Sorry for the delay in replying; I was away most of last week.
On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
> On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <gklkml16@gmail.com> wrote:
> > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <mark.rutland@arm.com> wrote:
> >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
> >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
> >>> +{
> >>> + int counter;
> >>> +
> >>> + raw_spin_lock(&pmu_uncore->lock);
> >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
> >>> + pmu_uncore->uncore_dev->max_counters);
> >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> + return -ENOSPC;
> >>> + }
> >>> + set_bit(counter, pmu_uncore->counter_mask);
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> + return counter;
> >>> +}
> >>> +
> >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
> >>> + int counter)
> >>> +{
> >>> + raw_spin_lock(&pmu_uncore->lock);
> >>> + clear_bit(counter, pmu_uncore->counter_mask);
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> +}
> >>
> >> I don't believe that locking is required in either of these, as the perf
> >> core serializes pmu::add() and pmu::del(), where these get called.
>
> without this locking, i am seeing "BUG: scheduling while atomic" when
> i run perf with more events together than the maximum counters
> supported
Did you manage to get to the bottom of this?
Do you have a backtrace?
It looks like in your latest posting you reserve counters through the
userspace ABI, which doesn't seem right to me, and I'd like to
understand the problem.
Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFT v2 0/4] Perf script: Add python script for CoreSight trace disassembler
From: Leo Yan @ 2018-05-21 8:52 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
Tor Jeremiassen, mike.leach, kim.phillips, Robert Walker,
coresight
Cc: Leo Yan
This patch series is to support for using 'perf script' for CoreSight
trace disassembler, for this purpose this patch series adds a new
python script to parse CoreSight tracing event and use command 'objdump'
for disassembled lines, finally this can generate readable program
execution flow for reviewing tracing data.
Patch 0001 is one fixing patch to generate samples for the start packet
and exception packets.
Patch 0002 is the prerequisite to add addr into sample dict, so this
value can be used by python script to analyze instruction range.
Patch 0003 is to add python script for trace disassembler.
Patch 0004 is to add doc to explain python script usage and give
example for it.
This patch series has been rebased on acme git tree [1] with the last
commit 19422a9f2a3b ("perf tools: Fix kernel_start for PTI on x86") and
tested on Hikey (ARM64 octa CA53 cores).
In this version the script has no dependency on ARM64 platform and is
expected to support ARM32 platform, but I am lacking ARM32 platform for
testing on it, so firstly upstream to support ARM64 platform.
This patch series is firstly to support 'per-thread' recording tracing
data, but we also need to verify the script can dump trace disassembler
CPU wide tracing and kernel panic kdump tracing data. I also verified
this patch series which can work with kernel panic kdump tracing data,
because Mathieu is working on CPU wide tracing related work, so after
this we need to retest for CPU wide tracing and kdump tracing to ensure
the python script can handle well for all cases.
You are very welcome to test the script in this patch series, your
testing result and suggestion are very valuable to perfect this script
to cover more cases.
Changes from v1:
* According to Mike and Rob suggestion, add the fixing to generate samples
for the start packet and exception packets.
* Simplify the python script to remove the exception prediction algorithm,
we can rely on the sane exception packets for disassembler.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git
Leo Yan (4):
perf cs-etm: Generate sample for missed packets
perf script python: Add addr into perf sample dict
perf script python: Add script for CoreSight trace disassembler
coresight: Document for CoreSight trace disassembler
Documentation/trace/coresight.txt | 52 +++++
tools/perf/scripts/python/arm-cs-trace-disasm.py | 234 +++++++++++++++++++++
tools/perf/util/cs-etm.c | 35 ++-
.../util/scripting-engines/trace-event-python.c | 2 +
4 files changed, 316 insertions(+), 7 deletions(-)
create mode 100644 tools/perf/scripts/python/arm-cs-trace-disasm.py
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFT v2 1/4] perf cs-etm: Generate sample for missed packets
From: Leo Yan @ 2018-05-21 8:52 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
Tor Jeremiassen, mike.leach, kim.phillips, Robert Walker,
coresight
Cc: Leo Yan, Mike Leach, Robert Walker
In-Reply-To: <1526892748-326-1-git-send-email-leo.yan@linaro.org>
Commit e573e978fb12 ("perf cs-etm: Inject capabilitity for CoreSight
traces") reworks the samples generation flow from CoreSight trace to
match the correct format so Perf report tool can display the samples
properly. But the change has side effect for packet handling, it only
generate samples when 'prev_packet->last_instr_taken_branch' is true,
this results in the start tracing packet and exception packets are
dropped.
This patch checks extra two conditions for complete samples:
- If 'prev_packet->sample_type' is zero we can use this condition to
get to know this is the start tracing packet; for this case, the start
packet's end_addr is zero as well so we need to handle it in the
function cs_etm__last_executed_instr();
- If 'prev_packet->exc' is true, we can know the previous packet is
exception handling packet so need to generate sample for exception
flow.
Fixes: e573e978fb12 ("perf cs-etm: Inject capabilitity for CoreSight traces")
Cc: Mike Leach <mike.leach@arm.com>
Cc: Robert Walker <robert.walker@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
tools/perf/util/cs-etm.c | 35 ++++++++++++++++++++++++++++-------
1 file changed, 28 insertions(+), 7 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 822ba91..378953b 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -495,6 +495,13 @@ static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq)
static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet)
{
/*
+ * The packet is the start tracing packet if the end_addr is zero,
+ * returns 0 for this case.
+ */
+ if (!packet->end_addr)
+ return 0;
+
+ /*
* The packet records the execution range with an exclusive end address
*
* A64 instructions are constant size, so the last executed
@@ -897,13 +904,27 @@ static int cs_etm__sample(struct cs_etm_queue *etmq)
etmq->period_instructions = instrs_over;
}
- if (etm->sample_branches &&
- etmq->prev_packet &&
- etmq->prev_packet->sample_type == CS_ETM_RANGE &&
- etmq->prev_packet->last_instr_taken_branch) {
- ret = cs_etm__synth_branch_sample(etmq);
- if (ret)
- return ret;
+ if (etm->sample_branches && etmq->prev_packet) {
+ bool generate_sample = false;
+
+ /* Generate sample for start tracing packet */
+ if (etmq->prev_packet->sample_type == 0)
+ generate_sample = true;
+
+ /* Generate sample for exception packet */
+ if (etmq->prev_packet->exc == true)
+ generate_sample = true;
+
+ /* Generate sample for normal branch packet */
+ if (etmq->prev_packet->sample_type == CS_ETM_RANGE &&
+ etmq->prev_packet->last_instr_taken_branch)
+ generate_sample = true;
+
+ if (generate_sample) {
+ ret = cs_etm__synth_branch_sample(etmq);
+ if (ret)
+ return ret;
+ }
}
if (etm->sample_branches || etm->synth_opts.last_branch) {
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFT v2 2/4] perf script python: Add addr into perf sample dict
From: Leo Yan @ 2018-05-21 8:52 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
Tor Jeremiassen, mike.leach, kim.phillips, Robert Walker,
coresight
Cc: Leo Yan
In-Reply-To: <1526892748-326-1-git-send-email-leo.yan@linaro.org>
ARM CoreSight auxtrace uses 'sample->addr' to record the target address
for branch instructions, so the data of 'sample->addr' is required for
tracing data analysis.
This commit collects data of 'sample->addr' into perf sample dict,
finally can be used for python script for parsing event.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
tools/perf/util/scripting-engines/trace-event-python.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/perf/util/scripting-engines/trace-event-python.c b/tools/perf/util/scripting-engines/trace-event-python.c
index 10dd5fc..7f8afac 100644
--- a/tools/perf/util/scripting-engines/trace-event-python.c
+++ b/tools/perf/util/scripting-engines/trace-event-python.c
@@ -531,6 +531,8 @@ static PyObject *get_perf_sample_dict(struct perf_sample *sample,
PyLong_FromUnsignedLongLong(sample->period));
pydict_set_item_string_decref(dict_sample, "phys_addr",
PyLong_FromUnsignedLongLong(sample->phys_addr));
+ pydict_set_item_string_decref(dict_sample, "addr",
+ PyLong_FromUnsignedLongLong(sample->addr));
set_sample_read_in_dict(dict_sample, sample, evsel);
pydict_set_item_string_decref(dict, "sample", dict_sample);
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFT v2 4/4] coresight: Document for CoreSight trace disassembler
From: Leo Yan @ 2018-05-21 8:52 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
Tor Jeremiassen, mike.leach, kim.phillips, Robert Walker,
coresight
Cc: Leo Yan
In-Reply-To: <1526892748-326-1-git-send-email-leo.yan@linaro.org>
This commit documents CoreSight trace disassembler usage and gives
example for it.
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
Documentation/trace/coresight.txt | 52 +++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt
index 6f0120c..b8f2359 100644
--- a/Documentation/trace/coresight.txt
+++ b/Documentation/trace/coresight.txt
@@ -381,3 +381,55 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
$ taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
5806 ms
+
+
+Tracing data disassembler
+-------------------------
+
+'perf script' supports to use script to parse tracing packet and rely on
+'objdump' for disassembled lines, this can convert tracing data to readable
+program execution flow for easily reviewing tracing data.
+
+The CoreSight trace disassembler is located in the folder:
+tools/perf/scripts/python/arm-cs-trace-disasm.py. This script support below
+options:
+
+ -d, --objdump: Set path to objdump executable, this option is
+ mandatory.
+ -k, --vmlinux: Set path to vmlinux file.
+ -v, --verbose: Enable debugging log, after enable this option the
+ script dumps every event data.
+
+Below is one example for using python script to dump CoreSight trace
+disassembler:
+
+ $ perf script -s arm-cs-trace-disasm.py -i perf.data \
+ -F cpu,event,ip,addr,sym -- -d objdump -k ./vmlinux > cs-disasm.log
+
+Below is one example for the disassembler log:
+
+ARM CoreSight Trace Data Assembler Dump
+ ffff000008a5f2dc <etm4_enable_hw+0x344>:
+ ffff000008a5f2dc: 340000a0 cbz w0, ffff000008a5f2f0 <etm4_enable_hw+0x358>
+ ffff000008a5f2f0 <etm4_enable_hw+0x358>:
+ ffff000008a5f2f0: f9400260 ldr x0, [x19]
+ ffff000008a5f2f4: d5033f9f dsb sy
+ ffff000008a5f2f8: 913ec000 add x0, x0, #0xfb0
+ ffff000008a5f2fc: b900001f str wzr, [x0]
+ ffff000008a5f300: f9400bf3 ldr x19, [sp, #16]
+ ffff000008a5f304: a8c27bfd ldp x29, x30, [sp], #32
+ ffff000008a5f308: d65f03c0 ret
+ ffff000008a5fa18 <etm4_enable+0x1b0>:
+ ffff000008a5fa18: 14000025 b ffff000008a5faac <etm4_enable+0x244>
+ ffff000008a5faac <etm4_enable+0x244>:
+ ffff000008a5faac: b9406261 ldr w1, [x19, #96]
+ ffff000008a5fab0: 52800015 mov w21, #0x0 // #0
+ ffff000008a5fab4: f901ca61 str x1, [x19, #912]
+ ffff000008a5fab8: 2a1503e0 mov w0, w21
+ ffff000008a5fabc: 3940e261 ldrb w1, [x19, #56]
+ ffff000008a5fac0: f901ce61 str x1, [x19, #920]
+ ffff000008a5fac4: a94153f3 ldp x19, x20, [sp, #16]
+ ffff000008a5fac8: a9425bf5 ldp x21, x22, [sp, #32]
+ ffff000008a5facc: a94363f7 ldp x23, x24, [sp, #48]
+ ffff000008a5fad0: a8c47bfd ldp x29, x30, [sp], #64
+ ffff000008a5fad4: d65f03c0 ret
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFT v2 3/4] perf script python: Add script for CoreSight trace disassembler
From: Leo Yan @ 2018-05-21 8:52 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, Mathieu Poirier, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, linux-arm-kernel, linux-doc, linux-kernel,
Tor Jeremiassen, mike.leach, kim.phillips, Robert Walker,
coresight
Cc: Leo Yan
In-Reply-To: <1526892748-326-1-git-send-email-leo.yan@linaro.org>
This commit adds python script to parse CoreSight tracing event and
use command 'objdump' for disassembled lines, finally we can generate
readable program execution flow for reviewing tracing data.
The script receives CoreSight tracing packet with below format:
+------------+------------+------------+
packet(n): | addr | ip | cpu |
+------------+------------+------------+
packet(n+1): | addr | ip | cpu |
+------------+------------+------------+
packet::ip is the last address of current branch instruction and
packet::addr presents the start address of the next coming branch
instruction. So for one branch instruction which starts in packet(n),
its execution flow starts from packet(n)::addr and it stops at
packet(n+1)::ip. As results we need to combine the two continuous
packets to generate the instruction range, this is the rationale for the
script implementation:
[ sample(n)::addr .. sample(n+1)::ip ]
Credits to Tor Jeremiassen who have written the script skeleton and
provides the ideas for reading symbol file according to build-id,
creating memory map for dso and basic packet handling. Mathieu Poirier
contributed fixes for build-id and memory map bugs. The detailed
development history for this script you can find from [1]. Based on Tor
and Mathieu work, the script is updated samples handling for the
corrected sample format. Another minor enhancement is to support for
without build-id case, the script can parse kernel symbols with option
'-k' for vmlinux file path.
[1] https://github.com/Linaro/perf-opencsd/commits/perf-opencsd-v4.15/tools/perf/scripts/python/cs-trace-disasm.py
Co-authored-by: Tor Jeremiassen <tor@ti.com>
Co-authored-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
tools/perf/scripts/python/arm-cs-trace-disasm.py | 234 +++++++++++++++++++++++
1 file changed, 234 insertions(+)
create mode 100644 tools/perf/scripts/python/arm-cs-trace-disasm.py
diff --git a/tools/perf/scripts/python/arm-cs-trace-disasm.py b/tools/perf/scripts/python/arm-cs-trace-disasm.py
new file mode 100644
index 0000000..58de36f
--- /dev/null
+++ b/tools/perf/scripts/python/arm-cs-trace-disasm.py
@@ -0,0 +1,234 @@
+# arm-cs-trace-disasm.py: ARM CoreSight Trace Dump With Disassember
+# SPDX-License-Identifier: GPL-2.0
+#
+# Tor Jeremiassen <tor@ti.com> is original author who wrote script
+# skeleton, Mathieu Poirier <mathieu.poirier@linaro.org> contributed
+# fixes for build-id and memory map; Leo Yan <leo.yan@linaro.org>
+# updated the packet parsing with new samples format.
+
+import os
+import sys
+import re
+from subprocess import *
+from optparse import OptionParser, make_option
+
+# Command line parsing
+
+option_list = [
+ # formatting options for the bottom entry of the stack
+ make_option("-k", "--vmlinux", dest="vmlinux_name",
+ help="Set path to vmlinux file"),
+ make_option("-d", "--objdump", dest="objdump_name",
+ help="Set path to objdump executable file"),
+ make_option("-v", "--verbose", dest="verbose",
+ action="store_true", default=False,
+ help="Enable debugging log")
+]
+
+parser = OptionParser(option_list=option_list)
+(options, args) = parser.parse_args()
+
+if (options.objdump_name == None):
+ sys.exit("No objdump executable file specified - use -d or --objdump option")
+
+# Initialize global dicts and regular expression
+
+build_ids = dict()
+mmaps = dict()
+disasm_cache = dict()
+cpu_data = dict()
+disasm_re = re.compile("^\s*([0-9a-fA-F]+):")
+disasm_func_re = re.compile("^\s*([0-9a-fA-F]+)\s\<.*\>:")
+cache_size = 32*1024
+prev_cpu = -1
+
+def parse_buildid():
+ global build_ids
+
+ buildid_regex = "([a-fA-f0-9]+)[ \t]([^ \n]+)"
+ buildid_re = re.compile(buildid_regex)
+
+ results = check_output(["perf", "buildid-list"]).split('\n');
+ for line in results:
+ m = buildid_re.search(line)
+ if (m == None):
+ continue;
+
+ id_name = m.group(2)
+ id_num = m.group(1)
+
+ if (id_name == "[kernel.kallsyms]") :
+ append = "/kallsyms"
+ elif (id_name == "[vdso]") :
+ append = "/vdso"
+ else:
+ append = "/elf"
+
+ build_ids[id_name] = os.environ['PERF_BUILDID_DIR'] + \
+ "/" + id_name + "/" + id_num + append;
+ # Replace duplicate slash chars to single slash char
+ build_ids[id_name] = build_ids[id_name].replace('//', '/', 1)
+
+ if ((options.vmlinux_name == None) and ("[kernel.kallsyms]" in build_ids)):
+ print "kallsyms cannot be used to dump assembler"
+
+ # Set vmlinux path to replace kallsyms file, if without buildid we still
+ # can use vmlinux to prase kernel symbols
+ if ((options.vmlinux_name != None)):
+ build_ids['[kernel.kallsyms]'] = options.vmlinux_name;
+
+def parse_mmap():
+ global mmaps
+
+ # Check mmap for PERF_RECORD_MMAP and PERF_RECORD_MMAP2
+ mmap_regex = "PERF_RECORD_MMAP.* -?[0-9]+/[0-9]+: \[(0x[0-9a-fA-F]+)\((0x[0-9a-fA-F]+)\).*:\s.*\s(\S*)"
+ mmap_re = re.compile(mmap_regex)
+
+ results = check_output("perf script --show-mmap-events | fgrep PERF_RECORD_MMAP", shell=True).split('\n')
+ for line in results:
+ m = mmap_re.search(line)
+ if (m != None):
+ if (m.group(3) == '[kernel.kallsyms]_text'):
+ dso = '[kernel.kallsyms]'
+ else:
+ dso = m.group(3)
+
+ start = int(m.group(1),0)
+ end = int(m.group(1),0) + int(m.group(2),0)
+ mmaps[dso] = [start, end]
+
+def find_dso_mmap(addr):
+ global mmaps
+
+ for key, value in mmaps.items():
+ if (addr >= value[0] and addr < value[1]):
+ return key
+
+ return None
+
+def read_disam(dso, start_addr, stop_addr):
+ global mmaps
+ global build_ids
+
+ addr_range = start_addr + ":" + stop_addr;
+
+ # Don't let the cache get too big, clear it when it hits max size
+ if (len(disasm_cache) > cache_size):
+ disasm_cache.clear();
+
+ try:
+ disasm_output = disasm_cache[addr_range];
+ except:
+ try:
+ fname = build_ids[dso];
+ except KeyError:
+ sys.exit("cannot find symbol file for " + dso)
+
+ disasm = [ options.objdump_name, "-d", "-z",
+ "--start-address="+start_addr,
+ "--stop-address="+stop_addr, fname ]
+
+ disasm_output = check_output(disasm).split('\n')
+ disasm_cache[addr_range] = disasm_output;
+
+ return disasm_output
+
+def dump_disam(dso, start_addr, stop_addr, check_svc):
+ for line in read_disam(dso, start_addr, stop_addr):
+ m = disasm_func_re.search(line)
+ if (m != None):
+ print "\t",line
+ continue
+
+ m = disasm_re.search(line)
+ if (m == None):
+ continue;
+
+ print "\t",line
+
+ if ((check_svc == True) and "svc" in line):
+ return
+
+def dump_packet(sample):
+ print "Packet = { cpu: 0x%d addr: 0x%x phys_addr: 0x%x ip: 0x%x " \
+ "pid: %d tid: %d period: %d time: %d }" % \
+ (sample['cpu'], sample['addr'], sample['phys_addr'], \
+ sample['ip'], sample['pid'], sample['tid'], \
+ sample['period'], sample['time'])
+
+def trace_begin():
+ print 'ARM CoreSight Trace Data Assembler Dump'
+ parse_buildid()
+ parse_mmap()
+
+def trace_end():
+ print 'End'
+
+def trace_unhandled(event_name, context, event_fields_dict):
+ print ' '.join(['%s=%s'%(k,str(v))for k,v in sorted(event_fields_dict.items())])
+
+def process_event(param_dict):
+ global cache_size
+ global options
+ global prev_cpu
+
+ sample = param_dict["sample"]
+
+ if (options.verbose == True):
+ dump_packet(sample)
+
+ # If period doesn't equal to 1, this packet is for instruction sample
+ # packet, we need drop this synthetic packet.
+ if (sample['period'] != 1):
+ print "Skip synthetic instruction sample"
+ return
+
+ cpu = format(sample['cpu'], "d");
+
+ # Initialize CPU data if it's empty, and directly return back
+ # if this is the first tracing event for this CPU.
+ if (cpu_data.get(str(cpu) + 'addr') == None):
+ cpu_data[str(cpu) + 'addr'] = format(sample['addr'], "#x")
+ prev_cpu = cpu
+ return
+
+ # The format for packet is:
+ #
+ # +------------+------------+------------+
+ # sample_prev: | addr | ip | cpu |
+ # +------------+------------+------------+
+ # sample_next: | addr | ip | cpu |
+ # +------------+------------+------------+
+ #
+ # We need to combine the two continuous packets to get the instruction
+ # range for sample_prev::cpu:
+ #
+ # [ sample_prev::addr .. sample_next::ip ]
+ #
+ # For this purose, sample_prev::addr is stored into cpu_data structure
+ # and read back for 'start_addr' when the new packet comes, and we need
+ # to use sample_next::ip to calculate 'stop_addr', plusing extra 4 for
+ # 'stop_addr' is for the sake of objdump so the final assembler dump can
+ # include last instruction for sample_next::ip.
+
+ start_addr = cpu_data[str(prev_cpu) + 'addr']
+ stop_addr = format(sample['ip'] + 4, "#x")
+
+ # Sanity checking dso for start_addr and stop_addr
+ prev_dso = find_dso_mmap(int(start_addr, 0))
+ next_dso = find_dso_mmap(int(stop_addr, 0))
+
+ # If cannot find dso so cannot dump assembler, bail out
+ if (prev_dso == None or next_dso == None):
+ print "Address range [ %s .. %s ]: failed to find dso" % (start_addr, stop_addr)
+ prev_cpu = cpu
+ return
+ elif (prev_dso != next_dso):
+ print "Address range [ %s .. %s ]: isn't in same dso" % (start_addr, stop_addr)
+ prev_cpu = cpu
+ return
+
+ dump_disam(prev_dso, start_addr, stop_addr, False)
+
+ cpu_data[str(cpu) + 'addr'] = format(sample['addr'], "#x")
+ prev_cpu = cpu
--
2.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH v2 3/7] memcg: use compound_order rather than hpage_nr_pages
From: TSUKADA Koutaro @ 2018-05-21 3:48 UTC (permalink / raw)
To: Punit Agrawal
Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Jonathan Corbet,
Luis R. Rodriguez, Kees Cook, Andrew Morton, Roman Gushchin,
David Rientjes, Mike Kravetz, Aneesh Kumar K.V, Naoya Horiguchi,
Anshuman Khandual, Marc-Andre Lureau, Dan Williams,
Vlastimil Babka, linux-doc, linux-kernel, linux-fsdevel, linux-mm,
cgroups
In-Reply-To: <87sh6ozwc4.fsf@e105922-lin.cambridge.arm.com>
On 2018/05/19 2:51, Punit Agrawal wrote:
> Punit Agrawal <punit.agrawal@arm.com> writes:
>
>> Tsukada-san,
>>
>> I am not familiar with memcg so can't comment about whether the patchset
>> is the right way to solve the problem outlined in the cover letter but
>> had a couple of comments about this patch.
>>
>> TSUKADA Koutaro <tsukada@ascade.co.jp> writes:
>>
>>> The current memcg implementation assumes that the compound page is THP.
>>> In order to be able to charge surplus hugepage, we use compound_order.
>>>
>>> Signed-off-by: TSUKADA Koutaro <tsukada@ascade.co.jp>
>>
>> Please move this before Patch 1/7. This is to prevent wrong accounting
>> of pages to memcg for size != PMD_SIZE.
>
> I just noticed that the default state is off so the change isn't enabled
> until the sysfs node is exposed in the next patch. Please ignore this
> comment.
>
> One below still applies.
>
>>
>>> ---
>>> memcontrol.c | 10 +++++-----
>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 2bd3df3..a8f1ff8 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4483,7 +4483,7 @@ static int mem_cgroup_move_account(struct page *page,
>>> struct mem_cgroup *to)
>>> {
>>> unsigned long flags;
>>> - unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>>> + unsigned int nr_pages = compound ? (1 << compound_order(page)) : 1;
>>
>> Instead of replacing calls to hpage_nr_pages(), is it possible to modify
>> it to do the calculation?
Thank you for review my code and please just call me Tsukada.
I think it is possible to modify the inside of itself rather than
replacing the call to hpage_nr_pages().
Inferring from the processing that hpage_nr_pages() desires, I thought
that the definition of hpage_nr_pages() could be moved outside the
CONFIG_TRANSPARENT_HUGEPAGE. It seems that THP and HugeTLBfs can be
handled correctly because compound_order() is judged by seeing whether it
is PageHead or not.
Also, I would like to use compound_order() inside hpage_nr_pages(), but
since huge_mm.h is included before mm.h where compound_order() is defined,
move hpage_nr_pages to mm.h.
Instead of patch 3/7, are the following patches implementing what you
intended?
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a1262..1186ab7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -204,12 +204,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
else
return NULL;
}
-static inline int hpage_nr_pages(struct page *page)
-{
- if (unlikely(PageTransHuge(page)))
- return HPAGE_PMD_NR;
- return 1;
-}
struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags);
@@ -254,8 +248,6 @@ static inline bool thp_migration_supported(void)
#define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; })
-#define hpage_nr_pages(x) 1
-
static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma)
{
return false;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ac1f06..082f2ee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -673,6 +673,12 @@ static inline unsigned int compound_order(struct page *page)
return page[1].compound_order;
}
+static inline int hpage_nr_pages(struct page *page)
+{
+ VM_BUG_ON_PAGE(PageTail(page), page);
+ return (1 << compound_order(page));
+}
+
static inline void set_compound_order(struct page *page, unsigned int order)
{
page[1].compound_order = order;
--
Thanks,
Tsukada
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 4/5] acpi/processor: Fix the return value of acpi_processor_ids_walk()
From: Thomas Gleixner @ 2018-05-19 15:06 UTC (permalink / raw)
To: Dou Liyang
Cc: LKML, x86, linux-acpi, linux-doc, Ingo Molnar, Jonathan Corbet,
Rafael J. Wysocki, Len Brown, H. Peter Anvin, Peter Zijlstra
In-Reply-To: <20180320110432.28127-5-douly.fnst@cn.fujitsu.com>
On Tue, 20 Mar 2018, Dou Liyang wrote:
> ACPI driver should make sure all the processor IDs in their ACPI Namespace
> are unique for CPU hotplug. the driver performs a depth-first walk of the
> namespace tree and calls the acpi_processor_ids_walk().
>
> But, the acpi_processor_ids_walk() will return true if one processor is
> checked, that cause the walk break after walking pass the first processor.
>
> Repace the value with AE_OK which is the standard acpi_status value.
>
> Fixes 8c8cb30f49b8 ("acpi/processor: Implement DEVICE operator for processor enumeration")
>
> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
> ---
> drivers/acpi/acpi_processor.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
> index 449d86d39965..db5bdb59639c 100644
> --- a/drivers/acpi/acpi_processor.c
> +++ b/drivers/acpi/acpi_processor.c
> @@ -663,11 +663,11 @@ static acpi_status __init acpi_processor_ids_walk(acpi_handle handle,
> }
>
> processor_validated_ids_update(uid);
> - return true;
> + return AE_OK;
>
> err:
> acpi_handle_info(handle, "Invalid processor object\n");
> - return false;
> + return AE_OK;
I'm not sure whether this is the right return value here. Rafael?
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [RFC PATCH 1/6] net: ethernet: ti: cpsw: use cpdma channels in backward order for txq
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
The cpdma channel highest priority is from hi to lo number.
The driver has limited number of descriptors that are shared between
number of cpdma channels. Number of queues can be tuned with ethtool,
that allows to not spend descriptors on not needed cpdma channels.
In AVB usually only 2 tx queues can be enough with rate limitation.
The rate limitation can be used only for hi priority queues. Thus, to
use only 2 queues the 8 has to be created. It's wasteful.
So, in order to allow using only needed number of rate limited
tx queues, save resources, and be able to set rate limitation for
them, let assign tx cpdma channels in backward order to queues.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
drivers/net/ethernet/ti/cpsw.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index a7285dddfd29..9bd615da04d3 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -967,8 +967,8 @@ static int cpsw_tx_mq_poll(struct napi_struct *napi_tx, int budget)
/* process every unprocessed channel */
ch_map = cpdma_ctrl_txchs_state(cpsw->dma);
- for (ch = 0, num_tx = 0; ch_map; ch_map >>= 1, ch++) {
- if (!(ch_map & 0x01))
+ for (ch = 0, num_tx = 0; ch_map & 0xff; ch_map <<= 1, ch++) {
+ if (!(ch_map & 0x80))
continue;
txv = &cpsw->txv[ch];
@@ -2431,7 +2431,7 @@ static int cpsw_update_channels_res(struct cpsw_priv *priv, int ch_num, int rx)
void (*handler)(void *, int, int);
struct netdev_queue *queue;
struct cpsw_vector *vec;
- int ret, *ch;
+ int ret, *ch, vch;
if (rx) {
ch = &cpsw->rx_ch_num;
@@ -2444,7 +2444,8 @@ static int cpsw_update_channels_res(struct cpsw_priv *priv, int ch_num, int rx)
}
while (*ch < ch_num) {
- vec[*ch].ch = cpdma_chan_create(cpsw->dma, *ch, handler, rx);
+ vch = rx ? *ch : 7 - *ch;
+ vec[*ch].ch = cpdma_chan_create(cpsw->dma, vch, handler, rx);
queue = netdev_get_tx_queue(priv->ndev, *ch);
queue->tx_maxrate = 0;
@@ -2980,7 +2981,7 @@ static int cpsw_probe(struct platform_device *pdev)
u32 slave_offset, sliver_offset, slave_size;
const struct soc_device_attribute *soc;
struct cpsw_common *cpsw;
- int ret = 0, i;
+ int ret = 0, i, ch;
int irq;
cpsw = devm_kzalloc(&pdev->dev, sizeof(struct cpsw_common), GFP_KERNEL);
@@ -3155,7 +3156,8 @@ static int cpsw_probe(struct platform_device *pdev)
if (soc)
cpsw->quirk_irq = 1;
- cpsw->txv[0].ch = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0);
+ ch = cpsw->quirk_irq ? 0 : 7;
+ cpsw->txv[0].ch = cpdma_chan_create(cpsw->dma, ch, cpsw_tx_handler, 0);
if (IS_ERR(cpsw->txv[0].ch)) {
dev_err(priv->dev, "error initializing tx dma channel\n");
ret = PTR_ERR(cpsw->txv[0].ch);
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 4/6] net: ethernet: ti: cpsw: add CBS Qdisc offload
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
The cpsw has up to 4 FIFOs per port and upper 3 FIFOs can feed rate
limited queue with shaping. In order to set and enable shaping for
those 3 FIFOs queues the network device with CBS qdisc attached is
needed. The CBS configuration is added for dual-emac/single port mode
only, but potentially can be used in switch mode also, based on
switchdev for instance.
Despite the FIFO shapers can work w/o cpdma level shapers the base
usage must be in combine with cpdma level shapers as described in TRM,
that are set as maximum rates for interface queues with sysfs.
One of the possible configuration with txq shapers and CBS shapers:
Configured with echo RATE >
/sys/class/net/eth0/queues/tx-0/tx_maxrate
/---------------------------------------------------
/
/ cpdma level shapers
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
| c7 | | c6 | | c5 | | c4 | | c3 | | c2 | | c1 | | c0 |
\ / \ / \ / \ / \ / \ / \ / \ /
\ / \ / \ / \ / \ / \ / \ / \ /
\/ \/ \/ \/ \/ \/ \/ \/
+---------|------|------|------|-------------------------------------+
| +----+ | | +---+ |
| | +----+ | | |
| v v v v |
| +----+ +----+ +----+ +----+ p p+----+ +----+ +----+ +----+ |
| | | | | | | | | o o| | | | | | | | |
| | f3 | | f2 | | f1 | | f0 | r CPSW r| f3 | | f2 | | f1 | | f0 | |
| | | | | | | | | t t| | | | | | | | |
| \ / \ / \ / \ / 0 1\ / \ / \ / \ / |
| \ X \ / \ / \ / \ / \ / \ / \ / |
| \/ \ \/ \/ \/ \/ \/ \/ \/ |
+-------\------------------------------------------------------------+
\
\ FIFO shaper, set with CBS offload added in this patch,
\ FIFO0 cannot be rate limited
------------------------------------------------------
CBS shaper configuration is supposed to be used with root MQPRIO Qdisc
offload allowing to add sk_prio->tc->txq maps that direct traffic to
appropriate tx queue and maps L2 priority to FIFO shaper.
The CBS shaper is intended to be used for AVB where L2 priority
(pcp field) is used to differentiate class of traffic. So additionally
vlan needs to be created with appropriate egress sk_prio->l2 prio map.
If CBS has several tx queues assigned to it, the sum of their
bandwidth has not overlap bandwidth set for CBS. It's recomended the
CBS bandwidth to be a little bit more.
The CBS shaper is configured with CBS qdisc offload interface using tc
tool from iproute2 packet.
For instance:
$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
$ tc -g class show dev eth0
+---(100:ffe2) mqprio
| +---(100:3) mqprio
| +---(100:4) mqprio
|
+---(100:ffe1) mqprio
| +---(100:2) mqprio
|
+---(100:ffe0) mqprio
+---(100:1) mqprio
$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1440 \
hicredit 60 sendslope -960000 idleslope 40000 offload 1
$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
hicredit 62 sendslope -980000 idleslope 20000 offload 1
The above code set CBS shapers for tc0 and tc1, for that txq0 and
txq1 is used. Pay attention, the real set bandwidth can differ a bit
due to discreteness of configuration parameters.
Here parameters like locredit, hicredit and sendslope are ignored
internally and are supposed to be set with assumption that maximum
frame size for frame - 1500.
It's supposed that interface speed is not changed while reconnection,
not always is true, so inform user in case speed of interface was
changed, as it can impact on dependent shapers configuration.
For more examples see Documentation.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
drivers/net/ethernet/ti/cpsw.c | 221 +++++++++++++++++++++++++++++++++
1 file changed, 221 insertions(+)
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 4b232cda5436..c7710b0e1c17 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -46,6 +46,8 @@
#include "cpts.h"
#include "davinci_cpdma.h"
+#include <net/pkt_sched.h>
+
#define CPSW_DEBUG (NETIF_MSG_HW | NETIF_MSG_WOL | \
NETIF_MSG_DRV | NETIF_MSG_LINK | \
NETIF_MSG_IFUP | NETIF_MSG_INTR | \
@@ -154,8 +156,12 @@ do { \
#define IRQ_NUM 2
#define CPSW_MAX_QUEUES 8
#define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
+#define CPSW_FIFO_QUEUE_TYPE_SHIFT 16
+#define CPSW_FIFO_SHAPE_EN_SHIFT 16
+#define CPSW_FIFO_RATE_EN_SHIFT 20
#define CPSW_TC_NUM 4
#define CPSW_FIFO_SHAPERS_NUM (CPSW_TC_NUM - 1)
+#define CPSW_PCT_MASK 0x7f
#define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT 29
#define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK GENMASK(2, 0)
@@ -457,6 +463,8 @@ struct cpsw_priv {
bool rx_pause;
bool tx_pause;
bool mqprio_hw;
+ int fifo_bw[CPSW_TC_NUM];
+ int shp_cfg_speed;
u32 emac_port;
struct cpsw_common *cpsw;
};
@@ -1081,6 +1089,38 @@ static void cpsw_set_slave_mac(struct cpsw_slave *slave,
slave_write(slave, mac_lo(priv->mac_addr), SA_LO);
}
+static bool cpsw_shp_is_off(struct cpsw_priv *priv)
+{
+ struct cpsw_common *cpsw = priv->cpsw;
+ struct cpsw_slave *slave;
+ u32 shift, mask, val;
+
+ val = readl_relaxed(&cpsw->regs->ptype);
+
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
+ mask = 7 << shift;
+ val = val & mask;
+
+ return !val;
+}
+
+static void cpsw_fifo_shp_on(struct cpsw_priv *priv, int fifo, int on)
+{
+ struct cpsw_common *cpsw = priv->cpsw;
+ struct cpsw_slave *slave;
+ u32 shift, mask, val;
+
+ val = readl_relaxed(&cpsw->regs->ptype);
+
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ shift = CPSW_FIFO_SHAPE_EN_SHIFT + 3 * slave->slave_num;
+ mask = (1 << --fifo) << shift;
+ val = on ? val | mask : val & ~mask;
+
+ writel_relaxed(val, &cpsw->regs->ptype);
+}
+
static void _cpsw_adjust_link(struct cpsw_slave *slave,
struct cpsw_priv *priv, bool *link)
{
@@ -1120,6 +1160,12 @@ static void _cpsw_adjust_link(struct cpsw_slave *slave,
mac_control |= BIT(4);
*link = true;
+
+ if (priv->shp_cfg_speed &&
+ priv->shp_cfg_speed != slave->phy->speed &&
+ !cpsw_shp_is_off(priv))
+ dev_warn(priv->dev,
+ "Speed was changed, CBS sahper speeds are changed!");
} else {
mac_control = 0;
/* disable forwarding */
@@ -1589,6 +1635,178 @@ static int cpsw_tc_to_fifo(int tc, int num_tc)
return CPSW_FIFO_SHAPERS_NUM - tc;
}
+static int cpsw_set_fifo_bw(struct cpsw_priv *priv, int fifo, int bw)
+{
+ struct cpsw_common *cpsw = priv->cpsw;
+ u32 val = 0, send_pct, shift;
+ struct cpsw_slave *slave;
+ int pct = 0, i;
+
+ if (bw > priv->shp_cfg_speed * 1000)
+ goto err;
+
+ /* shaping has to stay enabled for highest fifos linearly
+ * and fifo bw no more then interface can allow
+ */
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ send_pct = slave_read(slave, SEND_PERCENT);
+ for (i = CPSW_FIFO_SHAPERS_NUM; i > 0; i--) {
+ if (!bw) {
+ if (i >= fifo || !priv->fifo_bw[i])
+ continue;
+
+ dev_warn(priv->dev, "Prev FIFO%d is shaped", i);
+ continue;
+ }
+
+ if (!priv->fifo_bw[i] && i > fifo) {
+ dev_err(priv->dev, "Upper FIFO%d is not shaped", i);
+ return -EINVAL;
+ }
+
+ shift = (i - 1) * 8;
+ if (i == fifo) {
+ send_pct &= ~(CPSW_PCT_MASK << shift);
+ val = DIV_ROUND_UP(bw, priv->shp_cfg_speed * 10);
+ if (!val)
+ val = 1;
+
+ send_pct |= val << shift;
+ pct += val;
+ continue;
+ }
+
+ if (priv->fifo_bw[i])
+ pct += (send_pct >> shift) & CPSW_PCT_MASK;
+ }
+
+ if (pct >= 100)
+ goto err;
+
+ slave_write(slave, send_pct, SEND_PERCENT);
+ priv->fifo_bw[fifo] = bw;
+
+ dev_warn(priv->dev, "set FIFO%d bw = %d\n", fifo,
+ DIV_ROUND_CLOSEST(val * priv->shp_cfg_speed, 100));
+
+ return 0;
+err:
+ dev_err(priv->dev, "Bandwidth doesn't fit in tc configuration");
+ return -EINVAL;
+}
+
+static int cpsw_set_fifo_rlimit(struct cpsw_priv *priv, int fifo, int bw)
+{
+ struct cpsw_common *cpsw = priv->cpsw;
+ struct cpsw_slave *slave;
+ u32 tx_in_ctl_rg, val;
+ int ret;
+
+ ret = cpsw_set_fifo_bw(priv, fifo, bw);
+ if (ret)
+ return ret;
+
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ tx_in_ctl_rg = cpsw->version == CPSW_VERSION_1 ?
+ CPSW1_TX_IN_CTL : CPSW2_TX_IN_CTL;
+
+ if (!bw)
+ cpsw_fifo_shp_on(priv, fifo, bw);
+
+ val = slave_read(slave, tx_in_ctl_rg);
+ if (cpsw_shp_is_off(priv)) {
+ /* disable FIFOs rate limited queues */
+ val &= ~(0xf << CPSW_FIFO_RATE_EN_SHIFT);
+
+ /* set type of FIFO queues to normal priority mode */
+ val &= ~(3 << CPSW_FIFO_QUEUE_TYPE_SHIFT);
+
+ /* set type of FIFO queues to be rate limited */
+ if (bw)
+ val |= 2 << CPSW_FIFO_QUEUE_TYPE_SHIFT;
+ else
+ priv->shp_cfg_speed = 0;
+ }
+
+ /* toggle a FIFO rate limited queue */
+ if (bw)
+ val |= BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
+ else
+ val &= ~BIT(fifo + CPSW_FIFO_RATE_EN_SHIFT);
+ slave_write(slave, val, tx_in_ctl_rg);
+
+ /* FIFO transmit shape enable */
+ cpsw_fifo_shp_on(priv, fifo, bw);
+ return 0;
+}
+
+/* Defaults:
+ * class A - prio 3
+ * class B - prio 2
+ * shaping for class A should be set first
+ */
+static int cpsw_set_cbs(struct net_device *ndev,
+ struct tc_cbs_qopt_offload *qopt)
+{
+ struct cpsw_priv *priv = netdev_priv(ndev);
+ struct cpsw_common *cpsw = priv->cpsw;
+ struct cpsw_slave *slave;
+ int prev_speed = 0;
+ int tc, ret, fifo;
+ u32 bw = 0;
+
+ tc = netdev_txq_to_tc(priv->ndev, qopt->queue);
+
+ /* enable channels in backward order, as highest FIFOs must be rate
+ * limited first and for compliance with CPDMA rate limited channels
+ * that also used in bacward order. FIFO0 cannot be rate limited.
+ */
+ fifo = cpsw_tc_to_fifo(tc, ndev->num_tc);
+ if (!fifo) {
+ dev_err(priv->dev, "Last tc%d can't be rate limited", tc);
+ return -EINVAL;
+ }
+
+ /* do nothing, it's disabled anyway */
+ if (!qopt->enable && !priv->fifo_bw[fifo])
+ return 0;
+
+ /* shapers can be set if link speed is known */
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ if (slave->phy && slave->phy->link) {
+ if (priv->shp_cfg_speed &&
+ priv->shp_cfg_speed != slave->phy->speed)
+ prev_speed = priv->shp_cfg_speed;
+
+ priv->shp_cfg_speed = slave->phy->speed;
+ }
+
+ if (!priv->shp_cfg_speed) {
+ dev_err(priv->dev, "Link speed is not known");
+ return -1;
+ }
+
+ ret = pm_runtime_get_sync(cpsw->dev);
+ if (ret < 0) {
+ pm_runtime_put_noidle(cpsw->dev);
+ return ret;
+ }
+
+ bw = qopt->enable ? qopt->idleslope : 0;
+ ret = cpsw_set_fifo_rlimit(priv, fifo, bw);
+ if (ret) {
+ priv->shp_cfg_speed = prev_speed;
+ prev_speed = 0;
+ }
+
+ if (bw && prev_speed)
+ dev_warn(priv->dev,
+ "Speed was changed, CBS sahper speeds are changed!");
+
+ pm_runtime_put_sync(cpsw->dev);
+ return ret;
+}
+
static int cpsw_ndo_open(struct net_device *ndev)
{
struct cpsw_priv *priv = netdev_priv(ndev);
@@ -2263,6 +2481,9 @@ static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
void *type_data)
{
switch (type) {
+ case TC_SETUP_QDISC_CBS:
+ return cpsw_set_cbs(ndev, type_data);
+
case TC_SETUP_QDISC_MQPRIO:
return cpsw_set_tc(ndev, type_data);
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 3/6] net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
That's possible to offload vlan to tc priority mapping with
assumption sk_prio == L2 prio.
Example:
$ ethtool -L eth0 rx 1 tx 4
$ qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
$ tc -g class show dev eth0
+---(100:ffe2) mqprio
| +---(100:3) mqprio
| +---(100:4) mqprio
|
+---(100:ffe1) mqprio
| +---(100:2) mqprio
|
+---(100:ffe0) mqprio
+---(100:1) mqprio
Here, 100:1 is txq0, 100:2 is txq1, 100:3 is txq2, 100:4 is txq3
txq0 belongs to tc0, txq1 to tc1, txq2 and txq3 to tc2
The offload part only maps L2 prio to classes of traffic, but not
to transmit queues, so to direct traffic to traffic class vlan has
to be created with appropriate egress map.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
drivers/net/ethernet/ti/cpsw.c | 82 ++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 9bd615da04d3..4b232cda5436 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -39,6 +39,7 @@
#include <linux/sys_soc.h>
#include <linux/pinctrl/consumer.h>
+#include <net/pkt_cls.h>
#include "cpsw.h"
#include "cpsw_ale.h"
@@ -153,6 +154,8 @@ do { \
#define IRQ_NUM 2
#define CPSW_MAX_QUEUES 8
#define CPSW_CPDMA_DESCS_POOL_SIZE_DEFAULT 256
+#define CPSW_TC_NUM 4
+#define CPSW_FIFO_SHAPERS_NUM (CPSW_TC_NUM - 1)
#define CPSW_RX_VLAN_ENCAP_HDR_PRIO_SHIFT 29
#define CPSW_RX_VLAN_ENCAP_HDR_PRIO_MSK GENMASK(2, 0)
@@ -453,6 +456,7 @@ struct cpsw_priv {
u8 mac_addr[ETH_ALEN];
bool rx_pause;
bool tx_pause;
+ bool mqprio_hw;
u32 emac_port;
struct cpsw_common *cpsw;
};
@@ -1577,6 +1581,14 @@ static void cpsw_slave_stop(struct cpsw_slave *slave, struct cpsw_common *cpsw)
soft_reset_slave(slave);
}
+static int cpsw_tc_to_fifo(int tc, int num_tc)
+{
+ if (tc == num_tc - 1)
+ return 0;
+
+ return CPSW_FIFO_SHAPERS_NUM - tc;
+}
+
static int cpsw_ndo_open(struct net_device *ndev)
{
struct cpsw_priv *priv = netdev_priv(ndev);
@@ -2190,6 +2202,75 @@ static int cpsw_ndo_set_tx_maxrate(struct net_device *ndev, int queue, u32 rate)
return ret;
}
+static int cpsw_set_tc(struct net_device *ndev, void *type_data)
+{
+ struct tc_mqprio_qopt_offload *mqprio = type_data;
+ struct cpsw_priv *priv = netdev_priv(ndev);
+ struct cpsw_common *cpsw = priv->cpsw;
+ int fifo, num_tc, count, offset;
+ struct cpsw_slave *slave;
+ u32 tx_prio_map = 0;
+ int i, tc, ret;
+
+ num_tc = mqprio->qopt.num_tc;
+ if (num_tc > CPSW_TC_NUM)
+ return -EINVAL;
+
+ if (mqprio->mode != TC_MQPRIO_MODE_DCB)
+ return -EINVAL;
+
+ ret = pm_runtime_get_sync(cpsw->dev);
+ if (ret < 0) {
+ pm_runtime_put_noidle(cpsw->dev);
+ return ret;
+ }
+
+ if (num_tc) {
+ for (i = 0; i < 8; i++) {
+ tc = mqprio->qopt.prio_tc_map[i];
+ fifo = cpsw_tc_to_fifo(tc, num_tc);
+ tx_prio_map |= fifo << (4 * i);
+ }
+
+ netdev_set_num_tc(ndev, num_tc);
+ for (i = 0; i < num_tc; i++) {
+ count = mqprio->qopt.count[i];
+ offset = mqprio->qopt.offset[i];
+ netdev_set_tc_queue(ndev, i, count, offset);
+ }
+ }
+
+ if (!mqprio->qopt.hw) {
+ /* restore default configuration */
+ netdev_reset_tc(ndev);
+ tx_prio_map = TX_PRIORITY_MAPPING;
+ }
+
+ priv->mqprio_hw = mqprio->qopt.hw;
+
+ offset = cpsw->version == CPSW_VERSION_1 ?
+ CPSW1_TX_PRI_MAP : CPSW2_TX_PRI_MAP;
+
+ slave = &cpsw->slaves[cpsw_slave_index(cpsw, priv)];
+ slave_write(slave, tx_prio_map, offset);
+
+ pm_runtime_put_sync(cpsw->dev);
+
+ return 0;
+}
+
+static int cpsw_ndo_setup_tc(struct net_device *ndev, enum tc_setup_type type,
+ void *type_data)
+{
+ switch (type) {
+ case TC_SETUP_QDISC_MQPRIO:
+ return cpsw_set_tc(ndev, type_data);
+
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
static const struct net_device_ops cpsw_netdev_ops = {
.ndo_open = cpsw_ndo_open,
.ndo_stop = cpsw_ndo_stop,
@@ -2205,6 +2286,7 @@ static const struct net_device_ops cpsw_netdev_ops = {
#endif
.ndo_vlan_rx_add_vid = cpsw_ndo_vlan_rx_add_vid,
.ndo_vlan_rx_kill_vid = cpsw_ndo_vlan_rx_kill_vid,
+ .ndo_setup_tc = cpsw_ndo_setup_tc,
};
static int cpsw_get_regs_len(struct net_device *ndev)
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 2/6] net: ethernet: ti: cpdma: fit rated channels in backward order
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
According to TRM tx rated channels should be in 7..0 order,
so correct it.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
drivers/net/ethernet/ti/davinci_cpdma.c | 31 ++++++++++++-------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 31ae04117f0a..37fbdc668cc7 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -406,37 +406,36 @@ static int cpdma_chan_fit_rate(struct cpdma_chan *ch, u32 rate,
struct cpdma_chan *chan;
u32 old_rate = ch->rate;
u32 new_rmask = 0;
- int rlim = 1;
+ int rlim = 0;
int i;
- *prio_mode = 0;
for (i = tx_chan_num(0); i < tx_chan_num(CPDMA_MAX_CHANNELS); i++) {
chan = ctlr->channels[i];
- if (!chan) {
- rlim = 0;
+ if (!chan)
continue;
- }
if (chan == ch)
chan->rate = rate;
if (chan->rate) {
- if (rlim) {
- new_rmask |= chan->mask;
- } else {
- ch->rate = old_rate;
- dev_err(ctlr->dev, "Prev channel of %dch is not rate limited\n",
- chan->chan_num);
- return -EINVAL;
- }
- } else {
- *prio_mode = 1;
- rlim = 0;
+ rlim = 1;
+ new_rmask |= chan->mask;
+ continue;
}
+
+ if (rlim)
+ goto err;
}
*rmask = new_rmask;
+ *prio_mode = rlim;
return 0;
+
+err:
+ ch->rate = old_rate;
+ dev_err(ctlr->dev, "Upper cpdma ch%d is not rate limited\n",
+ chan->chan_num);
+ return -EINVAL;
}
static u32 cpdma_chan_set_factors(struct cpdma_ctlr *ctlr,
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 5/6] net: ethernet: ti: cpsw: restore shaper configuration while down/up
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
Need to restore shapers configuration after interface was down/up.
This is needed as appropriate configuration is still replicated in
kernel settings. This only shapers context restore, so vlan
configuration should be restored by user if needed, especially for
devices with one port where vlan frames are sent via ALE.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
drivers/net/ethernet/ti/cpsw.c | 47 ++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c7710b0e1c17..c3e88be36c1b 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -1807,6 +1807,51 @@ static int cpsw_set_cbs(struct net_device *ndev,
return ret;
}
+static void cpsw_cbs_resume(struct cpsw_slave *slave, struct cpsw_priv *priv)
+{
+ int fifo, bw;
+
+ for (fifo = CPSW_FIFO_SHAPERS_NUM; fifo > 0; fifo--) {
+ bw = priv->fifo_bw[fifo];
+ if (!bw)
+ continue;
+
+ cpsw_set_fifo_rlimit(priv, fifo, bw);
+ }
+}
+
+static void cpsw_mqprio_resume(struct cpsw_slave *slave, struct cpsw_priv *priv)
+{
+ struct cpsw_common *cpsw = priv->cpsw;
+ u32 tx_prio_map = 0;
+ int i, tc, fifo;
+ u32 tx_prio_rg;
+
+ if (!priv->mqprio_hw)
+ return;
+
+ for (i = 0; i < 8; i++) {
+ tc = netdev_get_prio_tc_map(priv->ndev, i);
+ fifo = CPSW_FIFO_SHAPERS_NUM - tc;
+ tx_prio_map |= fifo << (4 * i);
+ }
+
+ tx_prio_rg = cpsw->version == CPSW_VERSION_1 ?
+ CPSW1_TX_PRI_MAP : CPSW2_TX_PRI_MAP;
+
+ slave_write(slave, tx_prio_map, tx_prio_rg);
+}
+
+/* restore resources after port reset */
+static void cpsw_restore(struct cpsw_priv *priv)
+{
+ /* restore MQPRIO offload */
+ for_each_slave(priv, cpsw_mqprio_resume, priv);
+
+ /* restore CBS offload */
+ for_each_slave(priv, cpsw_cbs_resume, priv);
+}
+
static int cpsw_ndo_open(struct net_device *ndev)
{
struct cpsw_priv *priv = netdev_priv(ndev);
@@ -1886,6 +1931,8 @@ static int cpsw_ndo_open(struct net_device *ndev)
}
+ cpsw_restore(priv);
+
/* Enable Interrupt pacing if configured */
if (cpsw->coal_intvl != 0) {
struct ethtool_coalesce coal;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 6/6] Documentation: networking: cpsw: add MQPRIO & CBS offload examples
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
In-Reply-To: <20180518211510.13341-1-ivan.khoronzhuk@linaro.org>
This document describes MQPRIO and CBS Qdisc offload configuration
for cpsw driver based on examples. It potentially can be used in
audio video bridging (AVB) and time sensitive networking (TSN).
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
---
Documentation/networking/cpsw.txt | 540 ++++++++++++++++++++++++++++++
1 file changed, 540 insertions(+)
create mode 100644 Documentation/networking/cpsw.txt
diff --git a/Documentation/networking/cpsw.txt b/Documentation/networking/cpsw.txt
new file mode 100644
index 000000000000..28c64896d59d
--- /dev/null
+++ b/Documentation/networking/cpsw.txt
@@ -0,0 +1,540 @@
+* Texas Instruments CPSW ethernet driver
+
+Multiqueue & CBS & MQPRIO
+=====================================================================
+=====================================================================
+
+The cpsw has 3 CBS shapers for each external ports. This document
+describes MQPRIO and CBS Qdisc offload configuration for cpsw driver
+based on examples. It potentially can be used in audio video bridging
+(AVB) and time sensitive networking (TSN).
+
+The following examples was tested on AM572x EVM and BBB boards.
+
+Test setup
+==========
+
+Under consideration two examples with AM52xx EVM running cpsw driver
+in dual_emac mode.
+
+Several prerequisites:
+- TX queues must be rated starting from txq0 that has highest priority
+- Traffic classes are used starting from 0, that has highest priority
+- CBS shapers should be used with rated queues
+- The bandwidth for CBS shapers has to be set a little bit more then
+ potential incoming rate, thus, rate of all incoming tx queues has
+ to be a little less
+- Real rates can differ, due to discreetness
+- Map skb-priority to txq is not enough, also skb-priority to l2 prio
+ map has to be created with ip or vconfig tool
+- Any l2/socket prio (0 - 7) for classes can be used, but for
+ simplicity default values are used: 3 and 2
+- only 2 classes tested: A and B, but checked and can work with more,
+ maximum allowed 4, but only for 3 rate can be set.
+
+Test setup for examples
+=======================
+ +-------------------------------+
+ |--+ |
+ | | Workstation0 |
+ |E | MAC 18:03:73:66:87:42 |
++-----------------------------+ +--|t | |
+| | 1 | E | | |h |./tsn_listener -d \ |
+| Target board: | 0 | t |--+ |0 | 18:03:73:66:87:42 -i eth0 \|
+| AM572x EVM | 0 | h | | | -s 1500 |
+| | 0 | 0 | |--+ |
+| Only 2 classes: |Mb +---| +-------------------------------+
+| class A, class B | |
+| | +---| +-------------------------------+
+| | 1 | E | |--+ |
+| | 0 | t | | | Workstation1 |
+| | 0 | h |--+ |E | MAC 20:cf:30:85:7d:fd |
+| |Mb | 1 | +--|t | |
++-----------------------------+ |h |./tsn_listener -d \ |
+ |0 | 20:cf:30:85:7d:fd -i eth0 \|
+ | | -s 1500 |
+ |--+ |
+ +-------------------------------+
+
+*********************************************************************
+*********************************************************************
+*********************************************************************
+Example 1: One port tx AVB configuration scheme for target board
+----------------------------------------------------------------------
+(prints and scheme for AM52xx evm, applicable for single port boards)
+
+tc - traffic class
+txq - transmit queue
+p - priority
+f - fifo (cpsw fifo)
+S - shaper configured
+
++------------------------------------------------------------------+ u
+| +---------------+ +---------------+ +------+ +------+ | s
+| | | | | | | | | | e
+| | App 1 | | App 2 | | Apps | | Apps | | r
+| | Class A | | Class B | | Rest | | Rest | |
+| | Eth0 | | Eth0 | | Eth0 | | Eth1 | | s
+| | VLAN100 | | VLAN100 | | | | | | | | p
+| | 40 Mb/s | | 20 Mb/s | | | | | | | | a
+| | SO_PRIORITY=3 | | SO_PRIORITY=2 | | | | | | | | c
+| | | | | | | | | | | | | | e
+| +---|-----------+ +---|-----------+ +---|--+ +---|--+ |
++-----|------------------|------------------|--------|-------------+
+ +-+ +------------+ | |
+ | | +-----------------+ +--+
+ | | | |
++---|-------|-------------|-----------------------|----------------+
+| +----+ +----+ +----+ +----+ +----+ |
+| | p3 | | p2 | | p1 | | p0 | | p0 | | k
+| \ / \ / \ / \ / \ / | e
+| \ / \ / \ / \ / \ / | r
+| \/ \/ \/ \/ \/ | n
+| | | | | | e
+| | | +-----+ | | l
+| | | | | |
+| +----+ +----+ +----+ +----+ | s
+| |tc0 | |tc1 | |tc2 | |tc0 | | p
+| \ / \ / \ / \ / | a
+| \ / \ / \ / \ / | c
+| \/ \/ \/ \/ | e
+| | | +-----+ | |
+| | | | | | |
+| | | | | | |
+| | | | | | |
+| +----+ +----+ +----+ +----+ +----+ |
+| |txq0| |txq1| |txq2| |txq3| |txq4| |
+| \ / \ / \ / \ / \ / |
+| \ / \ / \ / \ / \ / |
+| \/ \/ \/ \/ \/ |
+| +-|------|------|------|--+ +--|--------------+ |
+| | | | | | | Eth0.100 | | Eth1 | |
++---|------|------|------|------------------------|----------------+
+ | | | | |
+ p p p p |
+ 3 2 0-1, 4-7 <- L2 priority |
+ | | | | |
+ | | | | |
++---|------|------|------|------------------------|----------------+
+| | | | | |----------+ |
+| +----+ +----+ +----+ +----+ +----+ |
+| |dma7| |dma6| |dma5| |dma4| |dma4| |
+| \ / \ / \ / \ / \ / | c
+| \S / \S / \ / \ / \ / | p
+| \/ \/ \/ \/ \/ | s
+| | | | +----- | | w
+| | | | | | |
+| | | | | | | d
+| +----+ +----+ +----+p p+----+ | r
+| | | | | | |o o| | | i
+| | f3 | | f2 | | f0 |r r| f0 | | v
+| |tc0 | |tc1 | |tc2 |t t|tc0 | | e
+| \CBS / \CBS / \CBS /1 2\CBS / | r
+| \S / \S / \ / \ / |
+| \/ \/ \/ \/ |
++------------------------------------------------------------------+
+========================================Eth==========================>
+
+1)
+// Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1
+$ ethtool -L eth0 rx 1 tx 5
+rx unmodified, ignoring
+
+2)
+// Check if num of queues is set correctly:
+$ ethtool -l eth0
+Channel parameters for eth0:
+Pre-set maximums:
+RX: 8
+TX: 8
+Other: 0
+Combined: 0
+Current hardware settings:
+RX: 1
+TX: 5
+Other: 0
+Combined: 0
+
+3)
+// TX queues must be rated starting from 0, so set bws for tx0 and tx1
+// Set rates 40 and 20 Mb/s appropriately.
+// Pay attention, real speed can differ a bit due to discreetness.
+// Leave last 2 tx queues not rated.
+$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+
+4)
+// Check maximum rate of tx (cpdma) queues:
+$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+40
+20
+0
+0
+0
+
+5)
+// Map skb->priority to traffic class:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3)
+$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
+
+5a)
+// As two interface sharing same set of tx queues, assign all traffic
+// coming to interface Eth1 to separate queue in order to not mix it
+// with traffic from interface Eth0, so use separate txq to send
+// packets to Eth1, so all prio -> tc0 and tc0 -> txq4
+// Here hw 0, so here still default configuration for eth1 in hw
+$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \
+map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0
+
+6)
+// Check classes settings
+$ tc -g class show dev eth0
++---(100:ffe2) mqprio
+| +---(100:3) mqprio
+| +---(100:4) mqprio
+|
++---(100:ffe1) mqprio
+| +---(100:2) mqprio
+|
++---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+$ tc -g class show dev eth1
++---(100:ffe0) mqprio
+ +---(100:5) mqprio
+
+7)
+// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc
+// Set it +1 Mb for reserve (important!)
+// here only idle slope is important, others arg are ignored
+// Pay attention, real speed can differ a bit due to discreetness
+$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \
+hicredit 62 sendslope -959000 idleslope 41000 offload 1
+net eth0: set FIFO3 bw = 50
+
+8)
+// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc:
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \
+hicredit 65 sendslope -979000 idleslope 21000 offload 1
+net eth0: set FIFO2 bw = 30
+
+9)
+// Create vlan 100 to map sk->priority to vlan qos
+$ ip link add link eth0 name eth0.100 type vlan id 100
+8021q: 802.1Q VLAN Support v1.8
+8021q: adding VLAN 0 to HW filter on device eth0
+8021q: adding VLAN 0 to HW filter on device eth1
+net eth0: Adding vlanid 100 to vlan filter
+
+10)
+// Map skb->priority to L2 prio, 1 to 1
+$ ip link set eth0.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth0.100
+[...]
+INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12)
+// Run your appropriate tools with socket option "SO_PRIORITY"
+// to 3 for class A and/or to 2 for class B
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+
+13)
+// run your listener on workstation
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39000 kbps
+
+14)
+// Restore default configuration if needed
+$ ip link del eth0.100
+$ tc qdisc del dev eth1 root
+$ tc qdisc del dev eth0 root
+net eth0: Prev FIFO2 is shaped
+net eth0: set FIFO3 bw = 0
+net eth0: set FIFO2 bw = 0
+$ ethtool -L eth0 rx 1 tx 1
+
+*********************************************************************
+*********************************************************************
+*********************************************************************
+Example 2: Two port tx AVB configuration scheme for target board
+----------------------------------------------------------------------
+(prints and scheme for AM52xx evm, for dual emac boards only)
+
++------------------------------------------------------------------+ u
+| +----------+ +----------+ +------+ +----------+ +----------+ | s
+| | | | | | | | | | | | e
+| | App 1 | | App 2 | | Apps | | App 3 | | App 4 | | r
+| | Class A | | Class B | | Rest | | Class B | | Class A | |
+| | Eth0 | | Eth0 | | | | | Eth1 | | Eth1 | | s
+| | VLAN100 | | VLAN100 | | | | | VLAN100 | | VLAN100 | | p
+| | 40 Mb/s | | 20 Mb/s | | | | | 10 Mb/s | | 30 Mb/s | | a
+| | SO_PRI=3 | | SO_PRI=2 | | | | | SO_PRI=3 | | SO_PRI=2 | | c
+| | | | | | | | | | | | | | | | | e
+| +---|------+ +---|------+ +---|--+ +---|------+ +---|------+ |
++-----|-------------|-------------|---------|-------------|--------+
+ +-+ +-------+ | +----------+ +----+
+ | | +-------+------+ | |
+ | | | | | |
++---|-------|-------------|--------------|-------------|-------|---+
+| +----+ +----+ +----+ +----+ +-+--+ +----+ +----+ +----+ |
+| | p3 | | p2 | | p1 | | p0 | | p0 | | p1 | | p2 | | p3 | | k
+| \ / \ / \ / \ / \ / \ / \ / \ / | e
+| \ / \ / \ / \ / \ / \ / \ / \ / | r
+| \/ \/ \/ \/ \/ \/ \/ \/ | n
+| | | | | | | | e
+| | | +----+ +----+ | | | l
+| | | | | | | |
+| +----+ +----+ +----+ +----+ +----+ +----+ | s
+| |tc0 | |tc1 | |tc2 | |tc2 | |tc1 | |tc0 | | p
+| \ / \ / \ / \ / \ / \ / | a
+| \ / \ / \ / \ / \ / \ / | c
+| \/ \/ \/ \/ \/ \/ | e
+| | | +-----+ +-----+ | | |
+| | | | | | | | | |
+| | | | | | | | | |
+| | | | | E E | | | | |
+| +----+ +----+ +----+ +----+ t t +----+ +----+ +----+ +----+ |
+| |txq0| |txq1| |txq4| |txq5| h h |txq6| |txq7| |txq3| |txq2| |
+| \ / \ / \ / \ / 0 1 \ / \ / \ / \ / |
+| \ / \ / \ / \ / . . \ / \ / \ / \ / |
+| \/ \/ \/ \/ 1 1 \/ \/ \/ \/ |
+| +-|------|------|------|--+ 0 0 +-|------|------|------|--+ |
+| | | | | | | 0 0 | | | | | | |
++---|------|------|------|---------------|------|------|------|----+
+ | | | | | | | |
+ p p p p p p p p
+ 3 2 0-1, 4-7 <-L2 pri-> 0-1, 4-7 2 3
+ | | | | | | | |
+ | | | | | | | |
++---|------|------|------|---------------|------|------|------|----+
+| | | | | | | | | |
+| +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
+| |dma7| |dma6| |dma3| |dma2| |dma1| |dma0| |dma4| |dma5| |
+| \ / \ / \ / \ / \ / \ / \ / \ / | c
+| \S / \S / \ / \ / \ / \ / \S / \S / | p
+| \/ \/ \/ \/ \/ \/ \/ \/ | s
+| | | | +----- | | | | | w
+| | | | | +----+ | | | |
+| | | | | | | | | | d
+| +----+ +----+ +----+p p+----+ +----+ +----+ | r
+| | | | | | |o o| | | | | | | i
+| | f3 | | f2 | | f0 |r CPSW r| f3 | | f2 | | f0 | | v
+| |tc0 | |tc1 | |tc2 |t t|tc0 | |tc1 | |tc2 | | e
+| \CBS / \CBS / \CBS /1 2\CBS / \CBS / \CBS / | r
+| \S / \S / \ / \S / \S / \ / |
+| \/ \/ \/ \/ \/ \/ |
++------------------------------------------------------------------+
+========================================Eth==========================>
+
+1)
+// Add 8 tx queues, for interface Eth0, but they are common, so are accessed
+// by two interfaces Eth0 and Eth1.
+$ ethtool -L eth1 rx 1 tx 8
+rx unmodified, ignoring
+
+2)
+// Check if num of queues is set correctly:
+$ ethtool -l eth0
+Channel parameters for eth0:
+Pre-set maximums:
+RX: 8
+TX: 8
+Other: 0
+Combined: 0
+Current hardware settings:
+RX: 1
+TX: 8
+Other: 0
+Combined: 0
+
+3)
+// TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0
+// and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately
+// for Eth0 and 30 and 10 Mb/s for Eth1.
+// Real speed can differ a bit due to discreetness
+// Leave last 4 tx queues as not rated
+$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+$ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate
+$ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate
+
+4)
+// Check maximum rate of tx (cpdma) queues:
+$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+40
+20
+30
+10
+0
+0
+0
+0
+
+5)
+// Map skb->priority to traffic class for Eth0:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5)
+$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1
+
+6)
+// Check classes settings
+$ tc -g class show dev eth0
++---(100:ffe2) mqprio
+| +---(100:5) mqprio
+| +---(100:6) mqprio
+|
++---(100:ffe1) mqprio
+| +---(100:2) mqprio
+|
++---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+7)
+// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0
+// here only idle slope is important, others ignored
+// Real speed can differ a bit due to discreetness
+$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \
+hicredit 62 sendslope -959000 idleslope 41000 offload 1
+net eth0: set FIFO3 bw = 50
+
+8)
+// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0
+$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
+hicredit 65 sendslope -979000 idleslope 21000 offload 1
+net eth0: set FIFO2 bw = 30
+
+9)
+// Create vlan 100 to map sk->priority to vlan qos for Eth0
+$ ip link add link eth0 name eth0.100 type vlan id 100
+net eth0: Adding vlanid 100 to vlan filter
+
+10)
+// Map skb->priority to L2 prio for Eth0.100, one to one
+$ ip link set eth0.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth0.100
+[...]
+INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12)
+// Map skb->priority to traffic class for Eth1:
+// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+// Map traffic class to transmit queue:
+// tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7)
+$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \
+map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1
+
+13)
+// Check classes settings
+$ tc -g class show dev eth1
++---(100:ffe2) mqprio
+| +---(100:7) mqprio
+| +---(100:8) mqprio
+|
++---(100:ffe1) mqprio
+| +---(100:4) mqprio
+|
++---(100:ffe0) mqprio
+ +---(100:3) mqprio
+
+14)
+// Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1
+// here only idle slope is important, others ignored
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth1 parent 100:1 cbs locredit -1453 \
+hicredit 47 sendslope -969000 idleslope 31000 offload 1
+net eth1: set FIFO3 bw = 40
+
+15)
+// Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1
+// Set it +1 Mb for reserve (important!)
+$ tc qdisc add dev eth1 parent 100:2 cbs locredit -1483 \
+hicredit 34 sendslope -989000 idleslope 11000 offload 1
+net eth1: set FIFO2 bw = 20
+
+16)
+// Create vlan 100 to map sk->priority to vlan qos for Eth1
+$ ip link add link eth1 name eth1.100 type vlan id 100
+net eth1: Adding vlanid 100 to vlan filter
+
+17)
+// Map skb->priority to L2 prio for Eth1.100, one to one
+$ ip link set eth1.100 type vlan \
+egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+18)
+// Check egress map for vlan 100
+$ cat /proc/net/vlan/eth1.100
+[...]
+INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+19)
+// Run appropriate tools with socket option "SO_PRIORITY" to 3
+// for class A and to 2 for class B. For both interfaces
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500&
+./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500&
+
+20)
+// run your listeners on workstations
+// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39012 kbps
+Receiving data rate: 39000 kbps
+
+21)
+// Restore default configuration if needed
+$ ip link del eth1.100
+$ ip link del eth0.100
+$ tc qdisc del dev eth1 root
+net eth1: Prev FIFO2 is shaped
+net eth1: set FIFO3 bw = 0
+net eth1: set FIFO2 bw = 0
+$ tc qdisc del dev eth0 root
+net eth0: Prev FIFO2 is shaped
+net eth0: set FIFO3 bw = 0
+net eth0: set FIFO2 bw = 0
+$ ethtool -L eth0 rx 1 tx 1
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* [RFC PATCH 0/6] net: ethernet: ti: cpsw: add MQPRIO and CBS Qdisc offload
From: Ivan Khoronzhuk @ 2018-05-18 21:15 UTC (permalink / raw)
To: grygorii.strashko, davem
Cc: corbet, akpm, netdev, linux-doc, linux-kernel, linux-omap,
vinicius.gomes, henrik, jesus.sanchez-palencia, Ivan Khoronzhuk
This series adds MQPRIO and CBS Qdisc offload for TI cpsw driver.
It potentially can be used in audio video bridging (AVB) and time
sensitive networking (TSN).
Patchset was tested on AM572x EVM and BBB boards. Last patch from this
series adds detailed description of configuration with examples. For
consistency reasons, in role of talker and listener, tools from
patchset "TSN: Add qdisc based config interface for CBS" were used and
can be seen here: https://www.spinics.net/lists/netdev/msg460869.html
Based on net-next/master
Ivan Khoronzhuk (6):
net: ethernet: ti: cpsw: use cpdma channels in backward order for txq
net: ethernet: ti: cpdma: fit rated channels in backward order
net: ethernet: ti: cpsw: add MQPRIO Qdisc offload
net: ethernet: ti: cpsw: add CBS Qdisc offload
net: ethernet: ti: cpsw: restore shaper configuration while down/up
Documentation: networking: cpsw: add MQPRIO & CBS offload examples
Documentation/networking/cpsw.txt | 540 ++++++++++++++++++++++++
drivers/net/ethernet/ti/cpsw.c | 364 +++++++++++++++-
drivers/net/ethernet/ti/davinci_cpdma.c | 31 +-
3 files changed, 913 insertions(+), 22 deletions(-)
create mode 100644 Documentation/networking/cpsw.txt
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox