Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

From: Yuri Norov <ynorov@marvell.com>

For nohz_full CPUs the desirable behavior is to receive interrupts
generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
obviously not desirable because it breaks isolation.

This patch adds check for it.

Signed-off-by: Yuri Norov <ynorov@marvell.com>
[abelits@marvell.com: updated, only exclude CPUs running isolated tasks]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 kernel/time/tick-sched.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a213952541db..6c8679e200f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -20,6 +20,7 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
 #include <linux/sched/nohz.h>
+#include <linux/isolation.h>
 #include <linux/module.h>
 #include <linux/irq_work.h>
 #include <linux/posix-timers.h>
@@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
  */
 void tick_nohz_full_kick_cpu(int cpu)
 {
-	if (!tick_nohz_full_cpu(cpu))
+	smp_rmb();
+	if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
 		return;
 
 	irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
-- 
2.20.1


^ permalink raw reply related

* [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
From: Alex Belits @ 2020-11-23 17:58 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

In do_notify_resume(), call task_isolation_before_pending_work_check()
first, to report isolation breaking, then after handling all pending
work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.

Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we
don't clear _TIF_TASK_ISOLATION in the loop.

Early kernel entry code calls task_isolation_kernel_enter(). In
particular:

Vectors:
el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
el1_irq -> asm_nmi_enter(), handle_arch_irq()
el1_error -> do_serror()
el0_sync -> el0_sync_handler()
el0_irq -> handle_arch_irq()
el0_error -> do_serror()
el0_sync_compat -> el0_sync_compat_handler()
el0_irq_compat -> handle_arch_irq()
el0_error_compat -> do_serror()

SDEI entry:
__sdei_asm_handler -> __sdei_handler() -> nmi_enter()

Functions called from there:
asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()

Handlers:
do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
  or task_isolation_kernel_enter()
el1_sync_handler() -> task_isolation_kernel_enter()
el0_sync_handler() -> task_isolation_kernel_enter()
el0_sync_compat_handler() -> task_isolation_kernel_enter()

handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
There is a separate patch for irqchips that do not follow this rule.

handle_domain_irq() -> task_isolation_kernel_enter()
do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
nmi_enter() -> task_isolation_kernel_enter()

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: simplified to match kernel 5.10]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/barrier.h     |  1 +
 arch/arm64/include/asm/thread_info.h |  7 +++++--
 arch/arm64/kernel/entry-common.c     |  7 +++++++
 arch/arm64/kernel/ptrace.c           | 10 ++++++++++
 arch/arm64/kernel/signal.c           | 13 ++++++++++++-
 arch/arm64/kernel/smp.c              |  3 +++
 7 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1515f6f153a0..fc958d8d8945 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -141,6 +141,7 @@ config ARM64
 	select HAVE_ARCH_PREL32_RELOCATIONS
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_TASK_ISOLATION
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index c3009b0e5239..ad5a6dd380cf 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -49,6 +49,7 @@
 #define dma_rmb()	dmb(oshld)
 #define dma_wmb()	dmb(oshst)
 
+#define instr_sync()	isb()
 /*
  * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz
  * and 0 otherwise.
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1fbab854a51b..3321c69c46fe 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define TIF_UPROBE		4	/* uprobe breakpoint or singlestep */
 #define TIF_FSCHECK		5	/* Check FS is USER_DS on return */
 #define TIF_MTE_ASYNC_FAULT	6	/* MTE Asynchronous Tag Check Fault */
+#define TIF_TASK_ISOLATION	7	/* task isolation enabled for task */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	9	/* syscall auditing */
 #define TIF_SYSCALL_TRACEPOINT	10	/* syscall tracepoint for ftrace */
@@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk);
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_FOREIGN_FPSTATE	(1 << TIF_FOREIGN_FPSTATE)
+#define _TIF_TASK_ISOLATION	(1 << TIF_TASK_ISOLATION)
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
@@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk);
 
 #define _TIF_WORK_MASK		(_TIF_NEED_RESCHED | _TIF_SIGPENDING | \
 				 _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \
-				 _TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT)
+				 _TIF_UPROBE | _TIF_FSCHECK | \
+				 _TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION)
 
 #define _TIF_SYSCALL_WORK	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SYSCALL_TRACEPOINT | _TIF_SECCOMP | \
-				 _TIF_SYSCALL_EMU)
+				 _TIF_SYSCALL_EMU | _TIF_TASK_ISOLATION)
 
 #ifdef CONFIG_SHADOW_CALL_STACK
 #define INIT_SCS							\
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 43d4c329775f..8152760de683 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -8,6 +8,7 @@
 #include <linux/context_tracking.h>
 #include <linux/ptrace.h>
 #include <linux/thread_info.h>
+#include <linux/isolation.h>
 
 #include <asm/cpufeature.h>
 #include <asm/daifflags.h>
@@ -77,6 +78,8 @@ asmlinkage void notrace el1_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_DABT_CUR:
 	case ESR_ELx_EC_IABT_CUR:
@@ -249,6 +252,8 @@ asmlinkage void notrace el0_sync_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC64:
 		el0_svc(regs);
@@ -321,6 +326,8 @@ asmlinkage void notrace el0_sync_compat_handler(struct pt_regs *regs)
 {
 	unsigned long esr = read_sysreg(esr_el1);
 
+	task_isolation_kernel_enter();
+
 	switch (ESR_ELx_EC(esr)) {
 	case ESR_ELx_EC_SVC32:
 		el0_svc_compat(regs);
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index f49b349e16a3..2941f2b16796 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -29,6 +29,7 @@
 #include <linux/regset.h>
 #include <linux/tracehook.h>
 #include <linux/elf.h>
+#include <linux/isolation.h>
 
 #include <asm/compat.h>
 #include <asm/cpufeature.h>
@@ -1803,6 +1804,15 @@ int syscall_trace_enter(struct pt_regs *regs)
 			return NO_SYSCALL;
 	}
 
+	/*
+	 * In task isolation mode, we may prevent the syscall from
+	 * running, and if so we also deliver a signal to the process.
+	 */
+	if (test_thread_flag(TIF_TASK_ISOLATION)) {
+		if (task_isolation_syscall(regs->syscallno) == -1)
+			return NO_SYSCALL;
+	}
+
 	/* Do the secure computing after ptrace; failures should be fast. */
 	if (secure_computing() == -1)
 		return NO_SYSCALL;
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index a8184cad8890..e3a82b75e39d 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/tracehook.h>
 #include <linux/ratelimit.h>
 #include <linux/syscalls.h>
+#include <linux/isolation.h>
 
 #include <asm/daifflags.h>
 #include <asm/debug-monitors.h>
@@ -911,6 +912,11 @@ static void do_signal(struct pt_regs *regs)
 	restore_saved_sigmask();
 }
 
+#define NOTIFY_RESUME_LOOP_FLAGS \
+	(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_RESUME | \
+	 _TIF_FOREIGN_FPSTATE | _TIF_UPROBE | _TIF_FSCHECK | \
+	 _TIF_MTE_ASYNC_FAULT)
+
 asmlinkage void do_notify_resume(struct pt_regs *regs,
 				 unsigned long thread_flags)
 {
@@ -921,6 +927,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 	 */
 	trace_hardirqs_off();
 
+	task_isolation_before_pending_work_check();
+
 	do {
 		/* Check valid user FS if needed */
 		addr_limit_user_check();
@@ -956,7 +964,10 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
 
 		local_daif_mask();
 		thread_flags = READ_ONCE(current_thread_info()->flags);
-	} while (thread_flags & _TIF_WORK_MASK);
+	} while (thread_flags & NOTIFY_RESUME_LOOP_FLAGS);
+
+	if (thread_flags & _TIF_TASK_ISOLATION)
+		task_isolation_start();
 }
 
 unsigned long __ro_after_init signal_minsigstksz;
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 18e9727d3f64..4401eac4710c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -33,6 +33,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/kexec.h>
 #include <linux/kvm_host.h>
+#include <linux/isolation.h>
 
 #include <asm/alternative.h>
 #include <asm/atomic.h>
@@ -890,6 +891,8 @@ static void do_handle_IPI(int ipinr)
 {
 	unsigned int cpu = smp_processor_id();
 
+	task_isolation_kernel_enter();
+
 	if ((unsigned)ipinr < NR_IPI)
 		trace_ipi_entry_rcuidle(ipi_types[ipinr]);
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH v5 5/9] task_isolation: Add driver-specific hooks
From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

Some drivers don't call functions that call
task_isolation_kernel_enter() in interrupt handlers. Call it
directly.

Signed-off-by: Alex Belits <abelits@marvell.com>
---
 drivers/irqchip/irq-armada-370-xp.c | 6 ++++++
 drivers/irqchip/irq-gic-v3.c        | 3 +++
 drivers/irqchip/irq-gic.c           | 3 +++
 drivers/s390/cio/cio.c              | 3 +++
 4 files changed, 15 insertions(+)

diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c
index d7eb2e93db8f..4ac7babe1abe 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -29,6 +29,7 @@
 #include <linux/slab.h>
 #include <linux/syscore_ops.h>
 #include <linux/msi.h>
+#include <linux/isolation.h>
 #include <asm/mach/arch.h>
 #include <asm/exception.h>
 #include <asm/smp_plat.h>
@@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = {
 static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 {
 	u32 msimask, msinr;
+	int isol_entered = 0;
 
 	msimask = readl_relaxed(per_cpu_int_base +
 				ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
@@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained)
 			continue;
 
 		if (is_chained) {
+			if (!isol_entered) {
+				task_isolation_kernel_enter();
+				isol_entered = 1;
+			}
 			irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
 					       msinr - PCI_MSI_DOORBELL_START);
 			generic_handle_irq(irq);
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0febe8..ded26dd4da0f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -18,6 +18,7 @@
 #include <linux/percpu.h>
 #include <linux/refcount.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 
 #include <linux/irqchip.h>
 #include <linux/irqchip/arm-gic-common.h>
@@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs
 {
 	u32 irqnr;
 
+	task_isolation_kernel_enter();
+
 	irqnr = gic_read_iar();
 
 	if (gic_supports_nmi() &&
diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
index 6053245a4754..bb482b4ae218 100644
--- a/drivers/irqchip/irq-gic.c
+++ b/drivers/irqchip/irq-gic.c
@@ -35,6 +35,7 @@
 #include <linux/interrupt.h>
 #include <linux/percpu.h>
 #include <linux/slab.h>
+#include <linux/isolation.h>
 #include <linux/irqchip.h>
 #include <linux/irqchip/chained_irq.h>
 #include <linux/irqchip/arm-gic.h>
@@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs)
 	struct gic_chip_data *gic = &gic_data[0];
 	void __iomem *cpu_base = gic_data_cpu_base(gic);
 
+	task_isolation_kernel_enter();
+
 	do {
 		irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
 		irqnr = irqstat & GICC_IAR_INT_ID_MASK;
diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
index 6d716db2a46a..beab88881b6d 100644
--- a/drivers/s390/cio/cio.c
+++ b/drivers/s390/cio/cio.c
@@ -20,6 +20,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
+#include <linux/isolation.h>
 #include <asm/cio.h>
 #include <asm/delay.h>
 #include <asm/irq.h>
@@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
 	struct irb *irb;
 	int irq_context;
 
+	task_isolation_kernel_enter();
+
 	irb = this_cpu_ptr(&cio_irb);
 	/* Store interrupt response block to lowcore. */
 	if (tsch(sch->schid, irb) != 0)
-- 
2.20.1


^ permalink raw reply related

* [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code
From: Alex Belits @ 2020-11-23 17:57 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

Kernel entry and exit functions for task isolation are added to context
tracking and common entry points. Common handling of pending work on exit
to userspace now processes isolation breaking, cleanup and start.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
[abelits@marvell.com: adapted for kernel 5.10]
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/hardirq.h   |  2 ++
 include/linux/sched.h     |  2 ++
 kernel/context_tracking.c |  5 +++++
 kernel/entry/common.c     | 10 +++++++++-
 kernel/irq/irqdesc.c      |  5 +++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 754f67ac4326..b9e604ae6a0d 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -7,6 +7,7 @@
 #include <linux/lockdep.h>
 #include <linux/ftrace_irq.h>
 #include <linux/vtime.h>
+#include <linux/isolation.h>
 #include <asm/hardirq.h>
 
 extern void synchronize_irq(unsigned int irq);
@@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void);
 	do {							\
 		lockdep_off();					\
 		arch_nmi_enter();				\
+		task_isolation_kernel_enter();			\
 		printk_nmi_enter();				\
 		BUG_ON(in_nmi() == NMI_MASK);			\
 		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5d8b17aa544b..51c2d774250b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
 #include <linux/rseq.h>
 #include <linux/seqlock.h>
 #include <linux/kcsan.h>
+#include <linux/isolation.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
 struct audit_context;
@@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk);
 #ifdef CONFIG_SMP
 static __always_inline void scheduler_ipi(void)
 {
+	task_isolation_kernel_enter();
 	/*
 	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 36a98c48aedc..379a48fd0e65 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -21,6 +21,7 @@
 #include <linux/hardirq.h>
 #include <linux/export.h>
 #include <linux/kprobes.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/context_tracking.h>
@@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state)
 		__this_cpu_write(context_tracking.state, state);
 	}
 	context_tracking_recursion_exit();
+
+	task_isolation_exit_to_user_mode();
 }
 EXPORT_SYMBOL_GPL(__context_tracking_enter);
 
@@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
 	if (!context_tracking_recursion_enter())
 		return;
 
+	task_isolation_kernel_enter();
+
 	if (__this_cpu_read(context_tracking.state) == state) {
 		if (__this_cpu_read(context_tracking.active)) {
 			/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..10a520894105 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -4,6 +4,7 @@
 #include <linux/entry-common.h>
 #include <linux/livepatch.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 
 static void exit_to_user_mode_prepare(struct pt_regs *regs)
 {
-	unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
+	unsigned long ti_work;
 
 	lockdep_assert_irqs_disabled();
 
+	task_isolation_before_pending_work_check();
+
+	ti_work = READ_ONCE(current_thread_info()->flags);
+
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
+	if (unlikely(ti_work & _TIF_TASK_ISOLATION))
+		task_isolation_start();
+
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
 	/* Ensure that the address limit is intact and no locks are held */
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b8f0a7574f55 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -16,6 +16,7 @@
 #include <linux/bitmap.h>
 #include <linux/irqdomain.h>
 #include <linux/sysfs.h>
+#include <linux/isolation.h>
 
 #include "internals.h"
 
@@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq = hwirq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	irq_enter();
 
 #ifdef CONFIG_IRQ_DOMAIN
@@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq,
 	unsigned int irq;
 	int ret = 0;
 
+	task_isolation_kernel_enter();
+
 	/*
 	 * NMI context needs to be setup earlier in order to deal with tracing.
 	 */
-- 
2.20.1


^ permalink raw reply related

* [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do
so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"isolcpus=nohz,domain,CPULIST" boot argument to enable
nohz_full and isolcpus. The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags. When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced. If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired. In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

The last stage of enabling task isolation happens in
task_isolation_exit_to_user_mode() that runs last before returning
to userspace and changes ll_isol_flags (see later) to prevent other
CPUs from interfering with isolated task.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will send it a signal to indicate
isolation loss. In addition to sending a signal, the code supports a
kernel command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting. Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

Isolation also disables CPU state synchronization mechanisms that
are. normally done by IPI. In the future, more synchronization
mechanisms, such as TLB flushes, may be disabled for isolated tasks.
This requires careful handling of kernel entry from isolated task --
remote synchronization requests must be re-enabled and
synchronization procedure triggered, before anything other than
low-level kernel entry code is called. Same applies to exiting from
kernel to userspace after isolation is enabled.

For this purpose, per-CPU low-level flags ll_isol_flags are used to
indicate isolation state, and task_isolation_kernel_enter() is used
to safely clear them early in kernel entry. CPU mask corresponding
to isolation bit in ll_isol_flags is visible to userspace as
/sys/devices/system/cpu/isolation_running, and can be used for
monitoring.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 drivers/base/cpu.c                            |  23 +
 include/linux/hrtimer.h                       |   4 +
 include/linux/isolation.h                     | 326 ++++++++
 include/linux/sched.h                         |   5 +
 include/linux/tick.h                          |   3 +
 include/uapi/linux/prctl.h                    |   6 +
 init/Kconfig                                  |  27 +
 kernel/Makefile                               |   2 +
 kernel/isolation.c                            | 714 ++++++++++++++++++
 kernel/signal.c                               |   2 +
 kernel/sys.c                                  |   6 +
 kernel/time/hrtimer.c                         |  27 +
 kernel/time/tick-sched.c                      |  18 +
 14 files changed, 1169 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 44fde25bb221..4b28f2022c98 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5250,6 +5250,12 @@
 			neutralize any effect of /proc/sys/kernel/sysrq.
 			Useful for debugging.
 
+	task_isolation_debug	[KNL]
+			In kernels built with CONFIG_TASK_ISOLATION, this
+			setting will generate console backtraces to
+			accompany the diagnostics generated about
+			interrupting tasks running with task isolation.
+
 	tcpmhash_entries= [KNL,NET]
 			Set the number of tcp_metrics_hash slots.
 			Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 8f1d6569564c..afa6163203c7 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -284,6 +284,26 @@ static ssize_t print_cpus_isolated(struct device *dev,
 }
 static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t isolation_running_show(struct device *dev,
+				      struct device_attribute *attr, char *buf)
+{
+	int n;
+	cpumask_var_t isolation_running;
+
+	if (!zalloc_cpumask_var(&isolation_running, GFP_KERNEL))
+		return -ENOMEM;
+
+	task_isolation_cpumask(isolation_running);
+	n = sprintf(buf, "%*pbl\n", cpumask_pr_args(isolation_running));
+
+	free_cpumask_var(isolation_running);
+
+	return n;
+}
+static DEVICE_ATTR_RO(isolation_running);
+#endif
+
 #ifdef CONFIG_NO_HZ_FULL
 static ssize_t print_cpus_nohz_full(struct device *dev,
 				    struct device_attribute *attr, char *buf)
@@ -471,6 +491,9 @@ static struct attribute *cpu_root_attrs[] = {
 #ifdef CONFIG_NO_HZ_FULL
 	&dev_attr_nohz_full.attr,
 #endif
+#ifdef CONFIG_TASK_ISOLATION
+	&dev_attr_isolation_running.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
 	&dev_attr_modalias.attr,
 #endif
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 107cedd7019a..9a6c9ce95080 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -529,6 +529,10 @@ extern void __init hrtimers_init(void);
 /* Show pending timers: */
 extern void sysrq_timer_list_show(void);
 
+#ifdef CONFIG_TASK_ISOLATION
+extern void kick_hrtimer(void);
+#endif
+
 int hrtimers_prepare_cpu(unsigned int cpu);
 #ifdef CONFIG_HOTPLUG_CPU
 int hrtimers_dead_cpu(unsigned int cpu);
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
new file mode 100644
index 000000000000..eb6c99bd2db4
--- /dev/null
+++ b/include/linux/isolation.h
@@ -0,0 +1,326 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Task isolation support
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+#ifndef _LINUX_ISOLATION_H
+#define _LINUX_ISOLATION_H
+
+#include <stdarg.h>
+#include <linux/errno.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+#include <linux/prctl.h>
+#include <linux/types.h>
+
+struct task_struct;
+
+#ifdef CONFIG_TASK_ISOLATION
+
+/*
+ * Logging
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...);
+
+#define pr_task_isol_emerg(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, false, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, false, fmt, ##__VA_ARGS__)
+
+#define pr_task_isol_emerg_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_EMERG, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_alert_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_ALERT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_crit_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_CRIT, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_err_supp(cpu, fmt, ...)				\
+	task_isolation_message(cpu, LOGLEVEL_ERR, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_warn_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_WARNING, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_notice_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_NOTICE, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_info_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_INFO, true, fmt, ##__VA_ARGS__)
+#define pr_task_isol_debug_supp(cpu, fmt, ...)			\
+	task_isolation_message(cpu, LOGLEVEL_DEBUG, true, fmt, ##__VA_ARGS__)
+
+#define BIT_LL_TASK_ISOLATION		(0)
+#define BIT_LL_TASK_ISOLATION_BROKEN	(1)
+#define BIT_LL_TASK_ISOLATION_REQUEST	(2)
+#define FLAG_LL_TASK_ISOLATION		(1 << BIT_LL_TASK_ISOLATION)
+#define FLAG_LL_TASK_ISOLATION_BROKEN	(1 << BIT_LL_TASK_ISOLATION_BROKEN)
+#define FLAG_LL_TASK_ISOLATION_REQUEST	(1 << BIT_LL_TASK_ISOLATION_REQUEST)
+
+DECLARE_PER_CPU(unsigned long, ll_isol_flags);
+extern cpumask_var_t task_isolation_map;
+
+/**
+ * task_isolation_request() - prctl hook to request task isolation
+ * @flags:	Flags from <linux/prctl.h> PR_TASK_ISOLATION_xxx.
+ *
+ * This is called from the generic prctl() code for PR_TASK_ISOLATION.
+ *
+ * Return: Returns 0 when task isolation enabled, otherwise a negative
+ * errno.
+ */
+extern int task_isolation_request(unsigned int flags);
+
+/**
+ * task_isolation_kernel_enter() - clear low-level task isolation flag
+ *
+ * This should be called immediately after entering kernel. It must
+ * be inline, and suitable for running after leaving isolated
+ * userspace in a "stale" state when synchronization is required
+ * before the CPU can safely enter the rest of the kernel.
+ */
+static __always_inline void task_isolation_kernel_enter(void)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated.  When
+	 * CPU is running isolated task until this point anything that
+	 * causes an interrupt on this CPU must end up calling this
+	 * before touching the rest of kernel. That is, this function
+	 * or fast_task_isolation_cpu_cleanup() or stop_isolation()
+	 * calling it. If any interrupt, including scheduling timer,
+	 * arrives, it will still end up here early after entering
+	 * kernel.  From this point interrupts are disabled until all
+	 * CPUs will see that this CPU is no longer running isolated
+	 * task.
+	 *
+	 * See also fast_task_isolation_cpu_cleanup().
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == 0)
+		return;
+
+	raw_local_irq_save(flags);
+
+	/* Change low-level flags to indicate broken isolation */
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_BROKEN);
+
+	/*
+	 * If something happened that requires a barrier that would
+	 * otherwise be called from remote CPUs by CPU kick procedure,
+	 * this barrier runs instead of it. After this barrier, CPU
+	 * kick procedure would see the updated ll_isol_flags, so it
+	 * will run its own IPI to trigger a barrier.
+	 */
+	smp_mb();
+	/*
+	 * Synchronize instructions -- this CPU was not kicked while
+	 * in isolated mode, so it might require synchronization.
+	 * There might be an IPI if kick procedure happened and
+	 * ll_isol_flags was already updated while it assembled a CPU
+	 * mask. However if this did not happen, synchronize everything
+	 * here.
+	 */
+	instr_sync();
+	raw_local_irq_restore(flags);
+}
+
+/**
+ * task_isolation_exit_to_user_mode() - set low-level task isolation flag
+ * if task isolation is requested
+ *
+ * This should be called immediately before exiting kernel. It must
+ * be inline, and the state of CPI may become "stale" between setting
+ * the flag and returning to the userspace.
+ */
+static __always_inline void task_isolation_exit_to_user_mode(void)
+{
+	unsigned long flags;
+
+	/* Check if this task is entering isolation */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_REQUEST)
+	    == 0)
+		return;
+	raw_local_irq_save(flags);
+
+	/* Set low-level flags */
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION);
+	/*
+	 * After this barrier the rest of the system stops using IPIs
+	 * to synchronize this CPU state. Since only exit to userspace
+	 * follows, this is safe. Synchronization will happen again in
+	 * task_isolation_enter() when this CPU will enter kernel.
+	 */
+	smp_mb();
+	/*
+	 * From this point this is recognized as isolated by
+	 * other CPUs
+	 */
+	raw_local_irq_restore(flags);
+}
+
+extern void task_isolation_cpu_cleanup(void);
+/**
+ * task_isolation_start() - attempt to actually start task isolation
+ *
+ * This function should be invoked as the last thing prior to returning to
+ * user space if TIF_TASK_ISOLATION is set in the thread_info flags.  It
+ * will attempt to quiesce the core and enter task-isolation mode.  If it
+ * fails, it will reset the system call return value to an error code that
+ * indicates the failure mode.
+ */
+extern void task_isolation_start(void);
+
+/**
+ * is_isolation_cpu() - check if CPU is intended for running isolated tasks.
+ * @cpu:	CPU to check.
+ */
+static inline bool is_isolation_cpu(int cpu)
+{
+	return task_isolation_map != NULL &&
+		cpumask_test_cpu(cpu, task_isolation_map);
+}
+
+/**
+ * task_isolation_on_cpu() - check if the cpu is running isolated task
+ * @cpu:	CPU to check.
+ */
+static inline int task_isolation_on_cpu(int cpu)
+{
+	return test_bit(BIT_LL_TASK_ISOLATION, &per_cpu(ll_isol_flags, cpu));
+}
+
+/**
+ * task_isolation_cpumask() - set CPUs currently running isolated tasks
+ * @mask:	Mask to modify.
+ */
+extern void task_isolation_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_clear_cpumask() - clear CPUs currently running isolated tasks
+ * @mask:      Mask to modify.
+ */
+extern void task_isolation_clear_cpumask(struct cpumask *mask);
+
+/**
+ * task_isolation_syscall() - report a syscall from an isolated task
+ * @nr:		The syscall number.
+ *
+ * This routine should be invoked at syscall entry if TIF_TASK_ISOLATION is
+ * set in the thread_info flags.  It checks for valid syscalls,
+ * specifically prctl() with PR_TASK_ISOLATION, exit(), and exit_group().
+ * For any other syscall it will raise a signal and return failure.
+ *
+ * Return: 0 for acceptable syscalls, -1 for all others.
+ */
+extern int task_isolation_syscall(int nr);
+
+/**
+ * task_isolation_before_pending_work_check() - check for isolation breaking
+ *
+ * This routine is called from the code responsible for exiting to user mode,
+ * before the point when thread flags are checked for pending work.
+ * That function must be called if the current task is isolated, because
+ * TIF_TASK_ISOLATION must trigger a call to it.
+ */
+void task_isolation_before_pending_work_check(void);
+
+/**
+ * _task_isolation_interrupt() - report an interrupt of an isolated task
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked at any exception or IRQ if
+ * TIF_TASK_ISOLATION is set in the thread_info flags.  It is not necessary
+ * to invoke it if the exception will generate a signal anyway (e.g. a bad
+ * page fault), and in that case it is preferable not to invoke it but just
+ * rely on the standard Linux signal.  The macro task_isolation_syscall()
+ * wraps the TIF_TASK_ISOLATION flag test to simplify the caller code.
+ */
+extern void _task_isolation_interrupt(const char *fmt, ...);
+#define task_isolation_interrupt(fmt, ...)				\
+	do {								\
+		if (current_thread_info()->flags & _TIF_TASK_ISOLATION)	\
+			_task_isolation_interrupt(fmt, ## __VA_ARGS__);	\
+	} while (0)
+
+/**
+ * task_isolation_remote() - report a remote interrupt of an isolated task
+ * @cpu:	The remote cpu that is about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This routine should be invoked any time a remote IPI or other type of
+ * interrupt is being delivered to another cpu. The function will check to
+ * see if the target core is running a task-isolation task, and generate a
+ * diagnostic on the console if so; in addition, we tag the task so it
+ * doesn't generate another diagnostic when the interrupt actually arrives.
+ * Generating a diagnostic remotely yields a clearer indication of what
+ * happened then just reporting only when the remote core is interrupted.
+ *
+ */
+extern void task_isolation_remote(int cpu, const char *fmt, ...);
+
+/**
+ * task_isolation_remote_cpumask() - report interruption of multiple cpus
+ * @mask:	The set of remotes cpus that are about to be interrupted.
+ * @fmt:	A format string describing the interrupt
+ * @...:	Format arguments, if any.
+ *
+ * This is the cpumask variant of _task_isolation_remote().  We
+ * generate a single-line diagnostic message even if multiple remote
+ * task-isolation cpus are being interrupted.
+ */
+extern void task_isolation_remote_cpumask(const struct cpumask *mask,
+					  const char *fmt, ...);
+
+/**
+ * _task_isolation_signal() - disable task isolation when signal is pending
+ * @task:	The task for which to disable isolation.
+ *
+ * This function generates a diagnostic and disables task isolation;
+ * it should be called if TIF_TASK_ISOLATION is set when notifying a
+ * task of a pending signal. The task_isolation_interrupt() function
+ * normally generates a diagnostic for events that just interrupt a
+ * task without generating a signal; here we need to hook the paths
+ * that correspond to interrupts that do generate a signal.  The macro
+ * task_isolation_signal() wraps the TIF_TASK_ISOLATION flag test to
+ * simplify the caller code.
+ */
+extern void _task_isolation_signal(struct task_struct *task);
+#define task_isolation_signal(task)					\
+	do {								\
+		if (task_thread_info(task)->flags & _TIF_TASK_ISOLATION) \
+			_task_isolation_signal(task);			\
+	} while (0)
+
+#else /* !CONFIG_TASK_ISOLATION */
+static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
+static inline void task_isolation_kernel_enter(void) {}
+static inline void task_isolation_exit_to_user_mode(void) {}
+static inline void task_isolation_start(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
+static inline int task_isolation_on_cpu(int cpu) { return 0; }
+static inline void task_isolation_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_clear_cpumask(struct cpumask *mask) { }
+static inline void task_isolation_cpu_cleanup(void) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_before_pending_work_check(void) { }
+static inline void task_isolation_signal(struct task_struct *task) { }
+#endif
+
+#endif /* _LINUX_ISOLATION_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..5d8b17aa544b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1316,6 +1316,11 @@ struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+	unsigned short			task_isolation_flags; /* prctl */
+	unsigned short			task_isolation_state;
+#endif
+
 #ifdef CONFIG_X86_MCE
 	void __user			*mce_vaddr;
 	__u64				mce_kflags;
diff --git a/include/linux/tick.h b/include/linux/tick.h
index 7340613c7eff..27c7c033d5a8 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -268,6 +268,9 @@ static inline void tick_dep_clear_signal(struct signal_struct *signal,
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
+#ifdef CONFIG_TASK_ISOLATION
+extern int try_stop_full_tick(void);
+#endif
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
 static inline bool tick_nohz_full_cpu(int cpu) { return false; }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..bf72afb0364f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -247,4 +247,10 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Enable task_isolation mode for TASK_ISOLATION kernels. */
+#define PR_TASK_ISOLATION		48
+# define PR_TASK_ISOLATION_ENABLE	(1 << 0)
+# define PR_TASK_ISOLATION_SET_SIG(sig)	(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits)	(((bits) >> 8) & 0x7f)
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index c9446911cf41..f708580fb56d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -648,6 +648,33 @@ config CPU_ISOLATION
 
 source "kernel/rcu/Kconfig"
 
+config HAVE_ARCH_TASK_ISOLATION
+	bool
+
+config TASK_ISOLATION
+	bool "Provide hard CPU isolation from the kernel on demand"
+	depends on NO_HZ_FULL && HAVE_ARCH_TASK_ISOLATION
+	help
+	  Allow userspace processes that place themselves on cores with
+	  nohz_full and isolcpus enabled, and run prctl(PR_TASK_ISOLATION),
+	  to "isolate" themselves from the kernel.  Prior to returning to
+	  userspace, isolated tasks will arrange that no future kernel
+	  activity will interrupt the task while the task is running in
+	  userspace.  Attempting to re-enter the kernel while in this mode
+	  will cause the task to be terminated with a signal; you must
+	  explicitly use prctl() to disable task isolation before resuming
+	  normal use of the kernel.
+
+	  This "hard" isolation from the kernel is required for userspace
+	  tasks that are running hard real-time tasks in userspace, such as
+	  a high-speed network driver in userspace.  Without this option, but
+	  with NO_HZ_FULL enabled, the kernel will make a best-faith, "soft"
+	  effort to shield a single userspace process from interrupts, but
+	  makes no guarantees.
+
+	  You should say "N" unless you are intending to run a
+	  high-performance userspace driver or similar task.
+
 config BUILD_BIN2C
 	bool
 	default n
diff --git a/kernel/Makefile b/kernel/Makefile
index af601b9bda0e..8a8a361dc1a5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -133,6 +133,8 @@ KCOV_INSTRUMENT_stackleak.o := n
 
 obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o
 
+obj-$(CONFIG_TASK_ISOLATION) += isolation.o
+
 $(obj)/configs.o: $(obj)/config_data.gz
 
 targets += config_data.gz
diff --git a/kernel/isolation.c b/kernel/isolation.c
new file mode 100644
index 000000000000..127b2a1cb0cb
--- /dev/null
+++ b/kernel/isolation.c
@@ -0,0 +1,714 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/vmstat.h>
+#include <linux/sched.h>
+#include <linux/isolation.h>
+#include <linux/syscalls.h>
+#include <linux/smp.h>
+#include <linux/tick.h>
+#include <asm/unistd.h>
+#include <asm/syscall.h>
+#include <linux/hrtimer.h>
+
+/*
+ * These values are stored in task_isolation_state.
+ * Note that STATE_NORMAL + TIF_TASK_ISOLATION means we are still
+ * returning from sys_prctl() to userspace.
+ */
+enum {
+	STATE_NORMAL = 0,	/* Not isolated */
+	STATE_ISOLATED = 1	/* In userspace, isolated */
+};
+
+/*
+ * Low-level isolation flags.
+ * Those flags are used by low-level isolation set/clear/check routines.
+ * Those flags should be set last before return to userspace and cleared
+ * first upon kernel entry, and synchronized to allow isolation breaking
+ * detection before touching potentially unsynchronized parts of kernel.
+ * Isolated task does not receive synchronization events of any kind, so
+ * at the time of the first entry into kernel it might not be ready to
+ * run most of the kernel code. However to perform synchronization
+ * properly, kernel entry code should also enable synchronization events
+ * at the same time. This presents a problem because more kernel code
+ * should run to determine the cause of isolation breaking, signals may
+ * have to be generated, etc. So some flag clearing and synchronization
+ * should happen in "low-level" entry code but processing of isolation
+ * breaking should happen in "high-level" code. Low-level isolation flags
+ * should be set in that low-level code, possibly long before the cause
+ * of isolation breaking is known. Symmetrically, entering isolation
+ * should disable synchronization events before returning to userspace
+ * but after all potentially volatile code is finished.
+ */
+DEFINE_PER_CPU(unsigned long, ll_isol_flags);
+
+/*
+ * Description of the last two tasks that ran isolated on a given CPU.
+ * This is intended only for messages about isolation breaking. We
+ * don't want any references to actual task while accessing this from
+ * CPU that caused isolation breaking -- we know nothing about timing
+ * and don't want to use locking or RCU.
+ */
+struct isol_task_desc {
+	atomic_t curr_index;
+	atomic_t curr_index_wr;
+	bool	warned[2];
+	pid_t	pid[2];
+	pid_t	tgid[2];
+	char	comm[2][TASK_COMM_LEN];
+};
+static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs);
+
+/*
+ * Counter for isolation exiting procedures (from request to the start of
+ * cleanup) being attempted at once on a CPU. Normally incrementing of
+ * this counter is performed from the CPU that caused isolation breaking,
+ * however decrementing is done from the cleanup procedure, delegated to
+ * the CPU that is exiting isolation, not from the CPU that caused isolation
+ * breaking.
+ *
+ * If incrementing this counter while starting isolation exit procedure
+ * results in a value greater than 0, isolation exiting is already in
+ * progress, and cleanup did not start yet. This means, counter should be
+ * decremented back, and isolation exit that is already in progress, should
+ * be allowed to complete. Otherwise, a new isolation exit procedure should
+ * be started.
+ */
+DEFINE_PER_CPU(atomic_t, isol_exit_counter);
+
+/*
+ * Descriptor for isolation-breaking SMP calls
+ */
+DEFINE_PER_CPU(call_single_data_t, isol_break_csd);
+
+cpumask_var_t task_isolation_map;
+cpumask_var_t task_isolation_cleanup_map;
+static DEFINE_SPINLOCK(task_isolation_cleanup_lock);
+
+/* We can run on cpus that are isolated from the scheduler and are nohz_full. */
+static int __init task_isolation_init(void)
+{
+	alloc_bootmem_cpumask_var(&task_isolation_cleanup_map);
+	if (alloc_cpumask_var(&task_isolation_map, GFP_KERNEL))
+		/*
+		 * At this point task isolation should match
+		 * nohz_full. This may change in the future.
+		 */
+		cpumask_copy(task_isolation_map, tick_nohz_full_mask);
+	return 0;
+}
+core_initcall(task_isolation_init)
+
+/* Enable stack backtraces of any interrupts of task_isolation cores. */
+static bool task_isolation_debug;
+static int __init task_isolation_debug_func(char *str)
+{
+	task_isolation_debug = true;
+	return 1;
+}
+__setup("task_isolation_debug", task_isolation_debug_func);
+
+/*
+ * Record name, pid and group pid of the task entering isolation on
+ * the current CPU.
+ */
+static void record_curr_isolated_task(void)
+{
+	int ind;
+	int cpu = smp_processor_id();
+	struct isol_task_desc *desc = &per_cpu(isol_task_descs, cpu);
+	struct task_struct *task = current;
+
+	/* Finish everything before recording current task */
+	smp_mb();
+	ind = atomic_inc_return(&desc->curr_index_wr) & 1;
+	desc->comm[ind][sizeof(task->comm) - 1] = '\0';
+	memcpy(desc->comm[ind], task->comm, sizeof(task->comm) - 1);
+	desc->pid[ind] = task->pid;
+	desc->tgid[ind] = task->tgid;
+	desc->warned[ind] = false;
+	/* Write everything, to be seen by other CPUs */
+	smp_mb();
+	atomic_inc(&desc->curr_index);
+	/* Everyone will see the new record from this point */
+	smp_mb();
+}
+
+/*
+ * Print message prefixed with the description of the current (or
+ * last) isolated task on a given CPU. Intended for isolation breaking
+ * messages that include target task for the user's convenience.
+ *
+ * Messages produced with this function may have obsolete task
+ * information if isolated tasks managed to exit, start and enter
+ * isolation multiple times, or multiple tasks tried to enter
+ * isolation on the same CPU at once. For those unusual cases it would
+ * contain a valid description of the cause for isolation breaking and
+ * target CPU number, just not the correct description of which task
+ * ended up losing isolation.
+ */
+int task_isolation_message(int cpu, int level, bool supp, const char *fmt, ...)
+{
+	struct isol_task_desc *desc;
+	struct task_struct *task;
+	va_list args;
+	char buf_prefix[TASK_COMM_LEN + 20 + 3 * 20];
+	char buf[200];
+	int curr_cpu, ind_counter, ind_counter_old, ind;
+
+	curr_cpu = get_cpu();
+	/* Barrier to synchronize with recording isolated task information */
+	smp_rmb();
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind_counter = atomic_read(&desc->curr_index);
+
+	if (curr_cpu == cpu) {
+		/*
+		 * Message is for the current CPU so current
+		 * task_struct should be used instead of cached
+		 * information.
+		 *
+		 * Like in other diagnostic messages, if issued from
+		 * interrupt context, current will be the interrupted
+		 * task. Unlike other diagnostic messages, this is
+		 * always relevant because the message is about
+		 * interrupting a task.
+		 */
+		ind = ind_counter & 1;
+		if (supp && desc->warned[ind]) {
+			/*
+			 * If supp is true, skip the message if the
+			 * same task was mentioned in the message
+			 * originated on remote CPU, and it did not
+			 * re-enter isolated state since then (warned
+			 * is true). Only local messages following
+			 * remote messages, likely about the same
+			 * isolation breaking event, are skipped to
+			 * avoid duplication. If remote cause is
+			 * immediately followed by a local one before
+			 * isolation is broken, local cause is skipped
+			 * from messages.
+			 */
+			put_cpu();
+			return 0;
+		}
+		task = current;
+		snprintf(buf_prefix, sizeof(buf_prefix),
+			 "isolation %s/%d/%d (cpu %d)",
+			 task->comm, task->tgid, task->pid, cpu);
+		put_cpu();
+	} else {
+		/*
+		 * Message is for remote CPU, use cached information.
+		 */
+		put_cpu();
+		/*
+		 * Make sure, index remained unchanged while data was
+		 * copied. If it changed, data that was copied may be
+		 * inconsistent because two updates in a sequence could
+		 * overwrite the data while it was being read.
+		 */
+		do {
+			/* Make sure we are reading up to date values */
+			smp_mb();
+			ind = ind_counter & 1;
+			snprintf(buf_prefix, sizeof(buf_prefix),
+				 "isolation %s/%d/%d (cpu %d)",
+				 desc->comm[ind], desc->tgid[ind],
+				 desc->pid[ind], cpu);
+			desc->warned[ind] = true;
+			ind_counter_old = ind_counter;
+			/* Record the warned flag, then re-read descriptor */
+			smp_mb();
+			ind_counter = atomic_read(&desc->curr_index);
+			/*
+			 * If the counter changed, something was updated, so
+			 * repeat everything to get the current data
+			 */
+		} while (ind_counter != ind_counter_old);
+	}
+
+	va_start(args, fmt);
+	vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	switch (level) {
+	case LOGLEVEL_EMERG:
+		pr_emerg("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ALERT:
+		pr_alert("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_CRIT:
+		pr_crit("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_ERR:
+		pr_err("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_WARNING:
+		pr_warn("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_NOTICE:
+		pr_notice("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_INFO:
+		pr_info("%s: %s", buf_prefix, buf);
+		break;
+	case LOGLEVEL_DEBUG:
+		pr_debug("%s: %s", buf_prefix, buf);
+		break;
+	default:
+		/* No message without a valid level */
+		return 0;
+	}
+	return 1;
+}
+
+/*
+ * Dump stack if need be. This can be helpful even from the final exit
+ * to usermode code since stack traces sometimes carry information about
+ * what put you into the kernel, e.g. an interrupt number encoded in
+ * the initial entry stack frame that is still visible at exit time.
+ */
+static void debug_dump_stack(void)
+{
+	if (task_isolation_debug)
+		dump_stack();
+}
+
+/*
+ * Set the flags word but don't try to actually start task isolation yet.
+ * We will start it when entering user space in task_isolation_start().
+ */
+int task_isolation_request(unsigned int flags)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * The task isolation flags should always be cleared just by
+	 * virtue of having entered the kernel.
+	 */
+	WARN_ON_ONCE(test_tsk_thread_flag(task, TIF_TASK_ISOLATION));
+	WARN_ON_ONCE(task->task_isolation_flags != 0);
+	WARN_ON_ONCE(task->task_isolation_state != STATE_NORMAL);
+
+	task->task_isolation_flags = flags;
+	if (!(task->task_isolation_flags & PR_TASK_ISOLATION_ENABLE))
+		return 0;
+
+	/* We are trying to enable task isolation. */
+	set_tsk_thread_flag(task, TIF_TASK_ISOLATION);
+
+	/*
+	 * Shut down the vmstat worker so we're not interrupted later.
+	 * We have to try to do this here (with interrupts enabled) since
+	 * we are canceling delayed work and will call flush_work()
+	 * (which enables interrupts) and possibly schedule().
+	 */
+	quiet_vmstat_sync();
+
+	/* We return 0 here but we may change that in task_isolation_start(). */
+	return 0;
+}
+
+/*
+ * Perform actions that should be done immediately on exit from isolation.
+ */
+static void fast_task_isolation_cpu_cleanup(void *info)
+{
+	unsigned long flags;
+
+	/*
+	 * This function runs on a CPU that ran isolated task.
+	 * It should be called either directly when isolation breaking is
+	 * being processed, or using IPI from another CPU when it intends
+	 * to break isolation on the given CPU.
+	 *
+	 * We don't want this CPU running code from the rest of kernel
+	 * until other CPUs know that it is no longer isolated. Any
+	 * entry into kernel will call task_isolation_kernel_enter()
+	 * before calling this, so this will be already done by
+	 * setting per-cpu flags and synchronizing in that function.
+	 *
+	 * For development purposes it makes sense to check if it was
+	 * done, because there is a possibility that some entry points
+	 * were left unguarded. That would be clearly a bug because it
+	 * will mean that regular kernel code is running with no
+	 * synchronization.
+	 */
+	local_irq_save(flags);
+	atomic_dec(&per_cpu(isol_exit_counter, smp_processor_id()));
+	/* Barrier to sync with requesting a task isolation breaking */
+	smp_mb__after_atomic();
+	/*
+	 * At this point breaking isolation will be treated as a
+	 * separate isolation-breaking event, however interrupts won't
+	 * arrive until local_irq_restore()
+	 */
+
+	/*
+	 * Check for the above mentioned entry without a call to
+	 * task_isolation_kernel_enter()
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) {
+		/*
+		 * If it did happen, call the function here, to
+		 * prevent further problems from running in
+		 * un-synchronized state
+		 */
+		task_isolation_kernel_enter();
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+		"Isolation breaking was not detected on kernel entry\n");
+	}
+	/*
+	 * This task is no longer isolated (and if by any chance this
+	 * is the wrong task, it's already not isolated)
+	 */
+	current->task_isolation_flags = 0;
+	clear_tsk_thread_flag(current, TIF_TASK_ISOLATION);
+
+	/* Run the rest of cleanup later */
+	set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+
+	local_irq_restore(flags);
+}
+
+/* Disable task isolation for the specified task. */
+static void stop_isolation(struct task_struct *p)
+{
+	int cpu, this_cpu;
+	unsigned long flags;
+
+	this_cpu = get_cpu();
+	cpu = task_cpu(p);
+	if (atomic_inc_return(&per_cpu(isol_exit_counter, cpu)) > 1) {
+		/* Already exiting isolation */
+		atomic_dec(&per_cpu(isol_exit_counter, cpu));
+		put_cpu();
+		return;
+	}
+
+	if (p == current) {
+		p->task_isolation_state = STATE_NORMAL;
+		fast_task_isolation_cpu_cleanup(NULL);
+		task_isolation_cpu_cleanup();
+		put_cpu();
+	} else {
+		/*
+		 * Schedule "slow" cleanup. This relies on
+		 * TIF_NOTIFY_RESUME being set
+		 */
+		spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+		cpumask_set_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		/*
+		 * Setting flags is delegated to the CPU where
+		 * isolated task is running
+		 * isol_exit_counter will be decremented from there as well.
+		 */
+		per_cpu(isol_break_csd, cpu).func =
+		    fast_task_isolation_cpu_cleanup;
+		per_cpu(isol_break_csd, cpu).info = NULL;
+		per_cpu(isol_break_csd, cpu).flags = 0;
+		smp_call_function_single_async(cpu,
+					       &per_cpu(isol_break_csd, cpu));
+		put_cpu();
+	}
+}
+
+/*
+ * This code runs with interrupts disabled just before the return to
+ * userspace, after a prctl() has requested enabling task isolation.
+ * We take whatever steps are needed to avoid being interrupted later:
+ * drain the lru pages, stop the scheduler tick, etc. More
+ * functionality may be added here later to avoid other types of
+ * interrupts from other kernel subsystems. This, however, may still not
+ * have the intended result, so the rest of the system takes into account
+ * the possibility of receiving an interrupt and isolation breaking later.
+ *
+ * If we can't enable task isolation, we update the syscall return
+ * value with an appropriate error.
+ *
+ * This, however, does not enable isolation yet, as far as low-level
+ * flags are concerned. So if interrupts will be enabled, it's still
+ * possible for the task to be interrupted. The call to
+ * task_isolation_exit_to_user_mode() should finally enable task
+ * isolation after this function set FLAG_LL_TASK_ISOLATION_REQUEST.
+ */
+void task_isolation_start(void)
+{
+	int error;
+	unsigned long flags;
+
+	/*
+	 * We should only be called in STATE_NORMAL (isolation
+	 * disabled), on our way out of the kernel from the prctl()
+	 * that turned it on.  If we are exiting from the kernel in
+	 * another state, it means we made it back into the kernel
+	 * without disabling task isolation, and we should investigate
+	 * how (and in any case disable task isolation at this
+	 * point). We are clearly not on the path back from the
+	 * prctl() so we don't touch the syscall return value.
+	 */
+	if (WARN_ON_ONCE(current->task_isolation_state != STATE_NORMAL)) {
+		stop_isolation(current);
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+		"Isolation start requested while not in the normal state\n");
+		return;
+	}
+
+	/*
+	 * Must be affinitized to a single core with task isolation possible.
+	 * In principle this could be remotely modified between the prctl()
+	 * and the return to userspace, so we have to check it here.
+	 */
+	if (current->nr_cpus_allowed != 1 ||
+	    !is_isolation_cpu(smp_processor_id())) {
+		error = -EINVAL;
+		goto error;
+	}
+
+	/* If the vmstat delayed work is not canceled, we have to try again. */
+	if (!vmstat_idle()) {
+		error = -EAGAIN;
+		goto error;
+	}
+
+	/* Try to stop the dynamic tick. */
+	error = try_stop_full_tick();
+	if (error)
+		goto error;
+
+	/* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+	lru_add_drain();
+
+	local_irq_save(flags);
+
+	/* Record isolated task IDs and name */
+	record_curr_isolated_task();
+
+	current->task_isolation_state = STATE_ISOLATED;
+	this_cpu_write(ll_isol_flags, FLAG_LL_TASK_ISOLATION_REQUEST);
+	/* Barrier to synchronize with reading of flags */
+	smp_mb();
+	local_irq_restore(flags);
+	return;
+
+error:
+	stop_isolation(current);
+	syscall_set_return_value(current, current_pt_regs(), error, 0);
+}
+
+/* Stop task isolation on the remote task and send it a signal. */
+static void send_isolation_signal(struct task_struct *task)
+{
+	int flags = task->task_isolation_flags;
+	kernel_siginfo_t info = {
+		.si_signo = PR_TASK_ISOLATION_GET_SIG(flags) ?: SIGKILL,
+	};
+
+	if ((flags & PR_TASK_ISOLATION_ENABLE) == 0)
+		return;
+
+	stop_isolation(task);
+	send_sig_info(info.si_signo, &info, task);
+}
+
+/* Only a few syscalls are valid once we are in task isolation mode. */
+static bool is_acceptable_syscall(int syscall)
+{
+	/* No need to incur an isolation signal if we are just exiting. */
+	if (syscall == __NR_exit || syscall == __NR_exit_group)
+		return true;
+
+	/* Check to see if it's the prctl for isolation. */
+	if (syscall == __NR_prctl) {
+		unsigned long arg[SYSCALL_MAX_ARGS];
+
+		syscall_get_arguments(current, current_pt_regs(), arg);
+		if (arg[0] == PR_TASK_ISOLATION)
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called from syscall entry, prevents most syscalls
+ * from executing, and if needed raises a signal to notify the process.
+ *
+ * Note that we have to stop isolation before we even print a message
+ * here, since otherwise we might end up reporting an interrupt due to
+ * kicking the printk handling code, rather than reporting the true
+ * cause of interrupt here.
+ *
+ * The message is not suppressed by previous remotely triggered
+ * messages.
+ */
+int task_isolation_syscall(int syscall)
+{
+	struct task_struct *task = current;
+
+	/*
+	 * Check if by any chance syscall is being processed from
+	 * isolated state without a call to
+	 * task_isolation_kernel_enter() happening on entry
+	 */
+	if ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION)) {
+		/*
+		 * If it did happen, call the function here, to
+		 * prevent further problems from running in
+		 * un-synchronized state
+		 */
+		task_isolation_kernel_enter();
+		/* Report the problem */
+		pr_task_isol_emerg(smp_processor_id(),
+			"Isolation breaking was not detected on syscall\n");
+		}
+	/*
+	 * Clear low-level isolation flags to avoid triggering
+	 * a signal on return to userspace
+	 */
+	this_cpu_write(ll_isol_flags, 0);
+
+	if (is_acceptable_syscall(syscall)) {
+		stop_isolation(task);
+		return 0;
+	}
+
+	send_isolation_signal(task);
+
+	pr_task_isol_warn(smp_processor_id(),
+			  "task_isolation lost due to syscall %d\n",
+			  syscall);
+	debug_dump_stack();
+
+	syscall_set_return_value(task, current_pt_regs(), -ERESTARTNOINTR, -1);
+	return -1;
+}
+
+/*
+ * This routine is called from the code responsible for exiting to user mode,
+ * before the point when thread flags are checked for pending work.
+ * That function must be called if the current task is isolated, because
+ * TIF_TASK_ISOLATION must trigger a call to it.
+ */
+void task_isolation_before_pending_work_check(void)
+{
+	int cpu;
+	unsigned long flags;
+
+	/* Handle isolation breaking */
+	if ((current->task_isolation_state != STATE_NORMAL)
+	    && ((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION_BROKEN)
+	        != 0)) {
+		/*
+		 * Clear low-level isolation flags to avoid triggering
+		 * a signal again
+		 */
+		this_cpu_write(ll_isol_flags, 0);
+		/* Send signal to notify about isolation breaking */
+		send_isolation_signal(current);
+		/* Produce generic message about lost isolation */
+		pr_task_isol_warn(smp_processor_id(), "task_isolation lost\n");
+		debug_dump_stack();
+	}
+
+	/*
+	 * If this CPU is in the map of CPUs with cleanup pending,
+	 * remove it from the map and call cleanup
+	 */
+	spin_lock_irqsave(&task_isolation_cleanup_lock, flags);
+
+	cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(cpu, task_isolation_cleanup_map)) {
+		cpumask_clear_cpu(cpu, task_isolation_cleanup_map);
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+		task_isolation_cpu_cleanup();
+	} else
+		spin_unlock_irqrestore(&task_isolation_cleanup_lock, flags);
+}
+
+/*
+ * Called before we wake up a task that has a signal to process.
+ * Needs to be done to handle interrupts that trigger signals, which
+ * we don't catch with task_isolation_interrupt() hooks.
+ *
+ * This message is also suppressed if there was already a remotely
+ * caused message about the same isolation breaking event.
+ */
+void _task_isolation_signal(struct task_struct *task)
+{
+	struct isol_task_desc *desc;
+	int ind, cpu;
+	bool do_warn = (task->task_isolation_state == STATE_ISOLATED);
+
+	cpu = task_cpu(task);
+	desc = &per_cpu(isol_task_descs, cpu);
+	ind = atomic_read(&desc->curr_index) & 1;
+	if (desc->warned[ind])
+		do_warn = false;
+
+	stop_isolation(task);
+
+	if (do_warn) {
+		pr_warn("isolation: %s/%d/%d (cpu %d): task_isolation lost due to signal\n",
+			task->comm, task->tgid, task->pid, cpu);
+		debug_dump_stack();
+	}
+}
+
+/*
+ * Set CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	/* Barrier to synchronize with writing task isolation flags */
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_set_cpu(cpu, mask);
+}
+
+/*
+ * Clear CPUs currently running isolated tasks in CPU mask.
+ */
+void task_isolation_clear_cpumask(struct cpumask *mask)
+{
+	int cpu;
+
+	if (task_isolation_map == NULL)
+		return;
+
+	/* Barrier to synchronize with writing task isolation flags */
+	smp_rmb();
+	for_each_cpu(cpu, task_isolation_map)
+		if (task_isolation_on_cpu(cpu))
+			cpumask_clear_cpu(cpu, mask);
+}
+
+/*
+ * Cleanup procedure. The call to this procedure may be delayed.
+ */
+void task_isolation_cpu_cleanup(void)
+{
+	kick_hrtimer();
+}
diff --git a/kernel/signal.c b/kernel/signal.c
index ef8f2a28d37c..252886e6aa76 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -46,6 +46,7 @@
 #include <linux/livepatch.h>
 #include <linux/cgroup.h>
 #include <linux/audit.h>
+#include <linux/isolation.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -759,6 +760,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
  */
 void signal_wake_up_state(struct task_struct *t, unsigned int state)
 {
+	task_isolation_signal(t);
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 	/*
 	 * TASK_WAKEKILL also means wake it up in the stopped/traced/killable
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..0da007ce8666 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
 #include <linux/syscore_ops.h>
 #include <linux/version.h>
 #include <linux/ctype.h>
+#include <linux/isolation.h>
 
 #include <linux/compat.h>
 #include <linux/syscalls.h>
@@ -2530,6 +2531,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = task_isolation_request(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 387b4bef7dd1..27fedddc2729 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -30,6 +30,7 @@
 #include <linux/syscalls.h>
 #include <linux/interrupt.h>
 #include <linux/tick.h>
+#include <linux/isolation.h>
 #include <linux/err.h>
 #include <linux/debugobjects.h>
 #include <linux/sched/signal.h>
@@ -720,6 +721,19 @@ static void retrigger_next_event(void *arg)
 	raw_spin_unlock(&base->lock);
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+void kick_hrtimer(void)
+{
+	unsigned long flags;
+
+	preempt_disable();
+	local_irq_save(flags);
+	retrigger_next_event(NULL);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+#endif
+
 /*
  * Switch to high resolution mode
  */
@@ -867,8 +881,21 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_TASK_ISOLATION
+	struct cpumask mask;
+
+	cpumask_clear(&mask);
+	task_isolation_cpumask(&mask);
+	cpumask_complement(&mask, &mask);
+	/*
+	 * Retrigger the CPU local events everywhere except CPUs
+	 * running isolated tasks.
+	 */
+	on_each_cpu_mask(&mask, retrigger_next_event, NULL, 1);
+#else
 	/* Retrigger the CPU local events everywhere */
 	on_each_cpu(retrigger_next_event, NULL, 1);
+#endif
 #endif
 	timerfd_clock_was_set();
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 81632cd5e3b7..a213952541db 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -896,6 +896,24 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 #endif
 }
 
+#ifdef CONFIG_TASK_ISOLATION
+int try_stop_full_tick(void)
+{
+	int cpu = smp_processor_id();
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	/* For an unstable clock, we should return a permanent error code. */
+	if (atomic_read(&tick_dep_mask) & TICK_DEP_MASK_CLOCK_UNSTABLE)
+		return -EINVAL;
+
+	if (!can_stop_full_tick(cpu, ts))
+		return -EAGAIN;
+
+	tick_nohz_stop_sched_tick(ts, cpu);
+	return 0;
+}
+#endif
+
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
 	/*
-- 
2.20.1


^ permalink raw reply related

* [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

From: Chris Metcalf <cmetcalf@mellanox.com>

This function checks to see if a vmstat worker is not running,
and the vmstat diffs don't require an update.  The function is
called from the task-isolation code to see if we need to
actually do some work to quiet vmstat.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h |  2 ++
 mm/vmstat.c            | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 300ce6648923..24392a957cfc 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
 void quiet_vmstat_sync(void);
+bool vmstat_idle(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
 static inline void quiet_vmstat_sync(void) { }
+static inline bool vmstat_idle(void) { return true; }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 43999caf47a4..5b0ad7ed65f7 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Report on whether vmstat processing is quiesced on the core currently:
+ * no vmstat worker running and no vmstat updates to perform.
+ */
+bool vmstat_idle(void)
+{
+	return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) &&
+		!need_update(smp_processor_id());
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related

* [PATCH v2] net: mvpp2: divide fifo for dts-active ports only
From: stefanc @ 2020-11-23 17:54 UTC (permalink / raw)
  To: netdev
  Cc: thomas.petazzoni, davem, nadavh, ymarkman, linux-kernel, stefanc,
	kuba, linux, mw, andrew, rmk+kernel

From: Stefan Chulski <stefanc@marvell.com>

Tx/Rx FIFO is a HW resource limited by total size, but shared
by all ports of same CP110 and impacting port-performance.
Do not divide the FIFO for ports which are not enabled in DTS,
so active ports could have more FIFO.
No change in FIFO allocation if all 3 ports on the communication
processor enabled in DTS.

The active port mapping should be done in probe before FIFO-init.

Signed-off-by: Stefan Chulski <stefanc@marvell.com>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2.h      |  23 +++--
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 129 +++++++++++++++++-------
 2 files changed, 108 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
index 8347758..6bd7e40 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
@@ -695,6 +695,9 @@
 /* Maximum number of supported ports */
 #define MVPP2_MAX_PORTS			4
 
+/* Loopback port index */
+#define MVPP2_LOOPBACK_PORT_INDEX	3
+
 /* Maximum number of TXQs used by single port */
 #define MVPP2_MAX_TXQ			8
 
@@ -729,22 +732,21 @@
 #define MVPP2_TX_DESC_ALIGN		(MVPP2_DESC_ALIGNED_SIZE - 1)
 
 /* RX FIFO constants */
+#define MVPP2_RX_FIFO_PORT_DATA_SIZE_44KB	0xb000
 #define MVPP2_RX_FIFO_PORT_DATA_SIZE_32KB	0x8000
 #define MVPP2_RX_FIFO_PORT_DATA_SIZE_8KB	0x2000
 #define MVPP2_RX_FIFO_PORT_DATA_SIZE_4KB	0x1000
-#define MVPP2_RX_FIFO_PORT_ATTR_SIZE_32KB	0x200
-#define MVPP2_RX_FIFO_PORT_ATTR_SIZE_8KB	0x80
+#define MVPP2_RX_FIFO_PORT_ATTR_SIZE(data_size)	((data_size) >> 6)
 #define MVPP2_RX_FIFO_PORT_ATTR_SIZE_4KB	0x40
 #define MVPP2_RX_FIFO_PORT_MIN_PKT		0x80
 
 /* TX FIFO constants */
-#define MVPP22_TX_FIFO_DATA_SIZE_10KB		0xa
-#define MVPP22_TX_FIFO_DATA_SIZE_3KB		0x3
-#define MVPP2_TX_FIFO_THRESHOLD_MIN		256
-#define MVPP2_TX_FIFO_THRESHOLD_10KB	\
-	(MVPP22_TX_FIFO_DATA_SIZE_10KB * 1024 - MVPP2_TX_FIFO_THRESHOLD_MIN)
-#define MVPP2_TX_FIFO_THRESHOLD_3KB	\
-	(MVPP22_TX_FIFO_DATA_SIZE_3KB * 1024 - MVPP2_TX_FIFO_THRESHOLD_MIN)
+#define MVPP22_TX_FIFO_DATA_SIZE_16KB		16
+#define MVPP22_TX_FIFO_DATA_SIZE_10KB		10
+#define MVPP22_TX_FIFO_DATA_SIZE_3KB		3
+#define MVPP2_TX_FIFO_THRESHOLD_MIN		256 /* Bytes */
+#define MVPP2_TX_FIFO_THRESHOLD(kb)	\
+		((kb) * 1024 - MVPP2_TX_FIFO_THRESHOLD_MIN)
 
 /* RX buffer constants */
 #define MVPP2_SKB_SHINFO_SIZE \
@@ -946,6 +948,9 @@ struct mvpp2 {
 	/* List of pointers to port structures */
 	int port_count;
 	struct mvpp2_port *port_list[MVPP2_MAX_PORTS];
+	/* Map of enabled ports */
+	unsigned long port_map;
+
 	struct mvpp2_tai *tai;
 
 	/* Number of Tx threads used */
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index f6616c8..08c237a 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -6601,32 +6601,56 @@ static void mvpp2_rx_fifo_init(struct mvpp2 *priv)
 	mvpp2_write(priv, MVPP2_RX_FIFO_INIT_REG, 0x1);
 }
 
-static void mvpp22_rx_fifo_init(struct mvpp2 *priv)
+static void mvpp22_rx_fifo_set_hw(struct mvpp2 *priv, int port, int data_size)
 {
-	int port;
+	int attr_size = MVPP2_RX_FIFO_PORT_ATTR_SIZE(data_size);
 
-	/* The FIFO size parameters are set depending on the maximum speed a
-	 * given port can handle:
-	 * - Port 0: 10Gbps
-	 * - Port 1: 2.5Gbps
-	 * - Ports 2 and 3: 1Gbps
-	 */
+	mvpp2_write(priv, MVPP2_RX_DATA_FIFO_SIZE_REG(port), data_size);
+	mvpp2_write(priv, MVPP2_RX_ATTR_FIFO_SIZE_REG(port), attr_size);
+}
 
-	mvpp2_write(priv, MVPP2_RX_DATA_FIFO_SIZE_REG(0),
-		    MVPP2_RX_FIFO_PORT_DATA_SIZE_32KB);
-	mvpp2_write(priv, MVPP2_RX_ATTR_FIFO_SIZE_REG(0),
-		    MVPP2_RX_FIFO_PORT_ATTR_SIZE_32KB);
+/* Initialize TX FIFO's: the total FIFO size is 48kB on PPv2.2.
+ * 4kB fixed space must be assigned for the loopback port.
+ * Redistribute remaining avialable 44kB space among all active ports.
+ * Guarantee minimum 32kB for 10G port and 8kB for port 1, capable of 2.5G
+ * SGMII link.
+ */
+static void mvpp22_rx_fifo_init(struct mvpp2 *priv)
+{
+	int remaining_ports_count;
+	unsigned long port_map;
+	int size_remainder;
+	int port, size;
+
+	/* The loopback requires fixed 4kB of the FIFO space assignment. */
+	mvpp22_rx_fifo_set_hw(priv, MVPP2_LOOPBACK_PORT_INDEX,
+			      MVPP2_RX_FIFO_PORT_DATA_SIZE_4KB);
+	port_map = priv->port_map & ~BIT(MVPP2_LOOPBACK_PORT_INDEX);
+
+	/* Set RX FIFO size to 0 for inactive ports. */
+	for_each_clear_bit(port, &port_map, MVPP2_LOOPBACK_PORT_INDEX)
+		mvpp22_rx_fifo_set_hw(priv, port, 0);
+
+	/* Assign remaining RX FIFO space among all active ports. */
+	size_remainder = MVPP2_RX_FIFO_PORT_DATA_SIZE_44KB;
+	remaining_ports_count = hweight_long(port_map);
+
+	for_each_set_bit(port, &port_map, MVPP2_LOOPBACK_PORT_INDEX) {
+		if (remaining_ports_count == 1)
+			size = size_remainder;
+		else if (port == 0)
+			size = max(size_remainder / remaining_ports_count,
+				   MVPP2_RX_FIFO_PORT_DATA_SIZE_32KB);
+		else if (port == 1)
+			size = max(size_remainder / remaining_ports_count,
+				   MVPP2_RX_FIFO_PORT_DATA_SIZE_8KB);
+		else
+			size = size_remainder / remaining_ports_count;
 
-	mvpp2_write(priv, MVPP2_RX_DATA_FIFO_SIZE_REG(1),
-		    MVPP2_RX_FIFO_PORT_DATA_SIZE_8KB);
-	mvpp2_write(priv, MVPP2_RX_ATTR_FIFO_SIZE_REG(1),
-		    MVPP2_RX_FIFO_PORT_ATTR_SIZE_8KB);
+		size_remainder -= size;
+		remaining_ports_count--;
 
-	for (port = 2; port < MVPP2_MAX_PORTS; port++) {
-		mvpp2_write(priv, MVPP2_RX_DATA_FIFO_SIZE_REG(port),
-			    MVPP2_RX_FIFO_PORT_DATA_SIZE_4KB);
-		mvpp2_write(priv, MVPP2_RX_ATTR_FIFO_SIZE_REG(port),
-			    MVPP2_RX_FIFO_PORT_ATTR_SIZE_4KB);
+		mvpp22_rx_fifo_set_hw(priv, port, size);
 	}
 
 	mvpp2_write(priv, MVPP2_RX_MIN_PKT_SIZE_REG,
@@ -6634,24 +6658,53 @@ static void mvpp22_rx_fifo_init(struct mvpp2 *priv)
 	mvpp2_write(priv, MVPP2_RX_FIFO_INIT_REG, 0x1);
 }
 
-/* Initialize Tx FIFO's: the total FIFO size is 19kB on PPv2.2 and 10G
- * interfaces must have a Tx FIFO size of 10kB. As only port 0 can do 10G,
- * configure its Tx FIFO size to 10kB and the others ports Tx FIFO size to 3kB.
+static void mvpp22_tx_fifo_set_hw(struct mvpp2 *priv, int port, int size)
+{
+	int threshold = MVPP2_TX_FIFO_THRESHOLD(size);
+
+	mvpp2_write(priv, MVPP22_TX_FIFO_SIZE_REG(port), size);
+	mvpp2_write(priv, MVPP22_TX_FIFO_THRESH_REG(port), threshold);
+}
+
+/* Initialize TX FIFO's: the total FIFO size is 19kB on PPv2.2.
+ * 3kB fixed space must be assigned for the loopback port.
+ * Redistribute remaining avialable 16kB space among all active ports.
+ * The 10G interface should use 10kB (which is maximum possible size
+ * per single port).
  */
 static void mvpp22_tx_fifo_init(struct mvpp2 *priv)
 {
-	int port, size, thrs;
-
-	for (port = 0; port < MVPP2_MAX_PORTS; port++) {
-		if (port == 0) {
+	int remaining_ports_count;
+	unsigned long port_map;
+	int size_remainder;
+	int port, size;
+
+	/* The loopback requires fixed 3kB of the FIFO space assignment. */
+	mvpp22_tx_fifo_set_hw(priv, MVPP2_LOOPBACK_PORT_INDEX,
+			      MVPP22_TX_FIFO_DATA_SIZE_3KB);
+	port_map = priv->port_map & ~BIT(MVPP2_LOOPBACK_PORT_INDEX);
+
+	/* Set TX FIFO size to 0 for inactive ports. */
+	for_each_clear_bit(port, &port_map, MVPP2_LOOPBACK_PORT_INDEX)
+		mvpp22_tx_fifo_set_hw(priv, port, 0);
+
+	/* Assign remaining TX FIFO space among all active ports. */
+	size_remainder = MVPP22_TX_FIFO_DATA_SIZE_16KB;
+	remaining_ports_count = hweight_long(port_map);
+
+	for_each_set_bit(port, &port_map, MVPP2_LOOPBACK_PORT_INDEX) {
+		if (remaining_ports_count == 1)
+			size = min(size_remainder,
+				   MVPP22_TX_FIFO_DATA_SIZE_10KB);
+		else if (port == 0)
 			size = MVPP22_TX_FIFO_DATA_SIZE_10KB;
-			thrs = MVPP2_TX_FIFO_THRESHOLD_10KB;
-		} else {
-			size = MVPP22_TX_FIFO_DATA_SIZE_3KB;
-			thrs = MVPP2_TX_FIFO_THRESHOLD_3KB;
-		}
-		mvpp2_write(priv, MVPP22_TX_FIFO_SIZE_REG(port), size);
-		mvpp2_write(priv, MVPP22_TX_FIFO_THRESH_REG(port), thrs);
+		else
+			size = size_remainder / remaining_ports_count;
+
+		size_remainder -= size;
+		remaining_ports_count--;
+
+		mvpp22_tx_fifo_set_hw(priv, port, size);
 	}
 }
 
@@ -6952,6 +7005,12 @@ static int mvpp2_probe(struct platform_device *pdev)
 			goto err_axi_clk;
 	}
 
+	/* Map DTS-active ports. Should be done before FIFO mvpp2_init */
+	fwnode_for_each_available_child_node(fwnode, port_fwnode) {
+		if (!fwnode_property_read_u32(port_fwnode, "port-id", &i))
+			priv->port_map |= BIT(i);
+	}
+
 	/* Initialize network controller */
 	err = mvpp2_init(pdev, priv);
 	if (err < 0) {
-- 
1.9.1


^ permalink raw reply related

* [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function
From: Alex Belits @ 2020-11-23 17:56 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org
In-Reply-To: <8d887e59ca713726f4fcb25a316e1e932b02823e.camel@marvell.com>

From: Chris Metcalf <cmetcalf@mellanox.com>

In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
the quiet_vmstat() function became asynchronous, in the sense that
the vmstat work was still scheduled to run on the core when the
function returned.  For task isolation, we need a synchronous
version of the function that guarantees that the vmstat worker
will not run on the core on return from the function.  Add a
quiet_vmstat_sync() function with that semantic.

Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Signed-off-by: Alex Belits <abelits@marvell.com>
---
 include/linux/vmstat.h | 2 ++
 mm/vmstat.c            | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 322dcbfcc933..300ce6648923 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
+void quiet_vmstat_sync(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page,
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
 static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat_sync(void) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 698bc0bc18d1..43999caf47a4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1936,6 +1936,15 @@ void quiet_vmstat(void)
 	refresh_cpu_vm_stats(false);
 }
 
+/*
+ * Synchronously quiet vmstat so the work is guaranteed not to run on return.
+ */
+void quiet_vmstat_sync(void)
+{
+	cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work));
+	refresh_cpu_vm_stats(false);
+}
+
 /*
  * Shepherd worker thread that checks the
  * differentials of processors that have their worker
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf v2] net, xsk: Avoid taking multiple skbuff references
From: Björn Töpel @ 2020-11-23 17:56 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, jonathan.lemon, yhs, weqaar.janjua,
	magnus.karlsson, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Commit 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.

If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.

This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.

Reported-by: Yonghong Song <yhs@fb.com>
Fixes: 642e450b6b59 ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
v2:
  Removed the bool parameter, and inlined dev_direct_xmit(). (Daniel)
---
 include/linux/netdevice.h | 11 ++++++++++-
 net/core/dev.c            |  8 ++------
 net/xdp/xsk.c             |  8 +-------
 3 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 964b494b0e8d..bd841f31dfab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2815,7 +2815,16 @@ u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
 		       struct net_device *sb_dev);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
-int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
+int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
+static inline int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
+{
+	int ret;
+
+	ret = __dev_direct_xmit(skb, queue_id);
+	if (!dev_xmit_complete(ret))
+		kfree_skb(skb);
+	return ret;
+}
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
 void unregister_netdevice_many(struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index 82dc6b48e45f..8588ade790cb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4180,7 +4180,7 @@ int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev)
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 
-int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
+int __dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 {
 	struct net_device *dev = skb->dev;
 	struct sk_buff *orig_skb = skb;
@@ -4210,17 +4210,13 @@ int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 	dev_xmit_recursion_dec();
 
 	local_bh_enable();
-
-	if (!dev_xmit_complete(ret))
-		kfree_skb(skb);
-
 	return ret;
 drop:
 	atomic_long_inc(&dev->tx_dropped);
 	kfree_skb_list(skb);
 	return NET_XMIT_DROP;
 }
-EXPORT_SYMBOL(dev_direct_xmit);
+EXPORT_SYMBOL(__dev_direct_xmit);
 
 /*************************************************************************
  *			Receiver routines
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5a6cdf7b320d..b7b039bd9d03 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -411,11 +411,7 @@ static int xsk_generic_xmit(struct sock *sk)
 		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
 		skb->destructor = xsk_destruct_skb;
 
-		/* Hinder dev_direct_xmit from freeing the packet and
-		 * therefore completing it in the destructor
-		 */
-		refcount_inc(&skb->users);
-		err = dev_direct_xmit(skb, xs->queue_id);
+		err = __dev_direct_xmit(skb, xs->queue_id);
 		if  (err == NETDEV_TX_BUSY) {
 			/* Tell user-space to retry the send */
 			skb->destructor = sock_wfree;
@@ -429,12 +425,10 @@ static int xsk_generic_xmit(struct sock *sk)
 		/* Ignore NET_XMIT_CN as packet might have been sent */
 		if (err == NET_XMIT_DROP) {
 			/* SKB completed but not sent */
-			kfree_skb(skb);
 			err = -EBUSY;
 			goto out;
 		}
 
-		consume_skb(skb);
 		sent_frame = true;
 	}
 

base-commit: 178648916e73e00de83150eb0c90c0d3a977a46a
-- 
2.27.0


^ permalink raw reply related

* RE: [EXT] Re: [PATCH v1] net: mvpp2: divide fifo for dts-active ports only
From: Stefan Chulski @ 2020-11-23 17:48 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: netdev@vger.kernel.org, thomas.petazzoni@bootlin.com,
	davem@davemloft.net, Nadav Haklai, Yan Markman,
	linux-kernel@vger.kernel.org, kuba@kernel.org, mw@semihalf.com,
	andrew@lunn.ch
In-Reply-To: <20201123173005.GY1551@shell.armlinux.org.uk>



> -----Original Message-----
> From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> Sent: Monday, November 23, 2020 7:30 PM
> To: Stefan Chulski <stefanc@marvell.com>
> Cc: netdev@vger.kernel.org; thomas.petazzoni@bootlin.com;
> davem@davemloft.net; Nadav Haklai <nadavh@marvell.com>; Yan Markman
> <ymarkman@marvell.com>; linux-kernel@vger.kernel.org; kuba@kernel.org;
> mw@semihalf.com; andrew@lunn.ch
> Subject: Re: [EXT] Re: [PATCH v1] net: mvpp2: divide fifo for dts-active ports
> only
> 
> On Mon, Nov 23, 2020 at 04:03:00PM +0000, Stefan Chulski wrote:
> > I agree with you. We can use "max-speed" for better FIFO allocations.
> > I plan to upstream more fixes from the "Marvell" devel branch then I can
> prepare this patch.
> > So you OK with this patch and then follow-on improvement?
> 
> Yes - but I would like to see the commit description say that this results in no
> change the situation where all three ports are in use.

Ok, I would repost patch.

Regards.

^ permalink raw reply

* [PATCH net-next] netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal
From: Antoine Tenart @ 2020-11-23 17:49 UTC (permalink / raw)
  To: kuba, pablo, kadlec, fw, roopa, nikolay
  Cc: Antoine Tenart, netdev, netfilter-devel, coreteam, sbrivio

Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
hooks as, while it's an expected value for a bridge, routing expects
PACKET_HOST. The change is undone later on after hook traversal. This
can be seen with pairs of functions updating skb>pkt_type and then
reverting it to its original value:

For hook NF_INET_PRE_ROUTING:
  setup_pre_routing / br_nf_pre_routing_finish

For hook NF_INET_FORWARD:
  br_nf_forward_ip / br_nf_forward_finish

But the third case where netfilter does this, for hook
NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
but never reverted. A comment says:

  /* We assume any code from br_dev_queue_push_xmit onwards doesn't care
   * about the value of skb->pkt_type. */

But when having a tunnel (say vxlan) attached to a bridge we have the
following call trace:

  br_nf_pre_routing
  br_nf_pre_routing_ipv6
     br_nf_pre_routing_finish
  br_nf_forward_ip
     br_nf_forward_finish
  br_nf_post_routing           <- pkt_type is updated to PACKET_HOST
     br_nf_dev_queue_xmit      <- but not reverted to its original value
  vxlan_xmit
     vxlan_xmit_one
        skb_tunnel_check_pmtu  <- a check on pkt_type is performed

In this specific case, this creates issues such as when an ICMPv6 PTB
should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
and returns early).

If the comment is right and no one cares about the value of
skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
it to its original value should be safe.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/bridge/br_netfilter_hooks.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 04c3f9a82650..8edfb98ae1d5 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -735,6 +735,11 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 	mtu_reserved = nf_bridge_mtu_reduction(skb);
 	mtu = skb->dev->mtu;

+	if (nf_bridge->pkt_otherhost) {
+		skb->pkt_type = PACKET_OTHERHOST;
+		nf_bridge->pkt_otherhost = false;
+	}
+
 	if (nf_bridge->frag_max_size && nf_bridge->frag_max_size < mtu)
 		mtu = nf_bridge->frag_max_size;

@@ -835,8 +840,6 @@ static unsigned int br_nf_post_routing(void *priv,
 	else
 		return NF_ACCEPT;

-	/* We assume any code from br_dev_queue_push_xmit onwards doesn't care
-	 * about the value of skb->pkt_type. */
 	if (skb->pkt_type == PACKET_OTHERHOST) {
 		skb->pkt_type = PACKET_HOST;
 		nf_bridge->pkt_otherhost = true;
-- 
2.28.0

^ permalink raw reply related

* [PATCH v5 0/9] "Task_isolation" mode
From: Alex Belits @ 2020-11-23 17:42 UTC (permalink / raw)
  To: nitesh@redhat.com, frederic@kernel.org
  Cc: Prasun Kapoor, linux-api@vger.kernel.org, davem@davemloft.net,
	trix@redhat.com, mingo@kernel.org, catalin.marinas@arm.com,
	rostedt@goodmis.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, tglx@linutronix.de, linux-arch@vger.kernel.org,
	mtosatti@redhat.com, will@kernel.org, peterz@infradead.org,
	leon@sidebranch.com, linux-arm-kernel@lists.infradead.org,
	pauld@redhat.com, netdev@vger.kernel.org

This is an update of task isolation work that was originally done by
Chris Metcalf <cmetcalf@mellanox.com> and maintained by him until
November 2017. It is adapted to the current kernel and cleaned up to
implement its functionality in a more complete and cleaner manner.

Previous version is at
https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.camel@marvell.com/

The last version by Chris Metcalf (now obsolete but may be relevant
for comparison and understanding the origin of the changes) is at
https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetcalf@mellanox.com

Supported architectures

This version includes only architecture-independent code and arm64
support. x86 and arm support, and everything related to virtualization
will be re-added later when new kernel entry/exit implementation will
be accommodated. Support for other architectures can be added in a
somewhat modular manner, however it heavily depends on the details of
a kernel entry/exit support on any particular architecture.
Development of common entry/exit and conversion to it should simplify
that task. For now, this is the version that is currently being
developed on arm64.

Major changes since v4

The goal was to make isolation-breaking detection as generic as
possible, and remove everything related to determining, _why_
isolation was broken. Originally reporting isolation breaking was done
with a large number of of hooks in specific code (hardware interrupts,
syscalls, IPIs, page faults, etc.), and it was necessary to cover all
possible such events to have a reliable notification of a task about
its isolation being broken. To avoid such a fragile mechanism, this
version relies on mere fact of kernel being entered in isolation
mode. As a result, reporting happens later in kernel code, however it
covers everything.

This means that now there is no specific reporting, in kernel log or
elsewhere, about the reasons for breaking isolation. Information about
that may be valuable at runtime, so a separate mechanism for generic
reporting "why did CPU enter kernel" (with isolation or under other
conditions) may be a good thing. That can be done later, however at
this point it's important that task isolation does not require it, and
such mechanism will not be developed with the limited purpose of
supporting isolation alone.

General description

This is the result of development and maintenance of task isolation
functionality that originally started based on task isolation patch
v15 and was later updated to include v16. It provided predictable
environment for userspace tasks running on arm64 processors alongside
with full-featured Linux environment. It is intended to provide
reliable interruption-free environment from the point when a userspace
task enters isolation and until the moment it leaves isolation or
receives a signal intentionally sent to it, and was successfully used
for this purpose. While CPU isolation with nohz provides an
environment that is close to this requirement, the remaining IPIs and
other disturbances keep it from being usable for tasks that require
complete predictability of CPU timing.

This set of patches only covers the implementation of task isolation,
however additional functionality, such as selective TLB flushes, may
be implemented to avoid other kinds of disturbances that affect
latency and performance of isolated tasks.

The userspace support and test program is now at
https://github.com/abelits/libtmc . It was originally developed for
earlier implementation, so it has some checks that may be redundant
now but kept for compatibility.

My thanks to Chris Metcalf for design and maintenance of the original
task isolation patch, Francis Giraldeau <francis.giraldeau@gmail.com>
and Yuri Norov <ynorov@marvell.com> for various contributions to this
work, Frederic Weisbecker <frederic@kernel.org> for his work on CPU
isolation and housekeeping that made possible to remove some less
elegant solutions that I had to devise for earlier, <4.17 kernels, and
Nitesh Narayan Lal <nitesh@redhat.com> for adapting earlier patches
related to interrupt and work distribution in presence of CPU
isolation.

-- 
Alex

^ permalink raw reply

* [PATCH net-next v4 5/7] dpaa_eth: add XDP_REDIRECT support
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

After transmission, the frame is returned on confirmation queues for
cleanup. For this, store a backpointer to the xdp_frame in the private
reserved area at the start of the TX buffer.

No TX batching support is implemented at this time.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 48 +++++++++++++++++++++++++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |  1 +
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 0deffcc..a96c994 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2304,8 +2304,11 @@ static int dpaa_eth_poll(struct napi_struct *napi, int budget)
 {
 	struct dpaa_napi_portal *np =
 			container_of(napi, struct dpaa_napi_portal, napi);
+	int cleaned;
 
-	int cleaned = qman_p_poll_dqrr(np->p, budget);
+	np->xdp_act = 0;
+
+	cleaned = qman_p_poll_dqrr(np->p, budget);
 
 	if (cleaned < budget) {
 		napi_complete_done(napi, cleaned);
@@ -2314,6 +2317,9 @@ static int dpaa_eth_poll(struct napi_struct *napi, int budget)
 		qman_p_irqsource_add(np->p, QM_PIRQ_DQRI);
 	}
 
+	if (np->xdp_act & XDP_REDIRECT)
+		xdp_do_flush();
+
 	return cleaned;
 }
 
@@ -2456,6 +2462,7 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	struct xdp_frame *xdpf;
 	struct xdp_buff xdp;
 	u32 xdp_act;
+	int err;
 
 	rcu_read_lock();
 
@@ -2497,6 +2504,17 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 			xdp_return_frame_rx_napi(xdpf);
 
 		break;
+	case XDP_REDIRECT:
+		/* Allow redirect to use the full headroom */
+		xdp.data_hard_start = vaddr;
+		xdp.frame_sz = DPAA_BP_RAW_SIZE;
+
+		err = xdp_do_redirect(priv->net_dev, &xdp, xdp_prog);
+		if (err) {
+			trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
+			free_pages((unsigned long)vaddr, 0);
+		}
+		break;
 	default:
 		bpf_warn_invalid_xdp_action(xdp_act);
 		fallthrough;
@@ -2526,6 +2544,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	struct dpaa_percpu_priv *percpu_priv;
 	const struct qm_fd *fd = &dq->fd;
 	dma_addr_t addr = qm_fd_addr(fd);
+	struct dpaa_napi_portal *np;
 	enum qm_fd_format fd_format;
 	struct net_device *net_dev;
 	u32 fd_status, hash_offset;
@@ -2540,6 +2559,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	u32 hash;
 	u64 ns;
 
+	np = container_of(&portal, struct dpaa_napi_portal, p);
 	dpaa_fq = container_of(fq, struct dpaa_fq, fq_base);
 	fd_status = be32_to_cpu(fd->status);
 	fd_format = qm_fd_get_format(fd);
@@ -2613,6 +2633,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	if (likely(fd_format == qm_fd_contig)) {
 		xdp_act = dpaa_run_xdp(priv, (struct qm_fd *)fd, vaddr,
 				       dpaa_fq, &xdp_meta_len);
+		np->xdp_act |= xdp_act;
 		if (xdp_act != XDP_PASS) {
 			percpu_stats->rx_packets++;
 			percpu_stats->rx_bytes += qm_fd_get_length(fd);
@@ -2943,6 +2964,30 @@ static int dpaa_xdp(struct net_device *net_dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int dpaa_xdp_xmit(struct net_device *net_dev, int n,
+			 struct xdp_frame **frames, u32 flags)
+{
+	struct xdp_frame *xdpf;
+	int i, err, drops = 0;
+
+	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+		return -EINVAL;
+
+	if (!netif_running(net_dev))
+		return -ENETDOWN;
+
+	for (i = 0; i < n; i++) {
+		xdpf = frames[i];
+		err = dpaa_xdp_xmit_frame(net_dev, xdpf);
+		if (err) {
+			xdp_return_frame_rx_napi(xdpf);
+			drops++;
+		}
+	}
+
+	return n - drops;
+}
+
 static int dpaa_ts_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
 	struct dpaa_priv *priv = netdev_priv(dev);
@@ -3011,6 +3056,7 @@ static int dpaa_ioctl(struct net_device *net_dev, struct ifreq *rq, int cmd)
 	.ndo_setup_tc = dpaa_setup_tc,
 	.ndo_change_mtu = dpaa_change_mtu,
 	.ndo_bpf = dpaa_xdp,
+	.ndo_xdp_xmit = dpaa_xdp_xmit,
 };
 
 static int dpaa_napi_add(struct net_device *net_dev)
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
index 5c8d52a..daf894a 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
@@ -127,6 +127,7 @@ struct dpaa_napi_portal {
 	struct napi_struct napi;
 	struct qman_portal *p;
 	bool down;
+	int xdp_act;
 };
 
 struct dpaa_percpu_priv {
-- 
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 7/7] dpaa_eth: implement the A050385 erratum workaround for XDP
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

For XDP TX, even tough we start out with correctly aligned buffers, the
XDP program might change the data's alignment. For REDIRECT, we have no
control over the alignment either.

Create a new workaround for xdp_frame structures to verify the erratum
conditions and move the data to a fresh buffer if necessary. Create a new
xdp_frame for managing the new buffer and free the old one using the XDP
API.

Due to alignment constraints, all frames have a 256 byte headroom that
is offered fully to XDP under the erratum. If the XDP program uses all
of it, the data needs to be move to make room for the xdpf backpointer.

Disable the metadata support since the information can be lost.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 82 +++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 149b549..d8fc19d 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2170,6 +2170,52 @@ static int dpaa_a050385_wa_skb(struct net_device *net_dev, struct sk_buff **s)
 
 	return 0;
 }
+
+static int dpaa_a050385_wa_xdpf(struct dpaa_priv *priv,
+				struct xdp_frame **init_xdpf)
+{
+	struct xdp_frame *new_xdpf, *xdpf = *init_xdpf;
+	void *new_buff;
+	struct page *p;
+
+	/* Check the data alignment and make sure the headroom is large
+	 * enough to store the xdpf backpointer. Use an aligned headroom
+	 * value.
+	 *
+	 * Due to alignment constraints, we give XDP access to the full 256
+	 * byte frame headroom. If the XDP program uses all of it, copy the
+	 * data to a new buffer and make room for storing the backpointer.
+	 */
+	if (PTR_IS_ALIGNED(xdpf->data, DPAA_A050385_ALIGN) &&
+	    xdpf->headroom >= priv->tx_headroom) {
+		xdpf->headroom = priv->tx_headroom;
+		return 0;
+	}
+
+	p = dev_alloc_pages(0);
+	if (unlikely(!p))
+		return -ENOMEM;
+
+	/* Copy the data to the new buffer at a properly aligned offset */
+	new_buff = page_address(p);
+	memcpy(new_buff + priv->tx_headroom, xdpf->data, xdpf->len);
+
+	/* Create an XDP frame around the new buffer in a similar fashion
+	 * to xdp_convert_buff_to_frame.
+	 */
+	new_xdpf = new_buff;
+	new_xdpf->data = new_buff + priv->tx_headroom;
+	new_xdpf->len = xdpf->len;
+	new_xdpf->headroom = priv->tx_headroom;
+	new_xdpf->frame_sz = DPAA_BP_RAW_SIZE;
+	new_xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+
+	/* Release the initial buffer */
+	xdp_return_frame_rx_napi(xdpf);
+
+	*init_xdpf = new_xdpf;
+	return 0;
+}
 #endif
 
 static netdev_tx_t
@@ -2406,6 +2452,15 @@ static int dpaa_xdp_xmit_frame(struct net_device *net_dev,
 	percpu_priv = this_cpu_ptr(priv->percpu_priv);
 	percpu_stats = &percpu_priv->stats;
 
+#ifdef CONFIG_DPAA_ERRATUM_A050385
+	if (unlikely(fman_has_errata_a050385())) {
+		if (dpaa_a050385_wa_xdpf(priv, &xdpf)) {
+			err = -ENOMEM;
+			goto out_error;
+		}
+	}
+#endif
+
 	if (xdpf->headroom < DPAA_TX_PRIV_DATA_SIZE) {
 		err = -EINVAL;
 		goto out_error;
@@ -2479,6 +2534,20 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	xdp.frame_sz = DPAA_BP_RAW_SIZE - DPAA_TX_PRIV_DATA_SIZE;
 	xdp.rxq = &dpaa_fq->xdp_rxq;
 
+	/* We reserve a fixed headroom of 256 bytes under the erratum and we
+	 * offer it all to XDP programs to use. If no room is left for the
+	 * xdpf backpointer on TX, we will need to copy the data.
+	 * Disable metadata support since data realignments might be required
+	 * and the information can be lost.
+	 */
+#ifdef CONFIG_DPAA_ERRATUM_A050385
+	if (unlikely(fman_has_errata_a050385())) {
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = vaddr;
+		xdp.frame_sz = DPAA_BP_RAW_SIZE;
+	}
+#endif
+
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
 	/* Update the length and the offset of the FD */
@@ -2486,7 +2555,8 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 
 	switch (xdp_act) {
 	case XDP_PASS:
-		*xdp_meta_len = xdp.data - xdp.data_meta;
+		*xdp_meta_len = xdp_data_meta_unsupported(&xdp) ? 0 :
+				xdp.data - xdp.data_meta;
 		break;
 	case XDP_TX:
 		/* We can access the full headroom when sending the frame
@@ -3188,10 +3258,16 @@ static u16 dpaa_get_headroom(struct dpaa_buffer_layout *bl,
 	 */
 	headroom = (u16)(bl[port].priv_data_size + DPAA_HWA_SIZE);
 
-	if (port == RX)
+	if (port == RX) {
+#ifdef CONFIG_DPAA_ERRATUM_A050385
+		if (unlikely(fman_has_errata_a050385()))
+			headroom = XDP_PACKET_HEADROOM;
+#endif
+
 		return ALIGN(headroom, DPAA_FD_RX_DATA_ALIGNMENT);
-	else
+	} else {
 		return ALIGN(headroom, DPAA_FD_DATA_ALIGNMENT);
+	}
 }
 
 static int dpaa_eth_probe(struct platform_device *pdev)
-- 
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 6/7] dpaa_eth: rename current skb A050385 erratum workaround
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

Explicitly point that the current workaround addresses skbs. This change is
in preparation for adding a workaround for XDP scenarios.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index a96c994..149b549 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2104,7 +2104,7 @@ static inline int dpaa_xmit(struct dpaa_priv *priv,
 }
 
 #ifdef CONFIG_DPAA_ERRATUM_A050385
-static int dpaa_a050385_wa(struct net_device *net_dev, struct sk_buff **s)
+static int dpaa_a050385_wa_skb(struct net_device *net_dev, struct sk_buff **s)
 {
 	struct dpaa_priv *priv = netdev_priv(net_dev);
 	struct sk_buff *new_skb, *skb = *s;
@@ -2220,7 +2220,7 @@ static int dpaa_a050385_wa(struct net_device *net_dev, struct sk_buff **s)
 
 #ifdef CONFIG_DPAA_ERRATUM_A050385
 	if (unlikely(fman_has_errata_a050385())) {
-		if (dpaa_a050385_wa(net_dev, &skb))
+		if (dpaa_a050385_wa_skb(net_dev, &skb))
 			goto enomem;
 		nonlinear = skb_is_nonlinear(skb);
 	}
-- 
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 4/7] dpaa_eth: add XDP_TX support
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

Use an xdp_frame structure for managing the frame. Store a backpointer to
the structure at the start of the buffer before enqueueing for cleanup
on TX confirmation. Reserve DPAA_TX_PRIV_DATA_SIZE bytes from the frame
size shared with the XDP program for this purpose. Use the XDP
API for freeing the buffer when it returns to the driver on the TX
confirmation path.

The frame queues are shared with the netstack.

This approach will be reused for XDP REDIRECT.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
Changes in v4:
- call xdp_rxq_info_is_reg() before unregistering
- minor cleanups (remove unneeded variable, print error code)
- add more details in the commit message
- did not call qman_destroy_fq() in case of xdp_rxq_info_reg() failure
since it would lead to a double free of the fq resources

 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 128 ++++++++++++++++++++++++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |   2 +
 2 files changed, 125 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index ee076f4..0deffcc 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -1130,6 +1130,24 @@ static int dpaa_fq_init(struct dpaa_fq *dpaa_fq, bool td_enable)

 	dpaa_fq->fqid = qman_fq_fqid(fq);

+	if (dpaa_fq->fq_type == FQ_TYPE_RX_DEFAULT ||
+	    dpaa_fq->fq_type == FQ_TYPE_RX_PCD) {
+		err = xdp_rxq_info_reg(&dpaa_fq->xdp_rxq, dpaa_fq->net_dev,
+				       dpaa_fq->fqid);
+		if (err) {
+			dev_err(dev, "xdp_rxq_info_reg() = %d\n", err);
+			return err;
+		}
+
+		err = xdp_rxq_info_reg_mem_model(&dpaa_fq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err) {
+			dev_err(dev, "xdp_rxq_info_reg_mem_model() = %d\n", err);
+			xdp_rxq_info_unreg(&dpaa_fq->xdp_rxq);
+			return err;
+		}
+	}
+
 	return 0;
 }

@@ -1159,6 +1177,11 @@ static int dpaa_fq_free_entry(struct device *dev, struct qman_fq *fq)
 		}
 	}

+	if ((dpaa_fq->fq_type == FQ_TYPE_RX_DEFAULT ||
+	     dpaa_fq->fq_type == FQ_TYPE_RX_PCD) &&
+	    xdp_rxq_info_is_reg(&dpaa_fq->xdp_rxq))
+		xdp_rxq_info_unreg(&dpaa_fq->xdp_rxq);
+
 	qman_destroy_fq(fq);
 	list_del(&dpaa_fq->list);

@@ -1625,6 +1648,9 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv)
  *
  * Return the skb backpointer, since for S/G frames the buffer containing it
  * gets freed here.
+ *
+ * No skb backpointer is set when transmitting XDP frames. Cleanup the buffer
+ * and return NULL in this case.
  */
 static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
 					  const struct qm_fd *fd, bool ts)
@@ -1664,13 +1690,21 @@ static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
 		}
 	} else {
 		dma_unmap_single(priv->tx_dma_dev, addr,
-				 priv->tx_headroom + qm_fd_get_length(fd),
+				 qm_fd_get_offset(fd) + qm_fd_get_length(fd),
 				 dma_dir);
 	}

 	swbp = (struct dpaa_eth_swbp *)vaddr;
 	skb = swbp->skb;

+	/* No skb backpointer is set when running XDP. An xdp_frame
+	 * backpointer is saved instead.
+	 */
+	if (!skb) {
+		xdp_return_frame(swbp->xdpf);
+		return NULL;
+	}
+
 	/* DMA unmapping is required before accessing the HW provided info */
 	if (ts && priv->tx_tstamp &&
 	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
@@ -2350,11 +2384,76 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
 	return qman_cb_dqrr_consume;
 }

+static int dpaa_xdp_xmit_frame(struct net_device *net_dev,
+			       struct xdp_frame *xdpf)
+{
+	struct dpaa_priv *priv = netdev_priv(net_dev);
+	struct rtnl_link_stats64 *percpu_stats;
+	struct dpaa_percpu_priv *percpu_priv;
+	struct dpaa_eth_swbp *swbp;
+	struct netdev_queue *txq;
+	void *buff_start;
+	struct qm_fd fd;
+	dma_addr_t addr;
+	int err;
+
+	percpu_priv = this_cpu_ptr(priv->percpu_priv);
+	percpu_stats = &percpu_priv->stats;
+
+	if (xdpf->headroom < DPAA_TX_PRIV_DATA_SIZE) {
+		err = -EINVAL;
+		goto out_error;
+	}
+
+	buff_start = xdpf->data - xdpf->headroom;
+
+	/* Leave empty the skb backpointer at the start of the buffer.
+	 * Save the XDP frame for easy cleanup on confirmation.
+	 */
+	swbp = (struct dpaa_eth_swbp *)buff_start;
+	swbp->skb = NULL;
+	swbp->xdpf = xdpf;
+
+	qm_fd_clear_fd(&fd);
+	fd.bpid = FSL_DPAA_BPID_INV;
+	fd.cmd |= cpu_to_be32(FM_FD_CMD_FCO);
+	qm_fd_set_contig(&fd, xdpf->headroom, xdpf->len);
+
+	addr = dma_map_single(priv->tx_dma_dev, buff_start,
+			      xdpf->headroom + xdpf->len,
+			      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(priv->tx_dma_dev, addr))) {
+		err = -EINVAL;
+		goto out_error;
+	}
+
+	qm_fd_addr_set64(&fd, addr);
+
+	/* Bump the trans_start */
+	txq = netdev_get_tx_queue(net_dev, smp_processor_id());
+	txq->trans_start = jiffies;
+
+	err = dpaa_xmit(priv, percpu_stats, smp_processor_id(), &fd);
+	if (err) {
+		dma_unmap_single(priv->tx_dma_dev, addr,
+				 qm_fd_get_offset(&fd) + qm_fd_get_length(&fd),
+				 DMA_TO_DEVICE);
+		goto out_error;
+	}
+
+	return 0;
+
+out_error:
+	percpu_stats->tx_errors++;
+	return err;
+}
+
 static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
-			unsigned int *xdp_meta_len)
+			struct dpaa_fq *dpaa_fq, unsigned int *xdp_meta_len)
 {
 	ssize_t fd_off = qm_fd_get_offset(fd);
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	struct xdp_buff xdp;
 	u32 xdp_act;

@@ -2370,7 +2469,8 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	xdp.data_meta = xdp.data;
 	xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
 	xdp.data_end = xdp.data + qm_fd_get_length(fd);
-	xdp.frame_sz = DPAA_BP_RAW_SIZE;
+	xdp.frame_sz = DPAA_BP_RAW_SIZE - DPAA_TX_PRIV_DATA_SIZE;
+	xdp.rxq = &dpaa_fq->xdp_rxq;

 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);

@@ -2381,6 +2481,22 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
 	case XDP_PASS:
 		*xdp_meta_len = xdp.data - xdp.data_meta;
 		break;
+	case XDP_TX:
+		/* We can access the full headroom when sending the frame
+		 * back out
+		 */
+		xdp.data_hard_start = vaddr;
+		xdp.frame_sz = DPAA_BP_RAW_SIZE;
+		xdpf = xdp_convert_buff_to_frame(&xdp);
+		if (unlikely(!xdpf)) {
+			free_pages((unsigned long)vaddr, 0);
+			break;
+		}
+
+		if (dpaa_xdp_xmit_frame(priv->net_dev, xdpf))
+			xdp_return_frame_rx_napi(xdpf);
+
+		break;
 	default:
 		bpf_warn_invalid_xdp_action(xdp_act);
 		fallthrough;
@@ -2415,6 +2531,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	u32 fd_status, hash_offset;
 	struct qm_sg_entry *sgt;
 	struct dpaa_bp *dpaa_bp;
+	struct dpaa_fq *dpaa_fq;
 	struct dpaa_priv *priv;
 	struct sk_buff *skb;
 	int *count_ptr;
@@ -2423,9 +2540,10 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	u32 hash;
 	u64 ns;

+	dpaa_fq = container_of(fq, struct dpaa_fq, fq_base);
 	fd_status = be32_to_cpu(fd->status);
 	fd_format = qm_fd_get_format(fd);
-	net_dev = ((struct dpaa_fq *)fq)->net_dev;
+	net_dev = dpaa_fq->net_dev;
 	priv = netdev_priv(net_dev);
 	dpaa_bp = dpaa_bpid2pool(dq->fd.bpid);
 	if (!dpaa_bp)
@@ -2494,7 +2612,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,

 	if (likely(fd_format == qm_fd_contig)) {
 		xdp_act = dpaa_run_xdp(priv, (struct qm_fd *)fd, vaddr,
-				       &xdp_meta_len);
+				       dpaa_fq, &xdp_meta_len);
 		if (xdp_act != XDP_PASS) {
 			percpu_stats->rx_packets++;
 			percpu_stats->rx_bytes += qm_fd_get_length(fd);
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
index 94e8613..5c8d52a 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
@@ -68,6 +68,7 @@ struct dpaa_fq {
 	u16 channel;
 	u8 wq;
 	enum dpaa_fq_type fq_type;
+	struct xdp_rxq_info xdp_rxq;
 };

 struct dpaa_fq_cbs {
@@ -150,6 +151,7 @@ struct dpaa_buffer_layout {
  */
 struct dpaa_eth_swbp {
 	struct sk_buff *skb;
+	struct xdp_frame *xdpf;
 };

 struct dpaa_priv {
--
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 2/7] dpaa_eth: add basic XDP support
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

Implement the XDP_DROP and XDP_PASS actions.

Avoid draining and reconfiguring the buffer pool at each XDP
setup/teardown by increasing the frame headroom and reserving
XDP_PACKET_HEADROOM bytes from the start. Since we always reserve an
entire page per buffer, this change only impacts Jumbo frame scenarios
where the maximum linear frame size is reduced by 256 bytes. Multi
buffer Scatter/Gather frames are now used instead in these scenarios.

Allow XDP programs to access the entire buffer.

The data in the received frame's headroom can be overwritten by the XDP
program. Extract the relevant fields from the headroom while they are
still available, before running the XDP program.

Since the headroom might be resized before the frame is passed up to the
stack, remove the check for a fixed headroom value when building an skb.

Allow the meta data to be updated and pass the information up the stack.

Scatter/Gather frames are dropped when XDP is enabled.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
Changes in v2:
- warn only once if extracting the timestamp from a received frame fails

Changes in v3:
- drop received S/G frames when XDP is enabled

Changes in v4:
- report a warning if the MTU is too hight for running XDP
- report an error if opening the device fails in the XDP setup

 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 166 ++++++++++++++++++++++---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |   2 +
 2 files changed, 152 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 88533a2..8acce62 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -53,6 +53,8 @@
 #include <linux/dma-mapping.h>
 #include <linux/sort.h>
 #include <linux/phy_fixed.h>
+#include <linux/bpf.h>
+#include <linux/bpf_trace.h>
 #include <soc/fsl/bman.h>
 #include <soc/fsl/qman.h>
 #include "fman.h"
@@ -177,7 +179,7 @@
 #define DPAA_HWA_SIZE (DPAA_PARSE_RESULTS_SIZE + DPAA_TIME_STAMP_SIZE \
 		       + DPAA_HASH_RESULTS_SIZE)
 #define DPAA_RX_PRIV_DATA_DEFAULT_SIZE (DPAA_TX_PRIV_DATA_SIZE + \
-					dpaa_rx_extra_headroom)
+					XDP_PACKET_HEADROOM - DPAA_HWA_SIZE)
 #ifdef CONFIG_DPAA_ERRATUM_A050385
 #define DPAA_RX_PRIV_DATA_A050385_SIZE (DPAA_A050385_ALIGN - DPAA_HWA_SIZE)
 #define DPAA_RX_PRIV_DATA_SIZE (fman_has_errata_a050385() ? \
@@ -1733,7 +1735,6 @@ static struct sk_buff *contig_fd_to_skb(const struct dpaa_priv *priv,
 			SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
 	if (WARN_ONCE(!skb, "Build skb failure on Rx\n"))
 		goto free_buffer;
-	WARN_ON(fd_off != priv->rx_headroom);
 	skb_reserve(skb, fd_off);
 	skb_put(skb, qm_fd_get_length(fd));

@@ -2349,12 +2350,62 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
 	return qman_cb_dqrr_consume;
 }

+static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
+			unsigned int *xdp_meta_len)
+{
+	ssize_t fd_off = qm_fd_get_offset(fd);
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	u32 xdp_act;
+
+	rcu_read_lock();
+
+	xdp_prog = READ_ONCE(priv->xdp_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		return XDP_PASS;
+	}
+
+	xdp.data = vaddr + fd_off;
+	xdp.data_meta = xdp.data;
+	xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
+	xdp.data_end = xdp.data + qm_fd_get_length(fd);
+	xdp.frame_sz = DPAA_BP_RAW_SIZE;
+
+	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	/* Update the length and the offset of the FD */
+	qm_fd_set_contig(fd, xdp.data - vaddr, xdp.data_end - xdp.data);
+
+	switch (xdp_act) {
+	case XDP_PASS:
+		*xdp_meta_len = xdp.data - xdp.data_meta;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(xdp_act);
+		fallthrough;
+	case XDP_ABORTED:
+		trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
+		fallthrough;
+	case XDP_DROP:
+		/* Free the buffer */
+		free_pages((unsigned long)vaddr, 0);
+		break;
+	}
+
+	rcu_read_unlock();
+
+	return xdp_act;
+}
+
 static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 						struct qman_fq *fq,
 						const struct qm_dqrr_entry *dq,
 						bool sched_napi)
 {
+	bool ts_valid = false, hash_valid = false;
 	struct skb_shared_hwtstamps *shhwtstamps;
+	unsigned int skb_len, xdp_meta_len = 0;
 	struct rtnl_link_stats64 *percpu_stats;
 	struct dpaa_percpu_priv *percpu_priv;
 	const struct qm_fd *fd = &dq->fd;
@@ -2362,12 +2413,14 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	enum qm_fd_format fd_format;
 	struct net_device *net_dev;
 	u32 fd_status, hash_offset;
+	struct qm_sg_entry *sgt;
 	struct dpaa_bp *dpaa_bp;
 	struct dpaa_priv *priv;
-	unsigned int skb_len;
 	struct sk_buff *skb;
 	int *count_ptr;
+	u32 xdp_act;
 	void *vaddr;
+	u32 hash;
 	u64 ns;

 	fd_status = be32_to_cpu(fd->status);
@@ -2423,35 +2476,67 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
 	count_ptr = this_cpu_ptr(dpaa_bp->percpu_count);
 	(*count_ptr)--;

-	if (likely(fd_format == qm_fd_contig))
+	/* Extract the timestamp stored in the headroom before running XDP */
+	if (priv->rx_tstamp) {
+		if (!fman_port_get_tstamp(priv->mac_dev->port[RX], vaddr, &ns))
+			ts_valid = true;
+		else
+			WARN_ONCE(1, "fman_port_get_tstamp failed!\n");
+	}
+
+	/* Extract the hash stored in the headroom before running XDP */
+	if (net_dev->features & NETIF_F_RXHASH && priv->keygen_in_use &&
+	    !fman_port_get_hash_result_offset(priv->mac_dev->port[RX],
+					      &hash_offset)) {
+		hash = be32_to_cpu(*(u32 *)(vaddr + hash_offset));
+		hash_valid = true;
+	}
+
+	if (likely(fd_format == qm_fd_contig)) {
+		xdp_act = dpaa_run_xdp(priv, (struct qm_fd *)fd, vaddr,
+				       &xdp_meta_len);
+		if (xdp_act != XDP_PASS) {
+			percpu_stats->rx_packets++;
+			percpu_stats->rx_bytes += qm_fd_get_length(fd);
+			return qman_cb_dqrr_consume;
+		}
 		skb = contig_fd_to_skb(priv, fd);
-	else
+	} else {
+		/* XDP doesn't support S/G frames. Return the fragments to the
+		 * buffer pool and release the SGT.
+		 */
+		if (READ_ONCE(priv->xdp_prog)) {
+			WARN_ONCE(1, "S/G frames not supported under XDP\n");
+			sgt = vaddr + qm_fd_get_offset(fd);
+			dpaa_release_sgt_members(sgt);
+			free_pages((unsigned long)vaddr, 0);
+			return qman_cb_dqrr_consume;
+		}
 		skb = sg_fd_to_skb(priv, fd);
+	}
 	if (!skb)
 		return qman_cb_dqrr_consume;

-	if (priv->rx_tstamp) {
+	if (xdp_meta_len)
+		skb_metadata_set(skb, xdp_meta_len);
+
+	/* Set the previously extracted timestamp */
+	if (ts_valid) {
 		shhwtstamps = skb_hwtstamps(skb);
 		memset(shhwtstamps, 0, sizeof(*shhwtstamps));
-
-		if (!fman_port_get_tstamp(priv->mac_dev->port[RX], vaddr, &ns))
-			shhwtstamps->hwtstamp = ns_to_ktime(ns);
-		else
-			dev_warn(net_dev->dev.parent, "fman_port_get_tstamp failed!\n");
+		shhwtstamps->hwtstamp = ns_to_ktime(ns);
 	}

 	skb->protocol = eth_type_trans(skb, net_dev);

-	if (net_dev->features & NETIF_F_RXHASH && priv->keygen_in_use &&
-	    !fman_port_get_hash_result_offset(priv->mac_dev->port[RX],
-					      &hash_offset)) {
+	/* Set the previously extracted hash */
+	if (hash_valid) {
 		enum pkt_hash_types type;

 		/* if L4 exists, it was used in the hash generation */
 		type = be32_to_cpu(fd->status) & FM_FD_STAT_L4CV ?
 			PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3;
-		skb_set_hash(skb, be32_to_cpu(*(u32 *)(vaddr + hash_offset)),
-			     type);
+		skb_set_hash(skb, hash, type);
 	}

 	skb_len = skb->len;
@@ -2671,6 +2756,54 @@ static int dpaa_eth_stop(struct net_device *net_dev)
 	return err;
 }

+static int dpaa_setup_xdp(struct net_device *net_dev, struct bpf_prog *prog)
+{
+	struct dpaa_priv *priv = netdev_priv(net_dev);
+	struct bpf_prog *old_prog;
+	int err, max_contig_data;
+	bool up;
+
+	max_contig_data = priv->dpaa_bp->size - priv->rx_headroom;
+
+	/* S/G fragments are not supported in XDP-mode */
+	if (prog &&
+	    (net_dev->mtu + VLAN_ETH_HLEN + ETH_FCS_LEN > max_contig_data)) {
+		dev_warn(net_dev->dev.parent,
+			 "The maximum MTU for XDP is %d\n",
+			 max_contig_data - VLAN_ETH_HLEN - ETH_FCS_LEN);
+		return -EINVAL;
+	}
+
+	up = netif_running(net_dev);
+
+	if (up)
+		dpaa_eth_stop(net_dev);
+
+	old_prog = xchg(&priv->xdp_prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	if (up) {
+		err = dpaa_open(net_dev);
+		if (err) {
+			dev_err(net_dev->dev.parent, "dpaa_open() failed\n");
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+static int dpaa_xdp(struct net_device *net_dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return dpaa_setup_xdp(net_dev, xdp->prog);
+	default:
+		return -EINVAL;
+	}
+}
+
 static int dpaa_ts_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
 	struct dpaa_priv *priv = netdev_priv(dev);
@@ -2737,6 +2870,7 @@ static int dpaa_ioctl(struct net_device *net_dev, struct ifreq *rq, int cmd)
 	.ndo_set_rx_mode = dpaa_set_rx_mode,
 	.ndo_do_ioctl = dpaa_ioctl,
 	.ndo_setup_tc = dpaa_setup_tc,
+	.ndo_bpf = dpaa_xdp,
 };

 static int dpaa_napi_add(struct net_device *net_dev)
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
index da30e5d..94e8613 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
@@ -196,6 +196,8 @@ struct dpaa_priv {

 	bool tx_tstamp; /* Tx timestamping enabled */
 	bool rx_tstamp; /* Rx timestamping enabled */
+
+	struct bpf_prog *xdp_prog;
 };

 /* from dpaa_ethtool.c */
--
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 3/7] dpaa_eth: limit the possible MTU range when XDP is enabled
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

Implement the ndo_change_mtu callback to prevent users from setting an
MTU that would permit processing of S/G frames. The maximum MTU size
is dependent on the buffer size.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 40 ++++++++++++++++++++------
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 8acce62..ee076f4 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2756,23 +2756,44 @@ static int dpaa_eth_stop(struct net_device *net_dev)
 	return err;
 }
 
+static bool xdp_validate_mtu(struct dpaa_priv *priv, int mtu)
+{
+	int max_contig_data = priv->dpaa_bp->size - priv->rx_headroom;
+
+	/* We do not support S/G fragments when XDP is enabled.
+	 * Limit the MTU in relation to the buffer size.
+	 */
+	if (mtu + VLAN_ETH_HLEN + ETH_FCS_LEN > max_contig_data) {
+		dev_warn(priv->net_dev->dev.parent,
+			 "The maximum MTU for XDP is %d\n",
+			 max_contig_data - VLAN_ETH_HLEN - ETH_FCS_LEN);
+		return false;
+	}
+
+	return true;
+}
+
+static int dpaa_change_mtu(struct net_device *net_dev, int new_mtu)
+{
+	struct dpaa_priv *priv = netdev_priv(net_dev);
+
+	if (priv->xdp_prog && !xdp_validate_mtu(priv, new_mtu))
+		return -EINVAL;
+
+	net_dev->mtu = new_mtu;
+	return 0;
+}
+
 static int dpaa_setup_xdp(struct net_device *net_dev, struct bpf_prog *prog)
 {
 	struct dpaa_priv *priv = netdev_priv(net_dev);
 	struct bpf_prog *old_prog;
-	int err, max_contig_data;
+	int err;
 	bool up;
 
-	max_contig_data = priv->dpaa_bp->size - priv->rx_headroom;
-
 	/* S/G fragments are not supported in XDP-mode */
-	if (prog &&
-	    (net_dev->mtu + VLAN_ETH_HLEN + ETH_FCS_LEN > max_contig_data)) {
-		dev_warn(net_dev->dev.parent,
-			 "The maximum MTU for XDP is %d\n",
-			 max_contig_data - VLAN_ETH_HLEN - ETH_FCS_LEN);
+	if (prog && !xdp_validate_mtu(priv, net_dev->mtu))
 		return -EINVAL;
-	}
 
 	up = netif_running(net_dev);
 
@@ -2870,6 +2891,7 @@ static int dpaa_ioctl(struct net_device *net_dev, struct ifreq *rq, int cmd)
 	.ndo_set_rx_mode = dpaa_set_rx_mode,
 	.ndo_do_ioctl = dpaa_ioctl,
 	.ndo_setup_tc = dpaa_setup_tc,
+	.ndo_change_mtu = dpaa_change_mtu,
 	.ndo_bpf = dpaa_xdp,
 };
 
-- 
1.9.1


^ permalink raw reply related

* [PATCH net-next v4 0/7] dpaa_eth: add XDP support
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza

Enable XDP support for the QorIQ DPAA1 platforms.

Implement all the current actions (DROP, ABORTED, PASS, TX, REDIRECT). No
Tx batching is added at this time.

Additional XDP_PACKET_HEADROOM bytes are reserved in each frame's headroom.

After transmit, a reference to the xdp_frame is saved in the buffer for
clean-up on confirmation in a newly created structure for software
annotations. DPAA_TX_PRIV_DATA_SIZE bytes are reserved in the buffer for
storing this structure and the XDP program is restricted from accessing
them.

The driver shares the egress frame queues used for XDP with the network
stack. The DPAA driver is a LLTX driver so no explicit locking is required
on transmission.

Changes in v2:
- warn only once if extracting the timestamp from a received frame fails
  in 2/7

Changes in v3:
- drop received S/G frames when XDP is enabled in 2/7

Changes in v4:
- report a warning if the MTU is too hight for running XDP in 2/7
- report an error if opening the device fails in the XDP setup in 2/7
- call xdp_rxq_info_is_reg() before unregistering in 4/7
- minor cleanups (remove unneeded variable, print error code) in 4/7
- add more details in the commit message in 4/7
- did not call qman_destroy_fq() in case of xdp_rxq_info_reg() failure
since it would lead to a double free of the fq resources in 4/7

Camelia Groza (7):
  dpaa_eth: add struct for software backpointers
  dpaa_eth: add basic XDP support
  dpaa_eth: limit the possible MTU range when XDP is enabled
  dpaa_eth: add XDP_TX support
  dpaa_eth: add XDP_REDIRECT support
  dpaa_eth: rename current skb A050385 erratum workaround
  dpaa_eth: implement the A050385 erratum workaround for XDP

 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 458 +++++++++++++++++++++++--
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |  13 +
 2 files changed, 441 insertions(+), 30 deletions(-)

--
1.9.1

^ permalink raw reply

* [PATCH net-next v4 1/7] dpaa_eth: add struct for software backpointers
From: Camelia Groza @ 2020-11-23 17:36 UTC (permalink / raw)
  To: kuba, maciej.fijalkowski, brouer, saeed, davem
  Cc: madalin.bucur, ioana.ciornei, netdev, Camelia Groza
In-Reply-To: <cover.1606150838.git.camelia.groza@nxp.com>

We maintain an skb backpointer in the software annotations area of Tx
frames. Introduce a structure for explicit handling.

Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 16 +++++++++-------
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |  8 ++++++++
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 8867693..88533a2 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -1633,6 +1633,7 @@ static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
 	dma_addr_t addr = qm_fd_addr(fd);
 	void *vaddr = phys_to_virt(addr);
 	const struct qm_sg_entry *sgt;
+	struct dpaa_eth_swbp *swbp;
 	struct sk_buff *skb;
 	u64 ns;
 	int i;
@@ -1665,7 +1666,8 @@ static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
 				 dma_dir);
 	}
 
-	skb = *(struct sk_buff **)vaddr;
+	swbp = (struct dpaa_eth_swbp *)vaddr;
+	skb = swbp->skb;
 
 	/* DMA unmapping is required before accessing the HW provided info */
 	if (ts && priv->tx_tstamp &&
@@ -1879,8 +1881,8 @@ static int skb_to_contig_fd(struct dpaa_priv *priv,
 {
 	struct net_device *net_dev = priv->net_dev;
 	enum dma_data_direction dma_dir;
+	struct dpaa_eth_swbp *swbp;
 	unsigned char *buff_start;
-	struct sk_buff **skbh;
 	dma_addr_t addr;
 	int err;
 
@@ -1891,8 +1893,8 @@ static int skb_to_contig_fd(struct dpaa_priv *priv,
 	buff_start = skb->data - priv->tx_headroom;
 	dma_dir = DMA_TO_DEVICE;
 
-	skbh = (struct sk_buff **)buff_start;
-	*skbh = skb;
+	swbp = (struct dpaa_eth_swbp *)buff_start;
+	swbp->skb = skb;
 
 	/* Enable L3/L4 hardware checksum computation.
 	 *
@@ -1931,8 +1933,8 @@ static int skb_to_sg_fd(struct dpaa_priv *priv,
 	const enum dma_data_direction dma_dir = DMA_TO_DEVICE;
 	const int nr_frags = skb_shinfo(skb)->nr_frags;
 	struct net_device *net_dev = priv->net_dev;
+	struct dpaa_eth_swbp *swbp;
 	struct qm_sg_entry *sgt;
-	struct sk_buff **skbh;
 	void *buff_start;
 	skb_frag_t *frag;
 	dma_addr_t addr;
@@ -2005,8 +2007,8 @@ static int skb_to_sg_fd(struct dpaa_priv *priv,
 	qm_fd_set_sg(fd, priv->tx_headroom, skb->len);
 
 	/* DMA map the SGT page */
-	skbh = (struct sk_buff **)buff_start;
-	*skbh = skb;
+	swbp = (struct dpaa_eth_swbp *)buff_start;
+	swbp->skb = skb;
 
 	addr = dma_map_page(priv->tx_dma_dev, p, 0,
 			    priv->tx_headroom + DPAA_SGT_SIZE, dma_dir);
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
index fc2cc4c..da30e5d 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
@@ -144,6 +144,14 @@ struct dpaa_buffer_layout {
 	u16 priv_data_size;
 };
 
+/* Information to be used on the Tx confirmation path. Stored just
+ * before the start of the transmit buffer. Maximum size allowed
+ * is DPAA_TX_PRIV_DATA_SIZE bytes.
+ */
+struct dpaa_eth_swbp {
+	struct sk_buff *skb;
+};
+
 struct dpaa_priv {
 	struct dpaa_percpu_priv __percpu *percpu_priv;
 	struct dpaa_bp *dpaa_bp;
-- 
1.9.1


^ permalink raw reply related

* Re: memory leak in inet_create (2)
From: syzbot @ 2020-11-23 17:31 UTC (permalink / raw)
  To: andrii, andriin, ast, bpf, daniel, davem, john.fastabend, kafai,
	kpsingh, kuba, kuznet, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs, yoshfuji
In-Reply-To: <0000000000007bf88805a445f729@google.com>

syzbot has found a reproducer for the following issue on:

HEAD commit:    418baf2c Linux 5.10-rc5
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=161c84ed500000
kernel config:  https://syzkaller.appspot.com/x/.config?x=5524c10373633a9c
dashboard link: https://syzkaller.appspot.com/bug?extid=bb7ba8dd62c3cb6e3c78
compiler:       gcc (GCC) 10.1.0-syz 20200507
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1514cfa3500000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11a52fc1500000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+bb7ba8dd62c3cb6e3c78@syzkaller.appspotmail.com

executing program
executing program
executing program
BUG: memory leak
unreferenced object 0xffff88810e85adc0 (size 1728):
  comm "syz-executor376", pid 8506, jiffies 4294946899 (age 13.430s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<00000000cb2829d9>] sk_prot_alloc+0x3e/0x1c0 net/core/sock.c:1660
    [<0000000023bd8ef8>] sk_alloc+0x30/0x3f0 net/core/sock.c:1720
    [<00000000a4a7ed0a>] inet_create net/ipv4/af_inet.c:322 [inline]
    [<00000000a4a7ed0a>] inet_create+0x16a/0x560 net/ipv4/af_inet.c:248
    [<000000003b729101>] __sock_create+0x1ab/0x2b0 net/socket.c:1427
    [<00000000ebee6fd5>] sock_create net/socket.c:1478 [inline]
    [<00000000ebee6fd5>] __sys_socket+0x6f/0x140 net/socket.c:1520
    [<00000000bcf20e68>] __do_sys_socket net/socket.c:1529 [inline]
    [<00000000bcf20e68>] __se_sys_socket net/socket.c:1527 [inline]
    [<00000000bcf20e68>] __x64_sys_socket+0x1a/0x20 net/socket.c:1527
    [<00000000732fe45a>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<0000000091e76b15>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88810fec3c80 (size 768):
  comm "syz-executor376", pid 8506, jiffies 4294946899 (age 13.430s)
  hex dump (first 32 bytes):
    01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c0 72 a0 0e 81 88 ff ff  .........r......
  backtrace:
    [<00000000681cd6ae>] sock_alloc_inode+0x18/0x90 net/socket.c:253
    [<00000000fa9d2004>] alloc_inode+0x27/0x100 fs/inode.c:234
    [<00000000f3a018c7>] new_inode_pseudo+0x13/0x70 fs/inode.c:930
    [<00000000549f715a>] sock_alloc+0x18/0x90 net/socket.c:573
    [<00000000a044e0d4>] __sock_create+0xb8/0x2b0 net/socket.c:1391
    [<00000000973ca39c>] mptcp_subflow_create_socket+0x57/0x280 net/mptcp/subflow.c:1152
    [<00000000a3724864>] __mptcp_socket_create net/mptcp/protocol.c:97 [inline]
    [<00000000a3724864>] mptcp_init_sock net/mptcp/protocol.c:1859 [inline]
    [<00000000a3724864>] mptcp_init_sock+0x12f/0x270 net/mptcp/protocol.c:1844
    [<00000000c97baf32>] inet_create net/ipv4/af_inet.c:380 [inline]
    [<00000000c97baf32>] inet_create+0x2ed/0x560 net/ipv4/af_inet.c:248
    [<000000003b729101>] __sock_create+0x1ab/0x2b0 net/socket.c:1427
    [<00000000ebee6fd5>] sock_create net/socket.c:1478 [inline]
    [<00000000ebee6fd5>] __sys_socket+0x6f/0x140 net/socket.c:1520
    [<00000000bcf20e68>] __do_sys_socket net/socket.c:1529 [inline]
    [<00000000bcf20e68>] __se_sys_socket net/socket.c:1527 [inline]
    [<00000000bcf20e68>] __x64_sys_socket+0x1a/0x20 net/socket.c:1527
    [<00000000732fe45a>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<0000000091e76b15>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88810de87bb8 (size 24):
  comm "syz-executor376", pid 8506, jiffies 4294946899 (age 13.430s)
  hex dump (first 24 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00                          ........
  backtrace:
    [<00000000bea9ec8c>] kmem_cache_zalloc include/linux/slab.h:654 [inline]
    [<00000000bea9ec8c>] lsm_inode_alloc security/security.c:589 [inline]
    [<00000000bea9ec8c>] security_inode_alloc+0x2a/0xb0 security/security.c:972
    [<00000000543365c5>] inode_init_always+0x10c/0x250 fs/inode.c:171
    [<000000004da5c777>] alloc_inode+0x44/0x100 fs/inode.c:241
    [<00000000f3a018c7>] new_inode_pseudo+0x13/0x70 fs/inode.c:930
    [<00000000549f715a>] sock_alloc+0x18/0x90 net/socket.c:573
    [<00000000a044e0d4>] __sock_create+0xb8/0x2b0 net/socket.c:1391
    [<00000000973ca39c>] mptcp_subflow_create_socket+0x57/0x280 net/mptcp/subflow.c:1152
    [<00000000a3724864>] __mptcp_socket_create net/mptcp/protocol.c:97 [inline]
    [<00000000a3724864>] mptcp_init_sock net/mptcp/protocol.c:1859 [inline]
    [<00000000a3724864>] mptcp_init_sock+0x12f/0x270 net/mptcp/protocol.c:1844
    [<00000000c97baf32>] inet_create net/ipv4/af_inet.c:380 [inline]
    [<00000000c97baf32>] inet_create+0x2ed/0x560 net/ipv4/af_inet.c:248
    [<000000003b729101>] __sock_create+0x1ab/0x2b0 net/socket.c:1427
    [<00000000ebee6fd5>] sock_create net/socket.c:1478 [inline]
    [<00000000ebee6fd5>] __sys_socket+0x6f/0x140 net/socket.c:1520
    [<00000000bcf20e68>] __do_sys_socket net/socket.c:1529 [inline]
    [<00000000bcf20e68>] __se_sys_socket net/socket.c:1527 [inline]
    [<00000000bcf20e68>] __x64_sys_socket+0x1a/0x20 net/socket.c:1527
    [<00000000732fe45a>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<0000000091e76b15>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88810ea072c0 (size 2208):
  comm "syz-executor376", pid 8506, jiffies 4294946899 (age 13.430s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<00000000cb2829d9>] sk_prot_alloc+0x3e/0x1c0 net/core/sock.c:1660
    [<0000000023bd8ef8>] sk_alloc+0x30/0x3f0 net/core/sock.c:1720
    [<00000000a4a7ed0a>] inet_create net/ipv4/af_inet.c:322 [inline]
    [<00000000a4a7ed0a>] inet_create+0x16a/0x560 net/ipv4/af_inet.c:248
    [<000000003b729101>] __sock_create+0x1ab/0x2b0 net/socket.c:1427
    [<00000000973ca39c>] mptcp_subflow_create_socket+0x57/0x280 net/mptcp/subflow.c:1152
    [<00000000a3724864>] __mptcp_socket_create net/mptcp/protocol.c:97 [inline]
    [<00000000a3724864>] mptcp_init_sock net/mptcp/protocol.c:1859 [inline]
    [<00000000a3724864>] mptcp_init_sock+0x12f/0x270 net/mptcp/protocol.c:1844
    [<00000000c97baf32>] inet_create net/ipv4/af_inet.c:380 [inline]
    [<00000000c97baf32>] inet_create+0x2ed/0x560 net/ipv4/af_inet.c:248
    [<000000003b729101>] __sock_create+0x1ab/0x2b0 net/socket.c:1427
    [<00000000ebee6fd5>] sock_create net/socket.c:1478 [inline]
    [<00000000ebee6fd5>] __sys_socket+0x6f/0x140 net/socket.c:1520
    [<00000000bcf20e68>] __do_sys_socket net/socket.c:1529 [inline]
    [<00000000bcf20e68>] __se_sys_socket net/socket.c:1527 [inline]
    [<00000000bcf20e68>] __x64_sys_socket+0x1a/0x20 net/socket.c:1527
    [<00000000732fe45a>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<0000000091e76b15>] entry_SYSCALL_64_after_hwframe+0x44/0xa9



^ permalink raw reply

* Re: [EXT] Re: [PATCH v1] net: mvpp2: divide fifo for dts-active ports only
From: Russell King - ARM Linux admin @ 2020-11-23 17:30 UTC (permalink / raw)
  To: Stefan Chulski
  Cc: netdev@vger.kernel.org, thomas.petazzoni@bootlin.com,
	davem@davemloft.net, Nadav Haklai, Yan Markman,
	linux-kernel@vger.kernel.org, kuba@kernel.org, mw@semihalf.com,
	andrew@lunn.ch
In-Reply-To: <CO6PR18MB3873FC445787E395CCB710E4B0FC0@CO6PR18MB3873.namprd18.prod.outlook.com>

On Mon, Nov 23, 2020 at 04:03:00PM +0000, Stefan Chulski wrote:
> > -----Original Message-----
> > From: Russell King - ARM Linux admin <linux@armlinux.org.uk>
> > Sent: Monday, November 23, 2020 5:52 PM
> > To: Stefan Chulski <stefanc@marvell.com>
> > Cc: netdev@vger.kernel.org; thomas.petazzoni@bootlin.com;
> > davem@davemloft.net; Nadav Haklai <nadavh@marvell.com>; Yan Markman
> > <ymarkman@marvell.com>; linux-kernel@vger.kernel.org; kuba@kernel.org;
> > mw@semihalf.com; andrew@lunn.ch
> > Subject: Re: [EXT] Re: [PATCH v1] net: mvpp2: divide fifo for dts-active ports
> > only
> > 
> > On Mon, Nov 23, 2020 at 03:44:05PM +0000, Stefan Chulski wrote:
> > > Yes, but this allocation exists also in current code.
> > > From HW point of view(MAC and PPv2) maximum supported speed in CP110:
> > > port 0 - 10G, port 1 - 2.5G, port 2 - 2.5G.
> > > in CP115: port 0 - 10G, port 1 - 5G, port 2 - 2.5G.
> > >
> > > So this allocation looks correct at least for CP115.
> > > Problem that we cannot reallocate FIFO during runtime, after specific speed
> > negotiation.
> > 
> > We could do much better. DT has a "max-speed" property for ethernet
> > controllers. If we have that property, then I think we should use that to
> > determine the initialisation time FIFO allocation.
> > 
> > As I say, on Macchiatobin, the allocations we end up with are just crazy when
> > you consider the port speeds that the hardware supports.
> > Maybe that should be done as a follow-on patch - but I think it needs to be
> > done.
> 
> I agree with you. We can use "max-speed" for better FIFO allocations.
> I plan to upstream more fixes from the "Marvell" devel branch then I can prepare this patch.
> So you OK with this patch and then follow-on improvement?

Yes - but I would like to see the commit description say that this
results in no change the situation where all three ports are in use.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [kbuild-all] Re: [net-next,v2,4/5] seg6: add support for the SRv6 End.DT4 behavior
From: Jakub Kicinski @ 2020-11-23 17:19 UTC (permalink / raw)
  To: Rong Chen
  Cc: kernel test robot, Andrea Mayer, David S. Miller, David Ahern,
	Alexey Kuznetsov, Hideaki YOSHIFUJI, Shuah Khan,
	Shrijeet Mukherjee, Alexei Starovoitov, Daniel Borkmann,
	kbuild-all, clang-built-linux, netdev
In-Reply-To: <9242894f-b831-067a-48d8-2f235963dedb@intel.com>

On Mon, 23 Nov 2020 09:13:49 +0800 Rong Chen wrote:
> Sorry for the inconvenience, we have optimized the build bot to avoid 
> this situation.

Great to hear, thank you!

^ permalink raw reply

* Re: [RFC] MAINTAINERS tag for cleanup robot
From: Tom Rix @ 2020-11-23 17:06 UTC (permalink / raw)
  To: Joe Perches, clang-built-linux
  Cc: linux-hyperv, linux-kernel, xen-devel, tboot-devel, kvm,
	linux-crypto, linux-acpi, devel, amd-gfx, dri-devel, intel-gfx,
	netdev, linux-media, MPT-FusionLinux.pdl, linux-scsi,
	linux-wireless, ibm-acpi-devel, platform-driver-x86, linux-usb,
	linux-omap, linux-fbdev, ecryptfs, linux-fsdevel, cluster-devel,
	linux-mtd, keyrings, netfilter-devel, coreteam, alsa-devel, bpf,
	linux-bluetooth, linux-nfs, patches
In-Reply-To: <859bae8ddae3238116824192f6ddf1c91a381913.camel@perches.com>


On 11/22/20 10:22 AM, Joe Perches wrote:
> On Sun, 2020-11-22 at 08:33 -0800, Tom Rix wrote:
>> On 11/21/20 9:10 AM, Joe Perches wrote:
>>> On Sat, 2020-11-21 at 08:50 -0800, trix@redhat.com wrote:
>>>> A difficult part of automating commits is composing the subsystem
>>>> preamble in the commit log.  For the ongoing effort of a fixer producing
>>>> one or two fixes a release the use of 'treewide:' does not seem appropriate.
>>>>
>>>> It would be better if the normal prefix was used.  Unfortunately normal is
>>>> not consistent across the tree.
>>>>
>>>> So I am looking for comments for adding a new tag to the MAINTAINERS file
>>>>
>>>> 	D: Commit subsystem prefix
>>>>
>>>> ex/ for FPGA DFL DRIVERS
>>>>
>>>> 	D: fpga: dfl:
>>> I'm all for it.  Good luck with the effort.  It's not completely trivial.
>>>
>>> From a decade ago:
>>>
>>> https://lore.kernel.org/lkml/1289919077.28741.50.camel@Joe-Laptop/
>>>
>>> (and that thread started with extra semicolon patches too)
>> Reading the history, how about this.
>>
>> get_maintainer.pl outputs a single prefix, if multiple files have the
>> same prefix it works, if they don't its an error.
>>
>> Another script 'commit_one_file.sh' does the call to get_mainainter.pl
>> to get the prefix and be called by run-clang-tools.py to get the fixer
>> specific message.
> It's not whether the script used is get_maintainer or any other script,
> the question is really if the MAINTAINERS file is the appropriate place
> to store per-subsystem patch specific prefixes.
>
> It is.
>
> Then the question should be how are the forms described and what is the
> inheritance priority.  My preference would be to have a default of
> inherit the parent base and add basename(subsystem dirname).
>
> Commit history seems to have standardized on using colons as the separator
> between the commit prefix and the subject.
>
> A good mechanism to explore how various subsystems have uses prefixes in
> the past might be something like:
>
> $ git log --no-merges --pretty='%s' -<commit_count> <subsystem_path> | \
>   perl -n -e 'print substr($_, 0, rindex($_, ":") + 1) . "\n";' | \
>   sort | uniq -c | sort -rn

Thanks, I have shamelessly stolen this line and limited the commits to the maintainer.

I will post something once the generation of the prefixes is done.

Tom


^ permalink raw reply

* Re: [PATCH] compiler_attribute: remove CONFIG_ENABLE_MUST_CHECK
From: Miguel Ojeda @ 2020-11-23 17:00 UTC (permalink / raw)
  To: Masahiro Yamada
  Cc: Jason A. Donenfeld, Shuah Khan, linux-kernel,
	open list:KERNEL SELFTEST FRAMEWORK, Network Development,
	wireguard, clang-built-linux, Nick Desaulniers
In-Reply-To: <CAK7LNARjU5HTcTjJG1-sQTJBFqohC1O8aAvFs3Hn_sXscH_pdg@mail.gmail.com>

On Mon, Nov 23, 2020 at 4:37 AM Masahiro Yamada <masahiroy@kernel.org> wrote:
>
> I can move it to compiler_attribute.h
>
> This attribute is supported by gcc, clang, and icc.
> https://godbolt.org/z/ehd6so
>
> I can send v2.
>
> I do not mind renaming, but it should be done in a separate patch.

Of course -- sorry, I didn't mean we had to do them now, i.e. we can
do the move and the rename later on, if you prefer.

Cheers,
Miguel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox