Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] tcp: make sure EPOLLOUT wont be missed
From: Jason Baron @ 2019-08-19 18:40 UTC (permalink / raw)
  To: Eric Dumazet, Eric Dumazet, David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Vladimir Rutsky
In-Reply-To: <c31d7c04-9d6f-aefb-500e-5b9b635ff221@gmail.com>



On 8/17/19 12:26 PM, Eric Dumazet wrote:
> 
> 
> On 8/17/19 4:19 PM, Jason Baron wrote:
>>
>>
>> On 8/17/19 12:26 AM, Eric Dumazet wrote:
>>> As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE
>>> under memory pressure"), it is crucial we properly set SOCK_NOSPACE
>>> when needed.
>>>
>>> However, Jason patch had a bug, because the 'nonblocking' status
>>> as far as sk_stream_wait_memory() is concerned is governed
>>> by MSG_DONTWAIT flag passed at sendmsg() time :
>>>
>>>     long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
>>>
>>> So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
>>> and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
>>> cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
>>> value.
>>
>> Is MSG_DONTWAIT not set in this case? The original patch was intended
>> only for the explicit non-blocking case. The epoll manpage says:
>> "EPOLLET flag should use nonblocking file descriptors". So the original
>> intention was not to impact the blocking case. This seems to me like
>> a different use-case.
>>
> 
> I guess the problem is how we define 'non-blocking' ...
> 
> SO_SNDTIMEO can be used by application to implement a variation of non-blocking,
> by waiting for a socket event with a short timeout, to maybe recover
> from memory pressure conditions in a more efficient way than simply looping.
> 
> Note that the man page for epoll() only _suggests_ to use nonblocking file descriptors.
> 
> <quote>
>        The  suggested  way  to use epoll as an edge-triggered (EPOLLET)
>        interface is as follows:
> 
>               i   with nonblocking file descriptors; and
> 
>               ii  by  waiting  for  an  event  only  after  read(2)  or
>                   write(2) return EAGAIN.
> </quote>
> 
> 

Ok, seems reasonable:
Acked-by: Jason Baron <jbaron@akamai.com>

I found a similar pattern in net/smc/smc_tx.c, which I also just sent a
patch for.

Thanks,

-Jason




^ permalink raw reply

* [net PATCH] net/smc: make sure EPOLLOUT is raised
From: Jason Baron @ 2019-08-19 18:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, Eric Dumazet, Ursula Braun, Karsten Graul

Currently, we are only explicitly setting SOCK_NOSPACE on a write timeout
for non-blocking sockets. Epoll() edge-trigger mode relies on SOCK_NOSPACE
being set when -EAGAIN is returned to ensure that EPOLLOUT is raised.
Expand the setting of SOCK_NOSPACE to non-blocking sockets as well that can
use SO_SNDTIMEO to adjust their write timeout. This mirrors the behavior
that Eric Dumazet introduced for tcp sockets.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
---
 net/smc/smc_tx.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index f0de323..6c8f09c 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -76,13 +76,11 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
 	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	struct smc_connection *conn = &smc->conn;
 	struct sock *sk = &smc->sk;
-	bool noblock;
 	long timeo;
 	int rc = 0;
 
 	/* similar to sk_stream_wait_memory */
 	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-	noblock = timeo ? false : true;
 	add_wait_queue(sk_sleep(sk), &wait);
 	while (1) {
 		sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -97,8 +95,8 @@ static int smc_tx_wait(struct smc_sock *smc, int flags)
 			break;
 		}
 		if (!timeo) {
-			if (noblock)
-				set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+			/* ensure EPOLLOUT is subsequently generated */
+			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 			rc = -EAGAIN;
 			break;
 		}
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH 00/16] treewide: prefer __section from compiler_attributes.h
From: Sedat Dilek @ 2019-08-19 18:18 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm, jpoimboe, yhs, Miguel Ojeda, Clang-Built-Linux ML,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	netdev, bpf
In-Reply-To: <20190812215052.71840-17-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:53 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> GCC unescapes escaped string section names while Clang does not. Because
> __section uses the `#` stringification operator for the section name, it
> doesn't need to be escaped.
>
> This fixes an Oops observed in distro's that use systemd and not
> net.core.bpf_jit_enable=1, when their kernels are compiled with Clang.
>
> Instead, we should:
> 1. Prefer __section(.section_name_no_quotes).
> 2. Only use __attribute__((__section(".section"))) when creating the
> section name via C preprocessor (see the definition of __define_initcall
> in arch/um/include/shared/init.h).
>
> This antipattern was found with:
> $ grep -e __section\(\" -e __section__\(\" -r
>
> See the discussions in:
> https://bugs.llvm.org/show_bug.cgi?id=42950
> https://marc.info/?l=linux-netdev&m=156412960619946&w=2
>
> Nick Desaulniers (16):
>   s390/boot: fix section name escaping
>   arc: prefer __section from compiler_attributes.h
>   parisc: prefer __section from compiler_attributes.h
>   um: prefer __section from compiler_attributes.h
>   sh: prefer __section from compiler_attributes.h
>   ia64: prefer __section from compiler_attributes.h
>   arm: prefer __section from compiler_attributes.h
>   mips: prefer __section from compiler_attributes.h
>   sparc: prefer __section from compiler_attributes.h
>   powerpc: prefer __section and __printf from compiler_attributes.h
>   x86: prefer __section from compiler_attributes.h
>   arm64: prefer __section from compiler_attributes.h
>   include/asm-generic: prefer __section from compiler_attributes.h
>   include/linux: prefer __section from compiler_attributes.h
>   include/linux/compiler.h: remove unused KENTRY macro
>   compiler_attributes.h: add note about __section
>

Hi Nick,

thanks for this patchset and the nice section names cleanup and simplification.

I have tested 5 relevant patches for my x86-64 Debian/buster system.

Patchset "for-5.3/x86-section-name-escaping" (5 patches):

compiler_attributes.h: add note about __section
include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

Toolchain: LLVM/Clang compiler and LLD linker version 9.0.0-rc2 (from
Debian/experimental)

I can boot on bare metal.

$ cat /proc/version
Linux version 5.3.0-rc5-1-amd64-cbl-asmgoto
(sedat.dilek@gmail.com@iniza) (clang version 9.0.0-+rc2-1~exp1
(tags/RELEASE_900/rc2)) #1~buster+dileks1 SMP 2019-08-19

I have sent by Tested-by to the single patches.

Have a nice day,
- Sedat -

>  arch/arc/include/asm/linkage.h        |  8 +++----
>  arch/arc/include/asm/mach_desc.h      |  3 +--
>  arch/arm/include/asm/cache.h          |  2 +-
>  arch/arm/include/asm/mach/arch.h      |  4 ++--
>  arch/arm/include/asm/setup.h          |  2 +-
>  arch/arm64/include/asm/cache.h        |  2 +-
>  arch/arm64/kernel/smp_spin_table.c    |  2 +-
>  arch/ia64/include/asm/cache.h         |  2 +-
>  arch/mips/include/asm/cache.h         |  2 +-
>  arch/parisc/include/asm/cache.h       |  2 +-
>  arch/parisc/include/asm/ldcw.h        |  2 +-
>  arch/powerpc/boot/main.c              |  3 +--
>  arch/powerpc/boot/ps3.c               |  6 ++----
>  arch/powerpc/include/asm/cache.h      |  2 +-
>  arch/powerpc/kernel/btext.c           |  2 +-
>  arch/s390/boot/startup.c              |  2 +-
>  arch/sh/include/asm/cache.h           |  2 +-
>  arch/sparc/include/asm/cache.h        |  2 +-
>  arch/sparc/kernel/btext.c             |  2 +-
>  arch/um/kernel/um_arch.c              |  6 +++---
>  arch/x86/include/asm/cache.h          |  2 +-
>  arch/x86/include/asm/intel-mid.h      |  2 +-
>  arch/x86/include/asm/iommu_table.h    |  5 ++---
>  arch/x86/include/asm/irqflags.h       |  2 +-
>  arch/x86/include/asm/mem_encrypt.h    |  2 +-
>  arch/x86/kernel/cpu/cpu.h             |  3 +--
>  include/asm-generic/error-injection.h |  2 +-
>  include/asm-generic/kprobes.h         |  5 ++---
>  include/linux/cache.h                 |  6 +++---
>  include/linux/compiler.h              | 31 ++++-----------------------
>  include/linux/compiler_attributes.h   | 10 +++++++++
>  include/linux/cpu.h                   |  2 +-
>  include/linux/export.h                |  2 +-
>  include/linux/init_task.h             |  4 ++--
>  include/linux/interrupt.h             |  5 ++---
>  include/linux/sched/debug.h           |  2 +-
>  include/linux/srcutree.h              |  2 +-
>  37 files changed, 62 insertions(+), 83 deletions(-)
>
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Jakub Kicinski @ 2019-08-19 18:15 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <5e9bee13-a746-f148-00de-feb7cb7b1403@gmail.com>

On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
> > On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:  
> >> On 2019/08/16 4:22, Jakub Kicinski wrote:  
> >>> There's a certain allure in bringing the in-kernel BPF translation
> >>> infrastructure forward. OTOH from system architecture perspective IMHO
> >>> it does seem like a task best handed in user space. bpfilter can replace
> >>> iptables completely, here we're looking at an acceleration relatively
> >>> loosely coupled with flower.  
> >>
> >> I don't think it's loosely coupled. Emulating TC behavior in userspace
> >> is not so easy.
> >>
> >> Think about recent multi-mask support in flower. Previously userspace could
> >> assume there is one mask and hash table for each preference in TC. After the
> >> change TC accepts different masks with the same pref. Such a change tends to
> >> break userspace emulation. It may ignore masks passed from flow insertion
> >> and use the mask remembered when the first flow of the pref is inserted. It
> >> may override the mask of all existing flows with the pref. It may fail to
> >> insert such flows. Any of them would result in unexpected wrong datapath
> >> handling which is critical.
> >> I think such an emulation layer needs to be updated in sync with TC.  
> > 
> > Oh, so you're saying that if xdp_flow is merged all patches to
> > cls_flower and netfilter which affect flow offload will be required
> > to update xdp_flow as well?  
> 
> Hmm... you are saying that we are allowed to break other in-kernel 
> subsystem by some change? Sounds strange...

No I'm not saying that, please don't put words in my mouth.
I'm asking you if that's your intention.

Having an implementation nor support a feature of another implementation
and degrade gracefully to the slower one is not necessarily breakage.
We need to make a concious decision here, hence the clarifying question.

> > That's a question of policy. Technically the implementation in user
> > space is equivalent.
> > 
> > The advantage of user space implementation is that you can add more
> > to it and explore use cases which do not fit in the flow offload API,
> > but are trivial for BPF. Not to mention the obvious advantage of
> > decoupling the upgrade path.  
> 
> I understand the advantage, but I can't trust such a third-party kernel 
> emulation solution for this kind of thing which handles critical data path.

That's a strange argument to make. All production data path BPF today
comes from user space.

> > Personally I'm not happy with the way this patch set messes with the
> > flow infrastructure. You should use the indirect callback
> > infrastructure instead, and that way you can build the whole thing
> > touching none of the flow offload core.  
> 
> I don't want to mess up the core flow infrastructure either. I'm all 
> ears about less invasive ways. Using indirect callback sounds like a 
> good idea. Will give it a try. Many thanks.

Excellent, thanks!

^ permalink raw reply

* Re: [PATCH 14/16] include/linux: prefer __section from compiler_attributes.h
From: Sedat Dilek @ 2019-08-19 18:05 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm, jpoimboe, yhs, Miguel Ojeda, Clang-Built-Linux ML,
	Luc Van Oostenryck, Lai Jiangshan, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra (Intel),
	Nicholas Piggin, Jiri Kosina, Will Deacon, Ard Biesheuvel,
	Michael Ellerman, Masahiro Yamada, Hans Liljestrand,
	Elena Reshetova, David Windsor, Marc Zyngier, Ming Lei,
	Dou Liyang, Julien Thierry, Mauro Carvalho Chehab, Jens Axboe,
	linux-kernel, linux-sparse, rcu, netdev, bpf
In-Reply-To: <20190812215052.71840-14-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:53 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> Link: https://github.com/ClangBuiltLinux/linux/issues/619
> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ Linux v5.3-rc5 ]

Patchset "for-5.3/x86-section-name-escaping" (5 patches):

compiler_attributes.h: add note about __section
include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

Thanks.

- Sedat -

> ---
>  include/linux/cache.h       | 6 +++---
>  include/linux/compiler.h    | 8 ++++----
>  include/linux/cpu.h         | 2 +-
>  include/linux/export.h      | 2 +-
>  include/linux/init_task.h   | 4 ++--
>  include/linux/interrupt.h   | 5 ++---
>  include/linux/sched/debug.h | 2 +-
>  include/linux/srcutree.h    | 2 +-
>  8 files changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/cache.h b/include/linux/cache.h
> index 750621e41d1c..3f4df9eef1e1 100644
> --- a/include/linux/cache.h
> +++ b/include/linux/cache.h
> @@ -28,7 +28,7 @@
>   * but may get written to during init, so can't live in .rodata (via "const").
>   */
>  #ifndef __ro_after_init
> -#define __ro_after_init __attribute__((__section__(".data..ro_after_init")))
> +#define __ro_after_init __section(.data..ro_after_init)
>  #endif
>
>  #ifndef ____cacheline_aligned
> @@ -45,8 +45,8 @@
>
>  #ifndef __cacheline_aligned
>  #define __cacheline_aligned                                    \
> -  __attribute__((__aligned__(SMP_CACHE_BYTES),                 \
> -                __section__(".data..cacheline_aligned")))
> +       __aligned(SMP_CACHE_BYTES)                              \
> +       __section(.data..cacheline_aligned)
>  #endif /* __cacheline_aligned */
>
>  #ifndef __cacheline_aligned_in_smp
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index f0fd5636fddb..5e88e7e33abe 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -24,7 +24,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
>                         long ______r;                                   \
>                         static struct ftrace_likely_data                \
>                                 __aligned(4)                            \
> -                               __section("_ftrace_annotated_branch")   \
> +                               __section(_ftrace_annotated_branch)     \
>                                 ______f = {                             \
>                                 .data.func = __func__,                  \
>                                 .data.file = __FILE__,                  \
> @@ -60,7 +60,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
>  #define __trace_if_value(cond) ({                      \
>         static struct ftrace_branch_data                \
>                 __aligned(4)                            \
> -               __section("_ftrace_branch")             \
> +               __section(_ftrace_branch)               \
>                 __if_trace = {                          \
>                         .func = __func__,               \
>                         .file = __FILE__,               \
> @@ -118,7 +118,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
>         ".popsection\n\t"
>
>  /* Annotate a C jump table to allow objtool to follow the code flow */
> -#define __annotate_jump_table __section(".rodata..c_jump_table")
> +#define __annotate_jump_table __section(.rodata..c_jump_table)
>
>  #else
>  #define annotate_reachable()
> @@ -298,7 +298,7 @@ unsigned long read_word_at_a_time(const void *addr)
>   * visible to the compiler.
>   */
>  #define __ADDRESSABLE(sym) \
> -       static void * __section(".discard.addressable") __used \
> +       static void * __section(.discard.addressable) __used \
>                 __PASTE(__addressable_##sym, __LINE__) = (void *)&sym;
>
>  /**
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index fcb1386bb0d4..186bbd79d6ce 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -166,7 +166,7 @@ void cpu_startup_entry(enum cpuhp_state state);
>  void cpu_idle_poll_ctrl(bool enable);
>
>  /* Attach to any functions which should be considered cpuidle. */
> -#define __cpuidle      __attribute__((__section__(".cpuidle.text")))
> +#define __cpuidle      __section(.cpuidle.text)
>
>  bool cpu_in_idle(unsigned long pc);
>
> diff --git a/include/linux/export.h b/include/linux/export.h
> index fd8711ed9ac4..808c1a0c2ef9 100644
> --- a/include/linux/export.h
> +++ b/include/linux/export.h
> @@ -104,7 +104,7 @@ struct kernel_symbol {
>   * discarded in the final link stage.
>   */
>  #define __ksym_marker(sym)     \
> -       static int __ksym_marker_##sym[0] __section(".discard.ksym") __used
> +       static int __ksym_marker_##sym[0] __section(.discard.ksym) __used
>
>  #define __EXPORT_SYMBOL(sym, sec)                              \
>         __ksym_marker(sym);                                     \
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index 6049baa5b8bc..50139505da34 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -51,12 +51,12 @@ extern struct cred init_cred;
>
>  /* Attach to the init_task data structure for proper alignment */
>  #ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK
> -#define __init_task_data __attribute__((__section__(".data..init_task")))
> +#define __init_task_data __section(.data..init_task)
>  #else
>  #define __init_task_data /**/
>  #endif
>
>  /* Attach to the thread_info data structure for proper alignment */
> -#define __init_thread_info __attribute__((__section__(".data..init_thread_info")))
> +#define __init_thread_info __section(.data..init_thread_info)
>
>  #endif
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 5b8328a99b2a..29debfe4dd0f 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -741,8 +741,7 @@ extern int arch_early_irq_init(void);
>  /*
>   * We want to know which function is an entrypoint of a hardirq or a softirq.
>   */
> -#define __irq_entry             __attribute__((__section__(".irqentry.text")))
> -#define __softirq_entry  \
> -       __attribute__((__section__(".softirqentry.text")))
> +#define __irq_entry    __section(.irqentry.text)
> +#define __softirq_entry        __section(.softirqentry.text)
>
>  #endif
> diff --git a/include/linux/sched/debug.h b/include/linux/sched/debug.h
> index 95fb9e025247..e17b66221fdd 100644
> --- a/include/linux/sched/debug.h
> +++ b/include/linux/sched/debug.h
> @@ -42,7 +42,7 @@ extern void proc_sched_set_task(struct task_struct *p);
>  #endif
>
>  /* Attach to any functions which should be ignored in wchan output. */
> -#define __sched                __attribute__((__section__(".sched.text")))
> +#define __sched                __section(.sched.text)
>
>  /* Linker adds these: start and end of __sched functions */
>  extern char __sched_text_start[], __sched_text_end[];
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 9cfcc8a756ae..9de652f4e1bd 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -124,7 +124,7 @@ struct srcu_struct {
>  # define __DEFINE_SRCU(name, is_static)                                        \
>         is_static struct srcu_struct name;                              \
>         struct srcu_struct * const __srcu_struct_##name                 \
> -               __section("___srcu_struct_ptrs") = &name
> +               __section(___srcu_struct_ptrs) = &name
>  #else
>  # define __DEFINE_SRCU(name, is_static)                                        \
>         static DEFINE_PER_CPU(struct srcu_data, name##_srcu_data);      \
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply

* Re: [PATCH 15/16] include/linux/compiler.h: remove unused KENTRY macro
From: Sedat Dilek @ 2019-08-19 18:03 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm, jpoimboe, yhs, Miguel Ojeda, Clang-Built-Linux ML,
	Luc Van Oostenryck, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, linux-sparse, linux-kernel, netdev,
	bpf
In-Reply-To: <20190812215052.71840-15-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:53 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> This macro is not used throughout the kernel. Delete it rather than
> update the __section to be a fully spelled out
> __attribute__((__section__())) to avoid
> https://bugs.llvm.org/show_bug.cgi?id=42950.
>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ Linux v5.3-rc5 ]

Patchset "for-5.3/x86-section-name-escaping" (5 patches):

compiler_attributes.h: add note about __section
include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

Thanks.

- Sedat -

> ---
>  include/linux/compiler.h | 23 -----------------------
>  1 file changed, 23 deletions(-)
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 5e88e7e33abe..f01c1e527f85 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -136,29 +136,6 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
>  } while (0)
>  #endif
>
> -/*
> - * KENTRY - kernel entry point
> - * This can be used to annotate symbols (functions or data) that are used
> - * without their linker symbol being referenced explicitly. For example,
> - * interrupt vector handlers, or functions in the kernel image that are found
> - * programatically.
> - *
> - * Not required for symbols exported with EXPORT_SYMBOL, or initcalls. Those
> - * are handled in their own way (with KEEP() in linker scripts).
> - *
> - * KENTRY can be avoided if the symbols in question are marked as KEEP() in the
> - * linker script. For example an architecture could KEEP() its entire
> - * boot/exception vector code rather than annotate each function and data.
> - */
> -#ifndef KENTRY
> -# define KENTRY(sym)                                           \
> -       extern typeof(sym) sym;                                 \
> -       static const unsigned long __kentry_##sym               \
> -       __used                                                  \
> -       __section("___kentry" "+" #sym )                        \
> -       = (unsigned long)&sym;
> -#endif
> -
>  #ifndef RELOC_HIDE
>  # define RELOC_HIDE(ptr, off)                                  \
>    ({ unsigned long __ptr;                                      \
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: enable and disable all ports
From: Vivien Didelot @ 2019-08-19 18:03 UTC (permalink / raw)
  To: Marek Behun; +Cc: netdev, davem, f.fainelli, andrew
In-Reply-To: <20190819193246.0e40a1d4@nic.cz>

Hi Marek,

On Mon, 19 Aug 2019 19:32:46 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> > Call the .port_enable and .port_disable functions for all ports,
> > not only the user ports, so that drivers may optimize the power
> > consumption of all ports after a successful setup.
> > 
> > Unused ports are now disabled on setup. CPU and DSA ports are now
> > enabled on setup and disabled on teardown. User ports were already
> > enabled at slave creation and disabled at slave destruction.
> 
> My original reason for enabling CPU and DSA ports is that enabling
> serdes irq could not be done in .setup in mv88e6xxx, since the required
> phylink structures did not yet exists for those ports.
> 
> The case after this patch would be that .port_enable is called for
> CPU/DSA ports right after these required phylink structures are created
> for this port. A thought came to me while reading this that some driver
> in the future can expect, in their implementation of
> port_enable/port_disable, that phylink structures already exist for all
> ports, not just the one being currently enabled/disabled.
> 
> Wouldn't it be safer if CPU/DSA ports were enabled in setup after all
> ports are registered, and disabled in teardown before ports are
> unregistered?
> 
> Current:
>   ->setup()
>   for each port
>       dsa_port_link_register_of()
>       dsa_port_enable()
> 
> Proposed:
>   ->setup()
>   for each port
>       dsa_port_link_register_of()
>   for each port
>       dsa_port_enable()

I understand what you mean, but the scope of .port_enable really is the
port itself, so I do not expect a driver to configure something else. Also,
I prefer to keep it simple at the moment and not speculate about what a
driver may need in the future, as we may also wonder which switch must be
enabled first in a tree, etc. If that case happens in the future, we can
definitely isolate the ports enabling code after the tree is setup.

So if this series meets your requirement for now, I'd keep it like that.


Thanks,

	Vivien

^ permalink raw reply

* Re: [PATCH 16/16] compiler_attributes.h: add note about __section
From: Sedat Dilek @ 2019-08-19 18:01 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm, jpoimboe, yhs, Miguel Ojeda, Clang-Built-Linux ML,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	linux-kernel, netdev, bpf
In-Reply-To: <20190812215052.71840-16-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:53 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> The antipattern described can be found with:
> $ grep -e __section\(\" -r -e __section__\(\"
>
> Link: https://bugs.llvm.org/show_bug.cgi?id=42950
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ Linux v5.3-rc5 ]

Patchset "for-5.3/x86-section-name-escaping":

include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

Thanks.

- Sedat -

> ---
>  include/linux/compiler_attributes.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
> index 6b318efd8a74..f8c008d7f616 100644
> --- a/include/linux/compiler_attributes.h
> +++ b/include/linux/compiler_attributes.h
> @@ -225,6 +225,16 @@
>  #define __pure                          __attribute__((__pure__))
>
>  /*
> + *  Note: Since this macro makes use of the "stringification operator" `#`, a
> + *        quoted string literal should not be passed to it. eg.
> + *        prefer:
> + *        __section(.foo)
> + *        to:
> + *        __section(".foo")
> + *        unless the section name is dynamically built up, in which case the
> + *        verbose __attribute__((__section__(".foo" x))) should be preferred.
> + *        See also: https://bugs.llvm.org/show_bug.cgi?id=42950
> + *
>   *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-section-function-attribute
>   *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-section-variable-attribute
>   * clang: https://clang.llvm.org/docs/AttributeReference.html#section-declspec-allocate
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply

* Re: [PATCH 11/16] x86: prefer __section from compiler_attributes.h
From: Sedat Dilek @ 2019-08-19 17:59 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm, jpoimboe, yhs, Miguel Ojeda, Clang-Built-Linux ML,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	x86, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Song Liu, Armijn Hemel, Greg Kroah-Hartman, Allison Randal,
	Juergen Gross, Frederic Weisbecker, Brijesh Singh, Enrico Weigelt,
	Kate Stewart, Hannes Reinecke, Sean Christopherson,
	Rafael J. Wysocki, Pu Wen, linux-kernel, netdev, bpf
In-Reply-To: <20190812215052.71840-11-ndesaulniers@google.com>

On Mon, Aug 12, 2019 at 11:52 PM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ Linux v5.3-rc5 ]

Patchset "for-5.3/x86-section-name-escaping":

include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

Thanks.

- Sedat -

> ---
>  arch/x86/include/asm/cache.h       | 2 +-
>  arch/x86/include/asm/intel-mid.h   | 2 +-
>  arch/x86/include/asm/iommu_table.h | 5 ++---
>  arch/x86/include/asm/irqflags.h    | 2 +-
>  arch/x86/include/asm/mem_encrypt.h | 2 +-
>  arch/x86/kernel/cpu/cpu.h          | 3 +--
>  6 files changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/cache.h b/arch/x86/include/asm/cache.h
> index abe08690a887..bb9f4bf4ec02 100644
> --- a/arch/x86/include/asm/cache.h
> +++ b/arch/x86/include/asm/cache.h
> @@ -8,7 +8,7 @@
>  #define L1_CACHE_SHIFT (CONFIG_X86_L1_CACHE_SHIFT)
>  #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>
> -#define __read_mostly __attribute__((__section__(".data..read_mostly")))
> +#define __read_mostly __section(.data..read_mostly)
>
>  #define INTERNODE_CACHE_SHIFT CONFIG_X86_INTERNODE_CACHE_SHIFT
>  #define INTERNODE_CACHE_BYTES (1 << INTERNODE_CACHE_SHIFT)
> diff --git a/arch/x86/include/asm/intel-mid.h b/arch/x86/include/asm/intel-mid.h
> index 8e5af119dc2d..f51f04aefe1b 100644
> --- a/arch/x86/include/asm/intel-mid.h
> +++ b/arch/x86/include/asm/intel-mid.h
> @@ -43,7 +43,7 @@ struct devs_id {
>
>  #define sfi_device(i)                                                          \
>         static const struct devs_id *const __intel_mid_sfi_##i##_dev __used     \
> -       __attribute__((__section__(".x86_intel_mid_dev.init"))) = &i
> +       __section(.x86_intel_mid_dev.init) = &i
>
>  /**
>  * struct mid_sd_board_info - template for SD device creation
> diff --git a/arch/x86/include/asm/iommu_table.h b/arch/x86/include/asm/iommu_table.h
> index 1fb3fd1a83c2..7d190710eb92 100644
> --- a/arch/x86/include/asm/iommu_table.h
> +++ b/arch/x86/include/asm/iommu_table.h
> @@ -50,9 +50,8 @@ struct iommu_table_entry {
>
>  #define __IOMMU_INIT(_detect, _depend, _early_init, _late_init, _finish)\
>         static const struct iommu_table_entry                           \
> -               __iommu_entry_##_detect __used                          \
> -       __attribute__ ((unused, __section__(".iommu_table"),            \
> -                       aligned((sizeof(void *)))))     \
> +               __iommu_entry_##_detect __used __section(.iommu_table)  \
> +               __aligned((sizeof(void *)))                             \
>         = {_detect, _depend, _early_init, _late_init,                   \
>            _finish ? IOMMU_FINISH_IF_DETECTED : 0}
>  /*
> diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
> index 8a0e56e1dcc9..68db90bca813 100644
> --- a/arch/x86/include/asm/irqflags.h
> +++ b/arch/x86/include/asm/irqflags.h
> @@ -9,7 +9,7 @@
>  #include <asm/nospec-branch.h>
>
>  /* Provide __cpuidle; we can't safely include <linux/cpu.h> */
> -#define __cpuidle __attribute__((__section__(".cpuidle.text")))
> +#define __cpuidle __section(.cpuidle.text)
>
>  /*
>   * Interrupt control:
> diff --git a/arch/x86/include/asm/mem_encrypt.h b/arch/x86/include/asm/mem_encrypt.h
> index 0c196c47d621..db2cd3709148 100644
> --- a/arch/x86/include/asm/mem_encrypt.h
> +++ b/arch/x86/include/asm/mem_encrypt.h
> @@ -50,7 +50,7 @@ void __init mem_encrypt_free_decrypted_mem(void);
>  bool sme_active(void);
>  bool sev_active(void);
>
> -#define __bss_decrypted __attribute__((__section__(".bss..decrypted")))
> +#define __bss_decrypted __section(.bss..decrypted)
>
>  #else  /* !CONFIG_AMD_MEM_ENCRYPT */
>
> diff --git a/arch/x86/kernel/cpu/cpu.h b/arch/x86/kernel/cpu/cpu.h
> index c0e2407abdd6..7ff9dc41a603 100644
> --- a/arch/x86/kernel/cpu/cpu.h
> +++ b/arch/x86/kernel/cpu/cpu.h
> @@ -38,8 +38,7 @@ struct _tlb_table {
>
>  #define cpu_dev_register(cpu_devX) \
>         static const struct cpu_dev *const __cpu_dev_##cpu_devX __used \
> -       __attribute__((__section__(".x86_cpu_dev.init"))) = \
> -       &cpu_devX;
> +       __section(.x86_cpu_dev.init) = &cpu_devX;
>
>  extern const struct cpu_dev *const __x86_cpu_dev_start[],
>                             *const __x86_cpu_dev_end[];
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

^ permalink raw reply

* Re: [PATCH net-next 0/8] sctp: support per endpoint auth and asconf flags
From: Marcelo Ricardo Leitner @ 2019-08-19 17:58 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, Neil Horman, davem
In-Reply-To: <cover.1566223325.git.lucien.xin@gmail.com>

On Mon, Aug 19, 2019 at 10:02:42PM +0800, Xin Long wrote:
> This patchset mostly does 3 things:
> 
>   1. add per endpint asconf flag and use asconf flag properly
>      and add SCTP_ASCONF_SUPPORTED sockopt.
>   2. use auth flag properly and add SCTP_AUTH_SUPPORTED sockopt.
>   3. remove the 'global feature switch' to discard chunks.

What about net->sctp.addip_noauth, why wasn't it included as well?
I'm thinking that's because it's more of a compatibility knob and
that there isn't much value on having it more fine-grained than
per-netns. Just feels a bit odd to have the other two but not this
one, which is directly related, but yeah.

Otherwise, changes LGTM.

^ permalink raw reply

* Re: [PATCH 13/16] include/asm-generic: prefer __section from compiler_attributes.h
From: Sedat Dilek @ 2019-08-19 17:56 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: akpm, Nick Desaulniers, Anil S Keshavamurthy, Arnd Bergmann,
	Alexei Starovoitov, bpf, Clang-Built-Linux ML, Daniel Borkmann,
	David S. Miller, jpoimboe, Martin KaFai Lau, linux-arch,
	linux-kernel, Masami Hiramatsu, Miguel Ojeda, netdev, Song Liu,
	yhs
In-Reply-To: <1566237106.8670clhnrk.naveen@linux.ibm.com>

On Mon, Aug 19, 2019 at 7:52 PM Naveen N. Rao
<naveen.n.rao@linux.ibm.com> wrote:
>
> Nick Desaulniers wrote:
> > Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> > Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> > ---
>
> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> [ Linux v5.3-rc5 ]

Patchset "for-5.3/x86-section-name-escaping":

include/linux/compiler.h: remove unused KENTRY macro
include/linux: prefer __section from compiler_attributes.h
include/asm-generic: prefer __section from compiler_attributes.h
x86: prefer __section from compiler_attributes.h

- Sedat -

^ permalink raw reply

* [PATCH net-next 1/1] Add genphy_c45_config_aneg() function to phy-c45.c
From: Marco Hartmann @ 2019-08-19 17:52 UTC (permalink / raw)
  To: Marco Hartmann, Christian Herber, andrew@lunn.ch,
	f.fainelli@gmail.com, hkallweit1@gmail.com, davem@davemloft.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1566237157-9054-1-git-send-email-marco.hartmann@nxp.com>

and call it from phy_config_aneg().

commit 34786005eca3 ("net: phy: prevent PHYs w/o Clause 22 regs from
calling genphy_config_aneg") introduced a check that aborts
phy_config_aneg() if the phy is a C45 phy.
This causes phy_state_machine() to call phy_error() so that the phy
ends up in PHY_HALTED state.

Instead of returning -EOPNOTSUPP, call genphy_c45_config_aneg()
(analogous to the C22 case) so that the state machine can run
correctly.

genphy_c45_config_aneg() closely resembles mv3310_config_aneg()
in drivers/net/phy/marvell10g.c, excluding vendor specific
configurations for 1000BaseT.

Fixes: 34786005eca3 ("net: phy: prevent PHYs w/o Clause 22 regs from
calling genphy_config_aneg")

Signed-off-by: Marco Hartmann <marco.hartmann@nxp.com>
---
 drivers/net/phy/phy-c45.c | 26 ++++++++++++++++++++++++++
 drivers/net/phy/phy.c     |  2 +-
 include/linux/phy.h       |  1 +
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy-c45.c b/drivers/net/phy/phy-c45.c
index b9d4145781ca..fa9062fd9122 100644
--- a/drivers/net/phy/phy-c45.c
+++ b/drivers/net/phy/phy-c45.c
@@ -509,6 +509,32 @@ int genphy_c45_read_status(struct phy_device *phydev)
 }
 EXPORT_SYMBOL_GPL(genphy_c45_read_status);
 
+/**
+ * genphy_c45_config_aneg - restart auto-negotiation or forced setup
+ * @phydev: target phy_device struct
+ *
+ * Description: If auto-negotiation is enabled, we configure the
+ *   advertising, and then restart auto-negotiation.  If it is not
+ *   enabled, then we force a configuration.
+ */
+int genphy_c45_config_aneg(struct phy_device *phydev)
+{
+	int ret;
+	bool changed = false;
+
+	if (phydev->autoneg == AUTONEG_DISABLE)
+		return genphy_c45_pma_setup_forced(phydev);
+
+	ret = genphy_c45_an_config_aneg(phydev);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		changed = true;
+
+	return genphy_c45_check_and_restart_aneg(phydev, changed);
+}
+EXPORT_SYMBOL_GPL(genphy_c45_config_aneg);
+
 /* The gen10g_* functions are the old Clause 45 stub */
 
 int gen10g_config_aneg(struct phy_device *phydev)
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index f3adea9ef400..74c4e15ebe52 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -507,7 +507,7 @@ static int phy_config_aneg(struct phy_device *phydev)
 	 * allowed to call genphy_config_aneg()
 	 */
 	if (phydev->is_c45 && !(phydev->c45_ids.devices_in_package & BIT(0)))
-		return -EOPNOTSUPP;
+		return genphy_c45_config_aneg(phydev);
 
 	return genphy_config_aneg(phydev);
 }
diff --git a/include/linux/phy.h b/include/linux/phy.h
index d26779f1fb6b..a7ecbe0e55aa 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1117,6 +1117,7 @@ int genphy_c45_an_disable_aneg(struct phy_device *phydev);
 int genphy_c45_read_mdix(struct phy_device *phydev);
 int genphy_c45_pma_read_abilities(struct phy_device *phydev);
 int genphy_c45_read_status(struct phy_device *phydev);
+int genphy_c45_config_aneg(struct phy_device *phydev);
 
 /* The gen10g_* functions are the old Clause 45 stub */
 int gen10g_config_aneg(struct phy_device *phydev);
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 0/1] Add genphy_c45_config_aneg() function to phy-c45.c
From: Marco Hartmann @ 2019-08-19 17:52 UTC (permalink / raw)
  To: Marco Hartmann, Christian Herber, andrew@lunn.ch,
	f.fainelli@gmail.com, hkallweit1@gmail.com, davem@davemloft.net,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org

Currently, using a Clause 45 Phy with the "Generic Clause 45 PHY"
driver leads to a warning, similar to the one below, as soon as the interface
is brought up.


  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 146 at drivers/net/phy/phy.c:736 phy_error+0x2c/0x64
  Modules linked in: fec
  CPU: 2 PID: 146 Comm: kworker/2:1 Not tainted 5.3.0-rc3-NETNEXT-00816-g48e924c73178 #20
  Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
  Workqueue: events_power_efficient phy_state_machine
  Backtrace: 
  ...

This happens, because the Genphy driver does not provide a config_aneg() func,
so that phy_start_aneg() ultimately fails such that phy_error() is called,
producing the above warning.

This patch adds the function genphy_c45_config_aneg(), which allows
phy_start_aneg() to work correctly for C45 phys.


Marco Hartmann (1):
  Add genphy_c45_config_aneg() function to phy-c45.c

 drivers/net/phy/phy-c45.c | 26 ++++++++++++++++++++++++++
 drivers/net/phy/phy.c     |  2 +-
 include/linux/phy.h       |  1 +
 3 files changed, 28 insertions(+), 1 deletion(-)

-- 
2.7.4


^ permalink raw reply

* Re: [PATCH 13/16] include/asm-generic: prefer __section from compiler_attributes.h
From: Naveen N. Rao @ 2019-08-19 17:52 UTC (permalink / raw)
  To: akpm, Nick Desaulniers
  Cc: Anil S Keshavamurthy, Arnd Bergmann, Alexei Starovoitov, bpf,
	clang-built-linux, Daniel Borkmann, David S. Miller, jpoimboe,
	Martin KaFai Lau, linux-arch, linux-kernel, Masami Hiramatsu,
	miguel.ojeda.sandonis, netdev, sedat.dilek, Song Liu, yhs
In-Reply-To: <20190812215052.71840-13-ndesaulniers@google.com>

Nick Desaulniers wrote:
> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> ---

Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

- Naveen


^ permalink raw reply

* Re: [GIT] Networking
From: pr-tracker-bot @ 2019-08-19 17:45 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, akpm, netdev, linux-kernel
In-Reply-To: <20190818.194615.2174476213333990592.davem@davemloft.net>

The pull request you sent on Sun, 18 Aug 2019 19:46:15 -0700 (PDT):

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git refs/heads/master

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/06821504fd47a5e5b641aeeb638a0ae10a216ef8

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply

* Re: [PATCH net-next] net: openvswitch: Set OvS recirc_id from tc chain
From: Marcelo Ricardo Leitner @ 2019-08-19 17:42 UTC (permalink / raw)
  To: Paul Blakey
  Cc: Pravin B Shelar, netdev, David S. Miller, Justin Pettit,
	Simon Horman, Vlad Buslov, Jiri Pirko, Roi Dayan, Yossi Kuperman,
	Rony Efraim, Oz Shlomo
In-Reply-To: <1566144059-8247-1-git-send-email-paulb@mellanox.com>

On Sun, Aug 18, 2019 at 07:00:59PM +0300, Paul Blakey wrote:
> What do you guys say about the following diff on top of the last one?
> Use static key, and also have OVS_DP_CMD_SET command probe/enable the feature.
> 
> This will allow userspace to probe the feature, and selectivly enable it via the
> OVS_DP_CMD_SET command.

I'm not convinced yet that we need something like this. Been
wondering, skb_ext_find() below is not that expensive if not in use.
It's just a bit check and that's it, it returns NULL.

And drivers will only be setting this if they have tc-offloading
enabled (assuming they won't be seeing it for chain 0 all the time).
On which case, with tc offloading, we need this in order to work
properly.

Is the bit checking really that worrysome?

> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -853,8 +853,10 @@ int ovs_flow_key_extract(const struct ip_tunnel_info *tun_info,
>  	key->mac_proto = res;
>  
>  #if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
> -	tc_ext = skb_ext_find(skb, TC_SKB_EXT);
> -	key->recirc_id = tc_ext ? tc_ext->chain : 0;
> +	if (static_branch_unlikely(&tc_recirc_sharing_support)) {
> +		tc_ext = skb_ext_find(skb, TC_SKB_EXT);
> +		key->recirc_id = tc_ext ? tc_ext->chain : 0;
> +	}
>  #else
>  	key->recirc_id = 0;
>  #endif
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-19 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Gleixner, Jordan Glover, Andy Lutomirski,
	Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190819172718.jwnvvotssxwhc7m6@ast-mbp.dhcp.thefacebook.com>



> On Aug 19, 2019, at 10:27 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Mon, Aug 19, 2019 at 11:15:11AM +0200, Thomas Gleixner wrote:
>> Alexei,
>> 
>>> On Sat, 17 Aug 2019, Alexei Starovoitov wrote:
>>>> On Fri, Aug 16, 2019 at 10:28:29PM +0200, Thomas Gleixner wrote:
>>>> On Fri, 16 Aug 2019, Alexei Starovoitov wrote:
>>>> While real usecases are helpful to understand a design decision, the design
>>>> needs to be usecase independent.
>>>> 
>>>> The kernel provides mechanisms, not policies. My impression of this whole
>>>> discussion is that it is policy driven. That's the wrong approach.
>>> 
>>> not sure what you mean by 'policy driven'.
>>> Proposed CAP_BPF is a policy?
>> 
>> I was referring to the discussion as a whole.
>> 
>>> Can kernel.unprivileged_bpf_disabled=1 be used now?
>>> Yes, but it will weaken overall system security because things that
>>> use unpriv to load bpf and CAP_NET_ADMIN to attach bpf would need
>>> to move to stronger CAP_SYS_ADMIN.
>>> 
>>> With CAP_BPF both load and attach would happen under CAP_BPF
>>> instead of CAP_SYS_ADMIN.
>> 
>> I'm not arguing against that.
>> 
>>>> So let's look at the mechanisms which we have at hand:
>>>> 
>>>> 1) Capabilities
>>>> 
>>>> 2) SUID and dropping priviledges
>>>> 
>>>> 3) Seccomp and LSM
>>>> 
>>>> Now the real interesting questions are:
>>>> 
>>>> A) What kind of restrictions does BPF allow? Is it a binary on/off or is
>>>>    there a more finegrained control of BPF functionality?
>>>> 
>>>>    TBH, I can't tell.
>>>> 
>>>> B) Depending on the answer to #A what is the control possibility for
>>>>    #1/#2/#3 ?
>>> 
>>> Can any of the mechanisms 1/2/3 address the concern in mds.rst?
>> 
>> Well, that depends. As with any other security policy which is implemented
>> via these mechanisms, the policy can be strict enough to prevent it by not
>> allowing certain operations. The more fine-grained the control is, it
>> allows the administrator who implements the policy to remove the
>> 'dangerous' parts from an untrusted user.
>> 
>> So really question #A is important for this. Is BPF just providing a binary
>> ON/OFF knob or does it allow to disable/enable certain aspects of BPF
>> functionality in a more fine grained way? If the latter, then it might be
>> possible to control functionality which might be abused for exploits of
>> some sorts (including MDS) in a way which allows other parts of BBF to be
>> exposed to less priviledged contexts.
> 
> I see. So the kernel.unprivileged_bpf_disabled knob is binary and I think it's
> the right mechanism to expose to users.
> Having N knobs for every map/prog type won't decrease attack surface.
> In the other email Andy's quoting seccomp man page...
> Today seccomp cannot really look into bpf_attr syscall args, but even
> if it could it won't secure the system.
> Examples:
> 1.
> spectre v2 is using bpf in-kernel interpreter in speculative way.
> The mere presence of interpreter as part of kernel .text makes the exploit
> easier to do. That was the reason to do CONFIG_BPF_JIT_ALWAYS_ON.
> For this case even kernel.unprivileged_bpf_disabled=1 was hopeless.
> 
> 2.
> var4 doing store hazard. It doesn't matter which program type is used.
> load/store instructions are the same across program types.
> 
> 3.
> prog_array was used as part of var1. I guess it was simply more
> convenient for Jann to do it this way :) All other map types
> have the same out-of-bounds speculation issue.
> 
> In general side channels are cpu bugs that are exploited via sequences
> of cpu instructions. In that sense bpf infra provides these instructions.
> So all program types and all maps have the same level of 'side channel risk'.
> 
>>> I believe Andy wants to expand the attack surface when
>>> kernel.unprivileged_bpf_disabled=0
>>> Before that happens I'd like the community to work on addressing the text above.
>> 
>> Well, that text above can be removed when the BPF wizards are entirely sure
>> that BPF cannot be abused to exploit stuff. 
> 
> Myself and Daniel looked at it in detail. I think we understood
> MDS mechanism well enough. Right now we're fairly confident that
> combination of existing mechanisms we did for var4 and
> verifier speculative analysis protect us from MDS.
> The thing is that every new cpu bug is looked at through the bpf lenses.
> Can it be exploited through bpf? Complexity of side channels
> is growing. Can the most recent swapgs be exploited ?
> What if we kprobe+bpf somewhere ?
> I don't think there is an issue, but we will never be 'entirely sure'.
> Even if myself and Daniel are sure the concern will stay.
> Unprivileged bpf as a whole is the concern due to side channels.
> The number of them are not yet disclosed. Who is going to analyze them?
> imo the only answer to that is kernel.unprivileged_bpf_disabled=1
> which together with CONFIG_BPF_JIT_ALWAYS_ON is secure enough.
> The other option is to sprinkle every bpf load/store with lfence
> which will make execution so slow that it will be unusable.
> Which is effectively the same as unprivileged_bpf_disabled=1.
> 
> There are other things we can do. Like kasan-style shadow memory
> for bpf execution. Auto re-JITing the code after it's running.
> We can do lfences everywhere for some time then re-JIT
> when kasan-ed shadow memory shows only clean memory accesses.
> The beauty of BPF that it is analyze-able and JIT-able instruction set.
> The verifier speculative analysis is an example that the kernel
> can analyze the speculative execution path that cpu will
> take before the code starts executing.
> Unprivileged bpf can made absolutely secure. It can be
> made more secure than the rest of the kernel.
> But today we should just go with unprivileged_bpf_disabled=1

I’m still okay with this.

> and CAP_BPF.
> 

I think this needs more design work.  I’m halfway through writing up an actual proposal. I’ll send it soon.

^ permalink raw reply

* Re: Help needed - Kernel lockup while running ipsec
From: Florian Westphal @ 2019-08-19 17:38 UTC (permalink / raw)
  To: Vakul Garg; +Cc: netdev@vger.kernel.org
In-Reply-To: <DB7PR04MB4620CD9AFFAFF8678F803DCE8BA80@DB7PR04MB4620.eurprd04.prod.outlook.com>

Vakul Garg <vakul.garg@nxp.com> wrote:
> Hi
> 
> With kernel 4.14.122, I am getting a kernel softlockup while running single static ipsec tunnel.
> The problem reproduces mostly after running 8-10 hours of ipsec encap test (on my dual core arm board).
> 
> I found that in function xfrm_policy_lookup_bytype(), the policy in variable 'ret' shows refcnt=0 under problem situation.
> This creates an infinite loop in  xfrm_policy_lookup_bytype() and hence the lockup.
> 
> Can some body please provide me pointers about 'refcnt'?
> Is it legitimate for 'refcnt' to become '0'? Under what condition can it become '0'?

Yes, when policy is destroyed and the last user calls
xfrm_pol_put() which will invoke call_rcu to free the structure.

^ permalink raw reply

* Re: [PATCH net-next 2/3] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Andrew Lunn @ 2019-08-19 17:34 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: netdev, lkml, Richard Cochran, Vivien Didelot, Florian Fainelli,
	Vladimir Oltean, David S. Miller
In-Reply-To: <CAFfN3gUNrnjdOt8bW2EugzjSZMb_vvdEaLN0yOv_06=roqcJYQ@mail.gmail.com>

On Mon, Aug 19, 2019 at 07:14:25PM +0200, Hubert Feurstein wrote:
> Hi Andrew,
> 
> Am Mo., 19. Aug. 2019 um 15:27 Uhr schrieb Andrew Lunn <andrew@lunn.ch>:
> >
> > > @@ -45,7 +45,8 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
> > >  {
> > >       int ret;
> > >
> > > -     ret = mdiobus_write_nested(chip->bus, dev, reg, data);
> > > +     ret = mdiobus_write_sts_nested(chip->bus, dev, reg, data,
> > > +                                    chip->ptp_sts);
> > >       if (ret < 0)
> > >               return ret;
> > >
> >
> > Please also make a similar change to mv88e6xxx_smi_indirect_write().
> > The last write in that function should be timestamped.
> Since it is already the last write it should be already ok (The
> mv88e6xxx_smi_indirect_write
> calls the mv88e6xxx_smi_direct_write which initiates the timestamping).

Hi Hubert

But you are also time stamping the first write as well. And it seems
like it is not completely for free. So it would be nice to limit it to
the write which actually matters.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/6] net: dsa: enable and disable all ports
From: Marek Behun @ 2019-08-19 17:32 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, davem, f.fainelli, andrew
In-Reply-To: <20190818173548.19631-4-vivien.didelot@gmail.com>

On Sun, 18 Aug 2019 13:35:45 -0400
Vivien Didelot <vivien.didelot@gmail.com> wrote:

> Call the .port_enable and .port_disable functions for all ports,
> not only the user ports, so that drivers may optimize the power
> consumption of all ports after a successful setup.
> 
> Unused ports are now disabled on setup. CPU and DSA ports are now
> enabled on setup and disabled on teardown. User ports were already
> enabled at slave creation and disabled at slave destruction.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>

My original reason for enabling CPU and DSA ports is that enabling
serdes irq could not be done in .setup in mv88e6xxx, since the required
phylink structures did not yet exists for those ports.

The case after this patch would be that .port_enable is called for
CPU/DSA ports right after these required phylink structures are created
for this port. A thought came to me while reading this that some driver
in the future can expect, in their implementation of
port_enable/port_disable, that phylink structures already exist for all
ports, not just the one being currently enabled/disabled.

Wouldn't it be safer if CPU/DSA ports were enabled in setup after all
ports are registered, and disabled in teardown before ports are
unregistered?

Current:
  ->setup()
  for each port
      dsa_port_link_register_of()
      dsa_port_enable()

Proposed:
  ->setup()
  for each port
      dsa_port_link_register_of()
  for each port
      dsa_port_enable()

Marek

^ permalink raw reply

* [PATCH net-next v2 0/4] Improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Hubert Feurstein @ 2019-08-19 17:28 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Vivien Didelot,
	Florian Fainelli, Heiner Kallweit, Vladimir Oltean, Fugang Duan,
	David S. Miller

With this patchset the phc2sys synchronisation precision improved to +/-555ns on
an IMX6DL with an MV88E6220 switch attached.

This patchset takes into account the comments from the following discussions:
- https://lkml.org/lkml/2019/8/2/1364
- https://lkml.org/lkml/2019/8/5/169

Patch 01 adds the required infrastructure in the MDIO layer.
Patch 02 adds additional PTP offset compensation
Patch 03 adds support for the PTP_SYS_OFFSET_EXTENDED ioctl in the mv88e6xxx driver.
Patch 04 adds support for the PTP system timestamping in the imx-fec driver.

The following tests show the improvement caused by each patch. The system clock 
precision was set to 15ns instead of 333ns (as described in https://lkml.org/lkml/2019/8/2/1364).

Without this patchset applied, the phc2sys synchronisation performance was very 
poor:

  offset: min -27120 max 28840 mean 2.44 stddev 8040.78 count 1236
  delay:  min 282103 max 386385 mean 352568.03 stddev 27814.27 count 1236
  (test runtime 20 minutes)

Results after appling patch 01-03:

  offset: min -12316 max 13314 mean -9.38 stddev 4274.82 count 1022
  delay:  min 69977 max 96266 mean 87939.04 stddev 6466.17 count 1022
  (test runtime 16 minutes)

Results after appling patch 03:

  offset: min -788 max 528 mean -0.06 stddev 185.02 count 7171
  delay:  min 1773 max 2031 mean 1909.43 stddev 33.74 count 7171
  (test runtime 119 minutes)

Hubert Feurstein (4):
  net: mdio: add support for passing a PTP system timestamp to the
    mii_bus driver
  net: mdio: add PTP offset compensation to mdiobus_write_sts
  net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
  net: fec: add support for PTP system timestamping for MDIO devices

 drivers/net/dsa/mv88e6xxx/chip.h          |  2 +
 drivers/net/dsa/mv88e6xxx/ptp.c           | 11 +--
 drivers/net/dsa/mv88e6xxx/smi.c           |  3 +-
 drivers/net/ethernet/freescale/fec_main.c |  7 +-
 drivers/net/phy/mdio_bus.c                | 88 +++++++++++++++++++++++
 include/linux/mdio.h                      |  5 ++
 include/linux/phy.h                       | 42 +++++++++++
 7 files changed, 152 insertions(+), 6 deletions(-)

-- 
2.22.1


^ permalink raw reply

* [PATCH net-next v2 3/4] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Hubert Feurstein @ 2019-08-19 17:28 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Vivien Didelot,
	Florian Fainelli, Vladimir Oltean, David S. Miller
In-Reply-To: <20190819172827.9550-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/dsa/mv88e6xxx/chip.h |  2 ++
 drivers/net/dsa/mv88e6xxx/ptp.c  | 11 +++++++----
 drivers/net/dsa/mv88e6xxx/smi.c  |  3 ++-
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index a406be2f5652..1bfde0d8a5a3 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -275,6 +275,8 @@ struct mv88e6xxx_chip {
 	struct ptp_clock_info	ptp_clock_info;
 	struct delayed_work	tai_event_work;
 	struct ptp_pin_desc	pin_config[MV88E6XXX_MAX_GPIO];
+	struct ptp_system_timestamp *ptp_sts;
+
 	u16 trig_config;
 	u16 evcap_config;
 	u16 enable_count;
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index 073cbd0bb91b..cf6e52ee9e0a 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
 	return 0;
 }
 
-static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
-				 struct timespec64 *ts)
+static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
+				  struct timespec64 *ts,
+				  struct ptp_system_timestamp *sts)
 {
 	struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
 	u64 ns;
 
 	mv88e6xxx_reg_lock(chip);
+	chip->ptp_sts = sts;
 	ns = timecounter_read(&chip->tstamp_tc);
+	chip->ptp_sts = NULL;
 	mv88e6xxx_reg_unlock(chip);
 
 	*ts = ns_to_timespec64(ns);
@@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
 	struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
 	struct timespec64 ts;
 
-	mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
+	mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
 
 	schedule_delayed_work(&chip->overflow_work,
 			      MV88E6XXX_TAI_OVERFLOW_PERIOD);
@@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
 	chip->ptp_clock_info.max_adj    = MV88E6XXX_MAX_ADJ_PPB;
 	chip->ptp_clock_info.adjfine	= mv88e6xxx_ptp_adjfine;
 	chip->ptp_clock_info.adjtime	= mv88e6xxx_ptp_adjtime;
-	chip->ptp_clock_info.gettime64	= mv88e6xxx_ptp_gettime;
+	chip->ptp_clock_info.gettimex64	= mv88e6xxx_ptp_gettimex;
 	chip->ptp_clock_info.settime64	= mv88e6xxx_ptp_settime;
 	chip->ptp_clock_info.enable	= ptp_ops->ptp_enable;
 	chip->ptp_clock_info.verify	= ptp_ops->ptp_verify;
diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
index 282fe08db050..abedd04ff2ae 100644
--- a/drivers/net/dsa/mv88e6xxx/smi.c
+++ b/drivers/net/dsa/mv88e6xxx/smi.c
@@ -45,7 +45,8 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
 {
 	int ret;
 
-	ret = mdiobus_write_nested(chip->bus, dev, reg, data);
+	ret = mdiobus_write_sts_nested(chip->bus, dev, reg, data,
+				       chip->ptp_sts);
 	if (ret < 0)
 		return ret;
 
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v2 1/4] net: mdio: add support for passing a PTP system timestamp to the mii_bus driver
From: Hubert Feurstein @ 2019-08-19 17:28 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190819172827.9550-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.

This patch adds the required infrastructure to solve the problem described
above.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
Changes in v2:
 - Removed mdiobus_write_sts as there was no user
 - Removed ptp_sts_supported-boolean and introduced flags instead

 drivers/net/phy/mdio_bus.c | 76 ++++++++++++++++++++++++++++++++++++++
 include/linux/mdio.h       |  5 +++
 include/linux/phy.h        | 34 +++++++++++++++++
 3 files changed, 115 insertions(+)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index bd04fe762056..4dba2714495e 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -34,6 +34,7 @@
 #include <linux/phy.h>
 #include <linux/io.h>
 #include <linux/uaccess.h>
+#include <linux/ptp_clock_kernel.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/mdio.h>
@@ -697,6 +698,81 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
 }
 EXPORT_SYMBOL(mdiobus_write);
 
+/**
+ * __mdiobus_write_sts - Unlocked version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * Write a MDIO bus register and request the MDIO bus driver to take the
+ * system timestamps when sts-pointer is valid. When the bus driver doesn't
+ * support this, the timestamps are taken in this function instead.
+ *
+ * In order to improve the synchronisation precision of phc2sys (from
+ * the linuxptp project) for devices like switches which are attached
+ * to the MDIO bus, it is necessary the get the system timestamps as
+ * close as possible to the access which causes the PTP timestamp
+ * register to be snapshotted in the switch hardware. Usually this is
+ * triggered by an MDIO write access, the snapshotted timestamp is then
+ * transferred by several MDIO reads.
+ *
+ * Caller must hold the mdio bus lock.
+ *
+ * NOTE: MUST NOT be called from interrupt context.
+ */
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			struct ptp_system_timestamp *sts)
+{
+	int retval;
+
+	WARN_ON_ONCE(!mutex_is_locked(&bus->mdio_lock));
+
+	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
+		ptp_read_system_prets(sts);
+
+	bus->ptp_sts = sts;
+	retval = __mdiobus_write(bus, addr, regnum, val);
+	bus->ptp_sts = NULL;
+
+	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
+		ptp_read_system_postts(sts);
+
+	return retval;
+}
+EXPORT_SYMBOL(__mdiobus_write_sts);
+
+/**
+ * mdiobus_write_sts_nested - Nested version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * In case of nested MDIO bus access avoid lockdep false positives by
+ * using mutex_lock_nested().
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			     struct ptp_system_timestamp *sts)
+{
+	int retval;
+
+	BUG_ON(in_interrupt());
+
+	mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
+	retval = __mdiobus_write_sts(bus, addr, regnum, val, sts);
+	mutex_unlock(&bus->mdio_lock);
+
+	return retval;
+}
+EXPORT_SYMBOL(mdiobus_write_sts_nested);
+
 /**
  * mdio_bus_match - determine if given MDIO driver supports the given
  *		    MDIO device
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index e8242ad88c81..482388341c7b 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -9,6 +9,7 @@
 #include <uapi/linux/mdio.h>
 #include <linux/mod_devicetable.h>
 
+struct ptp_system_timestamp;
 struct gpio_desc;
 struct mii_bus;
 
@@ -305,11 +306,15 @@ static inline void mii_10gbt_stat_mod_linkmode_lpa_t(unsigned long *advertising,
 
 int __mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
 int __mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			struct ptp_system_timestamp *sts);
 
 int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
 int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum);
 int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
 int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+			     struct ptp_system_timestamp *sts);
 
 int mdiobus_register_device(struct mdio_device *mdiodev);
 int mdiobus_unregister_device(struct mdio_device *mdiodev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index d26779f1fb6b..0b33662e0320 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -205,6 +205,13 @@ struct device;
 struct phylink;
 struct sk_buff;
 
+/* MII-bus flags:
+ * @MII_BUS_F_PTP_STS_SUPPORTED: The driver supports PTP system timestamping
+ */
+enum mii_bus_flags {
+	MII_BUS_F_PTP_STS_SUPPORTED = BIT(0)
+};
+
 /*
  * The Bus class for PHYs.  Devices which provide access to
  * PHYs should register using this structure
@@ -252,7 +259,34 @@ struct mii_bus {
 	int reset_delay_us;
 	/* RESET GPIO descriptor pointer */
 	struct gpio_desc *reset_gpiod;
+
+	/* Feature flags */
+	u32 flags;
+
+	/* PTP system timestamping support
+	 *
+	 * In order to improve the synchronisation precision of phc2sys (from
+	 * the linuxptp project) for devices like switches which are attached
+	 * to the MDIO bus, it is necessary the get the system timestamps as
+	 * close as possible to the access which causes the PTP timestamp
+	 * register to be snapshotted in the switch hardware. Usually this is
+	 * triggered by an MDIO write access, the snapshotted timestamp is then
+	 * transferred by several MDIO reads.
+	 *
+	 * The switch driver can use mdio_write_sts*() to pass through the
+	 * system timestamp pointer @ptp_sts to the MDIO bus driver. The bus
+	 * driver simply has to do the following calls in its write handler:
+	 *	ptp_read_system_prets(bus->ptp_sts);
+	 *	writel(value, mdio-register)
+	 *	ptp_read_system_postts(bus->ptp_sts);
+	 *
+	 * The ptp_read_system_*ts functions already check the ptp_sts pointer.
+	 * The MII_BUS_F_PTP_STS_SUPPORTED-bit must be set in flags, when the
+	 * MDIO bus driver takes the timestamps as described above.
+	 */
+	struct ptp_system_timestamp *ptp_sts;
 };
+
 #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
 
 struct mii_bus *mdiobus_alloc_size(size_t);
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v2 4/4] net: fec: add support for PTP system timestamping for MDIO devices
From: Hubert Feurstein @ 2019-08-19 17:28 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Fugang Duan,
	Vladimir Oltean, David S. Miller
In-Reply-To: <20190819172827.9550-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.

The ptp_read_system_*ts functions already check the ptp_sts pointer.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/ethernet/freescale/fec_main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index c01d3ec3e9af..dd1253683ac0 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1815,10 +1815,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
 	reinit_completion(&fep->mdio_done);
 
 	/* start a write op */
+	ptp_read_system_prets(bus->ptp_sts);
 	writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
 		FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
 		FEC_MMFR_TA | FEC_MMFR_DATA(value),
 		fep->hwp + FEC_MII_DATA);
+	ptp_read_system_postts(bus->ptp_sts);
 
 	/* wait for end of transfer */
 	time_left = wait_for_completion_timeout(&fep->mdio_done,
@@ -1956,7 +1958,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	struct fec_enet_private *fep = netdev_priv(ndev);
 	struct device_node *node;
 	int err = -ENXIO;
-	u32 mii_speed, holdtime;
+	u32 mii_speed, mii_period, holdtime;
 
 	/*
 	 * The i.MX28 dual fec interfaces are not equal.
@@ -1993,6 +1995,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 	 * document.
 	 */
 	mii_speed = DIV_ROUND_UP(clk_get_rate(fep->clk_ipg), 5000000);
+	mii_period = div_u64((u64)mii_speed * 2 * NSEC_PER_SEC, clk_get_rate(fep->clk_ipg));
 	if (fep->quirks & FEC_QUIRK_ENET_MAC)
 		mii_speed--;
 	if (mii_speed > 63) {
@@ -2034,6 +2037,8 @@ static int fec_enet_mii_init(struct platform_device *pdev)
 		pdev->name, fep->dev_id + 1);
 	fep->mii_bus->priv = fep;
 	fep->mii_bus->parent = &pdev->dev;
+	fep->mii_bus->flags = MII_BUS_F_PTP_STS_SUPPORTED;
+	fep->mii_bus->ptp_sts_offset = 32 * mii_period;
 
 	node = of_get_child_by_name(pdev->dev.of_node, "mdio");
 	err = of_mdiobus_register(fep->mii_bus, node);
-- 
2.22.1


^ permalink raw reply related

* [PATCH net-next v2 2/4] net: mdio: add PTP offset compensation to mdiobus_write_sts
From: Hubert Feurstein @ 2019-08-19 17:28 UTC (permalink / raw)
  To: netdev, linux-kernel
  Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Florian Fainelli,
	Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190819172827.9550-1-hubert.feurstein@vahle.at>

From: Hubert Feurstein <h.feurstein@gmail.com>

The slow MDIO access introduces quite a big offset (~13us) to the PTP
system time synchronisation. With this patch the driver has the possibility
to set the correct offset which can then be compensated.

Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
 drivers/net/phy/mdio_bus.c | 12 ++++++++++++
 include/linux/phy.h        |  8 ++++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 4dba2714495e..50a37cf46f96 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -739,6 +739,18 @@ int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
 	if (!(bus->flags & MII_BUS_F_PTP_STS_SUPPORTED))
 		ptp_read_system_postts(sts);
 
+	/* PTP offset compensation:
+	 * After the MDIO access is completed (from the chip perspective), the
+	 * switch chip will snapshot the PHC timestamp. To make sure our system
+	 * timestamp corresponds to the PHC timestamp, we have to add the
+	 * duration of this MDIO access to sts->post_ts. Linuxptp's phc2sys
+	 * takes the average of pre_ts and post_ts to calculate the final
+	 * system timestamp. With this in mind, we have to add ptp_sts_offset
+	 * twice to post_ts, in order to not introduce an constant time offset.
+	 */
+	if (sts)
+		timespec64_add_ns(&sts->post_ts, 2 * bus->ptp_sts_offset);
+
 	return retval;
 }
 EXPORT_SYMBOL(__mdiobus_write_sts);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 0b33662e0320..615df9c7f2c3 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -283,8 +283,16 @@ struct mii_bus {
 	 * The ptp_read_system_*ts functions already check the ptp_sts pointer.
 	 * The MII_BUS_F_PTP_STS_SUPPORTED-bit must be set in flags, when the
 	 * MDIO bus driver takes the timestamps as described above.
+	 *
+	 * @ptp_sts_offset: This is the compensation offset for the system
+	 * timestamp which is introduced by the slow MDIO access duration. An
+	 * MDIO access consists of 32 clock cycles. Usually the MDIO bus runs
+	 * at ~2.5MHz, so we have to compensate ~12800ns offset.
+	 * Set the ptp_sts_offset to the exact duration of one MDIO frame
+	 * (= 32 * clock-period) in nano-seconds.
 	 */
 	struct ptp_system_timestamp *ptp_sts;
+	u32 ptp_sts_offset;
 };
 
 #define to_mii_bus(d) container_of(d, struct mii_bus, dev)
-- 
2.22.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox