Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v1 1/4] media: rc: introduce BPF_PROG_IR_DECODER
From: Sean Young @ 2018-05-15 10:30 UTC (permalink / raw)
  To: Y Song
  Cc: linux-media, linux-kernel, Alexei Starovoitov,
	Mauro Carvalho Chehab, Daniel Borkmann, netdev, Matthias Reichl,
	Devin Heitmueller
In-Reply-To: <CAH3MdRXkzpJZ=VLwJWpuwiNhcjJwjNqpaZXwk+1-HfL_7hjJew@mail.gmail.com>

On Mon, May 14, 2018 at 09:48:05PM -0700, Y Song wrote:
> On Mon, May 14, 2018 at 2:10 PM, Sean Young <sean@mess.org> wrote:
> > Add support for BPF_PROG_IR_DECODER. This type of BPF program can call
> > rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report
> > that the last key should be repeated.
> >
> > Signed-off-by: Sean Young <sean@mess.org>
> > ---
> >  drivers/media/rc/Kconfig          |  8 +++
> >  drivers/media/rc/Makefile         |  1 +
> >  drivers/media/rc/ir-bpf-decoder.c | 93 +++++++++++++++++++++++++++++++
> >  include/linux/bpf_types.h         |  3 +
> >  include/uapi/linux/bpf.h          | 16 +++++-
> >  5 files changed, 120 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/media/rc/ir-bpf-decoder.c
> >
> > diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig
> > index eb2c3b6eca7f..10ad6167d87c 100644
> > --- a/drivers/media/rc/Kconfig
> > +++ b/drivers/media/rc/Kconfig
> > @@ -120,6 +120,14 @@ config IR_IMON_DECODER
> >            remote control and you would like to use it with a raw IR
> >            receiver, or if you wish to use an encoder to transmit this IR.
> >
> > +config IR_BPF_DECODER
> > +       bool "Enable IR raw decoder using BPF"
> > +       depends on BPF_SYSCALL
> > +       depends on RC_CORE=y
> > +       help
> > +          Enable this option to make it possible to load custom IR
> > +          decoders written in BPF.
> > +
> >  endif #RC_DECODERS
> >
> >  menuconfig RC_DEVICES
> > diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile
> > index 2e1c87066f6c..12e1118430d0 100644
> > --- a/drivers/media/rc/Makefile
> > +++ b/drivers/media/rc/Makefile
> > @@ -5,6 +5,7 @@ obj-y += keymaps/
> >  obj-$(CONFIG_RC_CORE) += rc-core.o
> >  rc-core-y := rc-main.o rc-ir-raw.o
> >  rc-core-$(CONFIG_LIRC) += lirc_dev.o
> > +rc-core-$(CONFIG_IR_BPF_DECODER) += ir-bpf-decoder.o
> >  obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o
> >  obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o
> >  obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o
> > diff --git a/drivers/media/rc/ir-bpf-decoder.c b/drivers/media/rc/ir-bpf-decoder.c
> > new file mode 100644
> > index 000000000000..aaa5e208b1a5
> > --- /dev/null
> > +++ b/drivers/media/rc/ir-bpf-decoder.c
> > @@ -0,0 +1,93 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// ir-bpf-decoder.c - handles bpf decoders
> > +//
> > +// Copyright (C) 2018 Sean Young <sean@mess.org>
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/filter.h>
> > +#include "rc-core-priv.h"
> > +
> > +/*
> > + * BPF interface for raw IR decoder
> > + */
> > +const struct bpf_prog_ops ir_decoder_prog_ops = {
> > +};
> > +
> > +BPF_CALL_1(bpf_rc_repeat, struct ir_raw_event*, event)
> > +{
> > +       struct ir_raw_event_ctrl *ctrl;
> > +
> > +       ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> > +
> > +       rc_repeat(ctrl->dev);
> > +       return 0;
> > +}
> > +
> > +static const struct bpf_func_proto rc_repeat_proto = {
> > +       .func      = bpf_rc_repeat,
> > +       .gpl_only  = true, // rc_repeat is EXPORT_SYMBOL_GPL
> > +       .ret_type  = RET_VOID,
> 
> I suggest using RET_INTEGER here since we do return an integer 0.
> RET_INTEGER will also make it easy to extend if you want to return
> a non-zero value for error code or other reasons.

Ok.

> > +       .arg1_type = ARG_PTR_TO_CTX,
> > +};
> > +
> > +BPF_CALL_4(bpf_rc_keydown, struct ir_raw_event*, event, u32, protocol,
> > +          u32, scancode, u32, toggle)
> > +{
> > +       struct ir_raw_event_ctrl *ctrl;
> > +
> > +       ctrl = container_of(event, struct ir_raw_event_ctrl, prev_ev);
> > +       rc_keydown(ctrl->dev, protocol, scancode, toggle != 0);
> > +       return 0;
> > +}
> > +
> > +static const struct bpf_func_proto rc_keydown_proto = {
> > +       .func      = bpf_rc_keydown,
> > +       .gpl_only  = true, // rc_keydown is EXPORT_SYMBOL_GPL
> > +       .ret_type  = RET_VOID,
> 
> ditto. RET_INTEGER is preferable.

Ok.

> > +       .arg1_type = ARG_PTR_TO_CTX,
> > +       .arg2_type = ARG_ANYTHING,
> > +       .arg3_type = ARG_ANYTHING,
> > +       .arg4_type = ARG_ANYTHING,
> > +};
> > +
> > +static const struct bpf_func_proto *ir_decoder_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +       switch (func_id) {
> > +       case BPF_FUNC_rc_repeat:
> > +               return &rc_repeat_proto;
> > +       case BPF_FUNC_rc_keydown:
> > +               return &rc_keydown_proto;
> > +       case BPF_FUNC_map_lookup_elem:
> > +               return &bpf_map_lookup_elem_proto;
> > +       case BPF_FUNC_map_update_elem:
> > +               return &bpf_map_update_elem_proto;
> > +       case BPF_FUNC_map_delete_elem:
> > +               return &bpf_map_delete_elem_proto;
> > +       case BPF_FUNC_ktime_get_ns:
> > +               return &bpf_ktime_get_ns_proto;
> > +       case BPF_FUNC_tail_call:
> > +               return &bpf_tail_call_proto;
> > +       case BPF_FUNC_get_prandom_u32:
> > +               return &bpf_get_prandom_u32_proto;
> > +       default:
> > +               return NULL;
> > +       }
> > +}
> > +
> > +static bool ir_decoder_is_valid_access(int off, int size,
> > +                                      enum bpf_access_type type,
> > +                                      const struct bpf_prog *prog,
> > +                                      struct bpf_insn_access_aux *info)
> > +{
> > +       if (type == BPF_WRITE)
> > +               return false;
> > +       if (off < 0 || off + size > sizeof(struct ir_raw_event))
> > +               return false;
> 
> You probably need more than just checking the boundary.
> >From patch #3, the structure is:
>  struct ir_raw_event {
>         union {
>                 __u32           duration;
>                 __u32           carrier;
>         };
>         __u8                    duty_cycle;
> 
>         unsigned                pulse:1;
>         unsigned                reset:1;
>         unsigned                timeout:1;
>        unsigned                carrier_report:1;
> };
> 
> You would like the memory access to be aligned,
> so accessing duration/carrier with 4-byte alignment, and
> pulse/reset/timeout/carrier_report 4-byte alignment as well.
> 
> You could only allow __u32 access to duration/carrier.
> But if you want bpf program to access duration/carrier with
> code like (__u16)(ctx->duration), then the compiler may
> translate the load to a 2-byte load. You may need to handle
> endianness here. You can check net/core/filter.c function
> bpf_skb_is_valid_access for some examples.

Thanks, yes that makes sense. Actually exposing a struct with bit fields
isn't great. I think I can rework into something simpler (two u32 fields).

> > +
> > +       return true;
> > +}
> > +
> > +const struct bpf_verifier_ops ir_decoder_verifier_ops = {
> > +       .get_func_proto  = ir_decoder_func_proto,
> > +       .is_valid_access = ir_decoder_is_valid_access
> > +};
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index 2b28fcf6f6ae..ee5355715ee0 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -25,6 +25,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> >  #ifdef CONFIG_CGROUP_BPF
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
> >  #endif
> > +#ifdef CONFIG_IR_BPF_DECODER
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_RAWIR_DECODER, ir_decoder)
> > +#endif
> >
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c5ec89732a8d..6ad053e831c0 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -137,6 +137,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_SK_MSG,
> >         BPF_PROG_TYPE_RAW_TRACEPOINT,
> >         BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> > +       BPF_PROG_TYPE_RAWIR_DECODER,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -755,6 +756,17 @@ union bpf_attr {
> >   *     @addr: pointer to struct sockaddr to bind socket to
> >   *     @addr_len: length of sockaddr structure
> >   *     Return: 0 on success or negative error code
> > + *
> > + * int bpf_rc_keydown(ctx, protocol, scancode, toggle)
> > + *     Report decoded scancode with toggle value
> > + *     @ctx: pointer to ctx
> > + *     @protocol: decoded protocol
> > + *     @scancode: decoded scancode
> > + *     @toggle: set to 1 if button was toggled, else 0
> > + *
> > + * int bpf_rc_repeat(ctx)
> > + *     Repeat the last decoded scancode
> > + *     @ctx: pointer to ctx
> 
> The comment format has changed dramatically for
> documentation reason. Could you rebase your change
> on top of bpf-next tree? You will need to rewrite the above
> helper description so tools can generate proper documentation
> for them.

Ah, I need to rebase on top of bpf-next.

Thanks!

> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -821,7 +833,9 @@ union bpf_attr {
> >         FN(msg_apply_bytes),            \
> >         FN(msg_cork_bytes),             \
> >         FN(msg_pull_data),              \
> > -       FN(bind),
> > +       FN(bind),                       \
> > +       FN(rc_repeat),                  \
> > +       FN(rc_keydown),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > --
> > 2.17.0
> >

^ permalink raw reply

* Re: [PATCH net-next v10 6/7] sch_cake: Add overhead compensation support to the rate shaper
From: Toke Høiland-Jørgensen @ 2018-05-15 10:22 UTC (permalink / raw)
  To: netdev; +Cc: cake
In-Reply-To: <152632442995.4861.6626480556330393864.stgit@alrua-kau>

Toke Høiland-Jørgensen <toke@toke.dk> writes:

> This commit adds configurable overhead compensation support to the rate
> shaper. With this feature, userspace can configure the actual bottleneck
> link overhead and encapsulation mode used, which will be used by the shaper
> to calculate the precise duration of each packet on the wire.
>
> This feature is needed because CAKE is often deployed one or two hops
> upstream of the actual bottleneck (which can be, e.g., inside a DSL or
> cable modem). In this case, the link layer characteristics and overhead
> reported by the kernel does not match the actual bottleneck. Being able to
> set the actual values in use makes it possible to configure the shaper rate
> much closer to the actual bottleneck rate (our experience shows it is
> possible to get with 0.1% of the actual physical bottleneck rate), thus
> keeping latency low without sacrificing bandwidth.
>
> The overhead compensation has three tunables: A fixed per-packet overhead
> size (which, if set, will be accounted from the IP packet header), a
> minimum packet size (MPU) and a framing mode supporting either ATM or PTM
> framing. We include a set of common keywords in TC to help users configure
> the right parameters. If no overhead value is set, the value reported by
> the kernel is used.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
>  net/sched/sch_cake.c |  123 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 122 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index ccc6f26b306c..6314a089a204 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -275,6 +275,7 @@ enum {
>  
>  struct cobalt_skb_cb {
>  	cobalt_time_t enqueue_time;
> +	u32           adjusted_len;
>  };
>  
>  static cobalt_time_t cobalt_get_time(void)
> @@ -1130,6 +1131,87 @@ static cobalt_time_t cake_ewma(cobalt_time_t avg, cobalt_time_t sample,
>  	return avg;
>  }
>  
> +static u32 cake_overhead(struct cake_sched_data *q, struct sk_buff *skb)
> +{
> +	const struct skb_shared_info *shinfo = skb_shinfo(skb);
> +	u32 off = skb_network_offset(skb);
> +	u32 len = qdisc_pkt_len(skb);
> +	u16 segs = 1;
> +
> +	if (unlikely(shinfo->gso_size)) {
> +		/* borrowed from qdisc_pkt_len_init() */
> +		unsigned int hdr_len;
> +
> +		hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
> +
> +		/* + transport layer */
> +		if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
> +					       SKB_GSO_TCPV6))) {
> +			const struct tcphdr *th;
> +			struct tcphdr _tcphdr;
> +
> +			th = skb_header_pointer(skb, skb_transport_offset(skb),
> +						sizeof(_tcphdr), &_tcphdr);
> +			if (likely(th))
> +				hdr_len += __tcp_hdrlen(th);
> +		} else {
> +			struct udphdr _udphdr;
> +
> +			if (skb_header_pointer(skb, skb_transport_offset(skb),
> +					       sizeof(_udphdr), &_udphdr))
> +				hdr_len += sizeof(struct udphdr);
> +		}
> +
> +		if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
> +			segs = DIV_ROUND_UP(skb->len - hdr_len,
> +					    shinfo->gso_size);
> +		else
> +			segs = shinfo->gso_segs;
> +
> +		/* The last segment may be shorter; we ignore this, which means
> +		 * that we will over-estimate the size of the whole GSO segment
> +		 * by the difference in size. This is conservative, so we live
> +		 * with that to avoid the complexity of dealing with it.
> +		 */
> +		len = shinfo->gso_size + hdr_len;
> +	}
> +
> +	q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
> +
> +	if (q->rate_flags & CAKE_FLAG_OVERHEAD)
> +		len -= off;
> +
> +	if (q->max_netlen < len)
> +		q->max_netlen = len;
> +	if (q->min_netlen > len)
> +		q->min_netlen = len;
> +
> +	len += q->rate_overhead;
> +
> +	if (len < q->rate_mpu)
> +		len = q->rate_mpu;
> +
> +	if (q->atm_mode == CAKE_ATM_ATM) {
> +		len += 47;
> +		len /= 48;
> +		len *= 53;
> +	} else if (q->atm_mode == CAKE_ATM_PTM) {
> +		/* Add one byte per 64 bytes or part thereof.
> +		 * This is conservative and easier to calculate than the
> +		 * precise value.
> +		 */
> +		len += (len + 63) / 64;
> +	}
> +
> +	if (q->max_adjlen < len)
> +		q->max_adjlen = len;
> +	if (q->min_adjlen > len)
> +		q->min_adjlen = len;
> +
> +	get_cobalt_cb(skb)->adjusted_len = len * segs;
> +	return len;

Well, this is embarrassing; seems that I broke this somewhere along the
way. Will resend with a fix...

-Toke

^ permalink raw reply

* Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks
From: Tariq Toukan @ 2018-05-15  9:19 UTC (permalink / raw)
  To: Qing Huang, tariqt, davem, haakon.bugge, yanjun.zhu
  Cc: netdev, linux-rdma, linux-kernel
In-Reply-To: <ff9c44ca-eaae-cb3e-4c69-e9e3cbb72ca1@oracle.com>



On 14/05/2018 7:41 PM, Qing Huang wrote:
> 
> 
> On 5/13/2018 2:00 AM, Tariq Toukan wrote:
>>
>>
>> On 11/05/2018 10:23 PM, Qing Huang wrote:
>>> When a system is under memory presure (high usage with fragments),
>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>> memory management to enter slow path doing memory compact/migration
>>> ops in order to complete high order memory allocations.
>>>
>>> When that happens, user processes calling uverb APIs may get stuck
>>> for more than 120s easily even though there are a lot of free pages
>>> in smaller chunks available in the system.
>>>
>>> Syslog:
>>> ...
>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>> ...
>>>
>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>>
>>> However in order to support smaller ICM chunk size, we need to fix
>>> another issue in large size kcalloc allocations.
>>>
>>> E.g.
>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>
>>> The solution is to use vzalloc to replace kcalloc. There is no need
>>> for contiguous memory pages for a driver meta data structure (no need
>>> of DMA ops).
>>>
>>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>>> Acked-by: Daniel Jurgens <danielj@mellanox.com>
>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
>>> ---
>>> v2 -> v1: adjusted chunk size to reflect different architectures.
>>>
>>>   drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>> index a822f7a..ccb62b8 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>> @@ -43,12 +43,12 @@
>>>   #include "fw.h"
>>>     /*
>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>>> - * per chunk.
>>> + * We allocate in page size (default 4KB on many archs) chunks to 
>>> avoid high
>>> + * order memory allocations in fragmented/high usage memory situation.
>>>    */
>>>   enum {
>>> -    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
>>> -    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
>>> +    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
>>> +    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT
>>
>> Which is actually PAGE_SIZE.
> 
> Yes, we wanted to avoid high order memory allocations.
> 

Then please use PAGE_SIZE instead.

>> Also, please add a comma at the end of the last entry.
> 
> Hmm..., followed the existing code style and checkpatch.pl didn't 
> complain about the comma.
> 

I am in favor of having a comma also after the last element, so that 
when another enum element is added we do not modify this line again, 
which would falsely affect git blame.

I know it didn't exist before your patch, but once we're here, let's do it.

>>
>>>   };
>>>     static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
>>> mlx4_icm_chunk *chunk)
>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>> struct mlx4_icm_table *table,
>>>       obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>>>       num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>>   -    table->icm      = kcalloc(num_icm, sizeof(*table->icm), 
>>> GFP_KERNEL);
>>> +    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
>>
>> Why not kvzalloc ?
> 
> I think table->icm pointer array doesn't really need physically 
> contiguous memory. Sometimes high order
> memory allocation by kmalloc variants may trigger slow path and cause 
> tasks to be blocked.
> 

This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzalloc so 
we maintain a similar behavior when contiguous memory is available, and 
a fallback for resiliency.

> Thanks,
> Qing
> 
>>
>>>       if (!table->icm)
>>>           return -ENOMEM;
>>>       table->virt     = virt;
>>> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>> struct mlx4_icm_table *table,
>>>               mlx4_free_icm(dev, table->icm[i], use_coherent);
>>>           }
>>>   -    kfree(table->icm);
>>> +    vfree(table->icm);
>>>         return -ENOMEM;
>>>   }
>>> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, 
>>> struct mlx4_icm_table *table)
>>>               mlx4_free_icm(dev, table->icm[i], table->coherent);
>>>           }
>>>   -    kfree(table->icm);
>>> +    vfree(table->icm);
>>>   }
>>>
>>
>> Thanks for your patch.
>>
>> I need to verify there is no dramatic performance degradation here.
>> You can prepare and send a v3 in the meanwhile.
>>
>> Thanks,
>> Tariq
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH bpf-next v2 0/5] samples: bpf: fix build after move to full libbpf
From: Jesper Dangaard Brouer @ 2018-05-15  9:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, daniel, oss-drivers, netdev,
	Björn Töpel, Y Song, brouer
In-Reply-To: <20180515081009.4cd33c8e@redhat.com>

SOLVED, how see below

On Tue, 15 May 2018 08:10:09 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon, 14 May 2018 22:57:53 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Mon, May 14, 2018 at 10:35:01PM -0700, Jakub Kicinski wrote:  
> > > Hi!
> > > 
> > > Following patches address build issues after recent move to libbpf.
> > > For out-of-tree builds we would see the following error:
> > > 
> > > gcc: error: samples/bpf/../../tools/lib/bpf/libbpf.a: No such file or directory
> > > 
> > > libbpf build system is now always invoked explicitly rather than
> > > relying on building single objects most of the time.  We need to
> > > resolve the friction between Kbuild and tools/ build system.
> > > 
> > > Mini-library called libbpf.h in samples is renamed to bpf_insn.h,
> > > using linux/filter.h seems not completely trivial since some samples
> > > get upset when order on include search path in changed.  We do have
> > > to rename libbpf.h, however, because otherwise it's hard to reliably
> > > get to libbpf's header in out-of-tree builds.
> > > 
> > > v2:
> > >  - fix the build error harder (patch 3);
> > >  - add patch 5 (make clang less noisy).    
> > 
> > Applied, Thanks  
> 
> I just tried it out, but the build fails.
> 
> $ make samples/bpf/
>   CHK     include/config/kernel.release
>   CHK     include/generated/uapi/linux/version.h
>   CHK     include/generated/utsrelease.h
>   CHK     include/generated/bounds.h
>   CHK     include/generated/timeconst.h
>   CHK     include/generated/asm-offsets.h
>   CALL    scripts/checksyscalls.sh
>   DESCEND  objtool
>   CHK     scripts/mod/devicetable-offsets.h
> make -C /home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/ RM='rm -rf' LDFLAGS= srctree=/home/jbrouer/git/kernel/bpf-next/samples/bpf/../../ O=
> 
> Auto-detecting system features:
> ...                        libelf: [ OFF ]
> ...                           bpf: [ OFF ]
> 
> No libelf found
> make[2]: *** [Makefile:212: elfdep] Error 1
> make[1]: *** [samples/bpf/Makefile:207: /home/jbrouer/git/kernel/bpf-next/samples/bpf/../../tools/lib/bpf/libbpf.a] Error 2
> make: *** [Makefile:1743: samples/bpf/] Error 2
>  
> 
> But it might not be related to this change, as the problem seems to the
> with the "Auto-detecting system features".  It is the build of libbpf
> that is the problem.
> 
> $ cd tools/lib/bpf/
> $ make
> 
> Auto-detecting system features:
> ...                        libelf: [ OFF ]
> ...                           bpf: [ OFF ]
> 
> No libelf found
> make: *** [Makefile:212: elfdep] Error 1
> 
> 
> And do have 'libelf' installed on this system... the same build/make
> commands works on net-next.
> 
> On net-next I see:
> 
> $ cd tools/lib/bpf/
> $ make
> 
> Auto-detecting system features:
> ...                        libelf: [ on  ]
> ...                           bpf: [ on  ]
> 
> Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h'
> Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from latest version at 'include/uapi/linux/if_link.h'
>   CC       libbpf.o
>   CC       bpf.o
>   CC       nlattr.o
>   CC       btf.o
>   LD       libbpf-in.o
>   LINK     libbpf.a
>   LINK     libbpf.so

SOLVED

It seems that the "Auto-detecting system" needed a 'make clean'.
My problem goes away when I did the following:

$ cd tools/
$ make clean

$ cd lib/bpf/
$ make

Auto-detecting system features:
...                        libelf: [ on  ]
...                           bpf: [ on  ]

  HOSTCC   fixdep.o
  HOSTLD   fixdep-in.o
  LINK     fixdep
Warning: Kernel ABI header at 'tools/include/uapi/linux/if_link.h' differs from latest version at 'include/uapi/linux/if_link.h'
  CC       libbpf.o
  CC       bpf.o
  CC       nlattr.o
  CC       btf.o
  LD       libbpf-in.o
  LINK     libbpf.a
  LINK     libbpf.so


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 06/14] net: sched: implement reference counted action release
From: Vlad Buslov @ 2018-05-15  9:16 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, xiyou.wangcong, pablo, kadlec, fw, ast,
	daniel, edumazet, keescook, linux-kernel, netfilter-devel,
	coreteam, kliteyn
In-Reply-To: <20180515090357.GK2134@nanopsycho.orion>


On Tue 15 May 2018 at 09:03, Jiri Pirko <jiri@resnulli.us> wrote:
> Mon, May 14, 2018 at 09:07:06PM CEST, vladbu@mellanox.com wrote:
>>
>>On Mon 14 May 2018 at 16:47, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Mon, May 14, 2018 at 04:27:07PM CEST, vladbu@mellanox.com wrote:
>>>
>>> [...]
>>>
>>>
>>>>+static int tcf_action_del_1(struct net *net, char *kind, u32 index,
>>>>+			    struct netlink_ext_ack *extack)
>>>>+{
>>>>+	const struct tc_action_ops *ops;
>>>>+	int err = -EINVAL;
>>>>+
>>>>+	ops = tc_lookup_action_n(kind);
>>>>+	if (!ops) {
>>>
>>> How this can happen? Apparently you hold reference to the module since
>>> you put it couple lines below. In that case, I don't really understand
>>> why you need to lookup and just don't use "ops" pointer you have in
>>> tcf_action_delete().
>>
>>The problem is that I cant really delete action while holding reference
>
> Wait a sec. I was talking about a "module reference" (module_put())

I misunderstood your question. Yes, I guess I can just save return value
of tcf_action_put to variable, continue using ops pointer and only
call module_put after delete.

>
>
>>to it. I can try to decrement reference twice, however that might result
>>race condition if another task tries to delete that action at the same
>>time.
>>
>>Imagine situation:
>>1. Action is in actions idr, refcount==1
>>2. Task tries to delete action, takes reference while working with it,
>>refcount==2
>>3. Another task tries to delete same action, takes reference,
>>refcount==3
>>4. First task decrements refcount twice, refcount==1
>>5. At the same time second task decrements refcount twice, refcount==-1
>>
>>My solution is to save action index, but release the reference. Then try
>>to lookup action again and delete it if it is still in idr. (was not
>>concurrently deleted)
>>
>>Now same potential race condition with this patch:
>>1. Action is in actions idr, refcount==1
>>2. Task tries to delete action, takes reference while working with it,
>>refcount==2
>>3. Another task tries to delete same action, takes reference,
>>refcount==3
>>4. First task releases reference and deletes actions from idr, which
>>results another refcount decrement, refcount==1
>>5. At the same time second task releases reference to action,
>>refcount==0, action is deleted
>>6. When task tries to lookup-delete action from idr by index, action is
>>not found. This case is handled correctly by code and no negative
>>overflow of refcount happens.
>>
>>>
>>>
>>>>+		NL_SET_ERR_MSG(extack, "Specified TC action not found");
>>>>+		goto err_out;
>>>>+	}
>>>>+
>>>>+	if (ops->delete)
>>>>+		err = ops->delete(net, index);
>>>>+
>>>>+	module_put(ops->owner);
>>>>+err_out:
>>>>+	return err;
>>>>+}
>>>>+
>>>> static int tca_action_flush(struct net *net, struct nlattr *nla,
>>>> 			    struct nlmsghdr *n, u32 portid,
>>>> 			    struct netlink_ext_ack *extack)
>>>>@@ -1052,6 +1088,36 @@ static int tca_action_flush(struct net *net, struct nlattr *nla,
>>>> 	return err;
>>>> }
>>>> 
>>>>+static int tcf_action_delete(struct net *net, struct list_head *actions,
>>>>+			     struct netlink_ext_ack *extack)
>>>>+{
>>>>+	int ret;
>>>>+	struct tc_action *a, *tmp;
>>>>+	char kind[IFNAMSIZ];
>>>>+	u32 act_index;
>>>>+
>>>>+	list_for_each_entry_safe(a, tmp, actions, list) {
>>>>+		const struct tc_action_ops *ops = a->ops;
>>>>+
>>>>+		/* Actions can be deleted concurrently
>>>>+		 * so we must save their type and id to search again
>>>>+		 * after reference is released.
>>>>+		 */
>>>>+		strncpy(kind, a->ops->kind, sizeof(kind) - 1);
>>>>+		act_index = a->tcfa_index;
>>>>+
>>>>+		list_del(&a->list);
>>>>+		if (tcf_action_put(a))
>>>>+			module_put(ops->owner);
>>>>+
>>>>+		/* now do the delete */
>>>>+		ret = tcf_action_del_1(net, kind, act_index, extack);
>>>>+		if (ret < 0)
>>>>+			return ret;
>>>>+	}
>>>>+	return 0;
>>>>+}
>>>>+
>>>
>>> [...]
>>

^ permalink raw reply

* Re: [PATCH 06/14] net: sched: implement reference counted action release
From: Jiri Pirko @ 2018-05-15  9:03 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: netdev, davem, jhs, xiyou.wangcong, pablo, kadlec, fw, ast,
	daniel, edumazet, keescook, linux-kernel, netfilter-devel,
	coreteam, kliteyn
In-Reply-To: <vbfsh6udpk5.fsf@reg-r-vrt-018-180.mtr.labs.mlnx>

Mon, May 14, 2018 at 09:07:06PM CEST, vladbu@mellanox.com wrote:
>
>On Mon 14 May 2018 at 16:47, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, May 14, 2018 at 04:27:07PM CEST, vladbu@mellanox.com wrote:
>>
>> [...]
>>
>>
>>>+static int tcf_action_del_1(struct net *net, char *kind, u32 index,
>>>+			    struct netlink_ext_ack *extack)
>>>+{
>>>+	const struct tc_action_ops *ops;
>>>+	int err = -EINVAL;
>>>+
>>>+	ops = tc_lookup_action_n(kind);
>>>+	if (!ops) {
>>
>> How this can happen? Apparently you hold reference to the module since
>> you put it couple lines below. In that case, I don't really understand
>> why you need to lookup and just don't use "ops" pointer you have in
>> tcf_action_delete().
>
>The problem is that I cant really delete action while holding reference

Wait a sec. I was talking about a "module reference" (module_put())


>to it. I can try to decrement reference twice, however that might result
>race condition if another task tries to delete that action at the same
>time.
>
>Imagine situation:
>1. Action is in actions idr, refcount==1
>2. Task tries to delete action, takes reference while working with it,
>refcount==2
>3. Another task tries to delete same action, takes reference,
>refcount==3
>4. First task decrements refcount twice, refcount==1
>5. At the same time second task decrements refcount twice, refcount==-1
>
>My solution is to save action index, but release the reference. Then try
>to lookup action again and delete it if it is still in idr. (was not
>concurrently deleted)
>
>Now same potential race condition with this patch:
>1. Action is in actions idr, refcount==1
>2. Task tries to delete action, takes reference while working with it,
>refcount==2
>3. Another task tries to delete same action, takes reference,
>refcount==3
>4. First task releases reference and deletes actions from idr, which
>results another refcount decrement, refcount==1
>5. At the same time second task releases reference to action,
>refcount==0, action is deleted
>6. When task tries to lookup-delete action from idr by index, action is
>not found. This case is handled correctly by code and no negative
>overflow of refcount happens.
>
>>
>>
>>>+		NL_SET_ERR_MSG(extack, "Specified TC action not found");
>>>+		goto err_out;
>>>+	}
>>>+
>>>+	if (ops->delete)
>>>+		err = ops->delete(net, index);
>>>+
>>>+	module_put(ops->owner);
>>>+err_out:
>>>+	return err;
>>>+}
>>>+
>>> static int tca_action_flush(struct net *net, struct nlattr *nla,
>>> 			    struct nlmsghdr *n, u32 portid,
>>> 			    struct netlink_ext_ack *extack)
>>>@@ -1052,6 +1088,36 @@ static int tca_action_flush(struct net *net, struct nlattr *nla,
>>> 	return err;
>>> }
>>> 
>>>+static int tcf_action_delete(struct net *net, struct list_head *actions,
>>>+			     struct netlink_ext_ack *extack)
>>>+{
>>>+	int ret;
>>>+	struct tc_action *a, *tmp;
>>>+	char kind[IFNAMSIZ];
>>>+	u32 act_index;
>>>+
>>>+	list_for_each_entry_safe(a, tmp, actions, list) {
>>>+		const struct tc_action_ops *ops = a->ops;
>>>+
>>>+		/* Actions can be deleted concurrently
>>>+		 * so we must save their type and id to search again
>>>+		 * after reference is released.
>>>+		 */
>>>+		strncpy(kind, a->ops->kind, sizeof(kind) - 1);
>>>+		act_index = a->tcfa_index;
>>>+
>>>+		list_del(&a->list);
>>>+		if (tcf_action_put(a))
>>>+			module_put(ops->owner);
>>>+
>>>+		/* now do the delete */
>>>+		ret = tcf_action_del_1(net, kind, act_index, extack);
>>>+		if (ret < 0)
>>>+			return ret;
>>>+	}
>>>+	return 0;
>>>+}
>>>+
>>
>> [...]
>

^ permalink raw reply

* Re: [PATCH 05/14] net: sched: always take reference to action
From: Jiri Pirko @ 2018-05-15  8:58 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: netdev, davem, jhs, xiyou.wangcong, pablo, kadlec, fw, ast,
	daniel, edumazet, keescook, linux-kernel, netfilter-devel,
	coreteam, kliteyn
In-Reply-To: <vbftvradqe4.fsf@reg-r-vrt-018-180.mtr.labs.mlnx>

Mon, May 14, 2018 at 08:49:07PM CEST, vladbu@mellanox.com wrote:
>
>On Mon 14 May 2018 at 16:23, Jiri Pirko <jiri@resnulli.us> wrote:
>> Mon, May 14, 2018 at 04:27:06PM CEST, vladbu@mellanox.com wrote:
>>>Without rtnl lock protection it is no longer safe to use pointer to tc
>>>action without holding reference to it. (it can be destroyed concurrently)
>>>
>>>Remove unsafe action idr lookup function. Instead of it, implement safe tcf
>>>idr check function that atomically looks up action in idr and increments
>>>its reference and bind counters.
>>>
>>>Implement both action search and check using new safe function.
>>>
>>>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>>>---
>>> net/sched/act_api.c | 38 ++++++++++++++++----------------------
>>> 1 file changed, 16 insertions(+), 22 deletions(-)
>>>
>>>diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>>>index 1331beb..9459cce 100644
>>>--- a/net/sched/act_api.c
>>>+++ b/net/sched/act_api.c
>>>@@ -284,44 +284,38 @@ int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
>>> }
>>> EXPORT_SYMBOL(tcf_generic_walker);
>>> 
>>>-static struct tc_action *tcf_idr_lookup(u32 index, struct tcf_idrinfo *idrinfo)
>>>+bool __tcf_idr_check(struct tc_action_net *tn, u32 index, struct tc_action **a,
>>>+		     int bind)
>>> {
>>>-	struct tc_action *p = NULL;
>>>+	struct tcf_idrinfo *idrinfo = tn->idrinfo;
>>>+	struct tc_action *p;
>>> 
>>> 	spin_lock_bh(&idrinfo->lock);
>>
>> Why "_bh" variant is necessary here?
>
>It is not my code.

Yeah, yet still I wonder :)

^ permalink raw reply

* Re: [PATCH v1 iproute2-next 2/3] rdma: print driver resource attributes
From: Leon Romanovsky @ 2018-05-15  8:54 UTC (permalink / raw)
  To: Steve Wise; +Cc: Jason Gunthorpe, dsahern, stephen, netdev, linux-rdma
In-Reply-To: <a8493ed9-f16c-eb71-55db-f27bd1331521@opengridcomputing.com>

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On Mon, May 14, 2018 at 05:04:26PM -0500, Steve Wise wrote:
>
>
> On 5/14/2018 3:41 PM, Jason Gunthorpe wrote:
> > On Mon, May 07, 2018 at 08:53:16AM -0700, Steve Wise wrote:
> >> This enhancement allows printing rdma device-specific state, if provided
> >> by the kernel.  This is done in a generic manner, so rdma tool doesn't
> >> need to know about the details of every type of rdma device.
> >>
> >> Driver attributes for a rdma resource are in the form of <key,
> >> [print_type], value> tuples, where the key is a string and the value can
> >> be any supported driver attribute.  The print_type attribute, if present,
> >> provides a print format to use vs the standard print format for the type.
> >> For example, the default print type for a PROVIDER_S32 value is "%d ",
> >> but "0x%x " if the print_type of PRINT_TYPE_HEX is included inthe tuple.
> >>
> >> Driver resources are only printed when the -dd flag is present.
> >> If -p is present, then the output is formatted to not exceed 80 columns,
> >> otherwise it is printed as a single row to be grep/awk friendly.
> >>
> >> Example output:
> >>
> >> # rdma resource show qp lqpn 1028 -dd -p
> >> link cxgb4_0/- lqpn 1028 rqpn 0 type RC state RTS rq-psn 0 sq-psn 0 path-mig-state MIGRATED pid 0 comm [nvme_rdma]
> >>     sqid 1028 flushed 0 memsize 123968 cidx 85 pidx 85 wq_pidx 106 flush_cidx 85 in_use 0
> >>     size 386 flags 0x0 rqid 1029 memsize 16768 cidx 43 pidx 41 wq_pidx 171 msn 44 rqt_hwaddr 0x2a8a5d00
> >>     rqt_size 256 in_use 128 size 130 idx 43 wr_id 0xffff881057c03408 idx 40 wr_id 0xffff881057c033f0
> > Hey some of these look like kernel pointers.. That is a no-no.. What
> > is up there?
>
> Nothing is defined as a kernel pointer.  But wr_id is often a pointer to
> the kernel rdma application's context...
>
> > The wr_id often contains a pointer, right? So we cannot just pass it
> > to user space..
>
> Hmm.  It is useful for debugging kernel rdma applications.  Perhaps
> these attrs can be only be sent up by the kernel if the capabilities
> allow.  But previous review comments of the kernel series, which is now
> merged, forced me to remove passing the capabilities information to the
> driver resource fill functions. 
>
> So what's the right way to do this?

The reviewer asked do not pass to drivers whole CAP_.. bits, because
they anyway don't need such granularity.

>
> Steve.
>
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* [PATCH net-next] sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers
From: Paolo Abeni @ 2018-05-15  8:50 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	John Fastabend

Currently NOLOCK qdiscs pay a measurable overhead to atomically
manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per
packet in the uncontended scenario with packet rate below the
line rate: on packed dequeue and on the next, failing dequeue attempt.

This changeset moves the bit manipulation into the qdisc_run_{begin,end}
helpers, so that the bit is now flipped only once per packet, with
measurable performance improvement in the uncontended scenario.

This also allows simplifying the qdisc teardown code path - since
qdisc_is_running() is now effective for each qdisc type - and avoid a
possible race between qdisc_run() and dev_deactivate_many(), as now
the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy
dequeuing packets.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sch_generic.h | 10 +++++++++-
 net/core/dev.c            |  2 +-
 net/sched/sch_generic.c   | 31 +++++++++----------------------
 3 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 5154c8300262..4d2b37226e75 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -113,13 +113,19 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 
 static inline bool qdisc_is_running(const struct Qdisc *qdisc)
 {
+	if (qdisc->flags & TCQ_F_NOLOCK)
+		return test_bit(__QDISC_STATE_RUNNING, &qdisc->state);
 	return (raw_read_seqcount(&qdisc->running) & 1) ? true : false;
 }
 
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
-	if (qdisc_is_running(qdisc))
+	if (qdisc->flags & TCQ_F_NOLOCK) {
+		if (test_and_set_bit(__QDISC_STATE_RUNNING, &qdisc->state))
+			return false;
+	} else if (qdisc_is_running(qdisc)) {
 		return false;
+	}
 	/* Variant of write_seqcount_begin() telling lockdep a trylock
 	 * was attempted.
 	 */
@@ -131,6 +137,8 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 static inline void qdisc_run_end(struct Qdisc *qdisc)
 {
 	write_seqcount_end(&qdisc->running);
+	if (qdisc->flags & TCQ_F_NOLOCK)
+		clear_bit(__QDISC_STATE_RUNNING, &qdisc->state);
 }
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/core/dev.c b/net/core/dev.c
index 9f4390182384..7ca19f47a92a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3244,7 +3244,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 			rc = NET_XMIT_DROP;
 		} else {
 			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-			__qdisc_run(q);
+			qdisc_run(q);
 		}
 
 		if (unlikely(to_free))
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 39c144b6ff98..ff3ce71aec93 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -373,33 +373,24 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  */
 static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 {
-	bool more, validate, nolock = q->flags & TCQ_F_NOLOCK;
 	spinlock_t *root_lock = NULL;
 	struct netdev_queue *txq;
 	struct net_device *dev;
 	struct sk_buff *skb;
+	bool validate;
 
 	/* Dequeue packet */
-	if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
-		return false;
-
 	skb = dequeue_skb(q, &validate, packets);
-	if (unlikely(!skb)) {
-		if (nolock)
-			clear_bit(__QDISC_STATE_RUNNING, &q->state);
+	if (unlikely(!skb))
 		return false;
-	}
 
-	if (!nolock)
+	if (!(q->flags & TCQ_F_NOLOCK))
 		root_lock = qdisc_lock(q);
 
 	dev = qdisc_dev(q);
 	txq = skb_get_tx_queue(dev, skb);
 
-	more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
-	if (nolock)
-		clear_bit(__QDISC_STATE_RUNNING, &q->state);
-	return more;
+	return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
 }
 
 void __qdisc_run(struct Qdisc *q)
@@ -1131,17 +1122,13 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 		dev_queue = netdev_get_tx_queue(dev, i);
 		q = dev_queue->qdisc_sleeping;
 
-		if (q->flags & TCQ_F_NOLOCK) {
-			val = test_bit(__QDISC_STATE_SCHED, &q->state);
-		} else {
-			root_lock = qdisc_lock(q);
-			spin_lock_bh(root_lock);
+		root_lock = qdisc_lock(q);
+		spin_lock_bh(root_lock);
 
-			val = (qdisc_is_running(q) ||
-			       test_bit(__QDISC_STATE_SCHED, &q->state));
+		val = (qdisc_is_running(q) ||
+		       test_bit(__QDISC_STATE_SCHED, &q->state));
 
-			spin_unlock_bh(root_lock);
-		}
+		spin_unlock_bh(root_lock);
 
 		if (val)
 			return true;
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH] dt-bindings: net: ravb: Add support for r8a77990 SoC
From: Sergei Shtylyov @ 2018-05-15  8:43 UTC (permalink / raw)
  To: Simon Horman, David Miller
  Cc: yoshihiro.shimoda.uh, netdev, linux-renesas-soc, robh+dt,
	mark.rutland, devicetree
In-Reply-To: <20180513075803.plrkbwmzksjnz45u@verge.net.au>

On 5/13/2018 10:58 AM, Simon Horman wrote:

>>> Add documentation for r8a77990 compatible string to renesas ravb device
>>> tree bindings documentation.
>>>
>>> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
>>
>> I'm assuming this isn't targetted at one of my trees.  Just FYI.
> 
> Hi Dave,
> 
> I think this is appropriate for net-next but if not I can take it.

    There's also Rob, and IIRC he has taken alike patch recently...

[...]

MBR, Sergei

^ permalink raw reply

* [RFC PATCH] net: qualcomm: rmnet: rmnet_ethtool_ops can be static
From: kbuild test robot @ 2018-05-15  8:41 UTC (permalink / raw)
  To: Subash Abhinov Kasiviswanathan
  Cc: kbuild-all, davem, netdev, Subash Abhinov Kasiviswanathan
In-Reply-To: <1526328527-20026-3-git-send-email-subashab@codeaurora.org>


Fixes: f85908d4636e ("net: qualcomm: rmnet: Add support for ethtool private stats")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 rmnet_vnd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 4c04482..cb02e1a 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -196,7 +196,7 @@ static void rmnet_get_ethtool_stats(struct net_device *dev,
 	memcpy(data, st, ARRAY_SIZE(rmnet_gstrings_stats) * sizeof(u64));
 }
 
-const struct ethtool_ops rmnet_ethtool_ops = {
+static const struct ethtool_ops rmnet_ethtool_ops = {
 	.get_ethtool_stats = rmnet_get_ethtool_stats,
 	.get_strings = rmnet_get_strings,
 	.get_sset_count = rmnet_get_sset_count,

^ permalink raw reply related

* Re: [PATCH net-next 2/3] net: qualcomm: rmnet: Add support for ethtool private stats
From: kbuild test robot @ 2018-05-15  8:41 UTC (permalink / raw)
  To: Subash Abhinov Kasiviswanathan
  Cc: kbuild-all, davem, netdev, Subash Abhinov Kasiviswanathan
In-Reply-To: <1526328527-20026-3-git-send-email-subashab@codeaurora.org>

Hi Subash,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Subash-Abhinov-Kasiviswanathan/net-qualcomm-rmnet-Updates-2018-05-14/20180515-115355
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c:199:26: sparse: symbol 'rmnet_ethtool_ops' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [RFC PATCH 5/5] net: macb: Add WOL support with ARP
From: Claudiu Beznea @ 2018-05-15  8:39 UTC (permalink / raw)
  To: Harini Katakam
  Cc: Nicolas Ferre, David Miller, netdev, linux-kernel, michals,
	appanad
In-Reply-To: <CAFcVECKA6J-c4KYPZS05Ooj4x5+AARcxCdJfV2uHOHNOVXZaOQ@mail.gmail.com>

Hi Harini,

On 10.05.2018 13:37, Harini Katakam wrote:
> Hi Claudiu,
> 
> On Fri, May 4, 2018 at 5:47 PM, Claudiu Beznea
> <Claudiu.Beznea@microchip.com> wrote:
>>
>>
>> On 22.03.2018 15:51, harinikatakamlinux@gmail.com wrote:
>>> From: Harini Katakam <harinik@xilinx.com>
>>>
>>> This patch enables ARP wake event support in GEM through the following:
>>>
>>> -> WOL capability can be selected based on the SoC/GEM IP version rather
>>> than a devictree property alone. Hence add a new capability property and
>>> set device as "wakeup capable" in probe in this case.
>>> -> Wake source selection can be done via ethtool or by enabling wakeup
>>> in /sys/devices/platform/..../ethx/power/
>>> This patch adds default wake source as ARP and the existing selection of
>>> WOL using magic packet remains unchanged.
>>> -> When GEM is the wake device with ARP as the wake event, the current
>>> IP address to match is written to WOL register along with other
>>> necessary confuguration required for MAC to recognize an ARP event.
>>> -> While RX needs to remain enabled, there is no need to process the
>>> actual wake packet - hence tie off all RX queues to avoid unnecessary
>>> processing by DMA in the background.
>>
>> Why is this different for magic packet vs ARP packet?
> 
> This should ideally be the same whether it is a magic packet or ARP on
> the version of the IP we use (more details in my comment below).
> I simply did not alter the magic packet code for now to avoid breaking
> others' flow.

I see your point. But in the end the code should be nice and clean.

>
> <snip>
>>> +#define MACB_CAPS_WOL                                0x00000080
>>
>> I think would be better to have this as part of bp->wol and use it properly
>> in suspend/resume hooks.
> 
> I think a capability flag as part of config structure is better
> because this is clearly an SoC related feature and there is no need
> to have a devicetree property.

Since there is no difference b/w ARP and magic packet in term of SoC,
I mean, there is no special treatment b/w them, I think we should treat
them both in the same way. This was my point.

> 
> <snip>
>> Wouldn't it work if you will change it in something like this:
>>
>>         u32 wolmask, arpipmask = 0;
>>
>>         if (bp->wol & MACB_WOL_ENABLED) {
>>                 macb_writel(bp, IER, MACB_BIT(WOL));
>>
>>                 if (bp->wol & MACB_WOL_HAS_ARP_PACKET) {
>>                         /* Enable broadcast. */
>>                         gem_writel(bp, NCFGR, gem_readl(bp, NCFGR) & ~MACB_BIT(NBC));
>>                         arpipmask = cpu_to_be32p(&bp->dev->ip_ptr->ifa_list->ifa_local) & 0xFFFF;
>>                         wolmask = arpipmask | MACB_BIT(ARP);
>>                 } else {
>>                         wolmask = MACB_BIT(MAG);
>>                 }
>>
>>                 macb_writel(bp, WOL, wolmask);
>>                 enable_irq_wake(bp->queues[0].irq);
> 
> The above would work. But I'd still have to add the RX BD changes
> and then stop the phy, disable interrupt etc., for most optimal power
> down state - the idea is to keep only is essential to detect a wake event.
> 

Also, with your approach the suspend/resume time will be longer.

>>                 netif_device_detach(netdev);
>>         }
>>
>> I cannot find anything particular for ARP WOL events in datasheet. Also,
>> I cannot find something related to DMA activity while WOL is active
> 
> Can you please let me know which version you are referring to?
> ZynqMP uses the IP version r1p06 or 07.

I have user guide revision 15
.

> 
> There is a clear set of rules for ARP wake event to be recognized.
> 
> About the DMA activity, it is not explicitly mentioned but we have
> observed that the DMA will continue to process incoming packets
> and try to write them to the memory and Cadence has confirmed
> the same. Later versions of the IP may have some provision to
> stop DMA activity on RX channel but unfortunately in this version,
> using a dummy RX buffer descriptor is the only way.>
> I'm looking into your other suggestions.
> Thanks, will get back to you.
> 
> Regards,
> Harini
> 

^ permalink raw reply

* Re: [PATCH 00/14] Modify action API for implementing lockless actions
From: Jiri Pirko @ 2018-05-15  8:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Vlad Buslov, netdev, davem, xiyou.wangcong, pablo, kadlec, fw,
	ast, daniel, edumazet, keescook, linux-kernel, netfilter-devel,
	coreteam, kliteyn
In-Reply-To: <23f37e5a-dd30-0ad3-a5ab-df22bf0ad7f9@mojatatu.com>

Mon, May 14, 2018 at 08:03:20PM CEST, jhs@mojatatu.com wrote:
>On 14/05/18 10:27 AM, Vlad Buslov wrote:
>> Currently, all netlink protocol handlers for updating rules, actions and
>> qdiscs are protected with single global rtnl lock which removes any
>> possibility for parallelism. This patch set is a first step to remove
>> rtnl lock dependency from TC rules update path. It updates act API to
>> use atomic operations, rcu and spinlocks for fine-grained locking. It
>> also extend API with functions that are needed to update existing
>> actions for parallel execution.
>> 
>> Outline of changes:
>> - Change tc action to use atomic reference and bind counters, rcu
>>    mechanism for cookie update.
>> - Extend action ops API with 'delete' function and 'unlocked' flag.
>> - Change action API to work with actions in lockless manner based on
>>    primitives implemented in previous patches.
>> - Extend action API with new functions necessary to implement unlocked
>>    actions.
>
>Please run all the tdc tests with these changes. This area has almost
>good test coverage at this point. If you need help just ping me.

Oh, that reminds me. Vlad, please run also:
tools/testing/selftests/net/forwarding/tc_*

Thanks!

^ permalink raw reply

* [RESEND PATCH v2 1/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  8:18 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, devicetree, kernel, niebelm
In-Reply-To: <20180515081856.23322-1-m.felsch@pengutronix.de>

From: Markus Niebel <Markus.Niebel@tqs.de>

The micrel KSZ9031 phy has a optional clock pin (CLK125_NDO) which can be
used as reference clock for the MAC unit. The clock signal must meet the
RGMII requirements to ensure the correct data transmission between the
MAC and the PHY. The KSZ9031 phy does not fulfill the duty cycle
requirement if the phy is configured as slave. For a complete
describtion look at the errata sheets: DS80000691D or DS80000692D.

The errata sheet recommends to force the phy into master mode whenever
there is a 1000Base-T link-up as work around. Only set the
"micrel,force-master" property if you use the phy reference clock provided
by CLK125_NDO pin as MAC reference clock in your application.

Attenation, this workaround is only usable if the link partner can
be configured to slave mode for 1000Base-T.

Signed-off-by: Markus Niebel <Markus.Niebel@tqs.de>
[m.felsch@pengutronix.de: fix dt-binding documentation]
[m.felsch@pengutronix.de: use already existing result var for read/write]
[m.felsch@pengutronix.de: add error handling]
[m.felsch@pengutronix.de: add more comments]
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
---
 .../bindings/net/micrel-ksz90x1.txt           |  7 +++++
 drivers/net/phy/micrel.c                      | 31 +++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
index 42a248301615..e22d8cfea687 100644
--- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
@@ -57,6 +57,13 @@ KSZ9031:
       - txd2-skew-ps : Skew control of TX data 2 pad
       - txd3-skew-ps : Skew control of TX data 3 pad
 
+    - micrel,force-master:
+        Boolean, force phy to master mode. Only set this option if the phy
+        reference clock provided at CLK125_NDO pin is used as MAC reference
+        clock because the clock jitter in slave mode is to high (errata#2).
+        Attention: The link partner must be configurable as slave otherwise
+        no link will be established.
+
 Examples:
 
 	mdio {
diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index f41b224a9cdb..ab195f0916d6 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -573,9 +573,40 @@ static int ksz9031_config_init(struct phy_device *phydev)
 		ksz9031_of_load_skew_values(phydev, of_node,
 				MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
 				tx_data_skews, 4);
+
+		/* Silicon Errata Sheet (DS80000691D or DS80000692D):
+		 * When the device links in the 1000BASE-T slave mode only,
+		 * the optional 125MHz reference output clock (CLK125_NDO)
+		 * has wide duty cycle variation.
+		 *
+		 * The optional CLK125_NDO clock does not meet the RGMII
+		 * 45/55 percent (min/max) duty cycle requirement and therefore
+		 * cannot be used directly by the MAC side for clocking
+		 * applications that have setup/hold time requirements on
+		 * rising and falling clock edges.
+		 *
+		 * Workaround:
+		 * Force the phy to be the master to receive a stable clock
+		 * which meets the duty cycle requirement.
+		 */
+		if (of_property_read_bool(of_node, "micrel,force-master")) {
+			result = phy_read(phydev, MII_CTRL1000);
+			if (result < 0)
+				goto err_force_master;
+
+			/* enable master mode, config & prefer master */
+			result |= CTL1000_ENABLE_MASTER | CTL1000_AS_MASTER;
+			result = phy_write(phydev, MII_CTRL1000, result);
+			if (result < 0)
+				goto err_force_master;
+		}
 	}
 
 	return ksz9031_center_flp_timing(phydev);
+
+err_force_master:
+	phydev_err(phydev, "failed to force the phy to master mode\n");
+	return result;
 }
 
 #define KSZ8873MLL_GLOBAL_CONTROL_4	0x06
-- 
2.17.0

^ permalink raw reply related

* [RESEND PATCH v2 0/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  8:18 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, devicetree, kernel, niebelm

This patch adds the workaround for the errata#2. The KSZ9031 phy
125MHz reference clock jitter is to high if the phy is configured as
slave. This will cause errors if a application uses the clock as
MAC reference clock. The workaround is to configure the phy as master.

v2:
- add more code and dt-binding comments
- delete unnecessary local variables
- add error handling

v1:
- initial commit

Markus Niebel (1):
  net: phy: micrel: add 125MHz reference clock workaround

 .../bindings/net/micrel-ksz90x1.txt           |  7 ++++
 drivers/net/phy/micrel.c                      | 32 +++++++++++++++++++
 2 files changed, 39 insertions(+)

-- 
2.17.0

^ permalink raw reply

* Re: [PATCH bpf-next] selftests/bpf: make sure build-id is on
From: Daniel Borkmann @ 2018-05-15  8:10 UTC (permalink / raw)
  To: Alexei Starovoitov, David S . Miller; +Cc: songliubraving, netdev
In-Reply-To: <20180515001129.3557608-1-ast@kernel.org>

On 05/15/2018 02:11 AM, Alexei Starovoitov wrote:
> --build-id may not be a default linker config.
> Make sure it's used when linking urandom_read test program.
> Otherwise test_stacktrace_build_id[_nmi] tests will be failling.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Applied to bpf-next, thanks Alexei!

^ permalink raw reply

* Re: [PATCH v2 0/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  8:06 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, niebelm, kernel, devicetree
In-Reply-To: <20180515074527.22546-1-m.felsch@pengutronix.de>

I forgot to set the thread option. I will resend with thread option 
enabled.

On 05/15/2018 09:45 AM, Marco Felsch wrote:

> This patch adds the workaround for the errata#2. The KSZ9031 phy
> 125MHz reference clock jitter is to high if the phy is configured as
> slave. This will cause errors if a application uses the clock as
> MAC reference clock. The workaround is to configure the phy as master.
>
> v2:
> - add more code and dt-binding comments
> - delete unnecessary local variables
> - add error handling
>
> v1:
> - initial commit
>
> Markus Niebel (1):
>    net: phy: micrel: add 125MHz reference clock workaround
>
>   .../bindings/net/micrel-ksz90x1.txt           |  7 ++++
>   drivers/net/phy/micrel.c                      | 32 +++++++++++++++++++
>   2 files changed, 39 insertions(+)
>
>

^ permalink raw reply

* [PATCH v2 1/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  7:47 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, devicetree, kernel, niebelm

From: Markus Niebel <Markus.Niebel@tqs.de>

The micrel KSZ9031 phy has a optional clock pin (CLK125_NDO) which can be
used as reference clock for the MAC unit. The clock signal must meet the
RGMII requirements to ensure the correct data transmission between the
MAC and the PHY. The KSZ9031 phy does not fulfill the duty cycle
requirement if the phy is configured as slave. For a complete
describtion look at the errata sheets: DS80000691D or DS80000692D.

The errata sheet recommends to force the phy into master mode whenever
there is a 1000Base-T link-up as work around. Only set the
"micrel,force-master" property if you use the phy reference clock provided
by CLK125_NDO pin as MAC reference clock in your application.

Attenation, this workaround is only usable if the link partner can
be configured to slave mode for 1000Base-T.

Signed-off-by: Markus Niebel <Markus.Niebel@tqs.de>
[m.felsch@pengutronix.de: fix dt-binding documentation]
[m.felsch@pengutronix.de: use already existing result var for read/write]
[m.felsch@pengutronix.de: add error handling]
[m.felsch@pengutronix.de: add more comments]
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
---
 .../bindings/net/micrel-ksz90x1.txt           |  7 +++++
 drivers/net/phy/micrel.c                      | 31 +++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
index 42a248301615..e22d8cfea687 100644
--- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
@@ -57,6 +57,13 @@ KSZ9031:
       - txd2-skew-ps : Skew control of TX data 2 pad
       - txd3-skew-ps : Skew control of TX data 3 pad
 
+    - micrel,force-master:
+        Boolean, force phy to master mode. Only set this option if the phy
+        reference clock provided at CLK125_NDO pin is used as MAC reference
+        clock because the clock jitter in slave mode is to high (errata#2).
+        Attention: The link partner must be configurable as slave otherwise
+        no link will be established.
+
 Examples:
 
 	mdio {
diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index f41b224a9cdb..ab195f0916d6 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -573,9 +573,40 @@ static int ksz9031_config_init(struct phy_device *phydev)
 		ksz9031_of_load_skew_values(phydev, of_node,
 				MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
 				tx_data_skews, 4);
+
+		/* Silicon Errata Sheet (DS80000691D or DS80000692D):
+		 * When the device links in the 1000BASE-T slave mode only,
+		 * the optional 125MHz reference output clock (CLK125_NDO)
+		 * has wide duty cycle variation.
+		 *
+		 * The optional CLK125_NDO clock does not meet the RGMII
+		 * 45/55 percent (min/max) duty cycle requirement and therefore
+		 * cannot be used directly by the MAC side for clocking
+		 * applications that have setup/hold time requirements on
+		 * rising and falling clock edges.
+		 *
+		 * Workaround:
+		 * Force the phy to be the master to receive a stable clock
+		 * which meets the duty cycle requirement.
+		 */
+		if (of_property_read_bool(of_node, "micrel,force-master")) {
+			result = phy_read(phydev, MII_CTRL1000);
+			if (result < 0)
+				goto err_force_master;
+
+			/* enable master mode, config & prefer master */
+			result |= CTL1000_ENABLE_MASTER | CTL1000_AS_MASTER;
+			result = phy_write(phydev, MII_CTRL1000, result);
+			if (result < 0)
+				goto err_force_master;
+		}
 	}
 
 	return ksz9031_center_flp_timing(phydev);
+
+err_force_master:
+	phydev_err(phydev, "failed to force the phy to master mode\n");
+	return result;
 }
 
 #define KSZ8873MLL_GLOBAL_CONTROL_4	0x06
-- 
2.17.0

^ permalink raw reply related

* [PATCH v2 1/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  7:45 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, devicetree, kernel, niebelm

From: Markus Niebel <Markus.Niebel@tqs.de>

The micrel KSZ9031 phy has a optional clock pin (CLK125_NDO) which can be
used as reference clock for the MAC unit. The clock signal must meet the
RGMII requirements to ensure the correct data transmission between the
MAC and the PHY. The KSZ9031 phy does not fulfill the duty cycle
requirement if the phy is configured as slave. For a complete
describtion look at the errata sheets: DS80000691D or DS80000692D.

The errata sheet recommends to force the phy into master mode whenever
there is a 1000Base-T link-up as work around. Only set the
"micrel,force-master" property if you use the phy reference clock provided
by CLK125_NDO pin as MAC reference clock in your application.

Attenation, this workaround is only usable if the link partner can
be configured to slave mode for 1000Base-T.

Signed-off-by: Markus Niebel <Markus.Niebel@tqs.de>
[m.felsch@pengutronix.de: fix dt-binding documentation]
[m.felsch@pengutronix.de: use already existing result var for read/write]
[m.felsch@pengutronix.de: add error handling]
[m.felsch@pengutronix.de: add more comments]
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
---
 .../bindings/net/micrel-ksz90x1.txt           |  7 +++++
 drivers/net/phy/micrel.c                      | 31 +++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
index 42a248301615..e22d8cfea687 100644
--- a/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ksz90x1.txt
@@ -57,6 +57,13 @@ KSZ9031:
       - txd2-skew-ps : Skew control of TX data 2 pad
       - txd3-skew-ps : Skew control of TX data 3 pad
 
+    - micrel,force-master:
+        Boolean, force phy to master mode. Only set this option if the phy
+        reference clock provided at CLK125_NDO pin is used as MAC reference
+        clock because the clock jitter in slave mode is to high (errata#2).
+        Attention: The link partner must be configurable as slave otherwise
+        no link will be established.
+
 Examples:
 
 	mdio {
diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index f41b224a9cdb..ab195f0916d6 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -573,9 +573,40 @@ static int ksz9031_config_init(struct phy_device *phydev)
 		ksz9031_of_load_skew_values(phydev, of_node,
 				MII_KSZ9031RN_TX_DATA_PAD_SKEW, 4,
 				tx_data_skews, 4);
+
+		/* Silicon Errata Sheet (DS80000691D or DS80000692D):
+		 * When the device links in the 1000BASE-T slave mode only,
+		 * the optional 125MHz reference output clock (CLK125_NDO)
+		 * has wide duty cycle variation.
+		 *
+		 * The optional CLK125_NDO clock does not meet the RGMII
+		 * 45/55 percent (min/max) duty cycle requirement and therefore
+		 * cannot be used directly by the MAC side for clocking
+		 * applications that have setup/hold time requirements on
+		 * rising and falling clock edges.
+		 *
+		 * Workaround:
+		 * Force the phy to be the master to receive a stable clock
+		 * which meets the duty cycle requirement.
+		 */
+		if (of_property_read_bool(of_node, "micrel,force-master")) {
+			result = phy_read(phydev, MII_CTRL1000);
+			if (result < 0)
+				goto err_force_master;
+
+			/* enable master mode, config & prefer master */
+			result |= CTL1000_ENABLE_MASTER | CTL1000_AS_MASTER;
+			result = phy_write(phydev, MII_CTRL1000, result);
+			if (result < 0)
+				goto err_force_master;
+		}
 	}
 
 	return ksz9031_center_flp_timing(phydev);
+
+err_force_master:
+	phydev_err(phydev, "failed to force the phy to master mode\n");
+	return result;
 }
 
 #define KSZ8873MLL_GLOBAL_CONTROL_4	0x06
-- 
2.17.0

^ permalink raw reply related

* [PATCH v2 0/1] net: phy: micrel: add 125MHz reference clock workaround
From: Marco Felsch @ 2018-05-15  7:45 UTC (permalink / raw)
  To: robh+dt, mark.rutland, andrew, f.fainelli
  Cc: netdev, devicetree, kernel, niebelm

This patch adds the workaround for the errata#2. The KSZ9031 phy
125MHz reference clock jitter is to high if the phy is configured as
slave. This will cause errors if a application uses the clock as
MAC reference clock. The workaround is to configure the phy as master.

v2:
- add more code and dt-binding comments
- delete unnecessary local variables
- add error handling

v1:
- initial commit

Markus Niebel (1):
  net: phy: micrel: add 125MHz reference clock workaround

 .../bindings/net/micrel-ksz90x1.txt           |  7 ++++
 drivers/net/phy/micrel.c                      | 32 +++++++++++++++++++
 2 files changed, 39 insertions(+)

-- 
2.17.0

^ permalink raw reply

* Re: [PATCH v4 1/1] drivers core: multi-threading device shutdown
From: Greg KH @ 2018-05-15  7:37 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: steven.sistare, daniel.m.jordan, linux-kernel, jeffrey.t.kirsher,
	intel-wired-lan, netdev, alexander.duyck, tobin
In-Reply-To: <20180514194254.14748-2-pasha.tatashin@oracle.com>

On Mon, May 14, 2018 at 03:42:54PM -0400, Pavel Tatashin wrote:
> When system is rebooted, halted or kexeced device_shutdown() is
> called.
> 
> This function shuts down every single device by calling either:
> 
> 	dev->bus->shutdown(dev)
> 	dev->driver->shutdown(dev)
> 
> Even on a machine with just a moderate amount of devices, device_shutdown()
> may take multiple seconds to complete. This is because many devices require
> a specific delays to perform this operation.
> 
> Here is a sample analysis of time it takes to call device_shutdown() on a
> two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.
> 
> device_shutdown		2.95s
> -----------------------------
>  mlx4_shutdown		1.14s
>  megasas_shutdown	0.24s
>  ixgbe_shutdown		0.37s x 4 (four ixgbe devices on this machine).
>  the rest		0.09s
> 
> In mlx4 we spent the most time, but that is because there is a 1 second
> sleep, which is defined by hardware specifications:
> mlx4_shutdown
>  mlx4_unload_one
>   mlx4_free_ownership
>    msleep(1000)
> 
> With megasas we spend a quarter of a second, but sometimes longer (up-to
> 0.5s) in this path:
> 
>     megasas_shutdown
>       megasas_flush_cache
>         megasas_issue_blocked_cmd
>           wait_event_timeout
> 
> Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
> is spread all over the place, with bigger offenders:
> 
>     ixgbe_shutdown
>       __ixgbe_shutdown
>         ixgbe_close_suspend
>           ixgbe_down
>             ixgbe_init_hw_generic
>               ixgbe_reset_hw_X540
>                 msleep(100);                        0.104483472
>                 ixgbe_get_san_mac_addr_generic      0.048414851
>                 ixgbe_get_wwn_prefix_generic        0.048409893
>               ixgbe_start_hw_X540
>                 ixgbe_start_hw_generic
>                   ixgbe_clear_hw_cntrs_generic      0.048581502
>                   ixgbe_setup_fc_generic            0.024225800
> 
>     All the ixgbe_*generic functions end-up calling:
>     ixgbe_read_eerd_X540()
>       ixgbe_acquire_swfw_sync_X540
>         usleep_range(5000, 6000);
>       ixgbe_release_swfw_sync_X540
>         usleep_range(5000, 6000);
> 
> While these are short sleeps, they end-up calling them over 24 times!
> 24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices. Also we have
> four msleep(100). Totaling to:  0.928s
> 
> While we should keep optimizing the individual device drivers, in some
> cases this is simply a hardware property that forces a specific delay, and
> we must wait.
> 
> So, the solution for this problem is to shutdown devices in parallel.
> However, we must shutdown children before shutting down parents, so parent
> device must wait for its children to finish.
> 
> With this patch, on the same machine devices_shutdown() takes 1.142s, and
> without mlx4 one second delay only 0.38s
> 
> This feature can be optionally disabled via kernel parameter:
> device_shutdown_serial. When booted with this parameter, device_shutdown()
> will shutdown devices one by one.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  drivers/base/core.c | 292 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 242 insertions(+), 50 deletions(-)

Can you refactor this to be at least 2 patches?  One that moves code
around to comon functions to make the second patch, that adds the new
functionality, easier to review and understand?

And I echo the "don't use kerneldoc for static functions" review
comment, that's not needed at all.

thanks,

greg k-h

^ permalink raw reply

* Re: possible deadlock in sk_diag_fill
From: Dmitry Vyukov @ 2018-05-15  7:26 UTC (permalink / raw)
  To: Andrei Vagin; +Cc: syzbot, avagin, David Miller, LKML, netdev, syzkaller-bugs
In-Reply-To: <20180515061854.GA30523@outlook.office365.com>

On Tue, May 15, 2018 at 8:18 AM, Andrei Vagin <avagin@virtuozzo.com> wrote:
>> >> >> Hello,
>> >> >>
>> >> >> syzbot found the following crash on:
>> >> >>
>> >> >> HEAD commit:    c1c07416cdd4 Merge tag 'kbuild-fixes-v4.17' of git://git.k..
>> >> >> git tree:       upstream
>> >> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12164c97800000
>> >> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=5a1dc06635c10d27
>> >> >> dashboard link: https://syzkaller.appspot.com/bug?extid=c1872be62e587eae9669
>> >> >> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>> >> >> userspace arch: i386
>> >> >>
>> >> >> Unfortunately, I don't have any reproducer for this crash yet.
>> >> >>
>> >> >> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> >> >> Reported-by: syzbot+c1872be62e587eae9669@syzkaller.appspotmail.com
>> >> >>
>> >> >>
>> >> >> ======================================================
>> >> >> WARNING: possible circular locking dependency detected
>> >> >> 4.17.0-rc3+ #59 Not tainted
>> >> >> ------------------------------------------------------
>> >> >> syz-executor1/25282 is trying to acquire lock:
>> >> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at: sk_diag_dump_icons
>> >> >> net/unix/diag.c:82 [inline]
>> >> >> 000000004fddf743 (&(&u->lock)->rlock/1){+.+.}, at:
>> >> >> sk_diag_fill.isra.5+0xa43/0x10d0 net/unix/diag.c:144
>> >> >>
>> >> >> but task is already holding lock:
>> >> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: spin_lock
>> >> >> include/linux/spinlock.h:310 [inline]
>> >> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_dump_icons
>> >> >> net/unix/diag.c:64 [inline]
>> >> >> 00000000b6895645 (rlock-AF_UNIX){+.+.}, at: sk_diag_fill.isra.5+0x94e/0x10d0
>> >> >> net/unix/diag.c:144
>> >> >>
>> >> >> which lock already depends on the new lock.
>> >> >
>> >> > In the code, we have a comment which explains why it is safe to take this lock
>> >> >
>> >> > /*
>> >> >  * The state lock is outer for the same sk's
>> >> >  * queue lock. With the other's queue locked it's
>> >> >  * OK to lock the state.
>> >> >  */
>> >> > unix_state_lock_nested(req);
>> >> >
>> >> > It is a question how to explain this to lockdep.
>> >>
>> >> Do I understand it correctly that (&u->lock)->rlock associated with
>> >> AF_UNIX is locked under rlock-AF_UNIX, and then rlock-AF_UNIX is
>> >> locked under (&u->lock)->rlock associated with AF_NETLINK? If so, I
>> >> think we need to split (&u->lock)->rlock by family too, so that we
>> >> have u->lock-AF_UNIX and u->lock-AF_NETLINK.
>> >
>> > I think here is another problem. lockdep woried about
>> > sk->sk_receive_queue vs unix_sk(s)->lock.
>> >
>> > sk_diag_dump_icons() takes sk->sk_receive_queue and then
>> > unix_sk(s)->lock.
>> >
>> > unix_dgram_sendmsg takes unix_sk(sk)->lock and then sk->sk_receive_queue.
>> >
>> > sk_diag_dump_icons() takes locks for two different sockets, but
>> > unix_dgram_sendmsg() takes locks for one socket.
>> >
>> > sk_diag_dump_icons
>> >         if (sk->sk_state == TCP_LISTEN) {
>> >                 spin_lock(&sk->sk_receive_queue.lock);
>> >                 skb_queue_walk(&sk->sk_receive_queue, skb) {
>> >                         unix_state_lock_nested(req);
>> >                                 spin_lock_nested(&unix_sk(s)->lock,
>> >
>> >
>> > unix_dgram_sendmsg
>> >         unix_state_lock(other)
>> >                 spin_lock(&unix_sk(s)->lock)
>> >         skb_queue_tail(&other->sk_receive_queue, skb);
>> >                 spin_lock_irqsave(&list->lock, flags);
>>
>>
>> Do you mean the following?
>> There is socket 1 with state lock (S1) and queue lock (Q2), and socket
>> 2 with state lock (S2) and queue lock (Q2). unix_dgram_sendmsg lock
>> S1->Q1. And sk_diag_dump_icons locks Q1->S2.
>> If yes, then this looks pretty much as deadlock. Consider that 2
>> unix_dgram_sendmsg in 2 different threads lock S1 and S2 respectively.
>> Now 2  sk_diag_dump_icons in 2 different threads lock Q1 and Q2
>> respectively. Now sk_diag_dump_icons want to lock S's, and
>> unix_dgram_sendmsg want to lock Q's. Nobody can proceed.
>
> Q1 and S1 belongs to a listen socket, so they can't be taken from
> unix_dgram_sendmsg().

Should we then split Q1/S1 for listening and data sockets? I don't
know it lockdep allows changing lock class on the fly, though. Always
wondered if there was a single reason to mix listening and data
sockets into a single thing on API level...

^ permalink raw reply

* Re: maximal supported bit rate in traffic control
From: Eitan Raviv @ 2018-05-15  7:25 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, Dan Kenigsberg
In-Reply-To: <f5797c00-f51f-430c-abf8-db1762f8a993@gmail.com>

Thanks a lot for the fast reply.
However we are using HFSC, and I did not find a 64bit HFSC qdisc. Does
one exits?
If not, are there plans to create one?

Thank you

On Tue, May 15, 2018 at 1:26 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 05/14/2018 03:15 PM, Eitan Raviv wrote:
>> Hi,
>>
>> In RHEVM we use traffic control to enable users to shape their QoS on
>> the networks they use.
>>
>> IIUC from the traffic control man page, the maximal bit rate currently
>> supported (RHEL 7.4) is 2^32 - 1 bytes per sec, which translates to 32
>> Gibit\s if I have done my maths correctly.
>>
>> Does anyone know whether there are any plans to support larger bit
>> rate values upstream or downstream soon or at all?
>>
>
> At least 3 qdisc got 64bit rates :
>
> HTB (  TCA_HTB_RATE64 ), tbf ( TCA_TBF_RATE64), netem ( TCA_NETEM_RATE64 )
>
> No plans yet for others.
>

^ permalink raw reply

* Re: Kernel panic on kernel-3.10.0-693.21.1.el7 in ndisc.h
From: Michal Kubecek @ 2018-05-15  6:55 UTC (permalink / raw)
  To: Roman Makhov; +Cc: Stephen Hemminger, Alexander Aring, linux-wpan, netdev
In-Reply-To: <20180514140611.26eec64f@xeon-e3>

On Mon, May 14, 2018 at 02:06:11PM -0700, Stephen Hemminger wrote:
> On Mon, 14 May 2018 21:29:03 +0300 Roman Makhov <roman.makhov@gmail.com> wrote:
> > Thank you for the answer.
> > Unfortunately CentOS goes with these dinosaurs.
> > So we will try to debug the problem in the current one and try to
> > reproduce on the latest kernel.
> 
> If you are stuck in old kernels, please bug the CentOs maintainers not
> upstream developers.

Actually, it's not even the dinosaur. Kernels from RHEL (and therefore
CentOS) or SLES differ from the original mainline version they are based
on quite a lot. It's possible that this bug didn't really exist in the
old 3.10 and is related to one of the backports added to CentOS (or
rather RHEL in this case) kernel. People outside of the distribution
have little idea what was backported and why so unless the issue can be
reproduced with mainline kernel they cannot be expected to help.

Michal Kubecek

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox