* Re: [PATCH 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Alexander Duyck @ 2018-03-15 1:44 UTC (permalink / raw)
To: Sinan Kaya
Cc: Timur Tabi, Netdev, sulrich, linux-arm-msm, linux-arm-kernel,
Jeff Kirsher, intel-wired-lan, LKML
In-Reply-To: <d565406d-1bc7-6fb0-3f36-87a28ac69070@codeaurora.org>
On Wed, Mar 14, 2018 at 3:57 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Hi Alexander,
>
> On 3/14/2018 5:49 PM, Alexander Duyck wrote:
>> On Wed, Mar 14, 2018 at 5:13 AM, <okaya@codeaurora.org> wrote:
>>> On 2018-03-14 01:08, Timur Tabi wrote:
>>>>
>>>> On 3/13/18 10:20 PM, Sinan Kaya wrote:
>
>> Actually I would argue this whole patch set is pointless. For starters
>> why is it we are only updating the Intel Ethernet drivers here?
>
> I did a regex search for wmb() followed by writel() in each drivers directory.
> I scrubbed the ones I care about and posted this series. Note also that
> I have one Infiniband patch in the series.
I didn't see it as it I was only looking at the patches that had ended
up in intel-wired-lan. Also was there a cover page, I couldn't seem to
find that on LKML.
> I considered "ease of change", "popular usage" and "performance critical
> path" as the determining criteria for my filtering.
It might be advisable to break things up to subsystem or family. So
for example if you are going to update the Intel Ethernet drivers I
would focus on that and maybe spin the infiniband patch of into a
separate set that can be applied to a separate tree. This is something
I would consider more of a driver optimization than a fix. In our case
it makes it easier for us to maintain the patches to the Intel drivers
if you could submit just those to Jeff and Intel-wired-lan so that we
can take care of test and review as well as figure out what other
drivers will would still need to update in order to handle all the
cases involved in this.
>> This
>> seems like something that is going to impact the whole kernel tree
>> since many of us have been writing drivers for some time assuming x86
>> style behavior.
>
> That's true. We used relaxed API heavily on ARM for a long time but
> it did not exist on other architectures. For this reason, relaxed
> architectures have been paying double penalty in order to use the common
> drivers.
>
> Now that relaxed API is present on all architectures, we can go and scrub
> all drivers to see what needs to change and what can remain.
My only real objection is that you are going to be having to scrub
pretty much ALL the drivers. It seems a little like trying to fix a
bad tire on your car by paving the road to match the shape of the
tire.
>>
>> It doesn't make sense to be optimizing the drivers for one subset of
>> architectures. The scope of the work needed to update the drivers for
>> this would be ridiculous. Also I don't see how this could be expected
>> to work on any other architecture when we pretty much need to have a
>> wmb() before calling the writel on x86 to deal with accesses between
>> coherent and non-coherent memory. It seems to me more like somebody
>> added what they considered to be an optimization somewhere that is a
>> workaround for a poorly written driver. Either that or the barrier is
>> serving a different purpose then the one we were using.
>
> Is there a semantic problem with the definition of wmb() vs. writel() vs.
> writel_relaxed()? I thought everything is well described in barriers
> document about what to expect from these APIs.
>
> AFAIK, writel() is equal to writel_relaxed() on x86 architecture.
> It doesn't really change anything for x86 but it saves barriers on
> other architectures.
Yeah. I had to go through and do some review since my concerns have
been PowerPC, IA64, and x86 historically. From what I can tell all
those architectures are setup the same way so that shouldn't be an
issue.
>>
>> It would make more sense to put in the effort making writel and
>> writel_relaxed consistent between architectures before we go through
>> and start modifying driver code to support different architectures.
>>
>
> Is there an arch problem that I'm not seeing?
>
> Sinan
It isn't really an arch problem I have as a logistical one. It just
seems like this is really backwards in terms of how this has been
handled. For the x86 we have historically had to deal with the
barriers for this kind of stuff ourselves, now for ARM and a couple
other architectures they seem to have incorporated the barriers into
writel and are expecting everyone to move over to writel_relaxed. It
seems like instead of going that route they should have probably just
looked at pushing the ARM drivers to something like a "writel_strict"
and adopted the behavior of the other architectures for writel.
I'll go back through and review. It looks like a number of items were missed.
^ permalink raw reply
* Re: [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs
From: Alexander Duyck @ 2018-03-15 1:41 UTC (permalink / raw)
To: Sinan Kaya
Cc: Netdev, Timur Tabi, sulrich, linux-arm-msm, linux-arm-kernel,
Jeff Kirsher, intel-wired-lan, LKML
In-Reply-To: <1520997629-17361-6-git-send-email-okaya@codeaurora.org>
On Tue, Mar 13, 2018 at 8:20 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Code includes wmb() followed by writel(). writel() already has a barrier
> on some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing the
> register write.
>
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
>
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
> drivers/net/ethernet/intel/e1000/e1000_main.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 3dd4aeb..e0e583a 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -4573,7 +4573,7 @@ e1000_alloc_jumbo_rx_buffers(struct e1000_adapter *adapter,
> * such as IA-64).
> */
> wmb();
> - writel(i, adapter->hw.hw_addr + rx_ring->rdt);
> + writel_relaxed(i, adapter->hw.hw_addr + rx_ring->rdt);
> }
> }
>
> @@ -4688,7 +4688,7 @@ static void e1000_alloc_rx_buffers(struct e1000_adapter *adapter,
> * such as IA-64).
> */
> wmb();
> - writel(i, hw->hw_addr + rx_ring->rdt);
> + writel_relaxed(i, hw->hw_addr + rx_ring->rdt);
> }
> }
>
So you missed the writel in e1000_xmit_frame. You should probably get
that one too while you are doing these updates. The wmb() is in
e1000_tx_queue().
^ permalink raw reply
* Re: [PATCH net-next v2 0/4] net: qualcomm: rmnet: Updates 2018-03-12
From: David Miller @ 2018-03-15 1:29 UTC (permalink / raw)
To: subashab; +Cc: netdev
In-Reply-To: <a099fdae0a1b6def9ac7db6bcafaf664@codeaurora.org>
From: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Date: Wed, 14 Mar 2018 13:27:41 -0600
>> Please address Joe's feedback and only update the copyright date on
>> files that actually had changes this year.
>> Thank you.
>
> Hi David
>
> I have fixed that now in v3.
> However, patchwork is not showing the entire series.
> It shows only the first patch for some reason even if I search with me
> as submitter.
We're having some problems with patchwork dropping patches,
especially when they are sent very quickly as a series one
after another.
The patchwork developers have been looking into it.
Meanwhile don't worry about it, the series is in my inbox.
^ permalink raw reply
* Re: KASAN: use-after-free Read in pfifo_fast_enqueue
From: Eric Dumazet @ 2018-03-15 0:47 UTC (permalink / raw)
To: syzbot, davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
xiyou.wangcong
In-Reply-To: <a37dbc2c-2c43-2f71-7b36-275fcee83eb1@gmail.com>
On 03/14/2018 05:16 PM, Eric Dumazet wrote:
>
> typical use after free...
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 190570f21b208d5a17943360a3a6f85e1c2a2187..663e016491773f40f81d9bbfeab3dd68e1c2fc5c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -628,6 +628,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
> int band = prio2band[skb->priority & TC_PRIO_MAX];
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> struct skb_array *q = band2list(priv, band);
> + unsigned int pkt_len = qdisc_pkt_len(skb);
> int err;
>
> err = skb_array_produce(q, skb);
> @@ -636,7 +637,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
> return qdisc_drop_cpu(skb, qdisc, to_free);
>
> qdisc_qstats_cpu_qlen_inc(qdisc);
> - qdisc_qstats_cpu_backlog_inc(qdisc, skb);
> + this_cpu_add(qdisc->cpu_qstats->backlog, pkt_len);
> return NET_XMIT_SUCCESS;
> }
>
There is also a similar issue right after qdisc_enqueue_skb_bad_txq() call.
We should move the following code in qdisc_enqueue_skb_bad_txq() to benefit from the locking
if (qdisc_is_percpu_stats(q)) {
qdisc_qstats_cpu_backlog_inc(q, nskb);
qdisc_qstats_cpu_qlen_inc(q);
} else {
qdisc_qstats_backlog_inc(q, nskb);
q->q.qlen++;
}
I will post a patch with the two fixes.
^ permalink raw reply
* Re: [PATCH bpf-next v6 0/2] bpf stackmap with build_id+offset
From: Song Liu @ 2018-03-15 0:34 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev@vger.kernel.org, ast@kernel.org, peterz@infradead.org,
Kernel Team, hannes@cmpxchg.org, Teng Qin
In-Reply-To: <46bec038-f793-7ce6-686f-adee7f462dec@iogearbox.net>
> On Mar 14, 2018, at 5:29 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 03/14/2018 06:23 PM, Song Liu wrote:
>> Changes v4 -> v6:
>>
>> 1. When kernel stack is added to stackmap with build_id, use fallback
>> mechanism to store ip (status == BPF_STACK_BUILD_ID_IP).
>>
>> Changes v4 -> v5:
>>
>> 1. Only allow build_id lookup in non-nmi context. Added comment and
>> commit message to highlight this limitation.
>> 2. Minor fix reported by kbuild test robot.
>>
>> Changes v3 -> v4:
>>
>> 1. Add fallback when build_id lookup failed. In this case, status is set
>> to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved.
>> 2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0).
>> Thanks to Teng for helping me identify this issue!
>> 3. Address feedbacks for previous versions.
>
> Applied to bpf-next, thanks Song!
>
> P.s.: In the merge commit I included your description from the original
> v1 cover letter to provide readers with some more context. Please don't
> remove such things in future from cover letter. Only the above doesn't
> really say much about the series overall.
Get it! Thanks Daniel!
Best,
Song
^ permalink raw reply
* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Alexei Starovoitov @ 2018-03-15 0:29 UTC (permalink / raw)
To: Daniel Borkmann, Alexei Starovoitov
Cc: Alexei Starovoitov, davem, netdev, kernel-team
In-Reply-To: <54d95e56-02dc-b766-2221-9d5d6ce9985c@iogearbox.net>
On 3/14/18 4:27 PM, Daniel Borkmann wrote:
> On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:
>> On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>>> BPF_PROG_TYPE_SOCK_OPS,
>>>> BPF_PROG_TYPE_SK_SKB,
>>>> BPF_PROG_TYPE_CGROUP_DEVICE,
>>>> + BPF_PROG_TYPE_CGROUP_INET4_BIND,
>>>> + BPF_PROG_TYPE_CGROUP_INET6_BIND,
>>>
>>> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
>>> confused by the many sock_*/sk_* prog types we have. The attach hook could
>>> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
>>> storing some prog-type specific void *private_data in prog's aux during
>>> verification could be a way (similarly as you mention) which can later be
>>> retrieved at attach time to reject with -ENOTSUPP or such.
>>
>> that's exacly what I mentioned in the cover letter,
>> but we need to extend attach cmd with verifier-like log_buf+log_size.
>> since simple enotsupp will be impossible to debug.
>
> Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
> for this case, but it's the usual problem where getting a std error code
> is like looking into a crystal ball to figure where it's coming from. I'd see
> only couple of other alternatives: distinct error code like ENAVAIL for such
> mismatch, or telling the verifier upfront where this is going to be attached
> to - same as we do with the dev for offloading or as you did with splitting
> the prog types or some sort of notion of sub-prog types; or having a query
> API that returns possible/compatible attach types for this loaded prog via
> e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
> occurs. All nothing really nice, though.
query after loading isn't great, since possible attach combinations
will be too high for human to understand,
but I like "passing attach_type into prog_load" idea.
That should work and it fits existing prog_ifindex too.
So we'll add '__u32 attach_type' to prog_load cmd.
elf loader would still need to parse section name to
figure out prog type and attach type.
Something like:
SEC("sock_addr/bind_v4") my_prog(struct bpf_sock_addr *ctx)
SEC("sock_addr/connect_v6") my_prog(struct bpf_sock_addr *ctx)
We still need new prog type for bind_v4/bind_v6/connect_v4/connect_v6
hooks with distinct 'struct bpf_sock_addr' context,
since the prog is accessing both sockaddr and sock.
Adding user_ip4, user_ip6 fields to 'struct bpf_sock_ops'
is doable, but it would be too confusing to users, so imo that's
not a good option.
For post-bind hook we probably can reuse 'struct bpf_sock_ops'
and BPF_PROG_TYPE_SOCK_OPS, since there only sock is the context.
> Making verifier-like log_buf + log_size generic meaning not just for the case
> of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
> have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
> cmd, but either that might require adding a BPF related pointer to task struct
> for this or any other future BPF feature (maybe not really an option); or to
> have some sort of bpf cmd to config and obtain an fd for error queue/log once,
> where this can then be retrieved from and used for all cmds generically.
I don't think we want to hold on to error logs in the kernel,
since user may not query it right away or ever.
verifier log is freed right after prog_load cmd is done.
> Seems like it would potentially be on top of that, plus having an option to
> attach from within the app instead of orchestrator.
right. let's worry about it as potential next step.
^ permalink raw reply
* Re: [PATCH bpf-next v6 0/2] bpf stackmap with build_id+offset
From: Daniel Borkmann @ 2018-03-15 0:29 UTC (permalink / raw)
To: Song Liu, netdev, ast, peterz; +Cc: kernel-team, hannes, qinteng
In-Reply-To: <20180314172323.3957483-1-songliubraving@fb.com>
On 03/14/2018 06:23 PM, Song Liu wrote:
> Changes v4 -> v6:
>
> 1. When kernel stack is added to stackmap with build_id, use fallback
> mechanism to store ip (status == BPF_STACK_BUILD_ID_IP).
>
> Changes v4 -> v5:
>
> 1. Only allow build_id lookup in non-nmi context. Added comment and
> commit message to highlight this limitation.
> 2. Minor fix reported by kbuild test robot.
>
> Changes v3 -> v4:
>
> 1. Add fallback when build_id lookup failed. In this case, status is set
> to BPF_STACK_BUILD_ID_IP, and ip of this entry is saved.
> 2. Handle cases where vma is only part of the file (vma->vm_pgoff != 0).
> Thanks to Teng for helping me identify this issue!
> 3. Address feedbacks for previous versions.
Applied to bpf-next, thanks Song!
P.s.: In the merge commit I included your description from the original
v1 cover letter to provide readers with some more context. Please don't
remove such things in future from cover letter. Only the above doesn't
really say much about the series overall.
^ permalink raw reply
* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Eric Dumazet @ 2018-03-15 0:17 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, David S. Miller, Daniel Borkmann,
Network Development, Kernel Team
In-Reply-To: <CAADnVQL4PDiNq07ip-uh=n0+ba3nzTy+QJwew1x71UKeGgsiEw@mail.gmail.com>
On 03/14/2018 11:41 AM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 11:00 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>>> It seems this is exactly the case where a netns would be the correct answer.
>>
>> Unfortuantely that's not the case. That's what I tried to explain
>> in the cover letter:
>> "The setup involves per-container IPs, policy, etc, so traditional
>> network-only solutions that involve VRFs, netns, acls are not applicable."
>> To elaborate more on that:
>> netns is l2 isolation.
>> vrf is l3 isolation.
>> whereas to containerize an application we need to punch connectivity holes
>> in these layered techniques.
>> We also considered resurrecting Hannes's afnetns work
>> and even went as far as designing a new namespace for L4 isolation.
>> Unfortunately all hierarchical namespace abstraction don't work.
>> To run an application inside cgroup container that was not written
>> with containers in mind we have to make an illusion of running
>> in non-containerized environment.
>> In some cases we remember the port and container id in the post-bind hook
>> in a bpf map and when some other task in a different container is trying
>> to connect to a service we need to know where this service is running.
>> It can be remote and can be local. Both client and service may or may not
>> be written with containers in mind and this sockaddr rewrite is providing
>> connectivity and load balancing feature that you simply cannot do
>> with hierarchical networking primitives.
>
> have to explain this a bit further...
> We also considered hacking these 'connectivity holes' in
> netns and/or vrf, but that would be real layering violation,
> since clean l2, l3 abstraction would suddenly support
> something that breaks through the layers.
> Just like many consider ipvlan a bad hack that punches
> through the layers and connects l2 abstraction of netns
> at l3 layer, this is not something kernel should ever do.
> We really didn't want another ipvlan-like hack in the kernel.
> Instead bpf programs at bind/connect time _help_
> applications discover and connect to each other.
> All containers are running in init_nens and there are no vrfs.
> After bind/connect the normal fib/neighbor core networking
> logic works as it should always do. The whole system is
> clean from network point of view.
We apparently missed something when deploying ipvlan and one netns per
container/job
Full access to 64K ports, no more ports being reserved/abused.
If one job needs more, no problem, just use more than one IP per netns.
It also works with UDP just fine. Are you considering adding a hook
later for sendmsg() (unconnected socket or not), or do you want to use
the existing one in ip_finish_output(), adding per-packet overhead ?
This notion of 'clean l2, l3 abstraction' is very subjective.
I find netns isolation very clean, powerful, and it is there already.
eBPF is certainly nice, but pretending netns/ipvlan are hacks is not
credible.
^ permalink raw reply
* Re: KASAN: use-after-free Read in pfifo_fast_enqueue
From: Eric Dumazet @ 2018-03-15 0:16 UTC (permalink / raw)
To: syzbot, davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
xiyou.wangcong
In-Reply-To: <000000000000081c6a056767c154@google.com>
On 03/14/2018 04:30 PM, syzbot wrote:
> syzbot has found reproducer for the following crash on net-next commit
> a870a02cc963de35452bbed932560ed69725c4f2 (Tue Mar 13 20:58:39 2018 +0000)
> pktgen: use dynamic allocation for debug print buffer
>
> So far this crash happened 7 times on mmots, net-next.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+ed43b6903ab968b16f54@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed.
>
> ==================================================================
> BUG: KASAN: use-after-free in qdisc_pkt_len include/net/sch_generic.h:610 [inline]
> BUG: KASAN: use-after-free in qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
> BUG: KASAN: use-after-free in pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
> Read of size 4 at addr ffff8801cede37e8 by task syzkaller717588/5543
>
> CPU: 1 PID: 5543 Comm: syzkaller717588 Not tainted 4.16.0-rc4+ #265
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x24d lib/dump_stack.c:53
> print_address_description+0x73/0x250 mm/kasan/report.c:256
> kasan_report_error mm/kasan/report.c:354 [inline]
> kasan_report+0x23c/0x360 mm/kasan/report.c:412
> __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
> qdisc_pkt_len include/net/sch_generic.h:610 [inline]
> qdisc_qstats_cpu_backlog_inc include/net/sch_generic.h:712 [inline]
> pfifo_fast_enqueue+0x4bc/0x5e0 net/sched/sch_generic.c:639
> __dev_xmit_skb net/core/dev.c:3216 [inline]
> __dev_queue_xmit+0xd50/0x2f30 net/core/dev.c:3517
> dev_queue_xmit+0x17/0x20 net/core/dev.c:3582
> neigh_hh_output include/net/neighbour.h:472 [inline]
> neigh_output include/net/neighbour.h:480 [inline]
> ip6_finish_output2+0x1472/0x23d0 net/ipv6/ip6_output.c:120
> ip6_finish_output+0x69b/0xaf0 net/ipv6/ip6_output.c:154
> NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171
> dst_output include/net/dst.h:444 [inline]
> ip6_local_out+0x95/0x160 net/ipv6/output_core.c:176
> ip6_send_skb+0xa1/0x330 net/ipv6/ip6_output.c:1677
> ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1697
> rawv6_push_pending_frames net/ipv6/raw.c:616 [inline]
> rawv6_sendmsg+0x2f96/0x40c0 net/ipv6/raw.c:935
> inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
> sock_sendmsg_nosec net/socket.c:629 [inline]
> sock_sendmsg+0xca/0x110 net/socket.c:639
> ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
> __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
> SYSC_sendmmsg net/socket.c:2168 [inline]
> SyS_sendmmsg+0x35/0x60 net/socket.c:2163
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x447739
> RSP: 002b:00007fb444290d58 EFLAGS: 00000297 ORIG_RAX: 0000000000000133
> RAX: ffffffffffffffda RBX: 00000000006f0084 RCX: 0000000000447739
> RDX: 0000000000000249 RSI: 0000000020001300 RDI: 0000000000000353
> RBP: 0000000000000000 R08: 0000000020000000 R09: 0000000020000000
> R10: 0000000000000000 R11: 0000000000000297 R12: 00000000006f0080
> R13: 00000000007ffd8f R14: 00007fb4442919c0 R15: 000000000000000a
>
> Allocated by task 5543:
> save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> set_track mm/kasan/kasan.c:459 [inline]
> kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:552
> kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
> kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3631
> __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
> alloc_skb include/linux/skbuff.h:986 [inline]
> sock_wmalloc+0x140/0x1d0 net/core/sock.c:1942
> __ip6_append_data.isra.43+0x26b9/0x3390 net/ipv6/ip6_output.c:1416
> ip6_append_data+0x189/0x290 net/ipv6/ip6_output.c:1571
> rawv6_sendmsg+0x1e09/0x40c0 net/ipv6/raw.c:928
> inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
> sock_sendmsg_nosec net/socket.c:629 [inline]
> sock_sendmsg+0xca/0x110 net/socket.c:639
> ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
> __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
> SYSC_sendmmsg net/socket.c:2168 [inline]
> SyS_sendmmsg+0x35/0x60 net/socket.c:2163
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> Freed by task 5531:
> save_stack+0x43/0xd0 mm/kasan/kasan.c:447
> set_track mm/kasan/kasan.c:459 [inline]
> __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:520
> kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:527
> __cache_free mm/slab.c:3485 [inline]
> kmem_cache_free+0x83/0x2a0 mm/slab.c:3743
> kfree_skbmem+0x1a1/0x1d0 net/core/skbuff.c:582
> __kfree_skb net/core/skbuff.c:642 [inline]
> kfree_skb+0x165/0x4c0 net/core/skbuff.c:659
> ip_tunnel_xmit+0x710/0x3550 net/ipv4/ip_tunnel.c:778
> __gre_xmit+0x546/0x8b0 net/ipv4/ip_gre.c:449
> erspan_xmit+0x779/0x22a0 net/ipv4/ip_gre.c:731
> __netdev_start_xmit include/linux/netdevice.h:4078 [inline]
> netdev_start_xmit include/linux/netdevice.h:4087 [inline]
> xmit_one net/core/dev.c:3026 [inline]
> dev_hard_start_xmit+0x24e/0xac0 net/core/dev.c:3042
> sch_direct_xmit+0x40d/0x1140 net/sched/sch_generic.c:327
> qdisc_restart net/sched/sch_generic.c:393 [inline]
> __qdisc_run+0x57d/0x19c0 net/sched/sch_generic.c:401
> __dev_xmit_skb net/core/dev.c:3217 [inline]
> __dev_queue_xmit+0xd5e/0x2f30 net/core/dev.c:3517
> dev_queue_xmit+0x17/0x20 net/core/dev.c:3582
> neigh_hh_output include/net/neighbour.h:472 [inline]
> neigh_output include/net/neighbour.h:480 [inline]
> ip6_finish_output2+0x1472/0x23d0 net/ipv6/ip6_output.c:120
> ip6_finish_output+0x69b/0xaf0 net/ipv6/ip6_output.c:154
> NF_HOOK_COND include/linux/netfilter.h:277 [inline]
> ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:171
> dst_output include/net/dst.h:444 [inline]
> ip6_local_out+0x95/0x160 net/ipv6/output_core.c:176
> ip6_send_skb+0xa1/0x330 net/ipv6/ip6_output.c:1677
> ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1697
> rawv6_push_pending_frames net/ipv6/raw.c:616 [inline]
> rawv6_sendmsg+0x2f96/0x40c0 net/ipv6/raw.c:935
> inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
> sock_sendmsg_nosec net/socket.c:629 [inline]
> sock_sendmsg+0xca/0x110 net/socket.c:639
> ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
> __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
> SYSC_sendmmsg net/socket.c:2168 [inline]
> SyS_sendmmsg+0x35/0x60 net/socket.c:2163
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> The buggy address belongs to the object at ffff8801cede37c0
> which belongs to the cache skbuff_head_cache of size 232
> The buggy address is located 40 bytes inside of
> 232-byte region [ffff8801cede37c0, ffff8801cede38a8)
> The buggy address belongs to the page:
> page:ffffea00073b78c0 count:1 mapcount:0 mapping:ffff8801cede3040 index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffff8801cede3040 0000000000000000 000000010000000c
> raw: ffffea00074946a0 ffffea00074c6d60 ffff8801d940fcc0 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff8801cede3680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801cede3700: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
>> ffff8801cede3780: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ^
> ffff8801cede3800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8801cede3880: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
>
typical use after free...
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f21b208d5a17943360a3a6f85e1c2a2187..663e016491773f40f81d9bbfeab3dd68e1c2fc5c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -628,6 +628,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
int band = prio2band[skb->priority & TC_PRIO_MAX];
struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
struct skb_array *q = band2list(priv, band);
+ unsigned int pkt_len = qdisc_pkt_len(skb);
int err;
err = skb_array_produce(q, skb);
@@ -636,7 +637,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
qdisc_qstats_cpu_qlen_inc(qdisc);
- qdisc_qstats_cpu_backlog_inc(qdisc, skb);
+ this_cpu_add(qdisc->cpu_qstats->backlog, pkt_len);
return NET_XMIT_SUCCESS;
}
^ permalink raw reply related
* [PATCH] netns: send uevent messages
From: Christian Brauner @ 2018-03-15 0:12 UTC (permalink / raw)
To: ebiederm, gregkh, netdev, linux-kernel; +Cc: serge, Christian Brauner
This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
to allow sending uevent messages into the network namespace the socket
belongs to.
Currently non-initial network namespaces are already isolated and don't
receive uevents. There are a number of cases where it is beneficial for a
sufficiently privileged userspace process to send a uevent into a network
namespace.
One such use case would be debugging and fuzzing of a piece of software
which listens and reacts to uevents. By running a copy of that software
inside a network namespace, specific uevents could then be presented to it.
More concretely, this would allow for easy testing of udevd/ueventd.
This will also allow some piece of software to run components inside a
separate network namespace and then effectively filter what that software
can receive. Some examples of software that do directly listen to uevents
and that we have in the past attempted to run inside a network namespace
are rbd (CEPH client) or the X server.
Implementation:
The implementation has been kept as simple as possible from the kernel's
perspective. Specifically, a simple input method uevent_net_rcv() is added
to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
af_netlink infrastructure and does neither add an additional netlink family
nor requires any user-visible changes.
For example, by using netlink_rcv_skb() we can make use of existing netlink
infrastructure to report back informative error messages to userspace.
Furthermore, this implementation does not introduce any overhead for
existing uevent generating codepaths. The struct netns gets a new uevent
socket member that records the uevent socket associated with that network
namespace. Since we record the uevent socket for each network namespace in
struct net we don't have to walk the whole uevent socket list.
Instead we can directly retrieve the relevant uevent socket and send the
message. This keeps the codepath very performant without introducing
needless overhead.
Uevent sequence numbers are kept global. When a uevent message is sent to
another network namespace the implementation will simply increment the
global uevent sequence number and append it to the received uevent. This
has the advantage that the kernel will never need to parse the received
uevent message to replace any existing uevent sequence numbers. Instead it
is up to the userspace process to remove any existing uevent sequence
numbers in case the uevent message to be sent contains any.
Security:
In order for a caller to send uevent messages to a target network namespace
the caller must have CAP_SYS_ADMIN in the owning user namespace of the
target network namespace. Additionally, any received uevent message is
verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
needed to append the uevent sequence number.
Testing:
This patch has been tested and verified to work with the following udev
implementations:
1. CentOS 6 with udevd version 147
2. Debian Sid with systemd-udevd version 237
3. Android 7.1.1 with ueventd
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
include/net/net_namespace.h | 1 +
lib/kobject_uevent.c | 88 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 88 insertions(+), 1 deletion(-)
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f306b2aa15a4..467bde763a9b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -78,6 +78,7 @@ struct net {
struct sock *rtnl; /* rtnetlink socket */
struct sock *genl_sock;
+ struct sock *uevent_sock; /* uevent socket */
struct list_head dev_base_head;
struct hlist_head *dev_name_head;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 9fe6ec8fda28..10b2144b9fc3 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -25,6 +25,7 @@
#include <linux/uuid.h>
#include <linux/ctype.h>
#include <net/sock.h>
+#include <net/netlink.h>
#include <net/net_namespace.h>
@@ -602,12 +603,94 @@ int add_uevent_var(struct kobj_uevent_env *env, const char *format, ...)
EXPORT_SYMBOL_GPL(add_uevent_var);
#if defined(CONFIG_NET)
+static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
+ struct netlink_ext_ack *extack)
+{
+ int ret;
+ u64 seqnum;
+ /* u64 to chars: 2^64 - 1 = 21 chars */
+ char buf[sizeof("SEQNUM=") + 21];
+ struct sk_buff *skbc;
+
+ /* bump sequence number */
+ mutex_lock(&uevent_sock_mutex);
+ seqnum = ++uevent_seqnum;
+ mutex_unlock(&uevent_sock_mutex);
+
+ /* prepare sequence number */
+ ret = snprintf(buf, sizeof(buf), "SEQNUM=%llu", seqnum);
+ if (ret < 0 || (size_t)ret >= sizeof(buf))
+ return -ENOMEM;
+ ret++;
+
+ /* verify message does not overflow */
+ if ((skb->len + ret) > UEVENT_BUFFER_SIZE) {
+ NL_SET_ERR_MSG(extack, "uevent message too big");
+ return -EINVAL;
+ }
+
+ /* copy skb and extend to accommodate sequence number */
+ skbc = skb_copy_expand(skb, 0, ret, GFP_KERNEL);
+ if (!skbc)
+ return -ENOMEM;
+
+ /* append sequence number */
+ skb_put_data(skbc, buf, ret);
+
+ /* remove msg header */
+ skb_pull(skbc, NLMSG_HDRLEN);
+
+ /* set portid 0 to inform userspace message comes from kernel */
+ NETLINK_CB(skbc).portid = 0;
+ NETLINK_CB(skbc).dst_group = 1;
+
+ ret = netlink_broadcast(usk, skbc, 0, 1, GFP_KERNEL);
+ /* ENOBUFS should be handled in userspace */
+ if (ret == -ENOBUFS || ret == -ESRCH)
+ ret = 0;
+
+ return ret;
+}
+
+static int uevent_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack)
+{
+ void *msg;
+ struct net *target_net;
+ struct sock *target_sk;
+
+ msg = nlmsg_data(nlh);
+ if (!msg)
+ return -EINVAL;
+
+ target_net = sock_net(NETLINK_CB(skb).sk);
+ target_sk = target_net->uevent_sock;
+
+ /*
+ * Verify that we are allowed to send messages to the target network
+ * namespace. The caller must have CAP_SYS_ADMIN in the owning user
+ * namespace of the target network namespace.
+ */
+ if (!netlink_ns_capable(skb, target_net->user_ns, CAP_SYS_ADMIN)) {
+ NL_SET_ERR_MSG(extack, "missing CAP_SYS_ADMIN capability");
+ return -EPERM;
+ }
+
+ return uevent_net_send(target_sk, skb, extack);
+}
+
+static void uevent_net_rcv(struct sk_buff *skb)
+{
+ netlink_rcv_skb(skb, &uevent_rcv_msg);
+}
+
static int uevent_net_init(struct net *net)
{
struct uevent_sock *ue_sk;
struct netlink_kernel_cfg cfg = {
.groups = 1,
- .flags = NL_CFG_F_NONROOT_RECV,
+ .input = uevent_net_rcv,
+ .flags = NL_CFG_F_NONROOT_RECV
};
ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
@@ -621,6 +704,9 @@ static int uevent_net_init(struct net *net)
kfree(ue_sk);
return -ENODEV;
}
+
+ net->uevent_sock = ue_sk->sk;
+
mutex_lock(&uevent_sock_mutex);
list_add_tail(&ue_sk->list, &uevent_sock_list);
mutex_unlock(&uevent_sock_mutex);
--
2.15.1
^ permalink raw reply related
* Re: [PATCH RFC 5/7] net: phy: make phy_stop synchronous
From: Florian Fainelli @ 2018-03-15 0:00 UTC (permalink / raw)
To: Heiner Kallweit, Andrew Lunn; +Cc: Geert Uytterhoeven, netdev@vger.kernel.org
In-Reply-To: <1ce9431b-a109-e7ac-f974-aeda8155e40e@gmail.com>
On 03/14/2018 01:16 PM, Heiner Kallweit wrote:
> Currently phy_stop() just sets the state to PHY_HALTED and relies on the
> state machine to do the remaining work. It can take up to 1s until the
> state machine runs again what causes issues in situations where e.g.
> driver / device is brought down directly after executing phy_stop().
>
> Fix this by executing all phy_stop() activities synchronously.
>
> Add a function phy_stop_suspending() which does basically the same as
> phy_stop() and just adopts the state adjustment logic from
> phy_stop_machine() to inform the resume callback about the status of
> the PHY before suspending.
Definitively an improvement, thanks!
>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
> drivers/net/phy/phy.c | 48 ++++++++++++++++++++++++++++++++----------------
> include/linux/phy.h | 12 +++++++++++-
> 2 files changed, 43 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 0ca1672a5..54144cd10 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -737,21 +737,49 @@ int phy_stop_interrupts(struct phy_device *phydev)
> }
> EXPORT_SYMBOL(phy_stop_interrupts);
>
> +static void phy_link_up(struct phy_device *phydev)
> +{
> + phydev->phy_link_change(phydev, true, true);
> + phy_led_trigger_change_speed(phydev);
> +}
> +
> +static void phy_link_down(struct phy_device *phydev, bool do_carrier)
> +{
> + phydev->phy_link_change(phydev, false, do_carrier);
> + phy_led_trigger_change_speed(phydev);
> +}
> +
> /**
> - * phy_stop - Bring down the PHY link, and stop checking the status
> + * __phy_stop - Bring down the PHY link, and stop checking the status
> * @phydev: target phy_device struct
> + * @suspending: called from a suspend callback
Can you put the same comment as what phy_stop_machine() has such that we
understand why there is a check for phydev->state > PHY_UP and what
happens when suspend is set to false and explain what happens when
@suspending is set to false.
> */
> -void phy_stop(struct phy_device *phydev)
> +void __phy_stop(struct phy_device *phydev, bool suspending)
> {
> mutex_lock(&phydev->lock);
>
> if (PHY_HALTED == phydev->state)
> goto out_unlock;
>
> + /* stop state machine */
> + cancel_delayed_work_sync(&phydev->state_queue);
> +
> if (phy_interrupt_is_valid(phydev))
> phy_disable_interrupts(phydev);
>
> - phydev->state = PHY_HALTED;
> + if (phydev->link) {
> + phydev->link = 0;
> + phy_link_down(phydev, true);
> + }
> +
> + phy_suspend(phydev);
> +
> + if (suspending) {
> + if (phydev->state > PHY_UP && phydev->state != PHY_HALTED)
> + phydev->state = PHY_UP;
> + } else {
> + phydev->state = PHY_HALTED;
> + }
>
> out_unlock:
> mutex_unlock(&phydev->lock);
> @@ -761,7 +789,7 @@ void phy_stop(struct phy_device *phydev)
> * will not reenable interrupts.
> */
> }
> -EXPORT_SYMBOL(phy_stop);
> +EXPORT_SYMBOL(__phy_stop);
>
> /**
> * phy_start - start or restart a PHY device
> @@ -804,18 +832,6 @@ void phy_start(struct phy_device *phydev)
> }
> EXPORT_SYMBOL(phy_start);
>
> -static void phy_link_up(struct phy_device *phydev)
> -{
> - phydev->phy_link_change(phydev, true, true);
> - phy_led_trigger_change_speed(phydev);
> -}
> -
> -static void phy_link_down(struct phy_device *phydev, bool do_carrier)
> -{
> - phydev->phy_link_change(phydev, false, do_carrier);
> - phy_led_trigger_change_speed(phydev);
> -}
> -
> /**
> * phy_state_machine - Handle the state machine
> * @work: work_struct that describes the work to be done
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index bc7aa93c6..be43bd270 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -940,7 +940,7 @@ struct phy_device *phy_connect(struct net_device *dev, const char *bus_id,
> void phy_disconnect(struct phy_device *phydev);
> void phy_detach(struct phy_device *phydev);
> void phy_start(struct phy_device *phydev);
> -void phy_stop(struct phy_device *phydev);
> +void __phy_stop(struct phy_device *phydev, bool suspending);
> int phy_start_aneg(struct phy_device *phydev);
> int phy_aneg_done(struct phy_device *phydev);
>
> @@ -964,6 +964,16 @@ static inline const char *phydev_name(const struct phy_device *phydev)
> return dev_name(&phydev->mdio.dev);
> }
>
> +static inline void phy_stop(struct phy_device *phydev)
> +{
> + __phy_stop(phydev, false);
> +}
> +
> +static inline void phy_stop_suspending(struct phy_device *phydev)
> +{
> + __phy_stop(phydev, true);
> +}
I am not a huge fond of these inline functions, I would just move thos
down to phy.c and actually export both of these symbols, this is just
personal preference though.
--
Florian
^ permalink raw reply
* Re: [PATCH RFC 0/7] net: phy: patch series aiming to improve few aspects of phylib
From: Florian Fainelli @ 2018-03-14 23:53 UTC (permalink / raw)
To: Heiner Kallweit, Andrew Lunn; +Cc: Geert Uytterhoeven, netdev@vger.kernel.org
In-Reply-To: <421476ed-03c7-63a0-44a4-c6d9c7702647@gmail.com>
On 03/14/2018 01:10 PM, Heiner Kallweit wrote:
> This patch series aims to tackle few issues with phylib:
>
> - address issues with patch series [1] (smsc911x + phylib changes)
> - make phy_stop synchronous
> - get rid of phy_start/stop_machine and handle it in phy_start/phy_stop
> - in mdio_suspend consider runtime pm state of mdio bus parent
> - consider more WOL conditions when deciding whether PHY is allowed to
> suspend
> - only resume phy after system suspend if needed
>
> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
>
> It works fine here but other NIC drivers may use phylib differently.
> Therefore I'd appreciate feedback and more testing.
>
> I could think of some subsequent patches, e.g. phy_error() could be
> reduced to calling phy_stop() and printing an error message
> (today it silently sets the PHY state to PHY_HALTED).
Thanks for the patch series, I will give it a spin on a number of
devices using different PHYLIB integration and see if something breaks.
>
> Heiner Kallweit (7):
> net: phy: unconditionally resume and re-enable interrupts in phy_start
> net: phy: improve checking for when PHY is allowed to suspend
> net: phy: resume PHY only if needed in mdio_bus_phy_suspend
> net: phy: remove phy_start_machine
> net: phy: make phy_stop synchronous
> net: phy: use new function phy_stop_suspending in mdio_bus_phy_suspend
> net: phy: remove phy_stop_machine
>
> drivers/net/phy/phy.c | 102 +++++++++++++++++--------------------------
> drivers/net/phy/phy_device.c | 80 ++++++++++++++++++++-------------
> drivers/net/phy/phylink.c | 1 -
> include/linux/phy.h | 14 ++++--
> 4 files changed, 100 insertions(+), 97 deletions(-)
>
--
Florian
^ permalink raw reply
* Re: [PATCH RFC 3/7] net: phy: resume PHY only if needed in, mdio_bus_phy_suspend
From: Florian Fainelli @ 2018-03-14 23:50 UTC (permalink / raw)
To: Heiner Kallweit, Andrew Lunn; +Cc: Geert Uytterhoeven, netdev@vger.kernel.org
In-Reply-To: <4e5a3bdd-8d17-a2ec-891f-7111e6b660c9@gmail.com>
On 03/14/2018 01:16 PM, Heiner Kallweit wrote:
> Currently the PHY is unconditionally resumed in mdio_bus_phy_suspend().
> In cases where the PHY was sleepinh before suspending or if somebody else
> takes care of resuming later, this is not needed and wastes energy.
>
> Also start the state machine only if it's used by the driver (indicated
> by the adjust_link callback being defined).
>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
> drivers/net/phy/phy_device.c | 33 +++++++++++++++++++++++----------
> 1 file changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index a5691536f..c6fd79758 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -124,6 +124,18 @@ static bool phy_may_suspend(struct phy_device *phydev)
> }
>
> #ifdef CONFIG_PM
> +
> +static bool mdio_bus_phy_needs_start(struct phy_device *phydev)
> +{
> + bool start;
How about needs_start? This is uber nitpicking but it seems to be more
in line with what is being tested for here.
> +
> + mutex_lock(&phydev->lock);
> + start = phydev->state == PHY_UP && phydev->adjust_link;
You probably need to add an || phydev->phy_link_change here because that
is what PHYLINK uses, it does not register an adjust_link callback, but
would likely expect similar semantics. Even better, introduce a helper
function that tests for both to avoid possible issues...
> + mutex_unlock(&phydev->lock);
> +
> + return start;
> +}
> +
> static int mdio_bus_phy_suspend(struct device *dev)
> {
> struct phy_device *phydev = to_phy_device(dev);
> @@ -142,25 +154,25 @@ static int mdio_bus_phy_suspend(struct device *dev)
> static int mdio_bus_phy_resume(struct device *dev)
> {
> struct phy_device *phydev = to_phy_device(dev);
> - int ret;
> + int ret = 0;
>
> - ret = phy_resume(phydev);
> - if (ret < 0)
> - return ret;
> + if (!phydev->attached_dev)
> + return 0;
>
> - if (phydev->attached_dev && phydev->adjust_link)
> - phy_start_machine(phydev);
> + if (mdio_bus_phy_needs_start(phydev))
> + phy_start(phydev);
> + else if (!phydev->adjust_link)
> + ret = phy_resume(phydev);
Humm, under which conditions can you not have phydev->attached_dev and
also not phydev->adjust_link being set? As mentioned earlier, you would
likely need to test for phy_link_change too here.
>
> - return 0;
> + return ret;
> }
>
> static int mdio_bus_phy_restore(struct device *dev)
> {
> struct phy_device *phydev = to_phy_device(dev);
> - struct net_device *netdev = phydev->attached_dev;
> int ret;
>
> - if (!netdev)
> + if (!phydev->attached_dev)
> return 0;
That does not seem to be making any functional difference, so I would
just drop this for now.
>
> ret = phy_init_hw(phydev);
> @@ -171,7 +183,8 @@ static int mdio_bus_phy_restore(struct device *dev)
> phydev->link = 0;
> phydev->state = PHY_UP;
>
> - phy_start_machine(phydev);
> + if (mdio_bus_phy_needs_start(phydev))
> + phy_start(phydev);
>
> return 0;
> }
>
--
Florian
^ permalink raw reply
* Re: [PATCH v6 0/6] staging: Introduce DPAA2 Ethernet Switch driver
From: Andrew Lunn @ 2018-03-14 23:44 UTC (permalink / raw)
To: Razvan Stefanescu
Cc: gregkh, devel, linux-kernel, netdev, agraf, arnd,
alexandru.marginean, ruxandra.radulescu, ioana.ciornei,
laurentiu.tudor, stuyoder
In-Reply-To: <20180314155558.6898-1-razvan.stefanescu@nxp.com>
On Wed, Mar 14, 2018 at 10:55:52AM -0500, Razvan Stefanescu wrote:
> This patchset introduces the Ethernet Switch Driver for Freescale/NXP SoCs
> with DPAA2 (DataPath Acceleration Architecture v2). The driver manages
> switch objects discovered on the fsl-mc bus. A description of the driver
> can be found in the associated README file.
Hi Greg
This code has much better quality than the usual stuff in staging. I
see no reason not to merge it.
Andrew
^ permalink raw reply
* Re: [PATCH net-next v2 0/2] skbuff: Fix applications not being woken for errors
From: Andrew Lunn @ 2018-03-14 23:39 UTC (permalink / raw)
To: Vinicius Costa Gomes; +Cc: netdev, randy.e.witt, davem, eric.dumazet
In-Reply-To: <20180314215437.14726-1-vinicius.gomes@intel.com>
On Wed, Mar 14, 2018 at 02:54:35PM -0700, Vinicius Costa Gomes wrote:
> Hi,
>
> Changes from v1:
> - Fixed comments from Willem de Bruijn, about the order of the
> options passed to getopt();
> - Added Reviewed-by and Fixes tags to patch (2);
>
> Changes from the RFC:
> - tweaked commit messages;
>
> Original cover letter:
>
> This is actually a "bug report"-RFC instead of the more usual "new
> feature"-RFC.
>
> We are developing an application that uses TX hardware timestamping to
> make some measurements, and during development Randy Witt initially
> reported that the application poll() never unblocked when TX hardware
> timestamping was enabled.
Hi Vinicius
This sounds a lot like an issue i was chasing a month ago with ptp4l
and the newly added PTP support for Marvell switch chips. I was seeing
poll not unblocking, and when it did, the sequence numbers for PTP
messages indicating they were old. But it suddenly started working,
and i've not been able to reproduce it again.
> (my hypothesis is that only triggers when the error is reported from
> a different task context than the application).
The case i was having problems with, it is a kernel task which does
the error reporting, which is clearly a different context to the ptp4l
application. So your hypothesis could be correct.
Andrew
^ permalink raw reply
* Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
From: Daniel Borkmann @ 2018-03-14 23:27 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Alexei Starovoitov, davem, netdev, kernel-team
In-Reply-To: <20180314181158.bsvgnffv3rurvdcx@ast-mbp>
On 03/14/2018 07:11 PM, Alexei Starovoitov wrote:
> On Wed, Mar 14, 2018 at 03:37:01PM +0100, Daniel Borkmann wrote:
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -133,6 +133,8 @@ enum bpf_prog_type {
>>> BPF_PROG_TYPE_SOCK_OPS,
>>> BPF_PROG_TYPE_SK_SKB,
>>> BPF_PROG_TYPE_CGROUP_DEVICE,
>>> + BPF_PROG_TYPE_CGROUP_INET4_BIND,
>>> + BPF_PROG_TYPE_CGROUP_INET6_BIND,
>>
>> Could those all be merged into BPF_PROG_TYPE_SOCK_OPS? I'm slowly getting
>> confused by the many sock_*/sk_* prog types we have. The attach hook could
>> still be something like BPF_CGROUP_BIND/BPF_CGROUP_CONNECT. Potentially
>> storing some prog-type specific void *private_data in prog's aux during
>> verification could be a way (similarly as you mention) which can later be
>> retrieved at attach time to reject with -ENOTSUPP or such.
>
> that's exacly what I mentioned in the cover letter,
> but we need to extend attach cmd with verifier-like log_buf+log_size.
> since simple enotsupp will be impossible to debug.
Hmm, adding verifier-like log_buf + log_size feels a bit of a kludge just
for this case, but it's the usual problem where getting a std error code
is like looking into a crystal ball to figure where it's coming from. I'd see
only couple of other alternatives: distinct error code like ENAVAIL for such
mismatch, or telling the verifier upfront where this is going to be attached
to - same as we do with the dev for offloading or as you did with splitting
the prog types or some sort of notion of sub-prog types; or having a query
API that returns possible/compatible attach types for this loaded prog via
e.g. bpf_prog_get_info_by_fd() that loader can precheck or check when error
occurs. All nothing really nice, though.
Making verifier-like log_buf + log_size generic meaning not just for the case
of BPF_PROG_ATTACH specifically might be yet another option, meaning you'd
have a new BPF_GET_ERROR command to pick up the log for the last failed bpf(2)
cmd, but either that might require adding a BPF related pointer to task struct
for this or any other future BPF feature (maybe not really an option); or to
have some sort of bpf cmd to config and obtain an fd for error queue/log once,
where this can then be retrieved from and used for all cmds generically.
[...]
>>> +struct bpf_sock_addr {
>>> + __u32 user_family; /* Allows 4-byte read, but no write. */
>>> + __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write.
>>> + * Stored in network byte order.
>>> + */
>>> + __u32 user_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write.
>>> + * Stored in network byte order.
>>> + */
>>> + __u32 user_port; /* Allows 4-byte read and write.
>>> + * Stored in network byte order
>>> + */
>>> + __u32 family; /* Allows 4-byte read, but no write */
>>> + __u32 type; /* Allows 4-byte read, but no write */
>>> + __u32 protocol; /* Allows 4-byte read, but no write */
>>
>> I recall bind to prefix came up from time to time in BPF context in the sense
>> to let the app itself be more flexible to bind to BPF prog. Do you see also app
>> to be able to add a BPF prog into the array itself?
>
> I'm not following. In this case the container management framework
> will attach bpf progs to cgroups and apps inside the cgroups will
> get their bind/connects overwritten when necessary.
Was mostly just thinking whether it could also cover the use case that was
brought up from time to time e.g.:
https://www.mail-archive.com/netdev@vger.kernel.org/msg100914.html
Seems like it would potentially be on top of that, plus having an option to
attach from within the app instead of orchestrator.
^ permalink raw reply
* Re: [PATCH] lan78xx: Connect phy early
From: Andrew Lunn @ 2018-03-14 23:25 UTC (permalink / raw)
To: Alexander Graf
Cc: woojung.huh, UNGLinuxDriver, netdev, linux-usb, linux-kernel,
Thomas Bogendoerfer, Phil Elwell
In-Reply-To: <20180314145456.69029-1-agraf@suse.de>
> @@ -2082,8 +2082,6 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>
> dev->fc_autoneg = phydev->autoneg;
>
> - phy_start(phydev);
> -
> netif_dbg(dev, ifup, dev->net, "phy initialised successfully");
>
> return 0;
> @@ -2512,9 +2510,7 @@ static int lan78xx_open(struct net_device *net)
> if (ret < 0)
> goto done;
>
> - ret = lan78xx_phy_init(dev);
> - if (ret < 0)
> - goto done;
> + phy_start(net->phydev);
Should the debug message be moved as well?
Andrew
^ permalink raw reply
* [PATCH net-next] liquidio: Allow RX descriptors to be consumed before disabling NAPI.
From: Felix Manlunas @ 2018-03-14 23:17 UTC (permalink / raw)
To: davem
Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
felix.manlunas
From: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Acked-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
drivers/net/ethernet/cavium/liquidio/lio_core.c | 23 ++++++++++++++++++++
drivers/net/ethernet/cavium/liquidio/lio_main.c | 25 +++++++++++++---------
drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 23 ++++++++++++--------
.../net/ethernet/cavium/liquidio/octeon_network.h | 1 +
4 files changed, 53 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index e7b6eb8..8e4716a 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1157,3 +1157,26 @@ int liquidio_change_mtu(struct net_device *netdev, int new_mtu)
octeon_free_soft_command(oct, sc);
return 0;
}
+
+int lio_wait_for_clean_oq(struct octeon_device *oct)
+{
+ int retry = 100, pending_pkts = 0;
+ int idx;
+
+ do {
+ pending_pkts = 0;
+
+ for (idx = 0; idx < MAX_OCTEON_OUTPUT_QUEUES(oct); idx++) {
+ if (!(oct->io_qmask.oq & BIT_ULL(idx)))
+ continue;
+ pending_pkts +=
+ atomic_read(&oct->droq[idx]->pkts_pending);
+ }
+
+ if (pending_pkts > 0)
+ schedule_timeout_uninterruptible(1);
+
+ } while (retry-- && pending_pkts);
+
+ return pending_pkts;
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 140085b..345fe6e 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2240,16 +2240,6 @@ static int liquidio_stop(struct net_device *netdev)
struct octeon_device *oct = lio->oct_dev;
struct napi_struct *napi, *n;
- if (oct->props[lio->ifidx].napi_enabled) {
- list_for_each_entry_safe(napi, n, &netdev->napi_list, dev_list)
- napi_disable(napi);
-
- oct->props[lio->ifidx].napi_enabled = 0;
-
- if (OCTEON_CN23XX_PF(oct))
- oct->droq[0]->ops.poll_mode = 0;
- }
-
ifstate_reset(lio, LIO_IFSTATE_RUNNING);
netif_tx_disable(netdev);
@@ -2275,6 +2265,21 @@ static int liquidio_stop(struct net_device *netdev)
lio->ptp_clock = NULL;
}
+ /* Wait for any pending Rx descriptors */
+ if (lio_wait_for_clean_oq(oct))
+ netif_info(lio, rx_err, lio->netdev,
+ "Proceeding with stop interface after partial RX desc processing\n");
+
+ if (oct->props[lio->ifidx].napi_enabled == 1) {
+ list_for_each_entry_safe(napi, n, &netdev->napi_list, dev_list)
+ napi_disable(napi);
+
+ oct->props[lio->ifidx].napi_enabled = 0;
+
+ if (OCTEON_CN23XX_PF(oct))
+ oct->droq[0]->ops.poll_mode = 0;
+ }
+
dev_info(&oct->pci_dev->dev, "%s interface is stopped\n", netdev->name);
return 0;
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 3342d64..66655df 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -1281,15 +1281,6 @@ static int liquidio_stop(struct net_device *netdev)
/* tell Octeon to stop forwarding packets to host */
send_rx_ctrl_cmd(lio, 0);
- if (oct->props[lio->ifidx].napi_enabled) {
- list_for_each_entry_safe(napi, n, &netdev->napi_list, dev_list)
- napi_disable(napi);
-
- oct->props[lio->ifidx].napi_enabled = 0;
-
- oct->droq[0]->ops.poll_mode = 0;
- }
-
netif_info(lio, ifdown, lio->netdev, "Stopping interface!\n");
/* Inform that netif carrier is down */
lio->intf_open = 0;
@@ -1302,6 +1293,20 @@ static int liquidio_stop(struct net_device *netdev)
txqs_stop(netdev);
+ /* Wait for any pending Rx descriptors */
+ if (lio_wait_for_clean_oq(oct))
+ netif_info(lio, rx_err, lio->netdev,
+ "Proceeding with stop interface after partial RX desc processing\n");
+
+ if (oct->props[lio->ifidx].napi_enabled == 1) {
+ list_for_each_entry_safe(napi, n, &netdev->napi_list, dev_list)
+ napi_disable(napi);
+
+ oct->props[lio->ifidx].napi_enabled = 0;
+
+ oct->droq[0]->ops.poll_mode = 0;
+ }
+
dev_info(&oct->pci_dev->dev, "%s interface is stopped\n", netdev->name);
return 0;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_network.h b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
index 76803a5..b7ee2d3 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_network.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
@@ -190,6 +190,7 @@ irqreturn_t liquidio_msix_intr_handler(int irq __attribute__((unused)),
int octeon_setup_interrupt(struct octeon_device *oct, u32 num_ioqs);
+int lio_wait_for_clean_oq(struct octeon_device *oct);
/**
* \brief Register ethtool operations
* @param netdev pointer to network device
^ permalink raw reply related
* Re: [PATCH 7/7] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
From: Sinan Kaya @ 2018-03-14 22:57 UTC (permalink / raw)
To: Alexander Duyck
Cc: Timur Tabi, Netdev, sulrich, linux-arm-msm, linux-arm-kernel,
Jeff Kirsher, intel-wired-lan, LKML
In-Reply-To: <CAKgT0UcriMiyV0q6y_x9r3-HJRAWp5_CpsK2jTqh3qvOV=Kkzw@mail.gmail.com>
Hi Alexander,
On 3/14/2018 5:49 PM, Alexander Duyck wrote:
> On Wed, Mar 14, 2018 at 5:13 AM, <okaya@codeaurora.org> wrote:
>> On 2018-03-14 01:08, Timur Tabi wrote:
>>>
>>> On 3/13/18 10:20 PM, Sinan Kaya wrote:
> Actually I would argue this whole patch set is pointless. For starters
> why is it we are only updating the Intel Ethernet drivers here?
I did a regex search for wmb() followed by writel() in each drivers directory.
I scrubbed the ones I care about and posted this series. Note also that
I have one Infiniband patch in the series.
I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.
> This
> seems like something that is going to impact the whole kernel tree
> since many of us have been writing drivers for some time assuming x86
> style behavior.
That's true. We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.
Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.
>
> It doesn't make sense to be optimizing the drivers for one subset of
> architectures. The scope of the work needed to update the drivers for
> this would be ridiculous. Also I don't see how this could be expected
> to work on any other architecture when we pretty much need to have a
> wmb() before calling the writel on x86 to deal with accesses between
> coherent and non-coherent memory. It seems to me more like somebody
> added what they considered to be an optimization somewhere that is a
> workaround for a poorly written driver. Either that or the barrier is
> serving a different purpose then the one we were using.
Is there a semantic problem with the definition of wmb() vs. writel() vs.
writel_relaxed()? I thought everything is well described in barriers
document about what to expect from these APIs.
AFAIK, writel() is equal to writel_relaxed() on x86 architecture.
It doesn't really change anything for x86 but it saves barriers on
other architectures.
>
> It would make more sense to put in the effort making writel and
> writel_relaxed consistent between architectures before we go through
> and start modifying driver code to support different architectures.
>
Is there an arch problem that I'm not seeing?
Sinan
--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply
* Re: regression: ti: cpsw: warning from phy_connect()->sysfs_create_link()-sysfs_warn_dup()
From: Andrew Lunn @ 2018-03-14 22:48 UTC (permalink / raw)
To: Grygorii Strashko; +Cc: Florian Fainelli, netdev, David Miller, Sekhar Nori
In-Reply-To: <d7b6d13b-ecec-2039-4a77-514f9666b729@ti.com>
> I've got additional testing data and this actually a *regression*, because
> second CPSW Port became broken after above commits due to Net PHY
> connection failure.
Hi Grygorii
I'm not sure this works by design, more by chance. And i think this is
the only MAC driver which does this. Can i suggest you test net-next
frequently, and review phylib patches, because i expect it will break
again unless you keep a close eye on changes. Developers simply don't
expect two phys connected to one MAC.
Thanks
Andrew
^ permalink raw reply
* Re: [PATCH 00/47] arch-removal: device drivers
From: Rafael J. Wysocki @ 2018-03-14 22:45 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Ulf Hansson, open list:ULTRA-WIDEBAND (UWB) SUBSYSTEM:,
Wolfram Sang, linux-iio, Viresh Kumar, Linus Walleij,
alexandre.belloni, linux-ide, netdev, linux-mtd, linux-i2c,
linux-rtc, Lars-Peter Clausen, Herbert Xu, Jonathan Corbet,
open list:DOCUMENTATION, Alan Stern, linux-serial, Jiri Slaby,
linux-mmc, shli, wg, linux-crypto, Linux PWM List, linux-watchdog
In-Reply-To: <20180314153603.3127932-1-arnd@arndb.de>
On Wed, Mar 14, 2018 at 4:35 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> Hi driver maintainers,
>
> I just posted one series with the removal of eight architectures,
> see https://lkml.org/lkml/2018/3/14/505 for details, or
> https://lwn.net/Articles/748074/ for more background.
>
> These are the device drivers that go along with them. I have already
> picked up the drivers for arch/metag/ into my tree, they were reviewed
> earlier.
>
> Please let me know if you have any concerns with the patch, or if you
> prefer to pick up the patches in your respective trees. I created
> the patches with 'git format-patch -D', so they will not apply without
> manually removing those files.
>
> For anything else, I'd keep the removal patches in my asm-generic tree
> and will send a pull request for 4.17 along with the actual arch removal.
Please include the cpufreq patches into that and you can add ACKs from
me on them if that helps.
Thanks!
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH net] net/sched: act_simple: don't leak 'index' in the error path
From: Davide Caratti @ 2018-03-14 22:43 UTC (permalink / raw)
To: Cong Wang
Cc: Jamal Hadi Salim, Jiri Pirko, David S. Miller,
Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUg5sDfXk1i1aJh_28cV7VpR2PpggnoR3jOJNxqDJiNSw@mail.gmail.com>
hello Cong, thank you for reviewing this.
On Wed, 2018-03-14 at 11:41 -0700, Cong Wang wrote:
> On Tue, Mar 13, 2018 at 7:13 PM, Davide Caratti <dcaratti@redhat.com> wrote:
>
> Looks like we just need to replace the tcf_idr_cleanup() with
> tcf_idr_release()? Which is also simpler.
I just tried it on act_simple, and I can confirm: 'index' does not leak
anymore if alloc_defdata() fails to kzalloc(), and then tcf_idr_release()
is called in place of of tcf_idr_cleanup().
> Looks like all other callers of tcf_idr_cleanup() need to be replaced too,
> but I don't audit all of them...
no problem, I can try to do that, it's not going to be a big series
anyway.
while at it, I will also fix other spots where the same bug can be
reproduced, even if tcf_idr_cleanup() is not there: for example, when
tcf_vlan_init() fails allocating struct tcf_vlan_params *p,
ASSERT_RTNL();
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p) {
if (ovr)
tcf_idr_release(*a, bind);
return -ENOMEM;
}
the followinng behavior can be observed:
# tc actions flush action vlan
# tc actions add action vlan pop index 5
RTNETLINK answers: Cannot allocate memory
We have an error talking to the kernel
# tc actions add action vlan pop index 5
RTNETLINK answers: No space left on device
We have an error talking to the kernel
# tc actions add action vlan pop index 5
RTNETLINK answers: No space left on device
We have an error talking to the kernel
Probably testing the value of 'ovr' here is wrong, or maybe it's
not enough: I will also verify what happens using 'replace'
keyword instead of 'add'.
>
> [...]
>
> > if (!exists) {
> > + defdata = nla_strdup(tb[TCA_DEF_DATA], GFP_KERNEL);
> > + if (unlikely(!defdata))
> > + return -ENOMEM;
> > +
> > ret = tcf_idr_create(tn, parm->index, est, a,
> > &act_simp_ops, bind, false);
> > if (ret)
> > return ret;
>
> You leak memory here on failure, BTW.
Ouch, you are right. I was wrongly convinced that act_simp_ops.cleanup()
was called also in case of failure of tcf_idr_create(), but it's not true.
Indeed, a call to act_simp_ops.cleanup() happens if we call
tcf_idr_release() after tcf_idr_create() returned 0.
regards,
--
davide
^ permalink raw reply
* [PATCH 2/2] net: phy: relax error checking when creating sysfs link netdev->phydev
From: Grygorii Strashko @ 2018-03-14 22:26 UTC (permalink / raw)
To: David S. Miller, netdev, Andrew Lunn, Florian Fainelli,
Greg Kroah-Hartman
Cc: Sekhar Nori, linux-kernel, linux-omap, Grygorii Strashko
In-Reply-To: <20180314222624.12744-1-grygorii.strashko@ti.com>
Some ethernet drivers (like TI CPSW) may connect and manage >1 Net PHYs per
one netdevice, as result such drivers will produce warning during system
boot and fail to connect second phy to netdevice when PHYLIB framework
will try to create sysfs link netdev->phydev for second PHY
in phy_attach_direct(), because sysfs link with the same name has been
created already for the first PHY. As result, second CPSW external
port will became unusable.
Fix it by relaxing error checking when PHYLIB framework is creating sysfs
link netdev->phydev in phy_attach_direct(), suppressing warning by using
sysfs_create_link_nowarn() and adding debug message instead.
Cc: Florian Fainelli <f.fainelli@gmail.com>
Fixes: a3995460491d ("net: phy: Relax error checking on sysfs_create_link()")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
drivers/net/phy/phy_device.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 478405e..fe16f58 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1012,10 +1012,17 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
err = sysfs_create_link(&phydev->mdio.dev.kobj, &dev->dev.kobj,
"attached_dev");
if (!err) {
- err = sysfs_create_link(&dev->dev.kobj, &phydev->mdio.dev.kobj,
- "phydev");
- if (err)
- goto error;
+ err = sysfs_create_link_nowarn(&dev->dev.kobj,
+ &phydev->mdio.dev.kobj,
+ "phydev");
+ if (err) {
+ dev_err(&dev->dev, "could not add device link to %s err %d\n",
+ kobject_name(&phydev->mdio.dev.kobj),
+ err);
+ /* non-fatal - some net drivers can use one netdevice
+ * with more then one phy
+ */
+ }
phydev->sysfs_links = true;
}
--
2.10.5
^ permalink raw reply related
* [PATCH 1/2] sysfs: symlink: export sysfs_create_link_nowarn()
From: Grygorii Strashko @ 2018-03-14 22:26 UTC (permalink / raw)
To: David S. Miller, netdev, Andrew Lunn, Florian Fainelli,
Greg Kroah-Hartman
Cc: Sekhar Nori, linux-kernel, linux-omap, Grygorii Strashko
In-Reply-To: <20180314222624.12744-1-grygorii.strashko@ti.com>
The sysfs_create_link_nowarn() is going to be used in phylib framework in
suseuent patch which can be built as module. Hence, export
sysfs_create_link_nowarn() to avoid build errors.
Cc: Florian Fainelli <f.fainelli@gmail.com>
Fixes: a3995460491d ("net: phy: Relax error checking on sysfs_create_link()")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
"Fixes" added as there is dependency this and subsequent patch.
fs/sysfs/symlink.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 8664db2..215c225 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -106,6 +106,7 @@ int sysfs_create_link_nowarn(struct kobject *kobj, struct kobject *target,
{
return sysfs_do_create_link(kobj, target, name, 0);
}
+EXPORT_SYMBOL_GPL(sysfs_create_link_nowarn);
/**
* sysfs_delete_link - remove symlink in object's directory.
--
2.10.5
^ permalink raw reply related
* [PATCH 0/2] net: phy: relax error checking when creating sysfs link netdev->phydev
From: Grygorii Strashko @ 2018-03-14 22:26 UTC (permalink / raw)
To: David S. Miller, netdev, Andrew Lunn, Florian Fainelli,
Greg Kroah-Hartman
Cc: Sekhar Nori, linux-kernel, linux-omap, Grygorii Strashko
Some ethernet drivers (like TI CPSW) may connect and manage >1 Net PHYs per
one netdevice, as result such drivers will produce warning during system
boot and fail to connect second phy to netdevice when PHYLIB framework
will try to create sysfs link netdev->phydev for second PHY
in phy_attach_direct(), because sysfs link with the same name has been
created already for the first PHY.
As result, second CPSW external port will became unusable.
This issue was introduced by commits:
5568363f0cb3 ("net: phy: Create sysfs reciprocal links for attached_dev/phydev"
a3995460491d ("net: phy: Relax error checking on sysfs_create_link()"
Patch 1: exports sysfs_create_link_nowarn() function as preparation for Patch 2.
Patch 2: relaxes error checking when PHYLIB framework is creating sysfs
link netdev->phydev in phy_attach_direct(), suppress warning by using
sysfs_create_link_nowarn() and adds debug message instead.
This is stable material 4.13+.
Cc: Florian Fainelli <f.fainelli@gmail.com>
Grygorii Strashko (2):
sysfs: symlink: export sysfs_create_link_nowarn()
net: phy: relax error checking when creating sysfs link netdev->phydev
drivers/net/phy/phy_device.c | 15 +++++++++++----
fs/sysfs/symlink.c | 1 +
2 files changed, 12 insertions(+), 4 deletions(-)
--
2.10.5
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox